Improved conversion framework for OpenBabel


Openbabel is an open-source project which is developing a program to inter-convert the many file formats which describe molecular structure. The code described here is based on version 1.100.2 files on the Openbabel Sourceforge site except for cml.cpp which was from the CML site.

Presented here are some suggestions for longer-term mods to the conversion process in OpenBabel. I feel that, although they may not be backward compatible, these features would be desirable to provide flexibility and maintainability for the future. Most have been previously discussed in the OpenBabel Discussion Forum.

I've put together some written some working code. It is written to separate parts of the OB program (see diagram below). This is taken sufficiently far that they can all be in different DLLs if necessary.

The separate parts of the OB program (see diagram below) are:

The obvious aim is to make each part sufficiently independent that it can be changed with affecting the rest of the code.

Without changing the source code, the program can be compiled as separate DLLs (with an exe for the user interface), or all together, as at present. I have tried to make it platform independent except for the GUI and the deployment of the DLLs.

The routines which read and write files in various chemical formats are now each in a class derived from a base class OBFormat, and are handled dynamically. By this I mean that no part of the framework knows about any format at the start. Each format is included by compiling its file in the main program or by making it available as a DLL/shared library. Each format registers registers itself with OBConversion in its constructor, which runs when a global instance of the format is made when the program initializes or the DLL is loaded.

OBConversion makes available a list of the available formats and information on them, when requested. It passes on the input and output streams from the user interface, handles the conversion options and controls the conversion process. It calls the ReadMolecule() and WriteMolecule() functions of the classes derived from OBFormat polymorphically. As at present, the conversion loops until no more molecules(etc) are found in the input steam.

A new aspect of the improved framework is that the type of object converted is decided by the read format. Although this is currently usually an OBMol, it can be something more complicated (derived from OBBase), and no code changes are necessary outside the particular OBFormat classes involved. The write routines check (by RTTI) that the object they are given is one they can handle.

I have written a minimal converter for a RXN file to illustrate this, where the object converted is a class representing a reaction. This file type contains embedded MDL MOL structures, and use is made of the code for this format by calling it through OBConversion. This avoids a rats nest of connection between different parts of the code. In addition, ReadMolecule() and WriteMolecule() have now been given access to the calling structure (OBConversion, passed as a parameter). This enables callback methods to be used, as they are in the RXN format reader, which makes calls to output the embedded molecules as they arise during the read process.

The changes to the format files needed are quite small. It is possible to use a wrapper file, which calls the format's original global read and write routines, and which is the same for all formats apart for a few name changes. Currently I have done this with with CML. MDL MOL is done as a rewritten file, but most of the original code is unchanged. The new framework makes it possible to include a few lines in the CML wrapper file to provide a <cml> ...</cml> cover for the output when there is more than one molecule, but not otherwise. Most of the formats have not been modified or wrapped yet and the implementation includes only a few formats (MOL, SMI, CML, RXN, MOL2) at present.

The options for the conversion process are of several different types and good C++ practice says that they need to be handled in the most relevant class. The -i and -o options are part of the conversion process itself, and even more particularly, are part of the command line interpretation, so they are handled there. The -f and -l options (counting molecules) are also part of the conversion and are handled in OBConversion which has the main conversion loop. Options like -h and -d, adding and removing explicit hydrogens, are concerned with the chemistry, and furthermore are also specific to OBMol as the conversion entity, so have been moved to that class. The opportunity is now there for similar transform options for conversion of other chemical entities, and for options for input as well as output formats. Lastly, the output options, currently used only by CML, are handled in the output format class.

The console interface looks very similar to previously, but has been slightly improved in the handling of inputs at the keyboard (SMILES really). The help info is now dynamic to reflect the available formats and most of the options are passed on to OBConversion so that they can be updated without altering the interface code. As a little extra, I've added the option of a structural filter using SMARTS so that much of the functionality of Fabien Fontaine's obgrep is available here. For instance,

  OBabel database.xxx -osmi -s"COC";

will output to the console in SMILES format all the ethers and esters in a database file of xxx format.

The new framework allows a GUI, Graphical user interface to be a drop-in replacement for the console. See the Windows interface.

The big disadvantage of changing a framework is that it is probably not compatible with applications that already use OB programatically. Examples shows that using an OBMol from a program is only slightly more complicated than at present, but of course is different. However, if an external program needed only to carry out a conversion, it needs to know nothing of the chemical part of the program. If OB was in a DLL the external program does not need to #include mol.h and most of OB could be modified without affecting the external program, provided a few functions of OBConversion were retained. This separation is not available at present and makes future development easier and safer. Separating the conversion process from the chemical representation as much as possible means that the details of the the chemistry (and any changes there may subsequently be made to it) do not affect programs which are interested only in converting formats.

I have used this feature to allow an application of mine which previously imported MDL mol files to import a range of file formats depending on their extensions. The OBDLL and the format DLLs are not loaded until they are needed. A fragment of the code is here.