The latest development from Fraunhofer-Insitute SCAI (www.scai.fraunhofer.de) is an integration of the chemoCR software, a tool for Chemical Compound Reconstruction, into SMILA as an extraction pipelet (cf. illustration 1).
chemoCR makes chemical information contained in depictions of chemical structures accessible as connection table for computer programs.
In order to solve the problem of recognizing and translating chemical structures in image documents, our chemoCR system combines pattern recognition techniques with a chemical rule based expert system. The method is based on the idea of identifying from depictions the most significant fragments of small molecules. The workflow consists of three phases: image vectorization, chemical object extraction and molecule reconstruction.
Application Fields
The majority of chemical structure information in the literature (including patents) is present as two-dimensional graphical representations. These images can be very easily interpreted by the chemist as atoms and bonds, but pose a large problem to the computer being only pixels. So far the computer cannot perceive this chemical knowledge from the picture itself. Therefore you cannot search for molecules in pictures or index documents with pictures, if the information is not present in the caption of the figure. E.g. “Find me documents showing molecules containing a benzene ring.”
On the other hand - if the picture is converted into a connection table, there exist several chemoinformatics algorithms to solve this problem. After the conversion process a lot of information on the molecule can be directly computed or retrieved from chemical databases.
So chemoCR is for
• retrieval
• indexing
• property prediction
of chemical molecules in structure depictions.
Features
• Conversion of various bitmap images (e.g. BMP, GIF, PNG, multi page TIF) into chemical file formats (e.g. SMILES, SDF)
• PDF document processing
• Depictions with multiple molecules can be handled
• Chemical page segmentation of full page scans into text and image
• Fully automatic batch processing mode (can be used as a crawler)
• Reconstruction of the full bond information (single, double, triple, chiral bonds)
• Recognition of superatoms and their conversion into structural representation
• Scoring scheme for the reconstruction process based on known chemical scaffolds
• Training ability for the OCR process (e.g. fused letters) and teaching new super-atoms
• Customization via easy manipulation of XML parameter files
• Chemical intelligence (e.g. filling free valences)
• Recognition of R-groups but not including Markush structures and bridged ring systems