Mesa Suite Version 2.0
Fingerprint Module
Copyright (c) 2010 Mesa Analytics & Computing,  Inc

John D. MacCuish, Norah E. MacCuish, and Mitch Chapman

www.mesaac.com
support@mesaac.com








The Fingerprint Module in the Mesa Suite

The Mesa Suite is a collection of application modules for research and applications (diversity selection, library analysis, compound acquistion, lead hopping, HTS data analysis, database systems, etc.) for chemical information systems.  The modules can be used separately or in tandem to perform a variety of 2D and 3D tasks.  Currently there are five modules:  Shape Module, Fingerprint Module, Grouping Module, Diversity, and ChemTattoo.    Presently we support Linux, OS X, and Windows

The Fingerprint Module is written in C++ and requires OECHEM from OpenEye Scientific Software, Inc. (www.eyesopen.com ).  


Summary 

The Fingerprint Module is comprised of two programs:

gen_mesa768 - Generates 768 bit (and optionally, count) fingerprints from SMILES strings input
gen_maccs320  - Generates 320 bit fingeprints from SMILES string input


Each takes Daylight SMILES strings as input. The gen_mesa768 fingerprint generator uses set of simple key definitions gleened from the MDL MACCS keys and a small set of additional PubChem keys to form a set of definitions that passed a set of statistical tests using ~1.7 million drug and lead-like compounds: namely, complex key definitions were decomposed into simple key definitions and duplicate key definitions or extremely highly correlated keys, or keys that generated all (or nearly all) 0 or all 1 bits, were removed from the superset of MDL and PubChem keys).  The gen_maccs320 fingerprint generator uses the MDL MACCS 320 keys. 1


Theory Section

Characterizing chemical structures in binary form based on 2D representations facilitates many tasks of cheminformatics.   Such binary representations are called molecular fingerprints.  Molecular fingerprints were first developed by chemical information systems (CIS) companies for efficiency enhancements in chemical database queries.  2D chemical representations are typically stored in chemical databases in a connection table (e.g. SDFile) or SMILES (Simplified Molecular Input Line ...) format.  Exact match, substructure, and superstructure queries against large database systems turns out to be relatively slow if full query searches against each member of a database has to occur for every query.  CIS companies designed molecular fingerprints as a means to screen database members and subselect chemical structures from databases that might in fact be a hit.  Fingerprints are used to filter out compounds that do no meet certain criteria (e.g. "contains a benzene ring"), thereby avoiding such CPU intensive comparisions for exact match or submatching searches for the majority of a database.  Filtering in this way increases search speed by only performing the "expensive" full query searching to a subset of potential answers rather than the whole database.  

Several methodologies exist for chemical binary representations.  For example, Daylight Chemical Information Systems fingerprint is often referred to as a  path-based approach.  This amounts to a unique subgraph matching of the graph representation of the chemical structure.  In the Daylight algorithm the fingerprint is "learned" from the structures themselves.  A molecular fingerprint is generated from a hash of all the unique connection paths (subgraphs) up to a maximum size (typically 8) into a fixed length bit string.  Fingerprints may be folded to decrease the length and increase the bit density.  Typical sizes for Daylight fingerprints are 512 or 1024 bits in length, but any power of two can be generated.2

Molecular Design Limited (MDL), created a key based fingerprint.  This fingerprint uses a pre-defined set of definitions and creates fingerprints based on pattern matching of the structure to the defined "key" set.  This key based approach relies on the definitions to encapsulate the molecluar descriptions apriori and does not "learn" the keys from the  chemical dataset.  The MDL original public key set was 166 keys, and their private key set was comprised of 966 keys.  Their recent publication of "drug -like" keys contains a subset of 320 keys from their 966 set.1  So MDL fingerprints could take on a maximum bit length of 966.  No folding occurs with this type of fingerprint.

Barnard Chemical Information Systems (BCI)  uses a dictionary approach in which the keys for the fingerprinter are first generated from the set and then implemented in the description.  This combines a bit of both of the Daylight an MDL approaches.  Typically the BCI dictionary generates thousands of keys, resulting in molecular fingerprint bit lengths on the order of 5,000 bits.

Mesa uses the 320 "drug-like" published by MDL to generate 320 bit string representations.  Mesa used to provide the MDL 166 key set as well (in the form of 164 bits -- two keys were removed as they do not relate to drug or lead-like compounds), but this set performs so poorly in practice that this program has been removed from the module.  Flow diagrams still contain the 164 bit program, but the Mesa 768 can be substituted for this program (or any fixed length fingerprint can be substituted).  Users have been evenly divided as to the use and efficacy of the 320 or the 768 keys, often depending on their respective application domains.  At Mesa, we tend to use the 768 key set more regularly as the keys are more easily understood and less confounding with respect to modal and frequency fingerprints.  Clustering and similarity results are comparable, as are their speed of generation.  Naturally, the 768 fingerprints take longer however to generate (dis)similarity measures.  The keys are generated from SMARTS pattern matching against the chemical dataset using the SMARTS matching algorithm in OEChem from OpenEye Scientific Software.

The first three approaches mentioned above all have their advantages and disadvantages.  In the Daylight case, learning the paths from the dataset enables new chemistries to be encapsulated in the fingerprints.  Novel unique paths in the dataset will be encoded and input into the fingerprint.  Searches against such databases will result in structure hit lists that contain this new chemistry.  The disadvantage of the Daylight fingerprinting approach is that in some cases the unique paths do not encode symmetric systems well, it may not be able to distinguish between a monomer and a dimer, for example.  Multiple counts of an identical path are not included in the description.  In the MDL case, the user is dependent on the key-set created by MDL to encapsulate all of the chemistry that user has in their database or chemical dataset.  The keys do take into account multiple counts of some features, which can be an advantage over the Daylight approach, but may not be able to uniquely describe a chemical dataset, if the keys are missing some of the chemical features in the dataset. BCI does seem to combine the best of both strategies, but at the expense of generating very large fingerprints.   Mesa's approach is identical the the MDL approach so our fingerprints will have the same advantages and disadvantages.  The decision as to which is the "best" fingerprinter is a decision one should not take lightly,and Mesa A&C believes the choice of a fingerprinter should be dictated by the data under study.  For example, if one has an acquisition database full of inorganic substances that one would like to cluster, using our fingerprinter would be a "bad" choice.

Example Applications and Flow Diagrams

  1. The Fingerprint Module programs are typically used to generate fingeprints for input into the Measures or ChemTattooModalStats . The Measures program generates a similarity or dissimilarity matrices which are necessary input for the RussianDollTransformation , SimilarityOutput, or the Clustering program in the Grouping Module.  The ChemTattooModalStats program  returns the modal fingerprint at a threshold and the frequency fingerprint of the set.  For more information on these programs please see the Grouping Module Manual and the ChemTattoo Manual.
  2. The Fingerprint Module programs can generate binary string output for use with any non-Mesa program which requires such representation.
   


Detailed Summary of Programs
with both I/O and Commandline Examples, and Specific References

The Fingerprint Module:   The generators take as input a file of Daylight Smiles (.smi file) in single column format.  E.g.

CCC(Br)CCl
CCCC(Br)CCl
CCCCC(Cl)C=C

As a default bit strings are output in column form without spaces.  E.g.,

10010100...
01010100...
10010011...
.
.
.
However, the programs contain the option of outputing fingerprints as a set of unsigned long integers, hexidecimal characters, or raw character bytes.
A SMILES data file is needed for input, where the first column in the file contains just SMILES strings.  The file may contain additional columns of associated data providing the columns are space delimited.  The user also has the option of outputting just the fingerprints or the fingerprints and the SMILES plus any associated data.  If a SMILES cannot be parsed this error is logged in an error log file with the offending SMILES and its index into the original SMILES data file.  If the option to return all associated data is on, the error file will also return the associated data as well.




References

  1. Reoptimization of MDL Keys for Use in Drug Discovery , J. L. Durant, B. A. Leland, D. R. Henry, J. G. Nourse, JCICS, 2002, 42 (6), 1273-1280.
  2. Daylight Chemical Information Systems, Inc. Daylight Clustering Manual.
  3. MDL Information Systems, Inc.