Mesa Suite Version 2.0
Copyright (c) 2010 Mesa Analytics & Computing, Inc
John D. MacCuish, Norah E. MacCuish, and Mitch Chapman
The Fingerprint Module in the Mesa Suite
The Mesa Suite is a collection of application modules for research
and applications (diversity selection, library analysis, compound
acquistion, lead hopping, HTS data analysis, database systems, etc.)
information systems. The modules can be used separately or in
to perform a variety of 2D and 3D tasks. Currently there are five
modules: Shape Module, Fingerprint Module, Grouping Module,
Diversity, and ChemTattoo. Presently
we support Linux, OS X, and Windows
The Fingerprint Module is written in C++ and requires OECHEM from OpenEye Scientific Software, Inc. (www.eyesopen.com ).
The Fingerprint Module is comprised of two programs:
gen_mesa768 - Generates 768 bit (and optionally, count) fingerprints from SMILES strings input
gen_maccs320 - Generates 320 bit fingeprints from SMILES string input
Each takes Daylight SMILES strings as input. The gen_mesa768 fingerprint generator uses set of simple key definitions gleened from the MDL MACCS keys and a small set of additional PubChem keys to form a set of definitions that passed a set of statistical tests using ~1.7 million drug and lead-like compounds: namely, complex key definitions were decomposed into simple key definitions and duplicate key definitions or extremely highly correlated keys, or keys that generated all (or nearly all) 0 or all 1 bits, were removed from the superset of MDL and PubChem keys). The gen_maccs320 fingerprint generator uses the MDL MACCS 320 keys. 1
Characterizing chemical structures in binary form based on 2D representations facilitates many tasks of cheminformatics. Such binary representations are called molecular fingerprints. Molecular fingerprints were first developed by chemical information systems (CIS) companies for efficiency enhancements in chemical database queries. 2D chemical representations are typically stored in chemical databases in a connection table (e.g. SDFile) or SMILES (Simplified Molecular Input Line ...) format. Exact match, substructure, and superstructure queries against large database systems turns out to be relatively slow if full query searches against each member of a database has to occur for every query. CIS companies designed molecular fingerprints as a means to screen database members and subselect chemical structures from databases that might in fact be a hit. Fingerprints are used to filter out compounds that do no meet certain criteria (e.g. "contains a benzene ring"), thereby avoiding such CPU intensive comparisions for exact match or submatching searches for the majority of a database. Filtering in this way increases search speed by only performing the "expensive" full query searching to a subset of potential answers rather than the whole database.
Several methodologies exist for chemical binary representations.
For example, Daylight Chemical Information Systems fingerprint is
referred to as a path-based approach. This amounts
a unique subgraph matching of the graph representation of the chemical
In the Daylight algorithm the fingerprint is "learned" from the
themselves. A molecular fingerprint is generated from a hash of
the unique connection paths (subgraphs) up to a maximum size (typically
into a fixed length bit string. Fingerprints may be folded to
the length and increase the bit density. Typical sizes for
fingerprints are 512 or 1024 bits in length, but any power of two can
Molecular Design Limited (MDL), created a key based fingerprint.
This fingerprint uses a pre-defined set of definitions and
fingerprints based on pattern matching of the structure to the defined
"key" set. This key based approach relies on the definitions to
encapsulate the molecluar descriptions apriori and does not
the keys from the chemical dataset. The MDL original public
key set was 166 keys, and their private key set was comprised of 966
Their recent publication of "drug -like" keys contains a subset
of 320 keys from their 966 set.1 So MDL fingerprints
could take on a maximum bit length of 966. No folding occurs with
type of fingerprint.
Barnard Chemical Information Systems (BCI) uses a dictionary approach in which the keys for the fingerprinter are first generated from the set and then implemented in the description. This combines a bit of both of the Daylight an MDL approaches. Typically the BCI dictionary generates thousands of keys, resulting in molecular fingerprint bit lengths on the order of 5,000 bits.
Mesa uses the 320 "drug-like" published by MDL to generate
320 bit string representations. Mesa used to provide the MDL 166
key set as well (in the form of 164 bits -- two keys were removed as
they do not relate to drug or lead-like compounds), but this set
performs so poorly in practice that this program has been removed from
the module. Flow diagrams still contain the 164 bit program, but
the Mesa 768 can be substituted for this program (or any fixed length
fingerprint can be substituted). Users have been evenly divided
as to the use and efficacy of the 320 or the 768 keys, often depending
on their respective application domains. At Mesa, we tend to use
the 768 key set more regularly as the keys are more easily understood
and less confounding with respect to modal and frequency
fingerprints. Clustering and similarity results are comparable,
as are their speed of generation. Naturally, the 768 fingerprints
take longer however to generate (dis)similarity measures. The
keys are generated from
SMARTS pattern matching against the chemical dataset using the SMARTS
algorithm in OEChem from OpenEye Scientific Software.
The first three approaches mentioned above all have their advantages and disadvantages. In the Daylight case, learning the paths from the dataset enables new chemistries to be encapsulated in the fingerprints. Novel unique paths in the dataset will be encoded and input into the fingerprint. Searches against such databases will result in structure hit lists that contain this new chemistry. The disadvantage of the Daylight fingerprinting approach is that in some cases the unique paths do not encode symmetric systems well, it may not be able to distinguish between a monomer and a dimer, for example. Multiple counts of an identical path are not included in the description. In the MDL case, the user is dependent on the key-set created by MDL to encapsulate all of the chemistry that user has in their database or chemical dataset. The keys do take into account multiple counts of some features, which can be an advantage over the Daylight approach, but may not be able to uniquely describe a chemical dataset, if the keys are missing some of the chemical features in the dataset. BCI does seem to combine the best of both strategies, but at the expense of generating very large fingerprints. Mesa's approach is identical the the MDL approach so our fingerprints will have the same advantages and disadvantages. The decision as to which is the "best" fingerprinter is a decision one should not take lightly,and Mesa A&C believes the choice of a fingerprinter should be dictated by the data under study. For example, if one has an acquisition database full of inorganic substances that one would like to cluster, using our fingerprinter would be a "bad" choice.
Example Applications and Flow Diagrams
Detailed Summary of Programs
with both I/O and Commandline Examples, and Specific References
The Fingerprint Module: The generators take as input a file of Daylight Smiles (.smi file) in single column format. E.g.
As a default bit strings are output in column form without spaces. E.g.,
However, the programs contain the option of outputing fingerprints as a set of unsigned long integers, hexidecimal characters, or raw character bytes.10010100...