Mesa Suite Version 2.0
Grouping Module
Copyright (c) 2010 Mesa Analytics & Computing, Inc.

John D. MacCuish, Norah E. MacCuish, Mitch Chapman

www.mesaac.com
support@mesaac.com


 









The Grouping Module in the Mesa Software Application Suite

The Mesa Software Application Suite is a collection of application modules for research and applications (diversity selection, library analysis, compound acquisition, lead hopping, HTS data analysis,  etc.) for chemical information systems.  The modules can be used separately or in tandem to perform a variety of 2D and 3D tasks.  Currently there are four basic modules:  Shape Module, Fingerprint Module, Grouping Module, and Chem Tattoo.  Grouping Module has a an additional add on package of parallel facilities distinct from the sequential programs.  Presently we support Linux, Windows, and OS 10.5+.  The software consists of executable binaries and library functions (upon request). Most of the code is written in C++, though there are a few Python and shell scripts that help augment the running of the binaries, especially the distributed process programs.


Overview of the  Grouping Module

The Grouping Module is a set of seven programs:

MeasuresAll - Generates Similarity/Dissimilarity Matrices using one of the following measures:  Tanimoto, Euclidean, Tversky, Cosine (1-Ochai), and Hamman.
SimilarityOutput - Takes similarity searching output as input and outputs formated search results.
SimilaritySets
- Takes similarity searching output formated results as input and outputs user specified number of near neighbors (rank ordered) to derive SAR sets.
TaylorsMain - Groups compounds using either asymmetric, symmetric, disjoint or non disjoint algorithms using output from Measures.
AmbiguityMeasures* - Generates a report based on the user defined threshold used in Clustering as to the level of Ambiguity to be expected in the results. 
ClusterOutput - Generates report of clustering results in space delimited formatted form, with the option of just returning the cluster representative (centrotype).
RussianDollTransformation* - Transforms an asymmetric matrix to another asymmetric matrix that contains sub- or superstructure proximities.
Square2Sparse  - Inputs a square matrix from outside source and transforms it for use with TaylorsMain.
ClusterCompare -
Compare the results of disjoint clusterings of the same data (i.e., at different thresholds, different measures, etc.).  Takes as input a file of the Clustering output file names, and outputs the all-pairs comparisons between clusterings.  The measure returned is between 0 and 1, where 1 is a perfect match (the clusterings are exactly the same).
BitStripper - A utility for stripping out unused bit columns (either all 1s or all 0s) for a collection of bitstrings.  Reports the number of columns removed and how many of each (1 or 0 columns).  Handy for large fingerprints, if there are a lot of unused bits.
HierarchicalClustering* - Contains three types of hierarchical clustering algorithms -- Group Average, Complete Link, and Wards -- that all use reciprocal nearest neighbors (RNN) to provide a constant time speedup of generating their hierarchies. 
LevelSelection* - Contains Kelley's Level Selection heuristic and returns a set of weights that define a measure suggesting at what level is it most appropriate, given the data, to cut the tree to form a partition.  
CentroTyper* - Given a hierarchy from the HierarchicalClustering program, returns the "centroid" or "centrotype" of each cluster for a given level.  Analogous to ClusterOutput option of returning the centrotypes of Taylor's clustering.
TreePermutation* - Generates a permutation of the hierarchical clustering results so that a dendrogram can be generated whereby there are no crossing edges.
*Upon request.


Its primary utility is in the TaylorsMain program which returns disjoint (non overlapping) or non disjoint (overlapping) clusters of binary bit string data, using either symmetric or asymmetric measures generated with the MeasuresAll program. The Grouping Module contains four variants of Taylor-Butina's grouping algorithm.1,2,3,4  The MeasuresAll program can also perform similarity searching for an MXN matrix, where M<<N, and M is a set of targets, and N is the corresponding database to be searched.  The AmbiguityMeasures program generates cluster ambiguity measures, and similarity searching facilities are provided via the MeasuresAll program to further augment data analysis.    SimilarityOutput program rank orders the queries in terms of their similarity to the respective target and formats this output.  User defined information about each query and target can be included and the output is in the form of a space delimited text file.  SimilaritySets program in turn takes SimilarityOutput output and a number specifying the most number of near neighbors to return given a target's list of near neighbors -- these being the top most in rank order.  The ClusterOutput program generates grouping results with user defined information in a space delimited text file that can be loaded into a spreadsheet program such as MicroSoft Office's Excel, or a row format for further processing.  It also has an option to just return the cluster representatives only.  The RussianDollTransformation is included for grouping by substructure or superstructure similarity. 5  A matrix utility program,  Square2Sparse, is provide in the event that the user has a square proximity (similarity or dissimilarity) matrix obtained from some an outside source and would like to group the data using the TaylorsMain program.  This applies to any data (e.g., a matrix formed by dissimilarities among customer retention bank data).   HierarchicalClustering performs hierarchical clustering, but keeps an eye on the amount of ambiguity related to the discrete nature of many types of data, and the inherent decision ties found in the three specific forms of clustering that are output by this program.  It returns the hierarchy and ambiguity statistics.  LevelSelection returns Kelley's level selection:  a N-1 vector of values, one for each corresponding level in the hierarchy, whose no-trivial local minimums are possible levels to cut, creating a partition that represents the clustering at that level.

Grouping or clustering 2D molecular structures or properties has many important uses in the pharmaceutical industry.  Common methods often employ symmetric, hierarchical algorithms such as Wards or complete link algorithms, non-hierarchical algorithms such as Jarvis-Patrick, or partitional algorithms such as k-means.10,11,12   Such  methods require symmetric measures, e.g. Tanimoto or Euclidean Distances as input.  One drawback to such methods is that they cannot capture important asymmetric relationships in either 2D or 3D between molecular structures. Also, these methods are typically disjoint (non overlapping) clustering methods.  That is, each cluster is distinct and does not contain elements (e.g., a specific molecular structure description such as a fingerprint) from other clusters.  The Grouping Module extends the common clustering concepts by introducing asymmetric measures and clustering, and providing either disjoint or non disjoint (overlapping) clusterings.  Overlapping clustering requires a degree of care, as the results are a cover:  hence, as overlapping increases, the computation and storage gets explosive.    If thresholds are kept low in a dissimilarity sense, overlapping tends to be minimized.  It is best to start with low thresholds and work upwards in small increments to test the level of overlapping.

Another serious issue when grouping molecular structures with bit string fingerprints is the problem of ties in proximity or algorithmic decision ties in the clustering algorithm --  more globally, how ambiguous is the final cluster partition?  Almost all hierarchical and partitional algorithms have, to varying degrees, ties in proximity or decision ties issues.  Grouping Module allows the user to identify and explore those issues with a report on the type and number of ties for a specific clustering, and several indices as to the overall ambiguity contained within a clustering.  Clustering algorithms are also often constrained by the overall size of the number elements to be clustered.  The clustering algorithms we employ for both the disjoint, non disjoint versions of either the symmetric or asymmetric clusterings are variations on the Taylor-Butina grouping algorithm.  Our implementations are both fast and space efficient such that large numbers of compounds can be clustered effectively on a single machine.  Rough estimates suggest that approximately 100 thousand compounds can be clustered on machine with 1 gigabyte of ram and a single 2.0 gigahertz core within a few hours time.  This includes the time to generate the fingerprints (using our Fingerprint Module, compute the dissimilarities (100k choose 2 symmetric, or 100kX100k asymmetric), perform the clustering, and produce the formatted output report. 

Flow diagram for clustering with either 2D or 3D fingerprints. Note that the SMILES file input can be used as the Data Info file with the 2D fingerprint, whereas the sd file cannot be used as the data info.  The data info file has the same length as the number of data items in either the smiles file or the sd file, and the data should be in either space or tab delimited columns, without missing entries.

clusteringfps

Flow diagram for generic clustering of any normalized sparse matrix:

GenericClustering

Flow diagram for clustering with the full compliment of Mesa programs:


 

Parallel Grouping Module

There are two parts of clustering:  generating dissimilarities, and clustering with the use of the dissimilarities.  (Note, some clustering algorithms use similarities, but Mesa clustering routines use dissimilarities, and only those normalized to [0,1].) The standard MeasuresAll program is sequential, however it can be used in parallel ("embarrassingly" parallel in a master/slave configuration), such that it can be run across separate cores, processors, and even across a heterogeneous network of machines to create large sparse matrices with the use of PVM.  The parallel version of Grouping Module contains several programs and scripts that facilitate this process.  Data is distributed across the parallel or distributed architecture such that bands of the dissimilarity sparse matrix are computed (a band for each processing unit) with instances of the Measures program with the use of pvm_master and several scripts.  The bands are then assembled in a sequential process.  The second step is the actual clustering.  And this too has been parallelized.  The parallel clustering of very large sparse matrices can be performed with the use of either ptaylor_pvm or ptaylor_mpi.  These programs are also running a master/slave configuration, however there is far more communication necessary to perform clustering than to simply generate the dissimilarity matrix.  The results of ptaylor_* can then be manipulated by the downstream ClusterOutput program.  At present the largest clustering performed has been ~7 million compounds on a 32 processor cluster, running 2.0 gigahertz processors to generate the matrix in 3 weeks, and 12 processors running 3.0 gigahertz processors with 128 gigabytes of RAM on a single shared memory machine, using the sequential Clustering program over 5 days.  The parallel clustering using both steps on a single 4 core machine with 4-8 gigabytes of RAM can routinely cluster several hundred thousand data items (e.g., compounds, conformations) in a few hours -- that is both matrix generation, clustering, and output formating --  depending on the length of the fingerprint and data diversity.    Larger sets and shape fingerprints typically take a good deal longer, on the order of 24 to 72 hours.  Increasing the number of cores and the amount of RAM per processor can tackle large sets.  The limiting factor is typically the matrix generation.  E.g., it might take 5 hours on a 4 core machine to generate the matrix for 250 thousand fingerprints and yet less than 20 minutes to perform the actual clustering and output formatting.



Theory for Grouping Module


 Pre-requisites

In order for comparisons to be made between chemical structures, a chemical description must first be applied.  Typically in 2D, descriptors are molecular fingerprint descriptors, such as MDL MACCS keys, Daylight fingerprints, BCI keys, etc.6,7,8  The Mesa Grouping Module expects as a pre-requisite that chemical data sets under study will be described in a binary bit string fashion -- at least up through the Measures program.  The Clustering program itself will operate on any sparse dissimilarity matrix of the form output by the Measures program, independent of its source.  The Grouping Module can read Daylight Fingerprints, or any fixed length fingerprints which can be cast into ASCII bit string format, e.g. 1010101000011.....  For details on one such binary descriptor see the Mesa Fingerprint Module Manual.

Grouping Chemical Structures based on 2D Representations



Measures Program

Once a chemical data set is described in terms of molecular fingerprints, a comparison measure or distance must be applied pair-wise to determine how similar or how dissimilar two compounds are relative to each other.  The Measures program provides this important pre clustering processing step.

   
The Measures program takes as input any fixed length bit strings as shown in the diagram above, these can be from the Mesa Fingerprint programs or user supplied fingerprints.  User supplied fingerprints must take the form of ASCII 1's and 0's, (e.g. 011100001111000....), ASCII Daylight fingerprints inside the FP<> data type are also valid input to Measures .  The Measures program  produces a similarity or dissimilarity matrix (user's choice) using one of the following user selected measures:   Tversky, Tanimoto, Euclidean, Hamman, or Ochia (1-Cosine).  In similarity form:
Tanimoto(A,B)  = c / [a + b + c]  (symmetric)

Euclidean(A,B) = 1 - {[(a + b)] / n}(1/2)   (symmetric)

Hamman(A,B)  = [c + d] /n  (symmetric)

Ochia(A,B) = 1 - Cosine(A,B) = c / [(c + a) * (c + b)](1/2)  (symmetric)

Tversky(A,B) = c / [(alpha) * a + (beta) * b + c]  (asymmetric)

a : Unique bits turned on in molecule "A"
b:  Unique bits turned on in molecule "B"
c:  Common bits turned on in both molecule "A" and molecule "B"
d:  Common bits turned off in both molecule "A" and molecule "B"
n:  The total number of bits in the fingerprint

The Tanimoto, Euclidean, Hamman, and Ochai are all symmetric measures.  This means that the comparison of A to B yields the same number as the comparison of compound B to compound A.   For the Tversky measure this is not the case.  The Tversky (A,B) does not necessarily equal the Tverksy (B,A), which makes the measure asymmetric.  The reason for this can be explained by the role of the two parameters, alpha and beta.  In the Measures program we add the constraint that alpha + beta = 1.  If the choice of alpha is set close to 1, e.g. .95 then beta = 0.05.  The Tversky(A,B) for this choice would be close to 1.0 (the maximum) if the unique bits of compound A (when compared to compound B) are nearly zero.  At high Tversky (A,B) values, the interpretation is that compound A "fits into" compound B.  If the choice of alpha  is 0.5, then beta = 0.5 and the Tversky (A,B) is monotonic to the Tanimoto (A,B).  The Tversky measure therefore enables the grouping of chemical structures of different sizes that share features of the smaller of the two.  Below illustrates this with an example:

The Measures program generates a symmetric dissimilarity matrix in the case of the four symmetric measures and an asymmetric matrix if the Tversky similarity measure is utilized for the pair wise comparisons.  The dissimilarity is just 1 - similarity.  The diagrams given below demonstrate this point:

In the case of a symmetric dissimilarity matrix the off diagonal elements are identical. 1- Tanimoto(A,B) = 1 - Tanimoto( B,A).
In the case of the asymmetric dissimilarity matrix the off diagonal elements are not necessarily equivalent.  Note that 1 - Tversky (A,B) = 0.35 and 1 - Tverksy(B,A) = 0.05.  In this case the interpretation is that compound B "fits into" compound A, but not vice versa.

The matrix output from Measures can be directed to the Clustering program or the RussianDollTransformation program as input.  Typically for use with the Clustering program the Measures program will produce an NxN dissimilarity matrix.  However, it maybe the case that the user would like to compare two data sets of different sizes (e.g. MXN), the Measures program will also generate rectangular rather than square matrices for this purpose.  The program therefore can be used in either similarity searching mode , or all pairs (dis)similarity mode for clustering purposes.  The similarity searching output can then be used as input  (along with a file of information (e.g., SMILES strings and ID numbers) for the targets, and another information file for the queries) to the SimilarityOutput program.  There are three different output formats for the Measures program:  matrix , sparse matrix , or ordered pairs.  The matrix format is for either an all pairs comparison of N elements, producing a full NXN matrix (e.g., for exploratory or statistical purposes such as multidimensional scaling); or for similarity searching comparison of M targets by N queries, producing a MXN matrix.   The sparse matrix format can be used with a threshold to produce a ragged array that contains those NXN comparisons for clustering that meet a specific threshold.  It can also be used for a thresholded similarity search, producing a ragged array of MXN comparisons that meet a threshold criterion, between M targets, and N queries.  It is this output that can be used as input to the SimilarityOutput program. The ordered pair format is useful for showing specific i, j comparisons, where the i and j indices are explicit in the output.

Example Usages:

UNIX VERSION EXAMPLES
Generate Dissimilarity Matrix for Cluster Program using Tversky with alpha = 0.95, and dissimilarity threshold cutoff of 0.3:
./Measures SampleFingerprints.txt -V -D -S -F 0 0.95 0.3> SampleAsymMatrix.95.3.txt
Options explained: Tversky (-V), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), alpha 0.1, threshold 0.3
 
Generate Dissimilarity Matrix for Cluster Program using Tanimoto measure, and dissimilarity threshold cutoff of 0.15:
./Measures SampleFingerprints.txt -T -D -S F 0 0.15> SampleAsymMatrix.15.txt
Options explained: Tanimoto (-T), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching),  threshold 0.15
 
Generate Similarity Matrix from the comparison of two data sets (Querysetfingerprints.txt and Targetsetfingerprints.txt) using Euclidean measure and a threshold cutoff of 0.80:
wc -l Targetsetfingerprints.txt
100   (there are 100 Target fingerprints)
wc -l Querysetfingerprints.txt
1000   (there are 1000 Target fingerprints)

cat Targetsetfingerprints.txt Querysetfingerprints.txt  > TQfingerprints.txt

./Measures TQfingerprints.txt -E -S -S -T 100 0.80 > SampleAsymMatrix.8.txt

Options explained: The target fingerprints are counted (in this case there are 100).  The target fingerprints and the query fingerprints are concatenated into one file, targets at the top.  Euclidean (-E), Similarity (-S), Sparse Matrix (-S), Rectangular Matrix, so Similarity searching (T), search number 100 (targets),  threshold 0.80

./SimilarityOutput SampleAsymMatrix.8.txt QuerysetSmilesData.txtTargetsetSmilesData.txt > SimSearchOutput.txt

Options explained:  There are three input files:  1. The sample similarity search (a ragged array of the sparse matrix formed by the MXN search matrix, in the example 100X1000, and the search threshold);  2. The Query Smiles data (space delimited columns of 1000 rows of query data);  3.  The  Target Smiles data (space delimited columns of 100 rows of query data).
Generate Dissimilarity Matrix for a non-Mesa Program such as SAS or SPLUS using Hamman :
./Measures SampleFingerprints.txt -H -D -M -F 0 > SampleAsymMatrix.txt
Options explained: Hamman (-H), Dissimilarity (-D), Matrix (-M), Square Matrix - not similarity searching (F), search number 0 (not similarity searching),
Generate Similarity Ordered Pair output for input into a non-Mesa program, using Cosine measure and threshold cutoff of 0.90
./Measures SampleFingerprints.txt -C -S -O -F 0 0.90> SampleAsymMatrix.9.txt
Options explained: Cosine (-C), Similarity (-S), Ordered Pair (-O), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), threshold  = 0.90
Generate Dissimilarity Matrix for RussianDollTransformation using Tversky with alpha= .99 and threshold cutoff of 0.001
./Measures SampleFingerprints.txt -V -D -S -F 0 0.99 0.001> SampleAsymMatrix.99.001.txt
Options explained: Tversky (-V), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), alpha 0.99, threshold 0.001

Clustering  Program

Clustering data is an iterative process, and it is not the case that there is one specific clustering method that fits all data, or even a specific class of data (e.g., bit string data).  However, some algorithms and proximity measures are more suitable than others for instance with bit string data -- given the type of structure the user is looking for.  An analogous situation is known in the classification or predictive modeling field:  there is no best classifier for predictive modeling.  This has been formalized in a proof, "The No Free Lunch Theorem" (see Pattern Classification, 2nd ed. Duda, Hart, Stork).  Rather, there is typically a family of classifiers that may be more suitable for a type of data, any one of which maybe the best or near optimal.  One must first search for that family of possible classifiers among many.

With regard to clustering, typically, there are several levels of refinement in determining structure in the form of groups in the data.  The first process is to determine which family of clustering methods are suitable for the data at hand.  Next, one performs a small battery of the candidate family of clustering methods in an effort to determine both efficacy and efficiency.  The latter is important if the data set(s) are large, and thus may limit the candidates to just a few.  Once a specific algorithm is chosen, further iterative refinement is performed to determine cluster validity using both a priori information (expert knowledge about the data set) and quantitative means such as cluster validity tests.  For large data sets quantitative cluster validity tests are often very computationally expensive.  Expert knowledge can therefore be extremely important.  The best or near best grouping may be obtained with level selection techniques employed in hierarchical algorithms or heuristics for choosing in k-means algorithms, etc.  The efficacy of these methods can in turn be explored.  In Taylor-Butina's algorithm, a careful choice of the exclusion region threshold (well known information regarding the common measures is helpful in this regard) and a non-trivial ambiguity determination can be used.
 

The Clustering program contains the core functionality of the Mesa Grouping Module.  The program inputs an NXN sparse dissimilarity matrix, either from the Measures program output, the RussianDollTransformation program, or a User defined sparse matrix input file, and generates chemical groupings based on user defined options, i.e., disjoint, non disjoint, symmetric, asymmetric.   The Clustering program sends output to either the ClusterOutput Program or the AmbiguityMeasures Program

The Clustering program employs variations of Taylor-Butina grouping algorithms. 1,2 (Note, Taylor's algorithm is not strictly considered a clustering algorithm.  But in our implementation it is effectively a clustering algorithm.  Thus, we use the terms clustering and grouping interchangeably.)  In its simplest symmetric disjoint form this algorithm iteratively identifies the element that is similar to the most elements, given a similarity exclusion region threshold.  That element (the representative element) and those similar to it -- namely, those within its exclusion region -- are removed and form the first group.  This process is repeated until there are no more elements that form groups.  There are three classes of elements produced by this algorithm.  True singletons -- those elements that do not fall into any group, elements within groups (these include the representative elements), and false singletons -- those elements that meet the similarity threshold criterion between at least one other element, but due to the grouping criterion are excluded from any group.

Asymmetric Taylor-Butina Algorithm(Disjoint)
  1. Create threshold nearest neighbor table using similarities in both directions  ( Measures )
  2. Find true singletons: all those compounds with an empty nearest neighbor list.
  3. Find the compound with the largest nearest neighbor list. This becomes a group and is excluded from consideration – these compounds are removed from all nearest neighbor lists.  (The compound becomes the representative compound.)
  4. Repeat step 3 until no compounds exist with a non empty nearest neighbor list.
  5. Assign remaining compounds, false singletons, to the group that contains their nearest neighbor, but identify them as "false singletons"


In  the Cluster program selecting the Disjoint option will generate clusters in which compounds are members of only one cluster.  Below is a schematic and algorithmic description of the non disjoint option.



Asymmetric Taylor-Butina Algorithm (Non Disjoint)
  1. Create threshold nearest neighbor table using similarities in both directions  ( Measures )
  2. Find true singletons: all those compounds with an empty nearest neighbor list.
  3. Find the compound with the largest nearest neighbor list. This becomes a group and is excluded from consideration – these compounds are removed from all nearest neighbor lists.  (The compound becomes the representative compound.)  If a shared compound is within the threshold it is not excluded from membership in other clusters.  Shared compounds are included in all groups in which it is within the threshold of a representative compound.
  4. Repeat step 3 until no compounds exist with a non empty nearest neighbor list.
Issues with Ties:

There are three possible decision ties that results from the symmetric and asymmetric disjoint Taylor-Butina algorithm as described above:

    1. Exclusion Region Ties
    2. Minimum Distance Ties
    3. False Singleton Ties

There may be more than one element that has the largest number of elements associated with it.  E.g., say during an iteration there are two elements with 100 associated elements similar to each, and 100 is the largest number of elements.  We call this an exclusion region tie, as there is more than one exclusion region candidate.  If the two groups are disjoint there is no real problem.  However, if the groups overlap at all, which exclusion group should be chosen?  Such a tie can be broken by a secondary grouping criterion, such as a criterion that attempts to find the tightest group (most all pairs of the group similar).  This second criterion can also be tied, and as it turns out given the limited number of possible values for bit string measures, such a tie is not as rare as one might assume.  We call such a tie a minimum distance tie.  False singletons represent a problem for Taylor's algorithm.  Typically, they only represent 1-3% of the elements in a set, but identifying them as distinct from true singletons and having a mechanism that associates them with some group is useful.  We both identify them and associate them with the group that contains the most similar element to the false singleton in question.  This association also can be tied if there is more than one element that is the most similar.  Such ties are also common for the same reason -- that bit string measures have a relatively small finite number of similarities.  We call this type of tie a false singleton tie.  Our implementation reports the number of each of these ties.  Minimum distance ties and false singleton ties are broken arbitrarily and therefore introduce a degree of ambiguity into the final grouping.  The non disjoint version, either symmetric or asymmetric, does not have the false singleton ties problem, as false singletons are naturally subsumed into overlapping groups.  The next program in the Grouping Module, AmbiguityMeasures , provides an effective means of threshold selection which provides insight into the ties issues mentioned here.
 
 

Example Usages:


UNIX VERSION EXAMPLES
Generate Asymmetric Disjoint Clustering with a exclusion region threshold of 0.15 (.85 simmilarity)
          wc -l SampleAsymMatrix.0.1.txt  (counts rows of matrix)
495
./Clustering SampleAsymMatrix.0.1.txt -A 0.15 495 > SampleAsym.0.15.Clustering.txt
    Options explained:  Asymmetric Disjoint (-A), threshold 0.15, 495 members of matrix, N of NXN is  495
Generate Asymmetric, Non-disjoint Clustering, with threshold of  0.1 (equivalent to 0.9 Similarity)
./Clustering SampleAsymMatrix.0.1.txt -N 0.1 495 > SampleAsymN.0.1.Clustering.txt
Generate Symmetric, Disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)
wc -l SampleSymMatrix.0.2.txt  (counts rows of matrix)
352
./Clustering SampleSymMatrix.0.2.txt -T 0.05 352 > SampleAsym.0.05.Clustering.txt
Generate Symmetric, Non-disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)
wc -l SampleSymMatrix.0.2.txt  (counts rows of matrix)
352

./Clustering SampleSymMatrix.0.2.txt -n 0.05 352 > SampleAsym.0.05.Clustering.txt

Generate Asymmetric Disjoint Clustering with a exclusion region threshold of 0.15 (.85 similarity)
          wc -l SampleAsymMatrix.0.1.txt  (counts rows of matrix)
495

./Clustering SampleAsymMatrix.0.1.txt -A 0.15 495 > SampleAsym.0.15.Clustering.txt

    Options explained:   Asymmetric Disjoint (-A), threshold 0.15, 495 members of matrix, N of NXN is  495
Generate Asymmetric, Non-disjoint Clustering, with threshold of  0.1 (equivalent to 0.9 Similarity)
./Clustering SampleAsymMatrix.0.1.txt -N 0.1 495 > SampleAsymN.0.1.Clustering.txt
Generate Symmetric, Disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)
wc -l SampleSymMatrix.0.2.txt  (counts rows of matrix)
352
./Clustering SampleSymMatrix.0.2.txt -T 0.05 352 > SampleAsym.0.05.Clustering.txt
Generate Symmetric, Non-disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)
wc -l SampleSymMatrix.0.2.txt  (counts rows of matrix)
352

./Clustering SampleSymMatrix.0.2.txt -n 0.05 352 > SampleAsym.0.05.Clustering.txt


AmbiguityMeasures  Program


 
 

Given the ties issues discussed above,  we can consider the amount of ambiguity that might accrue given a disjoint clustering at a given threshold.  Such a clustering is really just one of many possible clusterings, since there may be many possible outcomes of any set of arbitrary decisions that are made.  There are a number of different indices that can capture the inherent ambiguity that result from a collection of arbitrary decisions.  We propose three simple and useful indices -- two that are based on the number of overlapping elements, and a third based on the difference in the number of disjoint clusters versus the number of non disjoint clusters produced with the same threshold and the same symmetry or asymmetry.  The AmbiguityMeasures program takes the outputs of a disjoint and a non disjoint run (both either symmetric or asymmetric) at the same exclusion region threshold and returns three such indices.

The first index is the "raw" ambiguity, which is simply the counts of the number of compounds found in common among the clusters.  The second index is the sum of square roots of the counts per element found in common more than once, divided by the total number of elements.  This discounts elements that are found in many clusters.  The last index is simply the ratio of the number of disjoint clusters over the number of non disjoint clusters.  Combined, these measures give the user a way in which to determine if the level of ambiguity is reasonable at the specific threshold they pick.  For example, the user may find with a specific data set that when the clustering is performed at a 0.85 Tanimoto threshold the ambiguity measures are low enough such that the disjoint clustering is a reasonable approximation of the non disjoint clustering.  Whereas, if a lower Tanimoto threshold is chosen (e.g., 0.7) the ambiguity indices are sufficiently high that the disjoint clustering is a poor approximation to the non disjoint clustering (where the amount of overlapping in the clustering is large).   Below is a plot of the second Ambiguity index versus the Tanimoto similarity threshold for a set of NCI HIV Active Ligands which were fingerprinted with the 166 MDL Key fingerprints, 320 MDL Key fingerprints, and the Daylight 512 Bit fingerprints.
 
If one wanted to cluster this small data set and arbitrarily picked a threshold of  0.79 and used MACCS key 166 Bit fingerprints, clustered one time with disjoint clustering, the second ambiguity index in the clustering results would be much greater than 0.20. The orange-yellow line displays a ambiguity index level at which approximately 10% of the compounds are shared among the clusters.   Meaning that at a threshold = 0.79 and with MACCS key 166 Bit fingerprints, many more than 10% of the compounds would be put into "arbitrary" clusters.  Even for a small data set such as this, the difficulty of analyzing a clustering where the amount of compounds involved in overlapping clusters is above 10% or even 5% may be too great for even the most patient researcher.  Having a sense of the ambiguity for a specific clustering is therefore crucial for understanding the inherent structure of the data.

We suggest that threshold selection not be an arbitrary decision, but rather that clusterings be done both disjoint and non disjoint at a number of  threshold values, and the AmbiguityMeasures program be utilized to make and informed decision as the "best" threshold to pick for a given data set.
 
 
 

Example Usages:

UNIX VERSION EXAMPLES
Generate Ambiguity Report for 0.15 threshold clustering  of an Asymmetric Similarity Matrix
./AmbiguityMeasures SampleAsym.0.15.Clustering.txt SampleAsymN.0.15.Clustering.txt > AmbiguityReport0.15
Options explained:  Input files are the disjoint and non disjoint clustering results from the Clustering program run at the same threshold
Generate Ambiguity Report for 0.1 threshold clustering  of an Symmetric Similarity Matrix
./AmbiguityMeasures SampleSym.0.1.Clustering.txt SampleSymN.0.1.Clustering.txt > AmbiguityReport0.1

Options explained:  Input files are the disjoint and non disjoint clustering results from the Clustering program run at the same threshold


ClusterOutput  Program


 
 

The ClusterOutput program takes the Clustering program output with user information about the elements (e.g., SMILES strings, activity data, ID numbers) and outputs a clustering report.  The user can either specify a full report or just the identification of the representative element (one from each cluster).

The ClusterOutput program takes as input the Clustering program's output, a file of the data names in the same order as the input to the Measures program or MACCSKeysGenerator program (and any associated information in column form, e.g., a file that contains SMILES strings and associated activities, and identification numbers), the number of columns in the file of data names, and a boolean toggle that indicates what form of output (row or column) is desired.  Row form is desired for MicroSoft Excel importing, since the number of columns are only 256, but the number of row is roughly 65,000.  If there is no associated data with the data names, the number of columns is 1.  However, if there are other columns in the file, the total number of columns including the data name column input.  There is an option such that only representative compounds will be output in column form with any associated data.

Typical output from the program with the Clusters by Row option set would look like:
Cluster 1
Rows of SMILES strings and any associated data, space delimited.
Cluster 2
.
.
.
True Singletons
False Singletons (none in the non disjoint versions)
Ties Statistics
Representative Compounds ("Centroids")
Rows of the representative compounds and any associated data, space delimited.
 
Typical output from the program with the Clusters by Column option set would array the output in columns:

True Singletons    False Singletons    Cluster1    Cluster2 ...     ClusterK    Ties Statistics    Representative Compounds
.
.
.

Associated data is in adjacent columns to the data in question.  Since this is a ragged array, to make it rectangular, blanks are filled with "NA".  This format is useful when the for data sets are small and the user wishes to import them into MicroSoft's Excel where the number of columns for the cluster output and associated data is only 256.

Example Usage

UNIX VERSION EXAMPLES
Generate Formatted Clustering File with representative compounds only
./ClusterOutput SampleAsym.0.15.Clustering.txt SampleCol1.smi 1 T R > SampleAsym.0.15.ClusteringFormated.txt

Options explained:  The first input file is the output file from the Clustering program for an asymmetric clustering at a threshold of 0.15 and the second input file is a file of  SMILES strings associated with each row of the Asymmetric similarity matrix used as input in the Clustering program, there is only one column in the SMILES file ( 1 ),  Clusters by row the output will have only one cluster in each row  (T) and only Representative compounds will be printed into the output file (R)
 

Generate Formatted Clustering File for small set with clusters in columns for easy input into MicroSoft Excel
./ClusterOutput SampleAsym0.15.Clustering.txt SampleCol1.smi 1 F N> SampleAsym320.0.15.ClusteringFormated.txt
Options explained:  The first input file is the output file from the Clustering program for an asymmetric clustering at a threshold of 0.15 and the second input file is a file of  SMILES strings associated with each row of the Asymmetric similarity matrix used as input in the Clustering program, there is only one column in the SMILES file ( 1 ),  Clusters by column, the output will have  clusters printed out in column format  (F) and all compounds will be printed into the output file (N)

RussianDollTransformation  Program

The RussianDollTransformation program is a response to a question posed by John Bradshaw at Daylight Chemical Information Systems.  "How can one group large molecular structures in a database, based on the number of smaller structures they have in common such that this relationship forms a proximity matrix among larger structures?"  For example, say a group of larger molecular structures all contain the same 10 smaller structures within each large structure.  Another group of large structures only contain 7 smaller structures in common. The first group of large structures is more similar.  Namely, this transformation groups larger structures by what smaller structures fit within them.  The utility of this question becomes apparent if, say, a drug database is seeded with smaller structures that are important drug like structures, and one wishes to understand which larger structures in the database are grouped by the number of like seeded structures.  The answer to the question lies in the use of asymmetry.  With a Tversky alpha set to 1.0 (or near 1.0), one can effectively ask, what structures fit wholly (or mostly) within another structures.  With an asymmetric proximity matrix produced by such a Tversky parameterization, we can transform that matrix to ask John Bradshaw original question, producing another asymmetric proximity matrix, that, when used with an asymmetric clustering algorithm, will actually produce an answer to that question, and the converse of that question --  "can we group smaller structures based on what they fit into?"   A disjoint or non disjoint clustering of the transformed matrix will produce both types of groupings.

The Russian Doll Transformation transforms an asymmetric matrix into another asymmetric matrix containing the counts of compounds that two compounds i and j mutually share as nearest neighbors at some threshold.   Asymmetrically clustering this transformation groups large molecules by what smaller molecules they mutually share as substructures, and it groups small molecules by what larger molecules they fit into.  The input is a Tversky asymmetric matrix, typically with alpha set at or near 1.  This setting of alpha behaves such that one compound "fits" within the other, or nearly fits.  Those compounds that share other numerous compounds in this manner are more similar.  There are 3 parameters that effect the outcome of a clustering using this transformation:  1. the Tversky alpha value;  2.  the matrix threshold (a nearest neighbor like threshold) to the Russian Doll Transform;  and 3.  the clustering measure threshold.  These parameters can be modified to provide various clusterings.  For example, if one just wants to group those compounds that are very nearly the same and leave all others as singletons, set the Tversky to 1 (or 0), the matrix threshold to near 0 (say, 0.1), and the clustering threshold toward the lower range of the transform matrix output (often between 0.1 and 0.5).  As these constraints are loosened clusters contain more varied compounds and there are often fewer singletons.The clustering threshold value should depend directly on the desired number of substructures (superstructures) found in common. Thus, though the matrix is composed of proximities (dissimilarities), they are not necessarily related to Tanimoto-like proximities in value (e.g., where 0.7 similarity is a common threshold).  They can be quite data dependent, so several runs may be necessary to determine appropriate threshold values -- and these may be quite high depending on the application.  A glance at the matrix will give the user some idea as to what thresholds to try.  Somewhat counter intuitively, the dissimilarities can be quite close to 1 since the counts in common are scaled by the number of elements (compounds in the data set). The dissimilarity measure is 1 - (compounds in common)/n, where n is the number of compounds in the data set. If in general there are very few compounds in common compared to the total number of compounds, most of the dissimilarities are then near 1.  Other scalings that more evenly distribute the values between 0 and 1 are possible, but this scaling allows the user to directly calculate the total number of counts from the dissimilarity by knowing only n.

The Russian Doll idea was by John Bradshaw and the algorithm to implement the idea was by Norah and John MacCuish.  This is at present, to the best of our knowledge, an unpublished algorithm and unpublished use of this idea in compound clustering.  A presentation discussing the Russian Doll algorithm and its uses was given at MUG '03 and can be found at http://www.mesaac.com/RussianDollDistrib.htm

Example Usage 

UNIX VERSION EXAMPLES
Generate asymmetric matrix for RussianDollTransformation - must first use the Measures program to generate a Dissimilarity matrix for the   RussainDollTransformation program then Generate RussianDoll asymmetric matrix for input into the Cluster program
./Measures SampleFingerprints.txt -V -D -M -F 0 0.01 > SampleAsymMatrix.0.01.txt
Options Explained:   Tversky (-V), Dissimilarity Matrix (-D), Matrix Output (-M), Not Searching (F), no search number (0), Threshold 0.01

wc -l SampleAsymMatrix.0.05.txt
495
./RussianDollTransformation SampleAsymMatrix.0.01.txt 495 0.01 > RussianDollAsymMatrix.0.01.txt
Options Explained:  Input file is asymmetric similarity matrix at a dissimilarity threshold .01, Number of members of the matrix (495), dissimilarity threshold .01

./Clustering RussianDollAsymMatrix.0.01.txt -A 0.99 495 >RussianDollClustering.txt
Options Explained:  Input file is RussianDoll asymmetric matrix at a dissimilarity threshold .01, Number of members of the matrix (495), similarity threshold .99

 

Example Applications

Using the flow chart in the top on this document, subsets of the programs can be utilized for the following example applications:

Compound Acquisition Decisions

In-house databases are augmented many times through acquisitions of available compounds from commercial vendors.  Determining the extent to which a possible acquisition database is "similar" to an in-house database becomes crucial in making compound acquisition decisions.  Compounds are typically costly.  Thus, purchasing selections of various compounds from outside vendors that will increase the number of hits in HTS screens, without generating redundant data or producing few or no hits in the HTS screens, is what makes this task difficult.

To process, for example, an in-house database of 100,000 compounds and an acquisition database of 50,000 compounds would require  ~ 1 GByte of RAM. The steps are as follows:

  1. Acquisition compounds need to be "filtered" first.  A pre-processing of the data is crucial for proper decision making.  Typical decisions are made based on a company's criteria for what would make a "good" drug precursor.  One common strategy is to filter based on Lipinski's Rule of 5.   There are many commercial vendors that supply means to filter based on molecular properties (e.g. molecular weight, clogp, etc.), one example is the Filter program from OpenEye Scientific Software.  Data will typically come from vendors in different formats and representations.  For example, salts maybe a significant portion of the database, the parent of which may be interesting to a company but the salt form will not "fingerprint" properly.  Some companies filter out salts, this can be done easily in SMILES since you can remove all strings that have a period (".") in them.  The other possibility is to transform them into the SMILES of the respective parent (see, J. Bradshaw's presentation A beginner's guide to responsible parenting or knowing your roots at Daylight's User group meeting MUG'98).  After a filter step, one should remove any duplicate SMILES.  Another important issue that arises from these types of comparisons is the standard representation that a given acquisition database has used in the SMILES or SDF files given for analysis.  For example, a common issue in chemical registration is how should Nitro groups be represented?  One way is  , another is  .  You need to transform the SMILES of your acquisition database to have the same standard representations for these types of groups found in your in-house database -- so that you will be comparing apples with apples and not apples with oranges.
  2. Now that you have a filtered set of acquisition compounds from step 1, you can proceed to make decisions about the structural similarity differences between the potential purchase set and an in-house database with Mesa's Fingerprint and Grouping Modules.  The first decision you need to make is how do you want to augment your in-house database.  How many of each structural class that is "different" from the in-house database do you want to select?  For some companies, they would like to pick only one member of each "type" represented in the acquisition set, some companies prefer to select a small number, lets say 4, which enables SAR from the primary screens.  Pick the number which you think best and lets call this number, X.  X is the number of compounds of a different "class" that you want to select from the acquisition set.  It may be the case that for some representative compounds from your in-house collection do not have X near neighbors.  This means that you may want to add to clusters of your in-house collection from the acquisition collection so that all representative compounds in the combined set (in-house + buy set) will have the required number of near neighbors, X, e.g., 4.   Let Z be the number of near neighbors that a representative compound has in the in-house database. Z will be determined by the Grouping Module, and it will be threshold dependent.
    1. Fingerprint the in-house and the acquisition set.
    2. Cluster the in-house data set and select representatives and true singleton compounds at several thresholds and use the AmbiguityMeasures to select the "best" threshold.
    3. Find all representative compounds that have Z < X members in their group.
    4. Perform a similarity search of the acquisition set against the representatives from Step 3 and the true singleton compounds from Step 2 with the threshold cutoff used in step 2.
    5. For each representative compound hit list, select X minus Z acquisition compounds which fall within the user defined threshold and put into the buy file (e.g.,  the representative compound from the first cluster where Z < X of the in-house database has 2 near neighbors in the in-house collection (Z=2), you'd like it to have 4 (X=4). So, select 2 compounds from the representative/acquisition hit list and add those 2 selected compounds to the buy  file.)  And,  for each singleton compound hit list, select X acquisition compounds which fall within the user defined threshold and put into the buy file.  Remove the buy file compounds from the original acquisition data set in step 1.
    6. Remove remaining hist list members not selected for the buy file in step 5 from the acquisition set.
    7. Cluster what remains of the acquisition data set and use the threshold from step 2.
    8. For each representative compound from the clustering in step 7 add the representative compound and X nearest neighbors into the buy file.
    9. The buy file now contains the selection of compounds to purchase from the acquisition data set
Script for Compound Aquisition (after filtering, etc.)
1.  For this example let:
In house data file of SMILES strings = InHouse.smi
Acquistion data file of SMILES strings = Acquisition.smi
In house data file of fingerprints = InHouse.fps
Acquistion data file of fingerprints = Acquisition.fps
Hence,
./gen_maccs320  InHouse.smi > InHouse.fps
./
gen_maccs320 Acquistion.smi > Acquisition.fps

2.  In this example we use 4 thresholds to determine reasonable ambiguity and expert opinion as to the "best" clustering.
./Measures InHouse.fps -T -D -S -F 0 0.3 > InHouseSymMatrix.0.15.txt
./Clustering InHouseSymMatrix.0.15.txt -T 0.25 100000 > InHouseSymDisjointClustering.0.25.txt
./Clustering InHouseSymMatrix.0.15.txt -T 0.20 100000 > InHouseSymDisjointClustering.0.20.txt
./Clustering InHouseSymMatrix.0.15.txt -T 0.15 100000 > InHouseSymDisjointClustering.0.15.txt
./Clustering InHouseSymMatrix.0.15.txt -T 0.10 100000 > InHouseSymDisjointClustering.0.15.txt

./Clustering InHouseSymMatrix.0.15.txt -n 0.25 100000 > InHouseSymNonDisjointClustering.0.25.txt
./Clustering InHouseSymMatrix.0.15.txt -n 0.20 100000 > InHouseSymNonDisjointClustering.0.20.txt
./Clustering InHouseSymMatrix.0.15.txt -n 0.15 100000 > InHouseSymNonDisjointClustering.0.15.txt
./Clustering InHouseSymMatrix.0.15.txt -n 0.10 100000 > InHouseSymNonDisjointClustering.0.10.txt

./AmbiguityMeasures InHouseSymDisjointClustering.0.25.txt InHouseSymNonDisjointClustering.0.25.txt >AmbiguityReport.0.25.txt
...similarly for the other 3 Ambiguity reports.

3.  In this example, say that the 0.15 threshold represented the best of the 4 clusterings in terms of both ambiguity and expert analysis of the groups formed.
Then,
./ClusterOutput InHouseSymDisjointClustering.0.15.txt InHouse.smi 1 T N > InHouseSymDisjointClusteringFormated.0.15.txt
Review cluster output report file and determine how many representative compounds meet the Z < X criterion.   Say that number is 200.
./ClusterOutput InHouseSymDisjointClustering.0.15.txt InHouse.smi 1 T R > InHouseSymDisjointClusteringFormatedRepresentatives.0.15.txt
tail -200 InHouseSymDisjointClusteringFormatedRepresentatives.0.15.txt >  Z_LT_X_RepresentativeSet.smi
The last file, Z_LT_X_RepresentativeSet.smi, is the SMILES file of the representative compounds that meet the Z < X criterion.

4.  Now we append the acquistion set of SMILES to the Z < X representative set plus the true singletons.
Clip out true singletons SMILES from InHouseSymDisjointClusteringFormated.0.15.txt.  This can be done with unix head and tail commands once the line numbers for the beginning and end of the true singleton section of the report file is found.  Call this file InHouseTrueSingletons.smi
cat InHouseTrueSingletons.smi  Z_LT_X_RepresentativeSet.smi > T_S_Z_X.smi
Find the size of this combined file:
wc -l T_S_Z_X.smi
Say it equals 500
Append this combined file to the Acquisition file to obtain the similarity searching file
cat T_S_Z_X.smi Acquisition.smi > TSZXRep_Acquisition.smi
For this example the threshold found above in step two was 0.15 dissimilarity = 0.85 similarity

./gen_maccs320 TSZYRep_Acquisition.smi > TSZYRep_Acquisition.fps
./Measures TSZYRep_Acquisition.fps -T -S -S -T 500 0.85 > TSZYRep_AcquisitionSimSymMatrix.txt

5.  To perform this step one will need to write a short script (e.g., in perl or python) that will extract the information - matching indices to SMILES strings using the Acquistion.smi  and TSZYRep_AcquisitionSimSymMatrix.txt files.

6.  Again, a short script can be written to create the file of those compounds not identified for inclusion into the buy file from  Acquisition.smi file.  Let us call the new file, ReducedAcquisition.smi.

7.  Perform step 2 on the ReducedAcquisition.smi file and create a formated cluster output file of the settled on clustering.

8.  With a script, extract the desired compounds from the formatted cluster output file in step 7 and add these to the buy file.

9.  The buy file will contain a set of SMILES from the acquisition set to buy.  

Analysis of High Throughput Screening Hits

Large numbers of hits from HTS screens can be difficult to analyze when the hit rate is above more than a couple hundred compounds.  There is only so much one can do by eye, looking at structures on a computer screen or printed out on sheets of paper.  The Mesa Grouping Module provides several mechanisms for grouping hits for analysis and SAR studies.

One approach for a relatively small set, say 500, compounds would be to use ChemTattoo.  ChemTattoo will take an input set of SMILES strings, of actives for example, and find common features within the set.   ChemTattoo generates statistics for fingerprints of the active set.  It will generate a frequency fingerprint which shows the number of times a "key" in the fingerprint is represented across the set in the form of the fraction of the number of fingerprints in the set.  It will also generate a modal fingerprint, which is a representative fingerprint of the common "keys" in the set.  The user selects a threshold between 0 and 1 and then produces a modal fingerprint for the set at a given threshold.  If the user selects 1.0 for the threshold, then the modal fingerprint will contain on bits for only those keys which are present in all 500 compounds.  If the user selects a threshold of .75 then  the modal at that threshold will contain "on" bits for only those keys that are turned on in the fingerprints of 75% of the input set, 500*.75  = 375.  Namely, 375 of the 500 compounds would have to contain the key for the key to be turned on in the modal.  The modal at given thresholds can be compared to each of the members of the active set and to pull out structurally similar sets.

Depending on the diversity of the hits, the particular groupings that are desired, and the size of the number of hits, one may need to perform a clustering step prior to using the ChemTattoo for best results.   (For more information on ChemTattoo, see the ChemTattoo Module documentation.)
 
Process:

  1. SMILES form HTS hits are fingerprinted with the Fingerprint Module or another 2D or 3D fingerprinter of choice.
  2. The Clustering program is run at a relatively low similarity threshold.
  3. Choose a cluster of actives and extract the fingerprints for just those actives.

    Let the file of fingerprints for the cluster in question be called Actives.fps.  Then, to create a modal at least 95% of the fingerprints have a bit on for the modal to have that bit on perform the following:

./ChemTattooModalStats Actives.fps 0.95 -S > ActivesModalStatsReport.txt

The -S option will generate modal to fingerprint similarities (of 5 different measures including the Tanimoto and Tversky) for all fingerprints in the Actives file.


References

  1. Taylor, R.  Simulation Analysis of Experimental Design Strategies for Screening Random Compounds as Potential New Drugs and Agrochemicals ,  J. Chem. Inf. Comput. Sci 1995, 35, 59-67.
  2. Darko Butina.  Unsupervised Data Base Clustering based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets.  J.. Chem. Inf. Comput. Sci, 1999, 39(4), 747-750.
  3. Yvonne Martin.  Datamining HTS Results:  Do Similar Compounds Have Similar Activity?  Daylight User Group Meeting  MUG’02, Santa Fe, New Mexico, February 2002.
  4. N. MacCuish and J. MacCuish.   Tversky Shape Clustering for Screening Data , OpenEye Scientific Software User Group Meeting CUPIV, Santa Fe, New Mexico, February 2003.
  5. N. MacCuish , J. Bradshaw and  J. MacCuish.   Russian Doll:  Structure Clustering Using Asymmetry , Daylight User Group Meeting  MUG’03, Santa Fe, New Mexico, February 2003.
  6. Reoptimization of MDL Keys for Use in Drug Discovery , J. L. Durant, B. A. Leland, D. R. Henry, J. G. Nourse, JCICS, 2002, 42 (6), 1273-1280.
  7. Daylight Chemical Information Systems, Inc. Daylight Clustering Manual.
  8. MDL Information Systems, Inc.
  9. J. MacCuish and Norah MacCuish.  Mesa Suite Version 1.1. Fingerprint Module, 2003.
  10. Clustering Methods and Their Uses in Computational Chemistry , Downs and Barnard, Reviews in Computational Chemistry, 2002
  11. Ties in Proximity and Clustering Compounds, J. MacCuish, C. Nicolaou, N. E. MacCuish, JCICS, 2001, 41 (1), 134-146)