Mesa Suite Version 1.2
Grouping Module
Copyright (c) 2003 Mesa Analytics & Computing, LLC

John D. MacCuish and Norah E. MacCuish

www.mesaac.com
support@mesaac.com


 









The Grouping Module in the Mesa Suite

The Mesa Suite is a collection of application modules for research and applications (diversity selection, library analysis, compound acquisition, lead hopping, HTS data analysis,  etc.) for chemical information systems.  The modules can be used separately or in tandem to perform a variety of 2D and 3D tasks.  Currently there are five modules:  Shape, Fingerprint, Grouping, Diversity, and ChemTattoo.  Aside from the standalone module applications, a Khoros interface from Khoral Research is also available for each module.   Presently we support Linux, Windows, and Irix 6.5 platforms.  All but the Shape module are written in C++.  The Shape module is written in C.


Overview of the  Grouping Module

The Grouping Module is a set of seven programs:

Measures - Generates Similarity/Dissimilarity Matrices using one of the following measures:  Tanimoto, Euclidean, Tversky, Cosine (1-Ochai), and Hamman.
SimilarityOutput - Takes similarity searching output as input and outputs formated search results.
Clustering - Groups compounds using either asymmetric, symmetric, disjoint or non disjoint algorithms using output from Measures.
AmbiguityMeasures - Generates a report based on the user defined threshold used in Clustering as to the level of Ambiguity to be expected in the results.
ClusterOutput - Generates report of clustering results in space delimited formatted form.
RussianDollTransformation - Transforms an asymmetric matrix to another asymmetric matrix that contains sub- or superstructure proximities.
Square2Sparse  - Inputs a square matrix from outside source and transforms it for use with Clustering.
ClusterCompare -
Compare the results of disjoint clusterings of the same data (i.e., at different thresholds, different measures, etc.).  Takes as input a file of the Clustering output file names, and outputs the all-pairs comparisons between clusterings.  The measure returned is between 0 and 1, where 1 is a perfect match (the clusterings are exactly the same.
BitStripper - A utility for stripping out unused bit columns (either all 1s or all 0s) for a collection of bitstrings.  Reports the number of columns removed and how many of each (1 or 0 columns).  Handy for large fingerprints, if there are a lot of unused bits.
Its primary utility is in the Clustering program which returns disjoint (non overlapping) or non disjoint (overlapping) clusters of binary bit string data, using either symmetric or asymmetric measures generated with the Measures program. The Grouping Module contains four variants of  Taylor-Butina's grouping algorithm.1,2,3,4  The AmbiguityMeasures program generates cluster ambiguity measures, and similarity searching facilities are provided via the Measures program to further augment data analysis.    SimilarityOutput program rank orders the queries in terms of their similarity to the respective target and formats this output.  User defined information about each query and target can be included and the output is in the form of a space delimited text file.  The ClusterOutput program generates grouping results with user defined information in a space delimited text file that can be loaded into a spreadsheet program such as MicroSoft Office's Excel. The RussianDollTransformation is included for grouping by substructure or superstructure similarity. 5  A matrix utility program,  Square2Sparse, is provide in the event that the user has a square proximity (similarity or dissimilarity) matrix obtained from some an outside source and would like to group the data using the Clustering program.

Grouping or clustering 2D molecular structures or properties has many important uses in the pharmaceutical industry.  Common methods often employ symmetric, hierarchical algorithms such as Wards or complete link algorithms, non-hierarchical algorithms such as Jarvis-Patrick, or partitional algorithms such as k-means.10,11,12   Such  methods require symmetric measures, e.g. Tanimoto or Euclidean Distances as input.  One drawback to such methods is that they cannot capture important asymmetric relationships between molecular structures. Also, these methods are typically disjoint (non overlapping) clustering methods.  That is, each cluster is distinct and does not contain elements (e.g., a specific molecular structure description such as a fingerprint) from other clusters.  The Grouping Module extends the common clustering concepts by introducing asymmetric measures and clustering, and providing either disjoint or non disjoint (overlapping) clusterings.

Another serious issue when grouping molecular structures with bit string fingerprints is the problem of ties in proximity or algorithmic decision ties in the clustering algorithm --  more globally, how ambiguous is the final cluster partition?  Almost all hierarchical and partitional algorithms have, to varying degrees, ties in proximity or decision ties issues.  Our Grouping Module allows the user to identify and explore those issues with a report on the type and number of ties for a specific clustering, and several indices as to the overall ambiguity contained within a clustering.  Clustering algorithms are also often constrained by the overall size of the number elements to be clustered.  The clustering algorithms we employ for both the disjoint, non disjoint versions of either the symmetric or asymmetric clusterings are variations on the Taylor-Butina grouping algorithm.  Our implementations are both fast and space efficient such that large numbers of compounds can be clustered effectively on a single machine.  Rough estimates suggest that approximately 100 thousand compounds can be clustered on machine with 1 gigabyte of ram and 1 gigahertz processor within 4 to 8 hours.  This includes the time to generate the fingerprints (using our Fingerprint Module), compute the similarities, perform the clustering, and produce the formatted output report.


 

Theory for Grouping Module


 Pre-requisites

In order for comparisons to be made between chemical structures, a chemical description must first be applied.  Typically in 2D, descriptors are molecular fingerprint descriptors, such as MDL MACCS keys, Daylight fingerprints, BCI keys, etc.6,7,8  The Mesa Grouping Module expects as a pre-requisite that chemical data sets under study will be described in a binary bit string fashion -- at least up through the Measures program.  The Clustering program itself will operate on any sparse dissimilarity matrix of the form output by the Measures program, independent of its source.  The Grouping module can read Daylight Fingerprints, or any fixed length fingerprints which can be cast into ASCII bit string format, e.g. 1010101000011.....  For details on one such binary descriptor see the Mesa Fingerprint Module Manual.

Grouping Chemical Structures based on 2D Representations



Measures Program

Once a chemical data set is described in terms of molecular fingerprints, a comparison measure or distance must be applied pair wise to determine how similar or how dissimilar two compounds are relative to each other.  The Measures program provides this important pre clustering processing step.

   
The Measures program takes as input any fixed length bit strings as shown in the diagram above, these can be from the Mesa Fingerprint programs or user supplied fingerprints.  User supplied fingerprints must take the form of ASCII 1's and 0's, (e.g. 011100001111000....), ASCII Daylight fingerprints inside the FP<> data type are also valid input to Measures .  The Measures program  produces a similarity or dissimilarity matrix (user's choice) using one of the following user selected measures:   Tversky, Tanimoto, Euclidean, Hamman, or Ochia (1-Cosine).  In similarity form:
Tanimoto(A,B)  = c / [a + b + c]  (symmetric)

Euclidean(A,B) = 1 - {[(a + b)] / n}(1/2)   (symmetric)

Hamman(A,B)  = [c + d] /n  (symmetric)

Ochia(A,B) = 1 - Cosine(A,B) = c / [(c + a) * (c + b)](1/2)  (symmetric)

Tversky(A,B) = c / [(alpha) * a + (beta) * b + c]  (asymmetric)

a : Unique bits turned on in molecule "A"
b:  Unique bits turned on in molecule "B"
c:  Common bits turned on in both molecule "A" and molecule "B"
d:  Common bits turned off in both molecule "A" and molecule "B"
n:  The total number of bits in the fingerprint

The Tanimoto, Euclidean, Hamman, and Ochai are all symmetric measures.  This means that the comparison of A to B yields the same number as the comparison of compound B to compound A.   For the Tversky measure this is not the case.  The Tversky (A,B) does not necessarily equal the Tverksy (B,A), which makes the measure asymmetric.  The reason for this can be explained by the role of the two parameters, alpha and beta.  In the Measures program we add the constraint that alpha + beta = 1.  If the choice of alpha is set close to 1, e.g. .95 then beta = 0.05.  The Tversky(A,B) for this choice would be close to 1.0 (the maximum) if the unique bits of compound A (when compared to compound B) are nearly zero.  At high Tversky (A,B) values, the interpretation is that compound A "fits into" compound B.  If the choice of alpha  is 0.5, then beta = 0.5 and the Tversky (A,B) is monotonic to the Tanimoto (A,B).  The Tversky measure therefore enables the grouping of chemical structures of different sizes that share features of the smaller of the two.  Below illustrates this with an example:

The Measures program generates a symmetric dissimilarity matrix in the case of the four symmetric measures and an asymmetric matrix if the Tversky similarity measure is utilized for the pair wise comparisons.  The dissimilarity is just 1 - similarity.  The diagrams given below demonstrate this point:

In the case of a symmetric dissimilarity matrix the off diagonal elements are identical. 1- Tanimoto(A,B) = 1 - Tanimoto( B,A).
In the case of the asymmetric dissimilarity matrix the off diagonal elements are not necessarily equivalent.  Note that 1 - Tversky (A,B) = 0.35 and 1 - Tverksy(B,A) = 0.05.  In this case the interpretation is that compound B "fits into" compound A, but not vice versa.

The matrix output from Measures can be directed to the Clustering program or the RussianDollTransformation program as input.  Typically for use with the Clustering program the Measures program will produce an NxN dissimilarity matrix.  However, it maybe the case that the user would like to compare two data sets of different sizes (e.g. MXN), the Measures program will also generate rectangular rather than square matrices for this purpose.  The program therefore can be used in either similarity searching mode , or all pairs (dis)similarity mode for clustering purposes.  The similarity searching output can then be used as input  (along with a file of information (e.g., SMILES strings and ID numbers) for the targets, and another information file for the queries) to the SimilarityOutput program.  There are three different output formats for the Measures program:  matrix , sparse matrix , or ordered pairs.  The matrix format is for either an all pairs comparison of N elements, producing a full NXN matrix (e.g., for exploratory or statistical purposes such as multidimensional scaling); or for similarity searching comparison of M targets by N queries, producing a MXN matrix.   The sparse matrix format can be used with a threshold to produce a ragged array that contains those NXN comparisons for clustering that meet a specific threshold.  It can also be used for a thresholded similarity search, producing a ragged array of MXN comparisons that meet a threshold criterion, between M targets, and N queries.  It is this output that can be used as input to the SimilarityOutput program. The ordered pair format is useful for showing specific i, j comparisons, where the i and j indices are explicit in the output.

Example Usages:

WINDOWS VERSION EXAMPLES
Generate Dissimilarity Matrix for Cluster Program using Tversky with alpha = 0.95, and dissimilarity threshold cutoff of 0.3:
.\Measures SampleFingerprints.txt -V -D -S -F 0 0.95 0.3> SampleAsymMatrix.95.3.txt
Options explained: Tversky (-V), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), alpha 0.1, threshold 0.3.  Note that Measures can accept input from stdin with the character '-', e.g.,
 .\Measures - -V -D -S -F 0 0.95 0.3 < SampleFingerprints.txt > SampleAsymMatrix.95.3.txt
Generate Dissimilarity Matrix for Cluster Program using Tanimoto measure, and dissimilarity threshold cutoff of 0.15:
.\Measures SampleFingerprints.txt -T -D -S -F 0 0.15> SampleAsymMatrix.15.txt
Options explained: Tanimoto (-T), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching),  threshold 0.15
 
Generate Similarity Matrix from the comparison of two data sets (Querysetfingerprints.txt and Targetsetfingerprints.txt) using Euclidean measure and a threshold cutoff of 0.80:

Need to use the DOS equivalents of wc and cat.  E.g., find/c "1" TargetsetFingerprints.txt will return the number lines in TargetsetFingerprints.txt, since all of the fingerprints have at the character "1" in them.

wc -l Targetsetfingerprints.txt
100   (there are 100 Target fingerprints)
wc -l Querysetfingerprints.txt
1000   (there are 1000 fingerprints)

cat Targetsetfingerprints.txt Querysetfingerprints.txt   > TQfingerprints.txt

.\Measures TQfingerprints.txt -E -S -S -T 100 0.80 > SampleAsymMatrix.8.txt

Options explained: The target fingerprints are counted (in this case there are 100).  The target fingerprints and the query fingerprints are concatenated into one file, targets at the top.  Euclidean (-E), Similarity (-S), Sparse Matrix (-S), Rectangular Matrix, so Similarity searching (T), search number 100 (targets),  threshold 0.80

.\SimilarityOutput SampleAsymMatrix.8.txt QuerysetSmilesData.txt TargetsetSmilesData.txt > SimSearchOutput.txt

Options explained:  There are three input files:  1. The sample similarity search (a ragged array of the sparse matrix formed by the MXN search matrix, in the example 100X1000, and the search threshold);  2. The Query Smiles data (space delimited columns of 1000 rows of query data);  3.  The  Target Smiles data (space delimited columns of 100 rows of query data).
Generate Dissimilarity Matrix for a non-Mesa Program such as SAS or SPLUS using Hamman :
.\Measures SampleFingerprints.txt -H -D -M -F 0 > SampleAsymMatrix.txt
Options explained: Hamman (-H), Dissimilarity (-D), Matrix (-M), Square Matrix - not similarity searching (F), search number 0 (not similarity searching),
Generate Similarity Ordered Pair output for input into a non-Mesa program, using Cosine measure and threshold cutoff of 0.90
.\Measures SampleFingerprints.txt -C -S -O -F 0 0.90> SampleAsymMatrix.9.txt
Options explained: Cosine (-C), Similarity (-S), Ordered Pair (-O), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), threshold  = 0.90
Generate Dissimilarity Matrix for RussianDollTransformation using Tversky with alpha= .99 and threshold cutoff of 0.001
.\Measures SampleFingerprints.txt -V -D -S -F 0 0.99 0.001> SampleAsymMatrix.99.001.txt
Options explained: Tversky (-V), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), alpha 0.99, threshold 0.001
UNIX VERSION EXAMPLES
Generate Dissimilarity Matrix for Cluster Program using Tversky with alpha = 0.95, and dissimilarity threshold cutoff of 0.3:
./Measures SampleFingerprints.txt -V -D -S -F 0 0.95 0.3> SampleAsymMatrix.95.3.txt
Options explained: Tversky (-V), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), alpha 0.1, threshold 0.3
 
Generate Dissimilarity Matrix for Cluster Program using Tanimoto measure, and dissimilarity threshold cutoff of 0.15:
./Measures SampleFingerprints.txt -T -D -S F 0 0.15> SampleAsymMatrix.15.txt
Options explained: Tanimoto (-T), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching),  threshold 0.15
 
Generate Similarity Matrix from the comparison of two data sets (Querysetfingerprints.txt and Targetsetfingerprints.txt) using Euclidean measure and a threshold cutoff of 0.80:
wc -l Targetsetfingerprints.txt
100   (there are 100 Target fingerprints)
wc -l Querysetfingerprints.txt
1000   (there are 1000 Target fingerprints)

cat Targetsetfingerprints.txt Querysetfingerprints.txt  > TQfingerprints.txt

./Measures TQfingerprints.txt -E -S -S -T 100 0.80 > SampleAsymMatrix.8.txt

Options explained: The target fingerprints are counted (in this case there are 100).  The target fingerprints and the query fingerprints are concatenated into one file, targets at the top.  Euclidean (-E), Similarity (-S), Sparse Matrix (-S), Rectangular Matrix, so Similarity searching (T), search number 100 (targets),  threshold 0.80

./SimilarityOutput SampleAsymMatrix.8.txt QuerysetSmilesData.txtTargetsetSmilesData.txt > SimSearchOutput.txt

Options explained:  There are three input files:  1. The sample similarity search (a ragged array of the sparse matrix formed by the MXN search matrix, in the example 100X1000, and the search threshold);  2. The Query Smiles data (space delimited columns of 1000 rows of query data);  3.  The  Target Smiles data (space delimited columns of 100 rows of query data).
Generate Dissimilarity Matrix for a non-Mesa Program such as SAS or SPLUS using Hamman :
./Measures SampleFingerprints.txt -H -D -M -F 0 > SampleAsymMatrix.txt
Options explained: Hamman (-H), Dissimilarity (-D), Matrix (-M), Square Matrix - not similarity searching (F), search number 0 (not similarity searching),
Generate Similarity Ordered Pair output for input into a non-Mesa program, using Cosine measure and threshold cutoff of 0.90
./Measures SampleFingerprints.txt -C -S -O -F 0 0.90> SampleAsymMatrix.9.txt
Options explained: Cosine (-C), Similarity (-S), Ordered Pair (-O), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), threshold  = 0.90
Generate Dissimilarity Matrix for RussianDollTransformation using Tversky with alpha= .99 and threshold cutoff of 0.001
./Measures SampleFingerprints.txt -V -D -S -F 0 0.99 0.001> SampleAsymMatrix.99.001.txt
Options explained: Tversky (-V), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), alpha 0.99, threshold 0.001

Clustering  Program

Clustering data is an iterative process, and it is not the case that there is one specific clustering method that fits all data, or even a specific class of data (e.g., bit string data).  However, some algorithms and proximity measures are more suitable than others for instance with bit string data -- given the type of structure the user is looking for.  An analogous situation is known in the classification or predictive modeling field:  there is no best classifier for predictive modeling.  This has been formalized in a proof, "The No Free Lunch Theorem" (see Pattern Classification, 2nd ed. Duda, Hart, Stork).  Rather, there is typically a family of classifiers that may be more suitable for a type of data, any one of which maybe the best or near optimal.  One must first search for that family of possible classifiers among many.

With regard to clustering, typically, there are several levels of refinement in determining structure in the form of groups in the data.  The first process is to determine which family of clustering methods are suitable for the data at hand.  Next, one performs a small battery of the candidate family of clustering methods in an effort to determine both efficacy and efficiency.  The latter is important if the data set(s) are large, and thus may limit the candidates to just a few.  Once a specific algorithm is chosen, further iterative refinement is performed to determine cluster validity using both a priori information (expert knowledge about the data set) and quantitative means such as cluster validity tests.  For large data sets quantitative cluster validity tests are often very computationally expensive.  Expert knowledge can therefore be extremely important.  The best or near best grouping may be obtained with level selection techniques employed in hierarchical algorithms or heuristics for choosing in k-means algorithms, etc.  The efficacy of these methods can in turn be explored.  In Taylor-Butina's algorithm, a careful choice of the exclusion region threshold (well known information regarding the common measures is helpful in this regard) and a non-trivial ambiguity determination can be used.
 

The Clustering program contains the core functionality of the Mesa Grouping Module.  The program inputs an NXN sparse dissimilarity matrix, either from the Measures program output, the RussianDollTransformation program, or a User defined sparse matrix input file, and generates chemical groupings based on user defined options, i.e., disjoint, non disjoint, symmetric, asymmetric.   The Clustering program sends output to either the ClusterOutput Program or the AmbiguityMeasures Program

The Clustering program employs variations of Taylor-Butina grouping algorithms. 1,2 (Note, Taylor's algorithm is not strictly considered a clustering algorithm.  But in our implementation it is effectively a clustering algorithm.  Thus, we use the terms clustering and grouping interchangeably.)  In its simplest symmetric disjoint form this algorithm iteratively identifies the element that is similar to the most elements, given a similarity exclusion region threshold.  That element (the representative element) and those similar to it -- namely, those within its exclusion region -- are removed and form the first group.  This process is repeated until there are no more elements that form groups.  There are three classes of elements produced by this algorithm.  True singletons -- those elements that do not fall into any group, elements within groups (these include the representative elements), and false singletons -- those elements that meet the similarity threshold criterion between at least one other element, but due to the grouping criterion are excluded from any group.

Asymmetric Taylor-Butina Algorithm(Disjoint)
  1. Create threshold nearest neighbor table using similarities in both directions  ( Measures )
  2. Find true singletons: all those compounds with an empty nearest neighbor list.
  3. Find the compound with the largest nearest neighbor list. This becomes a group and is excluded from consideration – these compounds are removed from all nearest neighbor lists.  (The compound becomes the representative compound.)
  4. Repeat step 3 until no compounds exist with a non empty nearest neighbor list.
  5. Assign remaining compounds, false singletons, to the group that contains their nearest neighbor, but identify them as "false singletons"


In  the Cluster program selecting the Disjoint option will generate clusters in which compounds are members of only one cluster.  Below is a schematic and algorithmic description of the non disjoint option.



Asymmetric Taylor-Butina Algorithm (Non Disjoint)
  1. Create threshold nearest neighbor table using similarities in both directions  ( Measures )
  2. Find true singletons: all those compounds with an empty nearest neighbor list.
  3. Find the compound with the largest nearest neighbor list. This becomes a group and is excluded from consideration – these compounds are removed from all nearest neighbor lists.  (The compound becomes the representative compound.)  If a shared compound is within the threshold it is not excluded from membership in other clusters.  Shared compounds are included in all groups in which it is within the threshold of a representative compound.
  4. Repeat step 3 until no compounds exist with a non empty nearest neighbor list.
Issues with Ties:

There are three possible decision ties that results from the symmetric and asymmetric disjoint Taylor-Butina algorithm as described above:

    1. Exclusion Region Ties
    2. Minimum Distance Ties
    3. False Singleton Ties

There may be more than one element that has the largest number of elements associated with it.  E.g., say during an iteration there are two elements with 100 associated elements similar to each, and 100 is the largest number of elements.  We call this an exclusion region tie, as there is more than one exclusion region candidate.  If the two groups are disjoint there is no real problem.  However, if the groups overlap at all, which exclusion group should be chosen?  Such a tie can be broken by a secondary grouping criterion, such as a criterion that attempts to find the tightest group (most all pairs of the group similar).  This second criterion can also be tied, and as it turns out given the limited number of possible values for bit string measures, such a tie is not as rare as one might assume.  We call such a tie a minimum distance tie.  False singletons represent a problem for Taylor's algorithm.  Typically, they only represent 1-3% of the elements in a set, but identifying them as distinct from true singletons and having a mechanism that associates them with some group is useful.  We both identify them and associate them with the group that contains the most similar element to the false singleton in question.  This association also can be tied if there is more than one element that is the most similar.  Such ties are also common for the same reason -- that bit string measures have a relatively small finite number of similarities.  We call this type of tie a false singleton tie.  Our implementation reports the number of each of these ties.  Minimum distance ties and false singleton ties are broken arbitrarily and therefore introduce a degree of ambiguity into the final grouping.  The non disjoint version, either symmetric or asymmetric, does not have the false singleton ties problem, as false singletons are naturally subsumed into overlapping groups.  The next program in the Grouping Module, AmbiguityMeasures , provides an effective means of threshold selection which provides insight into the ties issues mentioned here.
 
 

Example Usages:

WINDOWS VERSION EXAMPLES
Generate Asymmetric Disjoint Clustering with a exclusion region threshold of 0.15 (.85 similarity)
          wc -l SampleAsymMatrix.0.1.txt  (counts rows of matrix)
495
.\Clustering SampleAsymMatrix.0.1.txt -A 0.15 495 > SampleAsym.0.15.Clustering.txt
    Options explained:  Asymmetric Disjoint (-A), threshold 0.15, 495 members of matrix, N of NXN is  495
Generate Asymmetric, Non-disjoint Clustering, with threshold of  0.1 (equivalent to 0.9 Similarity)
.\Clustering SampleAsymMatrix.0.1.txt -N 0.1 495 > SampleAsymN.0.1.Clustering.txt
Generate Symmetric, Disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)
wc -l SampleSymMatrix.0.2.txt  (counts rows of matrix)
352
.\Clustering SampleSymMatrix.0.2.txt -T 0.05 352 > SampleAsym.0.05.Clustering.txt
Generate Symmetric, Non-disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)
wc -l SampleSymMatrix.0.2.txt  (counts rows of matrix)
352

.\Clustering SampleSymMatrix.0.2.txt -n 0.05 352 > SampleAsym.0.05.Clustering.txt

Generate Asymmetric Disjoint Clustering with a exclusion region threshold of 0.15 (.85 simmilarity)
          wc -l SampleAsymMatrix.0.1.txt  (counts rows of matrix)
495

.\Clustering SampleAsymMatrix.0.1.txt -A 0.15 495 > SampleAsym.0.15.Clustering.txt

    Options explained:  Asymmetric Disjoint (-A), threshold 0.15, 495 members of matrix, N of NXN is  495
Generate Asymmetric, Non-disjoint Clustering, with threshold of  0.1 (equivalent to 0.9 Similarity)
.\Clustering SampleAsymMatrix.0.1.txt -N 0.1 495 > SampleAsymN.0.1.Clustering.txt
Generate Symmetric, Disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)
wc -l SampleSymMatrix.0.2.txt  (counts rows of matrix)
352
.\Clustering SampleSymMatrix.0.2.txt -T 0.05 352 > SampleAsym.0.05.Clustering.txt
Generate Symmetric, Non-disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)
wc -l SampleSymMatrix.0.2.txt  (counts rows of matrix)
352

.\Clustering SampleSymMatrix.0.2.txt -n 0.05 352 > SampleAsym.0.05.Clustering.txt
 
 

UNIX VERSION EXAMPLES
Generate Asymmetric Disjoint Clustering with a exclusion region threshold of 0.15 (.85 simmilarity)
          wc -l SampleAsymMatrix.0.1.txt  (counts rows of matrix)
495
./Clustering SampleAsymMatrix.0.1.txt -A 0.15 495 > SampleAsym.0.15.Clustering.txt
    Options explained:  Asymmetric Disjoint (-A), threshold 0.15, 495 members of matrix, N of NXN is  495
Generate Asymmetric, Non-disjoint Clustering, with threshold of  0.1 (equivalent to 0.9 Similarity)
./Clustering SampleAsymMatrix.0.1.txt -N 0.1 495 > SampleAsymN.0.1.Clustering.txt
Generate Symmetric, Disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)
wc -l SampleSymMatrix.0.2.txt  (counts rows of matrix)
352
./Clustering SampleSymMatrix.0.2.txt -T 0.05 352 > SampleAsym.0.05.Clustering.txt
Generate Symmetric, Non-disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)
wc -l SampleSymMatrix.0.2.txt  (counts rows of matrix)
352

./Clustering SampleSymMatrix.0.2.txt -n 0.05 352 > SampleAsym.0.05.Clustering.txt

Generate Asymmetric Disjoint Clustering with a exclusion region threshold of 0.15 (.85 similarity)
          wc -l SampleAsymMatrix.0.1.txt  (counts rows of matrix)
495

./Clustering SampleAsymMatrix.0.1.txt -A 0.15 495 > SampleAsym.0.15.Clustering.txt

    Options explained:   Asymmetric Disjoint (-A), threshold 0.15, 495 members of matrix, N of NXN is  495
Generate Asymmetric, Non-disjoint Clustering, with threshold of  0.1 (equivalent to 0.9 Similarity)
./Clustering SampleAsymMatrix.0.1.txt -N 0.1 495 > SampleAsymN.0.1.Clustering.txt
Generate Symmetric, Disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)
wc -l SampleSymMatrix.0.2.txt  (counts rows of matrix)
352
./Clustering SampleSymMatrix.0.2.txt -T 0.05 352 > SampleAsym.0.05.Clustering.txt
Generate Symmetric, Non-disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)
wc -l SampleSymMatrix.0.2.txt  (counts rows of matrix)
352

./Clustering SampleSymMatrix.0.2.txt -n 0.05 352 > SampleAsym.0.05.Clustering.txt


AmbiguityMeasures  Program


 
 

Given the ties issues discussed above,  we can consider the amount of ambiguity that might accrue given a disjoint clustering at a given threshold.  Such a clustering is really just one of many possible clusterings, since there may be many possible outcomes of any set of arbitrary decisions that are made.  There are a number of different indices that can capture the inherent ambiguity that result from a collection of arbitrary decisions.  We propose three simple and useful indices -- two that are based on the number of overlapping elements, and a third based on the difference in the number of disjoint clusters versus the number of non disjoint clusters produced with the same threshold and the same symmetry or asymmetry.  The AmbiguityMeasures program takes the outputs of a disjoint and a non disjoint run (both either symmetric or asymmetric) at the same exclusion region threshold and returns three such indices.

The first index is the "raw" ambiguity, which is simply the counts of the number of compounds found in common among the clusters.  The second index is the sum of square roots of the counts per element found in common more than once, divided by the total number of elements.  This discounts elements that are found in many clusters.  The last index is simply the ratio of the number of disjoint clusters over the number of non disjoint clusters.  Combined, these measures give the user a way in which to determine if the level of ambiguity is reasonable at the specific threshold they pick.  For example, the user may find with a specific data set that when the clustering is performed at a 0.85 Tanimoto threshold the ambiguity measures are low enough such that the disjoint clustering is a reasonable approximation of the non disjoint clustering.  Whereas, if a lower Tanimoto threshold is chosen (e.g., 0.7) the ambiguity indices are sufficiently high that the disjoint clustering is a poor approximation to the non disjoint clustering (where the amount of overlapping in the clustering is large).   Below is a plot of the second Ambiguity index versus the Tanimoto similarity threshold for a set of NCI HIV Active Ligands which were fingerprinted with the 166 MDL Key fingerprints, 320 MDL Key fingerprints, and the Daylight 512 Bit fingerprints.
 
If one wanted to cluster this small data set and arbitrarily picked a threshold of  0.79 and used MACCS key 166 Bit fingerprints, clustered one time with disjoint clustering, the second ambiguity index in the clustering results would be much greater than 0.20. The orange-yellow line displays a ambiguity index level at which approximately 10% of the compounds are shared among the clusters.   Meaning that at a threshold = 0.79 and with MACCS key 166 Bit fingerprints, many more than 10% of the compounds would be put into "arbitrary" clusters.  Even for a small data set such as this, the difficulty of analyzing a clustering where the amount of compounds involved in overlapping clusters is above 10% or even 5% may be too great for even the most patient researcher.  Having a sense of the ambiguity for a specific clustering is therefore crucial for understanding the inherent structure of the data.

We suggest that threshold selection not be an arbitrary decision, but rather that clusterings be done both disjoint and non disjoint at a number of  threshold values, and the AmbiguityMeasures program be utilized to make and informed decision as the "best" threshold to pick for a given data set.
 
 
 

Example Usages:

WINDOWS VERSION EXAMPLES


Generate Ambiguity Report for 0.15 threshold clustering  of an Asymmetric Similarity Matrix

.\AmbiguityMeasures SampleAsym.0.15.Clustering.txt SampleAsymN.0.15.Clustering.txt > AmbiguityReport0.15
Options explained:  Input files are the disjoint and non disjoint clustering results from the Clustering program run at the same threshold
Generate Ambiguity Report for 0.1 threshold clustering  of an Symmetric Similarity Matrix
.\AmbiguityMeasures SampleSym.0.1.Clustering.txt SampleSymN.0.1.Clustering.txt > AmbiguityReport0.1

Options explained:  Input files are the disjoint and non disjoint clustering results from the Clustering program run at the same threshold

UNIX VERSION EXAMPLES
Generate Ambiguity Report for 0.15 threshold clustering  of an Asymmetric Similarity Matrix
./AmbiguityMeasures SampleAsym.0.15.Clustering.txt SampleAsymN.0.15.Clustering.txt > AmbiguityReport0.15
Options explained:  Input files are the disjoint and non disjoint clustering results from the Clustering program run at the same threshold
Generate Ambiguity Report for 0.1 threshold clustering  of an Symmetric Similarity Matrix
./AmbiguityMeasures SampleSym.0.1.Clustering.txt SampleSymN.0.1.Clustering.txt > AmbiguityReport0.1

Options explained:  Input files are the disjoint and non disjoint clustering results from the Clustering program run at the same threshold


ClusterOutput  Program


 
 

The ClusterOutput program takes the Clustering program output with user information about the elements (e.g., SMILES strings, activity data, ID numbers) and outputs a clustering report.  The user can either specify a full report or just the identification of the representative element (one from each cluster).

The ClusterOutput program takes as input the Clustering program's output, a file of the data names in the same order as the input to the Measures program or MACCSKeysGenerator program (and any associated information in column form, e.g., a file that contains SMILES strings and associated activities, and identification numbers), the number of columns in the file of data names, and a boolean toggle that indicates what form of output (row or column) is desired.  Row form is desired for MicroSoft Excel importing, since the number of columns are only 256, but the number of row is roughly 65,000.  If there is no associated data with the data names, the number of columns is 1.  However, if there are other columns in the file, the total number of columns including the data name column input.  There is an option such that only representative compounds will be output in column form with any associated data.

Typical output from the program with the Clusters by Row option set would look like:
Cluster 1
Rows of SMILES strings and any associated data, space delimited.
Cluster 2
.
.
.
True Singletons
False Singletons (none in the non disjoint versions)
Ties Statistics
Representative Compounds ("Centroids")
Rows of the representative compounds and any associated data, space delimited.
 
Typical output from the program with the Clusters by Column option set would array the output in columns:

True Singletons    False Singletons    Cluster1    Cluster2 ...     ClusterK    Ties Statistics    Representative Compounds
.
.
.

Associated data is in adjacent columns to the data in question.  Since this is a ragged array, to make it rectangular, blanks are filled with "NA".  This format is useful when the for data sets are small and the user wishes to import them into MicroSoft's Excel where the number of columns for the cluster output and associated data is only 256.

Example Usage

WINDOWS VERSION EXAMPLES
Generate Formatted Clustering File with representative compounds only
.\ClusterOutput SampleAsym.0.15.Clustering.txt SampleCol1.smi 1 T R > SampleAsym.0.15.ClusteringFormated.txt

Options explained:  The first input file is the output file from the Clustering program for an asymmetric clustering at a threshold of 0.15 and the second input file is a file of  SMILES strings associated with each row of the Asymmetric similarity matrix used as input in the Clustering program, there is only one column in the SMILES file ( 1 ),  Clusters by row the output will have only one cluster in each row  (T ) and only Representative compounds will be printed into the output file ( R)
 

Generate Formatted Clustering File for small set with clusters in columns for easy input into MicroSoft Excel
.\ClusterOutput SampleAsym0.15.Clustering.txt SampleCol1.smi 1 F N> SampleAsym320.0.15.ClusteringFormated.txt
Options explained:  The first input file is the output file from the Clustering program for an asymmetric clustering at a threshold of 0.15 and the second input file is a file of  SMILES strings associated with each row of the Asymmetric similarity matrix used as input in the Clustering program, there is only one column in the SMILES file ( 1 ),  Clusters by column, the output will have  clusters printed out in column format  (F) and all compounds will be printed into the output file (N
UNIX VERSION EXAMPLES
Generate Formatted Clustering File with representative compounds only
./ClusterOutput SampleAsym.0.15.Clustering.txt SampleCol1.smi 1 T R > SampleAsym.0.15.ClusteringFormated.txt

Options explained:  The first input file is the output file from the Clustering program for an asymmetric clustering at a threshold of 0.15 and the second input file is a file of  SMILES strings associated with each row of the Asymmetric similarity matrix used as input in the Clustering program, there is only one column in the SMILES file ( 1 ),  Clusters by row the output will have only one cluster in each row  (T) and only Representative compounds will be printed into the output file (R)
 

Generate Formatted Clustering File for small set with clusters in columns for easy input into MicroSoft Excel
./ClusterOutput SampleAsym0.15.Clustering.txt SampleCol1.smi 1 F N> SampleAsym320.0.15.ClusteringFormated.txt
Options explained:  The first input file is the output file from the Clustering program for an asymmetric clustering at a threshold of 0.15 and the second input file is a file of  SMILES strings associated with each row of the Asymmetric similarity matrix used as input in the Clustering program, there is only one column in the SMILES file ( 1 ),  Clusters by column, the output will have  clusters printed out in column format  (F) and all compounds will be printed into the output file (N)

RussianDollTransformation  Program

The RussianDollTransformation program is a response to a question posed by John Bradshaw at Daylight Chemical Information Systems.  "How can one group large molecular structures in a database, based on the number of smaller structures they have in common such that this relationship forms a proximity matrix among larger structures?"  For example, say a group of larger molecular structures all contain the same 10 smaller structures within each large structure.  Another group of large structures only contain 7 smaller structures in common. The first group of large structures is more similar.  Namely, this transformation groups larger structures by what smaller structures fit within them.  The utility of this question becomes apparent if, say, a drug database is seeded with smaller structures that are important drug like structures, and one wishes to understand which larger structures in the database are grouped by the number of like seeded structures.  The answer to the question lies in the use of asymmetry.  With a Tversky alpha set to 1.0 (or near 1.0), one can effectively ask, what structures fit wholly (or mostly) within another structures.  With an asymmetric proximity matrix produced by such a Tversky parameterization, we can transform that matrix to ask John Bradshaw original question, producing another asymmetric proximity matrix, that, when used with an asymmetric clustering algorithm, will actually produce an answer to that question, and the converse of that question --  "can we group smaller structures based on what they fit into?"   A disjoint or non disjoint clustering of the transformed matrix will produce both types of groupings.

The Russian Doll Transformation transforms an asymmetric matrix into another asymmetric matrix containing the counts of compounds that two compounds i and j mutually share as nearest neighbors at some threshold.   Asymmetrically clustering this transformation groups large molecules by what smaller molecules they mutually share as substructures, and it groups small molecules by what larger molecules they fit into.  The input is a Tversky asymmetric matrix, typically with alpha set at or near 1.  This setting of alpha behaves such that one compound "fits" within the other, or nearly fits.  Those compounds that share other numerous compounds in this manner are more similar.  There are 3 parameters that effect the outcome of a clustering using this transformation:  1. the Tversky alpha value;  2.  the matrix threshold (a nearest neighbor like threshold) to the Russian Doll Transform;  and 3.  the clustering measure threshold.  These parameters can be modified to provide various clusterings.  For example, if one just wants to group those compounds that are very nearly the same and leave all others as singletons, set the Tversky to 1 (or 0), the matrix threshold to near 0 (say, 0.1), and the clustering threshold toward the lower range of the transform matrix output (often between 0.1 and 0.5).  As these constraints are loosened clusters contain more varied compounds and there are often fewer singletons.The clustering threshold value should depend directly on the desired number of substructures (superstructures) found in common. Thus, though the matrix is composed of proximities (dissimilarities), they are not necessarily related to Tanimoto-like proximities in value (e.g., where 0.7 similarity is a common threshold).  They can be quite data dependent, so several runs may be necessary to determine appropriate threshold values -- and these may be quite high depending on the application.  A glance at the matrix will give the user some idea as to what thresholds to try.  Somewhat counter intuitively, the dissimilarities can be quite close to 1 since the counts in common are scaled by the number of elements (compounds in the data set). The dissimilarity measure is 1 - (compounds in common)/n, where n is the number of compounds in the data set. If in general there are very few compounds in common compared to the total number of compounds, most of the dissimilarities are then near 1.  Other scalings that more evenly distribute the values between 0 and 1 are possible, but this scaling allows the user to directly calculate the total number of counts from the dissimilarity by knowing only n.

The Russian Doll idea was by John Bradshaw and the algorithm to implement the idea was by Norah and John MacCuish.  This is at present, to the best of our knowledge, an unpublished algorithm and unpublished use of this idea in compound clustering.  A presentation discussing the Russian Doll algorithm and its uses was given at MUG '03 and can be found at http://www.mesaac.com/RussianDollDistrib.htm

Example Usage  

WINDOWS VERSION EXAMPLES
Generate asymmetric matrix for RussianDollTransformation - must first use the Measures program to generate a Dissimilarity matrix for the   RussainDollTransformation program then Generate RussianDoll asymmetric matrix for input into the Cluster program
.\Measures SampleFingerprints.txt -V -D -M -F 0 0.01 > SampleAsymMatrix.0.01.txt
Options Explained:   Tversky (-V), Dissimilarity Matrix (-D), Matrix Output (-M), Not Searching (F), no search number (0), Threshold 0.01

wc -l SampleAsymMatrix.0.01.txt
495
.\RussianDollTransformation SampleAsymMatrix.0.01.txt 495 0.01 > RussianDollAsymMatrix.0.01.txt
Options Explained:  Input file is asymmetric similarity matrix at a dissimilarity threshold .01, Number of members of the matrix (495), dissimilarity threshold .01

.\Clustering RussianDollAsymMatrix.0.01.txt -A 0.99 495 >RussianDollClustering.txt
Options Explained:  Input file is RussianDoll asymmetric matrix at a dissimilarity threshold .01, Number of members of the matrix (495), dissimilarity threshold .99
 
 

UNIX VERSION EXAMPLES
Generate asymmetric matrix for RussianDollTransformation - must first use the Measures program to generate a Dissimilarity matrix for the   RussainDollTransformation program then Generate RussianDoll asymmetric matrix for input into the Cluster program
./Measures SampleFingerprints.txt -V -D -M -F 0 0.01 > SampleAsymMatrix.0.01.txt
Options Explained:   Tversky (-V), Dissimilarity Matrix (-D), Matrix Output (-M), Not Searching (F), no search number (0), Threshold 0.01

wc -l SampleAsymMatrix.0.05.txt
495
./RussianDollTransformation SampleAsymMatrix.0.01.txt 495 0.01 > RussianDollAsymMatrix.0.01.txt
Options Explained:  Input file is asymmetric similarity matrix at a dissimilarity threshold .01, Number of members of the matrix (495), dissimilarity threshold .01

./Clustering RussianDollAsymMatrix.0.01.txt -A 0.99 495 >RussianDollClustering.txt
Options Explained:  Input file is RussianDoll asymmetric matrix at a dissimilarity threshold .01, Number of members of the matrix (495), similarity threshold .99

 

Example Applications

Using the flow chart in the top on this document, subsets of the programs can be utilized for the following example applications:

Compound Acquisition Decisions

In-house databases are augmented many times through acquisitions of available compounds from commercial vendors.  Determining the extent to which a possible acquisition database is "similar" to an in-house database becomes crucial in making compound acquisition decisions.  Compounds are typically costly.  Thus, purchasing selections of various compounds from outside vendors that will increase the number of hits in HTS screens, without generating redundant data or producing few or no hits in the HTS screens, is what makes this task difficult.

To process, for example, an in-house database of 100,000 compounds and an acquisition database of 50,000 compounds would require  ~ 1 GByte of RAM. The steps are as follows:

  1. Acquisition compounds need to be "filtered" first.  A pre-processing of the data is crucial for proper decision making.  Typical decisions are made based on a company's criteria for what would make a "good" drug precursor.  One common strategy is to filter based on Lipinski's Rule of 5.   There are many commercial vendors that supply means to filter based on molecular properties (e.g. molecular weight, clogp, etc.), one example is the Filter program from OpenEye Scientific Software.  Data will typically come from vendors in different formats and representations.  For example, salts maybe a significant portion of the database, the parent of which may be interesting to a company but the salt form will not "fingerprint" properly.  Some companies filter out salts, this can be done easily in SMILES since you can remove all strings that have a period (".") in them.  The other possibility is to transform them into the SMILES of the respective parent (see, J. Bradshaw's presentation A beginner's guide to responsible parenting or knowing your roots at Daylight's User group meeting MUG'98).  After a filter step, one should remove any duplicate SMILES.  Another important issue that arises from these types of comparisons is the standard representation that a given acquisition database has used in the SMILES or SDF files given for analysis.  For example, a common issue in chemical registration is how should Nitro groups be represented?  One way is  , another is  .  You need to transform the SMILES of your acquisition database to have the same standard representations for these types of groups found in your in-house database -- so that you will be comparing apples with apples and not apples with oranges.
  2. Now that you have a filtered set of acquisition compounds from step 1, you can proceed to make decisions about the structural similarity differences between the potential purchase set and an in-house database with Mesa's Fingerprint and Grouping Modules.  The first decision you need to make is how do you want to augment your in-house database.  How many of each structural class that is "different" from the in-house database do you want to select?  For some companies, they would like to pick only one member of each "type" represented in the acquisition set, some companies prefer to select a small number, lets say 4, which enables SAR from the primary screens.  Pick the number which you think best and lets call this number, X.  X is the number of compounds of a different  "class" that you want to select from the acquisition set.  It may be the case that for some representative compounds from your in-house collection do not have X near neighbors.  This means that you may want to add to clusters of your in-house collection from the acquisition collection so that all representative compounds in the combined set (in-house + buy set) will have the required number of near neighbors, X, e.g., 4.   Let Z be the number of near neighbors that a Representative compound has in the in-house database. Z will be determined by the Grouping Module, and it will be threshold dependent.
    1. Fingerprint the in-house and the acquisition set.
    2. Cluster the in-house data set and select representative and true singleton compounds at several thresholds and use the AmbiguityMeasures to select the "best" threshold.
    3. Find all representative compounds that have Z < X members in their group.
    4. Perform a similarity search of the acquisition set against the representatives from Step 3 and the true singleton compounds from Step 2 with the threshold cutoff used in step 2.
    5. For each representative compound hit list, select X minus Z acquisition compounds which fall within the user defined threshold and put into the buy file (e.g.,  the representative compound from the first cluster where Z < X of the in-house database has 2 near neighbors in the in-house collection (Z=2), you'd like it to have 4 (X=4). So, select 2 compounds from the representative/acquisition hit list and add those 2 selected compounds to the buy  file.)  And,  for each singleton compound hit list, select X acquisition compounds which fall within the user defined threshold and put into the buy file.  Remove the buy file compounds from the original acquisition data set in step 1.
    6. Remove remaining hist list members not selected for the buy file in steps 5 from the acquisition set.
    7. Cluster what remains of the acquisition data set and use the threshold from step 2.
    8. For each representative compound from the clustering in step 7 add the representative compound and X nearest neighbors into the buy file.
    9. The buy file now contains the selection of compounds to purchase from the acquisition data set
Script for Step 2
1.  For this example let:
In house data file of SMILES strings = InHouse.smi
Acquistion data file of SMILES strings = Acquisition.smi
In house data file of fingerprints = InHouse.fps
Acquistion data file of fingerprints = Acquisition.fps
Hence,
./MACCSKeys320Generator InHouse.smi > InHouse.fps
./MACCSKeys320Generator Acquistion.smi > Acquisition.fps

2.  In this example we use 4 thresholds to determine reasonable ambiguity and expert opinion as to the "best" clustering.
./Measures InHouse.fps -T -D -S -F 0 0.3 > InHouseSymMatrix.0.15.txt
./Clustering InHouseSymMatrix.0.15.txt -T 0.25 100000 > InHouseSymDisjointClustering.0.25.txt
./Clustering InHouseSymMatrix.0.15.txt -T 0.20 100000 > InHouseSymDisjointClustering.0.20.txt
./Clustering InHouseSymMatrix.0.15.txt -T 0.15 100000 > InHouseSymDisjointClustering.0.15.txt
./Clustering InHouseSymMatrix.0.15.txt -T 0.10 100000 > InHouseSymDisjointClustering.0.15.txt

./Clustering InHouseSymMatrix.0.15.txt -n 0.25 100000 > InHouseSymNonDisjointClustering.0.25.txt
./Clustering InHouseSymMatrix.0.15.txt -n 0.20 100000 > InHouseSymNonDisjointClustering.0.20.txt
./Clustering InHouseSymMatrix.0.15.txt -n 0.15 100000 > InHouseSymNonDisjointClustering.0.15.txt
./Clustering InHouseSymMatrix.0.15.txt -n 0.10 100000 > InHouseSymNonDisjointClustering.0.10.txt

./AmbiguityMeasures InHouseSymDisjointClustering.0.25.txt InHouseSymNonDisjointClustering.0.25.txt >AmbiguityReport.0.25.txt
...similarly for the other 3 Ambiguity reports.

3.  In this example, say that the 0.15 threshold represented the best of the 4 clusterings in terms of both ambiguity and expert analysis of the groups formed.
Then,
./ClusterOutput InHouseSymDisjointClustering.0.15.txt InHouse.smi 1 T N > InHouseSymDisjointClusteringFormated.0.15.txt
Review cluster output report file and determine how many representative compounds meet the Z < X criterion.   Say that number is 200.
./ClusterOutput InHouseSymDisjointClustering.0.15.txt InHouse.smi 1 T R > InHouseSymDisjointClusteringFormatedRepresentatives.0.15.txt
tail -200 InHouseSymDisjointClusteringFormatedRepresentatives.0.15.txt >  Z_LT_X_RepresentativeSet.smi
The last file, Z_LT_X_RepresentativeSet.smi, is the SMILES file of the representative compounds that meet the Z < X criterion.

4.  Now we append the acquistion set of SMILES to the Z < X representative set plus the true singletons.
Clip out true singletons SMILES from InHouseSymDisjointClusteringFormated.0.15.txt.  This can be done with unix head and tail commands once the line numbers for the beginning and end of the true singleton section of the report file is found.  Call this file InHouseTrueSingletons.smi
cat InHouseTrueSingletons.smi  Z_LT_X_RepresentativeSet.smi > T_S_Z_X.smi
Find the size of this combined file:
wc -l T_S_Z_X.smi
Say it equals 500
Append this combined file to the Acquisition file to obtain the similarity searching file
cat T_S_Z_X.smi Acquisition.smi > TSZXRep_Acquisition.smi
For this example the threshold found above in step two was 0.15 dissimilarity = 0.85 similarity

./MACCSKeys320Generator TSZYRep_Acquisition.smi > TSZYRep_Acquisition.fps
./Measures TSZYRep_Acquisition.fps -T -S -S -T 500 0.85 > TSZYRep_Acquisition SimSymMatrix.txt

5.  To perform this step one will need to write a short script (e.g., in perl or python) that will extract the information - matching indices to SMILES strings using the Acquistion.smi  and TSZYRep_AcquisitionSimSymMatrix.txt files.

6.  Again, a short script can be written to create the file of those compounds not identified for inclusion into the buy file from  Acquisition.smi file.  Let us call the new file, ReducedAcquisition.smi.

7.  Perform step 2 on the ReducedAcquisition.smi file and create a formated cluster output file of the settled on clustering.

8.  With a script, extract the desired compounds from the formatted cluster output file in step 7 and add these to the buy file.

9.  The buy file will contain a set of SMILES from the acquisition set to buy.  

Analysis of High Throughput Screening Hits

Large numbers of hits from HTS screens can be difficult to analyze when the hit rate is above more than a couple hundred compounds.  There is only so much one can do by eye, looking at structures on a computer screen or printed out on sheets of paper.  The Mesa Grouping Module provides several mechanisms for grouping hits for analysis and SAR studies.

One approach for a relatively small set, say 500, compounds would be to use ChemTattoo.  ChemTattoo will take an input set of SMILES strings, of actives for example, and find common features within the set.   ChemTattoo generates statistics for fingerprints of the active set.  It will generate a frequency fingerprint which shows the number of times a "key" in the fingerprint is represented across the set in the form of the fraction of the number of fingerprints in the set.  It will also generate a modal fingerprint, which is a representative fingerprint of the common "keys" in the set.  The user selects a threshold between 0 and 1 and then produces a modal fingerprint for the set at a given threshold.  If the user selects 1.0 for the threshold, then the modal fingerprint will contain on bits for only those keys which are present in all 500 compounds.  If the user selects a threshold of .75 then  the modal at that threshold will contain "on" bits for only those keys that are turned on in the fingerprints of 75% of the input set, 500*.75  = 375.  Namely, 375 of the 500 compounds would have to contain the key for the key to be turned on in the modal.  The modal at given thresholds can be compared to each of the members of the active set and to pull out structurally similar sets.

Depending on the diversity of the hits, the particular groupings that are desired, and the size of the number of hits, one may need to perform a clustering step prior to using the ChemTattoo for best results.   (For more information on ChemTattoo, see the ChemTattoo Module documentation.)
 
Process:

  1. SMILES form HTS hits are fingerprinted with the Fingerprint Module or another 2D or 3D fingerprinter of choice.
  2. The Clustering program is run at a relatively low similarity threshold.
  3. Choose a cluster of actives and extract the fingerprints for just those actives.

    Let the file of fingerprints for the cluster in question be called Actives.fps.  Then, to create a modal at least 95% of the fingerprints have a bit on for the modal to have that bit on perform the following:

./ChemTattooModalStats Actives.fps 0.95 -S > ActivesModalStatsReport.txt

The -S option will generate modal to fingerprint similarities (of 5 different measures including the Tanimoto and Tversky) for all fingerprints in the Actives file.


References

  1. Taylor, R.  Simulation Analysis of Experimental Design Strategies for Screening Random Compounds as Potential New Drugs and Agrochemicals ,  J. Chem. Inf. Comput. Sci 1995, 35, 59-67.
  2. Darko Butina.  Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets.  J.. Chem. Inf. Comput. Sci, 1999, 39(4), 747-750.
  3. Yvonne Martin.  Datamining HTS Results:  Do Similar Compounds Have Similar Activity?  Daylight User Group Meeting  MUG’02, Santa Fe, New Mexico, February 2002.
  4. N. MacCuish and J. MacCuish.   Tversky Shape Clustering for Screening Data , OpenEye Scientific Software User Group Meeting CUPIV, Santa Fe, New Mexico, February 2003.
  5. N. MacCuish , J. Bradshaw and  J. MacCuish.   Russian Doll:  Structure Clustering Using Asymmetry , Daylight User Group Meeting  MUG’03, Santa Fe, New Mexico, February 2003.
  6. Reoptimization of MDL Keys for Use in Drug Discovery , J. L. Durant, B. A. Leland, D. R. Henry, J. G. Nourse, JCICS, 2002, 42 (6), 1273-1280.
  7. Daylight Chemical Information Systems, Inc. Daylight Clustering Manual.
  8. MDL Information Systems, Inc.
  9. J. MacCuish and Norah MacCuish.  Mesa Suite Version 1.1. Fingerprint Module, 2003.
  10. Clustering Methods and Their Uses in Computational Chemistry , Downs and Barnard, Reviews in Computational Chemistry, 2002
  11. Ties in Proximity and Clustering Compounds, J. MacCuish, C. Nicolaou, N. E. MacCuish, JCICS, 2001, 41 (1), 134-146)




 

Example Script

Below is an example script that is included in the program directory, called TestScript.  It contains the commandline interface details.


 #TestScript for Fingerprint and Grouping Module
#Copyright (c), 2002,2003, Mesa Analytics & Computing, LLC

#  ./MACCSKeys164Generator  164 keys
#  ./MACCSKeys320Generator  320 keys
#
#  Usage: ./MACCSKeys*Generator file | - -format -columns -errorfile
#
#  file = filename.smi | - for stdin
#
#  Space delimited .smi file may contain extra columns
#  -- SMILES MUST BE in first column.
#
#  -format = -A for ASCII (default), -C for 21 bytes of char,
#  and -U for 3 unsigned long ints.
#
#  -columns = -T for return all columns in file and error file,
#  fingerprint being the first column |
#  -F for only fingerprint returned in output (default),
#  and only smiles and index to .smi file returned in error file
#
#  -errorFile = -T for return Smiles parsing error file |
#  -F for no error file (default)
#
#  Note that other programs in the module that accept fingerprints as
#  input will use ascii fingerprints (the -A option).  The file of fingerprints
#  will need to be a single column.  So if the MACCSKeys*Generator output
#  contains extra columns then these will need to be stripped off. E.g.,
#  cut -d' ' -f1 fingerprint_and_other_columns_file.txt > fingerprintfile.txt
#  will create a single column of fingerprints in fingerprintfile.txt
#  for simplicity in the example below the -column option is -F so that just
#  the single column of fingerprints is produced for the remaining programs.


echo Number of lines =
wc -l SampleCol1.smi
echo This is the number of compounds or "Size"   

echo Generate Binary Fingerprints with 164 keys
./MACCSKeys164Generator SampleCol1.smi -A -F -T > SampleFingerprints.txt
echo Generate Binary Fingerprints with 320 keys
./MACCSKeys320Generator SampleCol1.smi -A -F -T > SampleFingerprints320.txt

#  ./ChemTattooModalStats
#
#  Usage:  ./ChemTattooModalStats fingerprintfile modalthreshold similarity | smartsfile
#  The modalthreshold can range from [0,1].
#  similarity = '-S' for similarity, '-D' for dissimilarity
#
#  Note smartsfile can be a collection of any strings, each associated with a
#  specific bit.
#  Note also, that any other option for similarity other than '-S' will
#  produce dissimilarities as the default.

echo Generate Modal Statistics given 164 bit strings
   ./ChemTattooModalStats SampleFingerprints.txt 0.95 -D >Modal164Stats
echo Generate Modal Statistics given 320 bit strings
   ./ChemTattooModalStats SampleFingerprints320.txt 0.95 -D >Modal320Stats

#  ./Measures
#
#  Usage: ./Measures fingerprintfile.txt measure similarity format searching searchnumber | alpha | sparsethreshold
#  measure = '-T' for Tanimoto, '-V' for Tversky,
#  '-E' for Euclidean, '-H' for Hamming, '-C' for Cosine
#  similarity = '-S' for similarity, '-D' for dissimilarity 
#  format = '-M' for Matrix, '-O' for Ordered Pairs, '-S' for Sparse Matrix
#  searching = '-T' for similarity searching, '-F' if not
#  searchnumber = an positive integer M < N to be searched
#  (Note: fingerprint file will have M + N fingerprints for searching = '-T')
#  alpha is in the range 0-1, and measure must = '-V'
#  sparsethreshold is in the range of (0,1), and format must be = '-S'      
#  and searching can = either '-T' or '-F'
#  Note clustering program requires that the similarity and format
#  have the -D and -S options respectively, with searching = '-F' (searchnumber = 0)
#  Similarly for alpha -- is only needed if measure = '-V'

#  Measures program as used for generating matrices for clustering
#
echo Generate Dissimilarity Matrix for 164 keys
./Measures SampleFingerprints.txt -V -D -S -F 0 0.1 0.3> SampleAsymMatrix.0.1.txt

echo Generate Dissimilarity Matrix for 320 keys
./Measures SampleFingerprints320.txt -V -D -S -F 0 0.1 0.3> SampleAsymMatrix320.0.1.txt

#  Measures program as used for similarity searching
#  The following is an artificial example that takes the first 30 compounds
#  as targets and the remaining 465 as queries and produces a similarity 30X465
#  ragged array at a threshold of the search of the queries.  This output is
#  is then input to the SimilarityOutput program to produce a formated result
#  that rank orders the queries for each target by the respective similarity.

head -30 SampleCol1.smi > SampleTargetsSmiles30.txt
tail -465 SampleCol1.smi > SampleQueriesSmiles465.txt

./Measures SampleFingerprints320.txt -T -S -S -T 30 0.7 >SimSearch30_465.txt

#./SimilarityOutput
#  Usage: ./SimilarityOutput SparseMatrixFile QueryDataFile TargetDataFile
#
 
./SimilarityOutput SimSearch30_465.txt SampleQueriesSmiles465.txt SampleTargetsSmiles30.txt > SimSearchOutput.txt

#  ./RussianDollTransformation
#
#  Usage: ./RussianDollTransformation AsymmetricMatrixFile N MatrixThreshold
#  Where N is the size of the NXN matrix
#  Where MatrixThreshold is a dissimilarity threshold  
#
#  Note, the RussianDollTransformation is best suited for a Tversky alpha of
#  1 or near 1 (e.g., 0.95).  The Matrix Threshold can be used to loosen
#  the constraint once more with higher values than 0.  The matrix is turned
#  into a dissimilarity and then thresholded with this value.  Those elements
#  in the matrix near zero are very similar and it is those that are counted
#  as having a great deal in commom depending on alpha and the Matrix
#  Threshold.  Also note, the clustering threshold value will
#  often depend on the level of substructures found in common desired, since,
#  though the matrix is composed of proximities (dissimilarities), they are
#  not necessarily related to Tanimoto-like proximities in value.  They can
#  be quite data dependent, so several runs may be necessary to determine
#  appropriate threshold values -- and these may be quite high depending
#  on the application.  A glance at the matrix will give the user
#  some idea as to what thresholds to try.  Below we create an asymmetric
#  matrix with an alpha of near 1, run the RussianDollTransformation. We
#  explore several clustering thresholds in the clustering section that
#  follows.

echo Generate asymmetric matrix for RussianDollTransformation
./Measures SampleFingerprints320.txt -V -D -M -F 0 0.05 > SampleAsymMatrix.0.05.txt
echo Generate RussianDoll asymmetric matrix
./RussianDollTransformation SampleAsymMatrix.0.05.txt 495 0.05 > RussianDollAsymMatrix.0.05.txt

#  ./Clustering
#
#  Usage: ./Clustering distanceMatrixFile.txt -ClusterAlgorithm Threshold Size
#  distanceMatrixFile = space delimited text file of matrix, or
#  typing 'stdin' will accept the same input from stdin   
#  ClusterAlgorithm = '-T' for Symmetric Taylor, '-A' for Asymmetric Taylor
#  '-n' for non-disjoint symmetric Taylor
#  '-N' for Non-disjoint Asymmetric Taylor
#  Threshold is in the range (0-1]
#  Size = n, where the distance matrix is nXn       
#
#  Note: Non-disjoint refers to "overlapping" clusters.

echo Generate Clustering with 0.15 threshold, namely, 0.85 similarity
echo Generate Asymmetric Clustering with the 164 keys used to generate the matrix
./Clustering SampleAsymMatrix.0.1.txt -A 0.15 495 > SampleAsym.0.15.Clustering.txt

echo Generate Clustering with the 320 keys used to generate the matrix
./Clustering SampleAsymMatrix320.0.1.txt -A 0.15 495 > SampleAsym320.0.15.Clustering.txt

echo Generate Clustering, asymmetric, non-disjoint, with the 320 keys used to generate the matrix
./Clustering SampleAsymMatrix320.0.1.txt -N 0.15 495 > SampleAsymN320.0.15.Clustering.txt

echo Generate Clustering with 0.08 threshold
echo Generate Clustering with the 320 keys used to generate the matrix
./Clustering SampleAsymMatrix320.0.1.txt -A 0.08 495 > SampleAsym320.0.08.Clustering.txt

echo Generate Clustering, asymmetric, non-disjoint, with the 320 keys used to generate the matrix
./Clustering SampleAsymMatrix320.0.1.txt -N 0.08 495 > SampleAsymN320.0.08.Clustering.txt


echo Generate Clustering of RussianDollTransformation using a threshold of 0.9
./Clustering RussianDollAsymMatrix.0.05.txt -A 0.95 495 >RussianDollClustering.txt

#  ./AmbiguityMeasures
#
#  Usage: ./AmbiguityMeasures ClusterResultsFileDisjoint ClusterResultsFileNondisjoint
#

echo Generate Ambiguity Report for 0.15 threshold clustering
./AmbiguityMeasures SampleAsym320.0.15.Clustering.txt SampleAsymN320.0.15.Clustering.txt > AmbiguityReport0.15 

echo Generate Ambiguity Report for 0.08 threshold clustering
./AmbiguityMeasures SampleAsym320.0.08.Clustering.txt SampleAsymN320.0.08.Clustering.txt > AmbiguityReport0.08 

#  ./ClusterOutput
#  Usage: ./ClusterOutput ClusterResultsFile SmilesActivityFile ColsInSmilesActivityFile InlineToggle
#  ColsInSmilesActivityFile = the number of columns in the SmilesActivityFile
#  InlineToggle = T | F; T = Clusters by Row, F = Clusters by Column
#  RepCompoundsToggle = R to print out representative compounds only 
#
#  Note the SmilesActivityFile can contain any number of columns after the
#  first smiles column as long as they are space delimited.
#  These can be used for denoting activity, id numbers, etc.
#
#  Note: if the number of columns in the .smi is 1 (just the smiles column)
#  enter 1 for ColsInSmilesActivityFile.  Otherwise, enter the total number
#  of columns INCLUDING the smiles column.
#
#  Note: Clusters by Column will not fit in Excel unless the number
#  of compounds clustered is small and thus the number of clusters is
#  also small.

echo Generate Formated Clustering File 164 key based, representative compounds only
./ClusterOutput SampleAsym.0.15.Clustering.txt SampleCol1.smi 1 T R > SampleAsym.0.15.ClusteringFormated.txt

echo Generate Formated Clustering File 320 key based.
./ClusterOutput SampleAsym320.0.15.Clustering.txt SampleCol1.smi 1 T N> SampleAsym320.0.15.ClusteringFormated.txt

./ClusterOutput RussianDollClustering.txt SampleCol1.smi 1 T N > RussianDollClusteringFormated.txt

#The output of ClusterOutput (say, by row) outputs each cluster, the
#representative compound foremost in the list.  Other column data in the
#.smi file are printed in columns with their respective smile.
#After the clusters are listed, the true and false singletons are listed,
#again with their respective column data if there is any.  Next, the tie
#statistic for this specific program are listed.  Note, that with non-disjoint
#clustering, false singleton ties do not occur.  Lastly, the list of
#representative compounds are printed. 
#
#The output can be sent to a file by redirecting as is done in this script. 
#The .txt file can be loaded into Excel by selecting space delimited columns
#when loading the file.