Mesa Suite Version 1.2
Grouping Module
Copyright (c) 2003 Mesa Analytics
& Computing, LLC
John D. MacCuish and Norah E. MacCuish
The Grouping Module in the Mesa Suite
The Mesa Suite is a collection of application modules for research
and applications (diversity selection, library analysis, compound
acquisition, lead hopping, HTS data analysis, etc.) for chemical
information
systems. The modules can be used separately or in tandem to
perform
a variety of 2D and 3D tasks. Currently there are five
modules:
Shape, Fingerprint, Grouping, Diversity, and ChemTattoo. Aside
from
the standalone module applications, a Khoros interface from Khoral
Research
is also available for each module. Presently we support
Linux,
Windows, and Irix 6.5 platforms. All but the Shape module are
written
in C++. The Shape module is written in C.
Overview of the Grouping Module
The Grouping Module is a set of seven programs:
Its primary utility is in the Clustering program which returns disjoint (non overlapping) or non disjoint (overlapping) clusters of binary bit string data, using either symmetric or asymmetric measures generated with the Measures program. The Grouping Module contains four variants of Taylor-Butina's grouping algorithm.1,2,3,4 The AmbiguityMeasures program generates cluster ambiguity measures, and similarity searching facilities are provided via the Measures program to further augment data analysis. SimilarityOutput program rank orders the queries in terms of their similarity to the respective target and formats this output. User defined information about each query and target can be included and the output is in the form of a space delimited text file. The ClusterOutput program generates grouping results with user defined information in a space delimited text file that can be loaded into a spreadsheet program such as MicroSoft Office's Excel. The RussianDollTransformation is included for grouping by substructure or superstructure similarity. 5 A matrix utility program, Square2Sparse, is provide in the event that the user has a square proximity (similarity or dissimilarity) matrix obtained from some an outside source and would like to group the data using the Clustering program.Measures - Generates Similarity/Dissimilarity Matrices using one of the following measures: Tanimoto, Euclidean, Tversky, Cosine (1-Ochai), and Hamman.
SimilarityOutput - Takes similarity searching output as input and outputs formated search results.
Clustering - Groups compounds using either asymmetric, symmetric, disjoint or non disjoint algorithms using output from Measures.
AmbiguityMeasures - Generates a report based on the user defined threshold used in Clustering as to the level of Ambiguity to be expected in the results.
ClusterOutput - Generates report of clustering results in space delimited formatted form.
RussianDollTransformation - Transforms an asymmetric matrix to another asymmetric matrix that contains sub- or superstructure proximities.
Square2Sparse - Inputs a square matrix from outside source and transforms it for use with Clustering.
ClusterCompare - Compare the results of disjoint clusterings of the same data (i.e., at different thresholds, different measures, etc.). Takes as input a file of the Clustering output file names, and outputs the all-pairs comparisons between clusterings. The measure returned is between 0 and 1, where 1 is a perfect match (the clusterings are exactly the same.
BitStripper - A utility for stripping out unused bit columns (either all 1s or all 0s) for a collection of bitstrings. Reports the number of columns removed and how many of each (1 or 0 columns). Handy for large fingerprints, if there are a lot of unused bits.
Grouping or clustering 2D molecular structures or properties has many important uses in the pharmaceutical industry. Common methods often employ symmetric, hierarchical algorithms such as Wards or complete link algorithms, non-hierarchical algorithms such as Jarvis-Patrick, or partitional algorithms such as k-means.10,11,12 Such methods require symmetric measures, e.g. Tanimoto or Euclidean Distances as input. One drawback to such methods is that they cannot capture important asymmetric relationships between molecular structures. Also, these methods are typically disjoint (non overlapping) clustering methods. That is, each cluster is distinct and does not contain elements (e.g., a specific molecular structure description such as a fingerprint) from other clusters. The Grouping Module extends the common clustering concepts by introducing asymmetric measures and clustering, and providing either disjoint or non disjoint (overlapping) clusterings.
Another serious issue when grouping molecular structures with bit string fingerprints is the problem of ties in proximity or algorithmic decision ties in the clustering algorithm -- more globally, how ambiguous is the final cluster partition? Almost all hierarchical and partitional algorithms have, to varying degrees, ties in proximity or decision ties issues. Our Grouping Module allows the user to identify and explore those issues with a report on the type and number of ties for a specific clustering, and several indices as to the overall ambiguity contained within a clustering. Clustering algorithms are also often constrained by the overall size of the number elements to be clustered. The clustering algorithms we employ for both the disjoint, non disjoint versions of either the symmetric or asymmetric clusterings are variations on the Taylor-Butina grouping algorithm. Our implementations are both fast and space efficient such that large numbers of compounds can be clustered effectively on a single machine. Rough estimates suggest that approximately 100 thousand compounds can be clustered on machine with 1 gigabyte of ram and 1 gigahertz processor within 4 to 8 hours. This includes the time to generate the fingerprints (using our Fingerprint Module), compute the similarities, perform the clustering, and produce the formatted output report.
Theory for Grouping Module
In order for comparisons to be made between chemical structures, a chemical description must first be applied. Typically in 2D, descriptors are molecular fingerprint descriptors, such as MDL MACCS keys, Daylight fingerprints, BCI keys, etc.6,7,8 The Mesa Grouping Module expects as a pre-requisite that chemical data sets under study will be described in a binary bit string fashion -- at least up through the Measures program. The Clustering program itself will operate on any sparse dissimilarity matrix of the form output by the Measures program, independent of its source. The Grouping module can read Daylight Fingerprints, or any fixed length fingerprints which can be cast into ASCII bit string format, e.g. 1010101000011..... For details on one such binary descriptor see the Mesa Fingerprint Module Manual.
Grouping Chemical Structures based on 2D Representations
Tanimoto(A,B) = c / [a + b + c] (symmetric)Euclidean(A,B) = 1 - {[(a + b)] / n}(1/2) (symmetric)
Hamman(A,B) = [c + d] /n (symmetric)
Ochia(A,B) = 1 - Cosine(A,B) = c / [(c + a) * (c + b)](1/2) (symmetric)
Tversky(A,B) = c / [(alpha) * a + (beta) * b + c] (asymmetric)
a : Unique bits turned on in molecule "A"
b: Unique bits turned on in molecule "B"
c: Common bits turned on in both molecule "A" and molecule "B"
d: Common bits turned off in both molecule "A" and molecule "B"
n: The total number of bits in the fingerprint
The Tanimoto, Euclidean, Hamman, and Ochai are all symmetric measures. This means that the comparison of A to B yields the same number as the comparison of compound B to compound A. For the Tversky measure this is not the case. The Tversky (A,B) does not necessarily equal the Tverksy (B,A), which makes the measure asymmetric. The reason for this can be explained by the role of the two parameters, alpha and beta. In the Measures program we add the constraint that alpha + beta = 1. If the choice of alpha is set close to 1, e.g. .95 then beta = 0.05. The Tversky(A,B) for this choice would be close to 1.0 (the maximum) if the unique bits of compound A (when compared to compound B) are nearly zero. At high Tversky (A,B) values, the interpretation is that compound A "fits into" compound B. If the choice of alpha is 0.5, then beta = 0.5 and the Tversky (A,B) is monotonic to the Tanimoto (A,B). The Tversky measure therefore enables the grouping of chemical structures of different sizes that share features of the smaller of the two. Below illustrates this with an example:
The Measures program generates a symmetric dissimilarity matrix in the case of the four symmetric measures and an asymmetric matrix if the Tversky similarity measure is utilized for the pair wise comparisons. The dissimilarity is just 1 - similarity. The diagrams given below demonstrate this point:
The matrix output from Measures can be directed to the Clustering program or the RussianDollTransformation program as input. Typically for use with the Clustering program the Measures program will produce an NxN dissimilarity matrix. However, it maybe the case that the user would like to compare two data sets of different sizes (e.g. MXN), the Measures program will also generate rectangular rather than square matrices for this purpose. The program therefore can be used in either similarity searching mode , or all pairs (dis)similarity mode for clustering purposes. The similarity searching output can then be used as input (along with a file of information (e.g., SMILES strings and ID numbers) for the targets, and another information file for the queries) to the SimilarityOutput program. There are three different output formats for the Measures program: matrix , sparse matrix , or ordered pairs. The matrix format is for either an all pairs comparison of N elements, producing a full NXN matrix (e.g., for exploratory or statistical purposes such as multidimensional scaling); or for similarity searching comparison of M targets by N queries, producing a MXN matrix. The sparse matrix format can be used with a threshold to produce a ragged array that contains those NXN comparisons for clustering that meet a specific threshold. It can also be used for a thresholded similarity search, producing a ragged array of MXN comparisons that meet a threshold criterion, between M targets, and N queries. It is this output that can be used as input to the SimilarityOutput program. The ordered pair format is useful for showing specific i, j comparisons, where the i and j indices are explicit in the output.
Example Usages:
.\Measures SampleFingerprints.txt -V -D -S -F 0 0.95 0.3> SampleAsymMatrix.95.3.txt
Options explained: Tversky (-V), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), alpha 0.1, threshold 0.3. Note that Measures can accept input from stdin with the character '-', e.g.,Generate Dissimilarity Matrix for Cluster Program using Tanimoto measure, and dissimilarity threshold cutoff of 0.15:
.\Measures - -V -D -S -F 0 0.95 0.3 < SampleFingerprints.txt > SampleAsymMatrix.95.3.txt
.\Measures SampleFingerprints.txt -T -D -S -F 0 0.15> SampleAsymMatrix.15.txt
Options explained: Tanimoto (-T), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), threshold 0.15Generate Similarity Matrix from the comparison of two data sets (Querysetfingerprints.txt and Targetsetfingerprints.txt) using Euclidean measure and a threshold cutoff of 0.80:
Need to use the DOS equivalents of wc and cat. E.g., find/c "1" TargetsetFingerprints.txt will return the number lines in TargetsetFingerprints.txt, since all of the fingerprints have at the character "1" in them.
wc -l Targetsetfingerprints.txt
100 (there are 100 Target fingerprints)
wc -l Querysetfingerprints.txt
1000 (there are 1000 fingerprints)
cat Targetsetfingerprints.txt Querysetfingerprints.txt > TQfingerprints.txt
.\Measures TQfingerprints.txt -E -S -S -T 100 0.80 > SampleAsymMatrix.8.txt
Options explained: The target fingerprints are counted (in this case there are 100). The target fingerprints and the query fingerprints are concatenated into one file, targets at the top. Euclidean (-E), Similarity (-S), Sparse Matrix (-S), Rectangular Matrix, so Similarity searching (T), search number 100 (targets), threshold 0.80Generate Dissimilarity Matrix for a non-Mesa Program such as SAS or SPLUS using Hamman :
.\SimilarityOutput SampleAsymMatrix.8.txt QuerysetSmilesData.txt TargetsetSmilesData.txt > SimSearchOutput.txt
Options explained: There are three input files: 1. The sample similarity search (a ragged array of the sparse matrix formed by the MXN search matrix, in the example 100X1000, and the search threshold); 2. The Query Smiles data (space delimited columns of 1000 rows of query data); 3. The Target Smiles data (space delimited columns of 100 rows of query data).
.\Measures SampleFingerprints.txt -H -D -M -F 0 > SampleAsymMatrix.txt
Options explained: Hamman (-H), Dissimilarity (-D), Matrix (-M), Square Matrix - not similarity searching (F), search number 0 (not similarity searching),Generate Similarity Ordered Pair output for input into a non-Mesa program, using Cosine measure and threshold cutoff of 0.90
.\Measures SampleFingerprints.txt -C -S -O -F 0 0.90> SampleAsymMatrix.9.txtGenerate Dissimilarity Matrix for RussianDollTransformation using Tversky with alpha= .99 and threshold cutoff of 0.001
Options explained: Cosine (-C), Similarity (-S), Ordered Pair (-O), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), threshold = 0.90
.\Measures SampleFingerprints.txt -V -D -S -F 0 0.99 0.001> SampleAsymMatrix.99.001.txt
Options explained: Tversky (-V), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), alpha 0.99, threshold 0.001
./Measures SampleFingerprints.txt -V -D -S -F 0 0.95 0.3> SampleAsymMatrix.95.3.txt
Options explained: Tversky (-V), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), alpha 0.1, threshold 0.3Generate Dissimilarity Matrix for Cluster Program using Tanimoto measure, and dissimilarity threshold cutoff of 0.15:
./Measures SampleFingerprints.txt -T -D -S F 0 0.15> SampleAsymMatrix.15.txt
Options explained: Tanimoto (-T), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), threshold 0.15Generate Similarity Matrix from the comparison of two data sets (Querysetfingerprints.txt and Targetsetfingerprints.txt) using Euclidean measure and a threshold cutoff of 0.80:
wc -l Targetsetfingerprints.txt
100 (there are 100 Target fingerprints)
wc -l Querysetfingerprints.txt
1000 (there are 1000 Target fingerprints)
cat Targetsetfingerprints.txt Querysetfingerprints.txt > TQfingerprints.txt
./Measures TQfingerprints.txt -E -S -S -T 100 0.80 > SampleAsymMatrix.8.txt
Options explained: The target fingerprints are counted (in this case there are 100). The target fingerprints and the query fingerprints are concatenated into one file, targets at the top. Euclidean (-E), Similarity (-S), Sparse Matrix (-S), Rectangular Matrix, so Similarity searching (T), search number 100 (targets), threshold 0.80Generate Dissimilarity Matrix for a non-Mesa Program such as SAS or SPLUS using Hamman :
./SimilarityOutput SampleAsymMatrix.8.txt QuerysetSmilesData.txtTargetsetSmilesData.txt > SimSearchOutput.txt
Options explained: There are three input files: 1. The sample similarity search (a ragged array of the sparse matrix formed by the MXN search matrix, in the example 100X1000, and the search threshold); 2. The Query Smiles data (space delimited columns of 1000 rows of query data); 3. The Target Smiles data (space delimited columns of 100 rows of query data).
./Measures SampleFingerprints.txt -H -D -M -F 0 > SampleAsymMatrix.txt
Options explained: Hamman (-H), Dissimilarity (-D), Matrix (-M), Square Matrix - not similarity searching (F), search number 0 (not similarity searching),Generate Similarity Ordered Pair output for input into a non-Mesa program, using Cosine measure and threshold cutoff of 0.90
./Measures SampleFingerprints.txt -C -S -O -F 0 0.90> SampleAsymMatrix.9.txtGenerate Dissimilarity Matrix for RussianDollTransformation using Tversky with alpha= .99 and threshold cutoff of 0.001
Options explained: Cosine (-C), Similarity (-S), Ordered Pair (-O), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), threshold = 0.90
./Measures SampleFingerprints.txt -V -D -S -F 0 0.99 0.001> SampleAsymMatrix.99.001.txt
Options explained: Tversky (-V), Dissimilarity (-D), Sparse Matrix (-S), Square Matrix - not similarity searching (F), search number 0 (not similarity searching), alpha 0.99, threshold 0.001
Clustering data is an iterative process, and it is not the case that there is one specific clustering method that fits all data, or even a specific class of data (e.g., bit string data). However, some algorithms and proximity measures are more suitable than others for instance with bit string data -- given the type of structure the user is looking for. An analogous situation is known in the classification or predictive modeling field: there is no best classifier for predictive modeling. This has been formalized in a proof, "The No Free Lunch Theorem" (see Pattern Classification, 2nd ed. Duda, Hart, Stork). Rather, there is typically a family of classifiers that may be more suitable for a type of data, any one of which maybe the best or near optimal. One must first search for that family of possible classifiers among many.
With regard to clustering, typically, there
are several levels of refinement in determining structure in the form
of groups in the data. The first process is to determine which
family of clustering methods are suitable for the data at hand.
Next, one performs a small battery of the candidate family of
clustering methods in an effort to determine both efficacy and
efficiency. The latter is important if the data set(s) are large,
and thus may limit the candidates to just a few. Once a specific
algorithm is chosen, further iterative refinement is performed to
determine cluster validity using both a priori information (expert
knowledge about the data set) and quantitative means such as cluster
validity tests. For large data sets quantitative cluster validity
tests are often very computationally expensive. Expert knowledge
can therefore be extremely important. The best or near best
grouping may be obtained with level selection techniques employed in
hierarchical algorithms or heuristics for choosing in k-means
algorithms,
etc. The efficacy of these methods can in turn be explored.
In
Taylor-Butina's algorithm, a careful choice of the exclusion region
threshold (well known information regarding the common measures is
helpful in this regard) and a non-trivial ambiguity determination can
be used.
The Clustering program contains the core functionality of the Mesa Grouping Module. The program inputs an NXN sparse dissimilarity matrix, either from the Measures program output, the RussianDollTransformation program, or a User defined sparse matrix input file, and generates chemical groupings based on user defined options, i.e., disjoint, non disjoint, symmetric, asymmetric. The Clustering program sends output to either the ClusterOutput Program or the AmbiguityMeasures Program
- Create threshold nearest neighbor table using similarities in both directions ( Measures )
- Find true singletons: all those compounds with an empty nearest neighbor list.
- Find the compound with the largest nearest neighbor list. This becomes a group and is excluded from consideration – these compounds are removed from all nearest neighbor lists. (The compound becomes the representative compound.)
- Repeat step 3 until no compounds exist with a non empty nearest neighbor list.
- Assign remaining compounds, false singletons, to the group that contains their nearest neighbor, but identify them as "false singletons"
In the Cluster program selecting the Disjoint option will generate clusters in which compounds are members of only one cluster. Below is a schematic and algorithmic description of the non disjoint option.
Issues with Ties:
- Create threshold nearest neighbor table using similarities in both directions ( Measures )
- Find true singletons: all those compounds with an empty nearest neighbor list.
- Find the compound with the largest nearest neighbor list. This becomes a group and is excluded from consideration – these compounds are removed from all nearest neighbor lists. (The compound becomes the representative compound.) If a shared compound is within the threshold it is not excluded from membership in other clusters. Shared compounds are included in all groups in which it is within the threshold of a representative compound.
- Repeat step 3 until no compounds exist with a non empty nearest neighbor list.
There are three possible decision ties that results from the symmetric and asymmetric disjoint Taylor-Butina algorithm as described above:
- Exclusion Region Ties
- Minimum Distance Ties
- False Singleton Ties
There may be more than one element that has the largest number of
elements associated with it. E.g., say during an iteration there
are two elements with 100 associated elements similar to each, and 100
is the largest number of elements. We call this an exclusion
region tie, as there is more than one exclusion region
candidate. If the two groups are disjoint there is no real
problem. However, if the groups overlap at all, which exclusion
group should be chosen? Such a tie can be broken by a secondary
grouping criterion, such as a criterion that attempts to find the
tightest group (most all pairs of the group similar). This second
criterion can also be tied, and as it turns out given the limited
number of
possible values for bit string measures, such a tie is not as rare as
one
might assume. We call such a tie a minimum distance tie.
False singletons represent a problem for Taylor's algorithm.
Typically, they only represent 1-3% of the elements in a set, but
identifying them as distinct from true singletons and having a
mechanism that associates them with some group is useful. We both
identify them and associate them with the group that contains the most
similar element to the false singleton in question. This
association also can be tied if there is more than one element that is
the most similar. Such ties are also common for the same reason
-- that bit string measures have a relatively small finite number of
similarities. We call this type of tie a false singleton tie.
Our implementation reports the number of each of these ties.
Minimum distance ties and false singleton ties are broken arbitrarily
and therefore introduce a degree of ambiguity into the final
grouping. The
non disjoint version, either symmetric or asymmetric, does not have the
false
singleton ties problem, as false singletons are naturally subsumed into
overlapping
groups. The next program in the Grouping Module,
AmbiguityMeasures
, provides an effective means of threshold selection which provides
insight
into the ties issues mentioned here.
Example Usages:
wc -l SampleAsymMatrix.0.1.txt (counts rows of matrix)495
.\Clustering SampleAsymMatrix.0.1.txt -A 0.15 495 > SampleAsym.0.15.Clustering.txt
Options explained: Asymmetric Disjoint (-A), threshold 0.15, 495 members of matrix, N of NXN is 495Generate Asymmetric, Non-disjoint Clustering, with threshold of 0.1 (equivalent to 0.9 Similarity)
Generate Symmetric, Disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity).\Clustering SampleAsymMatrix.0.1.txt -N 0.1 495 > SampleAsymN.0.1.Clustering.txt
wc -l SampleSymMatrix.0.2.txt (counts rows of matrix)
352
Generate Symmetric, Non-disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity).\Clustering SampleSymMatrix.0.2.txt -T 0.05 352 > SampleAsym.0.05.Clustering.txt
Generate Asymmetric Disjoint Clustering with a exclusion region threshold of 0.15 (.85 simmilarity)wc -l SampleSymMatrix.0.2.txt (counts rows of matrix)
352.\Clustering SampleSymMatrix.0.2.txt -n 0.05 352 > SampleAsym.0.05.Clustering.txt
wc -l SampleAsymMatrix.0.1.txt (counts rows of matrix)Generate Asymmetric, Non-disjoint Clustering, with threshold of 0.1 (equivalent to 0.9 Similarity)495Options explained: Asymmetric Disjoint (-A), threshold 0.15, 495 members of matrix, N of NXN is 495.\Clustering SampleAsymMatrix.0.1.txt -A 0.15 495 > SampleAsym.0.15.Clustering.txt
Generate Symmetric, Disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity).\Clustering SampleAsymMatrix.0.1.txt -N 0.1 495 > SampleAsymN.0.1.Clustering.txt
Generate Symmetric, Non-disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)wc -l SampleSymMatrix.0.2.txt (counts rows of matrix)352.\Clustering SampleSymMatrix.0.2.txt -T 0.05 352 > SampleAsym.0.05.Clustering.txt
wc -l SampleSymMatrix.0.2.txt (counts rows of matrix)
352.\Clustering SampleSymMatrix.0.2.txt -n 0.05 352 > SampleAsym.0.05.Clustering.txt
wc -l SampleAsymMatrix.0.1.txt (counts rows of matrix)495
./Clustering SampleAsymMatrix.0.1.txt -A 0.15 495 > SampleAsym.0.15.Clustering.txt
Options explained: Asymmetric Disjoint (-A), threshold 0.15, 495 members of matrix, N of NXN is 495Generate Asymmetric, Non-disjoint Clustering, with threshold of 0.1 (equivalent to 0.9 Similarity)
Generate Symmetric, Disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)./Clustering SampleAsymMatrix.0.1.txt -N 0.1 495 > SampleAsymN.0.1.Clustering.txt
wc -l SampleSymMatrix.0.2.txt (counts rows of matrix)
352
Generate Symmetric, Non-disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)./Clustering SampleSymMatrix.0.2.txt -T 0.05 352 > SampleAsym.0.05.Clustering.txt
Generate Asymmetric Disjoint Clustering with a exclusion region threshold of 0.15 (.85 similarity)wc -l SampleSymMatrix.0.2.txt (counts rows of matrix)
352./Clustering SampleSymMatrix.0.2.txt -n 0.05 352 > SampleAsym.0.05.Clustering.txt
wc -l SampleAsymMatrix.0.1.txt (counts rows of matrix)Generate Asymmetric, Non-disjoint Clustering, with threshold of 0.1 (equivalent to 0.9 Similarity)495Options explained: Asymmetric Disjoint (-A), threshold 0.15, 495 members of matrix, N of NXN is 495./Clustering SampleAsymMatrix.0.1.txt -A 0.15 495 > SampleAsym.0.15.Clustering.txt
Generate Symmetric, Disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)./Clustering SampleAsymMatrix.0.1.txt -N 0.1 495 > SampleAsymN.0.1.Clustering.txt
Generate Symmetric, Non-disjoint, clustering with with threshold 0.05 (equivalent to 0.95 Similarity)wc -l SampleSymMatrix.0.2.txt (counts rows of matrix)352./Clustering SampleSymMatrix.0.2.txt -T 0.05 352 > SampleAsym.0.05.Clustering.txt
wc -l SampleSymMatrix.0.2.txt (counts rows of matrix)
352./Clustering SampleSymMatrix.0.2.txt -n 0.05 352 > SampleAsym.0.05.Clustering.txt
Given the ties issues discussed above, we can consider the amount of ambiguity that might accrue given a disjoint clustering at a given threshold. Such a clustering is really just one of many possible clusterings, since there may be many possible outcomes of any set of arbitrary decisions that are made. There are a number of different indices that can capture the inherent ambiguity that result from a collection of arbitrary decisions. We propose three simple and useful indices -- two that are based on the number of overlapping elements, and a third based on the difference in the number of disjoint clusters versus the number of non disjoint clusters produced with the same threshold and the same symmetry or asymmetry. The AmbiguityMeasures program takes the outputs of a disjoint and a non disjoint run (both either symmetric or asymmetric) at the same exclusion region threshold and returns three such indices.
We suggest that threshold selection not be an arbitrary decision,
but rather that clusterings be done both disjoint and non disjoint at a
number of threshold values, and the AmbiguityMeasures
program be
utilized to make and informed decision as the "best" threshold to pick
for a given data set.
Example Usages:
Generate Ambiguity Report for 0.15 threshold clustering
of an Asymmetric Similarity Matrix
Generate Ambiguity Report for 0.1 threshold clustering of an Symmetric Similarity Matrix.\AmbiguityMeasures SampleAsym.0.15.Clustering.txt SampleAsymN.0.15.Clustering.txt > AmbiguityReport0.15Options explained: Input files are the disjoint and non disjoint clustering results from the Clustering program run at the same threshold
Generate Ambiguity Report for 0.15 threshold clustering of an Asymmetric Similarity Matrix.\AmbiguityMeasures SampleSym.0.1.Clustering.txt SampleSymN.0.1.Clustering.txt > AmbiguityReport0.1Options explained: Input files are the disjoint and non disjoint clustering results from the Clustering program run at the same threshold
UNIX VERSION EXAMPLES
Generate Ambiguity Report for 0.1 threshold clustering of an Symmetric Similarity Matrix./AmbiguityMeasures SampleAsym.0.15.Clustering.txt SampleAsymN.0.15.Clustering.txt > AmbiguityReport0.15Options explained: Input files are the disjoint and non disjoint clustering results from the Clustering program run at the same threshold
./AmbiguityMeasures SampleSym.0.1.Clustering.txt SampleSymN.0.1.Clustering.txt > AmbiguityReport0.1Options explained: Input files are the disjoint and non disjoint clustering results from the Clustering program run at the same threshold
The ClusterOutput program takes the Clustering program output with user information about the elements (e.g., SMILES strings, activity data, ID numbers) and outputs a clustering report. The user can either specify a full report or just the identification of the representative element (one from each cluster).
Typical output from the program with the Clusters by Row option
set
would look like:
Cluster 1
Rows of SMILES strings and any associated data, space delimited.
Cluster 2
.
.
.
True Singletons
False Singletons (none in the non disjoint versions)
Ties Statistics
Representative Compounds ("Centroids")
Rows of the representative compounds and any associated data, space
delimited.
Typical output from the program with the Clusters by Column option
set would array the output in columns:
True Singletons False
Singletons Cluster1 Cluster2 ...
ClusterK Ties
Statistics Representative Compounds
.
.
.
Associated data is in adjacent columns to the data in question.
Since this is a ragged array, to make it rectangular, blanks are
filled with "NA". This format is useful when the for data sets
are small and the user wishes to import them into MicroSoft's Excel
where the number of columns for the cluster output and associated data
is only 256.
Example Usage
.\ClusterOutput SampleAsym.0.15.Clustering.txt SampleCol1.smi 1 T R > SampleAsym.0.15.ClusteringFormated.txtGenerate Formatted Clustering File for small set with clusters in columns for easy input into MicroSoft ExcelOptions explained: The first input file is the output file from the Clustering program for an asymmetric clustering at a threshold of 0.15 and the second input file is a file of SMILES strings associated with each row of the Asymmetric similarity matrix used as input in the Clustering program, there is only one column in the SMILES file ( 1 ), Clusters by row the output will have only one cluster in each row (T ) and only Representative compounds will be printed into the output file ( R)
.\ClusterOutput SampleAsym0.15.Clustering.txt SampleCol1.smi 1 F N> SampleAsym320.0.15.ClusteringFormated.txt
Options explained: The first input file is the output file from the Clustering program for an asymmetric clustering at a threshold of 0.15 and the second input file is a file of SMILES strings associated with each row of the Asymmetric similarity matrix used as input in the Clustering program, there is only one column in the SMILES file ( 1 ), Clusters by column, the output will have clusters printed out in column format (F) and all compounds will be printed into the output file (N)Generate Formatted Clustering File with representative compounds onlyUNIX VERSION EXAMPLES
./ClusterOutput SampleAsym.0.15.Clustering.txt SampleCol1.smi 1 T R > SampleAsym.0.15.ClusteringFormated.txtGenerate Formatted Clustering File for small set with clusters in columns for easy input into MicroSoft ExcelOptions explained: The first input file is the output file from the Clustering program for an asymmetric clustering at a threshold of 0.15 and the second input file is a file of SMILES strings associated with each row of the Asymmetric similarity matrix used as input in the Clustering program, there is only one column in the SMILES file ( 1 ), Clusters by row the output will have only one cluster in each row (T) and only Representative compounds will be printed into the output file (R)
./ClusterOutput SampleAsym0.15.Clustering.txt SampleCol1.smi 1 F N> SampleAsym320.0.15.ClusteringFormated.txt
Options explained: The first input file is the output file from the Clustering program for an asymmetric clustering at a threshold of 0.15 and the second input file is a file of SMILES strings associated with each row of the Asymmetric similarity matrix used as input in the Clustering program, there is only one column in the SMILES file ( 1 ), Clusters by column, the output will have clusters printed out in column format (F) and all compounds will be printed into the output file (N)
The RussianDollTransformation program is a response to a question posed by John Bradshaw at Daylight Chemical Information Systems. "How can one group large molecular structures in a database, based on the number of smaller structures they have in common such that this relationship forms a proximity matrix among larger structures?" For example, say a group of larger molecular structures all contain the same 10 smaller structures within each large structure. Another group of large structures only contain 7 smaller structures in common. The first group of large structures is more similar. Namely, this transformation groups larger structures by what smaller structures fit within them. The utility of this question becomes apparent if, say, a drug database is seeded with smaller structures that are important drug like structures, and one wishes to understand which larger structures in the database are grouped by the number of like seeded structures. The answer to the question lies in the use of asymmetry. With a Tversky alpha set to 1.0 (or near 1.0), one can effectively ask, what structures fit wholly (or mostly) within another structures. With an asymmetric proximity matrix produced by such a Tversky parameterization, we can transform that matrix to ask John Bradshaw original question, producing another asymmetric proximity matrix, that, when used with an asymmetric clustering algorithm, will actually produce an answer to that question, and the converse of that question -- "can we group smaller structures based on what they fit into?" A disjoint or non disjoint clustering of the transformed matrix will produce both types of groupings.
The Russian Doll Transformation transforms an asymmetric matrix into another asymmetric matrix containing the counts of compounds that two compounds i and j mutually share as nearest neighbors at some threshold. Asymmetrically clustering this transformation groups large molecules by what smaller molecules they mutually share as substructures, and it groups small molecules by what larger molecules they fit into. The input is a Tversky asymmetric matrix, typically with alpha set at or near 1. This setting of alpha behaves such that one compound "fits" within the other, or nearly fits. Those compounds that share other numerous compounds in this manner are more similar. There are 3 parameters that effect the outcome of a clustering using this transformation: 1. the Tversky alpha value; 2. the matrix threshold (a nearest neighbor like threshold) to the Russian Doll Transform; and 3. the clustering measure threshold. These parameters can be modified to provide various clusterings. For example, if one just wants to group those compounds that are very nearly the same and leave all others as singletons, set the Tversky to 1 (or 0), the matrix threshold to near 0 (say, 0.1), and the clustering threshold toward the lower range of the transform matrix output (often between 0.1 and 0.5). As these constraints are loosened clusters contain more varied compounds and there are often fewer singletons.The clustering threshold value should depend directly on the desired number of substructures (superstructures) found in common. Thus, though the matrix is composed of proximities (dissimilarities), they are not necessarily related to Tanimoto-like proximities in value (e.g., where 0.7 similarity is a common threshold). They can be quite data dependent, so several runs may be necessary to determine appropriate threshold values -- and these may be quite high depending on the application. A glance at the matrix will give the user some idea as to what thresholds to try. Somewhat counter intuitively, the dissimilarities can be quite close to 1 since the counts in common are scaled by the number of elements (compounds in the data set). The dissimilarity measure is 1 - (compounds in common)/n, where n is the number of compounds in the data set. If in general there are very few compounds in common compared to the total number of compounds, most of the dissimilarities are then near 1. Other scalings that more evenly distribute the values between 0 and 1 are possible, but this scaling allows the user to directly calculate the total number of counts from the dissimilarity by knowing only n.
The Russian Doll idea was by John Bradshaw and the algorithm to implement the idea was by Norah and John MacCuish. This is at present, to the best of our knowledge, an unpublished algorithm and unpublished use of this idea in compound clustering. A presentation discussing the Russian Doll algorithm and its uses was given at MUG '03 and can be found at http://www.mesaac.com/RussianDollDistrib.htm
Example Usage
Generate asymmetric matrix for RussianDollTransformation - must first use the Measures program to generate a Dissimilarity matrix for the RussainDollTransformation program then Generate RussianDoll asymmetric matrix for input into the Cluster program.\Measures SampleFingerprints.txt -V -D -M -F 0 0.01 > SampleAsymMatrix.0.01.txt
Options Explained: Tversky (-V), Dissimilarity Matrix (-D), Matrix Output (-M), Not Searching (F), no search number (0), Threshold 0.01wc -l SampleAsymMatrix.0.01.txt
495
.\RussianDollTransformation SampleAsymMatrix.0.01.txt 495 0.01 > RussianDollAsymMatrix.0.01.txt
Options Explained: Input file is asymmetric similarity matrix at a dissimilarity threshold .01, Number of members of the matrix (495), dissimilarity threshold .01.\Clustering RussianDollAsymMatrix.0.01.txt -A 0.99 495 >RussianDollClustering.txt
Options Explained: Input file is RussianDoll asymmetric matrix at a dissimilarity threshold .01, Number of members of the matrix (495), dissimilarity threshold .99
UNIX VERSION EXAMPLES
./Measures SampleFingerprints.txt -V -D -M -F 0 0.01 > SampleAsymMatrix.0.01.txt
Options Explained: Tversky (-V), Dissimilarity Matrix (-D), Matrix Output (-M), Not Searching (F), no search number (0), Threshold 0.01wc -l SampleAsymMatrix.0.05.txt
495
./RussianDollTransformation SampleAsymMatrix.0.01.txt 495 0.01 > RussianDollAsymMatrix.0.01.txt
Options Explained: Input file is asymmetric similarity matrix at a dissimilarity threshold .01, Number of members of the matrix (495), dissimilarity threshold .01./Clustering RussianDollAsymMatrix.0.01.txt -A 0.99 495 >RussianDollClustering.txt
Options Explained: Input file is RussianDoll asymmetric matrix at a dissimilarity threshold .01, Number of members of the matrix (495), similarity threshold .99
Example Applications
Using the flow chart in the top on this document, subsets of the programs can be utilized for the following example applications:
Compound Acquisition Decisions
In-house databases are augmented many times through acquisitions of available compounds from commercial vendors. Determining the extent to which a possible acquisition database is "similar" to an in-house database becomes crucial in making compound acquisition decisions. Compounds are typically costly. Thus, purchasing selections of various compounds from outside vendors that will increase the number of hits in HTS screens, without generating redundant data or producing few or no hits in the HTS screens, is what makes this task difficult.
To process, for example, an in-house database of 100,000 compounds and an acquisition database of 50,000 compounds would require ~ 1 GByte of RAM. The steps are as follows:
Script for Step 2
1. For this example let:
In house data file of SMILES strings = InHouse.smi
Acquistion data file of SMILES strings = Acquisition.smi
In house data file of fingerprints = InHouse.fps
Acquistion data file of fingerprints = Acquisition.fps
Hence,
./MACCSKeys320Generator InHouse.smi > InHouse.fps
./MACCSKeys320Generator Acquistion.smi > Acquisition.fps2. In this example we use 4 thresholds to determine reasonable ambiguity and expert opinion as to the "best" clustering.
./Measures InHouse.fps -T -D -S -F 0 0.3 > InHouseSymMatrix.0.15.txt
./Clustering InHouseSymMatrix.0.15.txt -T 0.25 100000 > InHouseSymDisjointClustering.0.25.txt
./Clustering InHouseSymMatrix.0.15.txt -T 0.20 100000 > InHouseSymDisjointClustering.0.20.txt
./Clustering InHouseSymMatrix.0.15.txt -T 0.15 100000 > InHouseSymDisjointClustering.0.15.txt
./Clustering InHouseSymMatrix.0.15.txt -T 0.10 100000 > InHouseSymDisjointClustering.0.15.txt./Clustering InHouseSymMatrix.0.15.txt -n 0.25 100000 > InHouseSymNonDisjointClustering.0.25.txt
./Clustering InHouseSymMatrix.0.15.txt -n 0.20 100000 > InHouseSymNonDisjointClustering.0.20.txt
./Clustering InHouseSymMatrix.0.15.txt -n 0.15 100000 > InHouseSymNonDisjointClustering.0.15.txt
./Clustering InHouseSymMatrix.0.15.txt -n 0.10 100000 > InHouseSymNonDisjointClustering.0.10.txt./AmbiguityMeasures InHouseSymDisjointClustering.0.25.txt InHouseSymNonDisjointClustering.0.25.txt >AmbiguityReport.0.25.txt
...similarly for the other 3 Ambiguity reports.3. In this example, say that the 0.15 threshold represented the best of the 4 clusterings in terms of both ambiguity and expert analysis of the groups formed.
Then,
./ClusterOutput InHouseSymDisjointClustering.0.15.txt InHouse.smi 1 T N > InHouseSymDisjointClusteringFormated.0.15.txt
Review cluster output report file and determine how many representative compounds meet the Z < X criterion. Say that number is 200.
./ClusterOutput InHouseSymDisjointClustering.0.15.txt InHouse.smi 1 T R > InHouseSymDisjointClusteringFormatedRepresentatives.0.15.txt
tail -200 InHouseSymDisjointClusteringFormatedRepresentatives.0.15.txt > Z_LT_X_RepresentativeSet.smi
The last file, Z_LT_X_RepresentativeSet.smi, is the SMILES file of the representative compounds that meet the Z < X criterion.4. Now we append the acquistion set of SMILES to the Z < X representative set plus the true singletons.
Clip out true singletons SMILES from InHouseSymDisjointClusteringFormated.0.15.txt. This can be done with unix head and tail commands once the line numbers for the beginning and end of the true singleton section of the report file is found. Call this file InHouseTrueSingletons.smi
cat InHouseTrueSingletons.smi Z_LT_X_RepresentativeSet.smi > T_S_Z_X.smi
Find the size of this combined file:
wc -l T_S_Z_X.smi
Say it equals 500
Append this combined file to the Acquisition file to obtain the similarity searching file
cat T_S_Z_X.smi Acquisition.smi > TSZXRep_Acquisition.smi
For this example the threshold found above in step two was 0.15 dissimilarity = 0.85 similarity./MACCSKeys320Generator TSZYRep_Acquisition.smi > TSZYRep_Acquisition.fps
./Measures TSZYRep_Acquisition.fps -T -S -S -T 500 0.85 > TSZYRep_Acquisition SimSymMatrix.txt5. To perform this step one will need to write a short script (e.g., in perl or python) that will extract the information - matching indices to SMILES strings using the Acquistion.smi and TSZYRep_AcquisitionSimSymMatrix.txt files.
6. Again, a short script can be written to create the file of those compounds not identified for inclusion into the buy file from Acquisition.smi file. Let us call the new file, ReducedAcquisition.smi.
7. Perform step 2 on the ReducedAcquisition.smi file and create a formated cluster output file of the settled on clustering.
8. With a script, extract the desired compounds from the formatted cluster output file in step 7 and add these to the buy file.
9. The buy file will contain a set of SMILES from the acquisition set to buy.
Analysis of High Throughput Screening Hits
Large numbers of hits from HTS screens can be difficult to analyze when the hit rate is above more than a couple hundred compounds. There is only so much one can do by eye, looking at structures on a computer screen or printed out on sheets of paper. The Mesa Grouping Module provides several mechanisms for grouping hits for analysis and SAR studies.
One approach for a relatively small set, say 500, compounds would be to use ChemTattoo. ChemTattoo will take an input set of SMILES strings, of actives for example, and find common features within the set. ChemTattoo generates statistics for fingerprints of the active set. It will generate a frequency fingerprint which shows the number of times a "key" in the fingerprint is represented across the set in the form of the fraction of the number of fingerprints in the set. It will also generate a modal fingerprint, which is a representative fingerprint of the common "keys" in the set. The user selects a threshold between 0 and 1 and then produces a modal fingerprint for the set at a given threshold. If the user selects 1.0 for the threshold, then the modal fingerprint will contain on bits for only those keys which are present in all 500 compounds. If the user selects a threshold of .75 then the modal at that threshold will contain "on" bits for only those keys that are turned on in the fingerprints of 75% of the input set, 500*.75 = 375. Namely, 375 of the 500 compounds would have to contain the key for the key to be turned on in the modal. The modal at given thresholds can be compared to each of the members of the active set and to pull out structurally similar sets.
Depending on the diversity of the hits, the particular groupings
that are desired, and the size of the number of hits, one may need to
perform a clustering step prior to using the ChemTattoo for best
results. (For more information on ChemTattoo, see the ChemTattoo
Module documentation.)
Process:
References
Example Script
Below is an example script that is included in the program
directory, called TestScript. It contains the commandline
interface details.
#TestScript for Fingerprint and Grouping Module
#Copyright (c), 2002,2003, Mesa Analytics & Computing, LLC
# ./MACCSKeys164Generator 164 keys
# ./MACCSKeys320Generator 320 keys
#
# Usage: ./MACCSKeys*Generator file | - -format -columns
-errorfile
#
# file = filename.smi | - for stdin
#
# Space delimited .smi file may contain extra columns
# -- SMILES MUST BE in first column.
#
# -format = -A for ASCII (default), -C for 21 bytes of char,
# and -U for 3 unsigned long ints.
#
# -columns = -T for return all columns in file and error file,
# fingerprint being the first column |
# -F for only fingerprint returned in output (default),
# and only smiles and index to .smi file returned in error file
#
# -errorFile = -T for return Smiles parsing error file |
# -F for no error file (default)
#
# Note that other programs in the module that accept fingerprints
as
# input will use ascii fingerprints (the -A option). The
file
of fingerprints
# will need to be a single column. So if the
MACCSKeys*Generator
output
# contains extra columns then these will need to be stripped off.
E.g.,
# cut -d' ' -f1 fingerprint_and_other_columns_file.txt >
fingerprintfile.txt
# will create a single column of fingerprints in
fingerprintfile.txt
# for simplicity in the example below the -column option is -F so
that
just
# the single column of fingerprints is produced for the remaining
programs.
echo Number of lines =
wc -l SampleCol1.smi
echo This is the number of compounds or "Size"
echo Generate Binary Fingerprints with 164 keys
./MACCSKeys164Generator SampleCol1.smi -A -F -T >
SampleFingerprints.txt
echo Generate Binary Fingerprints with 320 keys
./MACCSKeys320Generator SampleCol1.smi -A -F -T >
SampleFingerprints320.txt
# ./ChemTattooModalStats
#
# Usage: ./ChemTattooModalStats fingerprintfile
modalthreshold
similarity | smartsfile
# The modalthreshold can range from [0,1].
# similarity = '-S' for similarity, '-D' for dissimilarity
#
# Note smartsfile can be a collection of any strings, each
associated
with a
# specific bit.
# Note also, that any other option for similarity other than '-S'
will
# produce dissimilarities as the default.
echo Generate Modal Statistics given 164 bit strings
./ChemTattooModalStats SampleFingerprints.txt 0.95 -D
>Modal164Stats
echo Generate Modal Statistics given 320 bit strings
./ChemTattooModalStats SampleFingerprints320.txt 0.95 -D
>Modal320Stats
# ./Measures
#
# Usage: ./Measures fingerprintfile.txt measure similarity format
searching
searchnumber | alpha | sparsethreshold
# measure = '-T' for Tanimoto, '-V' for Tversky,
# '-E' for Euclidean, '-H' for Hamming, '-C' for Cosine
# similarity = '-S' for similarity, '-D' for dissimilarity
# format = '-M' for Matrix, '-O' for Ordered Pairs, '-S' for
Sparse
Matrix
# searching = '-T' for similarity searching, '-F' if not
# searchnumber = an positive integer M < N to be searched
# (Note: fingerprint file will have M + N fingerprints for
searching
= '-T')
# alpha is in the range 0-1, and measure must = '-V'
# sparsethreshold is in the range of (0,1), and format must be =
'-S'
# and searching can = either '-T' or '-F'
# Note clustering program requires that the similarity and format
# have the -D and -S options respectively, with searching = '-F'
(searchnumber
= 0)
# Similarly for alpha -- is only needed if measure = '-V'
# Measures program as used for generating matrices for clustering
#
echo Generate Dissimilarity Matrix for 164 keys
./Measures SampleFingerprints.txt -V -D -S -F 0 0.1 0.3>
SampleAsymMatrix.0.1.txt
echo Generate Dissimilarity Matrix for 320 keys
./Measures SampleFingerprints320.txt -V -D -S -F 0 0.1 0.3>
SampleAsymMatrix320.0.1.txt
# Measures program as used for similarity searching
# The following is an artificial example that takes the first 30
compounds
# as targets and the remaining 465 as queries and produces a
similarity
30X465
# ragged array at a threshold of the search of the queries.
This
output is
# is then input to the SimilarityOutput program to produce a
formated
result
# that rank orders the queries for each target by the respective
similarity.
head -30 SampleCol1.smi > SampleTargetsSmiles30.txt
tail -465 SampleCol1.smi > SampleQueriesSmiles465.txt
./Measures SampleFingerprints320.txt -T -S -S -T 30 0.7
>SimSearch30_465.txt
#./SimilarityOutput
# Usage: ./SimilarityOutput SparseMatrixFile QueryDataFile
TargetDataFile
#
./SimilarityOutput SimSearch30_465.txt SampleQueriesSmiles465.txt
SampleTargetsSmiles30.txt
> SimSearchOutput.txt
# ./RussianDollTransformation
#
# Usage: ./RussianDollTransformation AsymmetricMatrixFile N
MatrixThreshold
# Where N is the size of the NXN matrix
# Where MatrixThreshold is a dissimilarity threshold
#
# Note, the RussianDollTransformation is best suited for a
Tversky
alpha of
# 1 or near 1 (e.g., 0.95). The Matrix Threshold can be
used
to loosen
# the constraint once more with higher values than 0. The
matrix
is turned
# into a dissimilarity and then thresholded with this
value.
Those elements
# in the matrix near zero are very similar and it is those that
are
counted
# as having a great deal in commom depending on alpha and the
Matrix
# Threshold. Also note, the clustering threshold value will
# often depend on the level of substructures found in common
desired,
since,
# though the matrix is composed of proximities (dissimilarities),
they
are
# not necessarily related to Tanimoto-like proximities in
value.
They can
# be quite data dependent, so several runs may be necessary to
determine
# appropriate threshold values -- and these may be quite high
depending
# on the application. A glance at the matrix will give the
user
# some idea as to what thresholds to try. Below we create
an
asymmetric
# matrix with an alpha of near 1, run the
RussianDollTransformation.
We
# explore several clustering thresholds in the clustering section
that
# follows.
echo Generate asymmetric matrix for RussianDollTransformation
./Measures SampleFingerprints320.txt -V -D -M -F 0 0.05 >
SampleAsymMatrix.0.05.txt
echo Generate RussianDoll asymmetric matrix
./RussianDollTransformation SampleAsymMatrix.0.05.txt 495 0.05 >
RussianDollAsymMatrix.0.05.txt
# ./Clustering
#
# Usage: ./Clustering distanceMatrixFile.txt -ClusterAlgorithm
Threshold
Size
# distanceMatrixFile = space delimited text file of matrix, or
# typing 'stdin' will accept the same input from
stdin
# ClusterAlgorithm = '-T' for Symmetric Taylor, '-A' for
Asymmetric
Taylor
# '-n' for non-disjoint symmetric Taylor
# '-N' for Non-disjoint Asymmetric Taylor
# Threshold is in the range (0-1]
# Size = n, where the distance matrix is
nXn
#
# Note: Non-disjoint refers to "overlapping" clusters.
echo Generate Clustering with 0.15 threshold, namely, 0.85 similarity
echo Generate Asymmetric Clustering with the 164 keys used to generate
the
matrix
./Clustering SampleAsymMatrix.0.1.txt -A 0.15 495 >
SampleAsym.0.15.Clustering.txt
echo Generate Clustering with the 320 keys used to generate the matrix
./Clustering SampleAsymMatrix320.0.1.txt -A 0.15 495 >
SampleAsym320.0.15.Clustering.txt
echo Generate Clustering, asymmetric, non-disjoint, with the 320 keys
used
to generate the matrix
./Clustering SampleAsymMatrix320.0.1.txt -N 0.15 495 >
SampleAsymN320.0.15.Clustering.txt
echo Generate Clustering with 0.08 threshold
echo Generate Clustering with the 320 keys used to generate the matrix
./Clustering SampleAsymMatrix320.0.1.txt -A 0.08 495 >
SampleAsym320.0.08.Clustering.txt
echo Generate Clustering, asymmetric, non-disjoint, with the 320 keys
used
to generate the matrix
./Clustering SampleAsymMatrix320.0.1.txt -N 0.08 495 >
SampleAsymN320.0.08.Clustering.txt
echo Generate Clustering of RussianDollTransformation using a threshold
of
0.9
./Clustering RussianDollAsymMatrix.0.05.txt -A 0.95 495
>RussianDollClustering.txt
# ./AmbiguityMeasures
#
# Usage: ./AmbiguityMeasures ClusterResultsFileDisjoint
ClusterResultsFileNondisjoint
#
echo Generate Ambiguity Report for 0.15 threshold clustering
./AmbiguityMeasures SampleAsym320.0.15.Clustering.txt
SampleAsymN320.0.15.Clustering.txt
> AmbiguityReport0.15
echo Generate Ambiguity Report for 0.08 threshold clustering
./AmbiguityMeasures SampleAsym320.0.08.Clustering.txt
SampleAsymN320.0.08.Clustering.txt
> AmbiguityReport0.08
# ./ClusterOutput
# Usage: ./ClusterOutput ClusterResultsFile SmilesActivityFile
ColsInSmilesActivityFile
InlineToggle
# ColsInSmilesActivityFile = the number of columns in the
SmilesActivityFile
# InlineToggle = T | F; T = Clusters by Row, F = Clusters by
Column
# RepCompoundsToggle = R to print out representative compounds
only
#
# Note the SmilesActivityFile can contain any number of columns
after
the
# first smiles column as long as they are space delimited.
# These can be used for denoting activity, id numbers, etc.
#
# Note: if the number of columns in the .smi is 1 (just the
smiles
column)
# enter 1 for ColsInSmilesActivityFile. Otherwise, enter
the
total number
# of columns INCLUDING the smiles column.
#
# Note: Clusters by Column will not fit in Excel unless the
number
# of compounds clustered is small and thus the number of clusters
is
# also small.
echo Generate Formated Clustering File 164 key based, representative
compounds
only
./ClusterOutput SampleAsym.0.15.Clustering.txt SampleCol1.smi 1 T R
>
SampleAsym.0.15.ClusteringFormated.txt
echo Generate Formated Clustering File 320 key based.
./ClusterOutput SampleAsym320.0.15.Clustering.txt SampleCol1.smi 1 T
N>
SampleAsym320.0.15.ClusteringFormated.txt
./ClusterOutput RussianDollClustering.txt SampleCol1.smi 1 T N >
RussianDollClusteringFormated.txt
#The output of ClusterOutput (say, by row) outputs each cluster, the
#representative compound foremost in the list. Other column data
in
the
#.smi file are printed in columns with their respective smile.
#After the clusters are listed, the true and false singletons are
listed,
#again with their respective column data if there is any. Next,
the
tie
#statistic for this specific program are listed. Note, that with
non-disjoint
#clustering, false singleton ties do not occur. Lastly, the list
of
#representative compounds are printed.
#
#The output can be sent to a file by redirecting as is done in this
script.
#The .txt file can be loaded into Excel by selecting space delimited
columns
#when loading the file.