This document describes work from two recent projects that illustrate some uses of hierarchical cluster analysis.  In one project the positions of water molecules in the available crystal structures for the kinase ERK2 were identified for use in evaluating their potential effects on ligand docking.  In the second project a sampling of kinase crystal structures was examined to identify groups of proteins that have very similar three-dimensional structures but low sequence similarity for use in an evaluation of sequence alignment methods.  The strategy used here for performing the clustering is well established but is often not used by the computational chemistry community.  It involves applying a number of clustering methods to the data and then using project-specific criteria for determining which method provides the most useful clusters for the given project.  The application of this strategy to these projects found that the methods that produced the most useful clusters were different between the two examples and counsels against assuming that one particular method will provide the best results for all types of data.

General Background on Clustering Methods

 

Cluster analysis is an empirical method for organizing data into groups that are useful for a particular application.  Many clustering methods are available and it is often not clear which is the most appropriate to use for a particular data set.  There is generally no one “best” method to apply to a given type of data.  Rather, the method that yields the best results depends on the descriptor space, the number of data points, the metric used to describe the distances between data points, the distribution of data within the descriptor space, and the criteria used to determine whether the clusters are useful.  In practice, an effective approach is to apply several different clustering methods and then select the one that gives the best results for the project at hand.

 

The process of clustering involves several steps, including data preparation, selection of the clustering method(s) to apply, performing the clustering, and evaluating the resulting clusters to determine which method produced the most useful clusters.1-3

 

Data preparation – Clustering methods generally use measures of dissimilarity between observations: larger values indicate that the observations are more different from each other or that they are farther apart in the descriptor space being used.  The descriptor space varies according to the needs of the project.  Molecular models are described in Cartesian coordinates so distances between pairs of atoms can be used directly; RMSD or similar measures are used to measure differences between aggregates of atoms.  A measure based on the MCQ (mean of circular quantities) can be used for polar coordinates or torsion angle space.  In multidimensional descriptor space, which might include a large number of calculated molecular properties, the magnitude (norm) of the differences between the vectors of descriptors is typically used.  For connection tables or graphs that describe chemical structures, fingerprints derived from fragment keys or from the analysis of atom connectivity are represented as bit strings that are compared with Tanimoto, Jaccard, or Hamming distances or coefficients.

 

Clustering Methods1,4 – Clustering methods fall into two major categories.  Hierarchical clustering proceeds iteratively, either merging observations into increasingly larger clusters (agglomerative methods) or separating the whole data set into successively smaller clusters (divisive methods).  Partitional clustering, in contrast, applies criteria supplied by the user to divide the data directly into a single set of clusters; it includes the k-means, k-nearest-neighbors (k-NN), and k-medoid (e.g. PAM) techniques.  It is better suited for large data sets than is hierarchical clustering but requires some prior knowledge of how the data should be partitioned or how many clusters should be produced.  CLARA is a partitional method that uses sampling in conjunction with PAM to handle large data sets.  The Jarvis-Patrick method, which is often used for clustering large sets of chemical-structure data, is a k-NN technique.  Clustering typically generates discrete clusters, in which each observation can belong to only one cluster.  Overlapping cluster methods allow an observation to belong to more than one cluster, and fuzzy clustering methods allow fractional memberships in clusters.

 

Neither of the projects described herein have clear criteria that could be used in the partitional methods, the data sets are small, and discrete clusters will be useful, so the hierarchical approach was used with no consideration of overlapping or fuzzy clustering.

 

Hierarchical clustering methods2-4 – The hclust function of the R statistical package5, which was used for this work, implements agglomerative methods; divisive methods are mathematically more challenging and, while some are implemented in R, their use was not explored.  Several commonly-used methods are available in the hclust function:

 

  • Single linkage (nearest neighbor) – The distance between clusters is taken as the shortest distance between any point in one cluster and any point in the second.  This produces arbitrarily-shaped clusters but is prone to chaining, where observations are successively added to a single cluster.

 

  • Complete linkage (farthest neighbor) – The opposite of the single linkage:  the distance between clusters is taken as the greatest distance between any point in one cluster and any point in the second.  This tends to produce globular clusters.

 

  • Average linkage (UPGMA; “Unweighted Pair Group Method Using Arithmetic Averages”) – The distance between the clusters is taken as the average of all Euclidean distances between the points in one cluster and the points in the second.

 

  • McQuitty (WPGMA; “Weighted Pair Group Method Using Arithmetic Averages”) – A mathematically simpler form of the Average Linkage method that results in the most recent additions to a cluster having more weight in the distance calculations than do earlier additions.

 

  • Centroid (UPGMC; “Unweighted Pair Group Method Using Centroids”) – The distance between clusters is taken to be the distance between the centroid of all of the observations in each cluster.  This can give poor results if clusters of very different sizes are merged.  As implemented in R, the centroid method uses the squares of the distances between observations as input.

 

  • Median (WPGMC; “Weighted Pair Group Method Using Centroids”) – A modification of the centroid method that mitigates the problem created by merging clusters of very different sizes.  The median method also uses the squares of the distances as input.

 

  • Ward – Instead of trying to calculate distance between clusters, the Ward methods attempt to maximize the homogeneity (minimize the variance) within each cluster.  Two variants of the method are implemented in R, one using the distance (Ward-D), the other using the square of the distance (Ward-D2) between observations to determine cluster memberships.  Ward’s methods produce globular clusters and have a strong bias for putting similar numbers of observations in each cluster.

 

The Single-linkage, Complete-linkage, Average-linkage, and McQuitty methods are graph-based and evaluate distances to neighbors to form clusters.  The Centroid, Median, and Ward methods are geometry-based and are influenced by the sizes and shapes of the clusters as they form.  The graph-based methods are somewhat more general; the geometry-based methods are most compatible with data that can be interpreted reasonably as being geometric.

 

The application of the hierarchical clustering methods produces a cluster tree, which is a record of the iterations that the procedure went through as it merged observations and clusters.  For each step in the procedure it lists the observations or clusters that were merged as well as the distance between the clusters as measured by the particular method used.  The distance is called the cluster “height”, or the “cluster distance”.  The number of the step at which a merge occurred is its “cluster level”.

 

The cluster tree establishes relationships between the observations in the data set.  There are different numbers of clusters at each cluster level, but the clustering process does not indicate the most appropriate number of clusters.  A second operation must be performed to determine what cluster level to select for examining the clusters.

 

Cluster statistics1,4,6,7 – Each step in the agglomerative hierarchical clustering process forms a new cluster by merging observations or clusters produced by previous steps.  Many statistical measures exist for evaluating the nature of the new cluster, for comparing it with the clusters or observations that were merged to form it, and for comparing it with the whole population of observations.  Examination of these statistics indicates at which cluster levels to evaluate the contents of the clusters.  When a cluster level is chosen the cluster tree is “cut” at that level to show the clusters that have been formed up to that point.

 

The clustering statistics are evaluated by plotting the values of the statistics against the cluster levels.   When the plotted values of all or most of the statistics simultaneously become sharply worse at a particular cluster level, indicates a “natural” cluster break and cluster tree should be cut at the previous level.  Several such cluster breaks may occur in each plot but generally the first one (lowest cluster level) is of the greatest interest.  Cluster breaks at higher levels, however, may be useful depending on the needs of the particular project.

 

Both of the examples presented in this document use six clustering statistics:

 

Statistics that evaluate the cluster formed at a given cluster level:

 

  • Height (Cluster Distance) – The separation between the two clusters that were merged at a given step, as measured by the method specified for calculating the distance between clusters; larger values indicate that dissimilar clusters are being merged

 

  • RMSSTD – The homogeneity of the cluster formed in a given step; larger values indicate lower homogeneity

 

  • Pseudo t2 – Measures the difference between the clusters being merged at a given step; a jump in value indicates that the previous cluster level is preferred

 

  • Semipartial R2 – The loss of homogeneity upon merging the two clusters at a given step; higher values indicate that the clusters being merged are dissimilar

 

Statistics that evaluate the overall set of clusters at a given level:

 

  • R2 – Measures the dissimilarity between the cluster formed at a given step and the overall data set; higher values indicate that the current cluster is more different from the rest of the data set

 

  • Pseudo F – Evaluates the tightness of clusters; larger values suggest close-knit, better-separated structures; peaks in the plotted values suggest greater cluster separation

 

An additional evaluation uses the cophenetic distances between the observations.  Cophenetic distances are calculated from the cluster tree using the height or cluster distance measure.  They provide estimates of the distances between the observations based on the structure of the cluster tree.  If the cophenetic distances have a high correlation with the actual distances, then the cluster tree is a good reflection of the actual organization of the data. 

 

Cluster evaluation – Selecting criteria for deciding whether a set of clusters is useful can be one of the more challenging aspects of cluster analysis. They might be subjective or quantitative, but these criteria ultimately determine which clusters will be used for the project at hand.  They determine which cluster method and which cluster level produced the most meaningful groupings for a given application.

 

Different clustering methods combine observations differently and at different cluster levels.  Cutting the cluster tree at the same level for different methods typically yields clusters with different members, although some clusters from different methods might be similar or even identical.  It is a mistake to assume that any one clustering method will produce clusters that are useful for all projects, or even for projects with similar data sets. 

 

There is no inherently correct set of criteria.  Conceivably, one could apply different criteria that represent different questions or hypotheses to the same data set and end up selecting cluster sets from different methods that best organize the data for the given question.

 

Both of the examples below employ relatively simple, quantitative criteria, but the second example also incorporates a more qualitative assessment.

Examples

 

Positions of crystallographic water molecules in ERK2 – In the first example, crystal structures of the tyrosine kinase ERK2 were examined to identify the locations of water molecules that are in positions to influence the binding of a ligand to the ATP binding site, either by interacting directly with the ligand or by influencing the conformations of protein side chains that, in turn, can interact with the ligand.  Depending on how frequently water molecules appear in the various positions, they will be included in docking studies that use multiple conformations of the receptor.  The presence of water molecules in or near the ATP binding site will influence the populations of side-chain and backbone conformations that will be used in the docking work; the water molecules also provide additional potential points of interaction for the ligand and, possibly, compete with the ligand for volume in the binding site.

 

A set of 52 crystal structures of ERK2 was downloaded from the RCSB Protein Data Bank (PDB)8,9.  Some of these structures contained more than one copy of the protein, resulting in a total of 61 instances of the ERK2 structure.  The complete three-dimensional structures of all 61 instances containing the protein, ligand, and solvent molecules were overlaid using the molecular modeling software package MOE10.  Water molecules that were within 4.5 Å of any of the ligands in the ATP-binding site were selected for the cluster analysis.  Of the 61 instances of the protein, five contained no water molecules; the remaining 56 structures yielded 772 water molecules for the cluster analysis.  The distances between all pairs of oxygen atoms in the water molecules were measured and exported to the statistical package R for clustering.

Cluster analyses were performed in R using the single-linkage, average-linkage, complete-linkage, McQuitty, centroid, median, Ward-D, and Ward-D2 methods.  The five clustering statistics were calculated for each of the resulting cluster trees.  Figure 1 shows plots of the statistics along with the cluster distance (height) for the last 150 clustering steps for all of the methods except single linkage; the last 200 steps were plotted for the single linkage method to include obvious cluster breaks that occur earlier in the agglomeration process.

Figure 1.  Plots of the clustering statistics for all eight clustering methods.  See text.

Average

Centroid

Complete

Median

McQuitty

Ward

Single

Ward.D2

Click on the plot thumbnails to view the full size image.

Generally speaking, cluster breaks occur at the level immediately before a spike in pseudo-t2, an increase in RMSSTD, a spike in semipartial R2, an increase in the height or cluster distance, a drop in pseudo-F, and a drop in R2.  Cluster breaks are suggested by simultaneous changes in a majority of the statistics, but not every statistic has an obvious change at every apparent cluster break.  For purposes of this project, changes must occur in at least four of the six statistics to indicate a cluster break.  Additional cluster breaks might be proposed, but those presented in figure 1 represent the most distinct breaks for each cluster method.

 

The clustering statistics for the single-linkage method show many sharp breaks with a faster drop-off of the cluster homogeneity (R2) than the other methods.  Conversely, the statistics for the Ward methods show few candidate break points until much higher cluster levels and maintain higher cluster homogeneity than the other methods. 

 

The final task is to determine which of the cluster levels for which method(s) provides the most useful set of natural clusters for the project.  The plots in figure 1 show the levels at which each cluster tree could be cut to produce natural clusters.  In this case we’re interested in cluster levels that produce clusters that contain water molecules from at least 25% of the structures but that have the fewest clusters that contain multiple waters from the same structure.  The minimum of 25% is arbitrary, but only positions that have substantial occupancy are of interest.  If there is more than one water molecule from a single structure in a cluster, it suggests that multiple water positions are being merged.  Table 1 lists the fraction of structures present in each cluster for the first cluster break in each of the methods, as well as how many structures are duplicated in each cluster.

 

The complete-linkage method and the centroid method both produce the cleanest cluster separations; each has only one cluster that contains a duplicate structure.  The clusters from the median method contain two duplicated structures and those from the McQuitty method contain three duplicates; the other methods have between 27 and 83 duplicates.

Table 1.  Summaries of the cluster memberships from the earliest cluster break for each of the clustering methods.  The clusters listed must each represent at least 25% of the crystal structures included in the analysis; note that each structure contains at least several water molecules.  The number of duplicate structures indicates how many structures have more than one water molecule in a given cluster.

Table 2 shows that the cophenetic distances calculated from the tree from the single-linkage method are the least similar to the original data.  The cophenetic distances calculated for rest of the methods have similar levels of correlation with the original data, with the Ward methods at the lower end of the range and the average-linkage and complete-linkage methods at the higher end.

 

The clusters at level 653 of the complete-linkage method generally contain more observations than do those from the centroid method level 630, as indicated by the fraction of structures represented listed in table 1, so were selected for use in the project.  It is likely that the clusters from the centroid, median, and McQuitty methods also would be useable.

Table 2.  Correlation Between the Original and Cophenetic Distances

The complete-linkage method at cluster level 653 produces thirteen clusters that represent the locations at which water molecules are found in the ATP binding site of at least 25% of the ERK2 crystal structures from the PDB.  Figure 2 shows the positions of the clustered water molecules in the ERK2 ATP site.  The water molecules in cluster 2 (light yellow spheres) lie deep in the site; they appear in most (86%) of the crystals structures and form interactions with protein side chains as well as with the ligand.  Water molecules in cluster 37 (cyan) are also deep in the site and are positioned to interact with the ligand; they also occupy volume that is used by some ligands.  Cluster 4 (purple) and 22 (light purple) are near the opening of the site and waters in these positions can interact with both a ligand and protein side chains.  Water molecules in the clusters at the lower right of image in figure 3 could possibly interact with a large ligand, but form many interactions with hydrophilic protein side chains and might help to organize the structure of the site.  Clusters 5 (dark yellow) and 16 (light blue) are at the periphery of the site and waters in these positions are unlikely to effect the binding of any but the largest ligands.  For the project, water molecules will be placed in various combinations of these positions for generating conformations of the ERK2 binding site for use in multicopy docking studies.

Figure 2.  Location of the water clusters from the complete linkage method, level #653.  The clusters are labeled by cluster ID and the fraction of structures represented.  The ERK2 structure is depicted as a backbone ribbon and the kinase motifs are identified by the Kinase Annotate function in MOE: green – G-loop; orange – Hyd1; magenta – α-C helix; yellow – hinge; red – catalytic subunit; and cyan – DFG motif.

Structurally similar kinases with diverse sequences – The second example entails clustering serine/threonine kinase crystal structures to identify a set that contains similar three-dimensional structures with diverse amino acid sequences.  The sequences will be used in an evaluation of sequence alignment methods with the superposed crystal structures providing the “correct” alignments.

 

A working set of crystal structures was obtained by searching the NCBI database11 using the sequence of human p38 MAPK (NP_001306) as a BLASTP query to find related serine/threonine kinases for which crystal structures are available in the PDB.  The search was limited to 1000 hits and returned sequences that ranged from 22% to 100% identity with p38 MAPK.  The 1000 hits mapped to 2460 crystal structures in the PDB.  Many of these contain other proteins in addition to the kinase or a kinase fragment.  The crystal structures were loaded into MOE in groups of about 500 for processing.  The actual kinase protein structures were identified using the Annotate Kinase function in MOE and all non-kinase molecules were removed.  The structures were superposed in MOE and the N- and C-terminal portions of the structures were truncated as necessary to leave just the catalytic domain.  Ultimately, 289 structures of serine/threonine kinases were retained that had no missing amino acids in the catalytic domain. 

 

The metric used for clustering was the RMSD distance between the backbone atom positions of the crystal structures as calculated by the MOE superposition routine.  As with the first example, all eight clustering methods in the R program’s hclust function were applied.  The clustering took only a couple of seconds.

 

The calculation of the cluster statistics was less straightforward than in the first example.  Several of the cluster statistics use quantities that are derived directly from the coordinates of the observations.  The transfer of the crystal coordinates from MOE to R for this purpose is cumbersome and the mean coordinates of the structures in each cluster would be dependent on the structure alignments and have questionable physical meaning.  To enable the calculation of the cluster statistics multidimensional scaling (MDS) was applied to create surrogate coordinates for the observations from the RMSD distance matrix. 

 

Surrogate coordinates were generated with MDS for three, five, seven, nine, and eleven dimensions.  The clustering statistics were calculated using the surrogate coordinates for the cluster tree from each clustering methods and for each dimension.  Figure 3 shows the plots of cluster statistics for the average linkage method for each of the MDS dimensions.  The plots of the statistics for three dimensions show fewer features than do those for the other dimensions.  The plots for five, seven, nine, and eleven dimensions exhibit similar features, with the sharpest features appearing with seven and nine dimensions.  The plots of the statistics for the other clustering methods showed a similar pattern, and the statistics generated for seven dimensions were used for determining the cluster breaks.

Figure 3.  Cluster statistics calculated for the average linkage method from MDS surrogate coordinates using 3, 5, 7, 9, and 11 dimensions.

3 Dimensions

5 Dimensions

7 Dimensions

9 Dimensions

11 Dimensions

Click on the plot thumbnails to view the full size image.

The plots in figure 4 show the cluster breaks suggested for the kinase sequence clustering example by the clustering statistics with seven MDS dimensions.  

Figure 4.  Cluster statistics calculated for all clustering methods from the MDS surrogate coordinates with seven dimensions.

Average

Centroid

Complete

Median

Mcquitty

Ward

Single

Ward.D2

Click on the plot thumbnails to view the full size image.

Table 3. Correlation Between Original and Cophenetic Distances

As shown in table 3, the correlations between the distances used in the clustering and the cophenetic distances are similar for most of the methods.  In this example the graph-based methods have at least slightly higher correlations than do the geometry-based methods.  The correlations for the Ward methods are somewhat weaker than the others. 

 

For this project, clusters were formed on the basis of the three-dimensional similarity of the protein crystal structures, but the ultimate goal is to identify clusters of structurally similar proteins that have low sequence similarity.  The similarities between the sequences within each cluster were measured by the average pairwise Blosum90 scores and by the average pairwise sequence identities.  Table 4 lists for each method the clusters that contain at least 20 of the 289 structures analyzed.  Clusters are listed for the earliest cluster break for each method shown in figure 4.

 

All of the methods generated clusters that contain structures having differing degrees of sequence similarity.  Each method includes one cluster for which the sequence similarity is low compared to the other clusters, as measured by the Blosum90 and Fraction ID scores.  The single-linkage method gave the largest cluster that has the lowest values for each of these measures so was examined further.

Table 4.  Clusters that contain at least 20 of the 289 kinase structures from both the graph-based and the geometry-based clustering methods.  The values in the Blosum90, Fraction ID, and RMSD columns are the averages of all pairwise comparisons between the members of each cluster.

Cluster 1 from the single-linkage method cluster level #192 contains the fewest duplicate sequences (Fraction ID) and the lowest degree of amino acid similarity (Blosum90).  Compared to the corresponding clusters from the other methods it has the most members and a similar RMSD between its member structures.  Cluster 1 contains 32 structures which, after truncation of the N- and C-termini to leave a common core, correspond to 12 unique amino acid sequences. They correspond to orthologs and point mutations of the kinases PKAα, PKBΘ, and AKT2 (PKBβ), all from the AGC protein kinase group.  The amino acid differences are scattered along the lengths of the sequences.  Table 5 lists the contents of the cluster.  Sequence P31751 (AKT2) has a three-residue insertion relative to the rest of the sequences in the cluster; otherwise the sequences are all the same length.

Table 5. Membership of single-linkage #192 cluster 1.

Cluster 3 is also of interest.  It contains 32 structures that correspond to six unique sequences after the N- and C-termini are truncated to give a common core. The cluster contains orthologs and point mutations of kinases CK2α, CK2α’, and CK2β from the CK2 subfamily.  None of the sequences contain gaps or insertions, and the residue differences are distributed long the entire sequence as they were in cluster 1.

Table 6. Membership of single-linkage #192 cluster 3.

Cluster 23, while not of direct use for the project, is worth noting because it is identical to a cluster found by all of the clustering methods except Ward-D.  It contains 31 structures of the kinase PIM-1, some of which contain the single mutation R250G.  The PIM-1 structure is distinct enough to be recognized by almost all of the methods, and the cluster persists through many levels of the agglomerative process.  The structures in the cluster have the lowest RMSD and sequences have the highest similarity scores of any of the clusters. 

 

Figure 5 shows the overlay of the structures for clusters 1, 3, and 23 from the single-linkage method cluster level 192.

Figure 5. Superposed crystal structures for clusters 1, 3, and 23 from single-linkage cluster level 192.  The images for clusters 1 and 3 contain one representative of each unique sequence; the image for cluster 23 contains all of the structures in the cluster.

Clusters 1 and 3 from single-linkage level 192 provide suitable sets of sequences for use in the project.  If necessary, however, additional sequences can be obtained by considering clusters at higher break levels to increase the sequence diversity.  Clusters at higher break levels contain more structures with lower similarity as measured by RMSD, and sequences that contain more diversity but that include amino acid insertions or deletions.  In particular, cluster levels 232 and 236 of the single-linkage method and cluster level 240 of the average linkage methods produce a cluster that contains all of the structures in cluster 1 and additional structures that represent up to six additional unique sequences.  The additional sequences introduce more sequence variation over the length of the alignment but contain numerous insertions of one- to six residues that complicate the mapping of the sequence alignment results to the structural superposition.

 

Finally, figure 6 shows where the sequences from the single-linkage level #192 clusters fall on a human kinome phylogenetic tree12.  Cluster 1 contains sequences from the AGC group, as listed in table 3.  The sequences from cluster 3 correspond to the CK2 subfamily, which is phylogenetically similar to the CMGC group.  Cluster 23 corresponds to the sequence of a single gene, PIM1.

Figure 6.  Clusters 1, 3, and 23 from level #192 of the single-linkage clustering run mapped onto the human kinome phylogenetic tree (cf. Manning12).

Summary and Conclusions

 

A general strategy for applying hierarchical clustering methods was applied to two projects.  Although both projects pertain to three-dimensional structures of serine/threonine kinases they address different questions and used different criteria for determining which clusters were most useful.  The clustering strategy ultimately found that different clustering methods gave the best clusters for the two different projects.

 

The strategy involves several steps:

  • Generate an appropriate distance matrix for the data
  • Apply multiple hierarchical clustering methods to the distance matrix
  • Calculate a set of statistics for each clustering method that suggest natural cluster breaks in the hierarchy
  • Select the cluster break levels of interest and record which observations that belong to the clusters formed at the given break level
  • Apply criteria for selecting the set of clusters that is most useful for the project
  • Based on the criteria, select the method that provides the most useful clusters

 

In practice the first four steps don’t change very much for different data sets.  MDS was used in the second example to enable the calculation of the clustering statistics, but these steps are largely automated and transferable between data sets.  While the selection of the cluster break levels is manual but straightforward, the other operations in these steps are done by scripts.

 

The run times depend on the software, the computer system being used, and the availability of parallelization.  The larger of the two examples had 772 observations and the R script that calculated the distance matrix from the water molecule coordinates, clustered the data using all eight methods, and generated all of the clustering statistics took just over 10 minutes on a single core of a 3.2 GHz i7-3930k processor.  The distance calculation and the cluster statistics calculations scale as N*(N-1)/2, suggesting that a set containing 10000 observations would take one- to two days using the same system.

 

The selection of the criteria for deciding what constitutes a “useful” or “good” set of clusters is the more challenging part of the process.  In the examples presented above, establishing and implementing the criteria were the most time-consuming operations.  The cluster membership information generated in R for each cluster method and cluster level of interest had to be transferred to MOE, for which custom scripts had to be prepared for each project to produce the summary information in tables 1 & 4.  The information in these tables enabled the final selections of the cluster method and cluster level.

 

The differences in the clusters resulting from different methods can be subtle, and more than one method can produce usable clusters.  In the first example the complete-linkage method gave the results that best met the established criteria, but the centroid, median, and McQuitty methods also produced clusters that could have been used for the project.  In the second example the clusters from the single-linkage method best matched the project’s criteria, though clusters from other methods might have been used instead with acceptable results.  The second example had more qualitative criteria for evaluating the clusters so had more leeway for considering different methods and higher clustering levels.

 

The performance of different methods can vary dramatically with different data sets.  The single-linkage method, for example, performed best for the protein structure clustering project, with the high cophenetic correlation suggesting that the cluster tree does a good job at reproducing the structure of the data.  For the water-position clustering project, on the other hand, the single-linkage method generated clusters that contained many duplicates and had the worst cophenetic correlation.   In neither example did the popular Ward’s methods produce clusters that best met the project’s criteria, and in both cases the cophenetic correlations for the Ward’s methods were lower than for most of the other methods.  That the cophenetic correlation coefficients were generally lower in the first example than in the second one suggests that the data set in the first example is noisier and that it more observations that are not well-described by the cluster trees.

 

The two examples presented here illustrate the application of hierarchical clustering to small data sets for which discrete clusters are desired.  The examples show the empirical nature of clustering and the use of project-specific criteria for determining which clustering method provides the most useful results.  The effort of running multiple clustering methods and applying customized criteria to each project make it tempting to pick a favorite method and apply it to all problems.  In truth, the examples suggest that more than one method can provide useable results for the same project.  The decision of how stringently one would apply the procedures described herein depends on the needs of the project and how the clustered data are to be used.  The first example sought quantitative positions for water molecules and had specific criteria for evaluating the clusters, so higher stringency was warranted.  The second example had more qualitative goals and, while a stringent approach gave a very good set of clusters, it is clear that methods other than the one selected would give useable, though somewhat noisier, clusters as well.

 

Cluster analysis can be a very useful tool for organizing and summarizing observations in many types of data sets used in computational chemistry.  Its use is complicated by the need to choose between many methods and to select some optimal number of clusters to use.  It is essential to have well-defined criteria for evaluating what constitutes a set of clusters that is most meaningful for each particular application.  The application of a systematic process for selecting the most appropriate clustering method and cluster level, such as described here, can provide optimal results for data sets as large as 10000 observations.

References

 

1.  SAS Institute Inc. "Introduction to Clustering Procedures"  In: SAS/STAT® 13.2 User's Manual,  SAS Institute Inc.: Cary, NC, 2014; Chapter 11.

 

2.  Romesburg, H. C. "Cluster Analysis for Researchers"; Lulu: North Carolina, 2004

 

3.  Sharma, S. "Cluster Analysis"  In: Applied Multivariate Techniques,  Wiley: Hobogen, NJ, 1996; Chapter 7.

 

4.  Gan, G.; Ma, C.; Wu, J. "Data clustering : theory, algorithms, and applications"; American Statistical Association: Philadelphia, Pa., 2007

 

5.  R Core Team R: A language and environment for statistcal computing, R Foundation for Statistical Computing, Vienna, Austria, 2014.

 

6.  Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. "Cluster Validity Methods: Part I" ACM SIGMOD Record 2002, 31 (3).

 

7.  Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. "Cluster Validity Methods: Part II" ACM SIGMOD Record 2002, 31 (3).

 

8.  Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. "The Protein Data Bank" Nucleic Acids Res 2000, 28 (1), 235-42.

 

9.  http://www.rcsb.org

 

10.  Molecular Operating Environment (MOE), 2013.08; Chemical Computing Group Inc., 1010 Sherbooke St. West, Suite #910, Montreal, QC, Canada, H3A 2R7, 2013.

 

11.  http://www.ncbi.nlm.nih.gov/

 

12.  Manning, G.; Whyte, D. B.; Martinez, R.; Hunter, T.; Sudarsanam, S. "The protein kinase complement of the human genome" Science 2002, 298 (5600), 1912-34.

About the Author: David Moreland, Ph.D.

 

Dr. Moreland has over 21 years of research experience in a large pharmaceutical company and an additional 7 years of consulting experience working with biotech, academic, and large pharma clients. With broad expertise in molecular modeling, cheminformatics/QSAR, and structural bioinformatics, he has made key contributions to projects in several therapeutic areas, including CNS, cardiovascular, antiretroviral, and antibacterial among others. He has applied both ligand-based and structure-based techniques to help teams identify numerous early clinical candidates, including a phase II compound. He has worked on GPCR, nuclear hormone receptor, protease, and RNA targets, both small molecules and biologics.

Print Print | Sitemap
© 2007-2014 IDSC, LLC.