Cluster Analysis of Microarray Data
This document is an introduction to cluster analysis tools that can be applied to raw microarray data. It is intended to be used as a guide and summary of general principles and applications of the selected programs, not a detailed manual of the specific parameters or statistics of the programs. That information can be obtained from the manuals and publications associated with each program (hyperlinks will be given where appropriate).
Background
A common procedure of analysis in microarray technology is to conduct several experiments across the same genes, measuring gene expression during each trial ( e.g. different patients, time points, etc.). The end result is often expression arrays of high dimensionality. For example, if you have 10 trials measured across 8,000 genes, you have a 10 by 8,000 matrix (8,000 genes in 10 dimensions). In order to detect a pattern in the data, researchers traditionally use methods that reduce the dimensionality to just two dimensions (along an x and y axis). Many methods exist to accomplish this task, but I will focus on just one: cluster analysis.
Cluster analysis
Gene Cluster 3.0 (de Hoon et al. 2004) is an improved version of the program Cluster developed by Michael Eisen (Eisen et al. 1998) and can be run on Windows, Mac OS X, Linux/Unix, and command line (Perl and Python) platforms. It can be downloaded for free from http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/ . Cluster 3.0 includes several cluster analysis options including hierarchal, k -means, and self-organizing map (SOM). A principle component analysis (PCA) can also be calculated. A manual is available under the “help” menu at the main interface.
Using Cluster 3.0
Raw microarray data should be in a tab-delimited text file with genes in rows and experiments (individual arrays) in columns. This program is more appropriate for multiple experiment data sets rather than simple one-experiment vs . control data. Once your data set is uploaded, you can select several filtering options that allow you to reduce the size of your data set by excluding unwanted variables such as genes that do not show a significant difference in expression values between the control and experimental groups. Your filtered data set can then be adjusted by log transformation (a good idea when dealing with ratios), centered, and normalized. The data are now ready to be analyzed using one of the four available clustering algorithms. The purpose of clustering is to group genes with significant changes in expression levels that behave similarly under different conditions. The result of Cluster 3.0 analysis is three output files with extensions *.cdt, *.gtr, and *.gnf. The file with extension *.cdt is appropriate for viewing in the program Treeview (introduced below).
The Hierarchal clustering algorithm is very similar to the average-linkage method developed by Sokal and Michener (1998). The algorithm computes a distance matrix, after which genes are assigned to the nearest neighbor gene or cluster. After all grouping has been acomplished, the clusters are linked in a dendrogram (tree relationship). A similarity matrix is computed for all pairs of genes using one of the nine available options (uncentered correlation is the default). This is performed when the matrix is scanned and the two most similar genes (those with the highest value) are linked at a node. This two-gene node is then placed back into the matrix, replacing the two original elements that were connected. The process is repeated n-1 times until all that remains is a single dendrogram (Eisen et al. 1998). The output file containing the tree can then be viewed graphically in the Treeview program.
In K-means clustering, you designate how many clusters there will be, and then the algorithm randomly assigns each gene to one of the K clusters before calculating distances. When a gene is found to be closer to the centroid of another cluster, it is reassigned. This is a very fast algorithm, but the number of clusters reported will be the K that was predetermined and it will not link them together as in the hierarchal clustering. This method is useful if you try different values of K.
The two previous methods cover most of the clustering needs that you will encounter. However, Cluster 3.0 also offers the self-organizing maps (SOM) (Kohonen, 1995). SOM constrains the data in a two-dimensional space (instead of multidimensional), and then organizes itself to best accommodate the data on the grid.
A principle components analysis (PCA) is another method that, like cluster analysis, reduces data to two dimensions so that patterns are tractable. Its purpose is to capture as much variation in the data as possible in only two dimensions (See Knudsen, 2002 for a nice summary of cluster analysis and PCA). Although this option is available in Cluster 3.0, it is not classified as a clustering algorithm.
Treeview
Treeview (Eisen et al. 1998) is a companion program that allows you to graphically browse results of clustering and other analyses from Cluster. It supports tree-based and image based browsing of hierarchical trees and allows multiple output formats for generation of images for publications. Treeview is also available for download from the Eisen lab ( http://rana.lbl.gov/EisenSoftware.htm ).
After Cluster 3.0 has finished analyzing the gene expression data, the *.cdt output file from Cluster 3.0 (or earlier versions) can be loaded into Treeview. This program creates a color matrix that represents the log-transformed gene expression ratios (Cy5/Cy3 fluorescence ratio -- see fig 1). Default colors are: green = negative ratios, red = positive ratios, and black = ratio value of 0. With your mouse arrow, you can click on any colored cell or location within the tree and an enlarged panel will highlight the details of the selected gene(s).

Fig. 1. A color coded dendrogram loaded in Treeview. Each cell represents the measured Cy5/Cy3 fluorescence ratio. Green colors represent negative ratios and red colors represent positive ratios. All values have been log transformed.
Other Clustering algorithms and tree-viewing software:
Maple Tree: http://rana.lbl.gov/EisenSoftware.htm
TIGR MultiExperiment Viewer (MeV): http://www.tigr.org/software/tm4/mev.html
J-Express pro: http://www.molmine.com/frameset/frm_jexpress.htm
GeneCluster (PC, Mac and Unix): http://www-genome.wi.mit.edu/MPR/
Expression Profiler (web server and Linux version): http://ep.ebi.ac.uk
References
De Hoon M.J.L., S. Imoto, J. Nolan, and S. Miyano. 2004. Open source clustering software. Bioinformatics (download pdf from http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/index.html )
Eisen M.B., P.T. Spellman, P.O. Brown, and D. Botstein. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA , 95:14863-14868.
Knudsen S. 2002. A biologist's guide to analysis of DNA microarray data. John Wiley & Sons, INC., New York, 125 pp.
Sokal R.R. and C.D. Michener. 1958. Univ. Kans. Sci. Bull. , 38:1409-1438.