生物谷報(bào)道:全球基因表達(dá)數(shù)據(jù)庫(kù)日益龐大,,但生物信息學(xué)分析手段卻十分缺乏,,無法挖掘其中的內(nèi)部真正的本質(zhì)。
Statistical analysis of global gene expression data: some practical considerations
Ted Holzman and Eugene Kolker
BIATECH, 19310 North Creek Parkway, Suite 115, Bothell, WA 98011, USA
Available online 9 January 2004.
Introduction
The technological breakthroughs that precipitated the advent of modern DNA microarrays are now about a decade old [1.]. Five years ago Nature Genetics dedicated an entire issue to this emerging technology, with an enthusiastic summary written by Lander [2.]. By and large, microarrays are beginning to live up to those predictions with ~4500 published papers, dozens of software packages and an unstinted flood of mathematical techniques, experimental designs, novel applications and improved technologies. Microarrays have proved useful in a wide variety of fields, especially in the analysis of gene expression. Despite the intensity of research and development, however, the study of global gene expression remains nontrivial. This review does not try to cover recent developments in the statistical analysis of global gene expression data (SAGGED), but rather emphasizes several approaches aimed at improving the capabilities of researchers in their daily work. For detailed information on different SAGGED-related issues, readers can consult some excellent books [3., 4., 5., 6., 7. and 8.].
Applying appropriate error models and conservative estimates to microarray data helps to avoid the generation of bovine scatus (GOBS; K Nealson, unpublished), reduces the number of false predictions [9.], and allows one to focus on biologically relevant observations. This review touches on the typical microarray platforms and their uses and considers a small subset of statistically sound approaches to data processing, analysis and experimental design. We also discuss the problem of making biological sense out of SAGGED and conclude with a few suggestions for further consideration.
Platforms
DNA microarrays can be separated into two categories based on the nature of the target (spotted) DNA: cDNA or DNA oligomers. cDNA is typically derived from a library, whereas DNA oligomers are synthesized in situ or synthesized externally and made to adhere to the array. The cDNA clones, by the nature of their extraction and amplification, are likely to be genuine products of transcription. As artificial constructs of defined sequences, oligomers can probe for very fine differences in nucleotide sequences with great sensitivity. Open source software for oligomer and cDNA array analyses, such as Chipinfo and lcDNA [10., 11., 12. and 13.], is extremely useful, especially for newcomers.
Microarrays can again be subdivided into two classes on the basis of the probe samples that are hybridized to their targets. Single-fluor methods tend to measure the intensity of fluorescent signals from each spot on the array. Fluorescent intensity is assumed to be (cor)related to the degree of hybridization between the sample probe and the array target, which in turn is related to the concentration of each species of probe cDNA in the sample. In two-fluor methods, one tends to co-hybridize equimolar aliquots of cDNA from two different samples (e.g. a control and treatment sample) each labeled with different fluors. The observations here typically reflect the difference in probe concentrations between the two samples expressed in (log) ratios, fold-changes or similar measures.
Applications
Microarrays have been used in a great variety of applications, most of which fall into four broad categories with different aims. The first type of study aims to distinguish the functional genomic differences between two or more states and might look at, for example, how the `transcriptome' of an infected cell differs from that of an uninfected cell [14.], which transcripts are expressed under particular growth conditions [15.] or which transcripts change as a cell is exposed to increasing environmental stress [16.]. The second type of application aims to differentiate between cell phenotypes and is particularly applicable to diagnostics and prognostics: to classify different tumors [17., 18. and 19.], associate expression profiles with outcome or response to treatment, and to provide a characteristic picture of the course of a disease or a treatment [20.]. The third type of analysis aims to discover which genes control metabolic pathways and signaling cascades [16. and 21.] and requires samples that are as homogenous as possible with respect to cell type and state. Being the least robust, it requires careful experimental design and intense data analysis. Small errors in gene annotation or random variations in message concentration can cause large changes in the obtained results and their biological interpretations. The final type of application aims to determine point mutations. Oligo-arrays used in this type of analysis can help to distinguish between different alleles of a gene (genotyping) or to classify strains of pathogens [22.].
Data processing
SAGGED generally consists of two steps, data processing and data analysis, both of which have generated a plethora of techniques and quite a bit of controversy. Starting with the initial images taken from an array scanner, one detects and delineates the target spots on the array. As simple as this seems, automated methods of spot detection have met with only moderate success [23. and 24.]. The background and signal must be isolated. Various corrections must be applied for bias owing to differing dye-binding affinities and non-linear concentration effects (e.g. spot saturation) [25., 26. and 27.]. Signal-to-noise ratios are typically calculated and then used to derive confidence intervals associated with the reported values [12. and 13.]. The data are often scaled and normalized so that different arrays can be compared with one another [28. and 29.]. In two-fluor experiments the significance of the difference between the channels must be determined [10., 11., 30. and 31.]. These analyses differ from one array platform to another and the comparison between platforms is an area of active study [32. and 33.].
Data analysis
The initial processing of microarray data deals with the measurement of individual spots. Further analyses apply to the comparison of spots within an array and between arrays. Hierarchical cluster analysis, introduced to expression arrays by Eisen et al. [34.], is still perhaps the most popular. This technique, familiar to anyone who works with phylogenies, produces a tree of genes or experiments with similar profiles. The choice of a metric for the similarity between the profiles of different transcripts is a topic worthy of careful consideration [35., 36. and 37.]. Other clustering techniques, such as self-organizing maps and k-means clustering, produce a fixed number of categories instead of a hierarchy. Discriminant analysis [38.] and principal component analysis [39.] are useful in distinguishing one state from another –– for example, tumors that will respond well to a particular therapy as opposed to those that do not. These `supervised' techniques require the existence of a `training set' (i.e. data for which the categories have already been classified). Their job is to develop processes by which unlabeled data can be classified or to determine which genes are most important to the classification. `Unsupervised' clustering and discriminating techniques are aimed at discovering the categories ab initio, without prior classification data [34., 35. and 40.].
Several analyses (trend analysis, spectral analysis and regression analysis) are used to discover patterns in time, concentration, temperature or some other series of experiments that vary across a continuous variable [41.]. Some promising newer approaches (Bayesian networks and path analysis) [42. and 43.] organize the information from many experiments not into fixed categories, but as a set of dependencies (displayed as a network of connected relations). Such methods have been successfully used to isolate sets of genes involved in the sporulation of yeast and other global metabolic changes, and to suggest genes previously unknown to be involved in those processes.
The low-level measurements derived from a single array or from technical replicates of the same condition are often reported with confidence intervals or similar measures of the strength and certainty of the measurement. The propagation of these statistics to higher level analyses, like clustering, is still a matter of much investigation. So too is the measurement of confidence in cluster membership, certainty of categorization, and so on.
Yet another level of analysis is ultimately strategic in nature: power analysis and experimental design [44. and 45.]. How many arrays will it take to be 95% certain of a given relationship (e.g. that gene A is upregulated in cells grown at temperature B?). What is the best (i.e. cheapest, easiest and/or most meaningful) set of comparisons to make? What is the best interval to use (i.e. the most likely to catch interesting changes using the fewest arrays) in a time-series experiment? Array experiments are still far from inexpensive and their preparation is far from foolproof. It is essential to design an experiment with robustness (e.g. can useful information still be obtained if one or several hybridizations fail?) and power in mind [46., 47. and 48.].
Array for everyday
The main emphasis of this review aims to improve the researchers' capabilities in their daily work by means of SAGGED rather than GOBS. Given the state of these technologies, it is beneficial to keep key points in mind when designing and analyzing an array experiment.
Microarrays are still developing
Microarrays are still very young and under rapid development. Although numerous attempts have been made to design the standard for data collection, normalization, standardization and reporting [29., 49. and 50.], neither a consensus nor a de facto standard has emerged. It is useful, but not always easy, to make microarray data MIAME compliant (i.e. the Minimum Information About a Microarray Experiment is avaliable, such that data can be easily interpreted and verified). Reporting some minimal/core information for each experiment is not nearly so difficult. Such essential information must include reagents, type of scanner, data extraction algorithms, error models (names or references), the type and range of measurement (ratio, log-ratio, intensity, normalized intensity, etc.), and the raw data (backgrounds in each fluor channel, signals in each fluor). When both raw and processed data are reported, the results can be re-analyzed as newer approaches and techniques are created.
Replicate measurements
Much information on baseline biological and technical measurements is still unknown [51. and 52.]. How long do different species of mRNA exist in a cell? How much do they vary under identical conditions? To what extent do degradation products and cross-hybridization affect the measurements? How much variation results from different arrays, sampling errors, and so on? If the samples are amplified, how much bias is the PCR process introducing? Because of the complexity of these processes and corresponding measurements it is important to perform biological and technical replicates in each study. It is common to have four technical replicates for each experimental condition.
Experimental design
A standard experimental design for a time-series study compares each time-point to a control. Often each array is replicated with flipped dyes to compensate for different dye-binding effects, as well as to provide replicate measurements. Kerr and Churchill [46.] showed that this design unnecessarily over-samples the control and suggested a simple and elegant alternative that reduce the number of arrays without compromising the results (see Figure 1). More issues on experimental design are described elsewhere [44., 44., 46., 47. and 48.].
(13K)
Figure 1. Examples of experimental designs. In design 1, ten arrays are used to assay five time points. The control sample is measured ten times, five in each channel. Time points 1–5 are sampled twice, once in each channel. In design 2, six arrays are used to assay the same five timepoints. The control and all timepoints are measured twice, once in each channel. (Figure modifed from [46.].)
Measures of similarity
When making sense out of SAGGED, it is important to choose measures of similarity carefully. The default similarity measure in many software packages is Pearson's correlation coefficient (PCC). This metric indicates whether two messages are changing in the same direction at the same rate. PCC-based clusters contain transcripts that appear to change concentrations in the same way under a given set of conditions. But, when do we expect messages to change the same way? Perhaps when they are coordinately regulated by the same transcription factor or perhaps when they are part of the same operon. In metabolic pathways and signaling cascades it is rather common to see one message increase while another decreases. Although the control of these genes is part of the same process, they would not be clustered together using PCC. Other metrics, for example, the `city block' distance might be more effective [53.].
Error estimates
It is extremely useful to incorporate error estimates into SAGGED, although it requires careful implementation. For example, one of the most popular error-weighted similarity measures was introduced in Rosetta's seminal studies of a yeast compendium [54.] and breast cancer [55.]. Unfortunately, this approach can be erroneous, as seen from Figure 2. The expression ratios of two genes may fall along a straight line through the origin (i.e. be perfectly correlated), but they may have an error-weighted correlation of less than 1 (e.g. 0.68 as in Figure 2a). Alternatively, two values might not fall along a straight line (i.e. may not be perfectly correlated), yet they may have an error-weighted correlation of 1.0 ( Figure 2b). This is due to the computation of a correlation between ratios of expression level using the standard error, rather than computing a weighted correlation between expression levels. To avoid these pitfalls, for example, error information can be intrinsically incorporated in clustering process (BC Tjaden, A Siegel and E Kolker, unpublished). Again, there is no single perfect and universally applicable similarity measure.
(6K)
Figure 2. Counter-examples for error-weighted similarity (see [54. and 55.] and related supplementary material). (a) Expression profiles for two genes gI and gII (with three measurements), gI = (100 300 400) and gII = (100 300 400) with standard errors I = (20, 15, 40) and II = s(20, 40, 15). The plotted expression profiles fall exactly on a straight line through the origin; however, the error-weighted similarity for these genes is only 0.68. (b) Expression profiles for gI = (100 300 400), I = (20, 15, 20) and gII = s(100 400 300), II = (20, 20, 15). The plotted expression profiles do not fall on a straight line through the origin; however, the error-weighted similarity in this case is 1.0. (Figure adapted from BC Tjaden, A Siegel and E Kolker, unpublished.)
Comparing arrays
It is nontrivial to compare the results from more than one array platform, for example, between Affymetrix and Agilent or between oligo and cDNA arrays [32. and 33.]. Not only do the measurements have different properties (errors, specificities, chemistries), but it is also difficult to map between the different nomenclatures for the targets on each spot [49.].
Integrative studies
Gene expression studies are best thought of as `hypothesis generators' [56. and 57.]. The variability and uncertainty of the measurements are still too poorly understood to test hypotheses with reasonable confidence [51. and 52.]. Nor are they generally sufficient to determine causal relationships. For the most part, important biological phenomena are mediated by proteins as well as by genes (messages). Hence, gene expression approaches, including SAGGED, have to be complemented by integrative studies involving analysis of protein and metabolite expression, biochemical, physiological and mutational methods [58., 59., 60. and 61.].
Internet resources
Internet resources are becoming increasingly useful in every day work as compilations of literature and analysis tools or services [48., 62. and 63.].
Conclusions
Expression arrays, now a decade old, are beginning to live up to the hopes outlined five years ago in Nature Genetics [2.], but the computational and analytic challenges are still enormous. There are five key issues to consider when analyzing gene expression data: reporting core information for each experiment, including both raw and processed data, is easy and advantageous; biological and technical replicates are needed; careful experimental design can avoid over-/under-samplings, making the analysis simpler and more powerful; choice of the similarity measure is nontrivial and depends on the goal of the experiment; array information must be complemented with other data; and gene expression studies are `hypothesis generators'.
Never before has the biomedical and biotechnological community been faced with such a plethora of interrelated information. The storage, classification, selection, standardization, filtering and comparison of the results from microarray experiments are areas of intensive research. So too is the application of this information to provide greater attribution of causal effects, elucidation of pathways and pathological processes, and the discovery of novel biological phenomena. With the emergence of proteomics and other high-throughput methodologies, huge volumes of interrelated information will elaborate more biological processes. And then the fun will really begin.