PNAS冷泉港實(shí)驗(yàn)室張麥可:從微陣列中提取信息 Proc. Natl. Acad. Sci. USA, Vol. 99, Issue 20, 12509-12511, October 1, 2002 Commentary Extracting functional information from microarrays: A challenge for functional g enomics Michael Q. Zhang* Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724 Article Top Article References The advent of the human and model organism genome project has provided an increa singly complete list of genes that code for the building blocks of life on Earth . Deciphering the functions of all these genes has proven to be no easy task. Th e availability of mountains of transcriptional profiling data from modern large- scale gene-expression technologies such as serial analysis of gene expression (S AGE) (1), oligonucleotide arrays (2), and cDNA microarrays (3) represents a trem endous windfall for computational biologists who have largely migrated from many different fields. One article appearing in this issue of PNAS (4) introduces a novel computational approach, shortest path (SP) analysis, to assign gene functi ons in a transitive fashion along a correlation linkage path terminated by two k nown genes belonging to the same functional category. A major goal of microarray data analyses is to identify genes that interact with each other where not every player has a similar transcriptional profile. Currently the most popular way to identify interesting genes and their functions is to perform cluster analysis on the relative expression pattern changes (Fig. 1A) in typical microarray experiments that survey a range of conditions (review ed in ref. 5). The fundamental premise of the clustering approach is that genes having similar expression profile across a set of conditions (cellular process, responses, phenotypes, etc.) may share similar functions (6). Obviously the word "function" is too general to be precise and quantitative and too broad to be sp ecific and meaningful. Genes, the products of which may have same function (say, phosphorylating other proteins), do not necessarily share similar transcription al pattern. Conversely, genes having different functions can have a similar expr ession profile simply by chance or stochastic fluctuations. Although many potent ial caveats exist, large numbers of functionally related genes do show very simi lar expression patterns under a relevant set of conditions, especially genes tha t are coregulated by common transcription factors, or their products are the com ponents of a larger complex; this is why a simple clustering of genes with a sim ilar expression pattern is allowed to assign a putative function to unknown gene s via "guilt-by-association" arguments (e.g., refs. 7 and 8). Several clustering techniques such as hierarchical clustering (9), K-means (10), and self-organizi ng map (SOM) (11) have been adopted from other fields and applied widely to micr oarray data analyses. Successful as it is, clustering cannot reveal functional r elation among genes with expression patterns that show very little correlations (they may be related by a time-delay for instance) (Fig. 1B). View larger version (22K): [in this window] [in a new window] Fig. 1. Relations among different concepts in the SP-analysis method. (A) E xpression profile matrix (table). t = (t1,t2,... ) is the experimental condition index; in this example it indicates a set of time points. (B) Expression profil es (patterns). g1 and g4 are not strongly correlated directly, but both are stro ngly correlated with the correlated set (gx,g2). gx,g2 are the transitive genes interpolating the two terminal genes along SP1 (see C and D); similarly, gy is t he transitive gene interpolating g1 and g5 along SP2. (C) GO biological process tree. The Ps are process annotations for genes at a particular node. A gene may belong to more than one node ("multiple-function," such as g2). Expression profi le space. gx is on the short path SP1 terminated by the known genes g1, g2, and g4 and hence is assigned a function of P1,1,1,1 (level L0) according to the GO t ree in C; gy is on SP2 terminated by g1,g5 and is assigned a function of P1,1,1 (level L1). g1 is shared by both SPs and may be involved in both processes, whic h means the processes represented by SP1 and SP2 actually crosstalk to each othe r. The linked gene network can be formed by the subgraph SP1+SP2. A major goal of microarray data analyses is to identify genes that interact with each other in a particular cellular process (or pathway) where not every player has a similar transcriptional profile. The crucial aspect of the approach of Zh ou et al. (4) is to extend the coexpression concept to a more general "transitiv e coexpression," which seems to be an important characteristic of many biologica l processes: Two genes involved in the same process may not be strongly correlat ed in expression directly, but both can be strongly correlated with the same set of other genes. Another widely recognized point is that functional annotations should really be incorporated early in the data analysis. Not surprisingly, the starting point of the Zhou et al. work is the exploitation of the controlled voc abulary tree in the biological process categories of gene ontology (GO) (ref. 12 ; Fig. 1C). In essence, this SP-analysis method starts from a pair of genes belonging to the same biological process category and the same major cellular compartment (mitoc hondrial, cytoplasmic, or nuclear), according to GO, and constructs the SP throu gh a chain of pairwise, strongly correlated genes, with a distance function that further contracts the strongly correlated genes. Unknown transitive genes on th e SP are assigned with the function of the "lowest common ancestor" of all the p rocess subcategories corresponding to the known genes on the same SP (Fig. 1 B-D ). To define a sufficiently specific gene function, the total SP length must be very short, and this lowest ancestral node must be at least four levels below th e root of the GO tree. In particular, if all the known genes are in the same nod e, the lowest common ancestor is the starting terminal gene process category its elf (level L0 assignment); if they are in different nodes but all share a direct parent with the terminal genes, this parent node will be identified as the lowe st common ancestor (level L1 assignment) (Fig. 1 C and D). To test the validity of their SP method, Zhou et al. (4) applied it to the analy sis of the Saccharomyces cerevisiae gene-expression profiles of the Rosetta comp endium (13), which measured the response of 300 gene-deletion and drug-treatment experiments. First, they used only the known genes (1,300 that have GO cellular process and localization annotations). The SP method was able to success fully call 64/84% (cytoplasm), 59/69% (mitochondria), and 39/51% (nuclear) transitive genes at the L0/L1 levels, and these results are highly significant as shown by further permutation tests. Encouraged by the benchmark tests, they extended the graphs of SPs of known genes to an additional 3,300 unknown ORFs and were able t o assign functions (i.e., cellular process categories) to 146 ORFs that include 75 high-confidence predictions (a gene-function assignment is highly confident i f the gene is the only unknown gene on the SP). Because a gene may belong to sev eral SPs, it can therefore get multiple-function assignments. One may choose not to make a prediction on an unknown gene if known genes on the SP fail to have a s consistent an annotation as the terminal genes. As often faced by many computa tional biologists, Zhou et al. spent a tremendous amount of effort in trying to substantiate the biological content of their findings by extensive literature se arches. Among the 75 high-confidence annotations, 24 were found in the yeast pro teome database (YPD, www.proteome.com), and 16 (83%) were confirmed by YPD-docum ented experiments. More encouragingly, their computational results seem to be ab le to correct some database annotation errors after closer scrutiny. As stated by the authors, the strength of their method is to use the SP to link "transitive coexpressed" genes even if some of the genes (especially the termina l genes) on the SP do not have correlated expression profiles directly. Further advantage is exemplified by the "active incorporation of biological annotation i nto the knowledge discovery process." But the conceptual significance actually l ies at a much deeper level. For example, one could also ask: If two known target s of a transcription factor are taken as the terminal nodes, could more targets along the SP be identified analogously? If not for the SP defined by the particu lar distance function, maybe some other SP defined by a more appropriate distanc e function would have to be used. In general, one could argue that, to a certain extent, the goal of all microarray data analyses is to identify a functionally linked subnetwork hidden in the expression profiles. Suppose we view the express ion profile space consisting of clouds of points (genes). If we connect all gene s (assuming we know every gene function) involved in a particular part of a cell ular process (say, cell-cycle progression), we would trace out a subnetwork path . We could do the same for a different process and would get another path. The i ntersection would define gene(s) that are involved in both processes. If the two processes are so linked, we could actually trace out a connected subnetwork (Fi g. 1D). Conversely, discovering such hidden functional linkages (paths, subnetwo rks, etc.) activated by response or process variables (such as time shift in the cell-cycle process) would be the central task. The expression space does not ha ve to be limited to relative mRNA density changes at different times or conditio ns; it could also include proteome information, localization variables, and tiss ue and developmental parameters. It is actually nontrivial to find the right met ric function that defines relevant distance relations appropriate to the cellula r processes interested and allows investigators to construct the SP links capabl e of tracing out the functional subnetworks. Although the particular distance fu nction and related SPs of Zhou et al. (4) may not be sufficient for identifying all types of processes, the general methodology does represent a significant ext ension of our microarray data analysis repertoire beyond cluster analysis. It is not clear how far one can take this empirical SP approach. If the two term inal genes are multifunctional, will there more likely be a single SP with multi functional transitive genes or multiple SPs with largely single-functional trans itive genes on each SP? It is more likely that the incomplete knowledge of the e xisting GO tree and the current resolution for most microarray data will prevent us from getting the answers to such questions. But the real key for understandi ng transcriptional profiles and gene-regulation networks is to link expression p attern to transcription factor-binding sites (cis-regulatory elements). Recent a dvances in computational (refs. 14 and 15; reviewed in ref. 16) and experimental (17, 18) technologies have opened up real opportunities for annotating gene fun ctions not only at the phenomenological levels but also at the mechanistic level s. Acknowledgements I thank G. X. Chen, N. Banerjee, and H. J. Yuan for critical comments on the man uscript. The Zhang lab is supported by National Institutes of Health grants. Footnotes See companion article on page 12783. * E-mail: [email protected]. References Top Article References 1. Velculescu, V. E. , Zhang, L. , Vogelstein, B. & Kinzler, K. W. (1995) Scien ce 270, 484-487[Abstract]. 2. Lockhart, D. J. , Dong, H. , Byrne, M. C. , Follettie, M. T. , Gallo, M. V. , Chee, M. S. , Mittmann, M. , Wang, C. , Kobayashi, M. , Horton, H. & Brown, E. L. (1996) Nat. Biotechnol. 14, 1675-1680[ISI][Medline]. 3. Schena, M. , Shalon, D. , Davis, R. W. & Brown, P. O. (1995) Science 270, 46 7-470[Abstract]. 4. Zhou, X. , Kao, M.-C. J. & Wong, W. H. (2002) Proc. Natl. Acad. Sci. USA 99, 12783-12788. 5. Quackenbush, J. (2001) Nat. Rev. Genet. 2, 418-427[CrossRef][ISI][Medline]. 6. Zhu, J. & Zhang, M. Q. (1999) Pac. Symp. Biocomput. 5, 476-487. 7. Wen, X. , Fuhrman, S. , Michaels, G. S. , Carr, D. B. , Smith, S. , Barker, J. L. & Somogyi, R. (1998) Proc. Natl. Acad. Sci. USA 95, 334-339[Abstract/Full Text]. 8. Spellman, P. T. , Sherlock, G. , Zhang, M. Q. , Iyer, V. R. , Anders, K. , E isen, M. B. , Brown, P. O. , Botstein, D. & Futcher, B. (1998) Mol. Biol. Cell 9 , 3273-3297[Abstract/Full Text]. 9. Eisen, M. B. , Spellman, P. T. , Brown, P. O. & Botstein, D. (1998) Proc. Na tl. Acad. Sci. USA 95, 14863-14868[Abstract/Full Text]. 10. Tavazoie, S. , Hughes, J. D. , Campbell, M. J. , Cho, R. J. & Church, G. M. (1999) Nat. Genet. 22, 281-285[CrossRef][ISI][Medline]. 11. Golub, T. R. , Slonim, D. K. , Tamayo, P. , Huard, C. , Gaasenbeek, M. , Me sirov, J. P. , Coller, H. , Loh, M. L. , Downing, J. R. , Caligiuri, M. A. , Blo omfield, C. D. & Lander, E. S. (1999) Science 286, 531-537[Abstract/Full Text]. 12. Ashburner, M. , Ball, C. A. , Blake, J. A. , Botstein, D. , Butler, H. , Ch erry, J. M. , Davis, A. P. , Dolinski, K. , Dwight, S. S. , Eppig, J. T. , et al . (2000) Nat. Genet. 25, 25-29[CrossRef][ISI][Medline]. 13. Hughes, T. R. , Marton, M. J. , Jones, A. R. , Roberts, C. J. , Stoughton, R. , Armour, C. D. , Bennett, H. A. , Coffey, E. , Dai, H. , He, Y. D. , et al. (2000) Cell 102, 109-126[ISI][Medline]. 14. Markstein, M. , Markstein, P. , Markstein, V. & Levine, M. S. (2002) Proc. Natl. Acad. Sci. USA 99, 763-768[Abstract/Full Text]. 15. Berman, B. P. , Nibu, Y. , Pfeiffer, B. D. , Tomancak, P. , Celniker, S. E. , Levine, M. , Rubin, G. M. & Eisen, M. B. (2002) Proc. Natl. Acad. Sci. USA 99 , 757-762[Abstract/Full Text]. 16. Michelson, A. M. (2002) Proc. Natl. Acad. Sci. USA 99, 546-548[Full Text]. 17. Ren, B. , Robert, F. , Wyrick, J. J. , Aparicio, O. , Jennings, E. G. , Sim on, I. , Zeitlinger, J. , Schreiber, J. , Hannett, N. , Kanin, E. , et al. (2000 ) Science 290, 2306-2309[Abstract/Full Text]. 18. Iyer, V. R. , Horak, C. E. , Scafe, C. S. , Botstein, D. , Snyder, M. & Bro wn,