通過利用超級(jí)計(jì)算機(jī)比較人類和其他哺乳動(dòng)物基因組部分,,來自康奈爾大學(xué)的研究人員發(fā)現(xiàn)了300個(gè)之前沒有確定出的人類基因,,并且還發(fā)現(xiàn)了幾百個(gè)已知基因的范圍。
這些發(fā)現(xiàn)是基于一種特殊的理論:當(dāng)有機(jī)體進(jìn)化時(shí),,對(duì)有機(jī)體有用的遺傳密碼部分以不同的方式發(fā)生變化,。研究人員將這項(xiàng)研究的結(jié)果發(fā)表在近期網(wǎng)絡(luò)版的Genome Research。
完整的人類基因組在幾年前已經(jīng)完成了測序,,但這只是表示人們知道了構(gòu)成遺傳密碼的堿基序列而已,。人們還需要確定出所有編碼蛋白質(zhì)或履行調(diào)節(jié)功能等的DNA序列的確切位置。
盡管目前已經(jīng)確定出了超過20000個(gè)蛋白質(zhì)編碼基因,,但康奈爾的這項(xiàng)發(fā)現(xiàn)證實(shí),,仍然有許多基因用目前的生物分析方法被漏掉了。這些方法對(duì)發(fā)現(xiàn)廣泛表達(dá)的基因是非常有效的,,但卻會(huì)漏掉旨在特定氣管表達(dá)或在胚胎發(fā)育早期表達(dá)的基因,。
研究組利用進(jìn)化觀點(diǎn)來確定這些基因。研究人員表示,,進(jìn)化做這種實(shí)驗(yàn)已經(jīng)有數(shù)百萬年的歷史了,。計(jì)算就是看到這些結(jié)果的“顯微鏡”。
領(lǐng)導(dǎo)這項(xiàng)研究的Siepel和同事準(zhǔn)備照出自阿進(jìn)化上保守的基因,,這些基因?qū)λ猩际侵陵P(guān)重要的,,并且其形式相同或非常相似。
利用大規(guī)模的計(jì)算機(jī)組,,研究人員運(yùn)行了三種不同的程序來比較這些已由其他研究人員發(fā)現(xiàn)的存在于人類,、小鼠、大鼠和小雞的聯(lián)合陣列,。
從構(gòu)建和檢測數(shù)學(xué)模型到最終運(yùn)行程序的整個(gè)計(jì)劃大約進(jìn)行了3年,。最終,他們發(fā)現(xiàn)了300個(gè)新的人類基因,。
此前,,由來自16個(gè)國家的超過100個(gè)研究機(jī)構(gòu)的數(shù)百名科研工作者合作進(jìn)行的一項(xiàng)大型研究計(jì)劃測序和比較了12種果蠅的基因組。這項(xiàng)計(jì)劃獲得的數(shù)據(jù)使研究人員對(duì)果蠅的了解前進(jìn)了一大步,。但是,,即使是人類基因組生物學(xué)家也還是會(huì)寫下這樣的記錄:這項(xiàng)計(jì)劃還揭露出了他們鑒定基因過程中的明顯的缺點(diǎn)、不足,。
來自美國印第安納大學(xué)的Thomas Kaufman表示,,近年來研究人員已經(jīng)取得了基因組研究的巨大進(jìn)步,但是只靠將數(shù)據(jù)輸入計(jì)算機(jī)來得到序列“真相”的方法卻解決不了很多問題,。這項(xiàng)新的大型研究告訴了我們這樣一件事:當(dāng)比較許多不同但相關(guān)的基因組時(shí),,你更可能“看到”深埋在所有A-C-T-G碎片中的基因,。
《自然》雜志上發(fā)表的兩篇該計(jì)劃的研究報(bào)告,給出了這個(gè)為期四年的基因組計(jì)劃的結(jié)果,,并根據(jù)這些數(shù)據(jù)作出有關(guān)果蠅的一些結(jié)論,。在這兩篇論文的結(jié)論中隱含了這樣一個(gè)觀點(diǎn):分析任何單個(gè)物種的基因組時(shí),將其與相關(guān)基因組進(jìn)行比較能夠極大提高鑒定的效率,。研究人員表示將有超過40個(gè)“同伴”草圖被公布,,而每個(gè)草圖則分析了12個(gè)果蠅基因組數(shù)據(jù)的一個(gè)不同的方面。
原始出處:
Published online before print November 7, 2007
Genome Research, DOI: 10.1101/gr.7128207
Targeted discovery of novel human exons by comparative genomics
Adam Siepel1,9, Mark Diekhans2, Brona Brejová1, Laura Langton3, Michael Stevens3, Charles L.G. Comstock3, Colleen Davis4, Brent Ewing4, Shelly Oommen5, Christopher Lau5, Hung-Chun Yu5, Jianfeng Li5, Bruce A. Roe5, Phil Green4, Daniela S. Gerhard6, Gary Temple7, David Haussler2,8, and Michael R. Brent3
1 Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, USA; 2 Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA; 3 Laboratory for Computational Genomics, Washington University, Saint Louis, Missouri 63130, USA; 4 Howard Hughes Medical Institute and Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA; 5 Departments of Chemistry and Biochemistry, University of Oklahoma, Norman, Oklahoma 73109, USA; 6 National Cancer Institute, Bethesda, Maryland 20892, USA; 7 National Human Genome Research Institute, Bethesda, Maryland 20892, USA; 8 Howard Hughes Medical Institute, University of California, Santa Cruz, California 95064, USA
A complete and accurate set of human protein-coding gene annotations is perhaps the single most important resource for genomic research after the human-genome sequence itself, yet the major gene catalogs remain incomplete and imperfect. Here we describe a genome-wide effort, carried out as part of the Mammalian Gene Collection (MGC) project, to identify human genes not yet in the gene catalogs. Our approach was to produce gene predictions by algorithms that rely on comparative sequence data but do not require direct cDNA evidence, then to test predicted novel genes by RT–PCR. We have identified 734 novel gene fragments (NGFs) containing 2188 exons with, at most, weak prior cDNA support. These NGFs correspond to an estimated 563 distinct genes, of which >160 are completely absent from the major gene catalogs, while hundreds of others represent significant extensions of known genes. The NGFs appear to be predominantly protein-coding genes rather than noncoding RNAs, unlike novel transcribed sequences identified by technologies such as tiling arrays and CAGE. They tend to be expressed at low levels and in a tissue-specific manner, and they are enriched for roles in motor activity, cell adhesion, connective tissue, and central nervous system development. Our results demonstrate that many important genes and gene fragments have been missed by traditional approaches to gene discovery but can be identified by their evolutionary signatures using comparative sequence data. However, they suggest that hundreds—not thousands—of protein-coding genes are completely missing from the current gene catalogs.