最近,,美國(guó)洛斯阿拉莫斯國(guó)家實(shí)驗(yàn)室(LANL)的一個(gè)遺傳學(xué)小組和一國(guó)際財(cái)團(tuán)聯(lián)合提出了一套旨在闡明可公開獲取的基因測(cè)序數(shù)據(jù)信息的質(zhì)量標(biāo)準(zhǔn)。新標(biāo)準(zhǔn)最終可使遺傳研究人員開發(fā)出更有效的疫苗,,或有助于公共健康部門或安全人員更迅速地應(yīng)對(duì)潛在的公共衛(wèi)生突發(fā)事件,。
在最新一期的《科學(xué)》雜志上,LANL遺傳學(xué)家帕特里克·錢恩和他的同事提出了6個(gè)基因組測(cè)序數(shù)據(jù)標(biāo)簽,,可將基因測(cè)序數(shù)據(jù)按其完整性,、準(zhǔn)確性以及由此帶來(lái)的可靠性進(jìn)行歸類。這些標(biāo)簽可在公共數(shù)據(jù)庫(kù)中獲取,,而目前使用的標(biāo)簽僅為兩個(gè),。此項(xiàng)成果的重要性在于,研究人員必須每天使用這樣的數(shù)據(jù),,以對(duì)未知遺傳數(shù)據(jù)和已知生物體的遺傳數(shù)據(jù)進(jìn)行相互參照,,而有了這樣的新的分類標(biāo)準(zhǔn),數(shù)據(jù)的獲取與對(duì)比工作的效率將大大提高,。
每個(gè)生物體的細(xì)胞內(nèi)都有DNA,,由4個(gè)分子構(gòu)建模塊(或稱堿基對(duì))組成,堿基對(duì)排成特定序列時(shí)就可構(gòu)成基因,。這些基因序列可包含對(duì)生物體有益或有害的遺傳指令,。基因組研究人員編目了數(shù)以千計(jì)的基因數(shù)據(jù),,并將其放在公眾數(shù)據(jù)庫(kù)中以供其他研究者使用,。 然而,,由于基因數(shù)據(jù)的復(fù)雜性,公共數(shù)據(jù)庫(kù)中的遺傳信息范圍從粗略到精致一概都有,。過(guò)去,,這些基因數(shù)據(jù)常被歸類為“草圖”和“成品”兩大類,給基因數(shù)據(jù)的準(zhǔn)確性留下了太多的不確定性,。
錢恩表示,,在過(guò)去幾年里,基因測(cè)序技術(shù)已取得重大進(jìn)步,,公眾可獲得的基因數(shù)據(jù)已呈爆炸性增長(zhǎng),,每天產(chǎn)生的堿基對(duì)序列數(shù)據(jù)量要比過(guò)去幾年產(chǎn)生的數(shù)據(jù)量還要多幾十億次。不同的測(cè)序技術(shù)具有不同的精確度,。一個(gè)序列中的高度不確定性可能會(huì)引導(dǎo)研究人員走向一條耗時(shí)長(zhǎng)達(dá)一年甚至數(shù)年的錯(cuò)誤道路,。因此,有必要建立一個(gè)標(biāo)準(zhǔn),,為研究人員提供對(duì)遺傳測(cè)序數(shù)據(jù)質(zhì)量的明確評(píng)估,。
錢恩聯(lián)合了大大小小的數(shù)個(gè)基因組測(cè)序中心,如美國(guó)能源部聯(lián)合基因組研究所,、桑格研究所,、人類微生物群系項(xiàng)目Jumpstart聯(lián)盟測(cè)序中心、密歇根州立大學(xué)以及安大略省癌癥研究所等,,共同提議將現(xiàn)有的測(cè)序數(shù)據(jù)分類從兩大類充實(shí)為6大類,。這6個(gè)標(biāo)準(zhǔn)涵蓋了從代表公眾提交最低要求的“標(biāo)準(zhǔn)草圖序列”到代表最高標(biāo)準(zhǔn)的“完成序列”,而“完成序列”的驗(yàn)收標(biāo)準(zhǔn)是每10萬(wàn)個(gè)堿基對(duì)中最多只能包含一個(gè)錯(cuò)誤,。
LANL基因科學(xué)小組負(fù)責(zé)人,、聯(lián)合基因組研究所LANL研究中心主任克里斯·戴特表示,,該項(xiàng)研究的目的是為了讓所有主要的基因組中心和基因組研究小組都能用上符合其需要的分類基因組測(cè)序數(shù)據(jù),。而為了盡可能保證基因組序列的完整性,一些較小的研究中心也可采用這個(gè)分類等級(jí)來(lái)建立和提交其研究成果,,以幫助其他科學(xué)家了解既已完成的工作,。(生物谷Bioon.com)
生物谷推薦原始出處:
Science 9 October 2009:DOI: 10.1126/science.1180614
Genome Project Standards in a New Era of Sequencing
P. S. G. Chain,1,2,3,*,, D. V. Grafham,4,, R. S. Fulton,5, M. G. FitzGerald,6, J. Hostetler,7, D. Muzny,8,J. Ali,9 B. Birren,6 D. C. Bruce,1,10 C. Buhay,8 J. R. Cole,3 Y. Ding,8 S. Dugan,8 D. Field,11 G. M. Garrity,3 R. Gibbs,8 T. Graves,5 C. S. Han,1,10 S. H. Harrison,3,* S. Highlander,8 P. Hugenholtz,1 H. M. Khouri,12 C. D. Kodira,6,* E. Kolker,13,14 N. C. Kyrpides,1 D. Lang,12 A. Lapidus,1 S. A. Malfatti,12 V. Markowitz,15 T. Metha,6 K. E. Nelson,7 J. Parkhill,4 S. Pitluck,1 X. Qin,8 T. D. Read,16 J. Schmutz,17 S. Sozhamannan,18 P. Sterk,11 R. L. Strausberg,7 G. Sutton,7 N. R. Thomson,4 J. M. Tiedje,3 G. Weinstock,5 A. Wollam,5 Genomic Standards Consortium Human Microbiome Project Jumpstart Consortium, J. C. Detter10,,
For over a decade, genome sequences have adhered to only two standards that are relied on for purposes of sequence analysis by interested third parties (1, 2). However, ongoing developments in revolutionary sequencing technologies have resulted in a redefinition of traditional whole-genome sequencing that requires reevaluation of such standards. With commercially available 454 pyrosequencing (followed by Illumina, SOLiD, and now Helicos), there has been an explosion of genomes sequenced under the moniker "draft"; however, these can be very poor quality genomes (due to inherent errors in the sequencing technologies, and the inability of assembly programs to fully address these errors). Further, one can only infer that such draft genomes may be of poor quality by navigating through the databases to find the number and type of reads deposited in sequence trace repositories (and not all genomes have this available), or to identify the number of contigs or genome fragments deposited to the database. The difficulty in assessing the quality of such deposited genomes has created some havoc for genome analysis pipelines and has contributed to many wasted hours. Exponential leaps in raw sequencing capability and greatly reduced prices have further skewed the time- and cost-ratios of draft data generation versus the painstaking process of improving and finishing a genome. The result is an ever-widening gap between drafted and finished genomes that only promises to continue (see the figure, page 236); hence, there is an urgent need to distinguish good from poor data sets.
1 U.S. Department of Energy Joint Genome Institute.
2 Lawrence Livermore National Laboratory.
3 Michigan State University.
4 The Sanger Institute.
5 Washington University School of Medicine.
6 The Broad Institute.
7 J. Craig Venter Institute.
8 Baylor College of Medicine.
9 Ontario Institute for Cancer Research.
10 Los Alamos National Laboratory.
11 Natural Environmental Research Council Centre for Ecology and Hydrology.
12 National Center for Biotechnology Information.
13 Seattle Children's Hospital and Research Institute.
14 University of Washington School of Medicine.
15 Lawrence Berkeley National Laboratory.
16 Emory GRA (Georgia Research Alliance) Genomics Center.
17 HudsonAlpha Institute.
18 Naval Medical Research Center.