最近,美國洛斯阿拉莫斯國家實(shí)驗(yàn)室(LANL)的一個遺傳學(xué)小組和一國際財團(tuán)聯(lián)合提出了一套旨在闡明可公開獲取的基因測序數(shù)據(jù)信息的質(zhì)量標(biāo)準(zhǔn),。新標(biāo)準(zhǔn)最終可使遺傳研究人員開發(fā)出更有效的疫苗,或有助于公共健康部門或安全人員更迅速地應(yīng)對潛在的公共衛(wèi)生突發(fā)事件,。
在最新一期的《科學(xué)》雜志上,,LANL遺傳學(xué)家帕特里克·錢恩和他的同事提出了6個基因組測序數(shù)據(jù)標(biāo)簽,可將基因測序數(shù)據(jù)按其完整性,、準(zhǔn)確性以及由此帶來的可靠性進(jìn)行歸類,。這些標(biāo)簽可在公共數(shù)據(jù)庫中獲取,而目前使用的標(biāo)簽僅為兩個,。此項(xiàng)成果的重要性在于,,研究人員必須每天使用這樣的數(shù)據(jù),以對未知遺傳數(shù)據(jù)和已知生物體的遺傳數(shù)據(jù)進(jìn)行相互參照,,而有了這樣的新的分類標(biāo)準(zhǔn),,數(shù)據(jù)的獲取與對比工作的效率將大大提高。
每個生物體的細(xì)胞內(nèi)都有DNA,,由4個分子構(gòu)建模塊(或稱堿基對)組成,,堿基對排成特定序列時就可構(gòu)成基因。這些基因序列可包含對生物體有益或有害的遺傳指令,?;蚪M研究人員編目了數(shù)以千計的基因數(shù)據(jù),并將其放在公眾數(shù)據(jù)庫中以供其他研究者使用,。 然而,,由于基因數(shù)據(jù)的復(fù)雜性,公共數(shù)據(jù)庫中的遺傳信息范圍從粗略到精致一概都有,。過去,這些基因數(shù)據(jù)常被歸類為“草圖”和“成品”兩大類,,給基因數(shù)據(jù)的準(zhǔn)確性留下了太多的不確定性,。
錢恩表示,在過去幾年里,,基因測序技術(shù)已取得重大進(jìn)步,,公眾可獲得的基因數(shù)據(jù)已呈爆炸性增長,每天產(chǎn)生的堿基對序列數(shù)據(jù)量要比過去幾年產(chǎn)生的數(shù)據(jù)量還要多幾十億次,。不同的測序技術(shù)具有不同的精確度,。一個序列中的高度不確定性可能會引導(dǎo)研究人員走向一條耗時長達(dá)一年甚至數(shù)年的錯誤道路。因此,,有必要建立一個標(biāo)準(zhǔn),,為研究人員提供對遺傳測序數(shù)據(jù)質(zhì)量的明確評估。
錢恩聯(lián)合了大大小小的數(shù)個基因組測序中心,,如美國能源部聯(lián)合基因組研究所,、桑格研究所、人類微生物群系項(xiàng)目Jumpstart聯(lián)盟測序中心,、密歇根州立大學(xué)以及安大略省癌癥研究所等,,共同提議將現(xiàn)有的測序數(shù)據(jù)分類從兩大類充實(shí)為6大類,。這6個標(biāo)準(zhǔn)涵蓋了從代表公眾提交最低要求的“標(biāo)準(zhǔn)草圖序列”到代表最高標(biāo)準(zhǔn)的“完成序列”,而“完成序列”的驗(yàn)收標(biāo)準(zhǔn)是每10萬個堿基對中最多只能包含一個錯誤,。
LANL基因科學(xué)小組負(fù)責(zé)人,、聯(lián)合基因組研究所LANL研究中心主任克里斯·戴特表示,該項(xiàng)研究的目的是為了讓所有主要的基因組中心和基因組研究小組都能用上符合其需要的分類基因組測序數(shù)據(jù),。而為了盡可能保證基因組序列的完整性,,一些較小的研究中心也可采用這個分類等級來建立和提交其研究成果,以幫助其他科學(xué)家了解既已完成的工作,。(生物谷Bioon.com)
生物谷推薦原始出處:
Science 9 October 2009:DOI: 10.1126/science.1180614
Genome Project Standards in a New Era of Sequencing
P. S. G. Chain,1,2,3,*,, D. V. Grafham,4,, R. S. Fulton,5, M. G. FitzGerald,6, J. Hostetler,7, D. Muzny,8,J. Ali,9 B. Birren,6 D. C. Bruce,1,10 C. Buhay,8 J. R. Cole,3 Y. Ding,8 S. Dugan,8 D. Field,11 G. M. Garrity,3 R. Gibbs,8 T. Graves,5 C. S. Han,1,10 S. H. Harrison,3,* S. Highlander,8 P. Hugenholtz,1 H. M. Khouri,12 C. D. Kodira,6,* E. Kolker,13,14 N. C. Kyrpides,1 D. Lang,12 A. Lapidus,1 S. A. Malfatti,12 V. Markowitz,15 T. Metha,6 K. E. Nelson,7 J. Parkhill,4 S. Pitluck,1 X. Qin,8 T. D. Read,16 J. Schmutz,17 S. Sozhamannan,18 P. Sterk,11 R. L. Strausberg,7 G. Sutton,7 N. R. Thomson,4 J. M. Tiedje,3 G. Weinstock,5 A. Wollam,5 Genomic Standards Consortium Human Microbiome Project Jumpstart Consortium, J. C. Detter10,,
For over a decade, genome sequences have adhered to only two standards that are relied on for purposes of sequence analysis by interested third parties (1, 2). However, ongoing developments in revolutionary sequencing technologies have resulted in a redefinition of traditional whole-genome sequencing that requires reevaluation of such standards. With commercially available 454 pyrosequencing (followed by Illumina, SOLiD, and now Helicos), there has been an explosion of genomes sequenced under the moniker "draft"; however, these can be very poor quality genomes (due to inherent errors in the sequencing technologies, and the inability of assembly programs to fully address these errors). Further, one can only infer that such draft genomes may be of poor quality by navigating through the databases to find the number and type of reads deposited in sequence trace repositories (and not all genomes have this available), or to identify the number of contigs or genome fragments deposited to the database. The difficulty in assessing the quality of such deposited genomes has created some havoc for genome analysis pipelines and has contributed to many wasted hours. Exponential leaps in raw sequencing capability and greatly reduced prices have further skewed the time- and cost-ratios of draft data generation versus the painstaking process of improving and finishing a genome. The result is an ever-widening gap between drafted and finished genomes that only promises to continue (see the figure, page 236); hence, there is an urgent need to distinguish good from poor data sets.
1 U.S. Department of Energy Joint Genome Institute.
2 Lawrence Livermore National Laboratory.
3 Michigan State University.
4 The Sanger Institute.
5 Washington University School of Medicine.
6 The Broad Institute.
7 J. Craig Venter Institute.
8 Baylor College of Medicine.
9 Ontario Institute for Cancer Research.
10 Los Alamos National Laboratory.
11 Natural Environmental Research Council Centre for Ecology and Hydrology.
12 National Center for Biotechnology Information.
13 Seattle Children's Hospital and Research Institute.
14 University of Washington School of Medicine.
15 Lawrence Berkeley National Laboratory.
16 Emory GRA (Georgia Research Alliance) Genomics Center.
17 HudsonAlpha Institute.
18 Naval Medical Research Center.