最近,美國洛斯阿拉莫斯國家實驗室(LANL)的一個遺傳學小組和一國際財團聯(lián)合提出了一套旨在闡明可公開獲取的基因測序數(shù)據(jù)信息的質量標準,。新標準最終可使遺傳研究人員開發(fā)出更有效的疫苗,,或有助于公共健康部門或安全人員更迅速地應對潛在的公共衛(wèi)生突發(fā)事件。
在最新一期的《科學》雜志上,,LANL遺傳學家帕特里克·錢恩和他的同事提出了6個基因組測序數(shù)據(jù)標簽,,可將基因測序數(shù)據(jù)按其完整性、準確性以及由此帶來的可靠性進行歸類,。這些標簽可在公共數(shù)據(jù)庫中獲取,,而目前使用的標簽僅為兩個。此項成果的重要性在于,,研究人員必須每天使用這樣的數(shù)據(jù),,以對未知遺傳數(shù)據(jù)和已知生物體的遺傳數(shù)據(jù)進行相互參照,而有了這樣的新的分類標準,,數(shù)據(jù)的獲取與對比工作的效率將大大提高,。
每個生物體的細胞內都有DNA,由4個分子構建模塊(或稱堿基對)組成,,堿基對排成特定序列時就可構成基因。這些基因序列可包含對生物體有益或有害的遺傳指令,?;蚪M研究人員編目了數(shù)以千計的基因數(shù)據(jù),并將其放在公眾數(shù)據(jù)庫中以供其他研究者使用。 然而,,由于基因數(shù)據(jù)的復雜性,,公共數(shù)據(jù)庫中的遺傳信息范圍從粗略到精致一概都有。過去,,這些基因數(shù)據(jù)常被歸類為“草圖”和“成品”兩大類,,給基因數(shù)據(jù)的準確性留下了太多的不確定性。
錢恩表示,,在過去幾年里,,基因測序技術已取得重大進步,公眾可獲得的基因數(shù)據(jù)已呈爆炸性增長,,每天產生的堿基對序列數(shù)據(jù)量要比過去幾年產生的數(shù)據(jù)量還要多幾十億次,。不同的測序技術具有不同的精確度。一個序列中的高度不確定性可能會引導研究人員走向一條耗時長達一年甚至數(shù)年的錯誤道路,。因此,,有必要建立一個標準,為研究人員提供對遺傳測序數(shù)據(jù)質量的明確評估,。
錢恩聯(lián)合了大大小小的數(shù)個基因組測序中心,,如美國能源部聯(lián)合基因組研究所、桑格研究所,、人類微生物群系項目Jumpstart聯(lián)盟測序中心,、密歇根州立大學以及安大略省癌癥研究所等,共同提議將現(xiàn)有的測序數(shù)據(jù)分類從兩大類充實為6大類,。這6個標準涵蓋了從代表公眾提交最低要求的“標準草圖序列”到代表最高標準的“完成序列”,,而“完成序列”的驗收標準是每10萬個堿基對中最多只能包含一個錯誤。
LANL基因科學小組負責人,、聯(lián)合基因組研究所LANL研究中心主任克里斯·戴特表示,,該項研究的目的是為了讓所有主要的基因組中心和基因組研究小組都能用上符合其需要的分類基因組測序數(shù)據(jù)。而為了盡可能保證基因組序列的完整性,,一些較小的研究中心也可采用這個分類等級來建立和提交其研究成果,,以幫助其他科學家了解既已完成的工作。(生物谷Bioon.com)
生物谷推薦原始出處:
Science 9 October 2009:DOI: 10.1126/science.1180614
Genome Project Standards in a New Era of Sequencing
P. S. G. Chain,1,2,3,*,, D. V. Grafham,4,, R. S. Fulton,5, M. G. FitzGerald,6, J. Hostetler,7, D. Muzny,8,J. Ali,9 B. Birren,6 D. C. Bruce,1,10 C. Buhay,8 J. R. Cole,3 Y. Ding,8 S. Dugan,8 D. Field,11 G. M. Garrity,3 R. Gibbs,8 T. Graves,5 C. S. Han,1,10 S. H. Harrison,3,* S. Highlander,8 P. Hugenholtz,1 H. M. Khouri,12 C. D. Kodira,6,* E. Kolker,13,14 N. C. Kyrpides,1 D. Lang,12 A. Lapidus,1 S. A. Malfatti,12 V. Markowitz,15 T. Metha,6 K. E. Nelson,7 J. Parkhill,4 S. Pitluck,1 X. Qin,8 T. D. Read,16 J. Schmutz,17 S. Sozhamannan,18 P. Sterk,11 R. L. Strausberg,7 G. Sutton,7 N. R. Thomson,4 J. M. Tiedje,3 G. Weinstock,5 A. Wollam,5 Genomic Standards Consortium Human Microbiome Project Jumpstart Consortium, J. C. Detter10,,
For over a decade, genome sequences have adhered to only two standards that are relied on for purposes of sequence analysis by interested third parties (1, 2). However, ongoing developments in revolutionary sequencing technologies have resulted in a redefinition of traditional whole-genome sequencing that requires reevaluation of such standards. With commercially available 454 pyrosequencing (followed by Illumina, SOLiD, and now Helicos), there has been an explosion of genomes sequenced under the moniker "draft"; however, these can be very poor quality genomes (due to inherent errors in the sequencing technologies, and the inability of assembly programs to fully address these errors). Further, one can only infer that such draft genomes may be of poor quality by navigating through the databases to find the number and type of reads deposited in sequence trace repositories (and not all genomes have this available), or to identify the number of contigs or genome fragments deposited to the database. The difficulty in assessing the quality of such deposited genomes has created some havoc for genome analysis pipelines and has contributed to many wasted hours. Exponential leaps in raw sequencing capability and greatly reduced prices have further skewed the time- and cost-ratios of draft data generation versus the painstaking process of improving and finishing a genome. The result is an ever-widening gap between drafted and finished genomes that only promises to continue (see the figure, page 236); hence, there is an urgent need to distinguish good from poor data sets.
1 U.S. Department of Energy Joint Genome Institute.
2 Lawrence Livermore National Laboratory.
3 Michigan State University.
4 The Sanger Institute.
5 Washington University School of Medicine.
6 The Broad Institute.
7 J. Craig Venter Institute.
8 Baylor College of Medicine.
9 Ontario Institute for Cancer Research.
10 Los Alamos National Laboratory.
11 Natural Environmental Research Council Centre for Ecology and Hydrology.
12 National Center for Biotechnology Information.
13 Seattle Children's Hospital and Research Institute.
14 University of Washington School of Medicine.
15 Lawrence Berkeley National Laboratory.
16 Emory GRA (Georgia Research Alliance) Genomics Center.
17 HudsonAlpha Institute.
18 Naval Medical Research Center.