近日,,中科院西雙版納熱帶植物園研究員Chuck Cannon與北京基因組所和美國(guó)得州理工大學(xué)的科研人員合作,,研發(fā)出可直接分析高通量短序列數(shù)據(jù)的程序包,簡(jiǎn)化了高通量數(shù)據(jù)的比較基因組和轉(zhuǎn)錄組研究,。相關(guān)研究成果日前發(fā)表于《科學(xué)公共圖書館—綜合》,。
據(jù)Cannon介紹,高通量測(cè)序又稱“下一代”測(cè)序,,可一次并行對(duì)幾十萬(wàn)到幾百萬(wàn)條DNA分子測(cè)序,。因此,這種測(cè)序方法能對(duì)物種的轉(zhuǎn)錄組和基因組進(jìn)行比以往更為全貌的分析,。
但是,,由于“下一代”測(cè)序技術(shù)原始數(shù)據(jù)的讀長(zhǎng)只有數(shù)十或一兩百個(gè)堿基,按照傳統(tǒng)的分析流程,,必須要采取生物信息學(xué)工具將這些短的堿基數(shù)據(jù)組裝成較長(zhǎng)的序列組或基因組框架,,才能進(jìn)一步取得具有生物學(xué)意義的結(jié)果。這制約了此類數(shù)據(jù)在沒(méi)有參照基因組的非模式生物基因組研究中的發(fā)展,。
“我們研發(fā)的直接分析高通量短序列數(shù)據(jù)的程序包,,可直接通過(guò)檢測(cè)數(shù)據(jù)中kmer片段是否存在和出現(xiàn)頻次,,來(lái)探討一定數(shù)量目標(biāo)基因組中的序列差異,所以該程序包可突破此類數(shù)據(jù)經(jīng)常面臨的生物信息學(xué)的分析瓶頸,。”Cannon告訴記者,。
同時(shí),基于先前工作,,他們還進(jìn)一步改善了非組裝分析法,,比較了174個(gè)葉綠體全基因組數(shù)據(jù),用以印證該程序包的功能和運(yùn)行流程,。
該研究得到中科院知識(shí)創(chuàng)新工程重要方向項(xiàng)目和云南省高端科技人才引進(jìn)計(jì)劃項(xiàng)目的資助,。(生物谷Bioon.com)
doi:10.1371/journal.pone.0048995
PMC:
PMID:
Reference-Free Comparative Genomics of 174 Chloroplasts
Chai-Shian Kua, Jue Ruan, John Harting, Cheng-Xi Ye, Matthew R. Helmus, Jun Yu, Charles H. Cannon
Direct analysis of unassembled genomic data could greatly increase the power of short read DNA sequencing technologies and allow comparative genomics of organisms without a completed reference available. Here, we compare 174 chloroplasts by analyzing the taxanomic distribution of short kmers across genomes [1]. We then assemble de novo contigs centered on informative variation. The localized de novo contigs can be separated into two major classes: tip = unique to a single genome and group = shared by a subset of genomes. Prior to assembly, we found that ~18% of the chloroplast was duplicated in the inverted repeat (IR) region across a four-fold difference in genome sizes, from a highly reduced parasitic orchid [2] to a massive algal chloroplast [3], including gnetophytes [4] and cycads [5]. The conservation of this ratio between single copy and duplicated sequence was basal among green plants, independent of photosynthesis and mechanism of genome size change, and different in gymnosperms and lower plants. Major lineages in the angiosperm clade differed in the pattern of shared kmers and de novo contigs. For example, parasitic plants demonstrated an expected accelerated overall rate of evolution, while the hemi-parasitic genomes contained a great deal more novel sequence than holo-parasitic plants, suggesting different mechanisms at different stages of genomic contraction. Additionally, the legumes are diverging more quickly and in different ways than other major families. Small duplicated fragments of the rrn23 genes were deeply conserved among seed plants, including among several species without the IR regions, indicating a crucial functional role of this duplication. Localized de novo assembly of informative kmers greatly reduces the complexity of large comparative analyses by confining the analysis to a small partition of data and genomes relevant to the specific question, allowing direct analysis of next-gen sequence data from previously unstudied genomes and rapid discovery of informative candidate regions.