第二代測序技術(shù)又稱作深度測序技術(shù),應(yīng)用到RNA上統(tǒng)稱作RNA-seq或RNA測序,,它已成為基因表達和轉(zhuǎn)錄組分析的重要手段,。第二代轉(zhuǎn)錄組測序數(shù)據(jù)中含有大量不編碼蛋白質(zhì)的ncRNA序列,因為它們像宇宙中的暗物質(zhì)一樣難以識別和有重要功能,,也被稱為“基因組暗物質(zhì)”,。由于數(shù)據(jù)量巨大,保守性差,,又有噪音干擾,,這些“暗物質(zhì)”的識別成為表觀遺傳學(xué)和調(diào)控網(wǎng)絡(luò)研究的瓶頸。piRNA是數(shù)量最大的一類ncRNA,,主要是通過與轉(zhuǎn)座子的序列互補來控制轉(zhuǎn)座子的表達,,進而調(diào)控生殖和發(fā)育,。由于不同物種的piRNA之間同源性很差,,至今國際上還沒有有效的識別方法。
中國科學(xué)院動物研究所康樂研究組的張屹等最近發(fā)表的題為A k-mer scheme to predict piRNA and characterize locust piRNA 的最新研究論文,,解決了高精度預(yù)測生物體中數(shù)量最大的一類非編碼RNA---piRNA的難題,,論文發(fā)表在生物信息學(xué)權(quán)威期刊《生物信息學(xué)》(Bioinformatics,IF=4.926)上,。
這篇文章中提出了一種基于k-mer串頻率的Fisher判別式來預(yù)測piRNA的算法, 精度達90%以上,,超過了哈佛大學(xué)B. Doron的61%的精度。利用該方法,,他們成功地鑒定出飛蝗8萬多條piRNA,,預(yù)測飛蝗可能存在約13萬條piRNA。進一步分析發(fā)現(xiàn),,這些piRNA在飛蝗群居型和散居型間存在巨大差異,,這可能為解釋飛蝗兩型生殖力差異提供了重要的線索。
這個不依賴基因組數(shù)據(jù)來鑒定非模式生物piRNA的新方法具有重要的理論意義和廣泛的應(yīng)用價值,。目前,,在線軟件piRNApredictor (http://59.79.168.90/piRNA/index.php) 已被國外科研機構(gòu)用于豬的piRNA研究中。
piRNA預(yù)測算法的突破為其它ncRNA的預(yù)測提供了重要的啟示:不保守的ncRNA是可以預(yù)測的,。由于該算法理論的普遍性,,該方法不僅可以預(yù)測其它物種的piRNA,還可以通過變更訓(xùn)練集來預(yù)測其它種類的ncRNA,。而且,,在線軟件給出的piRNA高精度預(yù)測結(jié)果,對表觀遺傳學(xué)、調(diào)控網(wǎng)絡(luò)與piRNA功能的進一步研究有重要理論意義和應(yīng)用價值,。(生物谷Bioon.com)
生物谷推薦原文出處:
Bioinformatics (2011) 27 (6): 771-776. doi: 10.1093/bioinformatics/btr016
A k-mer scheme to predict piRNAs and characterize locust piRNAs
Yi Zhang1,2, Xianhui Wang1 and Le Kang1,*
Motivation: Identifying piwi-interacting RNAs (piRNAs) of non-model organisms is a difficult and unsolved problem because piRNAs lack conservative secondary structure motifs and sequence homology in different species.
Results: In this article, a k-mer scheme is proposed to identify piRNA sequences, relying on the training sets from non-piRNA and piRNA sequences of five model species sequenced: rat, mouse, human, fruit fly and nematode. Compared with the existing ‘static’ scheme based on the position-specific base usage, our novel ‘dynamic’ algorithm performs much better with a precision of over 90% and a sensitivity of over 60%, and the precision is verified by 5-fold cross-validation in these species. To test its validity, we use the algorithm to identify piRNAs of the migratory locust based on 603 607 deep-sequenced small RNA sequences. Totally, 87 536 piRNAs of the locust are predicted, and 4426 of them matched with existing locust transposons. The transcriptional difference between solitary and gregarious locusts was described. We also revisit the position-specific base usage of piRNAs and find the conservation in the end of piRNAs. Therefore, the method we developed can be used to identify piRNAs of non-model organisms without complete genome sequences.