生物谷報道:不編碼cDNA到底有沒有用,?一直存在爭議,。前不久一項研究發(fā)現(xiàn)小鼠中的不編碼cDNA能夠編碼功能RNA基因,,最新的大規(guī)模研究則認為這些不編碼cDNA能夠表達,,而且可能具有特殊的功能,如CpG島等形成等中起到重要作用,。
We downloaded FANTOM release 2.0 cDNAs from the authors' website. Table 1 shows the data from the four categories defined by the authors, which we refer to as coding 1 (probably protein), coding 2 (marginal protein), non-coding 1 (marginal RNA), and non-coding 2 (probably RNA). Overall transcript sizes average about 2 kilobases (kb) in each category; most known RNA genes are much smaller than this — for example, the 587 mouse entries in the Rfam database4 average 96 base pairs (bp) in length. Larger RNA genes do exist (such as H19 and Xist) and many are stored in the Erdmann database5. Another striking difference between the given categories is the increase from 13.4% single-exon genes in coding 1 to 68.7% and 73.1% single-exon genes in non-coding 1 and non-coding 2, respectively.
As an evolutionarily neutral control, we use 'intergenic' sequences of 2 kb in length that are at least 5 kb distant from genes annotated by Ensembl, predicted by FgeneSH, or aligned to cDNAs. Transposons identified by RepeatMasker are excluded, as is the 5% of highly conserved mouse sequence that is under purifying selection6. Conversely, we have two positive controls: one is the coding 1 category of protein-coding genes and the other is a set of all known mouse RNA genes. To avoid an overt bias towards small RNA genes, we removed genes smaller than 80 bp in Rfam, leaving behind many encoding splicing factors such as U1 and U6. We then added all the mouse genes in the Erdmann database, which total 40. The resultant set of 321 RNA genes is referred to as 'ncRNAs'.
Genome sequences were taken from the UCSC Genome Browser with time stamp 28 June 2003 (rat) and 10 April 2003 (human). BlastZ (ref. 7) was used for the alignments, with default settings K=3,000 and H=2,200. The C=2 option enabled us to chain exons together. Although the complexities of the chaining procedure may prevent a few multi-exon genes from aligning, this should not be a problem for non-coding cDNAs as most are single-exon. We specified that the fraction of transcript length that is aligned by BlastZ must exceed a predetermined alignment threshold of 25%: this low threshold ensures that our positive controls almost always pass (Fig. 1).
Figure 1 Comparisons between rat (left) and human (right) data. Full legend
High resolution image and legend (128k)
The crucial observation is that the distributions of sequence identity and insertion–deletion ('indel') rate are remarkably similar for non-coding 1, non-coding 2 and intergenic. Even the widths of the distributions, a reflection of the stochastic nature of the underlying evolutionary process, are highly similar. The most well conserved are coding 1 and ncRNAs, and the least well conserved are non-coding 1, non-coding 2 and intergenic. The larger effect is observed in mouse-to-human, because it represents 75 million years of divergence, compared with only 14–24 million years in mouse-to-rat. For the latter comparison, the shift () is small compared with the width ( ); however, it is significant, as it is a shift in an entire distribution, and the oft-cited rule applies to a point sampled from a distribution.
The simplest explanation is that non-functional transcripts can be produced at low copy numbers, escape the cell's messenger RNA surveillance system, and yet inflict no damage on the cell. Table 1 highlights two theories. If these are processed pseudogenes, there should be residual similarity to known proteins, especially mouse proteins. Setting to E-values of 10-2, we find that 36.5% and 19.0% of non-coding 1 and non-coding 2 are similar to mouse coding 1. Just 15.7% and 2.4% are similar to SwissProt, because SwissProt does not store translated cDNAs. If random genomic sequence is transcribed, we should find transposon remnants (ignoring short interspersed elements because they are derived from transfer RNAs). This is indeed the case for 48.4% and 46.4% of non-coding 1 and non-coding 2. Note too that the ncRNAs control set is mostly negative for pseudogenes and random genomic sequence.
Given that all of the best techniques for detecting RNA genes depend on sequence conservation8, 9, the absence of this cannot be summarily dismissed, even if isolated examples of RNA genes being weakly conserved can be found10. Extraordinary claims require extraordinary proof — this is particularly true when much of the data support an alternative interpretation that they are simply non-functional cDNAs.
JUN WANG*†, JIANGUO ZHANG†, HONGKUN ZHENG†, JUN LI†, DONGYUAN LIU†, HENG LI†, RAM SAMUDRALA‡, JUN YU*† & GANE KA-SHU WONG*†§
* James D. Watson Institute of Genome Sciences of Zhejiang University, Hangzhou Genomics Institute, Hangzhou 310007, China
† Beijing Institute of Genomics of the Chinese Academy of Sciences, Beijing 101300, China
‡ Computation Genomics Group, Department of Microbiology, University of Washington, Seattle, Washington 98195, USA
§ University of Washington Genome Center, Department of Medicine, Seattle, Washington 98195, USA
相關報道:
1. Okazaki, Y. et al. Nature 420, 563–573 (2002). | Article | PubMed | ISI |