研究蛋白質(zhì)之間的功能關(guān)聯(lián)和相互作用是蛋白質(zhì)學(xué)的研究重點(diǎn)?,F(xiàn)在已經(jīng)出現(xiàn)了一些利用計(jì)算和大通量的實(shí)驗(yàn)方法來(lái)獲得蛋白質(zhì)之間的聯(lián)系。這些方法在某種程度上是很有效的,,它使得研究者們可以篩選大規(guī)模的數(shù)據(jù)來(lái)探測(cè)那些他們感興趣的東西。提供一個(gè)存儲(chǔ)所有這樣功能聯(lián)系的數(shù)據(jù)庫(kù)將使這些數(shù)據(jù)發(fā)揮最大的作用,。Predictome數(shù)據(jù)庫(kù)為44個(gè)基因組的蛋白之間的功能聯(lián)系提供預(yù)測(cè),。Predictome采用了三種方法:chromosomal proximity、phylogenetic profiling ,、domain fusion以及l(fā)arge-scale experimental screenings of protein–protein interaction data. Predictome 在 http://predictome.bu.edu/ 可以獲得,。
INTRODUCTION
The function of a protein is perhaps best described in terms of its interactions with other proteins (1). An interaction between two proteins can be understood not only as a physical interaction, but also as an abstract association that implies some general relationship. For example, two proteins may be said to be linked if they are involved in the same metabolic pathway, or necessary for the enactment of a cellular process. Traditionally, the dominant computational method for detecting functional relationships between proteins has been database sequence similarity searches such as BLAST (2). Recently, several non-homology-based methods have been proposed for detecting such interactions, among them phylogenetic profiling (3–6), chromosomal proximity (7,8) and domain fusion (9–11), as well as high-throughput experimental methods (12–14). However, as useful as these methods are, no global database exists to perform a complementary analysis in interaction space as one does in sequence space using BLAST (2). Here we present a database of predicted links between proteins, Predictome, that is based on the implementation of published computational methods and publicly available data, to facilitate precisely such an analysis.
THE Predictome DATABASE
Several published databases exist which rely on experimental methods (15,16) or shared context (17,18) to link functionally related proteins. Similarly, the methods included in Predictome essentially serve to link one protein to another. The method of chromosomal proximity links two proteins if they are encoded close to one another along the genomic sequence, are transcribed in the same direction, and their orthologs are proximate in a number of other genomes (8). Two proteins are linked by a phylogenetic link if they share the same evolutionary pattern, such that their orthologs are either both present or absent in the genomes of known sequences (6). If two distinct proteins in one organism are encoded as one multi-domain protein in another organism, they are said to be fusion linked (9,10). The experimental detection of physical interactions between proteins by methods such as yeast two-hybrid analysis (19) provides a complementary experimental source of links to those links imputed by the sequence-based methods.
The usefulness of links between proteins to predict function has been previously demonstrated. For example, Marcotte et al. (20) were able to offer functional annotation for roughly half of the unannotated genes in Saccharomyces cerevisiae by examining the functional links which they form with the rest of the genes in that genome. These results led to the hypothesis that the predictive power of any link is increased when supported by multiple methods. Huynen et al. (21) studied the correlation of individual links predicted by different methods in Mycoplasma genitalium and found that the strength of an inference is increased when supported by multiple methods. Finally, the application of combinations of these methods has also been well reviewed (22–24).
Although the published predictive methods have been shown to be reasonably adept at detecting functional associations, their role in actually assisting protein annotation remains to be tested. The difficulty in proceeding from prediction to experimental validation may be attributed to the lack of a dedicated database that contains all of the links predicted by all of the methods. We believe that such a database will aid the scientific community in organizing and accessing the predictions and thus effectively bridge computational predictions with their experimental validation.
SOURCE DATASETS AND METHODS
The published methods for generating phylogenetic links, chromosomal proximity links and fusion links have been re-implemented to apply to the 44 microbial genomes currently available. Since a working definition of orthology is central to these three methods, we have chosen the Clusters of Orthologous Groups (COG) database, which provides a well-established model for detecting orthology, as the framework for generating these links (5,25).
Similar to the computational methods, high-throughput experimental methods are also capable of yielding putative links between proteins. Recently, the yeast two-hybrid method has been used as a systematic tool for establishing global sets of physical interactions between proteins (12–14), and these interactions are available from publicly accessible web sites. These data sets have been compiled and integrated into Predictome.
The usefulness of this database naturally increases as the number of methods it includes grows and we expect that more methods will be added over time. For example, links based on the correlated expression of genes derived from DNA chips and microarrays would be of tremendous value. Also, we expect in librio links, based on automated literature searches for the co-occurrence of genes/proteins in the same publication (26,27), to be added in the future. In addition, users of the database have the option of submitting their own links on the submission page of the web site.
Through an analysis of the predicted inter-protein links based upon the various methods, it is possible to explore the relationships between these methods for correlation with each other and with known biological pathways and processes. Figure 1 illustrates such a comparison for a subset of 15 Escherichia coli proteins involved in the tricarboxylic acid (TCA) cycle. Since all 15 genes are in the same pathway, the predictive links among them recover existing, known associations. In order to assess the overall sensitivity of the links, we examine their correlation with three reference databases: COG (5,25), KEGG (28) and GeneQuiz (29) (Table 1). This analysis provides an evaluation of the methods used to create links, as well the selectivity of categorization in these databases. As is illustrated in Figure 1, few linked proteins are linked by more than one method. Table 2 shows the correlation between the sets of links generated by different methods. It is apparent from these results that false positives correspond to a substantial fraction of the links, typically 30%, and are difficult to identify given the limitations of genomic annotation. To assist users in identifying links of higher confidence, each link in Predictome is marked when the association agrees with a functional assignment in COG or pathway information in KEGG. Furthermore, users of the database can view those links produced by multiple methods, which are therefore less likely to be produced by chance.
Figure 1. Visualization of predicted links among components of the TCA cycle in E.coli. Red, links based on phylogenetic profiling; blue, gene fusion links; green, links established by chromosomal proximity. frdB/sdhB and frdA/sdhA are paralogous pairs (indicated by *).
APPLICATIONS AND FEATURES
Predictome is implemented as a web-accessible relational database using the PostgreSQL RDBMS. The schema and instructions for use of this database can be viewed from the database web page http://predictome.bu.edu. Users can browse the database by entering gene names or keywords, and navigate through the network of predicted links. An optional Java-based applet allows for the visualization of small sections of the network. The complete list of protein links and supporting data, as well as the technical specifications of the database system are publicly accessible through the home page.
Joseph C. Mellor, Itai Yanai, Karl H. Clodfelter, Julian Mintseris and Charles DeLisi*
Bioinformatics Graduate Program and Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA