PIR全稱The Protein Information Resource,,是一個(gè)集成了關(guān)于蛋白質(zhì)功能預(yù)測(cè)數(shù)據(jù)的公共資源的數(shù)據(jù)庫(kù),其目的是支持基因組/蛋白質(zhì)組研究。PIR與MIPS(the Munich Information Center for Protein Sequences),、JIPID(the Japan International Protein Information Database)合作,,共同構(gòu)成了PIR-國(guó)際蛋白質(zhì)序列數(shù)據(jù)庫(kù)(PSD)——一個(gè)主要的已預(yù)測(cè)的蛋白質(zhì)數(shù)據(jù)庫(kù),包括250000個(gè)蛋白,。為了提高蛋白質(zhì)預(yù)測(cè)和實(shí)驗(yàn)數(shù)據(jù)之間的相互吻合程度,,PIR建立了一套系統(tǒng),允許研究者們遞交,、分類,、提取文獻(xiàn)信息。PIR提供了在超家族,、域和模體水平上的對(duì)蛋白的分類,。PIR同時(shí)提供了蛋白的結(jié)構(gòu)和功能信息,并給出了與其他40個(gè)數(shù)據(jù)庫(kù)之間的相互參考,。PIR還提供了一個(gè)非冗余的蛋白質(zhì)數(shù)據(jù)庫(kù),,包括從PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq 、PDB收集來(lái)的約800000條序列,,對(duì)每條序列給出了一個(gè)符合的名稱和相關(guān)文獻(xiàn),。為了提高數(shù)據(jù)庫(kù)的協(xié)同工作能力,PIR采用開發(fā)的數(shù)據(jù)庫(kù)框架,,利用XML技術(shù)進(jìn)行數(shù)據(jù)發(fā)布,。在PIR的站點(diǎn)上(http://pir.georgetown.edu/)也提供了常規(guī)的生物信息學(xué)工具,以進(jìn)行數(shù)據(jù)發(fā)掘,。
INTRODUCTION
The Protein Information Resource (PIR) has been providing the scientific community with annotated protein databases and analysis tools for over three decades. To better support research in functional genomics and proteomics and facilitate knowledge discovery, we have made several new advances in the last year, in addition to further enhancing the PIR-International Protein Sequence Database. Some key developments include: launch of a new submission mechanism for literature data, distribution of a new non-redundant reference protein database, enhancement of the integrated classification database, and redesign of the web site for easy navigation, information retrieval and sequence analysis.
PIR-INTERNATIONAL PROTEIN SEQUENCE DATABASE
The PIR, along with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), continues to enhance and distribute the PIR-International Protein Sequence Database (PSD), a non-redundant, expertly annotated, fully classified and extensively cross-referenced protein sequence database in the public domain. It contains about 250 000 protein sequences with comprehensive coverage across the entire taxonomic range, including sequences from all the publicly available complete genomes.
Superfamily classification
A unique characteristic of the PIR-PSD is the superfamily/family classification (1) that provides complete and non-overlapping clustering of proteins based on global (end-to-end) sequence similarity. Sequences in the same superfamily share common domain architecture (i.e. have the same number, order and types of domains) and do not differ excessively in overall length unless they are fragments or result from alternate splicing or initiators. The automated classification system places new members into existing superfamilies and defines new superfamily clusters using parameters including the percentage of sequence identity, overlap length ratio, distance to neighboring superfamily clusters, and overall domain arrangement. Currently, >99% of sequences are classified into families of closely related sequences (at least 45% identical), and over two-thirds of sequences are classified into over 33 000 superfamilies. The automated classification is being augmented by manual curation of superfamilies, starting with those containing at least one definable domain, to provide superfamily names, brief descriptions, bibliography, list of representative and seed members, as well as domain and motif architecture characteristic of the superfamily.
Bibliography submission and literature mapping
Linking protein data to literature data that describes or characterizes the proteins is crucial for us to increase the amount of experimentally verified data and to improve the quality of protein annotation. Attribution of protein annotations to validated experimental sources provides effective means to avoid propagation of errors that may have resulted from large-scale genome annotation. We have developed a bibliography submission system for the scientific community to submit, categorize and retrieve literature information for PSD protein entries. The submission interface guides users through steps in mapping the paper citation to given protein entries, entering the literature data, and summarizing the literature data using categories such as genetics, tissue/cellular localization, molecular complex or interaction, function, regulation and disease. Also included is a literature information page that provides literature data mining and displays both references cited in PIR and submitted by users.
INTEGRATED PROTEIN CLASSIFICATION DATABASE
The iProClass (integrated Protein Classification) database (2) is designed to provide comprehensive descriptions of all proteins and to serve as a framework for data integration in a distributed networking environment. The database describes family relationships at both global (whole protein) and local (domain, motif, site) levels, as well as structural and functional classifications and features of proteins. The current version (Release 1.0, August 2001) consists of more than 270 000 non-redundant PIR-PSD and SWISS-PROT proteins organized with more than 33 000 PIR superfamilies, 100 000 families, 3400 PIR homology and Pfam domains (3), 1300 ProClass/ProSite motifs (4,5), 280 PIR post-translational modification sites, and links to over 40 databases of protein families, structures, functions, genes, genomes, literature and taxonomy. Protein sequence and superfamily summary reports provide rich annotations such as membership information with length, taxonomy and keyword statistics, extensive cross-references and graphical display of domain and motif regions. Directly linked to the iProClass sequence report are two additional PIR databases, ASDB and RESID (6). PIR-Annotation and Similarity Database (ASDB) lists pre-computed, biweekly updated FASTA neighbors of all PSD sequences with annotation information and graphical displays of sequence similarity matches. PIR-RESID documents over 280 post-translational modifications and links to PSD entries containing either experimentally determined or computationally predicted modifications with evidence tags. Future versions of iProClass and ASDB will be based on the new PIR Non-redundant Reference Protein database (NREF).
PIR-NREF
As a major resource of protein information, one of our primary aims is to provide a timely and comprehensive collection of all protein sequence data that keeps pace with the genome sequencing projects and contains source attribution and minimal redundancy. The PIR-NREF protein database includes sequences from PIR, SWISS-PROT (7), TrEMBL (7), RefSeq (8), GenPept, PDB (9) and other protein databases. The NREF entries, each representing an identical amino acid sequence from the same source organism redundantly presented in one or more underlying protein databases, can serve as the basic unit for protein annotation. The NCBI taxonomy (http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/) is used as the ontology for matching source organism names at the species or strain (if known) levels. The NREF report provides source attribution (containing protein IDs, accession numbers and protein names from underlying databases), in addition to taxonomy, amino acid sequence and composite literature data. The composite protein names, including synonyms, alternate names and even misspellings, can be used to assist the ontology development on protein names and the identification of mis-annotated proteins. Related sequences, including identical sequences from different organisms and closely related sequences within the same organism, are also listed. The database presently consists of about 800 000 entries and is updated biweekly.
AVAILABILITY
PIR web site
The PIR web site (http://pir.georgetown.edu) (10) connects data mining and sequence analysis tools to underlying databases for exploration of protein information and discovery of new knowledge. The site has been redesigned to include a user-friendly navigation system and more graphical interfaces and analysis tools. The PIR-PSD and iProClass pages represent primary entry points in the PIR web site. A list of the major PIR pages is shown in Table 1.
The PIR-PSD interface provides entry retrieval, batch retrieval, basic or advanced text searches, and various sequence searches. The iProClass interface also includes both sequence and text searches. The BLAST search (11) returns best-matched proteins and superfamilies, while peptide match allows protein identification based on peptide sequences. Text search involves direct search of the underlying Oracle tables using unique identifiers or combinations of text strings. The NREF database is searchable by BLAST search, peptide match and direct report retrieval based on the NREF ID or the entry identifiers of the source databases. Other sequence searches supported on the PIR web site include FASTA (12), pattern matching, hidden Markov model (HMM) (13) domain and motif search, Smith–Waterman (14) pair-wise alignment, CLUSTALW (15) multiple alignment and GeneFIND (16) family identification.
PIR FTP site
The PIR anonymous FTP site (ftp://nbrfa.georgetown.edu/pir_databases) provides direct file transfer. Files distributed include the PIR-PSD (quarterly release and interim updates), PIR-NREF, other auxiliary databases, other documents, files and software programs. The PIR-PSD is distributed as flat files in NBRF and CODATA formats, with corresponding sequences in FASTA format. Both PIR-PSD and PIR-NREF are also distributed in XML format with the associated document type definition (DTD) file.
The PIR-PSD, iProClass and PIR-NREF databases have been implemented in Oracle 8i object-relational database system on our Unix server. To enable open source distribution, the databases are being mapped to MySQL and ported to Linux system. To establish reciprocal links to PIR databases, to host a PIR mirror web site or to request PIR database schema, please contact [email protected].