近日來自華東理工大學(xué)以及上海生物信息研究中心的研究人員在國際蛋白質(zhì)組學(xué)頂級期刊《分子與細胞蛋白質(zhì)組學(xué)》(Molecular & Cellular Proteomics,MCP,,2010年SCI影響因子為8.35)上發(fā)表了題為“Feature-matching pattern-based support vector machines for robust peptide mass fingerprinting”的生物信息學(xué)研究論文,。
文章的通訊作者是華東理工大學(xué)的張嗣良教授,其早年畢業(yè)于華東華工學(xué)院抗生素制造工學(xué)專業(yè),,長期以來以微生物反應(yīng)與發(fā)酵工程為研究對象,,取得一系列生物醫(yī)藥產(chǎn)品生產(chǎn)技術(shù)的重大突破,曾三次獲得國家科技進步二等獎和多次省部級科技進步獎項,,為推動我國生物醫(yī)藥等行業(yè)的技術(shù)進步做出了重大貢獻,。發(fā)表論文100多篇,其中SCI收錄20余篇,。
作為蛋白質(zhì)組學(xué)研究領(lǐng)域一種非常重要的蛋白質(zhì)鑒定方法,,肽質(zhì)量指紋圖譜(Peptide mass fingerprinting,PMF)和串聯(lián)質(zhì)譜(Tandem MS,,MS/MS)相比,,具有高通量,、對單肽的高度特異性、對蛋白質(zhì)翻譯后修飾的低敏感度等特點,。本研究著眼于提高PMF算法的精確度和穩(wěn)定性,,將蛋白質(zhì)鑒定過程區(qū)分為獨立而又關(guān)聯(lián)的三個對象,針對每個對象的特定屬性和關(guān)鍵問題,,共分解出35640個特征,;利用機器學(xué)習方法—支持向量機—訓(xùn)練1733項標準數(shù)據(jù)集;與現(xiàn)有四種PMF鑒定算法(Mascot,,MS-Fit,,ProFound 和 Aldent)相比,新算法在靈敏度,、精確度和穩(wěn)定性上均獲得顯著提高,;并在新算法理論基礎(chǔ)上建立了專用蛋白質(zhì)鑒定網(wǎng)站。審稿人認為該項研究觀念新穎,,具有很好的應(yīng)用性,。
本研究得到了國家973項目“生化反應(yīng)過程放大原理與方法” (2007CB714303)和生物反應(yīng)器工程國家重點實驗室開放課題資助。(生物谷Bioon.com)
DOI:10.1074/mcp.M110.005785
PMC:
PMID:
Feature-matching pattern-based support vector machines for robust peptide mass fingerprinting
Youyuan Li1, Pei Hao, Siliang Zhang and Yixue Li
Peptide mass fingerprinting (PMF), regardless of becoming complementary to tandem mass spectrometry (MS/MS) for protein identification, is still the subject of in-depth study because of its higher sample throughput, higher level of specificity for single peptides and lower level of sensitivity to unexpected post-translational modifications. In this study, we propose, implement and evaluate a uniform approach using support vector machines (SVMs) to incorporate individual concepts and conclusions for accurate PMF. We focus on the inherent attributes and critical issues of the theoretical spectrum, the experimental spectrum and spectrum alignment. Eighty-one feature-matching patterns (FMPs) derived from cleavage type, uniqueness and variable masses of theoretical peptides together with the intensity rank of experimental peaks were proposed to characterize the matching profile of the PMF procedure. We developed a new strategy to handle shared peak intensities and 440 parameters were generated to digitalize each FMP. A high performance for an evaluation dataset of 137 items was finally achieved by the optimal multi-criteria SVM approach, with 491 final features out of a feature vector of 35,640 normalized features through cross training and validating a publicly available "gold standard" PMF dataset of 1733 items. Compared to the Mascot, MS-Fit, ProFound and Aldente, the FMP algorithm has a greater ability to identify correct proteins with the highest values for sensitivity (82%), precision (97%) and F1-measure (89%). Several conclusions have been reached via this research. Firstly, inherent attributes showed comparable or even greater robustness than other explicit. Inherent attribute, peak intensity, should receive considerable attention during protein identification. Secondly, alignment between intense experimental peaks and properly digested, unique or non-modified theoretical peptides is very likely to occur in positive PMFs. Finally, normalization by several types of harmonic factors, including missed cleavages and mass modification, can make important contributions to the performance of the procedure.