A DISTRIBUTED ALGORITHM FOR PROTEIN IDENTIFICATION FROM TANDEM MASS SPECTROMETRY DATA
Katarzyna ORZECHOWSKA
orzechowska@ire.pw.edu.plWarsaw University of Technology, Institute of Radioelectronics and Multimedia Technology (Poland)
Tymon RUBEL
Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology (Poland)
Robert KURJATA
Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology (Poland)
Krzysztof ZAREMBA
Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology, (Poland)
Abstract
Tandem mass spectrometry is an analytical technique widely used in proteomics for the high-throughput characterization of proteins in biological samples. Modern in-depth proteomic studies require the collection of even millions of mass spectra representing short protein fragments (peptides). In order to identify the peptides, the measured spectra are most often scored against a database of amino acid sequences of known proteins. Due to the volume of input data and the sizes of proteomic databases, this is a resource-intensive task, which requires an efficient and scalable computational strategy. Here, we present SparkMS, an algorithm for peptide and protein identification from mass spectrometry data explicitly designed to work in a distributed computational environment. To achieve the required performance and scalability, we use Apache Spark, a modern framework that is becoming increasingly popular not only in the field of “big data” analysis but also in bioinformatics. This paper describes the algorithm in detail and demonstrates its performance on a large proteomic dataset. Experimental results indicate that SparkMS scales with the number of worker nodes and the increasing complexity of the search task. Furthermore, it exhibits a protein identification efficiency comparable to X!Tandem, a widely-used proteomic search engine.
Keywords:
proteomics, mass spectrometry, distributed computing, Apache SparkReferences
Aebersold, R., & Mann, M. (2003). Mass spectrometry-based proteomics. Nature, 422(6928), 198–207. https://doi.org/10.1038/nature01511
DOI: https://doi.org/10.1038/nature01511
Google Scholar
Bjornson, R. D., Carriero, N. J., Colangelo, C., Shifman, M., Cheung, K. H., Miller, P. L., & Williams, K. (2008). X!!Tandem, an improved method for running X!tandem in parallel on collections of commodity computers. Journal of proteome research, 7(1), 293–299. https://doi.org/10.1021/pr0701198
DOI: https://doi.org/10.1021/pr0701198
Google Scholar
Cox, J., Neuhauser, N., Michalski, A., Scheltema, R. A., Olsen, J. V., & Mann, M. (2011). Andromeda: a peptide search engine integrated into the MaxQuant environment. Journal of proteome research, 10(4), 1794–1805. https://doi.org/10.1021/pr101065j
DOI: https://doi.org/10.1021/pr101065j
Google Scholar
Craig, R., & Beavis, R. C. (2004). TANDEM: matching proteins with tandem mass spectra. Bioinformatics (Oxford, England), 20(9), 1466–1467. https://doi.org/10.1093/bioinformatics/bth092
DOI: https://doi.org/10.1093/bioinformatics/bth092
Google Scholar
Creasy, D. M., & Cottrell, J. S. (2004). Unimod: Protein modifications for mass spectrometry. Proteomics, 4(6), 1534–1536. https://doi.org/10.1002/pmic.200300744
DOI: https://doi.org/10.1002/pmic.200300744
Google Scholar
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492
DOI: https://doi.org/10.1145/1327452.1327492
Google Scholar
Duncan, D. T., Craig, R., & Link, A. J. (2005). Parallel tandem: a program for parallel processing of tandem mass spectra using PVM or MPI and X!Tandem. Journal of proteome research, 4(5), 1842–1847. https://doi.org/10.1021/pr050058i
DOI: https://doi.org/10.1021/pr050058i
Google Scholar
Guo, R., Zhao, Y., Zou, Q., Fang, X., & Peng, S. (2018). Bioinformatics applications on Apache Spark. GigaScience, 7(8), giy098. https://doi.org/10.1093/gigascience/giy098
DOI: https://doi.org/10.1093/gigascience/giy098
Google Scholar
Hernandez, P., Müller, M., & Appel, R. D. (2006). Automated protein identification by tandem mass spectrometry: issues and strategies. Mass spectrometry reviews, 25(2), 235–254. https://doi.org/10.1002/mas.20068
DOI: https://doi.org/10.1002/mas.20068
Google Scholar
Horlacher, O., Lisacek, F., & Müller, M. (2016). Mining Large Scale Tandem Mass Spectrometry Data for Protein Modifications Using Spectral Libraries. Journal of proteome research, 15(3), 721–731. https://doi.org/10.1021/acs.jproteome.5b00877
DOI: https://doi.org/10.1021/acs.jproteome.5b00877
Google Scholar
Käll, L., Storey, J. D., MacCoss, M. J., & Noble, W. S. (2008). Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. Journal of proteome research, 7(1), 29–34. https://doi.org/10.1021/pr700600n
DOI: https://doi.org/10.1021/pr700600n
Google Scholar
Kim, S., & Pevzner, P. A. (2014). MS-GF+ makes progress towards a universal database search tool for proteomics. Nature communications, 5, 5277. https://doi.org/10.1038/ncomms6277
DOI: https://doi.org/10.1038/ncomms6277
Google Scholar
Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H., Hoopmann, M. R., Moritz, R. L., Deutsch, E. W., & Boyle, J. (2012). Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC bioinformatics, 13, 324. https://doi.org/10.1186/1471-2105-13-324
DOI: https://doi.org/10.1186/1471-2105-13-324
Google Scholar
Milloy, J. A., Faherty, B. K., & Gerber, S. A. (2012). Tempest: GPU-CPU computing for high-throughput database spectral matching. Journal of proteome research, 11(7), 3581–3591. https://doi.org/10.1021/pr300338p
DOI: https://doi.org/10.1021/pr300338p
Google Scholar
Orzechowska, K., & Rubel, T. (2021). An SVM-based peptide identification algorithm integrated into a database search engine. Proceedings of the XXII Polish Conference on Biocybernetics and Biomedical Engineering.
Google Scholar
Paulo, J. A. (2013). Practical and Efficient Searching in Proteomics: A Cross Engine Comparison. WebmedCentral, 4(10), WMCPLS0052. https://doi.org/10.9754/journal.wplus.2013.0052
DOI: https://doi.org/10.9754/journal.wplus.2013.0052
Google Scholar
Paziewska, A., Polkowski, M., Rubel, T., Karczmarski, J., Wiechowska-Kozlowska, A., Dabrowska, M., Mikula, M., Dadlez, M., & Ostrowski, J. (2018). Mass Spectrometry-Based Comprehensive Analysis of Pancreatic Cyst Fluids. BioMed research international, 2018, 7169595. https://doi.org/10.1155/2018/7169595
DOI: https://doi.org/10.1155/2018/7169595
Google Scholar
Perkins, D. N., Pappin, D. J., Creasy, D. M., & Cottrell, J. S. (1999). Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20(18), 3551–3567. https://doi.org/10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2
DOI: https://doi.org/10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2
Google Scholar
Pratt, B., Howbert, J. J., Tasman, N. I., & Nilsson, E. J. (2012). MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services. Bioinformatics (Oxford, England), 28(1), 136–137. https://doi.org/10.1093/bioinformatics/btr615
DOI: https://doi.org/10.1093/bioinformatics/btr615
Google Scholar
Rappsilber, J. (2011). The beginning of a beautiful friendship: Cross-linking/mass spectrometry and modelling of proteins and multi-protein complexes. Journal of Structural Biology, 173(3), 530–540. https://doi.org/10.1016/j.jsb.2010.10.014
DOI: https://doi.org/10.1016/j.jsb.2010.10.014
Google Scholar
Sadygov, R. G., Cociorva, D., & Yates, J. R., 3rd (2004). Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nature methods, 1(3), 195–202. https://doi.org/10.1038/nmeth725
DOI: https://doi.org/10.1038/nmeth725
Google Scholar
Taus, T., Köcher, T., Pichler, P., Paschke, C., Schmidt, A., Henrich, C., & Mechtler, K. (2011). Universal and confident phosphorylation site localization using phosphoRS. Journal of proteome research, 10(12), 5354–5362. https://doi.org/10.1021/pr200611n
DOI: https://doi.org/10.1021/pr200611n
Google Scholar
UniProt Consortium. (2019). UniProt: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1), D506–D515. https://doi.org/10.1093/nar/gky1049
DOI: https://doi.org/10.1093/nar/gky1049
Google Scholar
Vizcaíno, J. A., Csordas, A., Del-Toro, N., Dianes, J. A., Griss, J., Lavidas, I., Mayer, G., Perez-Riverol, Y., Reisinger, F., Ternent, T., Xu, Q. W., Wang, R., & Hermjakob, H. (2016). 2016 update of the PRIDE database and its related tools. Nucleic acids research, 44(22), 11033. https://doi.org/10.1093/nar/gkw880
DOI: https://doi.org/10.1093/nar/gkw880
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster Computing with Working Sets. Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association.
Google Scholar
Authors
Katarzyna ORZECHOWSKAorzechowska@ire.pw.edu.pl
Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology Poland
Authors
Tymon RUBELWarsaw University of Technology, Institute of Radioelectronics and Multimedia Technology Poland
Authors
Robert KURJATAWarsaw University of Technology, Institute of Radioelectronics and Multimedia Technology Poland
Authors
Krzysztof ZAREMBAWarsaw University of Technology, Institute of Radioelectronics and Multimedia Technology, Poland
Statistics
Abstract views: 164PDF downloads: 94
License
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
Similar Articles
- Raphael Olufemi AKINYEDE, Temitayo Elijah BALOGUN, Abiodun Boluwade ROTIMI, Oluwasefunmi Busola FAMODIMU, A CUSTOMER-CENTRIC APPLICATION FOR A CINEMA HOUSE , Applied Computer Science: Vol. 16 No. 2 (2020)
- Janusz MLECZKO, Paweł BOBIŃSKI, PRODUCTION PLANNING IN CONDITIONS OF MASS CUSTOMIZATION BASED ON THEORY OF CONSTRAINTS , Applied Computer Science: Vol. 13 No. 4 (2017)
- Ihor PYSMENNYI, Anatolii PETRENKO, Roman KYSLYI, GRAPH-BASED FOG COMPUTING NETWORK MODEL , Applied Computer Science: Vol. 16 No. 4 (2020)
- Raphael Olufemi AKINYEDE, Sulaiman Omolade ADEGBENRO, Babatola Moses OMILODI, A SECURITY MODEL FOR PREVENTING E-COMMERCE RELATED CRIMES , Applied Computer Science: Vol. 16 No. 3 (2020)
- Olutayo BOYINBODE, Paul OLOTU, Kolawole AKINTOLA, DEVELOPMENT OF AN ONTOLOGY-BASED ADAPTIVE PERSONALIZED E-LEARNING SYSTEM , Applied Computer Science: Vol. 16 No. 4 (2020)
- Lucian LUPŞA-TĂTARU, NOVEL TECHNIQUE OF CUSTOMIZING THE AUDIO FADE-OUT SHAPE , Applied Computer Science: Vol. 14 No. 3 (2018)
- Marcin TOMCZYK, Anna PLICHTA, Mariusz MIKULSKI, APPLICATION OF IMAGE ANALYSIS TO THE IDENTIFICATION OF MASS INERTIA MOMENTUM IN ELECTROMECHANICAL SYSTEM WITH CHANGEABLE BACKLASH ZONE , Applied Computer Science: Vol. 15 No. 3 (2019)
- Marcin TOMCZYK, Barbara BOROWIK, Bohdan BOROWIK, IDENTIFICATION OF THE MASS INERTIA MOMENT IN AN ELECTROMECHANICAL SYSTEM BASED ON WAVELET–NEURAL METHOD , Applied Computer Science: Vol. 14 No. 2 (2018)
- K. Raju, Niranjan N Chiplunkar, PERFORMANCE ENHANCEMENT OF CUDA APPLICATIONS BY OVERLAPPING DATA TRANSFER AND KERNEL EXECUTION , Applied Computer Science: Vol. 17 No. 3 (2021)
- Hanan M. SHUKUR, Shavan ASKAR, Subhi R.M. ZEEBAREE, THE UTILIZATION OF 6G IN INDUSTRY 4.0 , Applied Computer Science: Vol. 20 No. 2 (2024)
You may also start an advanced similarity search for this article.