A DISTRIBUTED ALGORITHM FOR PROTEIN IDENTIFICATION FROM TANDEM MASS SPECTROMETRY DATA
Katarzyna ORZECHOWSKA
orzechowska@ire.pw.edu.plWarsaw University of Technology, Institute of Radioelectronics and Multimedia Technology (Poland)
Tymon RUBEL
Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology (Poland)
Robert KURJATA
Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology (Poland)
Krzysztof ZAREMBA
Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology, (Poland)
Abstract
Tandem mass spectrometry is an analytical technique widely used in proteomics for the high-throughput characterization of proteins in biological samples. Modern in-depth proteomic studies require the collection of even millions of mass spectra representing short protein fragments (peptides). In order to identify the peptides, the measured spectra are most often scored against a database of amino acid sequences of known proteins. Due to the volume of input data and the sizes of proteomic databases, this is a resource-intensive task, which requires an efficient and scalable computational strategy. Here, we present SparkMS, an algorithm for peptide and protein identification from mass spectrometry data explicitly designed to work in a distributed computational environment. To achieve the required performance and scalability, we use Apache Spark, a modern framework that is becoming increasingly popular not only in the field of “big data” analysis but also in bioinformatics. This paper describes the algorithm in detail and demonstrates its performance on a large proteomic dataset. Experimental results indicate that SparkMS scales with the number of worker nodes and the increasing complexity of the search task. Furthermore, it exhibits a protein identification efficiency comparable to X!Tandem, a widely-used proteomic search engine.
Keywords:
proteomics, mass spectrometry, distributed computing, Apache SparkReferences
Aebersold, R., & Mann, M. (2003). Mass spectrometry-based proteomics. Nature, 422(6928), 198–207. https://doi.org/10.1038/nature01511
DOI: https://doi.org/10.1038/nature01511
Google Scholar
Bjornson, R. D., Carriero, N. J., Colangelo, C., Shifman, M., Cheung, K. H., Miller, P. L., & Williams, K. (2008). X!!Tandem, an improved method for running X!tandem in parallel on collections of commodity computers. Journal of proteome research, 7(1), 293–299. https://doi.org/10.1021/pr0701198
DOI: https://doi.org/10.1021/pr0701198
Google Scholar
Cox, J., Neuhauser, N., Michalski, A., Scheltema, R. A., Olsen, J. V., & Mann, M. (2011). Andromeda: a peptide search engine integrated into the MaxQuant environment. Journal of proteome research, 10(4), 1794–1805. https://doi.org/10.1021/pr101065j
DOI: https://doi.org/10.1021/pr101065j
Google Scholar
Craig, R., & Beavis, R. C. (2004). TANDEM: matching proteins with tandem mass spectra. Bioinformatics (Oxford, England), 20(9), 1466–1467. https://doi.org/10.1093/bioinformatics/bth092
DOI: https://doi.org/10.1093/bioinformatics/bth092
Google Scholar
Creasy, D. M., & Cottrell, J. S. (2004). Unimod: Protein modifications for mass spectrometry. Proteomics, 4(6), 1534–1536. https://doi.org/10.1002/pmic.200300744
DOI: https://doi.org/10.1002/pmic.200300744
Google Scholar
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492
DOI: https://doi.org/10.1145/1327452.1327492
Google Scholar
Duncan, D. T., Craig, R., & Link, A. J. (2005). Parallel tandem: a program for parallel processing of tandem mass spectra using PVM or MPI and X!Tandem. Journal of proteome research, 4(5), 1842–1847. https://doi.org/10.1021/pr050058i
DOI: https://doi.org/10.1021/pr050058i
Google Scholar
Guo, R., Zhao, Y., Zou, Q., Fang, X., & Peng, S. (2018). Bioinformatics applications on Apache Spark. GigaScience, 7(8), giy098. https://doi.org/10.1093/gigascience/giy098
DOI: https://doi.org/10.1093/gigascience/giy098
Google Scholar
Hernandez, P., Müller, M., & Appel, R. D. (2006). Automated protein identification by tandem mass spectrometry: issues and strategies. Mass spectrometry reviews, 25(2), 235–254. https://doi.org/10.1002/mas.20068
DOI: https://doi.org/10.1002/mas.20068
Google Scholar
Horlacher, O., Lisacek, F., & Müller, M. (2016). Mining Large Scale Tandem Mass Spectrometry Data for Protein Modifications Using Spectral Libraries. Journal of proteome research, 15(3), 721–731. https://doi.org/10.1021/acs.jproteome.5b00877
DOI: https://doi.org/10.1021/acs.jproteome.5b00877
Google Scholar
Käll, L., Storey, J. D., MacCoss, M. J., & Noble, W. S. (2008). Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. Journal of proteome research, 7(1), 29–34. https://doi.org/10.1021/pr700600n
DOI: https://doi.org/10.1021/pr700600n
Google Scholar
Kim, S., & Pevzner, P. A. (2014). MS-GF+ makes progress towards a universal database search tool for proteomics. Nature communications, 5, 5277. https://doi.org/10.1038/ncomms6277
DOI: https://doi.org/10.1038/ncomms6277
Google Scholar
Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H., Hoopmann, M. R., Moritz, R. L., Deutsch, E. W., & Boyle, J. (2012). Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC bioinformatics, 13, 324. https://doi.org/10.1186/1471-2105-13-324
DOI: https://doi.org/10.1186/1471-2105-13-324
Google Scholar
Milloy, J. A., Faherty, B. K., & Gerber, S. A. (2012). Tempest: GPU-CPU computing for high-throughput database spectral matching. Journal of proteome research, 11(7), 3581–3591. https://doi.org/10.1021/pr300338p
DOI: https://doi.org/10.1021/pr300338p
Google Scholar
Orzechowska, K., & Rubel, T. (2021). An SVM-based peptide identification algorithm integrated into a database search engine. Proceedings of the XXII Polish Conference on Biocybernetics and Biomedical Engineering.
Google Scholar
Paulo, J. A. (2013). Practical and Efficient Searching in Proteomics: A Cross Engine Comparison. WebmedCentral, 4(10), WMCPLS0052. https://doi.org/10.9754/journal.wplus.2013.0052
DOI: https://doi.org/10.9754/journal.wplus.2013.0052
Google Scholar
Paziewska, A., Polkowski, M., Rubel, T., Karczmarski, J., Wiechowska-Kozlowska, A., Dabrowska, M., Mikula, M., Dadlez, M., & Ostrowski, J. (2018). Mass Spectrometry-Based Comprehensive Analysis of Pancreatic Cyst Fluids. BioMed research international, 2018, 7169595. https://doi.org/10.1155/2018/7169595
DOI: https://doi.org/10.1155/2018/7169595
Google Scholar
Perkins, D. N., Pappin, D. J., Creasy, D. M., & Cottrell, J. S. (1999). Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20(18), 3551–3567. https://doi.org/10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2
DOI: https://doi.org/10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2
Google Scholar
Pratt, B., Howbert, J. J., Tasman, N. I., & Nilsson, E. J. (2012). MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services. Bioinformatics (Oxford, England), 28(1), 136–137. https://doi.org/10.1093/bioinformatics/btr615
DOI: https://doi.org/10.1093/bioinformatics/btr615
Google Scholar
Rappsilber, J. (2011). The beginning of a beautiful friendship: Cross-linking/mass spectrometry and modelling of proteins and multi-protein complexes. Journal of Structural Biology, 173(3), 530–540. https://doi.org/10.1016/j.jsb.2010.10.014
DOI: https://doi.org/10.1016/j.jsb.2010.10.014
Google Scholar
Sadygov, R. G., Cociorva, D., & Yates, J. R., 3rd (2004). Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nature methods, 1(3), 195–202. https://doi.org/10.1038/nmeth725
DOI: https://doi.org/10.1038/nmeth725
Google Scholar
Taus, T., Köcher, T., Pichler, P., Paschke, C., Schmidt, A., Henrich, C., & Mechtler, K. (2011). Universal and confident phosphorylation site localization using phosphoRS. Journal of proteome research, 10(12), 5354–5362. https://doi.org/10.1021/pr200611n
DOI: https://doi.org/10.1021/pr200611n
Google Scholar
UniProt Consortium. (2019). UniProt: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1), D506–D515. https://doi.org/10.1093/nar/gky1049
DOI: https://doi.org/10.1093/nar/gky1049
Google Scholar
Vizcaíno, J. A., Csordas, A., Del-Toro, N., Dianes, J. A., Griss, J., Lavidas, I., Mayer, G., Perez-Riverol, Y., Reisinger, F., Ternent, T., Xu, Q. W., Wang, R., & Hermjakob, H. (2016). 2016 update of the PRIDE database and its related tools. Nucleic acids research, 44(22), 11033. https://doi.org/10.1093/nar/gkw880
DOI: https://doi.org/10.1093/nar/gkw880
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster Computing with Working Sets. Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association.
Google Scholar
Authors
Katarzyna ORZECHOWSKAorzechowska@ire.pw.edu.pl
Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology Poland
Authors
Tymon RUBELWarsaw University of Technology, Institute of Radioelectronics and Multimedia Technology Poland
Authors
Robert KURJATAWarsaw University of Technology, Institute of Radioelectronics and Multimedia Technology Poland
Authors
Krzysztof ZAREMBAWarsaw University of Technology, Institute of Radioelectronics and Multimedia Technology, Poland
Statistics
Abstract views: 177PDF downloads: 95
License
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
Similar Articles
- Marcin Badurowicz, Sebastian Łagowski, USAGE OF IOT EDGE APPROACH FOR ROAD QUALITY ANALYSIS , Applied Computer Science: Vol. 19 No. 1 (2023)
- Evans BAIDOO, FIREWORKS ALGORITHM FOR UNCONSTRAINED FUNCTION OPTIMIZATION PROBLEMS , Applied Computer Science: Vol. 13 No. 1 (2017)
- Maciej NABOŻNY, ASYNCHRONOUS INFORMATION DISTRIBUTION AND CLUSTER STATE SYNCHRONIZATION , Applied Computer Science: Vol. 14 No. 1 (2018)
- Shadan Mohammed Jihad ABDALWAHID, Raghad Zuhair YOUSIF, Shahab Wahhab KAREEM, ENHANCING APPROACH USING HYBRID PAILLER AND RSA FOR INFORMATION SECURITY IN BIGDATA , Applied Computer Science: Vol. 15 No. 4 (2019)
- Lucian LUPŞA-TĂTARU, CUSTOMIZING AUDIO FADES WITH A VIEW TO REAL-TIME PROCESSING , Applied Computer Science: Vol. 15 No. 4 (2019)
- Workineh TESEMA, INEFFICIENCY OF DATA MINING ALGORITHMS AND ITS ARCHITECTURE: WITH EMPHASIS TO THE SHORTCOMING OF DATA MINING ALGORITHMS ON THE OUTPUT OF THE RESEARCHES , Applied Computer Science: Vol. 15 No. 3 (2019)
- Saheed A. ADEWUYI, Segun AINA, Adeniran I. OLUWARANTI, A DEEP LEARNING MODEL FOR ELECTRICITY DEMAND FORECASTING BASED ON A TROPICAL DATA , Applied Computer Science: Vol. 16 No. 1 (2020)
- Tomasz NOWICKI, Adam GREGOSIEWICZ, Zbigniew ŁAGODOWSKI, PRODUCTIVITY OF A LOW-BUDGET COMPUTER CLUSTER APPLIED TO OVERCOME THE N-BODY PROBLEM , Applied Computer Science: Vol. 17 No. 4 (2021)
- Kuppan Chetty RAMANATHAN, Manju MOHAN, Joshuva AROCKIA DHANRAJ, BACKWARD MOTION PLANNING AND CONTROL OF MULTIPLE MOBILE ROBOTS MOVING IN TIGHTLY COUPLED FORMATIONS , Applied Computer Science: Vol. 17 No. 3 (2021)
- Andrzej Jardzioch, Wioletta Marczak, APPLICATION OF A FUZZY CONTROLLER IN THE PROCESS OF AUTOMATED POLYETHYLENE FILM THICKNESS CONTROL , Applied Computer Science: Vol. 17 No. 3 (2021)
You may also start an advanced similarity search for this article.