A DISTRIBUTED ALGORITHM FOR PROTEIN IDENTIFICATION FROM TANDEM MASS SPECTROMETRY DATA

Katarzyna ORZECHOWSKA

orzechowska@ire.pw.edu.pl
Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology (Poland)

Tymon RUBEL


Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology (Poland)

Robert KURJATA


Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology (Poland)

Krzysztof ZAREMBA


Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology, (Poland)

Abstract

Tandem mass spectrometry is an analytical technique widely used in proteomics for the high-throughput characterization of proteins in biological samples. Modern in-depth proteomic studies require the collection of even millions of mass spectra representing short protein fragments (peptides). In order to identify the peptides, the measured spectra are most often scored against a database of amino acid sequences of known proteins. Due to the volume of input data and the sizes of proteomic databases, this is a resource-intensive task, which requires an efficient and scalable computational strategy. Here, we present SparkMS, an algorithm for peptide and protein identification from mass spectrometry data explicitly designed to work in a distributed computational environment. To achieve the required performance and scalability, we use Apache Spark, a modern framework that is becoming increasingly popular not only in the field of “big data” analysis but also in bioinformatics. This paper describes the algorithm in detail and demonstrates its performance on a large proteomic dataset. Experimental results indicate that SparkMS scales with the number of worker nodes and the increasing complexity of the search task. Furthermore, it exhibits a protein identification efficiency comparable to X!Tandem, a widely-used proteomic search engine.


Keywords:

proteomics, mass spectrometry, distributed computing, Apache Spark

Aebersold, R., & Mann, M. (2003). Mass spectrometry-based proteomics. Nature, 422(6928), 198–207. https://doi.org/10.1038/nature01511
DOI: https://doi.org/10.1038/nature01511   Google Scholar

Bjornson, R. D., Carriero, N. J., Colangelo, C., Shifman, M., Cheung, K. H., Miller, P. L., & Williams, K. (2008). X!!Tandem, an improved method for running X!tandem in parallel on collections of commodity computers. Journal of proteome research, 7(1), 293–299. https://doi.org/10.1021/pr0701198
DOI: https://doi.org/10.1021/pr0701198   Google Scholar

Cox, J., Neuhauser, N., Michalski, A., Scheltema, R. A., Olsen, J. V., & Mann, M. (2011). Andromeda: a peptide search engine integrated into the MaxQuant environment. Journal of proteome research, 10(4), 1794–1805. https://doi.org/10.1021/pr101065j
DOI: https://doi.org/10.1021/pr101065j   Google Scholar

Craig, R., & Beavis, R. C. (2004). TANDEM: matching proteins with tandem mass spectra. Bioinformatics (Oxford, England), 20(9), 1466–1467. https://doi.org/10.1093/bioinformatics/bth092
DOI: https://doi.org/10.1093/bioinformatics/bth092   Google Scholar

Creasy, D. M., & Cottrell, J. S. (2004). Unimod: Protein modifications for mass spectrometry. Proteomics, 4(6), 1534–1536. https://doi.org/10.1002/pmic.200300744
DOI: https://doi.org/10.1002/pmic.200300744   Google Scholar

Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492
DOI: https://doi.org/10.1145/1327452.1327492   Google Scholar

Duncan, D. T., Craig, R., & Link, A. J. (2005). Parallel tandem: a program for parallel processing of tandem mass spectra using PVM or MPI and X!Tandem. Journal of proteome research, 4(5), 1842–1847. https://doi.org/10.1021/pr050058i
DOI: https://doi.org/10.1021/pr050058i   Google Scholar

Guo, R., Zhao, Y., Zou, Q., Fang, X., & Peng, S. (2018). Bioinformatics applications on Apache Spark. GigaScience, 7(8), giy098. https://doi.org/10.1093/gigascience/giy098
DOI: https://doi.org/10.1093/gigascience/giy098   Google Scholar

Hernandez, P., Müller, M., & Appel, R. D. (2006). Automated protein identification by tandem mass spectrometry: issues and strategies. Mass spectrometry reviews, 25(2), 235–254. https://doi.org/10.1002/mas.20068
DOI: https://doi.org/10.1002/mas.20068   Google Scholar

Horlacher, O., Lisacek, F., & Müller, M. (2016). Mining Large Scale Tandem Mass Spectrometry Data for Protein Modifications Using Spectral Libraries. Journal of proteome research, 15(3), 721–731. https://doi.org/10.1021/acs.jproteome.5b00877
DOI: https://doi.org/10.1021/acs.jproteome.5b00877   Google Scholar

Käll, L., Storey, J. D., MacCoss, M. J., & Noble, W. S. (2008). Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. Journal of proteome research, 7(1), 29–34. https://doi.org/10.1021/pr700600n
DOI: https://doi.org/10.1021/pr700600n   Google Scholar

Kim, S., & Pevzner, P. A. (2014). MS-GF+ makes progress towards a universal database search tool for proteomics. Nature communications, 5, 5277. https://doi.org/10.1038/ncomms6277
DOI: https://doi.org/10.1038/ncomms6277   Google Scholar

Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H., Hoopmann, M. R., Moritz, R. L., Deutsch, E. W., & Boyle, J. (2012). Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC bioinformatics, 13, 324. https://doi.org/10.1186/1471-2105-13-324
DOI: https://doi.org/10.1186/1471-2105-13-324   Google Scholar

Milloy, J. A., Faherty, B. K., & Gerber, S. A. (2012). Tempest: GPU-CPU computing for high-throughput database spectral matching. Journal of proteome research, 11(7), 3581–3591. https://doi.org/10.1021/pr300338p
DOI: https://doi.org/10.1021/pr300338p   Google Scholar

Orzechowska, K., & Rubel, T. (2021). An SVM-based peptide identification algorithm integrated into a database search engine. Proceedings of the XXII Polish Conference on Biocybernetics and Biomedical Engineering.
  Google Scholar

Paulo, J. A. (2013). Practical and Efficient Searching in Proteomics: A Cross Engine Comparison. WebmedCentral, 4(10), WMCPLS0052. https://doi.org/10.9754/journal.wplus.2013.0052
DOI: https://doi.org/10.9754/journal.wplus.2013.0052   Google Scholar

Paziewska, A., Polkowski, M., Rubel, T., Karczmarski, J., Wiechowska-Kozlowska, A., Dabrowska, M., Mikula, M., Dadlez, M., & Ostrowski, J. (2018). Mass Spectrometry-Based Comprehensive Analysis of Pancreatic Cyst Fluids. BioMed research international, 2018, 7169595. https://doi.org/10.1155/2018/7169595
DOI: https://doi.org/10.1155/2018/7169595   Google Scholar

Perkins, D. N., Pappin, D. J., Creasy, D. M., & Cottrell, J. S. (1999). Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20(18), 3551–3567. https://doi.org/10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2
DOI: https://doi.org/10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2   Google Scholar

Pratt, B., Howbert, J. J., Tasman, N. I., & Nilsson, E. J. (2012). MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services. Bioinformatics (Oxford, England), 28(1), 136–137. https://doi.org/10.1093/bioinformatics/btr615
DOI: https://doi.org/10.1093/bioinformatics/btr615   Google Scholar

Rappsilber, J. (2011). The beginning of a beautiful friendship: Cross-linking/mass spectrometry and modelling of proteins and multi-protein complexes. Journal of Structural Biology, 173(3), 530–540. https://doi.org/10.1016/j.jsb.2010.10.014
DOI: https://doi.org/10.1016/j.jsb.2010.10.014   Google Scholar

Sadygov, R. G., Cociorva, D., & Yates, J. R., 3rd (2004). Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nature methods, 1(3), 195–202. https://doi.org/10.1038/nmeth725
DOI: https://doi.org/10.1038/nmeth725   Google Scholar

Taus, T., Köcher, T., Pichler, P., Paschke, C., Schmidt, A., Henrich, C., & Mechtler, K. (2011). Universal and confident phosphorylation site localization using phosphoRS. Journal of proteome research, 10(12), 5354–5362. https://doi.org/10.1021/pr200611n
DOI: https://doi.org/10.1021/pr200611n   Google Scholar

UniProt Consortium. (2019). UniProt: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1), D506–D515. https://doi.org/10.1093/nar/gky1049
DOI: https://doi.org/10.1093/nar/gky1049   Google Scholar

Vizcaíno, J. A., Csordas, A., Del-Toro, N., Dianes, J. A., Griss, J., Lavidas, I., Mayer, G., Perez-Riverol, Y., Reisinger, F., Ternent, T., Xu, Q. W., Wang, R., & Hermjakob, H. (2016). 2016 update of the PRIDE database and its related tools. Nucleic acids research, 44(22), 11033. https://doi.org/10.1093/nar/gkw880
DOI: https://doi.org/10.1093/nar/gkw880   Google Scholar

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster Computing with Working Sets. Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association.
  Google Scholar

Download


Published
2022-06-30

Cited by

ORZECHOWSKA, K. ., RUBEL, T., KURJATA, R., & ZAREMBA, K. (2022). A DISTRIBUTED ALGORITHM FOR PROTEIN IDENTIFICATION FROM TANDEM MASS SPECTROMETRY DATA. Applied Computer Science, 18(2), 16–27. https://doi.org/10.35784/acs-2022-10

Authors

Katarzyna ORZECHOWSKA 
orzechowska@ire.pw.edu.pl
Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology Poland

Authors

Tymon RUBEL 

Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology Poland

Authors

Robert KURJATA 

Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology Poland

Authors

Krzysztof ZAREMBA 

Warsaw University of Technology, Institute of Radioelectronics and Multimedia Technology, Poland

Statistics

Abstract views: 172
PDF downloads: 95


License

All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.


Similar Articles

1 2 3 4 > >> 

You may also start an advanced similarity search for this article.