A DISTRIBUTED ALGORITHM FOR PROTEIN IDENTIFICATION FROM TANDEM MASS SPECTROMETRY DATA
Article Sidebar
Open full text
Issue Vol. 18 No. 2 (2022)
-
USE OF SERIOUS GAMES FOR THE ASSESSMENT OF MILD COGNITIVE IMPAIRMENT IN THE ELDERLY
Moon-gee CHOI5-15
-
A DISTRIBUTED ALGORITHM FOR PROTEIN IDENTIFICATION FROM TANDEM MASS SPECTROMETRY DATA
Katarzyna ORZECHOWSKA, Tymon RUBEL, Robert KURJATA, Krzysztof ZAREMBA16-27
-
CONTRAST ENHANCEMENT OF SCANNING ELECTRON MICROSCOPY IMAGES USING A NONCOMPLEX MULTIPHASE ALGORITHM
Zaid ALSAYGH, Zohair AL-AMEEN28-42
-
STABILITY AND FAILURE OF THIN-WALLED COMPOSITE STRUCTURES WITH A SQUARE CROSS-SECTION
Błażej CZAJKA, Patryk RÓŻYŁO, Hubert DĘBSKI43-55
-
TOMATO DISEASE DETECTION MODEL BASED ON DENSENET AND TRANSFER LEARNING
Mahmoud BAKR, Sayed ABDEL-GABER, Mona NASR, Maryam HAZMAN56-70
-
KNEE JOINT OSTEOARTHRITIS DIAGNOSIS BASED ON SELECTED ACOUSTIC SIGNAL DISCRIMINANTS USING MACHINE LEARNING
Robert KARPIŃSKI71-85
-
CYBER SECURITY IN INDUSTRIAL CONTROL SYSTEMS (ICS): A SURVEY OF ROWHAMMER VULNERABILITY
Hakan AYDIN, Ahmet SERTBAŞ86-100
-
APPLICATION OF FINITE DIFFERENCE METHOD FOR MEASUREMENT SIMULATION IN ULTRASOUND TRANSMISSION TOMOGRAPHY
Konrad KANIA, Mariusz MAZUREK, Tomasz RYMARCZYK101-109
Archives
-
Vol. 20 No. 4
2025-01-31 12
-
Vol. 20 No. 3
2024-09-30 12
-
Vol. 20 No. 2
2024-08-14 12
-
Vol. 20 No. 1
2024-03-30 12
-
Vol. 19 No. 4
2023-12-31 10
-
Vol. 19 No. 3
2023-09-30 10
-
Vol. 19 No. 2
2023-06-30 10
-
Vol. 19 No. 1
2023-03-31 10
-
Vol. 18 No. 4
2022-12-30 8
-
Vol. 18 No. 3
2022-09-30 8
-
Vol. 18 No. 2
2022-06-30 8
-
Vol. 18 No. 1
2022-03-30 7
-
Vol. 17 No. 4
2021-12-30 8
-
Vol. 17 No. 3
2021-09-30 8
-
Vol. 17 No. 2
2021-06-30 8
-
Vol. 17 No. 1
2021-03-30 8
-
Vol. 16 No. 4
2020-12-30 8
-
Vol. 16 No. 3
2020-09-30 8
-
Vol. 16 No. 2
2020-06-30 8
-
Vol. 16 No. 1
2020-03-30 8
Main Article Content
DOI
Authors
Abstract
Tandem mass spectrometry is an analytical technique widely used in proteomics for the high-throughput characterization of proteins in biological samples. Modern in-depth proteomic studies require the collection of even millions of mass spectra representing short protein fragments (peptides). In order to identify the peptides, the measured spectra are most often scored against a database of amino acid sequences of known proteins. Due to the volume of input data and the sizes of proteomic databases, this is a resource-intensive task, which requires an efficient and scalable computational strategy. Here, we present SparkMS, an algorithm for peptide and protein identification from mass spectrometry data explicitly designed to work in a distributed computational environment. To achieve the required performance and scalability, we use Apache Spark, a modern framework that is becoming increasingly popular not only in the field of “big data” analysis but also in bioinformatics. This paper describes the algorithm in detail and demonstrates its performance on a large proteomic dataset. Experimental results indicate that SparkMS scales with the number of worker nodes and the increasing complexity of the search task. Furthermore, it exhibits a protein identification efficiency comparable to X!Tandem, a widely-used proteomic search engine.
Keywords:
References
Aebersold, R., & Mann, M. (2003). Mass spectrometry-based proteomics. Nature, 422(6928), 198–207. https://doi.org/10.1038/nature01511 DOI: https://doi.org/10.1038/nature01511
Bjornson, R. D., Carriero, N. J., Colangelo, C., Shifman, M., Cheung, K. H., Miller, P. L., & Williams, K. (2008). X!!Tandem, an improved method for running X!tandem in parallel on collections of commodity computers. Journal of proteome research, 7(1), 293–299. https://doi.org/10.1021/pr0701198 DOI: https://doi.org/10.1021/pr0701198
Cox, J., Neuhauser, N., Michalski, A., Scheltema, R. A., Olsen, J. V., & Mann, M. (2011). Andromeda: a peptide search engine integrated into the MaxQuant environment. Journal of proteome research, 10(4), 1794–1805. https://doi.org/10.1021/pr101065j DOI: https://doi.org/10.1021/pr101065j
Craig, R., & Beavis, R. C. (2004). TANDEM: matching proteins with tandem mass spectra. Bioinformatics (Oxford, England), 20(9), 1466–1467. https://doi.org/10.1093/bioinformatics/bth092 DOI: https://doi.org/10.1093/bioinformatics/bth092
Creasy, D. M., & Cottrell, J. S. (2004). Unimod: Protein modifications for mass spectrometry. Proteomics, 4(6), 1534–1536. https://doi.org/10.1002/pmic.200300744 DOI: https://doi.org/10.1002/pmic.200300744
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492 DOI: https://doi.org/10.1145/1327452.1327492
Duncan, D. T., Craig, R., & Link, A. J. (2005). Parallel tandem: a program for parallel processing of tandem mass spectra using PVM or MPI and X!Tandem. Journal of proteome research, 4(5), 1842–1847. https://doi.org/10.1021/pr050058i DOI: https://doi.org/10.1021/pr050058i
Guo, R., Zhao, Y., Zou, Q., Fang, X., & Peng, S. (2018). Bioinformatics applications on Apache Spark. GigaScience, 7(8), giy098. https://doi.org/10.1093/gigascience/giy098 DOI: https://doi.org/10.1093/gigascience/giy098
Hernandez, P., Müller, M., & Appel, R. D. (2006). Automated protein identification by tandem mass spectrometry: issues and strategies. Mass spectrometry reviews, 25(2), 235–254. https://doi.org/10.1002/mas.20068 DOI: https://doi.org/10.1002/mas.20068
Horlacher, O., Lisacek, F., & Müller, M. (2016). Mining Large Scale Tandem Mass Spectrometry Data for Protein Modifications Using Spectral Libraries. Journal of proteome research, 15(3), 721–731. https://doi.org/10.1021/acs.jproteome.5b00877 DOI: https://doi.org/10.1021/acs.jproteome.5b00877
Käll, L., Storey, J. D., MacCoss, M. J., & Noble, W. S. (2008). Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. Journal of proteome research, 7(1), 29–34. https://doi.org/10.1021/pr700600n DOI: https://doi.org/10.1021/pr700600n
Kim, S., & Pevzner, P. A. (2014). MS-GF+ makes progress towards a universal database search tool for proteomics. Nature communications, 5, 5277. https://doi.org/10.1038/ncomms6277 DOI: https://doi.org/10.1038/ncomms6277
Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H., Hoopmann, M. R., Moritz, R. L., Deutsch, E. W., & Boyle, J. (2012). Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC bioinformatics, 13, 324. https://doi.org/10.1186/1471-2105-13-324 DOI: https://doi.org/10.1186/1471-2105-13-324
Milloy, J. A., Faherty, B. K., & Gerber, S. A. (2012). Tempest: GPU-CPU computing for high-throughput database spectral matching. Journal of proteome research, 11(7), 3581–3591. https://doi.org/10.1021/pr300338p DOI: https://doi.org/10.1021/pr300338p
Orzechowska, K., & Rubel, T. (2021). An SVM-based peptide identification algorithm integrated into a database search engine. Proceedings of the XXII Polish Conference on Biocybernetics and Biomedical Engineering.
Paulo, J. A. (2013). Practical and Efficient Searching in Proteomics: A Cross Engine Comparison. WebmedCentral, 4(10), WMCPLS0052. https://doi.org/10.9754/journal.wplus.2013.0052 DOI: https://doi.org/10.9754/journal.wplus.2013.0052
Paziewska, A., Polkowski, M., Rubel, T., Karczmarski, J., Wiechowska-Kozlowska, A., Dabrowska, M., Mikula, M., Dadlez, M., & Ostrowski, J. (2018). Mass Spectrometry-Based Comprehensive Analysis of Pancreatic Cyst Fluids. BioMed research international, 2018, 7169595. https://doi.org/10.1155/2018/7169595 DOI: https://doi.org/10.1155/2018/7169595
Perkins, D. N., Pappin, D. J., Creasy, D. M., & Cottrell, J. S. (1999). Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20(18), 3551–3567. https://doi.org/10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2 DOI: https://doi.org/10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2
Pratt, B., Howbert, J. J., Tasman, N. I., & Nilsson, E. J. (2012). MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services. Bioinformatics (Oxford, England), 28(1), 136–137. https://doi.org/10.1093/bioinformatics/btr615 DOI: https://doi.org/10.1093/bioinformatics/btr615
Rappsilber, J. (2011). The beginning of a beautiful friendship: Cross-linking/mass spectrometry and modelling of proteins and multi-protein complexes. Journal of Structural Biology, 173(3), 530–540. https://doi.org/10.1016/j.jsb.2010.10.014 DOI: https://doi.org/10.1016/j.jsb.2010.10.014
Sadygov, R. G., Cociorva, D., & Yates, J. R., 3rd (2004). Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nature methods, 1(3), 195–202. https://doi.org/10.1038/nmeth725 DOI: https://doi.org/10.1038/nmeth725
Taus, T., Köcher, T., Pichler, P., Paschke, C., Schmidt, A., Henrich, C., & Mechtler, K. (2011). Universal and confident phosphorylation site localization using phosphoRS. Journal of proteome research, 10(12), 5354–5362. https://doi.org/10.1021/pr200611n DOI: https://doi.org/10.1021/pr200611n
UniProt Consortium. (2019). UniProt: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1), D506–D515. https://doi.org/10.1093/nar/gky1049 DOI: https://doi.org/10.1093/nar/gky1049
Vizcaíno, J. A., Csordas, A., Del-Toro, N., Dianes, J. A., Griss, J., Lavidas, I., Mayer, G., Perez-Riverol, Y., Reisinger, F., Ternent, T., Xu, Q. W., Wang, R., & Hermjakob, H. (2016). 2016 update of the PRIDE database and its related tools. Nucleic acids research, 44(22), 11033. https://doi.org/10.1093/nar/gkw880 DOI: https://doi.org/10.1093/nar/gkw880
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster Computing with Working Sets. Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association.
Article Details
Abstract views: 342
License
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
