MR-MSPOLYGRAPH:    A MapReduce Implementation of a Hybrid Spectral Library-Database Search Method for Large-scale Peptide Identification



A MapReduce based implementation called MR-MSPolygraph for parallelizing peptide identification from mass spectrometry data is presented. The underlying serial method, MSPoly-graph, uses a novel hybrid approach to match an experimental spectrum against a combination of a protein sequence database and a spectral library. Our MapReduce implementation can run on any Hadoop cluster environment. Experimental results demonstrate that, relative to the serial version, MR-MSPolygraph reduces the time to solution from weeks to hours, for processing tens of thousands of experimental spectra. Speedup and other related performance studies are also reported on a 400 core Hadoop cluster using spectral data sets from environmental microbial communities as inputs.


Downloads & Instructions

If you want to download the SEQUENTIAL version of the software, then please visit


Data   Download (size) Description
Source code for MR-MSPolygraph .zip
Source code for MR-MSPolygraph.


A comprehensive microbial database containing 2.65 million protein sequences.
Spectral library


Contains a spectral library for S. oneidensis MR-1 (1,752 peptides) along with its parameter files.
Experimental Spectra .zip
Contains 1,000 experimental spectra derived from Synechococcus sp. PCC 7002.
Parameter files .zip
Contains other parameter files (*.frq, *.dat) which will be required during runtime.
Run file
This is file that is supplied as the input argument to polygraph. This file needs to be present in the working directory from where  Edit this file before use.
Spectral index file index_1k.dat
This file should contain the paths to all experimental spectral *.dta files that need to be matched. Edit this file before use.
Hadoop run script script
This shell script shows an example hadoop command that can be used to run mspolygraph_mr. Edit this file before use.

    Download verification: md5sum checksums for all files above

  (currently our web server supports compressed file downloads only in .zip format. To unzip in linux console, please use the unzip command.)


Instructions:    readme.txt



Please cite the following two source papers for this work:


1) A. Kalyanaraman, W.R. Cannon, B. Latt, D.J. Baxter (2011). MapReduce implementation of a hybrid spectral library-database search method for large-scale peptide identification. Bioinformatics, In press, doi: 10.1093/bioinformatics/btr523.  Preprint

2) W.R. Cannon, M. Rawlins, D.J. Baxter, S.J. Callister, M.S. Lipton, and D.A. Bryant (2011), Large improvements in MS/MS based peptide identification rates using a hybrid analysis, Journal of Proteome Research, 10(5):2306-2317.



A. Kalyanaraman:     < a n a n t h @ e e c s . w s u . e d u >

                                School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164-2752.

W.R. Cannon:            < w i l l i a m . c a n n o n @ p n n l . g o v >

                                Computational Biology and Bioinformatics Group, Pacific Northwest National Laboratory, Richland WA 99352



        Funding from NSF IIS 0916463.