Thursday, September 27, 2007

Mad plotz

A recent submission to a journal has caused us a few headaches over the past few weeks, as the editors sent the paper back to us stating that we did not meet the minimal reporting requirements for the type of experiment that was performed. Which is poetic justice in a way, since for the past few years I have been promoting the use of minimal reporting requirements and standard data formats.

I think this particular journal, however, has gone a bit too far in asking for annotated spectra for every identification in the result set. For most low-throughput experiments this is not such a big deal, but we had thousands of identifications and even wrote an algorithm to automatically assign a quality score to those identifications so that such manual validation of spectra should not be necessary.

But I digress. Instead of fighting it, we decided to give the editors what they want, annotated spectra for every hit. It turns out that this is not such a trivial thing to do. Even gathering all of the data was a tough job, since the experiment was performed many years ago on instrumentation that is nearing its end of life. A lot of file parsing and data reorganization had to be done, prior to any development effort to produce the images that the journal wanted.

A bit of background and some numbers will help us understand the enormous task we undertook. The experiment was a proteomics profile of two developmental stages in zebra fish. We used two methodologies, 2D gels and LCMS, to fractionate the samples and ran them through mass spectrometers. The 2D gels gave fewer identifications than the LCMS, but it was still a lot of data. For instance, just these results contaied of 30,000 peptide identifications! You can reduce that to about 2,000 proteins that the journal has asked for annotated spectra. Needless to say, the brute force method of taking screen shots of each spectra from the program would not work.

I wrote a few scripts and libraries to parse the raw data and the final result table to come up with the above figure. This is bringing mzXML, excel, and MGF files together with Ruby, C libraries, and the R statistical tool to produce the nice picture you see, but it took me two weeks to figure out the specifics. How on earth could a regular bencher do this?

I think the journal is in for a rude awakening once the backlash of angry rebuttals from paper submitters start flowing in. I would also like to see their reaction to the gigantic pile of spectra we are about to send them.