def science { }: bioinformatics

Showing posts with label bioinformatics. Show all posts

Tuesday, October 7, 2008

Heuristically Beneficial Artifacts

I've recently embarked on a study to see what effect clustering mass spectra has on the design of MRM experiments. Specifically one clusters spectra from the shotgun proteomics discovery phase of the experiment to see how it affects the selection of peptide species and the transitions that a mass spec looks for during the second targeted sequencing phase of the experiment.

While the work is ongoing, a very interesting (at least to me) result has come out of plotting the unfiltered SEQUEST results from the non-clustered and clustered version of the mass spec data. The say a picture is worth a thousand words, so:

The chart is showing the XCORR values of the clustered spectra plotted against the XCORR values for their respective members, independent of what peptide was identified for each. The color coding represent scores for peptides that are part of the decoy database, where:

blue = a decoy hit was scored in the clustered spectra, but not the original
green = a decoy hit was scored in the original spectra, but not the clustered spectra
red = both spectra scored a decoy peptide

What you'll note is that spectra that all clustered score a different set of peptides higher than do the original spectra, and vice versa. At least for this data set, which is SILAC labeled, we note that each method is identifying a different population of peptides. This goes against all current publications, so I have to be careful in the interpretation of these results and will need a significant amount of validation experiments, but this is exciting stuff!

Particularly, it is going to be very interesting to see whether clustering actually helps or hinders ion selection when designing MRM experiments. If it does help ion selection, I have already come up with the catchy name, Heuristically Beneficial Artifacts^TM. If not, well, we will at least have looked at the real-world effects of a methodology when applied to a new type of experimental goal.
More on this as it develops.

Tuesday, December 11, 2007

Ask not for what you think you need...

I have been reflecting a lot recently on why researchers think they need grand solutions for relatively small problems. Specifically there is a perception among benchers that are conducting large-ish experiments that they need some sort of LIMS to manage their data. Frankly as I've stated many times (at least in conversations with other IT folk, and even with benchers) most researchers don't want a LIMS. What they really want is fancy file storage.

Yet, the town folk keep insisting "we need LIMS, please give...". LIMS are probably the only thing that potentially fits the bill for experimental data management, so that's what benchers ask for, when most likely a digital asset management application would suffice. Heck, even all those bit torrent sites can potentially do the job that researchers need.

If you are a coder that is continually asked to provide a LIMS to researchers, or if you are a bencher that is interested in LIMS for data management, here are a set of questions you can ask before going any further down the rough and tumbly road that is LIMS adoption:

Are you in a regulated environment?
Are you willing to mandate use of the LIMS?
Do you have adequate personnel to support the LIMS locally (e.g. do you have a dedicated person that will actively promote the use of the LIMS, train folks, configure the system, do extensive follow-up, etc. Vendor support will only be of help at the start of the adoption process.)
Do you have a lot of spare cash? (Think 6 figures to buy an initial bank of licenses)
Will you have a lot of spare cash for the next 3-5 years? (Think 5 figures to keep annual support and maintenance up to date.)

If you answered "No" to any of these questions, seriously reconsider buying what is traditionally known as a LIMS. Instrument control, automatic data acquisition, yada, yada, all those marketing features used to sell a LIMS don't mean squat if no one uses it in the first place.

Saturday, September 29, 2007

Signs from above

So my earlier idea of using an iPhone or iPod touch as the gateway into an electronic lab notebook for tracking protocols and doing real-time data entry has been given some conflicting signals from the powers that be. One of those powers (my boss) answered my long (but not too long) email proposal with a one word reply. What was the word you ask? ... "cute" ... No signature even.

Another higher power (the internet) quickly followed that crushing monosyllabic bitch-slap with an uplifting rumour from AppleInsider that Apple is developing a PDA device based on the iPhone & Touch UI, but with a bigger screen. Bigger screen (woohoo!) and the potential to install non-Apple third party apps (highly unlikely given Apple's recent activities) would be an ideal device to use as a lab e-notebook tablet.

Oh, whom do I listen to??? Should I drop the project completely? Or should I table the project until the a new, (relatively) low-cost, and developer-friendly device pops into existence?

In the end, I decided to listen to my inner voice, you know the impatient one, and just bought myself a touch to play around with for the time being. If this mythical iPAD ever sees the light of day, I'll raise the issue again with the G-man and get him to foot the bill for what will probably be a more expensive device. Either way I win ;) Unless of course he calls my bluff ...

Thursday, September 27, 2007

Mad plotz

A recent submission to a journal has caused us a few headaches over the past few weeks, as the editors sent the paper back to us stating that we did not meet the minimal reporting requirements for the type of experiment that was performed. Which is poetic justice in a way, since for the past few years I have been promoting the use of minimal reporting requirements and standard data formats.

I think this particular journal, however, has gone a bit too far in asking for annotated spectra for every identification in the result set. For most low-throughput experiments this is not such a big deal, but we had thousands of identifications and even wrote an algorithm to automatically assign a quality score to those identifications so that such manual validation of spectra should not be necessary.

But I digress. Instead of fighting it, we decided to give the editors what they want, annotated spectra for every hit. It turns out that this is not such a trivial thing to do. Even gathering all of the data was a tough job, since the experiment was performed many years ago on instrumentation that is nearing its end of life. A lot of file parsing and data reorganization had to be done, prior to any development effort to produce the images that the journal wanted.

A bit of background and some numbers will help us understand the enormous task we undertook. The experiment was a proteomics profile of two developmental stages in zebra fish. We used two methodologies, 2D gels and LCMS, to fractionate the samples and ran them through mass spectrometers. The 2D gels gave fewer identifications than the LCMS, but it was still a lot of data. For instance, just these results contaied of 30,000 peptide identifications! You can reduce that to about 2,000 proteins that the journal has asked for annotated spectra. Needless to say, the brute force method of taking screen shots of each spectra from the program would not work.

I wrote a few scripts and libraries to parse the raw data and the final result table to come up with the above figure. This is bringing mzXML, excel, and MGF files together with Ruby, C libraries, and the R statistical tool to produce the nice picture you see, but it took me two weeks to figure out the specifics. How on earth could a regular bencher do this?

I think the journal is in for a rude awakening once the backlash of angry rebuttals from paper submitters start flowing in. I would also like to see their reaction to the gigantic pile of spectra we are about to send them.

Monday, August 20, 2007

The SNP per Gene count

IN my last post I related an example where a scientist came to me to parse a file for the number of SNPs per gene in an excel file. The simplest solution would be to use a hash keyed on the gene symbol and the value tracks the number of times you have seen a particular gene symbol. Here is the program:


require 'rubygems'
require 'fastercsv'

genecount= Hash.new()
FasterCSV.foreach(ARGV[0], :headers => true) do |row|
# headers => id,snp_id,genome_build,chromosome,coordinate,gene_symbol,priority,snp_per_gene
if (genecount[row["gene_symbol"]])
  genecount[row["gene_symbol"]] += 1
else
  genecount[row["gene_symbol"]] = 1
end
end


output = File.open("#{ARGV[0]}.rev.csv", "w")
output.puts("gene_symbol,snp_count")
genecount.each_pair do |g,c|
output.puts "\"#{g}\",#{c}"
end
output.close

Friday, August 17, 2007

Hacks Before Code

I often find that when you are trying to solve two problems at once, you do a poor job of both. Case in point, someone just came into my office asking how they would go about getting the number of SNPs per gene from some excel file they have. I start to explain set theory and databases and you could see visible signs of mental shutdown ensue (the slacking jaw, the glazed eyes). Trying a different tactic and showing them a script as I wrote it to create a hash keyed by gene and the value being the count of SNPs from the file gave equal results.

So I am trying out Something New. I am going to push that researchers learn to program in a context that is completely separate from science, and is hopefully fun enough that they stick with it for more than a few days. Enter Hackety.org, a project spearheaded by _why the lucky stiff that seeks to (insert Fake Steve Jobs "voice") re-instill the child like wonder back into learning how to program.

With HacketyHack, I hope that researchers are motivated to learn aspects of programming in an entertaining environment before they have to do any real work, which of course will suck some of the fun out of the activity.

I'll be putting together lessons to augment the existing 7 exercises of HacketyHack in the coming months with real but simple bioinformatics tutorials. So download that hack-box and get coding folks!

Tut 1: Rename a set of files

Today I had a researcher come to me asking if I can write a script to rename a set of result files following some convention. This article will cover that bit of coding, but first some background:

1) I use Ruby, and Ruby on Rails, for my day-to-day operations. While there are some rough edges in Ruby's library support, it get's most things done efficiently, and of course you can't get much better than RoR for web apps. So any code in this blog will usually be Ruby code.

2) We have a commercial Laboratory Information Management System (LIMS) that creates identifiers for experiments, samples, and result files. The twist here is that most (3/4) of the experiments have already been accomplished before introduction of the LIMS. So while the LIMS is capable of outputing queue files for the instruments to name the files according to LIMS' convention, this does not apply here and we must retrofit the LIMS IDs into the existing result files.

Why is this important at all? Well, the LIMS can automatically assign the result file to the annotated experiment in the system on file upload if the result file has the correct identifier in the name. While you could do this manually, you would not want to do this for the 1000 result files that were/are going to be produced. See first post on time wasting by researchers that don't know how to code. At least this one is smart enough to know there is a better way.

The good news is that as long as the filename contains the LIMS ID, it does not matter what the rest of the name is, so we only have to figure out a way to relate the existing filename to proposed LIMS ID. This turns out to be easier than expected since they both have a sequential number that corresponds to the source sample in them.

E.g. :
existing file name = 07Aug05_SF_ASA_583.RAW
LIMS ID = APA1742A583MS3
proposed rename = APA1742A583MS3_07Aug05_SF_ASA_583.RAW

Thus a simple regular expression can pull out the proper sample number from the result filename and LIMS ID and do the renaming. Without further adieu, the script:


#!/usr/bin/env ruby
require 'rubygems'
require 'fastercsv'

# output a useage message if no inputs are given
unless ARGV[0]
puts "Need input queue file and directory of RAW files"
puts "USAGE:"
puts "ruby rename.rb INPUT_QUEUE_FILE INPUT_DIR"
exit(0)
end

#define a LIMS ID lookup hash keyed by the sample number
lims_ids = {}

# use FasterCSV to parse the LIMS instrument queue file for the LIMS IDs
# We need the third column for the filename (remember that arrays start with zero ;)
FasterCSV.foreach(ARGV[0]) do |row|
if (row[2] =~ /(\d+)MS3\-/)
  k = $1
  row[2] =~ /^(\S+MS3)\-/
  lims_ids[k] = $1
end
end

# change to the directory with all of the result files
# and read the files that have a "RAW" extension
Dir.chdir(ARGV[1])
raw_files = Dir.glob("*.{RAW,raw}")

# go through the set of files and rename them
raw_files.each do |f|
puts f
f =~ /(\d+)\.RAW$/i
puts $1
if (lims_ids[$1])
  system("mv #{f} #{lims_ids[$1]}_#{f}")
end
end