def science { }: August 2007

Thursday, August 30, 2007

Now with comments...

OK, OK, so before I said I didn't believe in comments ... but I guess they have a time and place. Like a blog that gives code examples. I have to admit that sometimes it is nice to see some point clarified by the blogger to some reader's question, but for the most part, I still think they are not so useful.

As a compromise, I decided to enable comments to posts, but you'll have to verify that you are indeed a real-live person with captcha for each post.

Also if you do have a blog, or have something more substantial to add, I prefer you post comment on your own blog and use blogger.com's "link to this post" back-link functionality. I think this would make for a better conversation and also increase google scores ;)

Tuesday, August 28, 2007

Gbarcode using GD script

BTW, here is the script I used to create the barcode in the previous post using the gbarcode and gd2 gems:


require 'rubygems'
require 'gd2'
require 'gbarcode'

include Gbarcode
include GD2

b = barcode_create("TEST1234567890")
barcode_encode(b,BARCODE_128)

w = 20
h = 100
x = 10

y1 = 10
y2 = h - 20

bars = b.partial.split(//).map {|e| e.to_i}
bars.map {|e| w += e}

i = Image::IndexedColor.new(w,h)
i.palette << c =" Canvas.new(i)" color =" Color::BLACK" font =" Font::Small" f =" File.open(">

For the public good

For a while now, I have been working on designing data standards for the research community. Often times, the standards process is not so much based on efficient and useful design, but more on compromises between a large and diverse set of users. So far this process has led to complex standards that I have to take some partial credit for. Frankly I would rather not, but a publication record is a must for any sort of success at academic institutions.

But something good did came out of my dissatisfaction in the public contributions I have made thus far. I was motivated to contribute something to the open source community that was completely unrelated to data standards, a barcode creation library ( a gem) for ruby, Gbarcode. Several items helped in deciding that this would be a good project:

I needed to create barcodes for a project ;)
the existing ruby barcode gem only produced Code 39
the images that it produced were not readable by my scanner
the project was dead (last release was in July 2005)

I looked around and the current open source projects where GNU barcode (C), and Barbecue (Java). I decided to try my hand at SWIG wrapping the GNU barcode C API. Long story short, SWIG is not the most intuitive tool, but I was able to make some strides in creating the interface file and pass in Ruby strings to create the barcodes with.

One major hurdle of the project was creating a MS Windows-compatible gem. Gems are notorious for not supporting Windows. On Unix, Linux and Mac OS X, the gems usually install just fine, since they are compiled on install. On Windows, it is not at all straight forward to pre-compiling the parts of the library written in C. Since I wanted this to be useful for the widest audience, I looked around for other gems that did support Windows and found that Hpricot has a nice rake task and environment for compiling and packaging the gem for windows. Thanks to _why I was able to make this work, with a bit of leg work. I wouldn't recommend going to the SVN repos to look at what I did, since it is in an unstable state. Just go to _why's site and look at what he did.

One criticism I have with Gbarcode is that the only supported output format is PostScript. In order to use it for web sites (specifically RoR), you would have to run the output through ImageMagick, or some other image processing software. Much to my pleasant surprise, I looked the other day on Google, and several sites have covered how to do this. Just search for ruby and barcode and it should come right up.

The drawback to the approaches listed in those how-to's is that RMagick (and ImageMagick) are memory hogs. Since people are actually starting to use Gbarcode, I have started thinking about re-coding it it to make it more Ruby-ish (currently since it is a binding of the C lib, it uses C-style method calls) and to use Cairo as the image producing library. I tried GD, but the barcodes come out less than optimal:

I don't know why the bottom part of the barcode marges bars, maybe it happens on the way to encoding the PNG, but nothing I tried fixes this. My hope is that with the Cairo integration, this artifact goes away.

Tuesday, August 21, 2007

A word about comments on blogs...

Bencher #1 also asked why I didn't allow comments on the blog. There is a good reason for this, see here and here.

UPDATE: Plus he can always come down the hall to complain ;)

UPDATE (8.23.07): For the impatient, I summarize:

Blogs should be about one voice, it is not a debate forum.
If you have something to say, start your own blog

Back two steps...

So bencher #1 that changed the regex in the file rename program to fit his needs says I come off a bit arrogant on this blog. Yeah, I sorta have to agree, and he makes some other good points:

his role is to do research and write papers
he brings in the grant money
learned to program since the compy86 came out

Fair enough. But learning a newer and more applicable language never hurts and could save time when I am not available.

UPDATE: I did change the wording a bit on the last post. Happy?

Progress!

It turns out that this blog is not a waste of time ;)

From my previous post, the bencher that requested a script to rename files actually modified it to fit his needs. This is great news and was unexpected.

He downloaded and installed ruby on his computer and ran the script himself, found that some filenames did not match to a LIMS ID in the input file. He knew these files did have a LIMS ID so he started investigating the code and found comments that identified where regular expressions where matching LIMS IDs to file names.

So what if he got stuck on a few files and had to ask for some regular expression help, at least he was trying and that is A Good Thing.

Monday, August 20, 2007

Tag Line Woes

So the tag line for this blog has changed a few times already. Call me fickle, but I have not found the "perfect" one yet, as each has had it's ups and downs:

Where agile methods and bioinformatics colide
Where agile methods and bioinformatics meet
For researchers and developers alike

To the current "For researchers (learn to code) and developers (learn to speak) alike". Doesn't exactly role off of the tongue. Or get my point across. Anywho, I am sure this won't be the last iteration.

BTW, the title also went through some changes on the first day, since "Agile Science" and a few other related titles was already taken in the blogosphere.

The SNP per Gene count

IN my last post I related an example where a scientist came to me to parse a file for the number of SNPs per gene in an excel file. The simplest solution would be to use a hash keyed on the gene symbol and the value tracks the number of times you have seen a particular gene symbol. Here is the program:


require 'rubygems'
require 'fastercsv'

genecount= Hash.new()
FasterCSV.foreach(ARGV[0], :headers => true) do |row|
# headers => id,snp_id,genome_build,chromosome,coordinate,gene_symbol,priority,snp_per_gene
if (genecount[row["gene_symbol"]])
  genecount[row["gene_symbol"]] += 1
else
  genecount[row["gene_symbol"]] = 1
end
end


output = File.open("#{ARGV[0]}.rev.csv", "w")
output.puts("gene_symbol,snp_count")
genecount.each_pair do |g,c|
output.puts "\"#{g}\",#{c}"
end
output.close

Friday, August 17, 2007

Hacks Before Code

I often find that when you are trying to solve two problems at once, you do a poor job of both. Case in point, someone just came into my office asking how they would go about getting the number of SNPs per gene from some excel file they have. I start to explain set theory and databases and you could see visible signs of mental shutdown ensue (the slacking jaw, the glazed eyes). Trying a different tactic and showing them a script as I wrote it to create a hash keyed by gene and the value being the count of SNPs from the file gave equal results.

So I am trying out Something New. I am going to push that researchers learn to program in a context that is completely separate from science, and is hopefully fun enough that they stick with it for more than a few days. Enter Hackety.org, a project spearheaded by _why the lucky stiff that seeks to (insert Fake Steve Jobs "voice") re-instill the child like wonder back into learning how to program.

With HacketyHack, I hope that researchers are motivated to learn aspects of programming in an entertaining environment before they have to do any real work, which of course will suck some of the fun out of the activity.

I'll be putting together lessons to augment the existing 7 exercises of HacketyHack in the coming months with real but simple bioinformatics tutorials. So download that hack-box and get coding folks!

ITMArT: A request tracking system

For a few (3-4) months I have been working on a user request and order management tracking system. Most of that time has been spent wrestling with RoR's ajax functionality and making the UI as intuitive as I possibly can. Basically I took the "getting real" book at face value and started with the interface.

What remains, though, are lots of "under-the-hood" plumbing to get small things like getting user accounts to work with the CAS SSO server, access control lists and group management. Oh, and email alerts... yeeesh. Well at least it looks pretty.

The search works well and the cart concept seems to be pretty easy to follow. The order processing, though, still leaves something to be desired. Reporting is air-ware at the moment.

I'll keep posting tidbits about this project often (since it currently take 90% of my time) so stay tuned!

Tut 1: Rename a set of files

Today I had a researcher come to me asking if I can write a script to rename a set of result files following some convention. This article will cover that bit of coding, but first some background:

1) I use Ruby, and Ruby on Rails, for my day-to-day operations. While there are some rough edges in Ruby's library support, it get's most things done efficiently, and of course you can't get much better than RoR for web apps. So any code in this blog will usually be Ruby code.

2) We have a commercial Laboratory Information Management System (LIMS) that creates identifiers for experiments, samples, and result files. The twist here is that most (3/4) of the experiments have already been accomplished before introduction of the LIMS. So while the LIMS is capable of outputing queue files for the instruments to name the files according to LIMS' convention, this does not apply here and we must retrofit the LIMS IDs into the existing result files.

Why is this important at all? Well, the LIMS can automatically assign the result file to the annotated experiment in the system on file upload if the result file has the correct identifier in the name. While you could do this manually, you would not want to do this for the 1000 result files that were/are going to be produced. See first post on time wasting by researchers that don't know how to code. At least this one is smart enough to know there is a better way.

The good news is that as long as the filename contains the LIMS ID, it does not matter what the rest of the name is, so we only have to figure out a way to relate the existing filename to proposed LIMS ID. This turns out to be easier than expected since they both have a sequential number that corresponds to the source sample in them.

E.g. :
existing file name = 07Aug05_SF_ASA_583.RAW
LIMS ID = APA1742A583MS3
proposed rename = APA1742A583MS3_07Aug05_SF_ASA_583.RAW

Thus a simple regular expression can pull out the proper sample number from the result filename and LIMS ID and do the renaming. Without further adieu, the script:


#!/usr/bin/env ruby
require 'rubygems'
require 'fastercsv'

# output a useage message if no inputs are given
unless ARGV[0]
puts "Need input queue file and directory of RAW files"
puts "USAGE:"
puts "ruby rename.rb INPUT_QUEUE_FILE INPUT_DIR"
exit(0)
end

#define a LIMS ID lookup hash keyed by the sample number
lims_ids = {}

# use FasterCSV to parse the LIMS instrument queue file for the LIMS IDs
# We need the third column for the filename (remember that arrays start with zero ;)
FasterCSV.foreach(ARGV[0]) do |row|
if (row[2] =~ /(\d+)MS3\-/)
  k = $1
  row[2] =~ /^(\S+MS3)\-/
  lims_ids[k] = $1
end
end

# change to the directory with all of the result files
# and read the files that have a "RAW" extension
Dir.chdir(ARGV[1])
raw_files = Dir.glob("*.{RAW,raw}")

# go through the set of files and rename them
raw_files.each do |f|
puts f
f =~ /(\d+)\.RAW$/i
puts $1
if (lims_ids[$1])
  system("mv #{f} #{lims_ids[$1]}_#{f}")
end
end

Thursday, August 16, 2007

It's Alive!

Good premise for a blog ... check.
Catchy title ... check.
Bad first-post title ... check.

OK! Start the blog!

Well, they say that third times the charm and since my other two blogs have gone stale and moldy, this would be the third. And it's not even a New Year's resolution! Actual internal motivation started this one ;)

So what is Def Sci about, you ask? Well basically I think that bioinformatics applications (and scientific software in general) is over-engineered and does not pay enough attention to the UI aspects of their software. Many projects and otherwise good idea fail to grasp the essential point that agile web development shops have been pushing of late; namely that without an intuitive and simple interface, you'll never get adoption and thus never get an evangelical early adopter community to beat your drums. I am here to espouse my views on simple and agile development, with a skew towards biomedical informatics.

Open source developers and commercial vendors, lend me your ears, because I have a direct line to actual researchers and deal with them daily! I know what they like and don't like about products! I know their needs! I am the bridge, if you will, between those that would sell me something and those that would use it.

But wait that's not all! I have a message for the biomedical researcher as well:

"Learn some programming, bub."

I kid you not when I say I have walked into a meeting and someone tells me they spent weeks trawling through protein result lists in excel determining which were the same/different across a couple of conditions and experiments and did I have a better way to do this. My answer: "That's three lines of code." OK, maybe four. Would take me 1 minute to code, and most of that time is spent typing, since I never did learn to type with more than 6 fingers.

So my audience, prepare yourself for a mix of posts ranging from problems folks approach me with (and the solution I come up with), to reviews of source software out there in the wild, to comments on projects that I am working on, to just plain old rants (like this one :) )