Friday, August 17, 2007

Tut 1: Rename a set of files

Today I had a researcher come to me asking if I can write a script to rename a set of result files following some convention. This article will cover that bit of coding, but first some background:

1) I use Ruby, and Ruby on Rails, for my day-to-day operations. While there are some rough edges in Ruby's library support, it get's most things done efficiently, and of course you can't get much better than RoR for web apps. So any code in this blog will usually be Ruby code.

2) We have a commercial Laboratory Information Management System (LIMS) that creates identifiers for experiments, samples, and result files. The twist here is that most (3/4) of the experiments have already been accomplished before introduction of the LIMS. So while the LIMS is capable of outputing queue files for the instruments to name the files according to LIMS' convention, this does not apply here and we must retrofit the LIMS IDs into the existing result files.

Why is this important at all? Well, the LIMS can automatically assign the result file to the annotated experiment in the system on file upload if the result file has the correct identifier in the name. While you could do this manually, you would not want to do this for the 1000 result files that were/are going to be produced. See first post on time wasting by researchers that don't know how to code. At least this one is smart enough to know there is a better way.

The good news is that as long as the filename contains the LIMS ID, it does not matter what the rest of the name is, so we only have to figure out a way to relate the existing filename to proposed LIMS ID. This turns out to be easier than expected since they both have a sequential number that corresponds to the source sample in them.

E.g. :
existing file name = 07Aug05_SF_ASA_583.RAW
LIMS ID = APA1742A583MS3
proposed rename = APA1742A583MS3_07Aug05_SF_ASA_583.RAW

Thus a simple regular expression can pull out the proper sample number from the result filename and LIMS ID and do the renaming. Without further adieu, the script:

#!/usr/bin/env ruby
require 'rubygems'
require 'fastercsv'

# output a useage message if no inputs are given
unless ARGV[0]
puts "Need input queue file and directory of RAW files"
puts "USAGE:"
puts "ruby rename.rb INPUT_QUEUE_FILE INPUT_DIR"
exit(0)
end

#define a LIMS ID lookup hash keyed by the sample number
lims_ids = {}

# use FasterCSV to parse the LIMS instrument queue file for the LIMS IDs
# We need the third column for the filename (remember that arrays start with zero ;)
FasterCSV.foreach(ARGV[0]) do |row|
if (row[2] =~ /(\d+)MS3\-/)
k = $1
row[2] =~ /^(\S+MS3)\-/
lims_ids[k] = $1
end
end

# change to the directory with all of the result files
# and read the files that have a "RAW" extension
Dir.chdir(ARGV[1])
raw_files = Dir.glob("*.{RAW,raw}")

# go through the set of files and rename them
raw_files.each do |f|
puts f
f =~ /(\d+)\.RAW$/i
puts $1
if (lims_ids[$1])
system("mv #{f} #{lims_ids[$1]}_#{f}")
end
end