def science { }: 2010

As read lengths grow, we have found that the current crop of short read aligners don't seem to do very well with RNA-seq data. In a recent experiment we were only able to map about 40% of the reads using Bowtie. BLAT was able to map an additional 45% percent of the reads, but was orders of magnitude slower than Bowtie (3 hours vs 8 days.)

I decided to try out my hand at using Amazon Web Services (AWS) to parallelize the BLAT search. Naturally I investigated Hadoop, and in particular AWS Elastic Map Reduce (EMR) service, which handles the particulars of setting up a Hadoop cluster for you. While Hadoop is a great project, and lots of people are having great success with it, there is a certain level of investment that must be made to learn how to mold your particular data analysis pipeline to fit into Hadoop's assumptions. For example, Hadoop assumes that the input file is large and that it will need to spilt the input for parallel processing on the worker nodes. The default method for splitting the input is on line breaks, which is not useful for FASTA sequence files. There are ways to customize this behavior, but I had to have results in two days for John Hogenesch to present at a cloud computing session for Science Online 2010.

Lucky for me, I had previously investigated another Map Reduce framework, CloudCrowd, which was developed by Document Cloud, a non-profit agency that archives source documents (scan PDF's, thumbnail pages, OCR) for news articles. The system is elegantly simple, built around the notion of a central work queue and worker nodes communicating over HTTP requests. A "job" is defined by specifying a set of input documents, an action to perform on those documents, and any other options that you want to pass along. The action is a simple Ruby class. CloudCrowd takes care of distributing the input files to worker nodes, and cleans up after itself pretty nicely. One last point is that the documentation is fairly complete, easy to use, and provides clear examples that work off the bat on install.

For our use case, the action we want to perform is the align all of the unmapped sequences using BLAT. In order for CloudCrowd to run BLAT, we must define a Ruby class that inherits from CloudCrowd::Action and implements a system call to BLAT:

CloudCrowd::Action takes care of a lot of the details for staging a work unit, such as creating working directories, downloading input files, launching the subprocesses, and cleaning up after itself when the job completes.

One difference between Hadoop and CloudCrowd is that while Hadoop will split the input and parcel out work items to nodes, CloudCrowd will send an entire input file to a single node. This means that I had to split the sequences into multiple FASTA files before sending them on to CloudCrowd. Using 100K sequences per file, we had 303 FASTA files to align. Note, that this is not lost work, as I would have had to define the split routine for Hadoop anyway. All of the input files, BLAT executable, and the search database were transferred to a S3 bucket.

Next I provisioned an small EC2 instance to act as the central CloudCrowd master, giving it a stable IP and configuring the connection options for CloudCrowd to connect to S3 and HTTP basic authentication for the nodes (see the CloudCrowd documentation). This manual process is fine for the master, but I worker nodes to automatically configure themselves, start the CloudCrowd node process, connect to the master and get jobs. You can provide EC2 nodes with a shell script that get's executed on start in what they call "user data". Here is my script (modified to protect my credentials and the data ;) ):

That script takes the public ubuntu 9.10 image (ami-55739e3c), adds the latest public repositories to the apt sources, updates all the packages and installs the latest Ruby and RubyGems packages. In addition it installs the CloudCrowd gem, downloads and installs BLAT, uses a pre-authenticated URL to fetch the pre-packaged CloudCrowd instance that I made (which has my AWS credentials) that actually defnes the BLAT job, and runs a CloudCrowd node process.

Now as nodes became available, they connected to the master and started working. If the nodes go down, the system would automatically cope with the node failures by resending the work to other nodes. We could also add new nodes as we see fit to decrease the time to completion. This type of fluid set up allowed me to take advantage of EC2 "spot instances" which have a variable pricing model based on demand and tends to be cheaper than the full price (12¢ vs. 34¢ per hour), with the caveat that your instance may come up at some random time and get shut down whenever the price exceeds your request maximum. For this particular job, I bid $0.25 and the actual price fluctuated from $0.12 to $0.18 per hour.

Finally we create a HTTP request using the following Ruby script, which designates the input files and database located in a public S3 bucket, as well as defines the parameters to use:

The system is not perfect and could use a lot of tweeking, but not bad for a day's work.

def science { }

Friday, January 22, 2010

Ruby + AWS == Easy Map-Reduce