Friday, December 23, 2016

Post on BAM index file parsing.

I posted a note on parsing binary files using Python over on my Rusty Bioinformatics blog. The example uses BAM index files, which are well described but do not have a software library that specifically targets reading and writing these types of files. Enjoy!

Saturday, August 22, 2015

DNA reverse complement program in Rust

After a very long hiatus, I am starting to write for pleasure again. In addition to posting some opinion pieces here, I am working on another blog, Rusty Bioinformatics, that is more focused on (re)learning bioinformatics programming.

The first post of note is a writing a simple program to provide the reverse complement of a DNA sequence on the command line, written in the Rust programming language. Find it here:

DNA reverse complement in Rust


The AWS genomics ecosystem

This past week, my good friend and collegue at AWS, Chris Crosbie, published a blog post about analyzing whole genome sequence data with one of our genomic analysis platform partners, Seven Bridges Genomics, importing the resulting VCF files to Amazon Redshift, and doing a simple analysis of the data within R and Bioconductor.

The post is lengthy and technically dense, but it is delightful in that it showcases one of the major advantages of working with genomics data on AWS: namely the rich and diverse ecosystem of tools and platform providers that a researcher can bundle together to fit their exact needs.

There has certainly been a failing on my part to communicate the fullness of this ecosystem, and Chris’s post has been a good reminder to me that I need to communicate better (and more often) to the broader bioinformatics community about how to effectively pair down their choices when facing some analysis challenge.

PS: If you like Chris’s post, be sure to read the other one we put together with Matt Wood about uploading dbGaP data to Amazon S3.

Sunday, August 4, 2013

Exciting New Adventures

Last week marked a major turning point in my career, as it was my last week working for Penn. I started in September 1997, almost 16 years of service to a fantastic organization that allowed me to learn, experiment, and grow as a person, engineer, and scientist. It has made me what I am today and I will forever be grateful to the organization and everyone I have worked for and with. Leaving it was an incredibly hard choice to make, but I am confident it was the right thing to do.

Monday August 5th I start at Amazon Web Services as a Sr. Solutions Architect. While I have some idea of what the role entails, which is generally to help people use cloud computing for solving scientific and high-performance computing problems,  I am under no illusion that I will know what it will actually be like in practice. Or what career paths just opened up for me.

I am super excited and am bouncing-off-the-walls excited for tomorrow to come.

Saturday, June 22, 2013

The new Mac Pro, Intel Phi, and the future of Bioinformatics

There was a time not too long ago where you could get a relatively easy publication in bioinformatics simply by slapping together a current algorithm and a GPU, and benchmark it against the same data set using a single-thread for the algorithm. I tended to dismiss all of these efforts, because once you look closely at what they actually did, you could find ways to easily outperform the GPU implementation using standard threading, or Map-Reduce over a cluster, at much less cost in the short and long terms. Data staging to the GPU was generally the Acheles heel of these algorithms, and there was just no getting around that step in high-throughput genomics.

The other detriment to these sorts of algorithms was that most bioinformatitians tended to stick to interpreted languages such as Perl, Ruby and Python. The programmer productivity gains were tremendous and the languages are “fast enought”.

There are two major changes in the ecosystem that are having me rethink my long-standing biases. First and foremost, properly conducted next-generation sequencing experiments are quite ridiculous in size. Certainly a lot bigger that any technology before it (in biology). New algorithms, like STAR for RNA-Seq alignment, show us the value of compiled languages paired with big RAM and big data pipes.

Second, the introduction of faster bus speeds (Thunderbolt 2), the Intel Phi architecture, and the new Mac Pro with its pair of standard pro-class GPUs, will drastically alter the economics and complexity of staging data to GPUs for computation.

In parallel to this, you have Continuum IO developing Python modules for fast scientific computation and I/O on GPUs (Numba and Blaze). Last but not least, we have C++11 and GNU gcc 4.8, with a slew of features being added to the standard library useful for bioiformatics (hello regular expressions ;) ).

It is my prediction that the future of NGS bioinformatics, or at least the cutting edge, will lie not with Hadoop and other “Big Data” infrastrucures, but rather with high-performance algorithms that take advantage of the new architectures coming to the pro-sumer market.

Don't do this

This past week I gave my boss and co-workers notice that I was moving on from UPENN. Normally you wouldn't do this sort of thing until you actually have an offer letter in hand, and for most people I would recommend the latter strategy.

So why did I do it? The main reason is that I love the group I work for. Specifically Garret FitzGerald and John Hogenesch are fantastic researchers, mentors, and all around great people. I have a broad role and I don't want them to be caught with their pants down when I do have that offer letter. Even with the four weeks notice that I will be giving them, it still won't be enough time to properly hand off my duties to other people. I specifically wanted them to have as much time to prepare and to prioritize my responsibilities.

 You may be thinking to yourself right about now "if you love it so much, why are you leaving?" Simply put, I currently don't see a long term future for me at UPENN. Once I realized that, I started really asking myself "what am I good at?", "what do I like to do the most?", and "what do I want to do in the future?" To answer the first two, I like and am extremely good at talking to researchers about their problems and about how to solve those problems with informatics. I like to build prototypes and design long-term scalable architectures that address actual research questions. In short, a Solutions Architect role fits my skillset very nicely.

 The last question has to do with the problem domain I want to solve. Genomics, my current research field, is poised to change medicine is ways we can't yet predict. We are already seeing the value of genetic sequencing in the diagnostics field, especially with respect to cancer. We are starting to see that our internal microbe populations play a larger part in our overall health than previously thought. Most of the current work studying the microbiome would not be possible without next generation sequencing. I want to be part of that sea change. I want to enable the use of high-throughput sequencing, and other 'omics technologies, in biomedical research that will have a huge and immediate impact on medicine. It should be noted that I could probably do that now in the group I am in, and I had already taken that into account when I thought long and hard about wanting to move on.

 So where to now? Good question. I'll have an answer for you, my captive audience, in about three weeks.

Wednesday, January 18, 2012

Getting Promise Pegasus to actually email you in case of drive failure

UPDATE: ContrarySheep (e.g. GriffithStudio) has posted an excellent update to these scripts that take advantage of launchd to run these processes. See this gist for more information.

So the Promise Pegasus thunderbolt external storage array is a super piece of hardware for the price. One thing they list as a feature is the ability to email a user in case of a drive failure. One problem, though, is that I could not get this to work for me.

A nice feature of the system, though, is that the included Promise Utility GUI that ships with the array also installs quite a full featured command line utility. For instance, you can get a quick report of the drives in the array by issuing the following:


I wrote a quick little Ruby script to grab the drive information and send me an email in case the status is not "OK".

DISCLAMIER: This snippet is provided as-is. I am not guaranteeing this works at all, and is using the assumption that the status will change from "OK" to something else if there is an issue with a drive. Use at your own risk.