def science { }: June 2013

Saturday, June 22, 2013

The new Mac Pro, Intel Phi, and the future of Bioinformatics

There was a time not too long ago where you could get a relatively easy publication in bioinformatics simply by slapping together a current algorithm and a GPU, and benchmark it against the same data set using a single-thread for the algorithm. I tended to dismiss all of these efforts, because once you look closely at what they actually did, you could find ways to easily outperform the GPU implementation using standard threading, or Map-Reduce over a cluster, at much less cost in the short and long terms. Data staging to the GPU was generally the Acheles heel of these algorithms, and there was just no getting around that step in high-throughput genomics.

The other detriment to these sorts of algorithms was that most bioinformatitians tended to stick to interpreted languages such as Perl, Ruby and Python. The programmer productivity gains were tremendous and the languages are “fast enought”.

There are two major changes in the ecosystem that are having me rethink my long-standing biases. First and foremost, properly conducted next-generation sequencing experiments are quite ridiculous in size. Certainly a lot bigger that any technology before it (in biology). New algorithms, like STAR for RNA-Seq alignment, show us the value of compiled languages paired with big RAM and big data pipes.

Second, the introduction of faster bus speeds (Thunderbolt 2), the Intel Phi architecture, and the new Mac Pro with its pair of standard pro-class GPUs, will drastically alter the economics and complexity of staging data to GPUs for computation.

In parallel to this, you have Continuum IO developing Python modules for fast scientific computation and I/O on GPUs (Numba and Blaze). Last but not least, we have C++11 and GNU gcc 4.8, with a slew of features being added to the standard library useful for bioiformatics (hello regular expressions ;) ).

It is my prediction that the future of NGS bioinformatics, or at least the cutting edge, will lie not with Hadoop and other “Big Data” infrastrucures, but rather with high-performance algorithms that take advantage of the new architectures coming to the pro-sumer market.

Don't do this

This past week I gave my boss and co-workers notice that I was moving on from UPENN. Normally you wouldn't do this sort of thing until you actually have an offer letter in hand, and for most people I would recommend the latter strategy.

So why did I do it? The main reason is that I love the group I work for. Specifically Garret FitzGerald and John Hogenesch are fantastic researchers, mentors, and all around great people. I have a broad role and I don't want them to be caught with their pants down when I do have that offer letter. Even with the four weeks notice that I will be giving them, it still won't be enough time to properly hand off my duties to other people. I specifically wanted them to have as much time to prepare and to prioritize my responsibilities.

You may be thinking to yourself right about now "if you love it so much, why are you leaving?" Simply put, I currently don't see a long term future for me at UPENN. Once I realized that, I started really asking myself "what am I good at?", "what do I like to do the most?", and "what do I want to do in the future?" To answer the first two, I like and am extremely good at talking to researchers about their problems and about how to solve those problems with informatics. I like to build prototypes and design long-term scalable architectures that address actual research questions. In short, a Solutions Architect role fits my skillset very nicely.

The last question has to do with the problem domain I want to solve. Genomics, my current research field, is poised to change medicine is ways we can't yet predict. We are already seeing the value of genetic sequencing in the diagnostics field, especially with respect to cancer. We are starting to see that our internal microbe populations play a larger part in our overall health than previously thought. Most of the current work studying the microbiome would not be possible without next generation sequencing. I want to be part of that sea change. I want to enable the use of high-throughput sequencing, and other 'omics technologies, in biomedical research that will have a huge and immediate impact on medicine. It should be noted that I could probably do that now in the group I am in, and I had already taken that into account when I thought long and hard about wanting to move on.

So where to now? Good question. I'll have an answer for you, my captive audience, in about three weeks.