Saturday, June 22, 2013

The new Mac Pro, Intel Phi, and the future of Bioinformatics

There was a time not too long ago where you could get a relatively easy publication in bioinformatics simply by slapping together a current algorithm and a GPU, and benchmark it against the same data set using a single-thread for the algorithm. I tended to dismiss all of these efforts, because once you look closely at what they actually did, you could find ways to easily outperform the GPU implementation using standard threading, or Map-Reduce over a cluster, at much less cost in the short and long terms. Data staging to the GPU was generally the Acheles heel of these algorithms, and there was just no getting around that step in high-throughput genomics.

The other detriment to these sorts of algorithms was that most bioinformatitians tended to stick to interpreted languages such as Perl, Ruby and Python. The programmer productivity gains were tremendous and the languages are “fast enought”.

There are two major changes in the ecosystem that are having me rethink my long-standing biases. First and foremost, properly conducted next-generation sequencing experiments are quite ridiculous in size. Certainly a lot bigger that any technology before it (in biology). New algorithms, like STAR for RNA-Seq alignment, show us the value of compiled languages paired with big RAM and big data pipes.

Second, the introduction of faster bus speeds (Thunderbolt 2), the Intel Phi architecture, and the new Mac Pro with its pair of standard pro-class GPUs, will drastically alter the economics and complexity of staging data to GPUs for computation.

In parallel to this, you have Continuum IO developing Python modules for fast scientific computation and I/O on GPUs (Numba and Blaze). Last but not least, we have C++11 and GNU gcc 4.8, with a slew of features being added to the standard library useful for bioiformatics (hello regular expressions ;) ).

It is my prediction that the future of NGS bioinformatics, or at least the cutting edge, will lie not with Hadoop and other “Big Data” infrastrucures, but rather with high-performance algorithms that take advantage of the new architectures coming to the pro-sumer market.