def science { }

Wednesday, February 1, 2017

To be (AWS certified), or not to be (AWS certified)

A couple of my bioinformatics colleagues have recently asked me about whether they should be AWS certified, and, if so, which certification would be best.

My general answer is that AWS certification doesn't hurt to have, but it is should not be considered a "must have" requirement for bioinformatics positions. The main reason I say this is that the certifications test, at least at the Associates level, tend to cover a broad set of AWS services, and some will most certainly not apply to many bioinformatics positions. As a community, we tend to go very deep in certain technologies as a function of our projects. Bioinformatics spans a varied set of technology stacks including web applications, batch analysis of large corpuses of data, machine learning, etc. But any individual bioinformatics practitioner will tend to focus on only one of these types of projects for long stretches of time, usually measured in years.

Still, it does not hurt to have a Associate level AWS certification in either Solutions Architecture or Software Developer. These two in particular will teach the foundational services that one would encounter in using AWS for bioinformatics, as well as give you a grounding in how to navigate the documentation and leverage the AWS ecosystem when you need to dive deeper on some service or domain.

Friday, December 23, 2016

Post on BAM index file parsing.

I posted a note on parsing binary files using Python over on my Rusty Bioinformatics blog. The example uses BAM index files, which are well described but do not have a software library that specifically targets reading and writing these types of files. Enjoy!

Saturday, August 22, 2015

DNA reverse complement program in Rust

After a very long hiatus, I am starting to write for pleasure again. In addition to posting some opinion pieces here, I am working on another blog, Rusty Bioinformatics, that is more focused on (re)learning bioinformatics programming.

The first post of note is a writing a simple program to provide the reverse complement of a DNA sequence on the command line, written in the Rust programming language. Find it here:

DNA reverse complement in Rust

Enjoy!

The AWS genomics ecosystem

This past week, my good friend and collegue at AWS, Chris Crosbie, published a blog post about analyzing whole genome sequence data with one of our genomic analysis platform partners, Seven Bridges Genomics, importing the resulting VCF files to Amazon Redshift, and doing a simple analysis of the data within R and Bioconductor.

The post is lengthy and technically dense, but it is delightful in that it showcases one of the major advantages of working with genomics data on AWS: namely the rich and diverse ecosystem of tools and platform providers that a researcher can bundle together to fit their exact needs.

There has certainly been a failing on my part to communicate the fullness of this ecosystem, and Chris’s post has been a good reminder to me that I need to communicate better (and more often) to the broader bioinformatics community about how to effectively pair down their choices when facing some analysis challenge.

PS: If you like Chris’s post, be sure to read the other one we put together with Matt Wood about uploading dbGaP data to Amazon S3.

Sunday, August 4, 2013

Exciting New Adventures

Last week marked a major turning point in my career, as it was my last week working for Penn. I started in September 1997, almost 16 years of service to a fantastic organization that allowed me to learn, experiment, and grow as a person, engineer, and scientist. It has made me what I am today and I will forever be grateful to the organization and everyone I have worked for and with. Leaving it was an incredibly hard choice to make, but I am confident it was the right thing to do.

Monday August 5th I start at Amazon Web Services as a Sr. Solutions Architect. While I have some idea of what the role entails, which is generally to help people use cloud computing for solving scientific and high-performance computing problems, I am under no illusion that I will know what it will actually be like in practice. Or what career paths just opened up for me.

I am super excited and am bouncing-off-the-walls excited for tomorrow to come.

Saturday, June 22, 2013

The new Mac Pro, Intel Phi, and the future of Bioinformatics

There was a time not too long ago where you could get a relatively easy publication in bioinformatics simply by slapping together a current algorithm and a GPU, and benchmark it against the same data set using a single-thread for the algorithm. I tended to dismiss all of these efforts, because once you look closely at what they actually did, you could find ways to easily outperform the GPU implementation using standard threading, or Map-Reduce over a cluster, at much less cost in the short and long terms. Data staging to the GPU was generally the Acheles heel of these algorithms, and there was just no getting around that step in high-throughput genomics.

The other detriment to these sorts of algorithms was that most bioinformatitians tended to stick to interpreted languages such as Perl, Ruby and Python. The programmer productivity gains were tremendous and the languages are “fast enought”.

There are two major changes in the ecosystem that are having me rethink my long-standing biases. First and foremost, properly conducted next-generation sequencing experiments are quite ridiculous in size. Certainly a lot bigger that any technology before it (in biology). New algorithms, like STAR for RNA-Seq alignment, show us the value of compiled languages paired with big RAM and big data pipes.

Second, the introduction of faster bus speeds (Thunderbolt 2), the Intel Phi architecture, and the new Mac Pro with its pair of standard pro-class GPUs, will drastically alter the economics and complexity of staging data to GPUs for computation.

In parallel to this, you have Continuum IO developing Python modules for fast scientific computation and I/O on GPUs (Numba and Blaze). Last but not least, we have C++11 and GNU gcc 4.8, with a slew of features being added to the standard library useful for bioiformatics (hello regular expressions ;) ).

It is my prediction that the future of NGS bioinformatics, or at least the cutting edge, will lie not with Hadoop and other “Big Data” infrastrucures, but rather with high-performance algorithms that take advantage of the new architectures coming to the pro-sumer market.

Don't do this

This past week I gave my boss and co-workers notice that I was moving on from UPENN. Normally you wouldn't do this sort of thing until you actually have an offer letter in hand, and for most people I would recommend the latter strategy.

So why did I do it? The main reason is that I love the group I work for. Specifically Garret FitzGerald and John Hogenesch are fantastic researchers, mentors, and all around great people. I have a broad role and I don't want them to be caught with their pants down when I do have that offer letter. Even with the four weeks notice that I will be giving them, it still won't be enough time to properly hand off my duties to other people. I specifically wanted them to have as much time to prepare and to prioritize my responsibilities.

You may be thinking to yourself right about now "if you love it so much, why are you leaving?" Simply put, I currently don't see a long term future for me at UPENN. Once I realized that, I started really asking myself "what am I good at?", "what do I like to do the most?", and "what do I want to do in the future?" To answer the first two, I like and am extremely good at talking to researchers about their problems and about how to solve those problems with informatics. I like to build prototypes and design long-term scalable architectures that address actual research questions. In short, a Solutions Architect role fits my skillset very nicely.

The last question has to do with the problem domain I want to solve. Genomics, my current research field, is poised to change medicine is ways we can't yet predict. We are already seeing the value of genetic sequencing in the diagnostics field, especially with respect to cancer. We are starting to see that our internal microbe populations play a larger part in our overall health than previously thought. Most of the current work studying the microbiome would not be possible without next generation sequencing. I want to be part of that sea change. I want to enable the use of high-throughput sequencing, and other 'omics technologies, in biomedical research that will have a huge and immediate impact on medicine. It should be noted that I could probably do that now in the group I am in, and I had already taken that into account when I thought long and hard about wanting to move on.

So where to now? Good question. I'll have an answer for you, my captive audience, in about three weeks.