def science { }

To be (AWS certified), or not to be (AWS certified)

2017-02-01T08:45:00.000-05:00

A couple of my bioinformatics colleagues have recently asked me about whether they should be AWS certified, and, if so, which certification would be best.

My general answer is that AWS certification doesn't hurt to have, but it is should not be considered a "must have" requirement for bioinformatics positions. The main reason I say this is that the certifications test, at least at the Associates level, tend to cover a broad set of AWS services, and some will most certainly not apply to many bioinformatics positions. As a community, we tend to go very deep in certain technologies as a function of our projects. Bioinformatics spans a varied set of technology stacks including web applications, batch analysis of large corpuses of data, machine learning, etc. But any individual bioinformatics practitioner will tend to focus on only one of these types of projects for long stretches of time, usually measured in years.

Still, it does not hurt to have a Associate level AWS certification in either Solutions Architecture or Software Developer. These two in particular will teach the foundational services that one would encounter in using AWS for bioinformatics, as well as give you a grounding in how to navigate the documentation and leverage the AWS ecosystem when you need to dive deeper on some service or domain.

Post on BAM index file parsing.

2016-12-23T21:26:00.004-05:00

I posted a note on parsing binary files using Python over on my Rusty Bioinformatics blog. The example uses BAM index files, which are well described but do not have a software library that specifically targets reading and writing these types of files. Enjoy!

DNA reverse complement program in Rust

2015-08-22T21:39:00.001-04:00

After a very long hiatus, I am starting to write for pleasure again. In addition to posting some opinion pieces here, I am working on another blog, Rusty Bioinformatics, that is more focused on (re)learning bioinformatics programming.

The first post of note is a writing a simple program to provide the reverse complement of a DNA sequence on the command line, written in the Rust programming language. Find it here:

DNA reverse complement in Rust

Enjoy!

The AWS genomics ecosystem

2015-08-22T21:30:00.001-04:00

This past week, my good friend and collegue at AWS, Chris Crosbie, published a blog post about analyzing whole genome sequence data with one of our genomic analysis platform partners, Seven Bridges Genomics, importing the resulting VCF files to Amazon Redshift, and doing a simple analysis of the data within R and Bioconductor.

The post is lengthy and technically dense, but it is delightful in that it showcases one of the major advantages of working with genomics data on AWS: namely the rich and diverse ecosystem of tools and platform providers that a researcher can bundle together to fit their exact needs.

There has certainly been a failing on my part to communicate the fullness of this ecosystem, and Chris’s post has been a good reminder to me that I need to communicate better (and more often) to the broader bioinformatics community about how to effectively pair down their choices when facing some analysis challenge.

PS: If you like Chris’s post, be sure to read the other one we put together with Matt Wood about uploading dbGaP data to Amazon S3.

Exciting New Adventures

2013-08-04T09:33:00.002-04:00

Last week marked a major turning point in my career, as it was my last week working for Penn. I started in September 1997, almost 16 years of service to a fantastic organization that allowed me to learn, experiment, and grow as a person, engineer, and scientist. It has made me what I am today and I will forever be grateful to the organization and everyone I have worked for and with. Leaving it was an incredibly hard choice to make, but I am confident it was the right thing to do.

Monday August 5th I start at Amazon Web Services as a Sr. Solutions Architect. While I have some idea of what the role entails, which is generally to help people use cloud computing for solving scientific and high-performance computing problems, I am under no illusion that I will know what it will actually be like in practice. Or what career paths just opened up for me.

I am super excited and am bouncing-off-the-walls excited for tomorrow to come.

The new Mac Pro, Intel Phi, and the future of Bioinformatics

2013-06-22T11:49:00.001-04:00

There was a time not too long ago where you could get a relatively easy publication in bioinformatics simply by slapping together a current algorithm and a GPU, and benchmark it against the same data set using a single-thread for the algorithm. I tended to dismiss all of these efforts, because once you look closely at what they actually did, you could find ways to easily outperform the GPU implementation using standard threading, or Map-Reduce over a cluster, at much less cost in the short and long terms. Data staging to the GPU was generally the Acheles heel of these algorithms, and there was just no getting around that step in high-throughput genomics.

The other detriment to these sorts of algorithms was that most bioinformatitians tended to stick to interpreted languages such as Perl, Ruby and Python. The programmer productivity gains were tremendous and the languages are “fast enought”.

There are two major changes in the ecosystem that are having me rethink my long-standing biases. First and foremost, properly conducted next-generation sequencing experiments are quite ridiculous in size. Certainly a lot bigger that any technology before it (in biology). New algorithms, like STAR for RNA-Seq alignment, show us the value of compiled languages paired with big RAM and big data pipes.

Second, the introduction of faster bus speeds (Thunderbolt 2), the Intel Phi architecture, and the new Mac Pro with its pair of standard pro-class GPUs, will drastically alter the economics and complexity of staging data to GPUs for computation.

In parallel to this, you have Continuum IO developing Python modules for fast scientific computation and I/O on GPUs (Numba and Blaze). Last but not least, we have C++11 and GNU gcc 4.8, with a slew of features being added to the standard library useful for bioiformatics (hello regular expressions ;) ).

It is my prediction that the future of NGS bioinformatics, or at least the cutting edge, will lie not with Hadoop and other “Big Data” infrastrucures, but rather with high-performance algorithms that take advantage of the new architectures coming to the pro-sumer market.

Don't do this

2013-06-22T09:28:00.001-04:00

This past week I gave my boss and co-workers notice that I was moving on from UPENN. Normally you wouldn't do this sort of thing until you actually have an offer letter in hand, and for most people I would recommend the latter strategy.

So why did I do it? The main reason is that I love the group I work for. Specifically Garret FitzGerald and John Hogenesch are fantastic researchers, mentors, and all around great people. I have a broad role and I don't want them to be caught with their pants down when I do have that offer letter. Even with the four weeks notice that I will be giving them, it still won't be enough time to properly hand off my duties to other people. I specifically wanted them to have as much time to prepare and to prioritize my responsibilities.

You may be thinking to yourself right about now "if you love it so much, why are you leaving?" Simply put, I currently don't see a long term future for me at UPENN. Once I realized that, I started really asking myself "what am I good at?", "what do I like to do the most?", and "what do I want to do in the future?" To answer the first two, I like and am extremely good at talking to researchers about their problems and about how to solve those problems with informatics. I like to build prototypes and design long-term scalable architectures that address actual research questions. In short, a Solutions Architect role fits my skillset very nicely.

The last question has to do with the problem domain I want to solve. Genomics, my current research field, is poised to change medicine is ways we can't yet predict. We are already seeing the value of genetic sequencing in the diagnostics field, especially with respect to cancer. We are starting to see that our internal microbe populations play a larger part in our overall health than previously thought. Most of the current work studying the microbiome would not be possible without next generation sequencing. I want to be part of that sea change. I want to enable the use of high-throughput sequencing, and other 'omics technologies, in biomedical research that will have a huge and immediate impact on medicine. It should be noted that I could probably do that now in the group I am in, and I had already taken that into account when I thought long and hard about wanting to move on.

So where to now? Good question. I'll have an answer for you, my captive audience, in about three weeks.

Getting Promise Pegasus to actually email you in case of drive failure

2012-01-18T15:31:00.001-05:00

UPDATE: ContrarySheep (e.g. GriffithStudio) has posted an excellent update to these scripts that take advantage of launchd to run these processes. See this gist for more information.

So the Promise Pegasus thunderbolt external storage array is a super piece of hardware for the price. One thing they list as a feature is the ability to email a user in case of a drive failure. One problem, though, is that I could not get this to work for me.

A nice feature of the system, though, is that the included Promise Utility GUI that ships with the array also installs quite a full featured command line utility. For instance, you can get a quick report of the drives in the array by issuing the following:

w00t!

I wrote a quick little Ruby script to grab the drive information and send me an email in case the status is not "OK".

DISCLAMIER: This snippet is provided as-is. I am not guaranteeing this works at all, and is using the assumption that the status will change from "OK" to something else if there is an issue with a drive. Use at your own risk.

Compile BLAT on x86_64 Ubuntu 10.04

2011-02-21T09:48:00.000-05:00

Recently I needed to compile BLAT for x86_64 on Ubuntu 10.04 server. The BLAT pre-compiled binaries for linux are 32-bit. Here is how you do it:

apt-get install build-essential
Download BLAT source
remove the -Werror compiler flag to treat warnings as errors
mkdir -p ~/bin/x86_64
export MACHTYPE=x86_64
make

Step (3) above requires editing the /inc/common.mk file, line 17 goes from



      HG_WARN_ERR = -DJK_WARN -Wall -Werror



HG_WARN_ERR = -DJK_WARN -Wall

Make places the executables into ~/bin/x86_64 directory. Either add this to $PATH or copy the files to /usr/local/bin

Ruby + AWS == Easy Map-Reduce

2010-01-22T21:38:00.010-05:00

As read lengths grow, we have found that the current crop of short read aligners don't seem to do very well with RNA-seq data. In a recent experiment we were only able to map about 40% of the reads using Bowtie. BLAT was able to map an additional 45% percent of the reads, but was orders of magnitude slower than Bowtie (3 hours vs 8 days.)

I decided to try out my hand at using Amazon Web Services (AWS) to parallelize the BLAT search. Naturally I investigated Hadoop, and in particular AWS Elastic Map Reduce (EMR) service, which handles the particulars of setting up a Hadoop cluster for you. While Hadoop is a great project, and lots of people are having great success with it, there is a certain level of investment that must be made to learn how to mold your particular data analysis pipeline to fit into Hadoop's assumptions. For example, Hadoop assumes that the input file is large and that it will need to spilt the input for parallel processing on the worker nodes. The default method for splitting the input is on line breaks, which is not useful for FASTA sequence files. There are ways to customize this behavior, but I had to have results in two days for John Hogenesch to present at a cloud computing session for Science Online 2010.

Lucky for me, I had previously investigated another Map Reduce framework, CloudCrowd, which was developed by Document Cloud, a non-profit agency that archives source documents (scan PDF's, thumbnail pages, OCR) for news articles. The system is elegantly simple, built around the notion of a central work queue and worker nodes communicating over HTTP requests. A "job" is defined by specifying a set of input documents, an action to perform on those documents, and any other options that you want to pass along. The action is a simple Ruby class. CloudCrowd takes care of distributing the input files to worker nodes, and cleans up after itself pretty nicely. One last point is that the documentation is fairly complete, easy to use, and provides clear examples that work off the bat on install.

For our use case, the action we want to perform is the align all of the unmapped sequences using BLAT. In order for CloudCrowd to run BLAT, we must define a Ruby class that inherits from CloudCrowd::Action and implements a system call to BLAT:

CloudCrowd::Action takes care of a lot of the details for staging a work unit, such as creating working directories, downloading input files, launching the subprocesses, and cleaning up after itself when the job completes.

One difference between Hadoop and CloudCrowd is that while Hadoop will split the input and parcel out work items to nodes, CloudCrowd will send an entire input file to a single node. This means that I had to split the sequences into multiple FASTA files before sending them on to CloudCrowd. Using 100K sequences per file, we had 303 FASTA files to align. Note, that this is not lost work, as I would have had to define the split routine for Hadoop anyway. All of the input files, BLAT executable, and the search database were transferred to a S3 bucket.

Next I provisioned an small EC2 instance to act as the central CloudCrowd master, giving it a stable IP and configuring the connection options for CloudCrowd to connect to S3 and HTTP basic authentication for the nodes (see the CloudCrowd documentation). This manual process is fine for the master, but I worker nodes to automatically configure themselves, start the CloudCrowd node process, connect to the master and get jobs. You can provide EC2 nodes with a shell script that get's executed on start in what they call "user data". Here is my script (modified to protect my credentials and the data ;) ):

That script takes the public ubuntu 9.10 image (ami-55739e3c), adds the latest public repositories to the apt sources, updates all the packages and installs the latest Ruby and RubyGems packages. In addition it installs the CloudCrowd gem, downloads and installs BLAT, uses a pre-authenticated URL to fetch the pre-packaged CloudCrowd instance that I made (which has my AWS credentials) that actually defnes the BLAT job, and runs a CloudCrowd node process.

Now as nodes became available, they connected to the master and started working. If the nodes go down, the system would automatically cope with the node failures by resending the work to other nodes. We could also add new nodes as we see fit to decrease the time to completion. This type of fluid set up allowed me to take advantage of EC2 "spot instances" which have a variable pricing model based on demand and tends to be cheaper than the full price (12¢ vs. 34¢ per hour), with the caveat that your instance may come up at some random time and get shut down whenever the price exceeds your request maximum. For this particular job, I bid $0.25 and the actual price fluctuated from $0.12 to $0.18 per hour.

Finally we create a HTTP request using the following Ruby script, which designates the input files and database located in a public S3 bucket, as well as defines the parameters to use:

The system is not perfect and could use a lot of tweeking, but not bad for a day's work.

Heuristically Beneficial Artifacts

2008-10-07T12:42:00.007-04:00

I've recently embarked on a study to see what effect clustering mass spectra has on the design of MRM experiments. Specifically one clusters spectra from the shotgun proteomics discovery phase of the experiment to see how it affects the selection of peptide species and the transitions that a mass spec looks for during the second targeted sequencing phase of the experiment.

While the work is ongoing, a very interesting (at least to me) result has come out of plotting the unfiltered SEQUEST results from the non-clustered and clustered version of the mass spec data. The say a picture is worth a thousand words, so:

The chart is showing the XCORR values of the clustered spectra plotted against the XCORR values for their respective members, independent of what peptide was identified for each. The color coding represent scores for peptides that are part of the decoy database, where:

blue = a decoy hit was scored in the clustered spectra, but not the original
green = a decoy hit was scored in the original spectra, but not the clustered spectra
red = both spectra scored a decoy peptide

What you'll note is that spectra that all clustered score a different set of peptides higher than do the original spectra, and vice versa. At least for this data set, which is SILAC labeled, we note that each method is identifying a different population of peptides. This goes against all current publications, so I have to be careful in the interpretation of these results and will need a significant amount of validation experiments, but this is exciting stuff!

Particularly, it is going to be very interesting to see whether clustering actually helps or hinders ion selection when designing MRM experiments. If it does help ion selection, I have already come up with the catchy name, Heuristically Beneficial Artifacts^TM. If not, well, we will at least have looked at the real-world effects of a methodology when applied to a new type of experimental goal.
More on this as it develops.

Not so much about science...

2008-10-06T14:11:00.003-04:00

So in looking back over that last few entries, it seems I have strayed from my original mission, which was to cover issues regarding informatics and how it relates to science. The first part seems to have dominated the posts recently, so I am dropping all pretenses starting a new blog, over at Blogger competitor Wordpress.

Why a new platform? I figured I would give Wordpress a spin to see how it compares to blogger. Also I like the template they had for the blog ;) So far, the interface is similar, but looks a bit cleaner than Blogger's. Overall, it seems the same functionality and workflow, with the addition of one thing that I can see coming in very useful for a site like I envision, permanent pages.

In effect, If I write something that I want to be prominently displayed at all times, I create a page. This can come in handy say, if I give a set of instructions for setting up your development environment, or for providing a listing of resources. As a regular blog post these types of items are "discoverable" by searching, but they will eventually go away as new posts drown them out.

Another agument for the Page as opposed to a post is to combat a post from becoming stale. How many times have you thought you found an answer to a question in a blog post but it turns out said post is a bit old and the solution no longer applies? I guess stackoverflow seeks to remedy this, but bloggers can do their part by posting "important" and generally applicable posts with prominence as a page, thus they are always reminded to update it, since is it always visible.

Anywho, there I go again, talking strictly about web applications. Well, I guess I'll shut up now and you can check AppMecha for more posts related to these issues.

Why Thn.gs should send a chill down evey ISV's spine

2008-09-15T10:10:00.001-04:00

Independent software vendors (ISV's) have recently been touting some of the most useful applications that I have seen in my entire career. From 37signals to GitHub to CulturedCode, the genius of these applications are simple interfaces to complex sets of requirements. That last one, CulturedCode, has a problem though. Their flagship about to launch product just got pwned by some russians.

While 37signals and GitHub are already web applications, with a service based revenue stream, CC's product is an OS X desktop application that works beautifully as a get-things-done task orginizer, with a perpetual licensing model. The aforementioned russians took every piece of this application and made it into what appears to be an almost complete equivilient of the desktop application. And if that we not bad enough, they integrated it into Google Gears for offline use.

What can CC do about this? Not much, unless they already have a huge bankroll for lawyers. And this situation should send chills down the spine of every small software shop releasing small useful tools. The very nature of these shops constrains feature creap, which in turn forces the designers and developers to squeeze the most they can out of what they have in place, which in turn makes the software simple and powerful at the same time. But it also makes these applications vulnerable to xeroxing using the web, (relatively) cheap labor pools, and robust distribution network with firewall-like immunity to legal action in the form of international borders ( "litigation-wall" ?).

Anyone want to take bets on who is next on the feed tray? My guess is a Delicious Library web clone.

So take home message is, design software that is not easily replicated, either by feature or connectivity, or realize that your app really is easy to xerox and have a large maerketing scheme to drown out any news of the enemy.

Speaking of Google, there was that little hiccup in relations related to releasing a clone 37signal's campfire with the release App Engine. And then there is Chrome+Gears, the browser-DB combination that makes web-applications even more desktop like. Not that they are alone in this, Adobe (AIR) and Microsoft (SilverLight will certainly "extend" the reach of IE) are walking the same browser-as-platform path, and trying to building their market share on the "everything should be free" web culture, and that really raises my mercury. Fucking piracy enablers.

Licensing makes Cloud Computing ... difficult

2008-09-12T09:57:00.001-04:00

I have thought a lot in recent months about how to best leverage cloud computing resources, or utility computing, that are increasingly becoming available to the general development community and one issue in particular makes me cringe: LICENSING.

It just so happens that in my field, proteomics, the open source set of tools gather a lot of press, but really most submissions to publish research are still using commercial algorithms for the initial data analysis, even as plenty of research has been show (and published) that results from commercial and open source algorithms are comparable. There is an inherent level of trust manuscript reviewers have in the commercial offerings that is hard to overcome, hence most researchers still opt to use the commercial algorithms as the gold standard.

Not that this is a bad thing, mind you. As someone whom supports the informatics efforts for many researchers, I find that the commercial offerings are much more stable, and are subsequently easier to support and maintain than most of the current crop of open source offerings.

The trouble lies with rigid licensing models of commercial offerings. Specifically you must purchase perpetual licenses for a certain number of compute cores. Such a model is just not compatible with what I would like to do as a service center, namely to provide software-as-service billing to researchers. True the high up-front licensing can be amortized over the life of the support contract, but it assumes that the computers running the algorithm are already procured and dedicated to the software (also not a bad assumption in most cases, since the hardware costs pale in comparison to the licensing).

In effect, there is no way to make a utility computing model, such as one offered by Amazon Web Services, work with these sorts of license restrictions. The set-up and tear down of a compute job is too high to be a viable full time solution.

What I would like to do is augment my current computing capacity during crunch times. Dedicated licensing prevents this. As does the way most networked algorithms work, but that's another post.

Pharma's Futures

2008-04-14T22:21:00.010-04:00

Alternate title: "Bet Big, lose a long long time, then win so big that it makes your losses look like pocket change. Or maybe I'll just lose my shirt. Maybe both simultaneously, who knows?"

This post is motivated by a set of lectures I attended as part of the 3rd Annual ITMAT Symposium on Translational Medicine (Full disclosure, ITMAT is currently my employer). Here are the first two lectures given today (for posterity in case the link above goes to the ether some day) that are the focus of this post:

Nassim Nicholas Taleb, PhD, London Business School, “Errors in the Analyses of Market Potential for Drugs”

Dale Nordenberg, Healthcare Industry Advisory, PricewaterhouseCoopers LLP, “Global Trends and Drug Development”

The first talk was by the author of The Black Swan, a very entertaining read that outlines the very large effects that highly improbable events have on the economic markets. The essence of his book, and his talk, were that in not normally distributed markets, such as the securities market, a rare event can either capture huge amounts of wealth or lead to huge amounts of losses, and the the incremental gains/losses reported over years are really just the noise in the data, and hence unimportant in a sense. Dr. Taleb backed up his claim with a few charts of trade earnings over a decade and most notably showed that two days, in which rare occurrences took place, he both gained 98% of his total revenue (and lost it as well) over the ten years he had data for. Two days in ten years. Yikes.

Taleb went on to claim that these types of events are equally applicable to big pharma. Currently the top 6 or so blockbuster drugs capture over 90% of the revenue, and that's no where near normal. Recent high-profile litigation also showed that rare events can hurt a companies' bottom line as well.

The second talk by Dale Nordenberg focused on what industry pundits and analysts are calling "Pharma 2020", or the shift in the industry towards providing personalized medicine. This is pharma's equivalent to the so-called "long tail" economics model popularized by Internet sales. Selling a larger variety of drugs, each tailored for use by fewer people.

There is no doubt in my mind that the gains in safety and efficacy of future drugs offered to the public would be enormous if pharma did indeed move in this direction, but I am a bit skeptical about the financial incentive of big pharma to follow the piper. It is essentially assuming that pharma can reach a normal earnings distribution if this model is followed, which assumes that big pharam will be insulated from the risks of non-normally distributed earnings environments.

But if we take Taleb's lecture at face value in the context of current litigation practices in the US, then we already know that the risk model that pharma faces is one where a rare event will have a very large detrimental effect. Without the equally large and insular effects of blockbusters on revenue, how will a company survive a class action? If I where a CEO, I would probably be paying lip-service to the Pharma 2020 vision, in the hope that such actions would lead to protection from Congress, but my horse's name would still be "Blockbuster McGee".

Google App Engine: constraints are good

2008-04-14T09:42:00.006-04:00

There has been a lot of rejoicing and jeering over Google's web application deployment offering, Google App Engine. Most of the gripes (that I have seen) have been about the constraints of only supporting a single language, limited database capabilities, no file or OS access of any kind. Frankly I think there is an element of FUD to all of these.

First for language choice, AppEngine has been receiving a lot of flack over the choice of a hobbled Python over and above all other languages. Also there have been a couple of gripes overheard at the lack of real foreign key constraints in the relational layer.

Boo frickin hooo. Stop whining and get coding and you'll come to the same conclusion all artists, composers, coders, and generally anybody that ever created anything did. Namely that constraints are sometimes the inspiration for the creation. Sometimes it is the constraints that shape the work more than anything else. Lack there of may sometimes lead to interesting experimentation, but I put forth that actively not following a system of conduct is itself a constraint. Trying to be original is hard work, and all the harder by not framing your work within something familiar.

But I also should stop bitching and get to work. My idea has the potential to reshape the way collaboration science is conducted, but I need to deliver the tool to the audience. Seems like G-Apps would be a perfect test bed.

If can't beat 'em, join 'em

2008-04-10T21:08:00.003-04:00

Just as Twitter was effectively killing my (and others) blogging fecundity, I saw that they posted a handy-dandy little link on the dash for inserting Tweets into your blog. Sweet. Now I can have the best of both worlds.

The Twitter giveth and taketh away.

Twitter is killing my blog

2008-03-13T09:13:00.004-04:00

Twitter is killing my blog initiative. In fact this post can just be a straight copy-paste from some recent tweets (reverse chronological order):

	delagoya Ah, just saw that on XP you can't see my character cleverness. Sux4U dude about 23 hours ago from Snitter
	delagoya suppose a hash assignment {:foo ➠ :bar } about 23 hours ago from Snitter
	delagoya and also wondering if I can use ruby's syntatic suger with random windings about 23 hours ago from Snitter
	delagoya It's like the evolution of large emails ☞ short emails ☞ IM about 23 hours ago from Snitter
	delagoya notice that twitter is killing my blog initiative. Why blog when a short tweet will do? about 23 hours ago from Snitter

Really? It's that easy? Can you beat that ease-of-brain-dump blogger.com? I think not. Some picasa functions came close for posting quick notes on photos, but there isn't really a native blogger equivalent (that I know of). The whole create post -> edit post -> post post life cycle is too long for a quick set of thoughts you want to jot down.

I could probably research this more, but really why would I when twitter is always there, and I actually have an audience that comments back to my posts. And so back to my original point: Twitter is killing my blog.

Sequel's lackluster to_*

2008-02-15T12:45:00.002-05:00

Sequel is a great bare-bones ORM, but the bare-bones quality of Sequel::Model leave something to be desired. For instance the obj.to_json method just calls the default ruby object inspect method, which prints out the class name and memory space. Not helpful. Also no to_xml() for easy REST incorporation. Almost makes me want to go back to ActiveRecord, but then what's the point in using Merb?

Anywho, it's not that hard to extend the functionality of Sequel::Model, so I have started writing some gems to make development with Merb + Sequel a little easier. Like real to_xml & to_json methods on the model instances and instance collections. More on this as it develops.

Merb TLS mail plugin gem

2008-02-15T08:23:00.005-05:00

UPDATE: Now available on github : merb_tlsmail github page

My previous post on sending mail via a TLS SMTP server on merb covered monkey patching Merb::Mailer.

I took the time to code this up as a gem, using Merb's meta-programming routines to extend Merb:Mailer in a standard way (for merb that is). See this open ticket in the Merb lighthouse issue tracker to download the gem until it is released as a proper plugin.

Secure SMTP server (TLS) from merb apps

2008-02-12T16:35:00.000-05:00

It seems that Merb's Mailer class is either using a local sendmail client or a non-TLS enabled SMTP server. This is not a unique problem to merb, but rather a deficiency in Ruby 1.8.

I took some time to look around and found that Rails has the same problem, and it was fixed via a plugin, not a gem as is the "merb way". There was also a gem that packaged the Net::SMTP classes from Ruby 1.9, which do have TLS support. It isn't hard to guess what I did next.

I monkey punched Merb::Mailer to overwrite the net_smtp method and added two config options into merb_init.rb. See the pastie for the code example here.

http://pastie.caboo.se/151190

Where has the Sematic web failed us?

2008-02-07T10:39:00.001-05:00

News about the Semantic Web has being the "next big thing" has been hitting web application developers over the head for years, like it was news about the iPhone. But where are the products? Who uses it? Expect for academic papers, a standards process that nobody pays attention to, and ontology narcissists, nobody uses RDF or OWL or any of those supposedly "next generation" tool sets and languages. OK, maybe Powerset will, but I'll believe that when I see it.

Certainly the swoogle is no google, although it is starting to address what I see is the most overlooked part of the semantic web, usability. It seems that developers and proponents of Web 3.0 thinmk regular users of the web are a lot smarter than they are. Swoogle does actually show a pretty nicely formatted report on the metadata it has for a result, if you know what you are looking at that is. Yet the main result link leads to the originating ontology, which is and RDF xml file. Yeah, that's helpful. Even if Joe Public is aware enough to click on the metadata link, instead of the big red button that is the main link, I can't ever image him making heads or tails of the report, or for that matter caring.

Why is that? Why is usability not even a concern for majority of ontology & semantic web developers? What makes this situation even more of a disaster is that tagging (and tag clouds) are so wide spread and ridiculously easy to understand. How is tagging any different than RDF annotations? A little more text, that's what. Oh and querying RDF is a bitch, so developers are also affected by the situation, making adoption of this "standard" that much more unlikely.

PS: I am not part of, or hold any affiliation to, NG&E, but "the 85%" is one of those stereotypes that ring true to me.

Zed.. very humorous

2008-01-04T11:19:00.000-05:00

Zed Shaw's latest rant is a hoot. When I first read it, it was clearly the draft he mentioned it was. I'm glad he posted it as a draft, though, because the next iteration did give a chance for DHH to clarify, and also gave Zed a chance to frame the whole rant a bit better with his admission that he himself was the main person responsible for almost going to the poor house.

Ask not for what you think you need...

2007-12-11T15:18:00.001-05:00

I have been reflecting a lot recently on why researchers think they need grand solutions for relatively small problems. Specifically there is a perception among benchers that are conducting large-ish experiments that they need some sort of LIMS to manage their data. Frankly as I've stated many times (at least in conversations with other IT folk, and even with benchers) most researchers don't want a LIMS. What they really want is fancy file storage.

Yet, the town folk keep insisting "we need LIMS, please give...". LIMS are probably the only thing that potentially fits the bill for experimental data management, so that's what benchers ask for, when most likely a digital asset management application would suffice. Heck, even all those bit torrent sites can potentially do the job that researchers need.

If you are a coder that is continually asked to provide a LIMS to researchers, or if you are a bencher that is interested in LIMS for data management, here are a set of questions you can ask before going any further down the rough and tumbly road that is LIMS adoption:

Are you in a regulated environment?
Are you willing to mandate use of the LIMS?
Do you have adequate personnel to support the LIMS locally (e.g. do you have a dedicated person that will actively promote the use of the LIMS, train folks, configure the system, do extensive follow-up, etc. Vendor support will only be of help at the start of the adoption process.)
Do you have a lot of spare cash? (Think 6 figures to buy an initial bank of licenses)
Will you have a lot of spare cash for the next 3-5 years? (Think 5 figures to keep annual support and maintenance up to date.)

If you answered "No" to any of these questions, seriously reconsider buying what is traditionally known as a LIMS. Instrument control, automatic data acquisition, yada, yada, all those marketing features used to sell a LIMS don't mean squat if no one uses it in the first place.

Python 25 & MySQL on Leopard

2007-11-29T09:34:00.000-05:00

There was a bit of trouble when I tried to use macports to install the excellent Trac project management and issue tracking application. Specifically, the py25-mysql port did not compile correctly.

Trolling through the InterWeb, I found this post about a fix to compile the python module from scratch. The post, however, has more instructions than are necessary, so here is my revised procedure:

Download the source from here.
Unpack the archive and edit the _mysql.c file to comment out lines 37-39:
// #ifndef uint // #define uint unsigned int // #endif
Edit the site.cfg file to set the mysql_config path. For me this was:
mysql_config = /usr/local/mysql/bin/mysql_config

Compile as normal
python setup.cfg config python setup.cfg build sudo python setup.cfg install

Now if only someone would make these changes in the port file, then trac installs would be super easy, instead of just easy-ish.