Dapper : the screen scraper for everyone

I’ve been meaning to write about the Dapper ‘screen scraping’ service for a while, since I think it’s mostly useful and pretty cool.

(Yes, this service is called Dapper, sharing a name with the popular Ubuntu GNU/Linux release. I’m a little suspicious that maybe this was a deliberate marketing trick to pull search traffic intended for Ubuntu ….).

Techcrunch describes Dapper well … ‘create an API for any site’. Essentially, Dapper is good at analyzing web pages which have a fixed format (eg Google search results) and will extract the content in a predictable fashion to provide the raw data in XML, RSS, JSON or CSV format. Depending on the type of data you extract from the page, Dapper can also display data as a Google Map, a Google Gadget or Netvibes module, send email alerts or output iCalendar format.

By making it simple to extract information from sites that do not provide their data as an RSS feed or other useful format, it is possible for non-programmers to use Dapper and ‘liberate’ these sites, producing feeds compatible with their favorite news feed reader from an otherwise dead and lifeless web page. Web application programmers can waste less time writing screen scrapers, and waste less time fixing broken code caused by slight changes in the page format. It may not be something you use for a mission critical component, but for a quick mashup or to get started quickly before writing your own more robust parser, I think Dapper will prove (and is proving) very valuable.

However, there is a downside. From playing with Dapper on-and-off for the past few months, I’ve established that Dapper works quite well extracting data from very uniform pages like Google search hits [however I didn’t do that since it’s against the terms of service. “No screen-scraping for you”, says the Google-Nazi] or the front page of digg .. but usually fails on pages that don’t follow a strict pattern. Getting the wrong data (or junk) in the wrong field 5% of the time may be tolerable for the occasional frivolous RSS feed, but it is annoying enough that for more important applications it is a real show stopper.

One of the reasons it took me several months to get around to posting about Dapper is that I desperately wanted a killer example of extracting data from a bioinformatics database or web site. I’ve found that most decent projects already make their data available in some useful format like XML or CSV and don’t really require scraping with Dapper, while some of the less organized projects which only provide say, HTML tables, [I won’t name names … in fact I’ve forgotten about them already … “no citation for you” says the citation-Soup-Nazi ] often failed to work well with Dapper’s page analysis unless the page formatting was strictly uniform.

Pedro

I replicated Pedro’s openKapow Ensembl orthologue search in Dapper as an example. It’s not the best example since, as Pedro notes, Ensembl is one of the ‘good guys’ that already provide results in XML format.

First, I fed Dapper four URLs for Ensembl gene report pages , which contain a section with predicted orthologues. Apparently, giving Dapper several pages of the same format helps the analysis:


Then, I selected the Gene ID in the orthologues list .. Dapper colours fields it detects as the same type. There is a cryptic unlabeled slider which determines the ‘greediness’ of the selection:

After selecting “Save and continue”, Dapper asks for the newly defined field to be named. In this case, I chose the same name as Pedro (“ort_geneID”), just for the hell of it:


This process was repeated to create a field for the species name, which I named “ort_spp”. Dapper allows ‘Fields’ to be grouped into ‘Groups’, so I grouped the “ort_geneID” and “ort_spp” fields into a group called “orthologue”: (data not shown 🙂 ).

Now, we save the Dapp. In “Advanced Options”, I changed the Ensembl gene ID part of the URL to {geneID}. This tells Dapper to make this part of the URL a query field, so that the user can provide any gene ID they like and have the orthologue results scraped:

Finally, we can test the saved Dapp, and retrieve XML formatted results for a particular gene ID:


The gene ID can be changed in the Dapper XML transform URL (http://www.dapper.net/RunDapp?dappName=EnsemblOrthologues&v=1&
variableArg_0=ENSG00000100347) to get XML results for orthologues of other human genes.
Various other transforms, like a cruft-free CSV version are also possible. Feel free to have a play yourself with my Ensembl Orthologues Dapp (like I can stop you now ! It’s public & live & irrevocable).

An Amazon EC2 cluster for BLAST searching ?

I’ve just been reading about the new Amazon Elastic Compute Cloud (EC2), which is essentially a pay-as-you go cluster, based on Xen virtual machine images. You can create and upload your own image using their tools, or use one of the pre-rolled GNU/Linux distro images already shared by other users of the EC2 system.

While it seems aimed at web service ‘startups’ that want a competitively priced hosting option which can quickly scale, I thought I’d attempt to figure out the economics of using something like this for some scientific computing. Would it be a cheap / easy / reliable alternative to the home-rolled Beowulf cluster ?

The advertised specs per node are: 1.7Ghz x86 (Xeon) processor, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network bandwidth. Nodes cost US$0.10 per instance hour. Bandwidth between nodes within EC2 is free, as is bandwidth from EC2 nodes to the S3 storage service, however Internet traffic costs US$0.20 / GB.

First, lets think about BLAST, the bread-and-butter sequence search tool for many bioinformaticians. Now as far as I understand (and from my own experience), NCBI BLAST works best when the entire database can be cached in RAM … otherwise lots of disk thrashing ensues and the search time is bounded by the disk I/O. The NCBI ‘nr’ (non-redundant) protein sequence database is currently about 3.3 Gb (and growing) … so it won’t fit in RAM on one of these EC2 nodes. While I don’t mind paying to thrash Amazons’ servers disks a little, it will slow the search down. However, if we use mpiBLAST the database gets split into chunks evenly distributed across each node, so if we were to use 3 nodes, the ‘nr’ database would be split into 1.1 Gb chunks and should fit in the RAM of each node (leaving ~600 Mb RAM for the OS and other overhead). However, now the speed of the network interconnects between nodes matters, since we are no longer computing on a single node … but from what I’ve read 250 Mb/s should be enough for mpiBLAST to run such that is it not bounded by the internode communication speed. (Actually EC2 instances have shared gigabit interconnects , but since several instances might share the same ethernet card, gigabit performance ‘per node’ can’t be expected. I guess the 250 Mb/s figure means that there are probably four EC2 instances per physical server/ethernet card ??). So with 3 nodes, this would cost US$ 0.30 per hour to run a (scalable) BLAST service.The performance should scale better than linearly with the number of nodes added. If you need the job done faster, just resplit the database and create more EC2 VM instances (mpiBLAST should work with the Portable Batch System to do this transparently, but I guess this would require some code to interface PBS with the control of your EC2 ‘elastic cluster’). It would only cost US$0.66 to upload the database to EC2 in the first place, and about US$0.50 cents to store it in S3 per month. This seems well within reach of many academic departments, and would really suite ‘sporadic’ users with occasional big jobs ….

Now for applications like molecular dynamics simulations (MD) (ie, GROMACS, NAMD, CHARMM etc etc), a lot more internode communication bandwidth is required. Looking at these benchmarks for GROMACS , it looks like things should scale nicely to two or four EC2 nodes, but after that the scaling would probably drop off, due to the less-than-gigabit ethernet. That doesn’t mean you won’t get more speed for more nodes, just that at some point adding more nodes will give greatly diminishing returns. While I’m speculating here, my I’d say it’s probably better to leave this type of number crunching to the ‘real’ supercomputers or home-rolled purpose-built clusters; EC2 may not be worth the cost/effort here for big long running calculations. Others are using MPI applications on EC2 already though, and I’d love to be proved wrong.

One of the current difficulties for running database driven web applications on EC2 is that the virtual machine instances do not have persistent storage … either a connection to a database running somewhere else needs to be used, or the precious data needs to be moved off each EC2 instance before shutting the server down. If it crashes before shifting data off … goodbye database. I’m sure Amazon will come up with a solution to this, since it seems often requested on their forums. Having non-persistent data wouldn’t be such a big deal for mpiBLAST … the servers should rarely crash, the results could be stored in Amazon S3 or sent to a remote machine as they arrive, and the sequence database can also be stored in S3 (for about US$0.50 per month … dirt cheap). There are already a few FUSE S3fs implementations floating about (like s3fs-fuse ) … I haven’t tried them yet, but essentially they should allow S3 storage to be mapped transparently to the Linux filesystem. My guess is it would be a bad idea to host a large MySQL database file on S3 using s3fs-fuse (there is a 5 Gb filesize limit for starters) … but for lots of little-ish files, as is often generated by bioinformatics software, s3fs-fuse might just do the trick.

Whew ! .. Now I’m really itching to spend some spare change and a few hours to see if running mpiBLAST on EC2 is as good an idea as it sounds.

Doh ! Just tried to set up an account and the Amazon EC2 limited beta is currently full … I’ll have to wait .. :(.

A few additional links I was also looking at while writing this post .. wow … someone has some ‘issues’ with the NCBI Blast implelmentation: http://blast.wustl.edu/blast/Memory.html
and http://blast.wustl.edu/blast/cparms.html