<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule"
>

<channel>
	<title>Your bones got a little machine &#187; two-point-oh</title>
	<atom:link href="http://blog.pansapiens.com/category/two-point-oh/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.pansapiens.com</link>
	<description>Ideas are cheap, implementation is expensive; act accordingly.</description>
	<lastBuildDate>Mon, 17 May 2010 02:20:10 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<creativeCommons:license>http://creativecommons.org/publicdomain/zero/1.0/</creativeCommons:license>
		<item>
		<title>Delicious geohashes &#8230; mmmm &#8230; tagging *drool*</title>
		<link>http://blog.pansapiens.com/2008/12/29/delicious-geohashes-mmmm-tagging-drool/</link>
		<comments>http://blog.pansapiens.com/2008/12/29/delicious-geohashes-mmmm-tagging-drool/#comments</comments>
		<pubDate>Mon, 29 Dec 2008 06:21:53 +0000</pubDate>
		<dc:creator>Andrew Perry</dc:creator>
				<category><![CDATA[ideas]]></category>
		<category><![CDATA[two-point-oh]]></category>
		<category><![CDATA[web2.0]]></category>
		<category><![CDATA[android]]></category>
		<category><![CDATA[delicious]]></category>
		<category><![CDATA[geohash]]></category>
		<category><![CDATA[gps]]></category>

		<guid isPermaLink="false">http://blog.pansapiens.com/?p=94</guid>
		<description><![CDATA[Since I got a new toy for Christmas, I&#8217;ve become interested in geolocation and the fun things you can do when you have an internet-connected GPS-enabled device in your pocket. I&#8217;m also a compulsive delicious tagger, so I quickly discovered the existing practice for geotagging delicious bookmarks.
Essentially, this seems to be: add the tag &#8216;geotagged&#8216;, [...]]]></description>
			<content:encoded><![CDATA[<p>Since I got a <a href="http://www.flickr.com/photos/pansapiens/3145097863/">new toy</a> for Christmas, I&#8217;ve become interested in geolocation and the fun things you can do when you have an internet-connected GPS-enabled device in your pocket. I&#8217;m also a compulsive delicious tagger, so I quickly discovered the <a href="http://en.wikipedia.org/wiki/Geotagging#Geotagging_in_tag-based_systems">existing practice for geotagging delicious bookmarks</a>.</p>
<p>Essentially, this seems to be: add the tag &#8216;<strong><a href="http://delicious.com/tag/geotagged">geotagged</a></strong>&#8216;, along with the tags &#8216;<strong>geo:lat=<em>X.xxx</em></strong>&#8216; and &#8216;<strong>geo:lon=<em>X.xxx</em></strong>&#8216;, where the <strong><em>X.xxx</em></strong>&#8217;s are the latitude and longtitude numbers that are likely to come straight out of your GPS, in decimal degrees (WGS84).</p>
<p>This is all very nice, but the problem with tags in this format is that there is no easy or efficient way to use them to retrieve all items tagged for a particular <em>locality</em>. Sure, if I&#8217;m standing right on top of the <a href="http://maps.google.com/maps?f=q&amp;hl=en&amp;geocode=&amp;q=eureka+tower&amp;sll=-37.773358,144.946055&amp;sspn=0.010194,0.014913&amp;ll=-37.821362,144.964213&amp;spn=0.010187,0.014913&amp;t=h&amp;z=16">Eureka Tower</a> at <em>-37.821362,144.964213</em>, I can search for tags <strong>geo:lat=-37.821362</strong> and <strong>geo:lon=144.964213</strong> to find all the geotagged links for that <em>exact</em> location, but what if I&#8217;m standing 50 metres across the street looking up at the tower and want to search for links near my current location ?<span id="more-94"></span></p>
<p>Enter the <a href="http://geohash.org/">geohash</a>, a hash function for geolocation coordinates invented by <a href="http://labix.org/">Gustavo Niemeyer</a> (not to be confused with the <a href="http://www.xkcd.com/426/">xkcd Spontaneous Adventure Generation algorithm</a> of the same name). Wikipedia gives a reasonable explanation of <a href="http://en.wikipedia.org/wiki/Geohash">how geohashes work</a> &#8230; essentially the latitude and longitude are encoded as strings like <em><a href="http://geohash.org/r1r0fdzdwg">r1r0fdzdwg</a></em>. Geohashes have the useful property of having arbitrary precision &#8230; geohashes with the same prefix represent locations in the same vicinity. This means that the location across the street from the Eureka tower, at geohash <em><a href="http://geohash.org/r1r0fdy7sm"><strong>r1r0fd</strong>y7sm</a></em>, shares the prefix <em>r1r0fd</em> with the geohash closest to the top of the Eureka Tower, at <em><a href="http://geohash.org/r1r0fdzdwg"><strong>r1r0fd</strong>zdwg</a></em>.</p>
<p>My proposal for delicious geotaggers is that in addition to the <strong>geo:lat</strong> and <strong>geo:lon</strong> tags, several truncated geo:hash tags should also be used. If I were to bookmark something related to the Eureka Tower, I may tag it:</p>
<pre><strong>geotagged
</strong><strong>geo:lat=-37.821362
geo:lon=144.964213
geo:hash=r1r0fdzdwg
geo:hash=</strong><strong><strong>r1r0fdz
</strong>geo:hash=</strong><strong><strong>r1r0f</strong></strong></pre>
<p>Then, anyone searching for the tag <a href="http://delicious.com/pansapiens/geo:hash=r1r0f"><strong>geo:hash=r1r0f</strong></a> will find every item within the area that this geohash covers &#8230; this would include not only the Eureka Tower, but the <a href="http://geohash.org/r1r0fe76n">Rialto Towers</a> too.</p>
<p>For each bookmarked item, the number of truncated geohashes used as tags roughly determine the distance ranges (ie <a href="http://mappinghacks.com/2008/05/29/geohash-implemented-in-python/">bounding boxes</a>) that can be searched. Exactly which truncations, or how many geohash tags to use, is an existing problem that I haven&#8217;t yet decided the best solution for; is it best to &#8216;overload&#8217; with every possible geohash truncation (eg include tags geo:hash=r1r0fdzdwg, geo:hash=r1r0fdzdwg, geo:hash=r1r0fdzdw, geo:hash=r1r0fdzd, geo:hash=r1r0fdz &#8230;etc&#8230; to geo:hash=r) ? This is probably overkill. A better approach would be to choose just a few key truncations that roughly correlate to a range of sensibly sized patches on the Earths surface, eg, bounding boxes with diagonal lengths of:</p>
<ul>
<li>geo:hash=r1r0fdzdwg <strong>~60 cm</strong> ['exact']</li>
<li>geo:hash=r1r0fdzd <strong>~20 m</strong></li>
<li>geo:hash=r1r0fdz <strong>~150 m</strong></li>
<li>geo:hash=r1r0fd <strong>~600 m</strong></li>
<li>geo:hash=r1r0f <strong>~4.8 km</strong></li>
<li>geo:hash=r1r0 <strong>~19.5 km</strong></li>
<li>geo:hash=r1r <strong>~150 km</strong></li>
</ul>
<p>These ranges map loosely to those deemed useful by Brightkite, which lets you search for events around you within 20 m, 200 m, 2 km, 4 km, 10 km, 50 km and 100 km. Maybe we only need a few of these. If only a few truncations were provided by the tagger, the user can always execute multiple searches, starting from the full geohash of their current location and truncating back, character by character, (effectively expanding the search radius) until they start to get hits. There may also be techniques whereby the last character(s) of the truncated hash can be incremented/decremented to search neighboring bounding boxes (eg for r1r0fd, also search for r1r0fc, r1r0fe tags), although I need to think about this a little more.</p>
<p>Of course, the best solution for more useful geotagging within delicious would be for delicious/Yahoo to explicitly support some style of geotagging and provide a geotag-aware search facility &#8230; but until that day, geohashes may well do the job well enough. Next step for me: write a proof of concept application that actually produces and makes use of these types of tags &#8230;.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pansapiens.com/2008/12/29/delicious-geohashes-mmmm-tagging-drool/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<creativeCommons:license>http://creativecommons.org/publicdomain/zero/1.0/</creativeCommons:license>
	</item>
		<item>
		<title>Searching bioinformatic databases with YubNub</title>
		<link>http://blog.pansapiens.com/2008/11/12/searching-bioinformatic-databases-with-yubnub/</link>
		<comments>http://blog.pansapiens.com/2008/11/12/searching-bioinformatic-databases-with-yubnub/#comments</comments>
		<pubDate>Wed, 12 Nov 2008 11:29:16 +0000</pubDate>
		<dc:creator>Andrew Perry</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[two-point-oh]]></category>
		<category><![CDATA[web2.0]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[yubnub]]></category>

		<guid isPermaLink="false">http://blog.pansapiens.com/?p=87</guid>
		<description><![CDATA[You may already be familiar with YubNub; it describes itself as &#8220;the social command line for the web&#8221;. Most commands consist of two (or more) words &#8230; one for the search engine, the other for the query.
For example, typing:
gg open science on friendfeed
into the YubNub search box searches Google for &#8220;open science on friendfeed&#8220;, via [...]]]></description>
			<content:encoded><![CDATA[<p>You may already be familiar with <a href="http://yubnub.org/">YubNub</a>; it describes itself as &#8220;the social command line for the web&#8221;. Most commands consist of two (or more) words &#8230; one for the search engine, the other for the query.</p>
<p>For example, typing:</p>
<blockquote><p><em><strong>gg open science on friendfeed</strong></em></p></blockquote>
<p>into the YubNub search box searches Google for &#8220;<em>open science on friendfeed</em>&#8220;, via YubNub.</p>
<p>I thought I&#8217;d highlight a few life science- and bioinformatics-related YubNub commands I find myself using quite often in my day-to-day work. Some are commands I created, others someone else created. This is the beauty of YubNub &#8230; often someone has already made the &#8216;obvious&#8217; command &#8230; it&#8217;s worth just trying to search with a command you expect to exist, since it often does.</p>
<p>Onward, with the list:</p>
<p><span id="more-87"></span></p>
<ul>
<li><a href="http://yubnub.org/kernel/man?args=pdb"><strong>pubmed</strong></a> &#8212; Searches PubMed</li>
<li><a href="http://yubnub.org/kernel/man?args=hubmed"><strong>hubmed</strong></a> &#8212; Searches <a href="http://www.hubmed.org/">HubMed</a> (Alf Eatons featureful alternative interface to PubMed)</li>
<li><a href="http://yubnub.org/kernel/man?args=gopubmed"><strong>gopubmed</strong></a> &#8212; Searches <a href="http://www.gopubmed.org/">GoPubMed</a> (an ontology enhanced PubMed search)</li>
<li><a href="http://yubnub.org/kernel/man?args=doi"><strong>doi</strong></a> &#8212; Redirects you based on a Digital Object Identifier (DOI), via <span class="muted">http://dx.doi.org/</span></li>
<li><a href="http://yubnub.org/kernel/man?args=pdb"><strong>pdb</strong></a> &#8212; Searches the Protein DataBank for 3D structures. Usually the search term should be a 4 letter pdb code.</li>
<li><a href="http://yubnub.org/kernel/man?args=uniprot"><strong>uniprot</strong></a> &#8212; Searches the Uniprot database (use an accession, id or keyword as the query).</li>
<li><a href="http://yubnub.org/kernel/man?args=ihop"><strong>ihop</strong></a> &#8212; Searches <a href="http://www.ihop-net.org">iHOP</a>, information Hyperlinked over Proteins, for views of the biomedical literature guided by gene networks. Nothing to do with <a href="http://www.google.com/search?q=ihop">pancakes (or prayer)</a>.</li>
</ul>
<p>There is also a class of more general, non-biomedical commands which I often use:</p>
<ul>
<li><a href="http://yubnub.org/kernel/man?args=gg"><strong>gg</strong></a> &#8212; The Google.</li>
<li><strong><a href="http://yubnub.org/kernel/man?args=gim">gim</a> &#8212; </strong>The Google Image Search.</li>
<li><a href="http://yubnub.org/kernel/man?args=wp"><strong>wp</strong></a> &#8212; Good ol&#8217; Wikipedia.</li>
<li><strong><a href="http://yubnub.org/kernel/man?args=ucc">ucc</a> </strong>&#8211; The universal currency converter at XE.com. Use it like <strong><em>ucc 399 aud usd</em></strong>, to convert $399 Australian dollars to US dollars. Then, if you have your cash in Australian dollars, weep about the recent drop in the exchange rate <img src='http://blog.pansapiens.com/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> </li>
<li><strong><a href="http://yubnub.org/kernel/man?args=man">man</a></strong> &#8212; Like *nix man &#8216;manual pages&#8217;, but for YubNub commands. Eg, <strong><em>man ucc</em></strong> will give the manual page describing how to used the <em>ucc</em> command.</li>
<li><strong><a href="http://yubnub.org/kernel/man?args=ls">ls</a></strong> &#8212; A bit like the *nix shell ls, this command lists existing YubNub commands that contain your query in their name, description or url. eg. searching <strong><em><a href="http://yubnub.org/kernel/ls?args=protein">ls protein</a></em></strong> gives you a short list of all the commands related to proteins.</li>
</ul>
<p>I&#8217;ve installed the <a href="http://mycroft.mozdev.org/search-engines.html?name=yubnub">YubNub opensearch plugin</a> so I can search directly from the search box (or location bar) in Firefox. Maybe one day <a href="https://wiki.mozilla.org/Labs/Ubiquity">Ubiquity</a> will fulfill this purpose, since in many way it is the natural progression of the YubNub idea. But for the moment YubNub is the fastest, most streamlined way I&#8217;ve found to quickly fire off a search when I need to hunt down a reference, protein sequence or 3D structure. Nothing like instant gratification <img src='http://blog.pansapiens.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pansapiens.com/2008/11/12/searching-bioinformatic-databases-with-yubnub/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<creativeCommons:license>http://creativecommons.org/publicdomain/zero/1.0/</creativeCommons:license>
	</item>
		<item>
		<title>Software review: producing two dimensional diagrams of membrane proteins</title>
		<link>http://blog.pansapiens.com/2008/06/26/software-review-producing-two-dimensional-diagrams-of-membrane-proteins/</link>
		<comments>http://blog.pansapiens.com/2008/06/26/software-review-producing-two-dimensional-diagrams-of-membrane-proteins/#comments</comments>
		<pubDate>Wed, 25 Jun 2008 20:30:22 +0000</pubDate>
		<dc:creator>Andrew Perry</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[publication]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[structural biology]]></category>
		<category><![CDATA[two-point-oh]]></category>
		<category><![CDATA[web2.0]]></category>
		<category><![CDATA[beta-barrels]]></category>
		<category><![CDATA[graphics]]></category>
		<category><![CDATA[structure]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://blog.pansapiens.com/?p=60</guid>
		<description><![CDATA[
I recently needed to make a simple, two dimensional figure of a beta-barrel membrane protein. I went hunting for programs that might take a sequence and/or structure and produce a pretty looking diagram to save me constructing everything by hand. Here are two I found and tried.

TMRPres2D
Ioannis C. Spyropoulos, Theodore D. Liakopoulos, Pantelis G. Bagos [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://blog.pansapiens.com/wp-content/uploads/2008/06/tmrpres2d_lamb_ecoli.png" rel="lightbox[60]"><img class="alignright size-medium wp-image-61" title="TMRPres2D LAMB_ECOLI" src="http://blog.pansapiens.com/wp-content/uploads/2008/06/tmrpres2d_lamb_ecoli-300x250.png" alt="E. coli LamB, presented using TMRPres2D. Not that the cytoplasmic/extracellular labels are incorrect, and should say extracellular/periplasmic." width="300" height="250" /></a><strong><a href="http://bioinformatics.biol.uoa.gr/TMRPres2D/"></a></strong></p>
<p>I recently needed to make a simple, two dimensional figure of a beta-barrel membrane protein. I went hunting for programs that might take a sequence and/or structure and produce a pretty looking diagram to save me constructing everything by hand. Here are two I found and tried.</p>
<p><span id="more-60"></span></p>
<p><strong><a href="http://bioinformatics.biol.uoa.gr/TMRPres2D/">TMRPres2D</a></strong></p>
<p><span>Ioannis C. Spyropoulos, Theodore D. Liakopoulos, Pantelis G. Bagos and Stavros J. Hamodrakas</span><span><strong> TMRPres2D: high quality visual representation of transmembrane protein models</strong><span style="text-decoration: underline;"> Bioinformatics</span>. 2004;  20: 3258-3260. (<a href="http://resolveref.appspot.com/ref/Bioinformatics/2004/20/3258">link</a>)<br />
</span><br />
<strong>Pros:</strong></p>
<ul>
<li> Cross-platform (Java)</li>
<li> Simple interface, GUI (zero learning curve)</li>
<li> Lots of input options (defines transmembrane regions directly from SwissP<a href="http://blog.pansapiens.com/wp-content/uploads/2008/06/tmrpres2d_secy_bucai.png" rel="lightbox[60]"><img class="alignright size-medium wp-image-62" title="TMRPres2D SECY_BUCAI" src="http://blog.pansapiens.com/wp-content/uploads/2008/06/tmrpres2d_secy_bucai-300x197.png" alt="TMRPres2D diagram of SECY_BUCAI. Labels \" width="300" height="197" /></a>rot or PIR annotations online, takes input from several transmembrane region predictors)</li>
<li> Lots of output formats and options (Postscript, gif, jpg, png, svg, bmp)</li>
<li> Various colouring options (hydrophobicity, charge, &#8220;printer friendly&#8221;)</li>
<li> Makes reasonable looking diagrams of helical transmembrane proteins</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li> Doesn&#8217;t handle beta-barrel membrane proteins gracefully (strand drawing is overlapped, messy).</li>
<li>The membrane is assumed to be a eukaryotic plasma membrane, with labels &#8220;cytoplasmic/extracellular&#8221; (which should be, for instance, &#8220;extracellular/periplasm&#8221; for a bacterial outer membrane protein). This is easily changed on the diagram with external editing.</li>
</ul>
<p><strong><a href="http://www.pharmazie.uni-kiel.de/chem/Prof_Beitz/textopo.htm">TeXtopo</a></strong></p>
<p>Beitz, E. (2000), <strong>TeXtopo: shaded membrane protein topology  	plots in LaTeX2e</strong>. <em>Bioinformatics</em> <strong>16</strong>: 1050-1051. (<a href="http://resolveref.appspot.com/ref/Bioinformatics/2000/16/1050">link</a>). See the <a href="http://resolveref.appspot.com/ref/Bioinformatics/2000/16/1050">original publication</a> or <a href="http://www.uni-kiel.de/Pharmazie/chem/Prof_Beitz/biotex.html">Professor Eric Beitz&#8217;s site</a> for a better example than my image.</p>
<p><a href="http://blog.pansapiens.com/wp-content/uploads/2008/06/secy_textopo.png" rel="lightbox[60]"><img class="alignright size-medium wp-image-64" title="SecY textopo diagram" src="http://blog.pansapiens.com/wp-content/uploads/2008/06/secy_textopo-300x214.png" alt="" width="300" height="214" /></a></p>
<p><strong>Pros:</strong></p>
<ul>
<li>Beautiful, clean, publication quality diagrams, courtesy of LaTeX</li>
<li>Multiple input options (Swissprot format, PHD, HMMTOP, user defined)</li>
<li>Multiple sequence annotation options including colouring by various physiochemical properties (hydrophobicity, charge), sequence conservation or user defined schemes.</li>
<li>Will depict membrane embedded half-loops and lipid anchors.</li>
<li>Versatile output (Postscript, pdf, dvi, basically anything that LaTex can be rendered as)</li>
<li>Also can generate attractive looking helical wheel plots</li>
<li>Did I mention the output is clean and looks great &#8230; ?</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Steep learning curve for the uninitiated, despite extensive documentation (ie LaTeX code, no GUI)</li>
<li>No support for beta-barrel membrane proteins</li>
</ul>
<p>If I ever need to make a 2D diagram of a helical membrane protein for a publication, TeXtopo would be my first choice. For quickly getting an overview of some transmembrane prediction results or a protein with defined tranmembrane regions in Uniprot, TMRPres2D is the quickest and easiest method.</p>
<p>In the end, since neither program would do a decent job at cleanly depicting the strands of a beta-barrel in a simple 2D plot, I ended up coding my own hackish solution (<a href="http://blog.pansapiens.com/wp-content/uploads/2008/06/svg_barrel.tar.gz">svg_barrel.tar.gz</a> or <a href="http://blog.pansapiens.com/wp-content/uploads/2008/06/svg_barrel_gui_win32.zip">svg_barrel_gui_win32.zip</a>) using Python and a tweaked version of <em>SVGdraw.py</em>. This allowed me to generate some SVG graphics to use as a starting point, and then hand edit the result in Inkscape to align strands to loosely match the real hydrogen bonding patterns. I also added some simple beizer curves for the loops, since neat placement of loop residues was the tricky part that I decided I didn&#8217;t have time to tackle.</p>
<p>Here&#8217;s the end result, after hand editing:<br />
    <object type="image/svg+xml" width="400" height="400" data="http://blog.pansapiens.com/wp-content/uploads/2008/06/lamb_2d_barrel.svg"><br />
      <img src="http://blog.pansapiens.com/wp-content/uploads/2008/06/lamb_2d_barrel.jpg" alt="SVG barrel diagram"><br />
    </object></p>
<p>And here is the 3D version, as a point of reference:</p>
<p><a href="http://blog.pansapiens.com/wp-content/uploads/2008/06/lamb_ray.jpg" rel="lightbox[60]"><img class="aligncenter size-medium wp-image-69" title="LamB (1MPQ)" src="http://blog.pansapiens.com/wp-content/uploads/2008/06/lamb_ray.jpg" alt="generated using PyMol (raytraced)" width="272" height="300" /></a></p>
<p>The 2D vector diagram could do with some work to aid in a more accurate representation (unfortunately &#8216;flat&#8217; views of a 3D barrel always have to make some compromises), but it does the job. The goal was to keep it simple &#8230; simple it is. One day I may extend this code to actually use known structure coordinates to automatically align the strands (saving tedious manual alignment), and write some code that properly lays out the loops.</p>
<p>Anyone know any other programs of similar functionality I&#8217;ve missed ?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pansapiens.com/2008/06/26/software-review-producing-two-dimensional-diagrams-of-membrane-proteins/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<creativeCommons:license>http://creativecommons.org/publicdomain/zero/1.0/</creativeCommons:license>
	</item>
		<item>
		<title>ResolveRef updated : now with auto-suggest and source code</title>
		<link>http://blog.pansapiens.com/2008/06/06/resolveref-updated-now-with-auto-suggest-and-source-code/</link>
		<comments>http://blog.pansapiens.com/2008/06/06/resolveref-updated-now-with-auto-suggest-and-source-code/#comments</comments>
		<pubDate>Fri, 06 Jun 2008 00:26:37 +0000</pubDate>
		<dc:creator>Andrew Perry</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[two-point-oh]]></category>
		<category><![CDATA[web2.0]]></category>
		<category><![CDATA[gae]]></category>
		<category><![CDATA[Google App Engine]]></category>
		<category><![CDATA[resolveref]]></category>

		<guid isPermaLink="false">http://blog.pansapiens.com/?p=56</guid>
		<description><![CDATA[I updated ResolveRef last night and checked in the most current sourcecode to svn at Google Code.
New features include:


Suggest/autocomplete for journal title field, using the journal title lists provided by PubMed.
A &#8220;Verify&#8221; button. Allows a ResolveRef URL to be constructed with the web form and verified as working and valid without actually forwarding the user [...]]]></description>
			<content:encoded><![CDATA[<p>I updated <a href="http://resolveref.appspot.com/">ResolveRef</a> last night and checked in the most current sourcecode to svn <a href="http://code.google.com/p/resolveref/">at Google Code</a>.</p>
<p>New features include:</p>
<p><a href="http://blog.pansapiens.com/wp-content/uploads/2008/06/resolveref1.png" rel="lightbox[56]"><img class="alignright size-medium wp-image-58" title="ResolveRef" src="http://blog.pansapiens.com/wp-content/uploads/2008/06/resolveref1-230x300.png" alt="ResolveRef, now prettier, with comments box by disqus." width="230" height="300" /></a></p>
<ul>
<li>Suggest/autocomplete for journal title field, using the journal title lists provided by PubMed.</li>
<li>A &#8220;Verify&#8221; button. Allows a ResolveRef URL to be constructed with the web form and verified as working and valid without actually forwarding the user to the article.</li>
<li>Some bugfixes (handled the case where there is no DOI in the PubMed record, handled network timeouts to PubMed)</li>
<li>Refreshed visuals</li>
<li>Disqus comments box for feedback</li>
</ul>
<p>In the interest of just getting something working quickly, I implemented the suggest feature in the laziest, possibly most RAM and CPU hungry way possible (the &#8220;JQuery Suggest&#8221; code queries the web app with substrings as you type each character. At the server side, the app uses a regex to scan a ~1.5 Mb list of journal titles held in RAM). I&#8217;ve already noticed a few &#8220;<em>This request used a high amount of CPU</em>&#8221; warnings in the logs, with the threat &#8220;<em>High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled</em>&#8220;. If my nasty hack starts heating up Google&#8217;s datacentre too much, I might have to disable the &#8217;suggest&#8217; feature until I can implement it &#8220;properly&#8221;.</p>
<p><span id="more-56"></span></p>
<h3>Reflections, discoveries</h3>
<p>This idea of implementing Openref-style article identifiers has been an fun experiment, and a nice way to learn more about the ins-and-outs of PubMed. When working on implementing the &#8217;suggest&#8217; feature, a major drawback became even more apparent &#8230; journal titles (the <em>[TA]</em> field) used by PubMed are not always easily guessable, and many common abbreviations used in reference lists do not appear to exist in <a href="http://www.ncbi.nlm.nih.gov/entrez/citmatch_help.html#JournalLists">PubMed&#8217;s downloadable flat-file journal title lists</a>. This is the list that ResolveRef uses to make the &#8217;suggestions&#8217;, so having &#8216;missing&#8217; journal titles presents a problem if I want users to be able to painlessly construct ResolveRef URLs.</p>
<p><em>Proc. Natl. Acad. Sci. U.S.A. </em>is a perfect example. Many article bibliographies use <em>PNAS</em> &#8211; that would be my guess if I were trying to create a ResolveRef URL for a <em>PNAS</em> paper &#8211; and yet this journal title does not exist as far as PubMed&#8217;s official journal list is concerned. Issues surrounding this problem were <a href="http://baoilleach.blogspot.com/2008/01/doi-or-doh-proposal-for-restful-unique.html">discussed on Noel&#8217;s original OpenRef post</a>. The odd thing, is that if I search the <a href="http://www.ncbi.nlm.nih.gov/sites/entrez?db=journals">PubMed Journals database</a>, for &#8220;PNAS&#8221;, <a href="http://www.ncbi.nlm.nih.gov/sites/entrez?Db=nlmcatalog&amp;doptcmdl=Expanded&amp;cmd=search&amp;Term=7505876[NlmId]">it finds it</a>, and gives me a record where <em>PNAS</em> is listed under &#8220;Other titles(s)&#8221;. If someone could point me to where I can get these extra fields containing additional names for a journal that are not provided in the the downloadable flat-files, it would be much appreciated (I bet Alf knows the answer. Or maybe I should email the folks at PubMed). If I can get a better list of titles the &#8217;suggest&#8217; feature in ResolveRef would suddenly become a whole lot more useful. Another way around this may be to use CrossRef, and I&#8217;m looking int<span style="color: #000000;">o tha</span><span style="color: #000000;">t, </span><span style="color: #000000;"><a href="http://depth-first.com/articles/2008/05/06/hacking-doi-interconvert-bibliographic-references-and-dois-with-crossref-and-openurl">but I get the feeling that usage of the CrossRef API is more restricted</a>, so I haven&#8217;t bothered with it so far.<br />
</span></p>
<h3>Thoughts about the future of ResolveRef / OpenRef</h3>
<p>At this stage, ResolveRef URLs are not actually identifiers. They simply act like a frontend to a single-hit PubMed search, and several <em>different</em> ResolveRef URLs can return the <em>same</em> DOI URL (and hence the same journal article). A proper identifier would have a one-to-one mapping between the human-readable ResolveRef URLs and a DOI. In the future, I may attempt to get ResolveRef to &#8216;normalize&#8217; URLs by allowing only a single journal title for each journal and forcing the use of volume numbers if present. The user could use the web interface to enter the values, and ResolveRef will return a normalized URL. Only normalized URLs would successfully forward to the DOI URL, others will return an error with &#8220;Did you mean ..<em>insert normalized URL ..?</em>&#8220;. One drawback is that this would reduce the guessablity of ResolveRef URLs, but the advantage is that they could be treated like identifiers: one article would have one and only one valid ResolveRef URL. By requiring a tool (like the ResolveRef web form) to help users build a vaild URL, and removing some of the guessability, ResolveRef would move a little closer to a <a href="http://hublog.hubmed.org/archives/001601.html">reinvention of OpenURL</a> (although I think OpenRef/ResolveRef URLs are still more readable and cleaner than OpenURLs, and are much more guessable if you have a bibliography in front of you).</p>
<p>A key cosmetic (and philosophical) difference between OpenURL and OpenRef/ResolveRef URLs is that OpenURL uses HTTP GET fields, eg <em>?title=bla&amp;issn=12345</em>, while OpenRef/ResolveRef uses the URL path itself eg, <em>somejournalname/2008/4/1996</em>. It&#8217;s a bit like one scheme was designed in the age of CGI scripts, while the other was designed for web applications capable of more RESTful behaviour. In my mind OpenURL is more versatile but much uglier, while OpenRef is cleaner and simpler but can only reference journal articles. OpenRef-style URLs will never be able to reference the breadth of resources that an OpenURL can theoretically handle. Maybe hybrid solution could work &#8230; some kind of OpenURL server that could &#8220;speak OpenRef&#8221; &#8230; accepting OpenRef-style URLs where possible, while still dealing with regular OpenURL style &#8220;<em>?bla=blarg&amp;</em>&#8221; query strings for everything else.</p>
<p>As far as I can tell OpenURLs are not <em>identifiers</em> with a one-to-one URL-to-article mapping &#8211; this is a drawback since you could not do a Google search to reliably find sites that reference an article via it&#8217;s OpenURL &#8230; you theoretically could do this with a normalized OpenRef/ResolveRef URL, since there will only be one unique string used to reference any one article (as Noel pointed out, OpenRef strings have some properites akin to InChi strings). Obviously to do this cleanly, ResolveRef would need a nicer domain (something akin to dx.doi.org).</p>
<p>Anyhow, I&#8217;m not expecting ResolveRef / OpenRef to make any impact on anything anywhere anytime soon. I&#8217;m not a librarian, I don&#8217;t sit on an <a href="http://listserv.oclc.org/scripts/wa.exe?A0=OPENURL">NISO/ANSI committee</a>, and I don&#8217;t see publishers seeing a need to adopt anything beyond the DOI. But it&#8217;s been nice to play around with, and I&#8217;m likely to continue doing so.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pansapiens.com/2008/06/06/resolveref-updated-now-with-auto-suggest-and-source-code/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	<creativeCommons:license>http://creativecommons.org/publicdomain/zero/1.0/</creativeCommons:license>
	</item>
		<item>
		<title>ResolveRef : looking at the logs</title>
		<link>http://blog.pansapiens.com/2008/06/01/resolveref-looking-at-the-logs/</link>
		<comments>http://blog.pansapiens.com/2008/06/01/resolveref-looking-at-the-logs/#comments</comments>
		<pubDate>Sun, 01 Jun 2008 08:29:24 +0000</pubDate>
		<dc:creator>Andrew Perry</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[biopython]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[two-point-oh]]></category>
		<category><![CDATA[web2.0]]></category>
		<category><![CDATA[gae]]></category>
		<category><![CDATA[Google App Engine]]></category>
		<category><![CDATA[resolveref]]></category>

		<guid isPermaLink="false">http://blog.pansapiens.com/?p=54</guid>
		<description><![CDATA[One of the nice features of Google App Engine is you can easily view logs for your application to quickly see requests generating errors. Browsing the logs of ResolveRef, I&#8217;ve been able to identify an few classes of query which for one reason or another, weren&#8217;t working.

Firstly, there is the &#8220;just testing and don&#8217;t actually [...]]]></description>
			<content:encoded><![CDATA[<p>One of the nice features of <a href="http://code.google.com/appengine/">Google App Engine</a> is you can easily view logs for your application to quickly see requests generating errors. Browsing the logs of <a href="http://resolveref.appspot.com/">ResolveRef</a>, I&#8217;ve been able to identify an few classes of query which for one reason or another, weren&#8217;t working.</p>
<p><span id="more-54"></span></p>
<p>Firstly, there is the &#8220;just testing and don&#8217;t actually have a citation on hand to key-in&#8221; class of users, that tried requests something like:</p>
<blockquote>
<h5><span class="file">/ref/xx/2007//</span></h5>
</blockquote>
<p>Not much sympathy here &#8230; it&#8217;s pretty much like dialing a random phone number and hoping it someone will pick up.</p>
<p>Then there is a class of users who appear to have sensible intentions, but provide incomplete ResolveRef URLs, eg:</p>
<blockquote>
<h5><span class="file">/ref/Organic%20Letters/2000//</span></h5>
</blockquote>
<p>Maybe I poorly described ResolveRef in the initial announcement, maybe the documentation in the &#8220;About&#8221; box on the ResolveRef site is unclear or maybe these users just didn&#8217;t read the docs in the first place. When I described the service as &#8220;A RESTful way to do PubMed searches&#8221;, maybe it would have been more accurate to say &#8220;A simple, RESTful way to resolve a <em><strong>single</strong></em> journal article using only the human-readble citation information&#8221;. ResolveRef does not give a <em>list</em> of results to a PubMed search; it forwards to a <em>single hit</em> (ideally the requested article), or gives an error if it can&#8217;t be found. By the looks of it, many users seem to want to use ResolveRef as a way to retrieve a list of results. While this goes against the original spirit of ResolveRef being a resolver for an [almost] <em>unique identifier</em> for journal articles (akin to <a href="http://baoilleach.blogspot.com/2008/01/doi-or-doh-proposal-for-restful-unique.html">Noel&#8217;s OpenRef proposa</a>l), I may be tempted to update ResolveRef to return a list of hits in the future (or just forward to the <a href="http://hubmed.org">HubMed</a> or PubMed results page).</p>
<p>There are also some <em>actual</em> bugs which throw nasty python backtraces (I think this one was actually me trying to use ResolveRef to look up a reference at work ):</p>
<blockquote><p><strong><br />
/ref/Protein%20Sci/1999/8/689</strong></p></blockquote>
<p>This threw an error since ResolveRef (stupidly) assumed that every PubMed record has an associated DOI &#8230; however for some reason this Protein Science article does not have a DOI recorded in PubMed, so it fails to resolve with ResolveRef. This is (yet another) drawback to using PubMed as a backend. I&#8217;m thinking I may need to make ResolveRef <a href="http://depth-first.com/articles/2008/05/06/hacking-doi-interconvert-bibliographic-references-and-dois-with-crossref-and-openurl">interface with CrossRef</a> somehow too, since that may act as a backup (or complete replacement) for these cases.</p>
<p>There also seem to be occasional errors generated when the HTTP connection from the Google App servers to PubMed fails; my fault entirely &#8230; that type of exception should always be anticipated and caught in a networked application.</p>
<p>Apart from guessing how people may like to use the application by examining the logs, <span class="gray"><a href="http://appgallery.appspot.com/about_app?app_id=agphcHBnYWxsZXJ5chMLEgxBcHBsaWNhdGlvbnMYnAcM"><em>edoardo.marcora</em> also suggested that autocomplete/suggest</a> for the journal field would be nice. I agree &#8230; this was a feature I was working on prior to the initial release, but it was taking too long so I just launched ResolveRef without it.</span></p>
<p>There is a new version in the pipeline, and will be ready for release soon. I&#8217;ll also put it on Google Code, warts and all. I already have the &#8220;suggest&#8221; functionality working, and once I resolve the few bugs discussed above, I&#8217;ll push out an update. Stay tuned.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pansapiens.com/2008/06/01/resolveref-looking-at-the-logs/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	<creativeCommons:license>http://creativecommons.org/publicdomain/zero/1.0/</creativeCommons:license>
	</item>
		<item>
		<title>Dapper : the screen scraper for everyone</title>
		<link>http://blog.pansapiens.com/2007/03/19/dapper-the-screen-scraper-for-everyone/</link>
		<comments>http://blog.pansapiens.com/2007/03/19/dapper-the-screen-scraper-for-everyone/#comments</comments>
		<pubDate>Sun, 18 Mar 2007 19:24:00 +0000</pubDate>
		<dc:creator>Andrew Perry</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[two-point-oh]]></category>

		<guid isPermaLink="false">http://blog.pansapiens.com/2007/03/19/dapper-the-screen-scraper-for-everyone/</guid>
		<description><![CDATA[I&#8217;ve been meaning to write about the Dapper &#8217;screen scraping&#8217; service for a while, since I think it&#8217;s mostly useful and pretty cool.
(Yes, this service is called Dapper, sharing a name with the popular Ubuntu GNU/Linux release. I&#8217;m a little suspicious that maybe this was a deliberate marketing trick to pull search traffic intended for [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been meaning to write about the <a href="http://www.dapper.net/">Dapper</a> &#8217;screen scraping&#8217; service for a while, since I think it&#8217;s mostly useful and pretty cool.</p>
<p><span style="font-style: italic;">(Yes, this service is called Dapper, sharing a name with the popular Ubuntu GNU/Linux release. I&#8217;m a little suspicious that maybe this was a deliberate marketing trick to pull search</span><span style="font-style: italic;"> traffic intended for Ubuntu &#8230;.).</span></p>
<p><a href="http://www.techcrunch.com/2006/08/17/create-an-api-for-any-site-with-dapper/">Techcrunch describes Dapper well</a> &#8230; &#8216;create an API for any site&#8217;. Essentially, Dapper is good at analyzing web pages which have a fixed format (eg Google search results) and will extract the content in a predictable fashion to provide the raw data in XML, RSS, JSON or CSV format. Depending on the type of data you extract from the page, Dapper can also display data as a Google Map, a Google Gadget or Netvibes module, send email alerts or output iCalendar format.</p>
<p>By making it simple to extract information from sites that do not provide their data as an RSS feed or other useful format, it is possible for non-programmers to use Dapper and &#8216;liberate&#8217; these sites, producing feeds compatible with their favorite news feed reader from an otherwise dead and lifeless web page. Web application programmers can waste less time writing screen scrapers, and waste less time fixing broken code caused by slight changes in the page format. It may not be something you use for a mission critical component, but for a quick mashup or to get started quickly before writing your own more robust parser, I think Dapper will prove (<a href="http://www.dappit.com/dapplications/">and is proving</a>) very valuable.</p>
<p>However, there is a downside. From playing with Dapper on-and-off for the past few months, I&#8217;ve established that Dapper works quite well extracting data from very uniform pages like Google search hits [<span style="font-style: italic;">however I didn't do that since it's against the terms of service. <a href="http://www.scroogle.org/">"No screen-scraping for you"</a>, says the <a href="http://www.google.com/intl/en/terms_of_service.html">Google-Nazi</a></span>] or the front page of <a href="http://digg.com/">digg</a> .. but usually fails on pages that don&#8217;t follow a strict pattern. Getting the wrong data (or junk) in the wrong field 5% of the time may be tolerable for the occasional <a href="http://feeds.taquitos.net/SnackReviews">frivolous RSS feed</a>, but it is annoying enough that for more important applications it is a real show stopper.</p>
<p>One of the reasons it took me several months to get around to posting about Dapper is that I desperately wanted a killer example of extracting data from a bioinformatics database or web site. I&#8217;ve found that most decent projects already make their data available in some useful format like XML or CSV and don&#8217;t really require scraping with Dapper, while some of the less organized projects which only provide say, HTML tables, [<span style="font-style: italic;">I won't name names ... in fact I've forgotten about them already ... "no citation for you" says the citation-<a href="http://en.wikipedia.org/wiki/Soup_Nazi">Soup-Nazi</a> </span>] often failed to work well with Dapper&#8217;s page analysis unless the page formatting was strictly uniform.</p>
<p>Pedro<span class="post-author"> Beltrão has given an <a href="http://pbeltrao.blogspot.com/2007/03/bioinformatic-web-scrapingmash-ups-made.html">example of using openKapow as a scraper for bioinformatics</a> &#8230; Dapper and <a href="http://openkapow.com/Default.aspx">openKapow</a> seem to be competing in the same space, however Dapper is entirely web-based and openKapow requires some free desktop software. I haven&#8217;t tried openKapow</span><span class="post-author"> yet, but <a href="http://www.techcrunch.com/2007/03/02/5-ways-to-mix-rip-and-mash-your-data/">Techcrunch&#8217;s rundown on several scraping services</a> lists the use case of Dapper as &#8220;Quickly Scrape Data&#8221; and openKapow as &#8220;Robustly Scrape Data&#8221;, which is telling.</span></p>
<p>I replicated <a href="http://pbeltrao.blogspot.com/2007/03/bioinformatic-web-scrapingmash-ups-made.html">Pedro&#8217;s openKapow Ensembl orthologue search</a> in Dapper as an example. It&#8217;s not the best example since, as Pedro notes, Ensembl is one of the &#8216;good guys&#8217; that already provide results in XML format.</p>
<p>First, I fed Dapper four URLs for <a href="http://www.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000173726">Ensembl gene report pages</a> , which contain a section with predicted orthologues. Apparently, giving Dapper several pages of the same format helps the analysis:</p>
<p><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_2BBBfWXV3-w/Rf2JKi-99TI/AAAAAAAAAA0/vlwVSJnc_rY/s1600-h/dapper_ensemble1.png" rel="lightbox[23]"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp2.blogger.com/_2BBBfWXV3-w/Rf2JKi-99TI/AAAAAAAAAA0/vlwVSJnc_rY/s320/dapper_ensemble1.png" alt="" id="BLOGGER_PHOTO_ID_5043337972007433522" border="0" /></a><br />Then, I selected the Gene ID in the orthologues list .. Dapper colours fields it detects as the same type. There is a cryptic unlabeled slider which determines the &#8216;greediness&#8217; of the selection:</p>
<p><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_2BBBfWXV3-w/Rf2KDS-99UI/AAAAAAAAAA8/OgoQGQ8z-bE/s1600-h/dapper_ensemble2.png" rel="lightbox[23]"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp1.blogger.com/_2BBBfWXV3-w/Rf2KDS-99UI/AAAAAAAAAA8/OgoQGQ8z-bE/s320/dapper_ensemble2.png" alt="" id="BLOGGER_PHOTO_ID_5043338946965009730" border="0" /></a></p>
<p>After selecting &#8220;Save and continue&#8221;, Dapper asks for the newly defined field to be named. In this  case, I chose the same name as Pedro (&#8221;ort_geneID&#8221;), just for the hell of it:</p>
<p><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_2BBBfWXV3-w/Rf2KzC-99VI/AAAAAAAAABE/xKAgkEQ791g/s1600-h/dapper_ensemble3.png" rel="lightbox[23]"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp0.blogger.com/_2BBBfWXV3-w/Rf2KzC-99VI/AAAAAAAAABE/xKAgkEQ791g/s320/dapper_ensemble3.png" alt="" id="BLOGGER_PHOTO_ID_5043339767303763282" border="0" /></a><br />This process was repeated to create a field for the species name, which I named &#8220;ort_spp&#8221;. Dapper allows &#8216;Fields&#8217; to be grouped into &#8216;Groups&#8217;, so I grouped the &#8220;ort_geneID&#8221; and &#8220;ort_spp&#8221; fields into a group called &#8220;orthologue&#8221;: (<span style="font-style: italic;">data not shown <img src='http://blog.pansapiens.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  </span>).</p>
<p>Now, we save the Dapp. In &#8220;Advanced Options&#8221;, I changed the Ensembl gene ID part of the URL to {geneID}. This tells Dapper to make this part of the URL a query field, so that the user can provide any gene ID they like and have the orthologue results scraped:</p>
<p><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_2BBBfWXV3-w/Rf2LsS-99WI/AAAAAAAAABM/Ck5MQE0nThE/s1600-h/dapper_ensemble4.png" rel="lightbox[23]"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp1.blogger.com/_2BBBfWXV3-w/Rf2LsS-99WI/AAAAAAAAABM/Ck5MQE0nThE/s320/dapper_ensemble4.png" alt="" id="BLOGGER_PHOTO_ID_5043340750851274082" border="0" /></a></p>
<p>Finally, we can test the saved Dapp, and retrieve <a href="http://www.dapper.net/RunDapp?dappName=EnsemblOrthologues&#038;v=1&amp;variableArg_0=ENSG00000100347">XML formatted results for a particular gene ID</a>:</p>
<p><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_2BBBfWXV3-w/Rf2NEy-99XI/AAAAAAAAABU/COxrEKCHDlo/s1600-h/dapper_ensemble5.png" rel="lightbox[23]"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://bp3.blogger.com/_2BBBfWXV3-w/Rf2NEy-99XI/AAAAAAAAABU/COxrEKCHDlo/s320/dapper_ensemble5.png" alt="" id="BLOGGER_PHOTO_ID_5043342271269696882" border="0" /></a><br />The gene ID can be changed in the Dapper XML transform URL (http://www.dapper.net/RunDapp?dappName=EnsemblOrthologues&#038;v=1&amp;<br />variableArg_0=<span style="font-weight: bold; color: rgb(255, 0, 0);">ENSG00000100347</span>) to get XML results for orthologues of other human genes.<br />Various other transforms, like a <a href="http://www.dapper.net/transform.php?dappName=EnsemblOrthologues&#038;transformer=CSV&amp;variableArg_0=ENSG00000163599&#038;extraArg_fields%5B%5D=ort_spp&amp;extraArg_fields%5B%5D=ort_geneID">cruft-free CSV version</a> are also possible. Feel free to have a play yourself with my <a href="http://www.dapper.net/dapp-howto-use.php?dappName=EnsemblOrthologues">Ensembl Orthologues Dapp</a> (<span style="font-style: italic;">like I can stop you now ! It&#8217;s public &#038; live &amp; irrevocable</span>).</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pansapiens.com/2007/03/19/dapper-the-screen-scraper-for-everyone/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	<creativeCommons:license>http://creativecommons.org/publicdomain/zero/1.0/</creativeCommons:license>
	</item>
		<item>
		<title>An Amazon EC2 cluster for BLAST searching ?</title>
		<link>http://blog.pansapiens.com/2007/03/04/an-amazon-ec2-cluster-for-blast-searching/</link>
		<comments>http://blog.pansapiens.com/2007/03/04/an-amazon-ec2-cluster-for-blast-searching/#comments</comments>
		<pubDate>Sun, 04 Mar 2007 04:47:00 +0000</pubDate>
		<dc:creator>Andrew Perry</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[two-point-oh]]></category>

		<guid isPermaLink="false">http://blog.pansapiens.com/2007/03/04/an-amazon-ec2-cluster-for-blast-searching/</guid>
		<description><![CDATA[I&#8217;ve just been reading about the new Amazon Elastic Compute Cloud (EC2), which is essentially a pay-as-you go cluster, based on Xen virtual machine images. You can create and upload your own image using their tools, or use one of the pre-rolled GNU/Linux distro images already shared by other users of the EC2 system.
While it [...]]]></description>
			<content:encoded><![CDATA[<p><span><span>I&#8217;ve just been reading about the new <a href="http://www.amazon.com/ec2">Amazon Elastic Compute Cloud (EC2)</a>, which is essentially a pay-as-you go cluster, based on <a href="http://www.cl.cam.ac.uk/research/srg/netos/xen/">Xen</a> virtual machine images. You can create and upload your own image using their tools, or use one of the pre-rolled GNU/Linux distro images already shared by other users of the EC2 system.</p>
<p>While it seems aimed at web service &#8217;startups&#8217; that want a competitively priced hosting option which can quickly scale, I thought I&#8217;d attempt to figure out the economics of using something like this for some scientific computing. Would it be a cheap / easy / reliable alternative to the home-rolled Beowulf cluster ?</p>
<p>The advertised specs per node are: 1.7Ghz x86 (Xeon) processor, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network bandwidth. Nodes cost US$0.10 per instance hour. Bandwidth between nodes within EC2 is free, as is bandwidth from EC2 nodes to the <a href="http://www.amazon.com/s3">S3 storage service</a>, however Internet traffic costs US$0.20 / GB.</p>
<p>First, lets think about BLAST, the bread-and-butter sequence search tool for many bioinformaticians. Now as far as I understand (and from my own experience), NCBI BLAST works best when the entire database can be cached in RAM &#8230; otherwise lots of disk thrashing ensues and the search time is bounded by the disk I/O. The NCBI &#8216;nr&#8217; (non-redundant) protein sequence database is currently about 3.3 Gb (and growing) &#8230; so it won&#8217;t fit in RAM on one of these EC2 nodes. While I don&#8217;t mind paying to thrash Amazons&#8217; servers disks a little, it will slow the search down. However, if we use <a href="http://www.mpiblast.org/">mpiBLAST</a> the database gets split into chunks evenly distributed across each node, so if we were to use 3 nodes, the &#8216;nr&#8217; database would be split into 1.1 Gb chunks and should fit in the RAM of each node (leaving ~600 Mb RAM for the OS and other overhead). However, now the speed of the network interconnects between nodes matters, since we are no longer computing on a single node &#8230; but from <a href="http://www.linuxjournal.com/article/7936">what</a> <a href="http://www.mpiblast.org/downloads/pubs/cwce03.pdf">I&#8217;ve read</a> 250 Mb/s should be enough for mpiBLAST to run such that is it not bounded by the internode communication speed. (Actually EC2 instances have shared gigabit interconnects , but since several instances might share the same ethernet card, gigabit performance &#8216;per node&#8217; can&#8217;t be expected. I guess the 250 Mb/s figure means that there are probably four EC2 instances per physical server/ethernet card ??). So with 3 nodes, this would cost US$ 0.30 per hour to run a (scalable) BLAST service.The performance should scale better than linearly with the number of nodes added. If you need the job done faster, just resplit the database and create more EC2 VM instances (mpiBLAST should work with the Portable Batch System to do this transparently, but I guess this would require some code to interface PBS with the control of your EC2 &#8216;elastic cluster&#8217;). It would only cost US$0.66 to upload the database to EC2 in the first place, and about US$0.50 cents to store it in S3 per month. This seems well within reach of many academic departments, and would really suite &#8217;sporadic&#8217; users with occasional big jobs &#8230;.</p>
<p>Now for applications like molecular dynamics simulations (MD) (ie, GROMACS, NAMD, CHARMM etc etc), a lot more internode communication bandwidth is required. Looking at these <a href="http://biowulf.nih.gov/apps/gromacs/bench-3.3.1.html">benchmarks for GROMACS</a> , it looks like things should scale nicely to two or four EC2 nodes, but after that the scaling would probably drop off, due to the less-than-gigabit ethernet. That doesn&#8217;t mean you won&#8217;t get more speed for more nodes, just that at some point adding more nodes will give greatly diminishing returns. While I&#8217;m speculating here, my I&#8217;d say it&#8217;s probably better to leave this type of number crunching to the &#8216;real&#8217; supercomputers or home-rolled purpose-built clusters; EC2 may not be worth the cost/effort here for big long running calculations. <a href="http://developer.amazonwebservices.com/connect/search.jspa?q=mpi">Others are using MPI applications on EC2</a> already though, and I&#8217;d love to be proved wrong.</p>
<p>One of the current difficulties for running database driven web applications on EC2 is that the virtual machine instances do not have persistent storage &#8230; either a connection to a database running somewhere else needs to be used, or the precious data needs to be moved off each EC2 instance before shutting the server down. If it crashes before shifting data off &#8230; goodbye database. I&#8217;m sure Amazon will come up with a solution to this, since it seems often requested on their forums. Having non-persistent data wouldn&#8217;t be such a big deal for mpiBLAST &#8230; the servers should rarely crash, the results could be stored in Amazon S3 or sent to a remote machine as they arrive, and the sequence database can also be stored in S3 (for about US$0.50 per month &#8230; dirt cheap). There are already a few <a href="http://fuse.sourceforge.net/">FUSE</a> S3fs implementations floating about (like <a href="http://code.google.com/p/s3fs-fuse/">s3fs-fuse</a> ) &#8230; I haven&#8217;t tried them yet, but essentially they should allow S3 storage to be mapped transparently to the Linux filesystem. My guess is it would be a bad idea to host a large MySQL database file on S3 using s3fs-fuse (there is a 5 Gb filesize limit for starters) &#8230; but for lots of little-ish files, as is often generated by bioinformatics software, s3fs-fuse might just do the trick.</p>
<p>Whew ! .. Now I&#8217;m really itching to spend some spare change and a few hours to see if running mpiBLAST on EC2 is as good an idea as it sounds.</p>
<p><span style="font-style: italic;">Doh ! Just tried to set up an account and the Amazon EC2 limited beta is currently full &#8230; I&#8217;ll have to wait .. <img src='http://blog.pansapiens.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> .</span></p>
<p><span style="font-style: italic;">A few additional links I was also looking at while writing this post .. w</span><span style="font-style: italic;">ow &#8230; someone has some &#8216;issues&#8217; with the NCBI Blast implelmentation: <a href="http://blast.wustl.edu/blast/Memory.html">http://blast.wustl.edu/blast/Memory.html</a></span><br /><span style="font-style: italic;">and <a href="http://blast.wustl.edu/blast/cparms.html">http://blast.wustl.edu/blast/cparms.html</a></span></p>
<p></span></span></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pansapiens.com/2007/03/04/an-amazon-ec2-cluster-for-blast-searching/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
	<creativeCommons:license>http://creativecommons.org/publicdomain/zero/1.0/</creativeCommons:license>
	</item>
		<item>
		<title>First Online EMBL PhD Symposium</title>
		<link>http://blog.pansapiens.com/2006/11/29/first-online-embl-phd-symposium/</link>
		<comments>http://blog.pansapiens.com/2006/11/29/first-online-embl-phd-symposium/#comments</comments>
		<pubDate>Tue, 28 Nov 2006 22:17:00 +0000</pubDate>
		<dc:creator>Andrew Perry</dc:creator>
				<category><![CDATA[meetings]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[two-point-oh]]></category>

		<guid isPermaLink="false">http://blog.pansapiens.com/2006/11/29/first-online-embl-phd-symposium/</guid>
		<description><![CDATA[This looks interesting &#8230; the First Online EMBL PhD Symposium, a sort of &#8216;online&#8217; conference for the life sciences. Everybody with a scientific background is invited to participate. Registration is free.
The programme  (Career Development Session, Omics Session / Systems Biology, Scientific Communication 2.0 and Participant&#8217;s Contributions) and speakers list makes it look sort of [...]]]></description>
			<content:encoded><![CDATA[<p>This looks interesting &#8230; the <a href="http://onlinesymposium.predocs.org/">First Online EMBL PhD Symposium</a>, a sort of &#8216;online&#8217; conference for the life sciences. Everybody with a scientific background is invited to participate. Registration is free.</p>
<p>The <a href="http://onlinesymposium.predocs.org/media">programme</a>  <span class="contenttype-folder"><span class="state-published visualIconPadding">(Career Development Session, Omics Session / Systems Biology</span>, <span class="state-published visualIconPadding">Scientific Communication 2.0</span> </span>and <span class="contenttype-folder"><span class="state-published visualIconPadding">Participant&#8217;s Contributions) </span></span>and <a href="http://onlinesymposium.predocs.org/media/overview-of-the-speakers/">speakers</a> list makes it look sort of like a &#8220;Biology 2.0&#8243; conference.</p>
<p><span class="contenttype-folder"><span class="state-published visualIconPadding"></span></span><span class="documentByLine"></span>Apart from the (possible) IRC sessions, hopefully the fact that everything is stored as video/audio + comments on their content managment system means the &#8216;inconvenient&#8217; timezone in Australia won&#8217;t limit my participation too much.</p>
<p>(via the worldwide bioinformatics cabal <img src='http://blog.pansapiens.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> , <a href="http://nsaunders.wordpress.com/2006/11/28/embl-online-phd-symposium/">Neil</a> via <a href="http://pbeltrao.blogspot.com/2006/11/embl-online-phd-symposium-via-notes.html">Pedro</a>, <a href="http://nftb.net/?p=64">Roland</a> and <a href="http://www.ghastlyfop.com/blog/2006/11/embls-online-phd-symposium.html">Stew</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.pansapiens.com/2006/11/29/first-online-embl-phd-symposium/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<creativeCommons:license>http://creativecommons.org/publicdomain/zero/1.0/</creativeCommons:license>
	</item>
	</channel>
</rss>
