Archive for the 'bioinformatics' Category

texshade: useful, and still kickin’

I’ve been looking at doing an analysis with some protein subfamily sequence logos, using Eric Beitz’s texshade. While it’s a little strange that it does the actual analysis part (rather than just the rendering) using LaTeX, it’s the only implementation of the method I know of, and it beats reimplementing it from the paper.

Although it was published in 2006 (and earlier in 2000), with the original URLs now dead, I noticed the latest update for the version of texshade in CTAN (v1.18) was on 15th of April, 2008 … ie texshade was updated just 14 days ago !

It happens all to often that published bioinformatics tools cease to be updated or even disappear from the Web not long after the peer-review publication is released. Kudos to Eric for not abandoning his software.

Announcing ResolveRef on Google App Engine

About two weeks ago, tipped off by Neil, I heard about Google App Engine. I managed to get a beta account, and I’ve finally had a chance to do something (hopefully) useful with it.

In the absence of any quickly achievable ideas for a bioinformatics app, I ported over the OpenRef application I wrote on top of TurboGears a few months back.

Just like the original, the new app, ResolveRef, is essentially a RESTful way of doing PubMed queries.
Continue reading ‘Announcing ResolveRef on Google App Engine’

The Biosciences in Google’s Summer of Code

The Google Summer of Code project participants have been selected for 2008. I scanned the list to see how projects specifically aimed at the biosciences and bioinformatics fared:

  • GenMAPP (Gene Map Annotator and Pathway Profiler), a tool for visualizing gene expression data on top of graphical representations of biological pathways.
  • The NESCent (National Evolutionary Synthesis Centre) Phyloinformatics project, has range of potential projects to do with phylogenetic analysis, covering things like phyloXML integration with BioPerl and BioRuby, phyloinformatics web services and tree analysis using the MapReduce algorithm (with Hadoop).
  • OMII-UK, which covers a range of tools including the Taverna Workbench for workflow design and execution.
  • Also participating is OpenMRS, a medical record system aimed at developing countries.

There are also at least two platforms for cluster, parallel or grid computing on the list; I spotted the Globus Toolkit and OAR, but there are probably a few more in that that broad category (eg, OMII-UK oversees a bunch of Grid related projects too).

It’s worth noting that I’ve ignored a bunch of really important pieces of software that are less field-specific, but are actually lower level components of the platforms critical for most large bioinformatics projects. Things like Python, Perl, R, various Open Source databases, and collaboration tools like wikis (MoinMoin) and CMSs (eg Drupal) are also participating.

I don’t think coding for bioinformatics applications is as attractive to students as working on some of the other “sexier” projects available (eg the SecondLife client, or the Apache Webserver), but kudos to Google for letting a few bioinformatics tools into the fray. Hopefully the students who hack on them learn something, and hone their coding skills (you never know, they may even help improve these tools too :) ).

An OpenRef implementation

Recently, Noel O’Boyle of Noel O’Blog proposed a new RESTful scheme for resolving publications, as an alternative to using DOI or PubMed ID (PMID) identifiers. Essentially, this would allow resolution of a publication like:

EL Willighagen, NM O’Boyle, H Gopalakrishnan, D Jiao, R Guha, C Steinbeck and D J Wild Userscripts for the Life Sciences BMC Bioinformatics 2007, 8, 487.

Using something like this:

openref://BMC Bioinformatics/2007/8/487

or

http://dx.openref.org/BMC Bioinformatics/2007/8/487

Simply using the journal title, publication year, volume and first page number. Read his post for a more detailed explanation.

While I think the scheme needs a little fleshing out, the idea is nice, since as Noel highlights - the “OpenRef” URL can be derived from the typical citation style used by academics, while the DOI and the PMID cannot (although the DOI is often printed on the journal article these days, it’s generally not used in a reference list at the end of a paper). I’m sure there are lots of corner cases that could ultimately work to over-complicate this scheme and force it to lose it’s simplicity … but at the moment it remains appealing.

It dawned upon me that an OpenRef resolver would actually be pretty straightforward to write with Turbogears (or just straight CherryPy), and a bit of Biopython EUtils magic to search PubMed.

So, without further ado … here’s the essential code for my quick implementation. It requires that you have installed Turbogears and made a quickstart project with tg-admin (see the Turbogears docs on how to do this). The code below should be added to the Root class in controllers.py, in addition to the autogenerated code that tg-admin makes for you:

from turbogears import controllers, expose, flash, redirect
from model import *

# from openref import model
from Bio import EUtils
from Bio.EUtils import DBIdsClient

from xml.dom import minidom
import urllib

class Root(controllers.RootController):

  # we use *args and **kw here to accept a variable number of
  # arguments and keyword arguments
  # (eg Journal/Year/Page or Journal/Year/Volume/Page)
  # turbogears passes arguments to the function from the URL like
  # http://webapp:8080/arg1/arg2/arg3?keyword=stuff&keyword2=morestuff
  @expose()
  def openref(self, journal, *args, **kw):
   
      # deals with openref://Journal/Year/Page
      # (no volume argument)
      if len(args) == 2:
          year, page = args
          query = ‘"%s"[TA] AND "%s"[DP] AND "%s"[PG]‘ % \
                    (journal, year, page)
      # deal with openref://Journal/Year/Volume/Page
      # (including volume number)
      if len(args) == 3:
          year, volume, page = args
          query = ‘"%s"[TA] AND "%s"[DP] AND "%s"[VI] AND "%s"[PG]‘ % \
                    (journal, year, volume, page)
   
      # search NCBI PubMed with EUtils
      client = DBIdsClient.DBIdsClient()
      result = client.search(query, retmax = 1)
      res = result[0].efetch(retmode = "xml", rettype = "xml").read()
   
      # get doi link from eutils XML result, example:
      #
      #    S0022-2836(07)01626-9
      #    10.1016/j.jmb.2007.12.021
      #    18187149
      #
      xml_doc = minidom.parseString(res)
      for tag in xml_doc.getElementsByTagName("ArticleId"):
          if tag.getAttribute("IdType") == "doi":
              doi = tag.childNodes[0].data
          if tag.getAttribute("IdType") == "pubmed":
              pmid = tag.childNodes[0].data
   
      # make the DOI resolution URL
      doi_url = urllib.basejoin("http://dx.doi.org/", doi)
      # make the Entrez Pubmed resolution URL
      pubmed_url =  "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?\
                           cmd=Retrieve&db=PubMed&\
                           list_uids=%s&dopt=Abstract"
% (pmid)
      # and lets not forget a URL to HubMed
      hubmed_url = "http://www.hubmed.org/display.cgi?uids=%s" % (pmid)
   
      # decide where to redirect to based on "?redirect=xxx" argument
      if kw.has_key("redirect"):
          if kw[‘redirect’] == "doi":
              url = doi_url
          elif kw[‘redirect’] == "pubmed":
              url = pubmed_url
          elif kw[‘redirect’] == "hubmed":
              url = hubmed_url
      else:
              url = doi_url
           
      raise redirect(url)

 

Since this is seat-of-the-pants Friday arvo coding, there is very little in the way of error handling or exceptions in the above code. I might add some niceties like that later. If the Pubmed query constructed from the URL gives no PubMed hit(s), or the PubMed results doesn’t contain a DOI, you’ll get some ugly and inelegant errors.

Assuming that you run this Turbogears app locally on the default port 8080, you should be able to get redirected to the Willighagen et al Userscripts paper by going to:

http://localhost:8080/openref/BMC Bioinformatics/2007/8/487

(Firefox will properly escape the space character in the URL .. I’m not sure what other browsers may do).

By default you will be redirected to wherever dx.doi.org decides to send you (which is often the journal article at the publishers site, but there is no rule that says this must be the case), but you can also choose to be redirected to PubMed or Hubmed using:

http://localhost:8080/openref/BMC Bioinformatics/2007/8/487?redirect=pubmed
or
http://localhost:8080/openref/BMC Bioinformatics/2007/8/487?redirect=hubmed

I’ve got a working example running at http://openref.pansapiens.com/ if anyone would like to try it out (eg, try http://openref.pansapiens.com/openref/BMC Bioinformatics/2007/8/487 ). No promises that it will stay up for long (Turbogears apps seem to die quite a lot on my cheap little virtual hosting account … I’m using supervisor2 now, which may help keep things more available).

It should be stressed that this as is only a quick and dirty hack to demonstrate the proof of concept. It’s really only translating the ‘paths’ in the URLs provided by the user into PubMed queries, and uses the existing DOI infrastructure to ultimately redirect the user to the article; in reality I’d expect that an “OpenRef” resolver would have to be more independent and sophisticated than this. I can’t imagine who would maintain a separate OpenRef database in order to make it independent of DOIs and PubMed.

Unfortunately the domain openref.org has already been registered .. and not by Noel. Maybe it’s already time for a new name for this fledgling resolution scheme :) ??

ARIA verson 2.2 released

I don’t usually post about NMR (Nuclear Magnetic Resonance) and structural biology related stuff, but I’ve always intended to. In this post I’m pulling out all the stops on specialist lingo and assumed background knowledge, so hopefully it isn’t too incomprehensible to the non-structural biology crowd :).

ARIA version 2.2 has been released in the last few weeks. ARIA is an automated NOE assignment and structure calculation package, which (in theory) takes some of the pain and slowness out of producing protein (and DNA and/or RNA) structures from Nuclear Magnetic Resonance data. I’ll say up front; I haven’t tried this version yet, but some of the improvements look exciting.

Here are two new features worth noting … followed by what I think it all means:

  • The assignment method has been improved with the introduction of a network-anchoring analysis (Herrmann et al., 2002) for filtering of the initial assignments.
  • The integration of the CCPN has been completed. The imported CCPN distance constraints lists can enter the ARIA process for calibration, violation analysis and network-anchoring analysis. The final constraint lists can be exported as well.

In the past I have done some quick and dirty tests comparing the quality of protein structures produced using Aria 2.1 vs. Peter Gunterts CYANA 1.07 and 2.1, using the exact same NMR peak input lists (with slightly noisy data containing a number of incorrectly picked peaks). CYANA always won hands down, assigning more NOE crosspeaks correctly and producing an ensemble of model structures with much lower RMSD and generally better protein structure quality scores (ie using pretty much any decent pairwise pseudo-energy potential, and Procheck). Also, ARIA produced ‘knotted’ structures which were almost certainly incorrect, while CYANA did not. Other postdocs and students in my former lab had done similar independent tests with ARIA 1.2 vs. CYANA 1.0.7, and had come to similar conclusions.

The disclaimer: It should be noted here that assessment of the quality of an ensemble of NMR structure coordinates can be problematic, and is really the topic of another long post (and probably tens if not hundreds of peer-reviewed journal articles). So saying “CYANA version X is better then ARIA version X” based on the RMSD of the final calculated ensemble is a bit unfair … in fact using RMSD of the ensemble to gauge structure quality is just plain wrong in this context. In my (unpublished, non-peer reviewed) tests, it is possible that ARIA could be producing high RMSD but essentially ‘correct’ structures, while CYANA could be producing tightly defined but ‘incorrect’ structures, but I doubt it. The gap between the output of each program was wide enough to suggest that under real-world conditions where the input peak list contained a number of ‘noise’ peaks, ARIA was failing to give a set of consistent solutions (probably due to lack of NOE assignments), while CYANA was giving a set of tightly defined structures (which may or may not have represented the ‘correct’ solution). Other evaluations (protein structure quality measures, Procheck, comparison to known structures of similar proteins) indicated that the CYANA structures were not grossly ‘incorrect’, so I’d say CYANA was just giving a better defined (ie lower ensemble RMSD) set of plausible solutions.

My gut feeling is that ARIA 2.2 will perform much better than past versions, due to one key feature that has been ‘borrowed’ from CYANA; the introduction of a network-anchoring analysis. In a nutshell, network-anchoring scores essentially weight distance constraints (or NOE assignments) based on how ‘connected’ that constraint is within the graph formed by other constraints. This means that in effect a single, isolated constraint pulling two residues on opposite sides of a protein together is down-weighted, while if multiple constraints link those residues (or their neighboring residues) then those constraints are considered more trusted and hence weighted heavier. For better or worse (usually better), this score simulates what the human NMR spectroscopist would do when assigning NOE crosspeaks manually … usually two residues in contact will show multiple NOE crosspeaks connecting them and involve multiple different nuclei, however a single lonely NOE between two nuclei which are distant from eachother in the primary protein sequence is heavily scrutinized and regarded with suspicion since it is likely to be mis-assigned. I’m very keen to test ARIA 2.2 on my old data set and see if I’m actually right (I may be able to try it with network anchoring turned on, and off, and see just what sort of contribution that score is making).

Another completed feature, the integration between ARIA and the CCPN libraries/analysis package should also be a big plus. I haven’t used the CCPN analysis software yet, but a few years ago I wrote some code to help make CYANA and the Sparky NMR assignment program work together better. The result was functional, but very hackish (and I’m probably the only person in the world who understands how it was intended to be used, since I still haven’t got around to writing any documentation. Naughty, naughty). CCPN + ARIA may turn out to be the better option for spectral analysis and structure calculation in the future, as opposed to my currently preferred Sparky + CYANA combination.

I’m really itching to find a good reason to do an NMR structure project now … back to work !!