Archive for the 'bioinformatics' Category

An OpenRef implementation

Recently, Noel O’Boyle of Noel O’Blog proposed a new RESTful scheme for resolving publications, as an alternative to using DOI or PubMed ID (PMID) identifiers. Essentially, this would allow resolution of a publication like:

EL Willighagen, NM O’Boyle, H Gopalakrishnan, D Jiao, R Guha, C Steinbeck and D J Wild Userscripts for the Life Sciences BMC Bioinformatics 2007, 8, 487.

Using something like this:

openref://BMC Bioinformatics/2007/8/487

or

http://dx.openref.org/BMC Bioinformatics/2007/8/487

Simply using the journal title, publication year, volume and first page number. Read his post for a more detailed explanation.

While I think the scheme needs a little fleshing out, the idea is nice, since as Noel highlights - the “OpenRef” URL can be derived from the typical citation style used by academics, while the DOI and the PMID cannot (although the DOI is often printed on the journal article these days, it’s generally not used in a reference list at the end of a paper). I’m sure there are lots of corner cases that could ultimately work to over-complicate this scheme and force it to lose it’s simplicity … but at the moment it remains appealing.

It dawned upon me that an OpenRef resolver would actually be pretty straightforward to write with Turbogears (or just straight CherryPy), and a bit of Biopython EUtils magic to search PubMed.

So, without further ado … here’s the essential code for my quick implementation. It requires that you have installed Turbogears and made a quickstart project with tg-admin (see the Turbogears docs on how to do this). The code below should be added to the Root class in controllers.py, in addition to the autogenerated code that tg-admin makes for you:

from turbogears import controllers, expose, flash, redirect
from model import *

# from openref import model
from Bio import EUtils
from Bio.EUtils import DBIdsClient

from xml.dom import minidom
import urllib

class Root(controllers.RootController):

  # we use *args and **kw here to accept a variable number of
  # arguments and keyword arguments
  # (eg Journal/Year/Page or Journal/Year/Volume/Page)
  # turbogears passes arguments to the function from the URL like
  # http://webapp:8080/arg1/arg2/arg3?keyword=stuff&keyword2=morestuff
  @expose()
  def openref(self, journal, *args, **kw):
   
      # deals with openref://Journal/Year/Page
      # (no volume argument)
      if len(args) == 2:
          year, page = args
          query = ‘"%s"[TA] AND "%s"[DP] AND "%s"[PG]‘ % \
                    (journal, year, page)
      # deal with openref://Journal/Year/Volume/Page
      # (including volume number)
      if len(args) == 3:
          year, volume, page = args
          query = ‘"%s"[TA] AND "%s"[DP] AND "%s"[VI] AND "%s"[PG]‘ % \
                    (journal, year, volume, page)
   
      # search NCBI PubMed with EUtils
      client = DBIdsClient.DBIdsClient()
      result = client.search(query, retmax = 1)
      res = result[0].efetch(retmode = "xml", rettype = "xml").read()
   
      # get doi link from eutils XML result, example:
      #
      #    S0022-2836(07)01626-9
      #    10.1016/j.jmb.2007.12.021
      #    18187149
      #
      xml_doc = minidom.parseString(res)
      for tag in xml_doc.getElementsByTagName("ArticleId"):
          if tag.getAttribute("IdType") == "doi":
              doi = tag.childNodes[0].data
          if tag.getAttribute("IdType") == "pubmed":
              pmid = tag.childNodes[0].data
   
      # make the DOI resolution URL
      doi_url = urllib.basejoin("http://dx.doi.org/", doi)
      # make the Entrez Pubmed resolution URL
      pubmed_url =  "http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?\
                           cmd=Retrieve&db=PubMed&\
                           list_uids=%s&dopt=Abstract"
% (pmid)
      # and lets not forget a URL to HubMed
      hubmed_url = "http://www.hubmed.org/display.cgi?uids=%s" % (pmid)
   
      # decide where to redirect to based on "?redirect=xxx" argument
      if kw.has_key("redirect"):
          if kw[‘redirect’] == "doi":
              url = doi_url
          elif kw[‘redirect’] == "pubmed":
              url = pubmed_url
          elif kw[‘redirect’] == "hubmed":
              url = hubmed_url
      else:
              url = doi_url
           
      raise redirect(url)

 

Since this is seat-of-the-pants Friday arvo coding, there is very little in the way of error handling or exceptions in the above code. I might add some niceties like that later. If the Pubmed query constructed from the URL gives no PubMed hit(s), or the PubMed results doesn’t contain a DOI, you’ll get some ugly and inelegant errors.

Assuming that you run this Turbogears app locally on the default port 8080, you should be able to get redirected to the Willighagen et al Userscripts paper by going to:

http://localhost:8080/openref/BMC Bioinformatics/2007/8/487

(Firefox will properly escape the space character in the URL .. I’m not sure what other browsers may do).

By default you will be redirected to wherever dx.doi.org decides to send you (which is often the journal article at the publishers site, but there is no rule that says this must be the case), but you can also choose to be redirected to PubMed or Hubmed using:

http://localhost:8080/openref/BMC Bioinformatics/2007/8/487?redirect=pubmed
or
http://localhost:8080/openref/BMC Bioinformatics/2007/8/487?redirect=hubmed

I’ve got a working example running at http://openref.pansapiens.com/ if anyone would like to try it out (eg, try http://openref.pansapiens.com/openref/BMC Bioinformatics/2007/8/487 ). No promises that it will stay up for long (Turbogears apps seem to die quite a lot on my cheap little virtual hosting account … I’m using supervisor2 now, which may help keep things more available).

It should be stressed that this as is only a quick and dirty hack to demonstrate the proof of concept. It’s really only translating the ‘paths’ in the URLs provided by the user into PubMed queries, and uses the existing DOI infrastructure to ultimately redirect the user to the article; in reality I’d expect that an “OpenRef” resolver would have to be more independent and sophisticated than this. I can’t imagine who would maintain a separate OpenRef database in order to make it independent of DOIs and PubMed.

Unfortunately the domain openref.org has already been registered .. and not by Noel. Maybe it’s already time for a new name for this fledgling resolution scheme :) ??

ARIA verson 2.2 released

I don’t usually post about NMR (Nuclear Magnetic Resonance) and structural biology related stuff, but I’ve always intended to. In this post I’m pulling out all the stops on specialist lingo and assumed background knowledge, so hopefully it isn’t too incomprehensible to the non-structural biology crowd :).

ARIA version 2.2 has been released in the last few weeks. ARIA is an automated NOE assignment and structure calculation package, which (in theory) takes some of the pain and slowness out of producing protein (and DNA and/or RNA) structures from Nuclear Magnetic Resonance data. I’ll say up front; I haven’t tried this version yet, but some of the improvements look exciting.

Here are two new features worth noting … followed by what I think it all means:

  • The assignment method has been improved with the introduction of a network-anchoring analysis (Herrmann et al., 2002) for filtering of the initial assignments.
  • The integration of the CCPN has been completed. The imported CCPN distance constraints lists can enter the ARIA process for calibration, violation analysis and network-anchoring analysis. The final constraint lists can be exported as well.

In the past I have done some quick and dirty tests comparing the quality of protein structures produced using Aria 2.1 vs. Peter Gunterts CYANA 1.07 and 2.1, using the exact same NMR peak input lists (with slightly noisy data containing a number of incorrectly picked peaks). CYANA always won hands down, assigning more NOE crosspeaks correctly and producing an ensemble of model structures with much lower RMSD and generally better protein structure quality scores (ie using pretty much any decent pairwise pseudo-energy potential, and Procheck). Also, ARIA produced ‘knotted’ structures which were almost certainly incorrect, while CYANA did not. Other postdocs and students in my former lab had done similar independent tests with ARIA 1.2 vs. CYANA 1.0.7, and had come to similar conclusions.

The disclaimer: It should be noted here that assessment of the quality of an ensemble of NMR structure coordinates can be problematic, and is really the topic of another long post (and probably tens if not hundreds of peer-reviewed journal articles). So saying “CYANA version X is better then ARIA version X” based on the RMSD of the final calculated ensemble is a bit unfair … in fact using RMSD of the ensemble to gauge structure quality is just plain wrong in this context. In my (unpublished, non-peer reviewed) tests, it is possible that ARIA could be producing high RMSD but essentially ‘correct’ structures, while CYANA could be producing tightly defined but ‘incorrect’ structures, but I doubt it. The gap between the output of each program was wide enough to suggest that under real-world conditions where the input peak list contained a number of ‘noise’ peaks, ARIA was failing to give a set of consistent solutions (probably due to lack of NOE assignments), while CYANA was giving a set of tightly defined structures (which may or may not have represented the ‘correct’ solution). Other evaluations (protein structure quality measures, Procheck, comparison to known structures of similar proteins) indicated that the CYANA structures were not grossly ‘incorrect’, so I’d say CYANA was just giving a better defined (ie lower ensemble RMSD) set of plausible solutions.

My gut feeling is that ARIA 2.2 will perform much better than past versions, due to one key feature that has been ‘borrowed’ from CYANA; the introduction of a network-anchoring analysis. In a nutshell, network-anchoring scores essentially weight distance constraints (or NOE assignments) based on how ‘connected’ that constraint is within the graph formed by other constraints. This means that in effect a single, isolated constraint pulling two residues on opposite sides of a protein together is down-weighted, while if multiple constraints link those residues (or their neighboring residues) then those constraints are considered more trusted and hence weighted heavier. For better or worse (usually better), this score simulates what the human NMR spectroscopist would do when assigning NOE crosspeaks manually … usually two residues in contact will show multiple NOE crosspeaks connecting them and involve multiple different nuclei, however a single lonely NOE between two nuclei which are distant from eachother in the primary protein sequence is heavily scrutinized and regarded with suspicion since it is likely to be mis-assigned. I’m very keen to test ARIA 2.2 on my old data set and see if I’m actually right (I may be able to try it with network anchoring turned on, and off, and see just what sort of contribution that score is making).

Another completed feature, the integration between ARIA and the CCPN libraries/analysis package should also be a big plus. I haven’t used the CCPN analysis software yet, but a few years ago I wrote some code to help make CYANA and the Sparky NMR assignment program work together better. The result was functional, but very hackish (and I’m probably the only person in the world who understands how it was intended to be used, since I still haven’t got around to writing any documentation. Naughty, naughty). CCPN + ARIA may turn out to be the better option for spectral analysis and structure calculation in the future, as opposed to my currently preferred Sparky + CYANA combination.

I’m really itching to find a good reason to do an NMR structure project now … back to work !!

Searching biological databases from the Firefox search bar

There is currently no “Google for Bioinformatics“, and so biologists/bioinformaticians typically need to search a number of separate databases to find the data the desire. While the Biobar Firefox extension helps search for a dizzying array of biological databases (individually), I think sometimes it offers too much, and contains too many databases that I rarely if ever use. As useful as it is, most of the time I keep the Biobar toolbar hidden to reclaim the screen real-estate.

Instead, I prefer the more lightweight “search plugins” to fully fledged extentions (accessed via the little search box up near the URL location bar). Here are some Firefox search plugins for common ‘bioinformatic’ search engines which I found scattered across the far reaches of the web:

  • HubMed is a clean and slick interface to search the PubMed database, with some features that the NCBI search doesn’t have. You may already be familiar with it, but sadly the majority of life scientists appear not to know about it / use it. I prefer HubMed to the regular NCBI Entrez interface. You can install the HubMed Firefox search plugin at HubMed.
  • In case you want the plain vanilla NCBI PubMed interface, the UCSF library provides a PubMed search plugin.
  • The Mycroft project by Mozilla seems to be an official repository of firefox search plugins. On Mycroft, I found plugins for searching SwissProt, the Protein Data Bank (PDB).

I also wanted some that couldn’t find, so I made them. Here are Firefox search plugins for Uniprot and Pfam:

Here is the code for the Pfam one, as an example:

<?xml version="1.0" encoding="UTF-8"?><OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/"                      xmlns:moz="http://www.mozilla.org/2006/browser/search/"> <shortname>Pfam</shortname> <description>Search the Pfam database</description> <image width="16" height="16">data:image/x-icon;base64,AAABAAEAICAAAAAAAACoCAAAFgAAACgAAAAgAAAAQAAAAAEACAAAAAAAgAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAA%2F%2F%2F%2FAMWLMwDew54AjEcBAO22WwD658kAoWclAMSdbADYn0gA1LKBALiLVADt1rMA%2FffkAKp0PACcXQ0AtnomAMiXVwDxvmsA0ZY4ALuJQwDjqk8AzqZ3AKVoFQD27dkA2LuRAJJRFACVVAEA5s6lALyVYgCbXR0Aq3ElAPPhvwClbDIAzqBkAL2BLgDFkEAAtYJKAPS9YACQTwkA%2Fu3QALJxHgDZnD8A5bJfAMWTTwD%2B89cAnmIUAOiwVQDXtYkAl1gXAN6%2FlgDsuWYAxJpeAN6kTADoz60AwYg6AL6RWwDQrH0Ay6NwAMmPOQCNTgEAllcMAKVtIwDNlD4A9uPGAPfAZgCocjEAxp1mAJhYBQDZv5sAi0oLAKxvHgDUmkIAt4hIAJpeFACoaxoAn2EOAKBlGQDwuF4Aun8qALJ2JQCocDgA3KFEAJRSCAD%2B9twAjksFAJVWEQDt2rcAkFEOAMKVXQD57M4AklACAKBfGACkaigA%2BvDWAPXnyQDr068A7bNWALeHTQD24sEAu5NdAJlZCgDgxqEA6rZjALyOWADcvZIAzZE4ANi5jQDEmGIAj0oAAJVWBgDl0KoA67NaAMCELwDChzQA8%2BLDAJhZEgClbS4AqnM1AOSpSwDy3L4Ai0cFAJxfEQCiYxQAnWAbAJteIQChZikAx402ANyhSAD669EA8Ni2AJBNAgCXVwIAzqt6AOjTsACSUwUAl10WAKhpGADutl0A5a9UAMugbgCSUAwA1rB%2FAOCmSgD86s4A9ujGAI1MAgCbYBUApmoZAK9yHwCjai8AqHAmANWzhADFnm8Ajk8EAI9LCADlzasAk1MBAKBeDgCfZBYAm2AeAKpzOQDQlD0At4ROALuQXQD879QA%2BujLAJFPAACVVgMAmFcFAJRTFQCVVhgAmFsYAKVmFwDZtocA9L1jAPK7YQDLkDsAvJNdAMeVUAD77c8AkFEFAOGqUQDhqE8A3J9CAM2TOgD57NAA9OTHAPThwACPUAIAjk0HAOXNpwCWWAQA27uTAJ9jGACkZhUApmkbAKNsJADwul8A77dbAO60WQDRq38A67RYAOmxVwDPqXkA36ZNANmeRgC%2BlmEA%2FOvQAPvqzAD86MkA7di2AI1KAQCPTAEAjUoFAI9OAACOTQMA6NCrAJJOAgCRUQEAk1QDAJZVAwCTVQYAklENAJRWCACYWAcAmFcKAJtdDwCYWRkAnmMWANW0iQDyvWIA07ODAOu2XADnsFcAuHsmALh%2BKQDcpEsAtIJIALuTXACOSwMAkE0HAI9NCQCSUQUAkE8MAJhZAwCcXw8Anl8RAPK7XwDtuFwA7bVWAP745QD66s4A%2BebJAJBPAwCQUQMAPuWfn5%2Bfn5MfJIjln%2BWfLkIZgdGmwX0un5%2Bf5eWf9s%2FlwETAwMCp3EznUNjdRET1nY2jDLTROINERETARERE75%2FARMDARETdbgUmfz29wMCphNRWyV5v1sCERERERERkn8BEwMBERIT%2Bf7DGt5XznanARNvymdPx3cBERERERO%2BfwETAwMD1RBvgsbDHTj96kqjARKiNWd6oqUREREREZJ%2FARMDARERERNyHT3DHinBP4p2onaBJg4RERERERESyn8BEwMBERERERN08xM1O%2BcY1S5LkA5x5hERERERERLKfwETAwERERERERMCd8yMvxunGajQY6FWEREREREREsp%2FARMDAREREREREwMCoW%2FcTYQXHUhF1G4TARERERESyn8BEwMBERERERPVEwIQ8RguzKshOjw%2Fc9cBERERERO%2BfwETAwEREREREqYSd8UIy0LuOO8jnudj1REREREREsp%2FARMDAwMBERMCE1B7Mpga0kQhYieqLW4TARERERETvn8BEwMDAwMCEGydiV1rSWml1VRvblO2HqMBEREREwGSfwETAwEREqRsnjIHSkIJovoOEwIT%2FiYTAREREREREZJ%2FARMDAwETA2EO0kHMWdIOEhMCEW60C%2FoRERERERERkn8BEwMDARBtWYy1gDlUbhKlEhDzjtou9wERERERERLKfwETAwMBEW10NhZvahEREwJ1T7PmvooOERERERERE75%2FARMDARPWdSjlWnYTAwN3%2Fw87H%2BcutG%2FVERERERMDvn8BEwMDAwN1NfKeERIT%2B4jv5ik7LR9vARPVEREREwO%2BfwETAwETAW0JFqp2dtVAVTgXGCXvbwEREREREREREsp%2FARMDAwIRbB%2Ftrq2VI%2BIpOLxCH%2F4SERERERERERESyn8BEwMBEwBs9vKWGLLj6BQIPvahERESpREREREREwO%2BfwETAwEREhJKkugZACnJcg6jAwEREREREREREREREZJ%2FARMDAwETAWykiQP3ReEMa2oREqURERERERERERESyn8BEwMBE9dxuyncUYNAGKGZ28Bup9URE9URERERERO%2BfwETAwMD1G55BcdXfbCDQBrTM5NX1RKlEREREREREZJ%2FARMDARESo4euHGxvYljC6Bvy%2FIYOERERERERERMBkn0T1wMBERBsXR731RIRbBCXZgZBf7tepRMDAwMDAwLKfwET1RMDAncPplL2oRPXd2jE6IChzoZ3AwMDAwMDA759EwMBEREQbTEHLUJqdwITAnW1RHFSY1oTARERERERk5UTAwMDAhN29gPjGSHv%2F3UTAhL305jbWwITAwMBERGTFn%2BXln5%2Bf5UqXKzMSZzfC5eXCn0p%2Brqzl5Z%2Bfn5%2F2HQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA </image> <url type="text/html" method="GET" template="http://www.sanger.ac.uk/cgi-bin/Pfam/qquerypfam.pl?">   <param name="terms" value="{searchTerms}"> </url></opensearchdescription>

To add this to your list of search engines, you also need to put the appropriate “link” tag in the head of an html page that you want the search plugin to be detected from, like this:

<link rel="search" href="opensearch_uniprot.xml"   type="application/opensearchdescription+xml"   title="Uniprot" /><link rel="search" href="opensearch_pfam.xml"   type="application/opensearchdescription+xml"   title="Pfam" />

OpenSearch plugins are also partially supported by some versions of Internet Explorer, but I haven’t tested it (there is no POST support for OpenSearch plugins in IE 7). [insert obligatory Firefox fanboy stab at IE here].

Chances are, you’d like to make one to search your favorite database. Here is the documentation I used for creating OpenSearch plugins for Firefox:


Edit: Argghh .. looks like this code is showing up fine on the web page, but is a bit broken when displayed from the RSS feed in akregator .. I assume other feed readers may also be having trouble .. Anyone know a reliable way to post code in Blogger ? Wordpress is starting to look attractive …

Cleaning up the cesspool that is the PDB

Well .. maybe cesspool is a little strong … there’s a lot of great data in the Protein Data Bank, it’s just that in the early days it was allowed to grow very large without enforcing better standardization of the data. Things that are being fixed include updating citations for structures from “To be published” to the actual publication if it exists (with PubMed ID), linking to sequence databases (ie UniProt), bringing atom names to standard IUPAC nomenclature (Hooray!!) and loads of other things I haven’t mentioned. Don’t fret … none of the raw experimental data or coordinates are going to be changed :)

From the PDB remediation overview document (pdf):

When the RCSB PDB first addressed the remediation issues in 1998, it was with the intention of providing a uniform and consistent content across all formats. It was surprising and very disappointing to find that many PDB users at the time strongly objected to any changes in the released PDB entries, even if these changes addressed serious but correctable errors (e.g., consistency between chemical and coordinate sequence). As a result of this prevailing attitude toward changes in PDB format entries, the RCSB PDB released its corrections in a new set of mmCIF format data files and left the data in PDB file format unchanged. Since that initial release of mmCIF data, new data items and uniformity corrections have been added to the released mmCIF data files.

I’ve used coordinates from PDB format files for a lot of things over the years, but I’ve got to admit, I’ve never used an mmCIF file. The PDB file format is almost always supported by all legacy (and recent) structural biology analysis software, while using mmCIF is rarely an option (unless it’s converted to PDB format first). If I’d known the mmCIF versions in the database have been ‘remediated’ I may have been more inclined to use them (or the somewhat equivalent XML/PDBML files) for some tasks, since the non-uniformity in atom naming in legacy PDB files can become a royal pain in the butt ….

Anyhow, everyone has until July 2007 to check out the new remediated files before the ‘mainline’ PDB changes over and provides these by default. All new structure releases will follow the remediated format after July. The old versions will still remain available … but who would want them … we are getting standardized goodness !!

Dapper : the screen scraper for everyone

I’ve been meaning to write about the Dapper ’screen scraping’ service for a while, since I think it’s mostly useful and pretty cool.

(Yes, this service is called Dapper, sharing a name with the popular Ubuntu GNU/Linux release. I’m a little suspicious that maybe this was a deliberate marketing trick to pull search traffic intended for Ubuntu ….).

Techcrunch describes Dapper well … ‘create an API for any site’. Essentially, Dapper is good at analyzing web pages which have a fixed format (eg Google search results) and will extract the content in a predictable fashion to provide the raw data in XML, RSS, JSON or CSV format. Depending on the type of data you extract from the page, Dapper can also display data as a Google Map, a Google Gadget or Netvibes module, send email alerts or output iCalendar format.

By making it simple to extract information from sites that do not provide their data as an RSS feed or other useful format, it is possible for non-programmers to use Dapper and ‘liberate’ these sites, producing feeds compatible with their favorite news feed reader from an otherwise dead and lifeless web page. Web application programmers can waste less time writing screen scrapers, and waste less time fixing broken code caused by slight changes in the page format. It may not be something you use for a mission critical component, but for a quick mashup or to get started quickly before writing your own more robust parser, I think Dapper will prove (and is proving) very valuable.

However, there is a downside. From playing with Dapper on-and-off for the past few months, I’ve established that Dapper works quite well extracting data from very uniform pages like Google search hits [however I didn't do that since it's against the terms of service. "No screen-scraping for you", says the Google-Nazi] or the front page of digg .. but usually fails on pages that don’t follow a strict pattern. Getting the wrong data (or junk) in the wrong field 5% of the time may be tolerable for the occasional frivolous RSS feed, but it is annoying enough that for more important applications it is a real show stopper.

One of the reasons it took me several months to get around to posting about Dapper is that I desperately wanted a killer example of extracting data from a bioinformatics database or web site. I’ve found that most decent projects already make their data available in some useful format like XML or CSV and don’t really require scraping with Dapper, while some of the less organized projects which only provide say, HTML tables, [I won't name names ... in fact I've forgotten about them already ... "no citation for you" says the citation-Soup-Nazi ] often failed to work well with Dapper’s page analysis unless the page formatting was strictly uniform.

Pedro

I replicated Pedro’s openKapow Ensembl orthologue search in Dapper as an example. It’s not the best example since, as Pedro notes, Ensembl is one of the ‘good guys’ that already provide results in XML format.

First, I fed Dapper four URLs for Ensembl gene report pages , which contain a section with predicted orthologues. Apparently, giving Dapper several pages of the same format helps the analysis:


Then, I selected the Gene ID in the orthologues list .. Dapper colours fields it detects as the same type. There is a cryptic unlabeled slider which determines the ‘greediness’ of the selection:

After selecting “Save and continue”, Dapper asks for the newly defined field to be named. In this case, I chose the same name as Pedro (”ort_geneID”), just for the hell of it:


This process was repeated to create a field for the species name, which I named “ort_spp”. Dapper allows ‘Fields’ to be grouped into ‘Groups’, so I grouped the “ort_geneID” and “ort_spp” fields into a group called “orthologue”: (data not shown :) ).

Now, we save the Dapp. In “Advanced Options”, I changed the Ensembl gene ID part of the URL to {geneID}. This tells Dapper to make this part of the URL a query field, so that the user can provide any gene ID they like and have the orthologue results scraped:

Finally, we can test the saved Dapp, and retrieve XML formatted results for a particular gene ID:


The gene ID can be changed in the Dapper XML transform URL (http://www.dapper.net/RunDapp?dappName=EnsemblOrthologues&v=1&
variableArg_0=ENSG00000100347) to get XML results for orthologues of other human genes.
Various other transforms, like a cruft-free CSV version are also possible. Feel free to have a play yourself with my Ensembl Orthologues Dapp (like I can stop you now ! It’s public & live & irrevocable).




Creative Commons Attribution 3.0 Unported
Creative Commons Attribution 3.0 Unported