Archive for the 'science' Category

ARIA verson 2.2 released

I don’t usually post about NMR (Nuclear Magnetic Resonance) and structural biology related stuff, but I’ve always intended to. In this post I’m pulling out all the stops on specialist lingo and assumed background knowledge, so hopefully it isn’t too incomprehensible to the non-structural biology crowd :).

ARIA version 2.2 has been released in the last few weeks. ARIA is an automated NOE assignment and structure calculation package, which (in theory) takes some of the pain and slowness out of producing protein (and DNA and/or RNA) structures from Nuclear Magnetic Resonance data. I’ll say up front; I haven’t tried this version yet, but some of the improvements look exciting.

Here are two new features worth noting … followed by what I think it all means:

  • The assignment method has been improved with the introduction of a network-anchoring analysis (Herrmann et al., 2002) for filtering of the initial assignments.
  • The integration of the CCPN has been completed. The imported CCPN distance constraints lists can enter the ARIA process for calibration, violation analysis and network-anchoring analysis. The final constraint lists can be exported as well.

In the past I have done some quick and dirty tests comparing the quality of protein structures produced using Aria 2.1 vs. Peter Gunterts CYANA 1.07 and 2.1, using the exact same NMR peak input lists (with slightly noisy data containing a number of incorrectly picked peaks). CYANA always won hands down, assigning more NOE crosspeaks correctly and producing an ensemble of model structures with much lower RMSD and generally better protein structure quality scores (ie using pretty much any decent pairwise pseudo-energy potential, and Procheck). Also, ARIA produced ‘knotted’ structures which were almost certainly incorrect, while CYANA did not. Other postdocs and students in my former lab had done similar independent tests with ARIA 1.2 vs. CYANA 1.0.7, and had come to similar conclusions.

The disclaimer: It should be noted here that assessment of the quality of an ensemble of NMR structure coordinates can be problematic, and is really the topic of another long post (and probably tens if not hundreds of peer-reviewed journal articles). So saying “CYANA version X is better then ARIA version X” based on the RMSD of the final calculated ensemble is a bit unfair … in fact using RMSD of the ensemble to gauge structure quality is just plain wrong in this context. In my (unpublished, non-peer reviewed) tests, it is possible that ARIA could be producing high RMSD but essentially ‘correct’ structures, while CYANA could be producing tightly defined but ‘incorrect’ structures, but I doubt it. The gap between the output of each program was wide enough to suggest that under real-world conditions where the input peak list contained a number of ‘noise’ peaks, ARIA was failing to give a set of consistent solutions (probably due to lack of NOE assignments), while CYANA was giving a set of tightly defined structures (which may or may not have represented the ‘correct’ solution). Other evaluations (protein structure quality measures, Procheck, comparison to known structures of similar proteins) indicated that the CYANA structures were not grossly ‘incorrect’, so I’d say CYANA was just giving a better defined (ie lower ensemble RMSD) set of plausible solutions.

My gut feeling is that ARIA 2.2 will perform much better than past versions, due to one key feature that has been ‘borrowed’ from CYANA; the introduction of a network-anchoring analysis. In a nutshell, network-anchoring scores essentially weight distance constraints (or NOE assignments) based on how ‘connected’ that constraint is within the graph formed by other constraints. This means that in effect a single, isolated constraint pulling two residues on opposite sides of a protein together is down-weighted, while if multiple constraints link those residues (or their neighboring residues) then those constraints are considered more trusted and hence weighted heavier. For better or worse (usually better), this score simulates what the human NMR spectroscopist would do when assigning NOE crosspeaks manually … usually two residues in contact will show multiple NOE crosspeaks connecting them and involve multiple different nuclei, however a single lonely NOE between two nuclei which are distant from eachother in the primary protein sequence is heavily scrutinized and regarded with suspicion since it is likely to be mis-assigned. I’m very keen to test ARIA 2.2 on my old data set and see if I’m actually right (I may be able to try it with network anchoring turned on, and off, and see just what sort of contribution that score is making).

Another completed feature, the integration between ARIA and the CCPN libraries/analysis package should also be a big plus. I haven’t used the CCPN analysis software yet, but a few years ago I wrote some code to help make CYANA and the Sparky NMR assignment program work together better. The result was functional, but very hackish (and I’m probably the only person in the world who understands how it was intended to be used, since I still haven’t got around to writing any documentation. Naughty, naughty). CCPN + ARIA may turn out to be the better option for spectral analysis and structure calculation in the future, as opposed to my currently preferred Sparky + CYANA combination.

I’m really itching to find a good reason to do an NMR structure project now … back to work !!

Cleaning up the cesspool that is the PDB

Well .. maybe cesspool is a little strong … there’s a lot of great data in the Protein Data Bank, it’s just that in the early days it was allowed to grow very large without enforcing better standardization of the data. Things that are being fixed include updating citations for structures from “To be published” to the actual publication if it exists (with PubMed ID), linking to sequence databases (ie UniProt), bringing atom names to standard IUPAC nomenclature (Hooray!!) and loads of other things I haven’t mentioned. Don’t fret … none of the raw experimental data or coordinates are going to be changed :)

From the PDB remediation overview document (pdf):

When the RCSB PDB first addressed the remediation issues in 1998, it was with the intention of providing a uniform and consistent content across all formats. It was surprising and very disappointing to find that many PDB users at the time strongly objected to any changes in the released PDB entries, even if these changes addressed serious but correctable errors (e.g., consistency between chemical and coordinate sequence). As a result of this prevailing attitude toward changes in PDB format entries, the RCSB PDB released its corrections in a new set of mmCIF format data files and left the data in PDB file format unchanged. Since that initial release of mmCIF data, new data items and uniformity corrections have been added to the released mmCIF data files.

I’ve used coordinates from PDB format files for a lot of things over the years, but I’ve got to admit, I’ve never used an mmCIF file. The PDB file format is almost always supported by all legacy (and recent) structural biology analysis software, while using mmCIF is rarely an option (unless it’s converted to PDB format first). If I’d known the mmCIF versions in the database have been ‘remediated’ I may have been more inclined to use them (or the somewhat equivalent XML/PDBML files) for some tasks, since the non-uniformity in atom naming in legacy PDB files can become a royal pain in the butt ….

Anyhow, everyone has until July 2007 to check out the new remediated files before the ‘mainline’ PDB changes over and provides these by default. All new structure releases will follow the remediated format after July. The old versions will still remain available … but who would want them … we are getting standardized goodness !!

I submitted my PhD thesis, and all I got was this crappy balloon

Well, it’s not really all that crappy … the balloon is a nice happy gesture to mark the occasion. I even got to pick the colour. It took me far too long to write and submit this thing, it’s a relief to not have to look at it for a few months. My thesis, entitled “The structure of outer mitochondrial protein import receptors”, may well be the first Creative Commons Licensed thesis submitted in Australia (although I doubt it) . Once it’s been examined (hopefully I pass), I’ll release it online and allow everyone to poke holes and rip it to shreds (or they can poke at the associated peer reviewed publication instead .. unfortunately it’s probably not Open Access).

Afterthought: One thing that slowed down the final submission was the bloody Latex typesetting. I’m a Latex novice, and while I really like the final result, Latex is an abomination (much like Perl).


Update, 15th October, 2007.

I’ve finally got around to submitting the final post-examination version of my thesis to the University of Melbourne ePrints server. You can get a PDF copy of my thesis here. I used the xmpincl Latex macro to embed XMP Creative Commons licensing data into the final PDF version generated by pdflatex. I probably didn’t get the format of the licensing XML exactly right, but I’m sure it will be good enough that search engines can (or will one day) determine the correct licensing for the work.

First Online EMBL PhD Symposium

This looks interesting … the First Online EMBL PhD Symposium, a sort of ‘online’ conference for the life sciences. Everybody with a scientific background is invited to participate. Registration is free.

The programme (Career Development Session, Omics Session / Systems Biology, Scientific Communication 2.0 and Participant’s Contributions) and speakers list makes it look sort of like a “Biology 2.0″ conference.

Apart from the (possible) IRC sessions, hopefully the fact that everything is stored as video/audio + comments on their content managment system means the ‘inconvenient’ timezone in Australia won’t limit my participation too much.

(via the worldwide bioinformatics cabal :), Neil via Pedro, Roland and Stew)

International Genetically Engineered Machine competition videos

The 2006 iGEM Jamboree (International Genetically Engineered Machine competition) happened at the start of this month. This is a synthetic biology ‘competition’ where teams of talented undergraduates from around the world engineer an organism for a specific purpose … like E. coli that produce mint or banana smell, or form simple logic gates the could potentially be used to make a ‘biological computer’.

They are encouraged to use BioBricks from the Registry of Standard Biological Parts, which at the moment is essentially comprised of series many well-characterized DNA constructs (promoters, repressors, selection markers, lots of fluorescence protein coding sequences, etc) with standardized restriction site that can be mixed and matched to produce new and interesting behaviours in bacteria, yeast or mammalian cells. BioBricks are sent out to teams in in 96-well format, so everyone has a good basic set of starting components.

Videos of the student presentations have finally turned up on Google Video. (Unfortunately, the videos only show the speakers, not the slides for the presentation … which makes some parts pretty hard to follow).

I watched the presentation by the University of Arizona team. They printed bacteria onto paper using a stock-standard inkjet printer, with the ink simply removed from the cartridges and replaced with a solution of bacteria. They could then tranfer this to agar plates to grow in whatever pattern they printed. Very simple, but inkjet hardware hacking crossed with molecular biology is just plain cool. As a side discovery, they noticed some weird fractal patterns in colonies under the confocal microscope, apparently based on variation in the fluorescent protein expression level of cells in a single colony.

I wonder how much interest there would be from undergrads (and their supervising acedemics) to start an Australian iGEM team for 2007 ? Funding would also be a tricky issue, as always.