One of the nice features of Google App Engine is you can easily view logs for your application to quickly see requests generating errors. Browsing the logs of ResolveRef, I’ve been able to identify an few classes of query which for one reason or another, weren’t working.
About two weeks ago, tipped off by Neil, I heard about Google App Engine. I managed to get a beta account, and I’ve finally had a chance to do something (hopefully) useful with it.
In the absence of any quickly achievable ideas for a bioinformatics app, I ported over the OpenRef application I wrote on top of TurboGears a few months back.
(spurred on by my own comment here)
Anyone know if the Clustal alignment file format (eg ClustalW output) has any strict definition somewhere ?
Some Googling suggests it has never been “formally” described .. eg, from the ClustalX help:
“CLUSTAL format output is a self explanatory alignment format. It shows the sequences aligned in blocks. It can be read in again at a later date to (for example) calculate a phylogenetic tree or add a new sequence with a profile alignment.”
Well, it is fairly self explanatory, and as a result there are lots parsers around for Clustal format alignment data, and lots of programs that claim to output alignments in “Clustal format”. I say claim, since many programs output Clustal alignments with different headers to the original ClustalW program (eg “MUSCLE” instead of “CLUSTAL”) .. and some parsers don’t handle that very gracefully (eg Biopython’s Bio.Clustalw).
Unfortunately, these ‘pseudo-Clustal’ formats aren’t going away, and so it is probably up to the parsers to be a little more flexible. Fortunately, the variation is usually only in the header on the first line of the file, so it should be trivial fix the Biopython parser so that it is more forgiving. One idea would be to simply add an optional keyword flag like “ignore_header = True” to the the Bio.Clustalw.parse_file() function. This way, something like:
alignment = Bio.Clustalw.parse_file(my_muscle_align_file, alphabet=IUPAC.protein,
should happily slurp up most variations on the Clustal format.
Eventually I’ll get this to the Biopython mailing list (I’ll probably write a proper patch first).