Bioinformatics data (non-)formats

(spurred on by my own comment here)

Anyone know if the Clustal alignment file format (eg ClustalW output) has any strict definition somewhere ?

Some Googling suggests it has never been “formally” described .. eg, from the ClustalX help:

“CLUSTAL format output is a self explanatory alignment format. It shows the sequences aligned in blocks. It can be read in again at a later date to (for example) calculate a phylogenetic tree or add a new sequence with a profile alignment.”

Well, it is fairly self explanatory, and as a result there are lots parsers around for Clustal format alignment data, and lots of programs that claim to output alignments in “Clustal format”. I say claim, since many programs output Clustal alignments with different headers to the original ClustalW program (eg “MUSCLE” instead of “CLUSTAL”) .. and some parsers don’t handle that very gracefully (eg Biopython’s Bio.Clustalw).

Unfortunately, these ‘pseudo-Clustal’ formats aren’t going away, and so it is probably up to the parsers to be a little more flexible. Fortunately, the variation is usually only in the header on the first line of the file, so it should be trivial fix the Biopython parser so that it is more forgiving. One idea would be to simply add an optional keyword flag like “ignore_header = True” to the the Bio.Clustalw.parse_file() function. This way, something like:

alignment = Bio.Clustalw.parse_file(my_muscle_align_file, alphabet=IUPAC.protein, ignore_header=True)

should happily slurp up most variations on the Clustal format.

Eventually I’ll get this to the Biopython mailing list (I’ll probably write a proper patch first).

3 thoughts on “Bioinformatics data (non-)formats

  1. Just FYI, Biopython 1.51 onwards accepts MUSCLE’s “ClustalW like” output, which starts with “MUSCLE” instead of “CLUSTAL”.

    For older versions of Biopython, just use the muscle -clwstrict option instead of -clw to get “real” CLUSTAL style output.


  2. Cheers, thanks for the update Peter. I never did get around to writing that patch, but it’s good to see that Biopython now handles these variations in the Clustal format.

    I think I’m going to make it my New Years resolution to post these ideas / queries etc directly to the appropriate Biopython mailing list rather than letting them languish on my poor neglected blog for years 🙂

  3. Good plan 🙂

    If you post this kind of thing to our mailing list or bugzilla, you should get a much faster response rate too

Leave a Reply

Your email address will not be published. Required fields are marked *