Skip to content

Latest commit

 

History

History
230 lines (184 loc) · 9 KB

File metadata and controls

230 lines (184 loc) · 9 KB

Reading Multiple-sequence Alignment Files

The previous sections looked at Biopython's SeqIO module for sequence file input and output (reading and writing sequence files).

Now we come to the AlignIO module which as the name suggests is for alignment input and ouput. Note that this is focused on dealing with multiple sequence alignments of the kind typically used in phylogenetics - a separate SearchIO module targets pairwise alignments generated by search tools like BLAST.

These examples use a number of real example sequence alignment files, see the sample data instructions in the introduction for how to download them.

We're going to look at a small seed alignment for one of the PFAM domains, the A2L zinc ribbon domain (A2L_zn_ribbon; PF08792). This was picked almost at random - it is small enough to see the entire alignment on screen, and has some obvious gap-rich columns.

From the alignments tab on the Pfam webpage, you can download the raw alignment in several different formats (Selex, Stockholm, FASTA, and MSF). Biopython is able to work with FASTA (very simple) and Stockholm format (richly annotated).

Loading a single Alignment

As in SeqIO, under AlignIO we have both AlignIO.parse(...) for looping over multiple separate alignments, and AlignIO.read(...) for loading a file containing a single alignment.

Most of the time you will be working with alignment files which contain a single alignment, so normally you will use AlignIO.read(..).

Here is an example loading the Pfam seed alignment for the A2L zinc ribbon domain (A2L_zn_ribbon; PF08792):

>>> from Bio import AlignIO
>>> alignment = AlignIO.read("PF08792_seed.sth", "stockholm")
>>> print(alignment)
SingleLetterAlphabet() alignment with 14 rows and 37 columns
SIPVVCT---CGNDKDFY--KDDDIYICQLCNAETVK VF282_IIV6/150-181
DIIENCKY--CGSFDIE---KVKDIYTCGDCTQTYTT Q9YW27_MSEPV/2-33
SDNIKCKY--CNSFNII---KNKDIYSCCDCSNCYTT Q9EMK1_AMEPV/2-33
AQDWRCDD--CNATLVYV--KKDAQRVCLECGKSTFF Q6XM16_9PHYC/83-115
SKEWICEV--CNKELVYI--RKDAERVCPDCGLSHPY Q8QNH7_ESV1K/101-133
NDDSKCIK--CGGPVLMQ--AARSLLNCQECGYSAAV Q4A276_EHV8U/148-180
KSQNVCSVPDCDGEKILN--QNDGYMVCKKCGFSEPI YR429_MIMIV/213-247
LKYKECKY--CHTDMVFN--TTQFGLQCPNCGCIQEL VF385_ASFB7/145-177
RNLKSCSN--CKHNGLI---TEYNHEFCIFCQSVFQL Q6VZA9_CNPV/2-33
MNLRMCGG--CRRNGLV---SDADYEFCLFCETVFPM Q6TVP3_ORFSA/1-32
MNLRLCSG--CRHNGIV---SEQGYEYCIFCESVFQK VLTF3_VACCC/1-32
MNLKMCSG--CSHNGIV---SEHGYEFCIFCESIFQS Q8V3K7_SWPV1/1-32
NALRHCHG--CKHNGLV---LEQGYEFCIFCQAVFQH O11357_MCV1/5-36
DQIYTCT---CGGQMELWVNSTQSDLVCNECGATQPY Y494R_PBCV1/148-181

Printing a Biopython alignment object will give you a display as above (but truncated for larger alignments).

In many ways, the alignment acts like a list of SeqRecord objects (just like you would get from SeqIO). The length of the alignment is the number of rows for example, and you can loop over the rows as individual SeqRecord objects:

>>> print(len(alignment))
14
>>> for record in
>>> for record in alignment:
...     print(record.id + " has " + str(record.seq.count("-")) + " gaps")
...
VF282_IIV6/150-181 has 5 gaps
Q9YW27_MSEPV/2-33 has 5 gaps
Q9EMK1_AMEPV/2-33 has 5 gaps
Q6XM16_9PHYC/83-115 has 4 gaps
Q8QNH7_ESV1K/101-133 has 4 gaps
Q4A276_EHV8U/148-180 has 4 gaps
YR429_MIMIV/213-247 has 2 gaps
VF385_ASFB7/145-177 has 4 gaps
Q6VZA9_CNPV/2-33 has 5 gaps
Q6TVP3_ORFSA/1-32 has 5 gaps
VLTF3_VACCC/1-32 has 5 gaps
Q8V3K7_SWPV1/1-32 has 5 gaps
O11357_MCV1/5-36 has 5 gaps
Y494R_PBCV1/148-181 has 3 gaps

Exercise: Write a python script called count_gaps.py which reports the number of records, the total number of gaps, and the mean (average) number of gaps per record:

$ python count_gaps.py
PF08792_seed.sth had 14 records,
Total gaps 61, average per record 4.35714285714

Tip: If you get zero as the average, and are using Python 2, add the following special import line to the start of your Python file. This will give natural division (as used in Python 3) rather than integer division (used by default in Python 2):

.. sourcecode:: python
from __future__ import division

Writing Multiple-sequence Alignment Files

As you might guess from using SeqIO.convert(...) and SeqIO.write(...), there are matching AlignIO.convert() and AlignIO.write(...) functions.

For example, this will convert the Stockholm formatted alignment into a relaxed PHYLIP format file:

from Bio import AlignIO
input_filename = "PF08792_seed.sth"
output_filename = "PF08792_seed_converted.phy"
AlignIO.convert(input_filename, "stockholm", output_filename, "phylip-relaxed")

Exercise: Modify this example to convert the Stockholm file into a FASTA alignment file.

This AlignIO.convert(...) example is equivalent to using AlignIO.read(...) and AlignIO.write(...) explicitly:

from Bio import AlignIO
input_filename = "PF08792_seed.sth"
output_filename = "PF08792_seed_converted.phy"
alignment = AlignIO.read(input_filename, "stockholm")
AlignIO.write(alignment, output_filename, "phylip-relaxed")

This form is most useful if you wish to modify the alignment in some way, which we will do next.

Sorting the rows

Downloading from Pfam gives you the option of picking the order the rows appear in - by default this is according to the tree order (clustering similar sequences together), but it can also be alphabetical (using the identifiers).

We downloaded the file using the tree order, but here is how you can sort the rows by identifier within Biopython:

>>> from Bio import AlignIO
>>> alignment = AlignIO.read("PF08792_seed.sth", "stockholm")
>>> alignment.sort()
>>> print(alignment)
SingleLetterAlphabet() alignment with 14 rows and 37 columns
NALRHCHG--CKHNGLV---LEQGYEFCIFCQAVFQH O11357_MCV1/5-36
NDDSKCIK--CGGPVLMQ--AARSLLNCQECGYSAAV Q4A276_EHV8U/148-180
MNLRMCGG--CRRNGLV---SDADYEFCLFCETVFPM Q6TVP3_ORFSA/1-32
RNLKSCSN--CKHNGLI---TEYNHEFCIFCQSVFQL Q6VZA9_CNPV/2-33
AQDWRCDD--CNATLVYV--KKDAQRVCLECGKSTFF Q6XM16_9PHYC/83-115
SKEWICEV--CNKELVYI--RKDAERVCPDCGLSHPY Q8QNH7_ESV1K/101-133
MNLKMCSG--CSHNGIV---SEHGYEFCIFCESIFQS Q8V3K7_SWPV1/1-32
SDNIKCKY--CNSFNII---KNKDIYSCCDCSNCYTT Q9EMK1_AMEPV/2-33
DIIENCKY--CGSFDIE---KVKDIYTCGDCTQTYTT Q9YW27_MSEPV/2-33
SIPVVCT---CGNDKDFY--KDDDIYICQLCNAETVK VF282_IIV6/150-181
LKYKECKY--CHTDMVFN--TTQFGLQCPNCGCIQEL VF385_ASFB7/145-177
MNLRLCSG--CRHNGIV---SEQGYEYCIFCESVFQK VLTF3_VACCC/1-32
DQIYTCT---CGGQMELWVNSTQSDLVCNECGATQPY Y494R_PBCV1/148-181
KSQNVCSVPDCDGEKILN--QNDGYMVCKKCGFSEPI YR429_MIMIV/213-247

Exercise: Write a Python script sort_alignment_by_id.py which uses AlignIO.read(..) and AlignIO.write(..) to convert PF08792_seed.sth into a sorted FASTA file.

By default the alignment's sort method uses the identifers as the sort key, but much like how sorting a Python list works, you can override this.

Advanced Exercise: Define your own function taking a single argument (a SeqRecord) which returns the number of gaps in the sequence. Use this to sort the alignment and print it to screen (or save it as a new file):

from Bio import AlignIO

def count_gaps(record):
    """Counts number of gaps in record's sequence."""
    return 0  # Fill in code

filename = "PF08792_seed.sth"
alignment = AlignIO.read(filename, "stockholm")
alignment.sort(key=count_gaps)
print(alignment)

Expected output:

$ python sort_gaps.py
SingleLetterAlphabet() alignment with 14 rows and 37 columns
KSQNVCSVPDCDGEKILN--QNDGYMVCKKCGFSEPI YR429_MIMIV/213-247
DQIYTCT---CGGQMELWVNSTQSDLVCNECGATQPY Y494R_PBCV1/148-181
AQDWRCDD--CNATLVYV--KKDAQRVCLECGKSTFF Q6XM16_9PHYC/83-115
SKEWICEV--CNKELVYI--RKDAERVCPDCGLSHPY Q8QNH7_ESV1K/101-133
NDDSKCIK--CGGPVLMQ--AARSLLNCQECGYSAAV Q4A276_EHV8U/148-180
LKYKECKY--CHTDMVFN--TTQFGLQCPNCGCIQEL VF385_ASFB7/145-177
SIPVVCT---CGNDKDFY--KDDDIYICQLCNAETVK VF282_IIV6/150-181
DIIENCKY--CGSFDIE---KVKDIYTCGDCTQTYTT Q9YW27_MSEPV/2-33
SDNIKCKY--CNSFNII---KNKDIYSCCDCSNCYTT Q9EMK1_AMEPV/2-33
RNLKSCSN--CKHNGLI---TEYNHEFCIFCQSVFQL Q6VZA9_CNPV/2-33
MNLRMCGG--CRRNGLV---SDADYEFCLFCETVFPM Q6TVP3_ORFSA/1-32
MNLRLCSG--CRHNGIV---SEQGYEYCIFCESVFQK VLTF3_VACCC/1-32
MNLKMCSG--CSHNGIV---SEHGYEFCIFCESIFQS Q8V3K7_SWPV1/1-32
NALRHCHG--CKHNGLV---LEQGYEFCIFCQAVFQH O11357_MCV1/5-36