Codon Alignment Challenge¶
One thing biologists care a lot about is the way amino acids change through time. This is found by sequencing DNA at several timepoints, codon aligning various timepoints, and comparing timepoints to see which amino acids change at what time. Why don’t we just convert to amino acids and align there? Because different reading frames and deletions cause suboptimal translations into the amino acid realm, and therefore decrease the usefulness of the alignment. Plus, codon alignment can reveal synonymous vs nonsynonymous mutations.
For a visualization of this type of alignment, head over here. The sequences shown here are from an HIV envelope. If you click on on amino acid index, you can see a graph showing how the amino acid in that position evolved over time. You can explore the site - I recommend the tree section because it is pretty.
I made a fake fasta of sequences that need to be codon aligned over at
/srv/WI20/not_aligned.fasta. Here is my step by step guide on making your own codon aligner to make sure those sequences are nice and neat.
0. Get the imports¶
#to call mafft import subprocess #to use the Seq type from Bio import Seq #to read/write fasta from Bio import SeqIO #to translate from Bio.Alphabet import generic_dna
1. Align with mafft and replace initial gaps with N’s¶
Why we do this?¶
Due to DNA fragmentation, sequencing starts at many different points, and we don’t know ahead of time where the points are. If we don’t identify where one sequence starts relative to another, we cannot begin comparing them. Why the N’s? That’s because the biological meaning of N’s is different than that of gaps. Gaps in an alignment indicate deletions or insertions from one sequence to another, we predict that something was removed or added. In the case of different starting points, we know that something is supposed to be there, since we have information from other sequences. Thus it is not an insertion or deletion, but unkown nucleotides. We represent this with N’s.
a. Align the file with mafft. Syntax for this is provided in lesson 8
b. Go through each of the sequences in the mafft aligned file
for seq_record in SeqIO.parse("nuc_aligned.fasta", "fasta"):
c. Count the number of initial gaps on each sequence
d. Degap each sequence and place it into a Seq object
The syntax for this step is whack, so let me give it to you:
d. Prepend each sequence with n’s. The number of n’s should be equal to the number of initial gaps that sequence had.
2. Correct the reading frame and length¶
We have now lined up sequences with different starting points, but what if every single sequence starts at the wrong reading frame? All of the translated animo acids will be useless. One way of making the choice between starting each sequence from the first, second, or third nucleotide by seeing which one results in the longest total distance between stop codons in all of the sequences.
In the interest of time, let us assume that the reading frame is correct. All you need to do is to make sure that the alignment has a length that is a multiple of three.
a. Go through each index from 0 to the number of sequences
b. Use the modulus to find whether the current sequences has a length divisible by three.
What is a modulus? it’s this
% symbol. In math, this symbol will find the remainder for you. So 15%3=0, 16%3=1, 17%3=2, 18%3=0 and so on.
c Based on the remainder you found in b, figure out how much of the sequence you should cut off the end. Syntax hint:
seq[0:-1] will give you the sequence with the last nucleotide cut off.
3. Translate the DNA into amino acids.¶
How? Google translating DNA with BioPython.
4. Mafft align the amino acid sequences¶
It’s the same syntax as step 1, just put your animo acids in a file.
5. Backtranslate to codons¶
We need to backtranslate from amino acids into codons by inserting the gaps from the amino acid alignment into the nucleotide codons themselves.
a. Go through each amino acid in the aligned amino acid array
b. If it’s a regular amino acid, go ahead and take next three nucelotides from the degapped nucleotide array (the one with the n’s). If it’s a gap, think about how many gaps that corresponds to in nuceotide space (hint: it’s three) and insert those gaps.
Verify your answers:
The code itself can be found at
The answers for each question are currently being made. They will be located in