Lesson 6: Bioinformatics x Python Packages

Recording from 2020 https://drive.google.com/file/d/1-WxYMeo-SRU_IcOli_aE-WfE3ALmBw2M/view?usp=sharing


Congratulations! After your hard work honing your Unix and Python skills, you’ve been selected for a bioinformatic research project! Soon, you’re going to be trusted with genetic data from an unknown source. Your goal will be to use bioinformatics to determine where this data came from.

Warm Up (Packages)

Before you’re allowed to work with real genetic data, you’ll need to familiarize yourself with Python packages. Think of packages as Python programs someone else has written that you can borrow and use for your own purposes.

For example, earlier you determined the GC content of your sequence by writing your own algorithm. Instead, it’s a lot easier to find a Python package that already has a built-in GC counter, and just run it by doing something like sequence.count_gc(), for example.

The Python package we’ll be using for our bioinformatics purposes is called Biopython. This is already installed on our cluster. If it wasn’t, you could install it by doing (DO NOT ACTUALLY DO THIS): pip3 install --user biopython.

Instead, prove to yourself that Biopython is already on the cluster by doing:

pip3 list

Warm Up (Biopython)

Great, we have Biopython. Let’s figure out what it does and how to use it.

Seq Objects

Much like the fundamental building block of genetics is DNA, the fundamental building block of Biopython (at least for what we’re doing) is the Seq object. As we move forward in the course, we’ll introduce other Biopython features.

First, you’ll need to import the Seq package (which is part of the Biopython library). This gives you access to use the Seq object in your file:

 from Bio.Seq import Seq

Next, create a Seq object (think of that as a nucleotide sequence) and assign it to a variable:


Now, print out that variable:


The output (while boring) is probably what you’d expect–the sequence you specified. But, as you know, you could have done the same thing with a simple string! What makes Seq’s especially useful?

Biopython Methods

The answer - methods! See, the Seq object knows what’s inside is either DNA or RNA, so it has specific genetics-related methods you can easily use! For example, you can translate this RNA! Do:


Check out this table to see what amino acids this translated to.

Note: Methods belonging to the Seq type will all be executed in the format above: SeqObj.method(). All but one method you’ll use today will be in this format. The other is a method outside the type, which usess Seq as a parameter: method(SeqObj).

The First Glimpse


You’ll be using the same genetic data as before.

!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1qalp5p_feGsd-G_FtOq6v-OAb5US3rDM' -O unidentified.fasta

This time, instead of having various files for various methods, we’re going to be using just one file. Isn’t that handy? Copy it to your directory:

!wget https://raw.githubusercontent.com/ubicucsd/crash_course/master/source/scripts/biopython.py -O biopythonExercise.py

This program will provide you with feedback as you progress through the next few steps. Run the program as you complete each step. You’ll find the Biopython package documentation useful.

Hint: You really don’t want to look through this vast documentation. Can you figure out which specific sub-packages the biopython.py file uses, and only look for that documentation? Ask an instructor if you’re stuck for more than 3 minutes.

Hint 2: Every single task below can be completed in one line of code. Although you need not be this efficient, be wary of writing long methods. The whole point of Biopython is that you don’t need to think about any of the logic; it’s all done for you.

Transcription Simulation

Reverse Complement

Follow the instructions in the file.

RNA Transcription

Follow the instructions in the file.

A Potential Breakthrough

Recall the two markers that could suggest your unknown sample may, in fact, be related to the one seen in rats:

  1. The sequence “ATGGAGCTGACTGTGGAGGCATG” is often present.
  2. The GC Content is above 55%.

Follow the instructions in the file.

And as you might have guessed from the closing comments of the last lesson, instead of parsing files manually, we can use the file parsers in Biopython to get the fasta sequence from our fasta file easily. Create the biopython SeqRecord object which stores all headers in id attribute and sequences in seq attribute:

record = SeqIO.read("unidentified.fasta", "fasta") # Read SeqIO documentation
print(record.id, record.seq)

Now let’s Blast!

result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
blast_qresult = SearchIO.read(result_handle, "blast-xml")
# Get the description of top 5 hits

We’ll return to using Biopython in a future lesson on Alignment.


Exercises are adapted from Rosalind, the official Biopython tutorial textbook, and a course from the University of Trento.