Assignment 3

In this assignment, we will use the Viterbi algorithm on a hidden Markov model to detect C/G rich regions in a DNA sequence.

DNA sequences are made up of the four letters A, C, T and G (each letter corresponding to a particular nucleic acid). It turns out that, in the genomes of many organisms (including humans), the DNA letters are not distributed randomly, but rather have some regions with more A's and T's and some regions with more C's and G's. It turns out that C/G rich regions tend to have more genes, whereas A/T rich regions tend have more non-functional DNA. In this assignment we will analyze the genome of the bacterium Escherichia coli (E. coli), which is commonly found in the human gut.

We will use a hidden Markov model to detect C/G rich regions in a DNA sequence. We will use an HMM with two states, corresponding to low- and high-C/G regions respectively. The prior, transition and emission probabilities are given in the template file.

You need to implement two functions. HMM.logprob() calculates the probability of a particular sequence of states and characters. HMM.viterbi() uses the Viterbi algorithm to calculate the most likely sequence of states given a sequence of DNA characters. In other words, logprob() computes log(P(sequence, states)) and viterbi() computes argmax_states P(sequence, states).

Files

Download the code template here. You should write your code between the comments marked "Begin/End your code". You may add extra functions as you like.

The input file is a single sequence of letters A, C, T, G. Download a small example here. When you're ready to run on the full E. coli genome, download it here (5m letters, 5MB file).

Your program should output a file in the following format. The first line is the (natural) log probability of the sequence, as calculated by HMM.logprob(). The second and third lines is number of inferred 0 and 1 states respectively in your inferred sequence of states. The fourth line is a sequence of states (0 or 1) with the highest probability of outputting the observed sequence, calculated using the Viterbi algorithm. Use the write_output() function to produce this output file. Download the correct output file for the small example here.

Guidelines

Please use Python 3 for this assignment.

100 points total, worth 10% of the course grade. Turn in on CourSys. Submit a compressed directory (.zip or .tar.gz) with (1) your code in one file name "a3.py" and (2) your E. coli output named ecoli_output.txt.

You should be able to test your code running the following commands:

python a3_template.py "small.txt" 
python a3_template.py "ecoli.txt"

Grading: Your solution will be evaluated on accuracy and clarity. Solutions that are hard to understand will be given a maximum of 50/100 points. Please use a logical structure, use informative variable names, and document your code thoroughly so that your solution is understandable.

You are encouraged to work in a group. Feel free to discuss solution strategies and check each other’s work. However, you must write all the text and code you submit on your own. Joint submissions are not allowed, nor is copying someone else's text or code. Plagiarism is not okay, and will be taken very seriously. If you’re not sure whether something is okay, please ask.

If you are having trouble, please take advantage of office hours and the discussion forum. If you see a question from another student on the discussion forum that you think you know the answer to, please respond. (Do your best to lead your fellow student to a solution, rather than simply giving the solution directly.)

Updated Fri Oct. 25 2019, 11:50 by liuhengl.