Project 01
Assignment D
Gene prediction
Always ensure you're working within your "BIOSC 1540 Project" for this assignment. This will help you keep your work organized and easily accessible. If you encounter any issues with Galaxy, check their support and training .
Please set the
STUDENT_ID
variable in the cell below to your student ID number.
STUDENT_ID = 1111111
Q01 ¶
After assembling the genome, the next step is to annotate it to identify genes and their potential functions. We'll use Prokka, a rapid prokaryotic genome annotation tool. Follow the instructions below for both your parental and evolved assembly from assignment P01C .
-
In the Galaxy interface, search for
Prokka
in the tools panel. -
Click on
Prokka: Prokaryotic genome annotation
. -
In the tool interface, set the following parameters:
- "Contigs to annotate": Select the contigs file from your SPAdes output.
- "Genus name": Enter "Staphylococcus"
- "Species name": Enter "aureus"
- Set "Use genus-specific BLAST database" to yes.
- Click "Run Tool".
After the annotation is complete, examine the output files. Pay particular attention to:
- GFF3 file (.gff) : This contains the annotations in a standard format. You can use this file to visualize the annotations in a genome browser.
- GenBank file (.gbk) : This file contains both the sequence and the annotations. It can be viewed in many sequence analysis tools.
- Statistics file (.txt) : This provides the number and types of features identified.
- Protein FASTA file (.faa) : This contains the amino acid sequences of predicted proteins.
PARENTAL_CDS = 0
PARENTAL_rRNA = 0
PARENTAL_tRNA = 0
PARENTAL_tmRNA = 0
EVOLVED_CDS = 0
EVOLVED_rRNA = 0
EVOLVED_tRNA = 0
EVOLVED_tmRNA = 0
Download your nucleotide FASTA file (
fna
) for both parental and evolved annotations.
You can do this by clicking on the dataset collection on the right and then clicking the save icon.
Rename your files as follows:
-
Parental:
parental-cds.fasta
-
Evolved:
evolved-cds.fasta
Upload both of these files with your homework submission.
Q02 ¶
In this problem, you will develop a transition matrix for a first-order Markov chain based on DNA sequences. The goal is to determine how likely it is for one nucleotide (A, C, G, or T) to transition to another within a given sequence.
Background and Motivation
A Markov chain is a mathematical model that describes a sequence of events where the probability of each event depends only on the state of the previous event. When applied to DNA sequences, a first-order Markov chain assumes that the likelihood of observing a particular nucleotide at any position depends solely on the nucleotide immediately before it.
Mathematically, this is expressed as:
$$ P(X_n = x_n \mid X_1 = x_1, X_2 = x_2, ..., X_{n-1} = x_{n-1}) = P(X_n = x_n \mid X_{n-1} = x_{n-1}) $$
where $X_n$ represents the nucleotide at position $n$. This assumption simplifies modeling by ignoring dependencies beyond the most recent nucleotide.
By constructing a transition matrix, we can quantitatively describe the patterns of nucleotide transitions in a given DNA sequence. This is useful in applications such as genome annotation, motif discovery, and evolutionary analysis.
Your Task
You will implement a function to compute a transition matrix from a given DNA sequence or set of sequences. The transition matrix captures the probabilities of moving from one nucleotide to another based on observed frequencies in the dataset.
To achieve this, you will:
- For each consecutive pair of nucleotides (e.g., A → C, C → G), count how many times each transition occurs within the sequence.
- Convert raw counts into probabilities by dividing each transition count by the total occurrences of the preceding nucleotide.
- Ensure that your implementation handles very short sequences (length < 2), as well as cases where certain transitions are rare or entirely absent.
def compute_transition_matrix(seq_list: list[str]) -> dict[str, dict[str, float]]:
"""
Compute a first-order Markov transition matrix from a list of DNA sequences.
This function calculates the probability of transitioning from one nucleotide
(A, C, G, T) to another by counting occurrences in a collection of sequences.
It assumes a first-order Markov process where the probability of each nucleotide
only depends on the previous one.
Args:
seq_list: A list of DNA sequences (each sequence should consist of the characters A, C, G, and T).
Returns:
A nested dictionary representing the transition probabilities, e.g.:
{
'A': {'A': 0.3, 'C': 0.2, 'G': 0.3, 'T': 0.2},
'C': {'A': 0.2, 'C': 0.3, 'G': 0.1, 'T': 0.4},
'G': {'A': 0.4, 'C': 0.1, 'G': 0.4, 'T': 0.1},
'T': {'A': 0.1, 'C': 0.3, 'G': 0.2, 'T': 0.4}
}
Each value is the probability of transitioning from the outer key nucleotide to the inner key nucleotide,
calculated using the counts from all sequences provided.
"""
# TODO: Define a list of the nucleotides you want to track, e.g. ["A", "C", "G", "T"].
# TODO: Initialize a dictionary to keep track of counts of each nucleotide (single_counts).
# Make sure each nucleotide starts at 0.
# TODO: Initialize a nested dictionary (pair_counts) to track counts of each pair of nucleotides.
# For each "from" nucleotide, store a dictionary for each "to" nucleotide, also starting at 0.
# TODO: Loop over each sequence in seq_list:
# 1. Convert it to uppercase to standardize.
# 2. Loop through each pair (n1, n2) in the sequence.
# 3. If n1 and n2 are valid nucleotides, update single_counts and pair_counts accordingly.
# TODO: Initialize a dictionary (transition_matrix) to store the final transition probabilities.
# TODO: For each nucleotide n1 in your single_counts, compute the transition probabilities:
# 1. Calculate the total count for n1 (the denominator).
# 2. For each possible "to" nucleotide n2, compute probability = pair_counts[n1][n2] / denominator.
# 3. Handle the case where denominator is zero to avoid division errors.
# TODO: Return the transition_matrix dictionary.
pass # Remove or replace with your implementation
Below is an simple list of sequences you can use to test your function.
seqs = [
"GCATGTAGGCGCTGGACTCGCTAGTAGTTTTGGGGCTGGAGACCGGAAAACATGTGCTACCTCACTTAGTACTAGCGGGGCAAGACATGCTGCTCTGCGAGTTATGACAGCGGAGAATTACTTTAGGATTTATTAAATCCGAGCCGGCATCCTTTTTCGTCTATGTCTACGAAAATTACAATGGCCGCCTCAGTGATGCGCGTAACCTAGTACGATGCCTAGTGAATT",
"GGCGATAAGTTAAATTGTGTCAAGGGATGTCTTCGGAGTTCGAGCAACTGCATACCCCCAGTTAACGTCGTCCTGCCGGCAACGAGCAGCAATACAAGAGCGCCACTATCCTCCCCTACAAACGTATGCACCAAGCCAAGTCCCCATATCAAGGTATCCACGAGCTCAAGGTACTGTCTATAGTCTGCTGCTACAG",
"TTCGGAGCGTCCACCGCCTGTCCAAATTTCCATTGTAATGTTGTTGTTAAGGTTGGTAATATGTAGCCCCTGGTAGCAAGACTACGCAGTGAAGGTTCGCCCTACGGACTCTGCGACCAAAGTCGCCCGCGCCGCCAATGACCTCTGCGTTGTGCGCGATTGGTTCCGGATCTCGGGAGCTAGGTCCCGCTGGATTTTGTGGGCAAGCCCTCTCTCTCTTACTTCACCGTGATTATTCCTGGAAACCGCATTTCTAGACTGACCAGTTAGCGT",
]
Your function should give around the following values for
seqs
.
Note that I rounded these values to the third decimal point.
{
'A': {'A': 0.255, 'C': 0.230, 'G': 0.280, 'T': 0.236},
'C': {'A': 0.208, 'C': 0.279, 'G': 0.240, 'T': 0.273},
'G': {'A': 0.209, 'C': 0.308, 'G': 0.209, 'T': 0.273},
'T': {'A': 0.258, 'C': 0.236, 'G': 0.258, 'T': 0.247},
}
Q03 ¶
In this problem, you will use Markov models to distinguish between coding and non-coding DNA sequences. Many gene prediction algorithms rely on the fact that coding regions (exons) and non-coding regions (introns or intergenic sequences) exhibit different statistical properties. By modeling these differences using first-order Markov chains, we can estimate how likely a given sequence is to belong to either category and classify it accordingly.
Background and Motivation
In genomics, one of the challenges in gene prediction is identifying which regions of a DNA sequence encode proteins (coding regions) and which do not (non-coding regions). One approach is to train separate Markov models for these two types of sequences and then use these models to classify unknown sequences.
A first-order Markov model assumes that the probability of a nucleotide occurring at a given position depends only on the nucleotide that precedes it. By training one Markov model on a dataset of known coding regions and another on non-coding regions, we can compare how likely a new sequence is under each model.
Given a DNA sequence, we will compute its log-likelihood under both models and classify it as coding if it is more probable under the coding model, or non-coding otherwise.
Your Task
You will implement a method to compute the probability an unknown DNA sequence is following
- Given a new sequence, calculate the probability of observing that sequence under both the coding and non-coding Markov models.
- To avoid numerical underflow, compute the log-probability instead of the raw probability.
Log-Probability Calculation
For a given sequence $S = n_1 n_2 n_3 ... n_L$, where each $n_i$ represents a nucleotide (A, C, G, or T), the probability of the sequence under a first-order Markov model is:
$$ P(S) = \prod_{i=1}^{L-1} P(n_{i+1} \mid n_i) $$
Since multiplying many small probabilities can lead to very small numbers (numerical underflow), we compute the log-probability instead:
$$ \log P(S) = \sum_{i=1}^{L-1} \log P(n_{i+1} \mid n_i) $$
You can assume all values in the transition matrix will be nonzero and all possible transitions will be in the
transition_matrix
(i.e., no missing keys).
# Note: These values are from our oversimplified hypothesis that
# coding regions are high in GC content.
# https://slides.com/aalexmmaldonado/biosc1540-l04b#/5/3/2
transition_coding = {
"A": {"A": 0.2, "C": 0.3, "G": 0.4, "T": 0.1},
"C": {"A": 0.1, "C": 0.4, "G": 0.3, "T": 0.2},
"G": {"A": 0.1, "C": 0.4, "G": 0.4, "T": 0.1},
"T": {"A": 0.1, "C": 0.3, "G": 0.4, "T": 0.2},
}
transition_noncoding = {
"A": {"A": 0.3, "C": 0.2, "G": 0.3, "T": 0.2},
"C": {"A": 0.2, "C": 0.3, "G": 0.1, "T": 0.4},
"G": {"A": 0.4, "C": 0.1, "G": 0.4, "T": 0.1},
"T": {"A": 0.1, "C": 0.3, "G": 0.2, "T": 0.4},
}
import math
def sequence_log_probability(
seq: str, transition_matrix: dict[str, dict[str, float]]
) -> float:
"""
Compute the log probability of a given sequence under the provided Markov transition matrix.
This function calculates the log probability of observing a DNA sequence based on a first-order
Markov model. It multiplies the probabilities of each transition between consecutive nucleotides,
but to avoid numerical underflow (since probabilities can be very small), it computes the sum of the
logarithms of the probabilities instead of their product.
Args:
seq: A string representing the DNA sequence (consisting of A, C, G, T).
transition_matrix: A nested dictionary representing the transition probabilities between nucleotides.
The matrix should have the structure:
{
'A': {'A': prob, 'C': prob, 'G': prob, 'T': prob},
'C': {'A': prob, 'C': prob, 'G': prob, 'T': prob},
'G': {'A': prob, 'C': prob, 'G': prob, 'T': prob},
'T': {'A': prob, 'C': prob, 'G': prob, 'T': prob}
}
Returns:
The log probability (a float) of the sequence under the Markov model.
"""
# TODO: Convert seq to uppercase to match the keys in transition_matrix.
# TODO: Initialize a variable to accumulate the total log probability (e.g., log_prob = 0.0).
# TODO: Loop over the sequence indices to get each pair of consecutive nucleotides.
# Remember to stop at the second-to-last character to avoid index issues.
# TODO: For each pair (n1, n2), retrieve the transition probability from transition_matrix.
# TODO: Add the logarithm of that transition probability to the running log probability sum.
# TODO: Return the total accumulated log probability.
pass # Remove or replace with your implementation
For
seq = "CGGTGTTTTGGCCTGCCTCCACGAGCCGCTCACCCGGCCCGTCCGAGCAC"
, your
sequence_log_probability
should give around
-63.6
for coding and
-76.0
for non-coding.
Since
-63.6 > -76.0
, this sequence would be classified as coding.
If you would like to generate random sequences with varying GC content, you can use the function below.
import random
def generate_dna_sequence(length=1000, gc_bias=0.6):
"""
Generate a random DNA sequence with slightly high GC content.
Parameters:
- length (int): Length of the DNA sequence.
- gc_bias (float): Desired proportion of G and C nucleotides (0.5 is balanced, >0.5 favors GC).
Returns:
- str: Generated DNA sequence.
"""
gc_nucleotides = ["G", "C"]
at_nucleotides = ["A", "T"]
sequence = [
(
random.choice(gc_nucleotides)
if random.random() < gc_bias
else random.choice(at_nucleotides)
)
for _ in range(length)
]
return "".join(sequence)