CByte 03
Getting Great Genes 👖
Ahhh! Relaxation! After assembling that genome, your life has been going pretty great—nothing like a day at the beach, a view of the waves, and a– wait. What's that? Is that a UFO?
ring ring
ring ring
My lab mentor? What could this be?
click!
murmur murmur
Uh-huh. Yeah. UFO—yeah I see it right now.
murmur murmur
Alien jeans? Oh, you mean genes! What about them?
murmur murmur
We have to find them? Aww man. There goes my vacation!
By the end of this CByte, you will have gained experience with methods utilized for gene identification/annotation, including familiarizing yourself with concepts such as open reading frames, and how features of a gene can be analyzed.
Finding Open Reading Frames (ORFs)¶
ATP possible: 20
Dealing with these alien genomes shouldn't be too bad! You remember from your lecture that open reading frames are the key to finding genes—while not every one is a gene, a gene will certainly be one of them! Your lab mentor agrees, and thankfully, they've gotten their hands on some of the alien start and stop codons! Things should be smooth sailing from here.
Finding open reading frames is oftentimes the first step in gene identification in practice. Without open reading frames, computational biologists do not have candidate genes to evaluate.
Your task is to implement the following function:
def return_open_reading_frames(
seq: str, start_codons: list[str], stop_codons: list[str]
) -> list[str]:
"""
A function that returns all open reading frames found within the sequence "seq". Consider what would
happen if, from a start codon, there are two valid stop codons—what would happen in the cell?
You can ensure that the input will be a string representing a valid nucleotide sequence, and input formatting
will be abided by.
Args:
seq: A string, representing a nucleotide sequence.
start_codons: A list of strings representing start codons to search for.
stop_codons: A list of strings representing stop codons to search for.
Returns:
- A list of tuples of strings and integers, representing the found open reading frames
and their starting indices.
Example:
seq = "ATGAAACCCGGGTTTTAAGGGAATGCGTACGATGTAG"
start_codons = ["ATG"]
stop_codons = ["TAA", "TAG", "TGA"]
output: [('ATGAAACCCGGGTTTTAA', 0), ('ATGCGTACGATGTAG', 22), ('ATGTAG', 31)]
"""
orfs: list[tuple[str, int]] = []
# TODO: Using the provided start/stop codons, identify and return all potential open reading frames.
pass
return orfs
Coding Scores¶
ATP possible: 30
Now that you have all of those ORFs, you lab mentor has devised a devilish plan that has never been before seen (ever! trust me!)—you're going to score them. And get this! You're not only going to score them, young undergrad, but you're going to score them based on hexamers! These are k-mers, but of size 6—you oftentimes see patterns of hexamers in genes, which means we can use them to clue in to what might be an actual gene in this extra-terrestrial problem.
In practice, this is called a "coding score", and is used in real gene identification software.
Your task is to implement the following function:
def score_orf_coding(
seq: str,
start_codons: list[str],
stop_codons: list[str],
hexamer_scores: dict[str, int],
) -> tuple[str, int]:
"""
A function that identifies the highest-scoring open reading frame (ORF) in a nucleotide sequence
based on coding scores. Each ORF is divided into hexamers (six-nucleotide sequences), stepping
over one codon at a time. The hexamer scores are summed to determine the coding score for an ORF.
The highest scoring ORF is returned, with ties broken alphabetically.
You can ensure that the input will be a string representing a valid nucleotide sequence, and input formatting
will be abided by.
Args:
seq: A string representing a nucleotide sequence.
start_codons: A list of strings representing start codons to search for.
stop_codons: A list of strings representing stop codons to search for.
hexamer_scores: A dictionary mapping hexamer sequences (6-mers) to point values.
Returns:
- A tuple of a string and integer, representing the highest scoring open reading frame
and its starting indices.
Example:
seq = "ATGAAACCCGGGTTTTAAGGGAATGCGTACGATGTAG"
start_codons = ["ATG"]
stop_codons = ["TAA", "TAG", "TGA"]
hexamer_scores = {
"ATGAAA": 5, "AAACCC": 2, "CCCGGG": 3, "GGGTTT": 1,
"TTTAAA": 4, "ATGCGT": 3, "CGTACG": 2, "TACGAT": 2,
"GATGTA": 5, "ATGTAG": 6
}
output: ('ATGAAACCCGGGTTTTAA', 0, 11)
"""
best_orf = (None, None)
# TODO: Using the inputted start/stop codons and hexamer scores, return the highest scoring ORF and its start index.
pass
return(best_orf)
Rules From Out of this World!¶
ATP possible: 50
Turns out, the aliens weren't even that bad! In fact, one of the aliens—err, I mean, one of your new colleagues—was proficient in Python, and is now a part of your lab! In talking with them, you find out a ton of interesting things—they're an Eagles fan (do they even get cable on their planet?), they make a mean risotto, they actually know of additional scoring methods for genes, they've always dreamed of becoming an acto- wait a second, what was that last one? Additional scoring methods? Aww man!
While coding scores are useful, gene identification in practice is the culmination of a multitude of scores. Genes (and their surrounding regions) have a variety of patterns and traits that can be identified, letting us use those same patterns and traits to identify genes from a collection of ORFs.
Your task is to implement the following function:
def score_orf(
seq: str,
start_codons: list[str],
stop_codons: list[str],
hexamer_scores: dict[str, int],
rbs_scores: dict[tuple[str, int], int], # this is the upstream score
gc_content_scores: dict[tuple[int, int], int],
) -> str:
"""
A function that identifies the highest-scoring open reading frame (ORF) in a nucleotide sequence
based on coding score, ribosome binding site (RBS) score, and GC content score. Each ORF is divided into
hexamers (six-nucleotide sequences), stepping over one codon at a time. Hexamer, RBS, and GC content
scores are summed to determine the final score for an ORF. The highest scoring ORF is returned, with ties
broken alphabetically.
You can ensure that the input will be a string representing a valid nucleotide sequence, and input formatting
will be abided by.
Args:
seq: A string representing a nucleotide sequence.
start_codons: A list of strings representing start codons to search for.
stop_codons: A list of strings representing stop codons to search for.
hexamer_scores: A dictionary mapping hexamer sequences (6-mers) to point values.
rbs_scores: A dictionary mapping tuples of strings (representing RBSs) and
integers (representing the number of bases upstream for the
start of the RBS) to point values. Upstream values will be a
minimum of 3.
gc_content_scores: A dictionary mapping tuples of integers (representing ranges
of GC composition percentage) to point values. Consider the
ranges, boundary inclusive. There will be no overlapping ranges
(i.e., no repeat of any start or stop in the dictionary).
Returns:
- A tuple of a string and integer, representing the highest scoring open reading frame
and its starting indices.
Example:
sequence = "ATGAAACCCGGGTTTTAAGGGAATGCGTACGATGTAG"
start_codons = ["ATG"]
stop_codons = ["TAA", "TAG", "TGA"]
hexamer_scores = {
"ATGAAA": 5, "AAACCC": 2, "CCCGGG": 3, "GGGTTT": 1,
"TTTAAA": 4, "ATGCGT": 3, "CGTACG": 2, "TACGAT": 2,
"GATGTA": 5, "ATGTAG": 6
}
rbs_scores = {
("GGA", 3): 2,
("AAA", 4): 1,
("TTT", 6): 3
}
gc_content_scores = {
(0, 40): 1,
(41, 60): 2,
(61, 80): 3,
(81, 100): 4
}
output: ('ATGCGTACGATGTAG', 22, 15)
"""
best_orf = (None, None)
# TODO: Using the inputted start/stop codons, hexamer scores, RBS scores, and
# GC content scores, return the highest scoring ORF and its start index.
pass
return(best_orf)