CBit 08
Now we're NumPy and Matplotlib wizards! With our practice in getting a grasp of the basics of these packages, we can begin to tackle word problems. These are similar to what you might see on homework, or when you need to program something for a real science questions/problems! By the end of this CBit/recitation , you will have gained practice with decomposing word problems into their computational constituents, and how to implement code to solve those problems . You'll additionally gain experience with a method called "pseudocoding" —a great way to problem solve computationally, before having to write any code!
Below, you'll find a set of 4 problems focusing on using NumPy and Matplotlib concepts. CBit recitation will walk through these as a group as we try to solve them together!
Gibraltar (cool name, I know!) has just finished using his Base-Quantifier-Doohickey for a set of sequences, quantifying the number of nitrogenous bases within a set of sequences and putting these values into a table. The table is organized as a row for each sequence and four columns containing the counts of A, T, G, C (in that order) for a given sequence/row. Unfortunately, however, Gibraltar wasn't able to ensure that the input sequences were all of the same size, making these raw counts difficult to work with for downstream analyses. It'd be more beneficial to know the percent base composition for each sequence. He needs your help!
Can you help Gibraltar convert the raw base counts to proportions of the total sequence containing that base?
import numpy as np
import numpy.typing as npt
# This is code used for generating a NumPy array of sequences!
def generate_sequence_array(
num_sequences: int, min_length: int, max_length: int
) -> npt.NDArray[np.int64]:
"""
A function that generates a random NumPy array of sequences and their base amounts.
Sequences are limited in their size.
The array is formatted such that rows represent sequences, and four columns
represent the counts of A, T, G, and C (in that order) for a given sequence.
Args:
num_sequences: An integer, representing the number of sequences (rows)
in the array.
min_length: An integer, representing the minimum size of a sequence in
the array.
max_length: An integer, representing the maximum size of a sequence in
the array.
Returns:
A NumPy array in the previously specified format.
"""
sequences: list[list[int]] = []
# We create "num_sequences" sequences!
for _ in range(num_sequences):
# First, we choose the sequence length
seq_length = np.random.randint(min_length, max_length + 1)
# We then create a sequence!
# We create it purely randomly across all four bases -- while this isn't
# biologically accurate, it's a good method to just get some random data to use.
curr_seq = []
# We add bases one-by-one
for i in range(seq_length):
# Randomly selecting a value from [0,3], where we associate each value with a base
base_selection = np.random.randint(0, 4)
if base_selection == 0:
# And we add them accordingly
curr_seq.append("A")
elif base_selection == 1:
curr_seq.append("T")
elif base_selection == 2:
curr_seq.append("G")
else:
curr_seq.append("C")
# We then count the presence of each of these bases in the above sequence
num_A = curr_seq.count("A")
num_T = curr_seq.count("T")
num_G = curr_seq.count("G")
num_C = curr_seq.count("C")
# And we append the list of those values to our total sequence list
sequences.append([num_A, num_T, num_G, num_C])
# And then we can return the array!
return np.array(sequences, dtype=np.int64)
def convert_counts_to_proportions(
sequence_array: npt.NDArray[np.int64],
) -> npt.NDArray[np.float64]:
"""
A function that takes an array of sequences and their base amounts, and converts
base counts to proportions (i.e., percentages) of a sequence of that base.
Args:
sequence_array: A NumPy formatted such that rows represent sequences, and
four columns represent the counts of A, T, G, and C (in that order)
for a given sequence. Must be a 2D array.
Returns:
A NumPy array, where base counts have been converted to proportions of a
given sequence that is that base.
Examples:
```python
>>> convert_counts_to_proportions(np.array([[2, 0, 2, 0]]))
[[0.5, 0.0, 0.5, 0.0]]
>>> convert_counts_to_proportions(np.array([[2, 0, 2, 0], [1, 1, 1, 1]]))
[[0.5, 0.0, 0.5, 0.0], [0.25, 0.25, 0.25, 0.25]]
```
"""
proportion_sequence_array = None
# TODO: Convert the input array to contain proportions instead of counts, as specified above.
return proportion_sequence_array
Awesome! Now you have everything as a proportion. For the study, the next things Gibraltar needs to find is the average proportion for each base across all sequences. Help him out!
def find_average_base_proportion(sequence_array: npt.NDArray) -> dict[str, int | None]:
"""
A function that returns the average proportion of each nitrogenous base across
all sequences in `sequence_array`.
Args:
sequence_array: A NumPy formatted such that rows represent
sequences, and four columns represent the proportions of A, T, G, and C
(in that order) for a given sequence.
Returns:
A dictionary, mapping a string (representing a base) to a float
(representing the average sequence proportion of that base across all
sequences).
"""
base_to_avg_proportion = {
"A": None,
"T": None,
"G": None,
"C": None,
}
# TODO: Find the average sequence proportion of each base across all sequences,
# placing values into the above dictionary.
return base_to_avg_proportion
Perfect! With that done, we can get into the meat and potatoes of the analysis. Gibraltar has come up with a (totally not arbitrary) measure of "good" and "bad" sequences, based on thresholds of base composition. Right now, he's thinking that:
- Good sequences will have a minimum of 52% A and T content, and a maximum of 48% G and C content.
- Bad sequence are everything else!
However, science moves so fast, that those boundaries are constantly changing! Gibraltar needs you to design a function that can take variable goodness thresholds (changing minima and maxima thresholds) and return the number of good sequences, given an array of sequences in the previous format. Additionally, he needs you to plot each sequence on a scatterplot based on their A/T and G/C content, and highlight the "good" sequences.
import matplotlib.pyplot as plt
def find_num_good_sequences(
sequence_array: npt.NDArray[np.int64], good_min_at: float, good_max_gc: float
) -> int:
"""
A function that finds the number of "good" sequences within "sequence_array",
dependent on the minimum AT proportion (good_min_at) and the maximum GC proportion (good_max_gc).
Args:
sequence_array: A NumPy formatted such that rows represent sequences,
and four columns represent the proportions of A, T, G, and C
(in that order) for a given sequence.
good_min_at: A float representing the minimum proportion of A/T for a
"good" sequence.
good_max_at: A float representing the maximum proportion of G/C for a
"good" sequence.
Returns:
An integer, representing the number of "good" sequences in "sequence_array".
"""
num_good: int = 0
# TODO: Find the number of "good" sequences in "sequence_array", based on the
# above specifications.
return num_good
def plot_sequence_base_composition(
sequence_array: npt.NDArray[np.int64], good_min_g: float, good_max_c: float
) -> None:
"""
A function that plots sequences based on their G proportion (x-axis) and C
proportion (y-axis) on a scatterplot, coloring the points for "good" sequences
(defined by "good_min_g" and "good_max_c") red, and other points grey.
Args:
sequence_array: A NumPy formatted such that rows represent sequences, and
four columns represent the proportions of A, T, G, and C (in that order)
for a given sequence.
good_min_g: A float representing the minimum proportion of G for a "good"
sequence.
good_max_c: A float representing the maximum proportion of C for a "good"
sequence.
Returns:
A scatter plot, with the above specifications.
"""
# TODO: Plot sequences on a scatterplot based on their G proportion (x-axis) and
# C proportion (y-axis), coloring "good" sequences red, and other sequences grey.