logo
BIOSC 1540 - Spring 2025
CB08
Initializing search
    oasci/pitt-biosc1540-2025s
    • Home
    • Syllabus
    • Lectures
    • Assessments
    • CBytes
    • Team
    • Students
    oasci/pitt-biosc1540-2025s
    • Home
    • Syllabus
    • Lectures
    • Assessments
    • CBytes
      • CB01
      • CB02
      • CB03
      • CB04
      • CB05
      • CB06
      • CB07
      • CB08
      • CB09
    • Team
    • Students

    CBit 08

    Colab Image

    Now we're NumPy and Matplotlib wizards! With our practice in getting a grasp of the basics of these packages, we can begin to tackle word problems. These are similar to what you might see on homework, or when you need to program something for a real science questions/problems! By the end of this CBit/recitation , you will have gained practice with decomposing word problems into their computational constituents, and how to implement code to solve those problems . You'll additionally gain experience with a method called "pseudocoding" —a great way to problem solve computationally, before having to write any code!

    Below, you'll find a set of 4 problems focusing on using NumPy and Matplotlib concepts. CBit recitation will walk through these as a group as we try to solve them together!

    Gibraltar (cool name, I know!) has just finished using his Base-Quantifier-Doohickey for a set of sequences, quantifying the number of nitrogenous bases within a set of sequences and putting these values into a table. The table is organized as a row for each sequence and four columns containing the counts of A, T, G, C (in that order) for a given sequence/row. Unfortunately, however, Gibraltar wasn't able to ensure that the input sequences were all of the same size, making these raw counts difficult to work with for downstream analyses. It'd be more beneficial to know the percent base composition for each sequence. He needs your help!

    Can you help Gibraltar convert the raw base counts to proportions of the total sequence containing that base?

    In [ ]:
    Copied!
    import numpy as np
    import numpy.typing as npt
    
    import numpy as np import numpy.typing as npt
    In [ ]:
    Copied!
    # This is code used for generating a NumPy array of sequences!
    def generate_sequence_array(
        num_sequences: int, min_length: int, max_length: int
    ) -> npt.NDArray[np.int64]:
        """
        A function that generates a random NumPy array of sequences and their base amounts.
        Sequences are limited in their size.
    
        The array is formatted such that rows represent sequences, and four columns
        represent the counts of A, T, G, and C (in that order) for a given sequence.
    
        Args:
            num_sequences: An integer, representing the number of sequences (rows)
                in the array.
            min_length: An integer, representing the minimum size of a sequence in
                the array.
            max_length: An integer, representing the maximum size of a sequence in
                the array.
    
        Returns:
            A NumPy array in the previously specified format.
        """
    
        sequences: list[list[int]] = []
    
        # We create "num_sequences" sequences!
        for _ in range(num_sequences):
    
            # First, we choose the sequence length
            seq_length = np.random.randint(min_length, max_length + 1)
    
            # We then create a sequence!
            # We create it purely randomly across all four bases -- while this isn't
            # biologically accurate, it's a good method to just get some random data to use.
            curr_seq = []
    
            # We add bases one-by-one
            for i in range(seq_length):
                # Randomly selecting a value from [0,3], where we associate each value with a base
                base_selection = np.random.randint(0, 4)
                if base_selection == 0:
                    # And we add them accordingly
                    curr_seq.append("A")
                elif base_selection == 1:
                    curr_seq.append("T")
                elif base_selection == 2:
                    curr_seq.append("G")
                else:
                    curr_seq.append("C")
    
            # We then count the presence of each of these bases in the above sequence
            num_A = curr_seq.count("A")
            num_T = curr_seq.count("T")
            num_G = curr_seq.count("G")
            num_C = curr_seq.count("C")
    
            # And we append the list of those values to our total sequence list
            sequences.append([num_A, num_T, num_G, num_C])
    
        # And then we can return the array!
        return np.array(sequences, dtype=np.int64)
    
    # This is code used for generating a NumPy array of sequences! def generate_sequence_array( num_sequences: int, min_length: int, max_length: int ) -> npt.NDArray[np.int64]: """ A function that generates a random NumPy array of sequences and their base amounts. Sequences are limited in their size. The array is formatted such that rows represent sequences, and four columns represent the counts of A, T, G, and C (in that order) for a given sequence. Args: num_sequences: An integer, representing the number of sequences (rows) in the array. min_length: An integer, representing the minimum size of a sequence in the array. max_length: An integer, representing the maximum size of a sequence in the array. Returns: A NumPy array in the previously specified format. """ sequences: list[list[int]] = [] # We create "num_sequences" sequences! for _ in range(num_sequences): # First, we choose the sequence length seq_length = np.random.randint(min_length, max_length + 1) # We then create a sequence! # We create it purely randomly across all four bases -- while this isn't # biologically accurate, it's a good method to just get some random data to use. curr_seq = [] # We add bases one-by-one for i in range(seq_length): # Randomly selecting a value from [0,3], where we associate each value with a base base_selection = np.random.randint(0, 4) if base_selection == 0: # And we add them accordingly curr_seq.append("A") elif base_selection == 1: curr_seq.append("T") elif base_selection == 2: curr_seq.append("G") else: curr_seq.append("C") # We then count the presence of each of these bases in the above sequence num_A = curr_seq.count("A") num_T = curr_seq.count("T") num_G = curr_seq.count("G") num_C = curr_seq.count("C") # And we append the list of those values to our total sequence list sequences.append([num_A, num_T, num_G, num_C]) # And then we can return the array! return np.array(sequences, dtype=np.int64)
    In [ ]:
    Copied!
    def convert_counts_to_proportions(
        sequence_array: npt.NDArray[np.int64],
    ) -> npt.NDArray[np.float64]:
        """
        A function that takes an array of sequences and their base amounts, and converts
        base counts to proportions (i.e., percentages) of a sequence of that base.
    
        Args:
            sequence_array: A NumPy formatted such that rows represent sequences, and
                four columns represent the counts of A, T, G, and C (in that order)
                for a given sequence. Must be a 2D array.
    
        Returns:
            A NumPy array, where base counts have been converted to proportions of a
                given sequence that is that base.
    
        Examples:
            ```python
            >>> convert_counts_to_proportions(np.array([[2, 0, 2, 0]]))
            [[0.5, 0.0, 0.5, 0.0]]
            >>> convert_counts_to_proportions(np.array([[2, 0, 2, 0], [1, 1, 1, 1]]))
            [[0.5, 0.0, 0.5, 0.0], [0.25, 0.25, 0.25, 0.25]]
            ```
        """
        proportion_sequence_array = None
    
        # TODO: Convert the input array to contain proportions instead of counts, as specified above.
    
        return proportion_sequence_array
    
    def convert_counts_to_proportions( sequence_array: npt.NDArray[np.int64], ) -> npt.NDArray[np.float64]: """ A function that takes an array of sequences and their base amounts, and converts base counts to proportions (i.e., percentages) of a sequence of that base. Args: sequence_array: A NumPy formatted such that rows represent sequences, and four columns represent the counts of A, T, G, and C (in that order) for a given sequence. Must be a 2D array. Returns: A NumPy array, where base counts have been converted to proportions of a given sequence that is that base. Examples: ```python >>> convert_counts_to_proportions(np.array([[2, 0, 2, 0]])) [[0.5, 0.0, 0.5, 0.0]] >>> convert_counts_to_proportions(np.array([[2, 0, 2, 0], [1, 1, 1, 1]])) [[0.5, 0.0, 0.5, 0.0], [0.25, 0.25, 0.25, 0.25]] ``` """ proportion_sequence_array = None # TODO: Convert the input array to contain proportions instead of counts, as specified above. return proportion_sequence_array

    Awesome! Now you have everything as a proportion. For the study, the next things Gibraltar needs to find is the average proportion for each base across all sequences. Help him out!

    In [ ]:
    Copied!
    def find_average_base_proportion(sequence_array: npt.NDArray) -> dict[str, int | None]:
        """
        A function that returns the average proportion of each nitrogenous base across
        all sequences in `sequence_array`.
    
        Args:
            sequence_array: A NumPy formatted such that rows represent
                sequences, and four columns represent the proportions of A, T, G, and C
                (in that order) for a given sequence.
    
        Returns:
            A dictionary, mapping a string (representing a base) to a float
                (representing the average sequence proportion of that base across all
                sequences).
        """
    
        base_to_avg_proportion = {
            "A": None,
            "T": None,
            "G": None,
            "C": None,
        }
    
        # TODO: Find the average sequence proportion of each base across all sequences,
        # placing values into the above dictionary.
    
        return base_to_avg_proportion
    
    def find_average_base_proportion(sequence_array: npt.NDArray) -> dict[str, int | None]: """ A function that returns the average proportion of each nitrogenous base across all sequences in `sequence_array`. Args: sequence_array: A NumPy formatted such that rows represent sequences, and four columns represent the proportions of A, T, G, and C (in that order) for a given sequence. Returns: A dictionary, mapping a string (representing a base) to a float (representing the average sequence proportion of that base across all sequences). """ base_to_avg_proportion = { "A": None, "T": None, "G": None, "C": None, } # TODO: Find the average sequence proportion of each base across all sequences, # placing values into the above dictionary. return base_to_avg_proportion

    Perfect! With that done, we can get into the meat and potatoes of the analysis. Gibraltar has come up with a (totally not arbitrary) measure of "good" and "bad" sequences, based on thresholds of base composition. Right now, he's thinking that:

    • Good sequences will have a minimum of 52% A and T content, and a maximum of 48% G and C content.
    • Bad sequence are everything else!

    However, science moves so fast, that those boundaries are constantly changing! Gibraltar needs you to design a function that can take variable goodness thresholds (changing minima and maxima thresholds) and return the number of good sequences, given an array of sequences in the previous format. Additionally, he needs you to plot each sequence on a scatterplot based on their A/T and G/C content, and highlight the "good" sequences.

    In [ ]:
    Copied!
    import matplotlib.pyplot as plt
    
    import matplotlib.pyplot as plt
    In [ ]:
    Copied!
    def find_num_good_sequences(
        sequence_array: npt.NDArray[np.int64], good_min_at: float, good_max_gc: float
    ) -> int:
        """
        A function that finds the number of "good" sequences within "sequence_array",
        dependent on the minimum AT proportion (good_min_at) and the maximum GC proportion (good_max_gc).
    
        Args:
            sequence_array: A NumPy formatted such that rows represent sequences,
                and four columns represent the proportions of A, T, G, and C
                (in that order) for a given sequence.
            good_min_at: A float representing the minimum proportion of A/T for a
                "good" sequence.
            good_max_at: A float representing the maximum proportion of G/C for a
                "good" sequence.
    
        Returns:
            An integer, representing the number of "good" sequences in "sequence_array".
        """
        num_good: int = 0
    
        # TODO: Find the number of "good" sequences in "sequence_array", based on the
        # above specifications.
    
        return num_good
    
    
    def plot_sequence_base_composition(
        sequence_array: npt.NDArray[np.int64], good_min_g: float, good_max_c: float
    ) -> None:
        """
        A function that plots sequences based on their G proportion (x-axis) and C
        proportion (y-axis) on a scatterplot, coloring the points for "good" sequences
        (defined by "good_min_g" and "good_max_c") red, and other points grey.
    
        Args:
            sequence_array: A NumPy formatted such that rows represent sequences, and
                four columns represent the proportions of A, T, G, and C (in that order)
                for a given sequence.
            good_min_g: A float representing the minimum proportion of G for a "good"
                sequence.
            good_max_c: A float representing the maximum proportion of C for a "good"
                sequence.
    
        Returns:
            A scatter plot, with the above specifications.
        """
    
        # TODO: Plot sequences on a scatterplot based on their G proportion (x-axis) and
        # C proportion (y-axis), coloring "good" sequences red, and other sequences grey.
    
    def find_num_good_sequences( sequence_array: npt.NDArray[np.int64], good_min_at: float, good_max_gc: float ) -> int: """ A function that finds the number of "good" sequences within "sequence_array", dependent on the minimum AT proportion (good_min_at) and the maximum GC proportion (good_max_gc). Args: sequence_array: A NumPy formatted such that rows represent sequences, and four columns represent the proportions of A, T, G, and C (in that order) for a given sequence. good_min_at: A float representing the minimum proportion of A/T for a "good" sequence. good_max_at: A float representing the maximum proportion of G/C for a "good" sequence. Returns: An integer, representing the number of "good" sequences in "sequence_array". """ num_good: int = 0 # TODO: Find the number of "good" sequences in "sequence_array", based on the # above specifications. return num_good def plot_sequence_base_composition( sequence_array: npt.NDArray[np.int64], good_min_g: float, good_max_c: float ) -> None: """ A function that plots sequences based on their G proportion (x-axis) and C proportion (y-axis) on a scatterplot, coloring the points for "good" sequences (defined by "good_min_g" and "good_max_c") red, and other points grey. Args: sequence_array: A NumPy formatted such that rows represent sequences, and four columns represent the proportions of A, T, G, and C (in that order) for a given sequence. good_min_g: A float representing the minimum proportion of G for a "good" sequence. good_max_c: A float representing the maximum proportion of C for a "good" sequence. Returns: A scatter plot, with the above specifications. """ # TODO: Plot sequences on a scatterplot based on their G proportion (x-axis) and # C proportion (y-axis), coloring "good" sequences red, and other sequences grey.
    2025-03-28 10:54:52
    CC BY-NC-SA 4.0 by OASCI
    Made with Material for MkDocs