Q01 ¶
Points: 50
In RNA-seq experiments, sequencing reads are not sampled uniformly across transcripts. Multiple factors influence this non-uniform distribution, including transcript length, GC content, sequence complexity, secondary structure, and technical biases in library preparation and sequencing. Understanding these biases is crucial for accurate quantification of gene expression levels.
In this problem, you will explore how transcript length affects the probability of generating sequencing reads. While real RNA-seq experiments involve additional complexities, this simplified model will help you understand one of the fundamental biases in RNA-seq data.
import random
import numpy as np
Toy transcriptome
Real transcriptomes contain thousands of transcripts with varying lengths and expression levels. For this exercise, you will create a simplified "toy" transcriptome containing a small number of transcripts.
Implement the
define_transcriptome
function that generates a random transcriptome with specified number of transcripts.
Each transcript should have:
- A unique identifier (e.g., "T1", "T2", etc.)
- A randomly assigned length (between 100 and 1000 bases)
- A randomly assigned expression count (between 100 and 10,000)
Use your student ID as the random seed to ensure reproducible results.
STUDENT_ID = 1111111 # Specifies a seed for reproducible random numbers.
Here's a list of programming concepts needed to implement the
define_transcriptome
function:
-
Dictionary operations
- Creating empty dictionaries
- Setting key-value pairs
- Nested dictionary structures
-
For loops and range
- Iterating a specific number of times
- Using range with a parameter
-
Random number generation
-
random.randint()
for generating random integers in a range
-
def define_transcriptome(n_transcripts: int) -> dict[str, dict[str, int]]:
"""
Randomly generate $n$ transcripts with lengths between 100 and 10000
bases and expression counts.
Args:
n_transcripts: Number of transcripts to define transcriptome.
Returns:
Virtual transcriptome containing transcript names as keys with transcript
length and count as values.
"""
random.seed(STUDENT_ID)
# TODO: Initialize an empty dictionary to store our transcriptome
transcriptome: dict[str, dict[str, int]] = {}
# TODO: Loop through and create n_transcripts different transcripts
# TODO: Generate a transcript name in the format "T1", "T2", etc.
# TODO: Generate a random transcript length between 100 and 1000 bases
# TODO: Generate a random transcript count between 100 and 10000
# TODO: Add the transcript to the transcriptome dictionary
# TODO: Return the completed transcriptome dictionary
return transcriptome
Length bias
The probability of generating a read from a transcript is influenced by its length. Longer transcripts tend to produce more reads than shorter ones. We can model this relationship using a logistic (sigmoid) function that represents the probability P(l) of generating a read as a function of transcript length l.
Implement the
read_probability
function that calculates the probability of generating a read from a fragment of a given length using the following logistic function:
$$ P(l) = \frac{1}{1 + e^{-k (l - l_0)}} $$
where:
- $l$ is the fragment length.
- $l_0$ is the inflection point (the length at which the probability is 0.5). This parameter sets the “midpoint” of the transition.
- $k$ is a steepness parameter that controls how rapidly the probability transitions from 0 to 1.
Here's a list of programming concepts needed to implement the function body of
read_probability
:
-
Mathematical operations
- Working with exponential functions
- Understanding logistic functions
- Arithmetic operations
-
NumPy array operations
- Working with NumPy arrays as function inputs
- Element-wise operations on arrays
def read_probability(length: float, k: float = 0.006, l0: float = 400.0) -> float:
"""
Calculate the probability of generating a read from a fragment of length l.
Args:
length : Fragment length(s).
k : Steepness of the logistic function (default 0.006).
l0 : Inflection point (length at which the probability is 0.5, default 400 bp).
Returns:
Probability values.
"""
# TODO: Calculate the probability using a logistic function
# This represents the probability of generating a read based on fragment length
# TODO: Return the calculated probability value
return 0.0
Simulate Sequencing Read Sampling
Now that we have a model for how transcript length affects read probability, we can simulate the process of generating RNA-seq reads with this bias.
Implement the
simulate_reads
function that generates a specified number of reads from the transcriptome, where the probability of selecting each transcript is determined by its length according to the
read_probability
function.
Here's a list of programming concepts needed to implement the function body of
simulate_reads
:
-
Dictionary iteration and access
-
Iterating through dictionary items using
.items()
- Accessing nested dictionary values
-
Iterating through dictionary items using
-
Function calling
-
Calling external functions (
read_probability
) with parameters
-
Calling external functions (
-
Variable accumulation
- Summing values in a loop
-
Mathematical operations
- Division for normalization
- Calculating probabilities/weights
-
List manipulation
- Creating empty lists
- Appending elements to lists
-
Weighted random sampling
-
Using
random.choices()
with a weights parameter
-
Using
-
For loops with range
- Iterating a specific number of times
-
List indexing
- Accessing the first element of a list with index [0]
def simulate_reads(
num_reads: int, transcriptome: dict[str, dict[str, int]]
) -> list[str]:
"""
Simulate RNA-seq read generation where transcripts are sampled based solely on
a length bias modeled by the read_probability function.
Args:
num_reads : The number of reads to simulate.
transcriptome : Dictionary where keys are transcript identifiers and values
are dictionaries with the key "length".
Returns:
A list of transcript identifiers, one per simulated read.
"""
# TODO: Calculate the total weight by summing the read probability for each transcript
total_weight = 0.0
# TODO: Create a dictionary of normalized weights for each transcript
weights: dict[str, float] = {}
# TODO: Create separate lists for transcript IDs and their corresponding weights
transcript_list: list[str] = []
weight_list: list[float] = []
# TODO: Simulate reads by randomly selecting transcripts based on their weights
reads: list[str] = []
# TODO: Return the list of simulated reads
return reads
Q02 ¶
Points: 35
In RNA-seq analysis, determining which transcript each sequencing read originated from is a fundamental challenge. Reads don't always map uniquely to a single transcript due to sequence similarities between gene isoforms, paralogous genes, or sequencing errors. This ambiguity creates uncertainty in transcript quantification that needs to be addressed systematically.
One way to represent these potential read-to-transcript assignments is through a binary assignment matrix. This matrix provides a foundation for more advanced transcript quantification methods.
The assignment matrix (Z) represents the relationship between reads and transcripts. Each entry Z[i,j] indicates whether read j could have originated from transcript i.
Understand and use the
initialize_assignment_matrix
function that creates an empty binary matrix with transcripts as rows and reads as columns.
Here's a list of programming concepts needed to implement the function body of
initialize_assignment_matrix
:
-
NumPy array creation
-
Using
np.zeros()
function to create a zero-filled array - Working with multi-dimensional arrays
-
Using
-
Tuple creation
- Creating a tuple of dimensions for the array shape
def initialize_assignment_matrix(num_transcripts: int, num_reads: int) -> np.ndarray:
"""
Initialize an empty binary assignment matrix Z for transcript-read assignments.
This matrix represents which transcript each read originated from, where:
- Each row corresponds to a transcript (i = 1, 2, ..., num_transcripts)
- Each column corresponds to a read (j = 1, 2, ..., num_reads)
- Z[i,j] = 1 indicates read j was generated from transcript i
- Z[i,j] = 0 indicates read j was not generated from transcript i
Initially, all values are set to 0 as no assignments have been made.
Args:
num_transcripts: Number of transcripts in the transcriptome.
num_reads: Number of reads to be assigned.
Returns:
A zero-initialized numpy array of shape (num_transcripts, num_reads)
representing the assignment matrix.
"""
# TODO: Create a numpy array filled with zeros
# TODO: Use np.zeros() to create this array
# TODO: Return the initialized assignment matrix
pass
Create a 5×10 assignment matrix (5 transcripts, 10 reads) and verify its dimensions.
In reality, some reads may align to multiple transcripts due to sequence similarity. We can model this by allowing a read to be assigned to more than one transcript.
Understand and use the
assign_reads_to_transcripts
function that probabilistically assigns reads to transcripts with controlled ambiguity.
Here's a list of programming concepts needed to implement the function body of
assign_reads_to_transcripts
:
-
NumPy array operations
- Reading array shape attribute
- Indexing and modifying array elements
-
Random number generation
-
Using
random.randint()
for random integer selection -
Using
random.random()
for probability comparison
-
Using
-
Conditional statements
- Comparing random values against a probability threshold
- Checking for duplicate values
-
For loops
- Iterating through a range of indices
- Nested loops
-
List creation and manipulation
- Creating a list with an initial element
- Checking if values are in a list
- Appending to lists
-
Variable unpacking
- Extracting dimensions from array shape
-
Assignment operations
- Setting specific elements in a NumPy array to a value
def assign_reads_to_transcripts(
Z: np.ndarray, ambiguity_prob: float = 0.1
) -> np.ndarray:
"""
Assign reads probabilistically to transcripts, allowing for ambiguity in assignment.
This function simulates the read assignment process where some reads may have ambiguous
origins (i.e., they could have come from multiple transcripts). For each read:
1. One transcript is randomly assigned as the source
2. With probability 'ambiguity_prob', a second transcript may also be assigned
The assignment is recorded in the binary matrix Z, where Z[i,j] = 1 indicates
that read j is assigned to transcript i.
Args:
Z: A binary assignment matrix of shape (num_transcripts, num_reads) initialized
with zeros. Each column represents a read, and each row represents a transcript.
ambiguity_prob: Probability (between 0 and 1) that a read will be ambiguously
assigned to two different transcripts. Default is 0.1 (10%).
Returns:
np.ndarray: The updated assignment matrix Z with reads assigned to transcripts.
For each column (read), there will be either one or two 1's, indicating
which transcript(s) the read is assigned to.
Note:
The function uses STUDENT_ID as a seed for reproducibility in the random assignments.
"""
np.random.seed(STUDENT_ID)
random.seed(STUDENT_ID)
# TODO: Extract the dimensions of the assignment matrix Z
# The shape of Z gives (num_transcripts, num_reads)
# TODO: Iterate through each read (columns of Z)
# TODO: Randomly assign one transcript to this read
# Use random.randint() to select a random transcript index
# TODO: With probability 'ambiguity_prob', assign a second transcript
# 1. Generate a random number between 0 and 1
# 2. If this number is less than ambiguity_prob, select another random transcript
# 3. Make sure the second transcript is different from the first one
# TODO: Update the assignment matrix Z
# Set Z[t, j] = 1 for each transcript t that is assigned to read j
# TODO: Return the updated assignment matrix
return Z
Experiment with different
ambiguity_prob
values (0.05, 0.2, 0.5) and analyze how they affect the proportion of ambiguous reads.
For further analysis, we often need to normalize the assignment matrix so that each read's assignments sum to 1 across all transcripts.
Understand and use the
normalize_assignment_matrix
function that normalizes each column of the assignment matrix.
Here's a list of programming concepts needed to implement the function body of
normalize_assignment_matrix
:
-
NumPy array operations
-
Using
sum()
with axis parameter to calculate column sums -
Setting
keepdims=True
to maintain array dimensionality - Element-wise division of arrays
-
Using
-
Array broadcasting
- Division of a matrix by a row vector with compatible dimensions
-
Conditional array assignment
- Using boolean indexing to modify specific elements (handling zero sums)
def normalize_assignment_matrix(Z: np.ndarray) -> np.ndarray:
"""
Normalize each column of the assignment matrix so that read assignments sum to 1 across transcripts.
In RNA-seq quantification, reads may be assigned to multiple transcripts with varying
confidence. This function normalizes these assignments to ensure that the total probability
mass for each read (column) sums to exactly 1, representing a proper probability distribution
over possible transcript origins.
The normalization is performed by dividing each entry in a column by the sum of that column.
For example, if a read is assigned to two transcripts with equal weights, each will receive
a normalized value of 0.5.
Args:
Z: Assignment matrix of shape (num_transcripts, num_reads) where Z[i,j] represents
the assignment weight of read j to transcript i. Values are typically binary (0 or 1)
before normalization, but can be any non-negative value.
Returns:
Normalized assignment matrix of the same shape as Z, where each column
sums to 1 (except for columns that were all zeros, which remain zeros).
After normalization, Z[i,j] represents the probability that read j
originated from transcript i.
Note:
Columns that sum to zero (no assignments) remain as zeros after normalization
to avoid division by zero errors.
"""
# TODO: Calculate the sum of each column in the assignment matrix
# Use the sum() method with axis=0 and keepdims=True to maintain array dimensions
# TODO: Handle the case of columns that sum to zero
# Replace zero sums with 1 to avoid division by zero
# TODO: Normalize each value by dividing by the column sum
# This ensures each column (read) has a total assignment probability of 1
# TODO: Return the normalized assignment matrix
return None
For further analysis, we will need initial estimates of transcript abundances. These values represent the relative proportions of each transcript in the sample.
Understand and use the
initialize_abundances
function that creates random initial abundance estimates.
Here's a list of programming concepts needed to implement the function body of
initialize_abundances
:
-
NumPy random number generation
-
Using
np.random.seed()
for reproducibility -
Using
np.random.rand()
to generate random values
-
Using
-
NumPy array creation
- Creating a 1D array of random values
-
NumPy array operations
-
Using
np.sum()
to calculate the sum of all elements
-
Using
-
Array normalization
- Element-wise division of array by scalar
def initialize_abundances(num_transcripts: int) -> np.ndarray:
"""
Initialize random transcript abundance values and normalize them to sum to 1.
This function creates an initial set of random abundance values that represent the
relative proportion of each transcript in the sample. These abundances can be
interpreted as the probability that a randomly selected RNA molecule in the sample
originated from each transcript.
The random values are drawn from a uniform distribution between 0 and 1, and then
normalized so that they sum to 1, creating a valid probability distribution over
the transcripts.
Args:
num_transcripts: The number of transcripts in the transcriptome for which
abundances need to be initialized.
Returns:
A 1D array of length num_transcripts containing the normalized
abundance values. Each value is between 0 and 1, and the sum of all
values is 1.
Note:
Uses STUDENT_ID as a random seed for reproducibility of results.
"""
# TODO: Set the random seed using STUDENT_ID for reproducibility
# TODO: Generate an array of random values between 0 and 1
# Use np.random.rand() to create an array of size num_transcripts
# TODO: Normalize the abundance values to sum to 1
# Divide each value by the sum of all values
# TODO: Return the normalized abundances
return None
Q03 ¶
Points: 10
Log likelihoods
In the broader context of statistical inference, this formulation helps in parameter estimation and hypothesis testing by allowing us to work with log-likelihoods—since taking the logarithm turns the product into a sum, which is computationally more convenient:
$$ \log L(\theta \mid X) = \sum_{i=1}^{N} \log f(x_i \mid \theta) $$
This independence assumption underpins many standard methods in RNA-seq data analysis, such as estimating transcript abundances or testing differential expression. While it is a useful idealization, it's also important to acknowledge that in real-world data there might be some dependencies between reads (e.g., due to technical or biological factors). However, the independence model often provides a reasonable approximation that allows researchers to make meaningful inferences about gene expression patterns.
RNA-seq parameters
In many statistical models, we start by denoting the unknown parameters as $\theta$, which represents the set of values we aim to estimate. In the context of RNA-seq, these parameters are specifically the transcript abundances. We redefine our parameters as:
$$ \pi = (\pi_1, \pi_2, \dots, \pi_K) $$
where each $\pi_k$ corresponds to the relative abundance of transcript $k$. This change in notation from $\theta$ to $\pi$ reflects our focus on a particular biological quantity—transcript abundance—instead of a generic parameter vector.
Complete-data model
In RNA-seq analysis, we encounter a fundamental challenge: our observed data is incomplete. While we can observe the RNA-seq reads $X = \{x_1, x_2, \ldots, x_N\}$, we lack crucial information about which specific transcript generated each read. This uncertainty creates what we call an incomplete-data model.
The likelihood function for this incomplete-data model is:
$$ L(\pi \mid X) = \prod_{j=1}^{N} f(x_j \mid \pi) $$
Where:
- $\pi = (\pi_1, \pi_2, \ldots, \pi_K)$ represents the transcript abundances we wish to estimate
- $f(x_j \mid \pi)$ is the probability of observing read $j$ given the transcript abundances
This formulation implicitly integrates over all possible transcript origins for each read. Specifically, for each read $x_j$, the function $f(x_j \mid \pi)$ represents a mixture probability:
$f(x_j \mid \pi) = \sum_{k=1}^{K} P(x_j \mid \text{transcript } k) \cdot \pi_k$
This sum considers all possible transcripts that could have generated the read. When we take the logarithm of the likelihood, we get a "log of sum" structure that is mathematically challenging to optimize:
$\log L(\pi \mid X) = \sum_{j=1}^{N} \log\left(\sum_{k=1}^{K} P(x_j \mid \text{transcript } k) \cdot \pi_k\right)$
The complexity of this integration makes direct optimization of the likelihood with respect to $\pi$ mathematically challenging and computationally intensive, as the parameters are entangled in a non-linear, non-convex function.
To address this challenge, we introduce latent indicator variables $\delta_{kj}$ for each read $j$ and transcript $k$:
$$ \delta_{kj} = \begin{cases} 1, & \text{if read } j \text{ originated from transcript } k, \\ 0, & \text{otherwise.} \end{cases} $$
These binary variables represent the true, but hidden, assignment of reads to transcripts. For each read $j$, exactly one $\delta_{kj}$ equals 1 (the transcript that actually produced the read), while all others equal 0.
If we knew the true values of all $\delta_{kj}$ (which we don't in practice), we could formulate a complete-data model. The likelihood function would then simplify to:
$$ L(\pi \mid \delta) = \prod_{j=1}^{N} \prod_{k=1}^{K} (\pi_k)^{\delta_{kj}} $$
This expression elegantly captures that each read contributes exactly one term to the likelihood: the abundance $\pi_k$ of its originating transcript. The inner product ensures that only the term corresponding to the true transcript (where $\delta_{kj} = 1$) contributes to the likelihood, while all other terms (where $\delta_{kj} = 0$) become 1 and have no effect.
Taking the logarithm of the complete-data likelihood converts the product into a sum:
$$ \log L(\pi \mid \delta) = \sum_{j=1}^{N} \sum_{k=1}^{K} \delta_{kj} \log \pi_k $$
This transformation yields a mathematically tractable expression that forms the foundation for the Expectation-Maximization (EM) algorithm's application to transcript quantification. The EM algorithm will iteratively estimate the expected values of $\delta_{kj}$ based on current abundance estimates, then update the abundance estimates to maximize this expected complete-data log-likelihood.
Soft assignments
In practice, since the true assignments $\delta_{kj}$ are unobserved, we compute soft assignments, denoted by $\gamma_{kj}$. These soft assignments represent the probability that read $j$ originates from transcript $k$, computed based on the current estimates of $\pi$ (and any additional bias factors). For each read $j$, $\gamma_{kj}$ is a continuous value between 0 and 1, and they satisfy:
$$ \sum_{k=1}^{K} \gamma_{kj} = 1. $$
Thus, while $\delta_{kj}$ represents the ideal, binary latent indicator variable, $\gamma_{kj}$ serves as its probabilistic counterpart in the expectation step of the EM algorithm.
Accounting for Transcript Length Bias
RNA-seq data exhibits an important biological reality: longer transcripts are more likely to generate reads simply because they provide more opportunities for fragmentation and sequencing. To account for this inherent bias, we must incorporate transcript length into our model.
For each transcript $k$, we denote its length as $L_k$. When computing the probability that a read comes from a given transcript, we weight the contribution by both the transcript's relative abundance $\pi_k$ and its length $L_k$.
For each read $j$ and transcript $k$, we first compute a weighted value:
$$ w_{kj} = \pi_k \cdot L_k \cdot Z_{kj} $$
Where:
- $\pi_k$ is the current estimate of transcript $k$'s abundance
- $L_k$ is the length of transcript $k$
- $Z_{kj}$ is an indicator (typically 0 or 1) of whether read $j$ could have originated from transcript $k$ based on sequence compatibility
This weighted value reflects the combined influence of the transcript's abundance, its length, and whether the read could have originated from it based on sequence alignment.
Next, we normalize these weights to compute the soft assignment $\gamma_{kj}$:
$$ \gamma_{kj} = \frac{w_{kj}}{\sum_{l=1}^{K} w_{lj}} = \frac{\pi_k \cdot L_k \cdot Z_{kj}}{\sum_{l=1}^{K} \pi_l \cdot L_l \cdot Z_{lj}} $$
This normalization ensures that, for each read $j$, the probabilities sum to 1.
Here's a list of programming concepts needed to implement the function body of
get_soft_assignments
:
-
Function calling
-
Calling external functions (
read_probability
) with multiple parameters
-
Calling external functions (
-
NumPy array operations
-
Array broadcasting with
np.newaxis
- Element-wise multiplication of arrays
-
Using
sum()
with axis parameter and keepdims option
-
Array broadcasting with
-
Conditional array assignment
- Using boolean indexing to modify specific elements (handling zero sums)
-
Array reshaping
-
Using indexing with
np.newaxis
to convert 1D arrays to column vectors
-
Using indexing with
-
Array normalization
- Element-wise division of arrays
def get_soft_assignments(Z, abundances, lengths, k=0.006, l0=400.0):
"""
Compute the probabilistic (soft) assignments of reads to transcripts using an Expectation-Maximization approach.
In RNA-seq quantification, this represents the E-step where we calculate the probability (gamma)
that each read originated from each transcript. The calculation incorporates:
1. Current transcript abundance estimates (pi)
2. Transcript length bias modeling using a logistic function
3. Binary compatibility matrix indicating which reads could come from which transcripts
The resulting probabilities form a proper distribution for each read (each column sums to 1).
Args:
Z: Binary assignment matrix of shape (K, N), where K is the number of transcripts
and N is the number of reads. Z[k,j] = 1 indicates read j could have originated
from transcript k.
abundances: Current estimates of transcript relative abundances (pi), a 1D array of
length K summing to 1.
lengths: Transcript lengths in base pairs, a 1D array of length K.
k: Steepness parameter for the logistic read probability function (default 0.006).
Controls how quickly the probability changes with length.
l0: Inflection point for the read probability function (default 400 bp).
The length at which the probability equals 0.5.
Returns:
Soft assignment matrix (gamma) of shape (K, N), where gamma[k,j]
represents the probability that read j originated from transcript k.
Each column sums to 1, forming a probability distribution.
"""
# TODO: Calculate read probabilities based on transcript lengths
# Use the read_probability function with the provided lengths, k, and l0 parameters
# TODO: Compute the weighted contributions for each transcript and read
# The weight matrix should be calculated as: abundances * read_probs * Z
# Use broadcasting by adding np.newaxis to abundances and read_probs
# TODO: Compute the sum of weights for each read (across all transcripts)
# Sum along axis 0 (transcript axis) and keep dimensions for proper broadcasting
# TODO: Handle the case of reads with no assignments
# Replace any zero sums with 1 to avoid division by zero
# TODO: Normalize weights to get the soft assignment probabilities (gamma)
# Divide the weights by their column sums to ensure each column sums to 1
# TODO: Return the soft assignment matrix (gamma)
return None
Connection to the Expected Log Likelihood
In the context of the EM algorithm, we work with the expected complete-data log likelihood. Using our soft assignments $\gamma_{kj}$ as the expected values of the latent variables $\delta_{kj}$, this expected log likelihood (ignoring constant terms) becomes:
$$ \mathbb{E}[\log L(\pi \mid \delta)] = \sum_{j=1}^{N} \sum_{k=1}^{K} \gamma_{kj} \log (\pi_k \cdot L_k) $$
This formulation allows us to iteratively:
- Compute soft assignments $\gamma_{kj}$ based on current estimates of $\pi$ (E-step)
- Update $\pi$ to maximize the expected log likelihood given these soft assignments (M-step)
Here's a list of programming concepts needed to implement the function body of
expectation_step
:
-
Function calling
-
Calling complex functions with multiple parameters (
get_soft_assignments
) - Passing parameters to helper functions
-
Calling complex functions with multiple parameters (
-
NumPy array operations
-
Array broadcasting with
np.newaxis
- Element-wise multiplication
-
Using
np.sum()
for array reduction
-
Array broadcasting with
-
Mathematical operations
-
Logarithmic calculations with
np.log()
- Handling numerical stability (adding epsilon to avoid log(0))
-
Logarithmic calculations with
def expectation_step(Z, abundances, lengths, k=0.006, l0=400.0):
"""
Perform the Expectation (E) step of the Expectation-Maximization algorithm for RNA-seq quantification.
The E-step consists of two main parts:
1. Computing the posterior probabilities (gamma) that each read originated from each transcript
given the current abundance estimates
2. Computing the expected log-likelihood of the data given these probabilities
This function implements the E-step in the iterative EM algorithm for transcript abundance estimation.
The computed gamma values will be used in the subsequent Maximization step to update
transcript abundance estimates.
Args:
Z: Binary assignment matrix of shape (K, N), where K is the number of transcripts
and N is the number of reads. Z[k,j] = 1 indicates read j could have originated
from transcript k.
abundances: Current estimates of transcript relative abundances (pi), a 1D array of
length K summing to 1.
lengths: Transcript lengths in base pairs, a 1D array of length K.
k: Steepness parameter for the logistic read probability function (default 0.006).
Controls how quickly the probability changes with length.
l0: Inflection point for the read probability function (default 400 bp).
The length at which the probability equals 0.5.
Returns:
A tuple containing:
- gamma (np.ndarray): Soft assignment matrix of shape (K, N), where gamma[k,j]
represents the probability that read j originated from transcript k.
- expected_log_likelihood (float): The expected log-likelihood of the data given
the current parameter estimates, used for monitoring convergence.
Note:
A small epsilon (1e-12) is added to the weighted abundances before taking the logarithm
to avoid numerical issues with log(0).
"""
# TODO: Part 1 - Compute the soft assignment matrix (gamma)
# Use the get_soft_assignments function with Z, abundances, lengths, k, and l0
# TODO: Part 2 - Begin computing the expected log-likelihood
# Calculate read probabilities using the read_probability function
# TODO: Calculate weighted abundances by multiplying abundances by read probabilities
# TODO: Add a small epsilon (e.g., 1e-12) to weighted abundances to avoid log(0)
# TODO: Compute the logarithm of the weighted abundances
# TODO: Calculate the expected log-likelihood
# Multiply gamma by log_weighted_abundances (with appropriate broadcasting)
# Sum all elements to get the final scalar expected log-likelihood value
# TODO: Return both gamma and the expected log-likelihood as a tuple
return None, None
Q04 ¶
Points: 5
After computing soft assignments in the expectation step, we need to update our transcript abundance estimates ($\pi_k$) to better explain the observed data. This is the "M-step" in the EM algorithm.
The key insight behind the maximization step is straightforward: transcripts that likely generated more reads should have higher abundance estimates. But how do we formalize this intuition mathematically?
Recall that our expected complete-data log likelihood is:
$$ \mathbb{E}[\log L(\pi \mid \delta)] = \sum_{j=1}^{N} \sum_{k=1}^{K} \gamma_{kj} \log (\pi_k \cdot L_k) $$
This represents our objective function - we want to find values of $\pi_k$ that maximize this expression. Since the transcript lengths $L_k$ are fixed constants, our focus is on optimizing the $\pi_k$ values.
We face an important constraint: the $\pi_k$ values represent relative abundances and must sum to 1:
$$ \sum_{k=1}^{K} \pi_k = 1 $$
Deriving the Update Rule
To find the values of $\pi_k$ that maximize the expected log likelihood subject to the constraint that they sum to 1, we use the method of Lagrange multipliers. This leads to a surprisingly elegant solution.
Let's define the "effective count" for each transcript as:
$$ C_k = \sum_{j=1}^{N} \gamma_{kj} $$
This represents the expected number of reads originating from transcript $k$ under our current model. Intuitively, if many reads have high probability of coming from transcript $k$ (high $\gamma_{kj}$ values), then $C_k$ will be large.
The optimal update rule turns out to be:
$$ \pi_k^{\text{new}} = \frac{C_k}{\sum_{l=1}^{K} C_l} $$
This formula has an intuitive interpretation: the new abundance of transcript $k$ should be proportional to its effective read count, normalized by the total effective count across all transcripts.
Simplifying the Update Rule
We can simplify this further by noting that $\sum_{j=1}^{N} \sum_{k=1}^{K} \gamma_{kj} = N$, since for each read $j$, the soft assignments $\gamma_{kj}$ sum to 1 across all transcripts. Therefore:
$$ \sum_{l=1}^{K} C_l = \sum_{l=1}^{K} \sum_{j=1}^{N} \gamma_{lj} = \sum_{j=1}^{N} \sum_{l=1}^{K} \gamma_{lj} = \sum_{j=1}^{N} 1 = N $$
This gives us our final update rule:
$$ \pi_k^{\text{new}} = \frac{\sum_{j=1}^{N} \gamma_{kj}}{N} $$
This elegant formula tells us that the updated abundance of each transcript is simply the average of its soft assignment values across all reads.
Here's a list of programming concepts needed to implement the function body of
maximization_step
:
-
NumPy array operations
-
Using
sum()
with axis parameter to sum rows - Calculating the total sum of an array
-
Using
-
Array normalization
- Element-wise division by a scalar
def maximization_step(gamma: np.ndarray) -> np.ndarray:
"""
Perform the Maximization (M) step of the Expectation-Maximization algorithm for RNA-seq quantification.
The M-step updates the transcript abundance estimates (pi) based on the posterior
probabilities (gamma) computed in the E-step. This maximizes the expected log-likelihood
with respect to the model parameters.
The new abundance for each transcript is calculated as the sum of its probabilistic
read assignments divided by the total number of reads. This represents the maximum
likelihood estimate (MLE) under the model assumptions.
Args:
gamma: Soft assignment matrix of shape (K, N), where K is the number of transcripts
and N is the number of reads. gamma[k,j] represents the probability that
read j originated from transcript k. Each column of gamma should sum to 1.
Returns:
np.ndarray: Updated transcript abundance estimates (pi), a 1D array of length K
where each value represents the relative abundance of a transcript.
The returned abundances sum to 1, forming a proper probability distribution.
"""
# TODO: Calculate the effective count for each transcript
# Sum the gamma values along axis 1 (read axis) to get the expected number of reads
# from each transcript
# TODO: Calculate the total effective count
# Sum all the effective counts to get the total number of reads
# TODO: Calculate the new abundance estimates
# Divide each transcript's effective count by the total count to normalize
# TODO: Return the updated abundance estimates
return None
With both the expectation step and maximization step defined, the EM algorithm proceeds iteratively:
- Start with initial abundance estimates $\pi^{(0)}$
- E-step: Compute soft assignments $\gamma$ based on current abundances
- M-step: Update abundances using the formula derived above
- Repeat steps 2-3 until convergence
Each iteration is guaranteed to improve the likelihood of the observed data under our model, eventually converging to a local optimum.
The beauty of this approach is that it transforms a complex optimization problem with hidden variables into a series of simpler, tractable optimization steps, allowing us to efficiently estimate transcript abundances from RNA-seq data.
Here's a list of programming concepts needed to implement the function body of
run_em
:
-
NumPy array operations
-
Creating uniform arrays using
np.ones()
and division - Accessing array shape attributes
-
Creating uniform arrays using
-
Function calling
-
Calling complex functions with multiple parameters (
expectation_step
,maximization_step
) - Handling multiple return values
-
Calling complex functions with multiple parameters (
-
For loops
- Iterating a specific number of times
- Using a loop counter variable
def run_em(Z: np.ndarray, lengths: np.ndarray, num_iterations: int = 10) -> np.ndarray:
"""
Run the Expectation-Maximization algorithm to estimate transcript abundances from RNA-seq data.
This function implements the complete EM algorithm for transcript abundance estimation.
It iteratively refines abundance estimates by alternating between:
1. E-step: Computing the probabilities of read assignments given current abundances
2. M-step: Updating abundance estimates based on these probabilities
The algorithm continues for a fixed number of iterations or until convergence.
It prints diagnostic information during each iteration to track progress.
Args:
Z: Binary assignment matrix of shape (K, N), where K is the number of
transcripts and N is the number of reads. Z[k,j] = 1 indicates read j
could have originated from transcript k.
lengths: Transcript lengths in base pairs, a 1D array of length K.
num_iterations: Number of EM iterations to perform (default: 10).
More iterations typically result in better convergence
but take longer to run.
Returns:
Final estimated transcript abundances (pi), a 1D array of length K
where each value represents the relative abundance of a transcript.
The returned abundances sum to 1, forming a proper probability distribution.
Note:
This implementation uses uniform initialization for transcript abundances.
The algorithm typically converges within 10-50 iterations for most RNA-seq datasets,
but this can vary depending on dataset complexity.
"""
# TODO: Determine the number of transcripts K from the shape of Z
# TODO: Initialize abundances uniformly
# Create an array of K elements, all with the same value 1/K
# TODO: Run the EM algorithm for the specified number of iterations
# TODO: Print the current iteration number and abundance estimates
# TODO: Perform the Expectation step
# Call expectation_step with the current abundances to get gamma and log-likelihood
# TODO: Perform the Maximization step
# Call maximization_step with gamma to get updated abundances
# TODO: Return the final abundance estimates
return None