Project 02
Assignment C
In this assignment, we analyze bacterial gene expression data collected from stainless steel 316 surfaces after three days of exposure in two distinct conditions: on the ground and in space.
The primary goal is to determine whether bacterial gene expression differs significantly between these conditions.
To achieve this, we employ Negative Binomial models, a common statistical approach for modeling overdispersed count data in RNA sequencing and microbial studies.
By the end of this assignment, you will gain hands-on experience in statistical modeling, parameter estimation, and hypothesis testing in a biological context.
import requests
import io
import pandas as pd
import numpy as np
import numpy.typing as npt
We have defined a function below that will download and create a Pandas DataFrame containing gene counts of bacterial growth on stainless steel 316 after three days of being in space or on the ground.
def get_count_data():
csv_path = "https://github.com/oasci/pitt-biosc1540-2025s/raw/refs/heads/main/content/data/gene-expr/ground-space-day3/ss316-3day-bygravity-counts.csv"
response = requests.get(csv_path)
if response.status_code == 200:
csv_text = response.text
data = pd.read_csv(io.StringIO(csv_text))
return data
else:
print(f"Failed to fetch file. Status code: {response.status_code}")
df = get_count_data()
To analyze gene expression data, we need to know how to retrieve specific values from the dataframe.
The dataset is stored in a Pandas DataFrame (
df
), where each row represents a gene and each column contains information about its expression levels under different conditions.
One way to extract data is by using integer-based indexing with
.iloc[]
.
For example:
print(df.iloc[19])
Gene.ID PA14_00210 Ground.1 223 Ground.2 282 Ground.3 777 Ground.4 444 Space.1 42 Space.2 148 Space.3 112 Space.4 118 Name: 19, dtype: object
Q01 ¶
Points: 15
The Negative Binomial (NB) distribution is commonly used to model count data, particularly in cases where the variance exceeds the mean (i.e., overdispersion). The probability mass function (PMF) gives the probability of observing a particular count $x$, given the mean expression level $\mu$ and the dispersion parameter $\alpha$.
A key property of the NB distribution is its variance:
$$ \mathrm{Var}(X) = \mu + \alpha \mu^2. $$
This expression shows how dispersion ($\alpha$) affects the spread of the data:
- When $\alpha \to 0$, the variance approaches $\mu$, reducing the distribution to a Poisson model where variance equals the mean.
- When $\alpha > 0$, the variance grows larger than the mean, allowing the model to capture overdispersion often seen in RNA-seq data.
In order to estimate the parameters $\mu$ (mean) and $\alpha$ (dispersion), we employ maximum likelihood estimation (MLE). The likelihood function represents the probability of observing the given data given a set of model parameters. Since likelihood values are typically very small when dealing with multiple observations, we work with the log-likelihood function instead. The log-likelihood is computed by taking the natural logarithm of the likelihood function, which conveniently transforms the product into a summation:
$$ \log L(\mu, \alpha) = \sum_{i} \log \mathrm{NB}(x_i; \mu, \alpha). $$
After substituting the NB model, the full log-likelihood function becomes:
$$ \log L(\mu, \alpha) = \sum_{i} \left[ \log \Gamma(x_i + \tfrac{1}{\alpha}) - \log \Gamma(x_i+1) - \log \Gamma(\tfrac{1}{\alpha}) + \frac{1}{\alpha} \log \left(\frac{1/\alpha}{1/\alpha + \mu}\right) + x_i \log \left(\frac{\mu}{1/\alpha + \mu}\right) \right]. $$
For optimization purposes, we typically minimize rather than maximize functions. To achieve this, we define the negative log-likelihood (NLL) by multiplying the log-likelihood function by -1:
$$ \mathcal{L}(\mu, \alpha) = - \sum_{i} \left[ \log \Gamma(x_i + \tfrac{1}{\alpha}) - \log \Gamma(x_i+1) - \log \Gamma(\tfrac{1}{\alpha}) + \frac{1}{\alpha} \log \left(\frac{1/\alpha}{1/\alpha + \mu}\right) + x_i \log \left(\frac{\mu}{1/\alpha + \mu}\right) \right]. $$
This negative log-likelihood function serves as the objective function in parameter estimation. By minimizing $\mathcal{L}(\mu, \alpha)$, we find the optimal values of $\mu$ and $\alpha$ that maximize the likelihood of observing the given gene expression data.
from scipy.special import gammaln
def negbin_nll_contrib(
mu: float, alpha: float, counts: npt.NDArray[np.uint32]
) -> np.float64:
"""Compute individual negative log-likelihood contributions for Negative Binomial.
Args:
mu: Mean of the negative binomial distribution (> 0).
alpha: Dispersion parameter (> 0).
counts: Observed counts (non-negative integers).
Returns:
Negative log-likelihood contributions summed over all counts.
Examples:
>>> negbin_nll_contrib(2.0, 0.5, [3, 0, 2])
5.139712336371398
"""
# TODO: Check if mu and alpha are valid (both must be > 0), return np.inf if not.
# TODO: Convert counts to a NumPy array.
# TODO: Compute r as the inverse of alpha.
# TODO: Implement the formula to compute the negative log-likelihood.
# TODO: Return the computed negative log-likelihood.
Q02 ¶
Points: 30
In statistical modeling, a null model serves as a baseline hypothesis against which we compare more complex models. In the context of differential gene expression analysis, the null model represents the assumption that there is no difference in expression between the two experimental conditions. By comparing the null model to a more flexible model that allows for condition-specific differences, we can assess whether gene expression is significantly affected by the experimental conditions.
In the null model, we assume that all 8 samples (4 from ground and 4 from space) share the same mean expression level, denoted as $\mu$. That is, we ignore any potential effect of condition and fit a single-group model, treating the samples as if they all come from the same underlying population. Mathematically, this means that the expression count for a given gene, across all samples, follows a NB distribution with a single shared mean:
$$ X_i \sim \mathrm{NB}(\mu, \alpha), $$
where:
- $X_i$ represents the observed count for sample $i$.
- $\mu$ is the shared mean expression level across all samples.
- $\alpha$ is the dispersion parameter accounting for variability in gene expression.
def negbin_nloglik_single(
params: list[float], counts: npt.NDArray[np.uint32]
) -> np.float64:
"""Computes the negative log-likelihood for a single negative binomial distribution.
The negative binomial distribution is parameterized by its mean (`mu`) and
dispersion parameter (`alpha`). This function calculates the negative
log-likelihood (NLL) of observing the provided count data given the parameters.
A lower NLL indicates a better fit of the parameters to the data.
Args:
params: A length-2 array or list containing:
- mu: Mean of the negative binomial distribution (must be > 0).
- alpha: Dispersion parameter (inverse of the size
parameter `r`, must be > 0).
counts: 1D array or list of non-negative integer counts observed.
Returns:
The negative log-likelihood value. Returns infinity if parameters are invalid
(i.e., non-positive).
Examples:
>>> counts = [3, 0, 2, 4, 1]
>>> params = [2.0, 0.5]
>>> negbin_nloglik_single(params, counts)
9.07545186841686
# Invalid parameters example
>>> params_invalid = [-1, 0.5]
>>> negbin_nloglik_single(params_invalid, counts)
inf
"""
# TODO: Check that params contains exactly two elements, otherwise raise a ValueError.
# TODO: Extract mu and alpha from params.
# TODO: Compute the negative log-likelihood using the negbin_nll_contrib function.
# TODO: Return the computed negative log-likelihood.
Q03 ¶
Points: 10
To estimate the parameters $\mu$ (mean expression level) and $\alpha$ (dispersion) of the NB distribution, we need to find the values that minimize the negative log-likelihood (NLL). Since there is no closed-form solution for these parameters, we use numerical optimization techniques to perform maximum likelihood estimation (MLE).
In this section, we define the function
fit_reduced_model
, which fits the reduced model, meaning that we assume all samples (from both conditions) share the same mean expression level $\mu$.
This corresponds to the null hypothesis, where there is no differential expression between groups.
We will use SciPy’s
minimize
function to find the optimal values of $\mu$ and $\alpha$.
The steps of the optimization process are as follows:
-
Initialize parameters:
- $\mu$ is initialized as the mean of the observed counts, ensuring a reasonable starting point.
- $\alpha$ is initialized to 0.1 as a rough starting estimate of dispersion.
-
Set parameter bounds:
- Both $\mu$ and $\alpha$ must be strictly positive to ensure valid probability calculations. We impose lower bounds of $1e-9$ to prevent numerical errors.
-
Minimize the negative log-likelihood (NLL):
- The optimizer iteratively updates $\mu$ and $\alpha$ to find the values that minimize the NLL, thereby maximizing the likelihood of the observed data.
The function
fit_reduced_model
takes the observed count data as input, fits the model, and returns the estimated parameters $\mu$ and $\alpha$ along with the minimized negative log-likelihood value.
from scipy.optimize import minimize
def fit_reduced_model(counts: npt.NDArray[np.float64]) -> tuple[np.float64]:
"""Fits a reduced negative binomial model (single distribution) to the provided
count data.
This function estimates the parameters (`mu` and `alpha`) of a single negative
binomial distribution that best fit the given counts. It uses numerical
optimization to minimize the negative log-likelihood calculated by the
`negbin_nloglik_single` function.
Args:
counts: 1D array or list of non-negative integer counts observed.
Returns:
Estimated mean parameter of the negative binomial distribution.
Estimated dispersion parameter (inverse size parameter).
Negative log-likelihood of the fitted parameters.
Examples:
>>> counts = np.array([168, 162, 645, 345, 41, 87, 137, 84])
>>> mu_hat, alpha_hat, nll = fit_reduced_model(counts)
>>> print(f"mu_hat: {mu_hat:.4f}, alpha_hat: {alpha_hat:.4f}, NLL: {nll:.4f}")
mu_hat: 208.6250, alpha_hat: 0.6038, NLL: 50.2213
"""
# TODO: Compute an initial guess for mu using np.mean(counts) + 1e-9.
# TODO: Define an initial guess for alpha as 0.1.
# TODO: Set bounds for optimization as [(1e-9, None), (1e-9, None)].
# TODO: Use scipy.optimize.minimize to optimize negbin_nloglik_single with the initial parameters and bounds.
# TODO: Extract optimized mu_hat and alpha_hat from result.x.
# TODO: Extract the final negative log-likelihood from result.fun.
# TODO: Return mu_hat, alpha_hat, and neg_loglike.
Q04 ¶
Points: 10
In the alternative model, we relax the assumption of a single shared mean expression level across all samples and instead allow for condition-specific means. This means we model gene expression separately for the ground and space groups, enabling us to test whether the gene is differentially expressed between conditions.
Unlike the null model, which assumes a single mean $\mu$ for all samples, the full model introduces separate means:
- $\mu_1$: The mean expression level for the ground condition.
- $\mu_2$: The mean expression level for the space condition.
- $\alpha$: The dispersion parameter, which remains shared between the two groups to maintain model simplicity.
This means that each observed count $x_i$ follows a Negative Binomial distribution:
$$ X_i \sim \mathrm{NB}(\mu_1, \alpha), \quad \text{if sample } i \text{ belongs to ground} $$
$$ X_i \sim \mathrm{NB}(\mu_2, \alpha), \quad \text{if sample } i \text{ belongs to space} $$
Here, we assume that only the mean expression levels differ between groups while keeping dispersion constant across conditions. The reason for keeping a single $\alpha$ is to avoid overfitting and to ensure stable parameter estimation, especially with limited sample sizes.
The alternative model is more flexible than the null model because it accounts for potential differences in gene expression between the two conditions. This flexibility allows us to formally test whether differential expression exists by comparing the likelihood of the null model (single mean $\mu$) against the alternative model (separate means $\mu_1$ and $\mu_2$).
def negbin_nloglik_two_groups(
params: list[float], counts: npt.NDArray[np.uint32], group_labels
):
"""Computes negative log-likelihood for a two-group negative binomial model.
This function calculates the negative log-likelihood (NLL) for observing two
separate groups of count data, each modeled with its own mean (`mu1`, `mu2`) but
sharing a common dispersion parameter (`alpha`).
Args:
params: A length-3 array or list containing:
- mu1: Mean of negative binomial distribution for group 1 (must be > 0).
- mu2: Mean of negative binomial distribution for group 2 (must be > 0).
- alpha: Common dispersion parameter (must be > 0).
counts: 1D numpy array of non-negative integer counts from both groups combined.
group_labels: 1D numpy array with binary labels (0 or 1), identifying
group membership corresponding to `counts`.
Returns:
Negative log-likelihood value. Returns infinity for invalid parameters.
Examples:
>>> counts = np.array([104, 92, 304, 200, 32, 28, 37, 58])
>>> group_labels = np.array([0, 0, 0, 0, 1, 1, 1, 1])
>>> params = [232.0, 36.0, 0.1]
>>> negbin_nloglik_two_groups(params, counts, group_labels)
40.73053338608538
"""
if len(params) != 3:
raise ValueError(
"params must contain exactly three elements: [mu1, mu2, alpha]."
)
if len(counts) != len(group_labels):
raise ValueError("counts and group_labels must be arrays of the same length.")
# TODO: Extract mu1, mu2, and alpha from params.
# TODO: Validate that mu1, mu2, and alpha are all positive, otherwise return np.inf.
# TODO: Convert counts and group_labels to NumPy arrays using np.asarray().
# TODO: Separate counts into group1_counts (where group_labels == 0)
# and group2_counts (where group_labels == 1).
# TODO: Compute negative log-likelihood for each group using negbin_nll_contrib().
# TODO: Sum the negative log-likelihood contributions from both groups.
# TODO: Return the final computed negative log-likelihood.
Q05 ¶
Points: 5
We aim to fit a two-group negative binomial model to bacterial gene count data to assess differential gene expression between two conditions: ground and space. The objective is to estimate $\mu_1, \mu_2$, and $\alpha$ by minimizing the negative log-likelihood using the observed count data.
def fit_full_model(
counts: npt.NDArray[np.float64], group_labels: npt.NDArray[np.uint32]
) -> tuple[np.float64, np.float64, np.float64, np.float64]:
"""Fits a two-group negative binomial model to the provided count data.
This function estimates separate mean parameters (`mu1`, `mu2`) for two groups
and a shared dispersion parameter (`alpha`) by minimizing the negative
log-likelihood.
Args:
counts: 1D array or list of non-negative integer counts observed from
both groups.
group_labels: 1D array or list of binary labels (0 or 1) indicating group
membership for each count in `counts`.
Returns:
Estimated mean for group 1.
Estimated mean for group 2.
Estimated shared dispersion parameter.
Negative log-likelihood of the fitted parameters.
Examples:
>>> counts = np.array([104, 92, 304, 200, 32, 28, 37, 58])
>>> group_labels = np.array([0, 0, 0, 0, 1, 1, 1, 1])
>>> mu1_hat, mu2_hat, alpha_hat, nll = fit_full_model(counts, group_labels)
>>> print(f"mu1: {mu1_hat:.4f}, mu2: {mu2_hat:.4f}, alpha: {alpha_hat:.4f}, NLL: {nll:.4f}")
mu1: 175.0000, mu2: 38.7500, alpha: 0.1508, NLL: 38.9149
"""
# TODO: Convert counts and group_labels to NumPy arrays using np.asarray().
# TODO: Separate counts into group1_counts (where group_labels == 0)
# and group2_counts (where group_labels == 1).
# TODO: Compute initial guesses for mu1 and mu2 using np.mean(group1_counts) plus
# a small number to avoid zeros.
# TODO: Define an initial guess for alpha as 0.1.
# TODO: Set parameter bounds as [(1e-9, None), (1e-9, None), (1e-9, None)].
# TODO: Use scipy.optimize.minimize to optimize negbin_nloglik_two_groups
# with initial parameters and bounds.
# TODO: Extract optimized mu1_hat, mu2_hat, and alpha_hat from result.x.
# TODO: Extract the final negative log-likelihood from result.fun.
# TODO: Return mu1_hat, mu2_hat, alpha_hat, and nll.
Q06 ¶
Points: 30
If the alternative model provides a significantly better fit to the data than the null model, it suggests that the gene is differentially expressed. The strength of this evidence is quantified using the likelihood-ratio test (LRT) statistic, which is compared to a chi-squared distribution to determine statistical significance.
The likelihood ratio statistic is computed as:
$$ LR = 2 \times (\mathcal{L}_{\text{full}} - \mathcal{L}_{\text{reduced}}) $$
where:
- $\mathcal{L}_{\text{reduced}}$ is the log-likelihood of the reduced (null) model.
- $\mathcal{L}_{\text{full}}$ is the log-likelihood of the full (alternative) model.
The factor of $2$ arises from the derivation of the likelihood ratio test under the assumption that log-likelihoods follow a chi-squared distribution asymptotically.
However, we often compute the negative log-likelihood which would change the expression to
$$ LR = -2 \times (\mathcal{NL}_{\text{full}} - \mathcal{NL}_{\text{reduced}}) $$
where we have a factor of $-2$ instead of $2$.
The likelihood ratio statistic follows a chi-squared distribution with degrees of freedom equal to the difference in the number of parameters between the two models:
$$ LR \sim \chi^2_{\text{df}} $$
where df (degrees of freedom) is typically the number of additional parameters in the full model compared to the reduced model.
The statistical significance of the likelihood ratio is assessed by computing the p-value from the chi-squared distribution:
$$ p = P(\chi^2_{\text{df}} > LR) $$
A smaller p-value (e.g., $p < 0.05$) suggests that the full model provides a significantly better fit than the null model, implying that the gene is differentially expressed between space and ground conditions.
from scipy.stats import chi2
def likelihood_ratio_test(
nll_reduced: float, nll_full: float, df: int = 1
) -> tuple[float, float]:
"""
Calculates the likelihood ratio test statistic and corresponding p-value for
nested models.
Args:
nll_reduced: Negative log-likelihood of the reduced (simpler) model.
nll_full: Negative log-likelihood of the full model (more parameters).
df: Degrees of freedom, typically the difference in number of parameters
between the two models.
Returns:
Computed Likelihood Ratio test statistic.
Corresponding p-value from the upper tail of the chi-squared distribution.
Example:
>>> nll_reduced = 44.9176
>>> nll_full = 38.9149
>>> lr_stat, p_val = likelihood_ratio_test(nll_reduced, nll_full)
>>> print(f"LR Stat: {lr_stat:.4f}, p-value: {p_val:.4g}")
LR Stat: 12.0054, p-value: 0.0005305
"""
# TODO: Compute the LR statistic.
# TODO: Compute the p-value using chi2.sf(lr_stat, df).
# TODO: Return lr_stat and p_val.