Differential gene expression ¶
This activity is designed to guide you through the process of differential gene expression analysis. It focuses on examining the variations in gene expression across different experimental conditions to identify genes that are significantly upregulated or downregulated. The analysis integrates data preprocessing, statistical testing, and visualization techniques to enable a comprehensive understanding of the biological differences present in the dataset. Through this notebook, you will learn to interpret gene expression patterns and draw meaningful conclusions that can contribute to a broader understanding of molecular mechanisms.
Obtaining the data ¶
In computational biology, working with large datasets is a fundamental part of data analysis. One of the most common formats for storing structured data is the CSV (Comma-Separated Values) file. A CSV file is essentially a plain text file where each row represents a record, and columns are separated by commas. This format makes it easy to share and manipulate tabular data.
In this course, we will use Pandas , a powerful Python library designed for data manipulation and analysis, to handle our data. Before we can analyze any dataset, we must first obtain and load it into a format that allows efficient processing.
Note: Polars is way better (in my "professional" opinion), but it is not as popular as pandas.
import requests
import io
import pandas as pd
import numpy as np
The dataset we will be working with contains gene expression data and is hosted on GitHub.
To download and load this dataset into memory, we can use the
requests
library to make an HTTP request and retrieve the file. Below is a Python function that performs this task.
Note that this is only really done because of how I am hosting the data.
def get_gene_expr_data() -> pd.DataFrame:
csv_path = "https://github.com/oasci/pitt-biosc1540-2025s/raw/refs/heads/main/content/data/gene-expr/SSvLIS-day3/ground-day3-gene-counts-SS-and-LIS.csv"
response = requests.get(csv_path)
if response.status_code == 200:
csv_text = response.text
data = pd.read_csv(io.StringIO(csv_text))
return data
else:
print(f"Failed to fetch file. Status code: {response.status_code}")
This function does the following:
-
Uses the
requests
library to send an HTTP GET request to the GitHub-hosted CSV file. - Checks if the request was successful by verifying the status code.
-
Reads the CSV file's content and converts it into a Pandas DataFrame using
pd.read_csv()
. - Returns the DataFrame for further analysis.
A DataFrame in Pandas is a two-dimensional, table-like data structure similar to an Excel spreadsheet. It consists of:
- Rows (each row represents an observation or record).
- Columns (each column represents a variable or feature).
- Indexing (a way to reference rows and columns efficiently).
You can think of a DataFrame as an enhanced Excel sheet that allows programmatic manipulation, filtering, and analysis of data. Unlike an Excel sheet, however, Pandas provides powerful tools to handle missing data, apply complex transformations, and perform statistical computations efficiently.
Once we retrieve the CSV file, we use
pd.read_csv()
to load the text data into a DataFrame:
data = pd.read_csv(io.StringIO(csv_text))
Here,
io.StringIO(csv_text)
treats the downloaded CSV content as if it were a file, allowing
pd.read_csv()
to parse it directly.
df = get_gene_expr_data()
Exploring the data ¶
Now that we have successfully loaded our dataset into a Pandas DataFrame, it is essential to understand how to explore, manipulate, and extract meaningful information from it. In this section, we will cover fundamental operations such as viewing data, indexing, slicing, filtering, and selecting subsets of a DataFrame.
Viewing the First and Last Few Rows ¶
We can use the
.head()
and
.tail()
methods to preview the data:
# Display the first five rows
print(df.head())
# Display the last five rows
print(df.tail())
By default,
.head()
and
.tail()
return the first and last five rows, respectively. You can specify a different number of rows as an argument, e.g.,
df.head(10)
for the first ten rows.
Checking the Structure of the Data ¶
To understand the columns, data types, and non-null values, we use:
print(df.info())
This provides details such as:
- The number of rows and columns.
- Column names and their data types.
- The number of non-null values in each column.
Summarizing the Data ¶
To obtain summary statistics of numeric columns, we use:
print(df.describe())
The
.describe()
method provides useful insights such as the mean, standard deviation, minimum, and maximum values of numerical features.
Selecting Columns ¶
You can select a column using bracket notation (
[]
) or the dot notation (
.
):
genes = df["Gene"]
print(genes)
To select multiple columns, pass a list:
ss_data = df[["SS316.1", "SS316.2", "SS316.3", "SS316.4"]]
print(ss_data)
Selecting Rows ¶
To access specific rows, Pandas provides two primary methods:
-
.loc[]
(label-based selection) -
.iloc[]
(integer index-based selection)
# Select a row by index label
row_5 = df.loc[5]
print(row_5)
# Select multiple rows
rows_5_to_10 = df.loc[5:10]
print(rows_5_to_10)
Slicing the DataFrame ¶
You can slice both rows and columns using
.iloc[]
:
# Select rows 5 to 10 and columns 1 to 3
subset = df.iloc[5:11, 1:4]
print(subset)
This follows Python's standard slicing rules (
start:stop
, where
stop
is exclusive).
Introduction to NumPy and Its Relationship with Pandas ¶
As we continue our exploration of computational tools for biology, it is important to understand NumPy , a fundamental package for numerical computing in Python. While we have been working with Pandas, it is built on top of NumPy, meaning that under the hood, Pandas leverages NumPy's efficient array operations.
NumPy (Numerical Python) is a powerful library that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently. NumPy is optimized for performance, making it much faster than native Python lists when performing numerical computations.
Key Features of NumPy
- Supports multi-dimensional arrays (ndarrays).
- Provides mathematical functions for linear algebra, statistics, and other numerical tasks.
- Optimized for speed and memory efficiency.
- Used by many scientific computing libraries, including Pandas, SciPy, and matplotlib.
Pandas DataFrames internally store their data using NumPy arrays. We can extract the underlying NumPy representation of a DataFrame (or a column) using the
.to_numpy()
method.
For example, if we have a Pandas DataFrame called
df
, we can retrieve its NumPy array as follows:
numpy_array = df.to_numpy()
print(numpy_array)
This will return a NumPy ndarray, where each row corresponds to a record, and each column corresponds to a feature from the DataFrame.
Extracting a NumPy Array from a DataFrame ¶
Let’s now apply this knowledge to our gene expression dataset.
Suppose we want to extract all data except the first column (which is labeled
"Gene"
).
We can achieve this by selecting all columns except the first one and then converting the result into a NumPy array.
# Drop the first column ("Gene") and convert the rest of the DataFrame to a NumPy array
data_array = df.iloc[:, 1:].to_numpy()
# Display the resulting NumPy array
print(data_array)
print(type(data_array)) # Confirm it's a NumPy array
-
We use
.iloc[:, 1:]
to select all rows (:
) and all columns starting from index 1 (1:
), effectively excluding the first column. -
We then apply
.to_numpy()
to convert the DataFrame selection into a NumPy array. - Finally, we print the array and its type to confirm that it is a NumPy ndarray.
While Pandas provides a user-friendly interface for handling tabular data, NumPy is faster and more memory-efficient for numerical operations. NumPy arrays are typically preferred when performing:
- Mathematical transformations on entire datasets.
- Statistical analyses that require fast computations.
- Machine learning preprocessing, where numerical arrays are required for training models.
Indexing and Slicing in NumPy ¶
Like Python lists and Pandas DataFrames, NumPy arrays support indexing and slicing.
Basic Indexing ¶
NumPy arrays are zero-indexed, meaning the first element has an index of
0
.
# Access the first row
first_row = data_array[0]
print(first_row)
# Access the first element in the first row
first_element = data_array[0, 0] # Equivalent to data_array[0][0]
print(first_element)
# Access the first column (all rows)
first_column = data_array[:, 0]
print(first_column)
Slicing ¶
Slicing allows us to select subsets of the array.
# Select first three rows
subset_rows = data_array[:3]
print(subset_rows)
# Select first three columns
subset_columns = data_array[:, :3]
print(subset_columns)
# Select rows 2 to 4 and columns 1 to 3
subset = data_array[2:5, 1:4]
print(subset)
Here, slicing follows the standard Python format:
start:stop
, where
stop
is exclusive.
Common NumPy Operations ¶
NumPy provides built-in functions to perform mathematical operations efficiently across entire arrays.
Basic Mathematical Operations ¶
NumPy supports element-wise arithmetic operations without needing explicit loops:
# Add 10 to every element
data_plus_10 = data_array + 10
print(data_plus_10[:5])
# Multiply all elements by 2
data_times_2 = data_array * 2
print(data_times_2[:5])
# Compute the square root of each element
data_sqrt = np.sqrt(data_array)
print(data_sqrt[:5])
# Compute the natural logarithm of each element
data_log = np.log(data_array + 1) # Adding 1 to avoid log(0)
print(data_log[:5])
Statistical Functions ¶
NumPy makes it easy to compute descriptive statistics:
# Mean of entire dataset
mean_value = np.mean(data_array)
print(mean_value)
# Standard deviation
std_dev = np.std(data_array)
print(std_dev)
# Minimum and maximum values
min_value = np.min(data_array)
print(min_value)
max_value = np.max(data_array)
print(max_value)
# Column-wise mean (mean for each sample across all genes)
column_means = np.mean(data_array, axis=0)
print(column_means)
# Row-wise mean (mean for each gene across all samples)
row_means = np.mean(data_array, axis=1)
print(row_means)
Boolean Indexing and Filtering ¶
Boolean indexing and filtering are powerful techniques that enable us to select and manipulate data elements based on specific conditions. Instead of iterating through each element with a loop, boolean indexing allows us to apply a condition directly to an array, producing a new array composed solely of elements that meet that condition. When a condition is applied to an array, it generates an array of Boolean values, where each value corresponds to an element in the original array. A value of True indicates that the condition is met for that element, while False means it is not. This Boolean array can then be used to index the original array, effectively filtering the data.
# Find all values greater than 300
mask = data_array > 300
print(mask)
# Find all values greater than 300
high_values = data_array[mask]
print(high_values)
# Get rows where at least one value is greater than 500
rows_with_high_values = data_array[np.any(data_array > 500, axis=1)]
print(rows_with_high_values)
Plotting ¶
Matplotlib is a widely used Python library for creating static, animated, and interactive visualizations. It provides a high degree of customization, allowing users to tailor their plots to best represent their data. In the context of differential gene expression analysis, visualization plays a crucial role in identifying patterns, trends, and significant changes across different conditions.
Matplotlib operates primarily through its
pyplot
module, which offers a collection of functions similar to those in MATLAB, making it easy to generate plots, customize axes, adjust labels, and modify visual properties. A typical workflow involves creating a figure and axes, plotting the desired data, and refining the visualization with titles, labels, legends, and other stylistic adjustments.
import matplotlib.pyplot as plt
To compare the expression levels of a specific gene across two conditions, we need to extract the relevant expression data from our dataset. In this case, we are analyzing gene expression from two groups: SS316 and LIS, each with four replicates. By selecting a specific gene using its index in the dataset, we can retrieve the expression values corresponding to these two conditions.
In the following code, we define the column names associated with each condition and extract the expression counts for a single gene at a given index. We then combine these counts into a single array for easier manipulation and create a corresponding array of group labels to indicate which samples belong to each condition. This will be useful for further statistical analysis and visualization.
ss316_cols = ["SS316.1", "SS316.2", "SS316.3", "SS316.4"]
lis_cols = ["LIS.1", "LIS.2", "LIS.3", "LIS.4"]
index = 755
ss_counts = df.iloc[index][ss316_cols].values
lis_counts = df.iloc[index][lis_cols].values
print(ss_counts)
print(lis_counts)
counts_all = np.concatenate([ss_counts, lis_counts])
print(counts_all)
group_labels = np.array([0] * 4 + [1] * 4)
print(group_labels)
Now that we have extracted the expression values for this gene,
counts_all
contains the combined expression data from both conditions, while
group_labels
assigns numerical labels (0 for SS316 and 1 for LIS) to indicate the condition each sample belongs to. These arrays can now be used for downstream analysis, such as statistical testing to determine whether the gene exhibits significant differential expression between the two conditions. Additionally, we can use these values to generate visualizations that highlight the differences in expression levels.
To better understand the differences in expression levels between the two conditions, we can create a scatter plot to visualize the raw count data. This plot will display the observed expression values for a single gene across the SS316 and LIS groups.
By plotting individual data points, we can assess variability within each condition and identify potential differences in expression. This is particularly useful for spotting trends, outliers, or cases where expression levels may overlap between conditions.
In the following code, we generate a scatter plot where:
- Each point represents the observed expression count for a given replicate.
- The SS316 samples are plotted at x = 0, while LIS samples are plotted at x = 1.
- Different colors distinguish the two conditions, enhancing visual clarity.
This visualization will help us determine whether the expression levels differ significantly between the two groups before proceeding with further statistical analysis.
fig, ax = plt.subplots(figsize=(6, 4))
# X positions for SS316: [0, 0, 0, 0], LIS: [1,1,1,1]
x_ss = np.zeros_like(ss_counts)
x_lis = np.ones_like(lis_counts)
# Scatter the raw counts
ax.scatter(x_ss, ss_counts, color="#003594", label="SS316 Observed", alpha=0.7)
ax.scatter(x_lis, lis_counts, color="#f46036", label="LIS Observed", alpha=0.7)
# Clean up the plot
ax.set_xlim(-0.5, 1.5)
ax.set_xticks([0, 1])
ax.set_xticklabels(["SS316", "LIS"])
ax.set_ylabel("Counts")
ax.legend(loc="best")
plt.show()
Negative Binomial Distribution ¶
In the context of RNA sequencing data, gene expression counts often exhibit overdispersion, meaning their variance exceeds the mean. The Negative Binomial (NB) distribution is a widely used probabilistic model for such count data, as it accounts for this extra variability beyond what the Poisson distribution can capture.
Probability Mass Function ¶
The Negative Binomial (NB) distribution is commonly used to model count data, particularly in cases where the variance exceeds the mean (overdispersion). The probability mass function (PMF) gives the probability of observing a particular count $x$, given the mean expression level $\mu$ and the dispersion parameter $\alpha$. The PMF is expressed as:
$$ \mathrm{NB}(x; \mu, \alpha) = \binom{x + \tfrac{1}{\alpha}-1}{x} \left(\frac{1/\alpha}{1/\alpha + \mu}\right)^{\frac{1}{\alpha}} \left(\frac{\mu}{1/\alpha + \mu}\right)^{x}. $$
Where,
-
$x$: The observed count (e.g., number of RNA-seq reads mapped to a gene).
-
$\mu$: The expected value (mean) of the distribution. This represents the average expression level of a gene.
-
$\alpha$: The dispersion parameter, controlling how much the variance deviates from the mean.
-
$r = \frac{1}{\alpha}$: A reparameterization often used in the traditional Negative Binomial definition, representing the number of “successes” before the process stops in a sequential trials interpretation.
-
$\binom{x + r -1}{x}$: This is the generalized binomial coefficient, which counts the number of ways to distribute $x$ identical objects into $r$ groups when order does not matter. It is defined in terms of the gamma function as:
$$ \binom{x + r -1}{x} = \frac{\Gamma(x + r)}{\Gamma(x+1) \Gamma(r)} $$
where $\Gamma(n)$ is the gamma function, which generalizes the factorial function such that $\Gamma(n) = (n-1)!$ for positive integers.
-
The two fractions inside the PMF represent the probabilities associated with the two outcomes in the Negative Binomial process:
- $\left(\frac{1/\alpha}{1/\alpha + \mu}\right)^{1/\alpha}$: The probability of continuing the process (i.e., not observing $x$ yet).
- $\left(\frac{\mu}{1/\alpha + \mu}\right)^{x}$: The probability of observing $x$ successes.
A key property of the Negative Binomial distribution is its variance:
$$ \mathrm{Var}(X) = \mu + \alpha \mu^2. $$
This expression shows how dispersion ($\alpha$) affects the spread of the data:
- When $\alpha \to 0$, the variance approaches $\mu$, reducing the distribution to a Poisson model where variance equals the mean.
- When $\alpha > 0$, the variance grows larger than the mean, allowing the model to capture overdispersion often seen in RNA-seq data.
By using the Negative Binomial model, we accommodate extra variability beyond what a Poisson model can handle, making it a more realistic representation of gene expression data in differential analysis.
Log-Likelihood ¶
In order to estimate the parameters $\mu$ (mean) and $\alpha$ (dispersion), we employ maximum likelihood estimation (MLE). The likelihood function represents the probability of observing the given data given a set of model parameters. For the Negative Binomial (NB) distribution, the likelihood function is the product of the individual probabilities for all observed counts $x_i$:
$$ L(\mu, \alpha) = \prod_{i} \mathrm{NB}(x_i; \mu, \alpha). $$
Since likelihood values are typically very small when dealing with multiple observations, we work with the log-likelihood function instead. The log-likelihood is computed by taking the natural logarithm of the likelihood function, which conveniently transforms the product into a summation:
$$ \log L(\mu, \alpha) = \sum_{i} \log \mathrm{NB}(x_i; \mu, \alpha). $$
Using the previously defined probability mass function (PMF) for the Negative Binomial distribution:
$$ \mathrm{NB}(x; \mu, \alpha) = \binom{x + \tfrac{1}{\alpha}-1}{x} \left(\frac{1/\alpha}{1/\alpha + \mu}\right)^{\frac{1}{\alpha}} \left(\frac{\mu}{1/\alpha + \mu}\right)^{x}, $$
we take the logarithm of each term. Taking the logarithm of the probability mass function, we break it into three components:
$$ \log \mathrm{NB}(x; \mu, \alpha) = \log \binom{x + \tfrac{1}{\alpha}-1}{x} + \frac{1}{\alpha} \log \left(\frac{1/\alpha}{1/\alpha + \mu}\right) + x \log \left(\frac{\mu}{1/\alpha + \mu}\right). $$
Now, using the gamma function to express the binomial coefficient:
$$ \binom{x + \tfrac{1}{\alpha}-1}{x} = \frac{\Gamma(x + \tfrac{1}{\alpha})}{\Gamma(x+1) \Gamma(\tfrac{1}{\alpha})}, $$
its logarithm is:
$$ \log \binom{x + \tfrac{1}{\alpha}-1}{x} = \log \Gamma(x + \tfrac{1}{\alpha}) - \log \Gamma(x+1) - \log \Gamma(\tfrac{1}{\alpha}). $$
Thus, the full log-likelihood function becomes:
$$ \log L(\mu, \alpha) = \sum_{i} \left[ \log \Gamma(x_i + \tfrac{1}{\alpha}) - \log \Gamma(x_i+1) - \log \Gamma(\tfrac{1}{\alpha}) + \frac{1}{\alpha} \log \left(\frac{1/\alpha}{1/\alpha + \mu}\right) + x_i \log \left(\frac{\mu}{1/\alpha + \mu}\right) \right]. $$
Taking the negative ¶
For optimization purposes, we typically minimize rather than maximize functions. To achieve this, we define the negative log-likelihood (NLL) by multiplying the log-likelihood function by -1:
$$ \mathcal{L}(\mu, \alpha) = -\sum_{i} \log \mathrm{NB}(x_i; \mu, \alpha). $$
Substituting the expanded log-likelihood expression, we obtain:
$$ \mathcal{L}(\mu, \alpha) = - \sum_{i} \left[ \log \Gamma(x_i + \tfrac{1}{\alpha}) - \log \Gamma(x_i+1) - \log \Gamma(\tfrac{1}{\alpha}) + \frac{1}{\alpha} \log \left(\frac{1/\alpha}{1/\alpha + \mu}\right) + x_i \log \left(\frac{\mu}{1/\alpha + \mu}\right) \right]. $$
This negative log-likelihood function serves as the objective function in parameter estimation. By minimizing $\mathcal{L}(\mu, \alpha)$, we find the optimal values of $\mu$ and $\alpha$ that maximize the likelihood of observing the given gene expression data.
Since the function involves logarithms and the gamma function, numerical optimization techniques such as gradient descent or Newton's method are typically used for parameter estimation. The gamma function terms are particularly important when implementing this in code, as they allow us to compute likelihoods for non-integer values of the dispersion parameter.
This formulation ensures that the Negative Binomial model properly accounts for the overdispersion commonly observed in RNA-seq data, improving the reliability of differential gene expression analysis.
Null Model ¶
In statistical modeling, a null model serves as a baseline hypothesis against which we compare more complex models. In the context of differential gene expression analysis, the null model represents the assumption that there is no difference in expression between the two experimental conditions. By comparing the null model to a more flexible model that allows for condition-specific differences, we can assess whether gene expression is significantly affected by the experimental conditions.
Single-Group (Reduced) Model ¶
In the null model, we assume that all 8 samples (4 from SS316 and 4 from LIS) share the same mean expression level, denoted as $\mu$. That is, we ignore any potential effect of condition and fit a single-group model, treating the samples as if they all come from the same underlying population. Mathematically, this means that the expression count for a given gene, across all samples, follows a Negative Binomial distribution with a single shared mean:
$$ X_i \sim \mathrm{NB}(\mu, \alpha), $$
where:
- $X_i$ represents the observed count for sample $i$.
- $\mu$ is the shared mean expression level across all samples.
- $\alpha$ is the dispersion parameter accounting for variability in gene expression.
This model acts as our reduced model, meaning it does not differentiate between the two experimental conditions. By ignoring the grouping of samples into SS316 and LIS, we effectively test the assumption that condition has no effect on expression levels for this gene.
To estimate the parameters $\mu$ (mean) and $\alpha$ (dispersion) of the Negative Binomial distribution, we need to compute the negative log-likelihood (NLL). Since likelihood-based estimation requires numerical optimization, we define a function that computes the NLL for a given set of parameters and observed count data.
The function below,
negbin_nloglik_single
, implements the negative log-likelihood for a single Negative Binomial distribution, assuming all samples share the same mean $\mu$. This corresponds to the null model, where no distinction is made between experimental groups.
The key components of the function are:
-
Input parameters:
mu
(mean expression level) andalpha
(dispersion parameter). -
Log-gamma function (
lgamma
) to handle factorial terms, ensuring numerical stability for large values. - Logarithmic terms to avoid computational underflow when dealing with very small probabilities.
-
A penalty for invalid parameter values: If
mu
oralpha
are non-positive, the function returns infinity (np.inf
), preventing optimization algorithms from exploring invalid regions.
The function iterates over all observed counts and accumulates the negative log-likelihood by summing the log-probabilities computed from the Negative Binomial probability mass function.
from math import lgamma, log
def negbin_nloglik_single(params, counts):
"""
Negative log-likelihood for a single negative binomial distribution.
params = [mu, alpha]
counts = array-like of counts from *all samples*, ignoring group.
"""
mu, alpha = params
# If parameters go non-positive, penalize with infinity (invalid region)
if mu <= 0 or alpha <= 0:
return np.inf
r = 1.0 / alpha
nll = 0.0
for x in counts:
term = (
lgamma(x + r)
- lgamma(r)
- lgamma(x + 1)
+ r * log(r / (r + mu))
+ x * log(mu / (r + mu))
)
nll -= term
return nll
This function provides the foundation for fitting the Negative Binomial model to gene expression data. Given a set of observed counts,
negbin_nloglik_single
evaluates how well a proposed pair of parameters ($\mu$, $\alpha$) explain the data.
To estimate the optimal parameters, we will later use numerical optimization techniques, such as gradient-based methods or likelihood maximization algorithms, to find the values of $\mu$ and $\alpha$ that minimize the negative log-likelihood.
This function is particularly useful for testing the null model, where all samples are assumed to have the same mean expression level. In the next steps, we will extend this approach to compare models that account for condition-specific differences in gene expression.
Fitting/optimization ¶
To estimate the parameters $\mu$ (mean expression level) and $\alpha$ (dispersion) of the Negative Binomial distribution, we need to find the values that minimize the negative log-likelihood (NLL). Since there is no closed-form solution for these parameters, we use numerical optimization techniques to perform maximum likelihood estimation (MLE).
In this section, we define the function
fit_reduced_model
, which fits the reduced model, meaning that we assume all samples (from both conditions) share the same mean expression level $\mu$. This corresponds to the null hypothesis, where there is no differential expression between groups.
To find the optimal values of $\mu$ and $\alpha$, we use the L-BFGS-B optimization algorithm, implemented in SciPy’s
minimize
function. The steps of the optimization process are as follows:
-
Initialize parameters:
- $\mu$ is initialized as the mean of the observed counts, ensuring a reasonable starting point.
- $\alpha$ is initialized to 0.1 as a rough starting estimate of dispersion.
-
Set parameter bounds:
- Both $\mu$ and $\alpha$ must be strictly positive to ensure valid probability calculations. We impose lower bounds of $1e-9$ to prevent numerical errors.
-
Minimize the negative log-likelihood (NLL):
- The optimizer iteratively updates $\mu$ and $\alpha$ to find the values that minimize the NLL, thereby maximizing the likelihood of the observed data.
The function
fit_reduced_model
takes the observed count data as input, fits the model, and returns the estimated parameters $\mu$ and $\alpha$ along with the minimized negative log-likelihood value.
from scipy.optimize import minimize
def fit_reduced_model(counts):
"""
Fit a single NB distribution to all counts (reduced model).
returns (mu_hat, alpha_hat, neg_loglike).
"""
mu_init = np.mean(counts) + 1e-9
alpha_init = 0.1 # an arbitrary guess
init_params = [mu_init, alpha_init]
bnds = [(1e-9, None), (1e-9, None)]
result = minimize(
negbin_nloglik_single,
init_params,
args=(counts,),
method="L-BFGS-B",
bounds=bnds,
)
mu_hat, alpha_hat = result.x
nll = result.fun
return mu_hat, alpha_hat, nll
mu_r, alpha_r, nll_r = fit_reduced_model(counts_all)
After running this function:
-
mu_r
is the estimated mean expression level across all samples. -
alpha_r
is the estimated dispersion parameter, which accounts for variability in expression beyond Poisson assumptions. -
nll_r
is the final negative log-likelihood value, representing the goodness-of-fit of the model.
These parameter estimates provide the best-fitting single-group model, which serves as our baseline for later hypothesis testing. In the next steps, we will compare this reduced model to a more complex alternative model that allows different mean expression levels for the two conditions (SS316 and LIS) to determine if differential expression exists.
Simulating data ¶
To better understand the properties of the Negative Binomial distribution and validate our estimated parameters, we can generate synthetic count data using the fitted parameters $\mu$ (mean expression) and $\alpha$ (dispersion). This allows us to visualize the probability mass function (PMF) and compare the theoretical distribution to the observed RNA-seq data.
The function
sample_nb
generates random samples from the Negative Binomial distribution using NumPy’s
np.random.negative_binomial
. However, NumPy’s parameterization differs from the one we have been using, so we first need to convert the parameters appropriately.
NumPy’s Negative Binomial distribution is parameterized in terms of:
- $r$: The number of successes before the process stops.
- $p$: The probability of success in each trial.
In contrast, our parameterization is based on:
- $\mu$: The mean expression level.
- $\alpha$: The dispersion parameter.
From the relationship between these representations, we derive:
$$ r = \frac{1}{\alpha}, \quad p = \frac{r}{r + \mu} = \frac{1/\alpha}{1/\alpha + \mu}. $$
Using these values, we generate Negative Binomial-distributed samples where the mean and dispersion match our estimated parameters.
# Negative binomial param conversion:
# If X ~ NB(r, p) in NumPy's parameterization, E[X] = r * (1-p)/p.
# We have mean = mu, dispersion alpha => r = 1/alpha, p = r/(r+mu).
# So if r=1/alpha, p = r/(r+mu):
def sample_nb(mu, alpha, size=1000):
r = 1.0 / alpha
p = r / (r + mu)
# np.random.negative_binomial(n, p) => n = r, success prob = p
# yields # of failures (X) until n successes with prob p each trial
# but the mean will match if we do it this way.
samples = np.random.negative_binomial(r, p, size=size)
return samples
samples_r = sample_nb(mu_r, alpha_r, size=5000)
Visualizing fit ¶
To assess how well our fitted Negative Binomial model captures the observed gene expression data, we visualize both raw counts from the dataset and simulated counts generated using the estimated parameters $\mu$ and $\alpha$. This plot provides an intuitive way to compare real and modeled data distributions, helping us evaluate the model's fit.
What This Plot Shows
- The blue points represent observed expression levels for SS316 samples.
- The orange points represent observed expression levels for LIS samples.
- These points are plotted at x-positions of 0 (SS316) and 1 (LIS) to clearly separate the two conditions.
- The violin plot overlays the scatter plot and represents the distribution of 2,000 samples drawn from the fitted Negative Binomial distribution.
- This helps visualize the shape and spread of the simulated counts compared to real data.
- If the violin plot closely follows the scatter points, our model captures the distribution of gene expression well.
Purpose of This Visualization
- Model Validation: By overlaying the observed data with the simulated distribution, we check whether the Negative Binomial model reasonably explains the variability in gene expression.
- Checking Distributional Assumptions: If the simulated data significantly differs from the observed counts, this suggests that the estimated parameters do not fully capture the dispersion characteristics of the data.
- Comparing Conditions: While this visualization does not explicitly test for differential expression, it provides an intuitive way to compare gene expression levels between SS316 and LIS.
The following code generates this visualization, combining both observed and simulated data into a single plot.
fig, ax = plt.subplots(figsize=(6, 4))
# X positions for SS316: [0, 0, 0, 0], LIS: [1,1,1,1]
x_ss = np.zeros_like(ss_counts)
x_lis = np.ones_like(lis_counts)
# Scatter the raw counts
ax.scatter(
x_ss, ss_counts, color="#003594", label="SS316 Observed", alpha=0.9, zorder=10
)
ax.scatter(
x_lis, lis_counts, color="#f46036", label="LIS Observed", alpha=0.9, zorder=10
)
data_to_violin = [samples_r, samples_r]
parts = ax.violinplot(
dataset=data_to_violin,
positions=[0, 1],
showmeans=False,
showextrema=False,
showmedians=False,
)
parts["bodies"][0].set_facecolor("#784A66")
parts["bodies"][0].set_edgecolor("#784A66")
parts["bodies"][1].set_facecolor("#784A66")
parts["bodies"][1].set_edgecolor("#784A66")
# Clean up the plot
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(0)
ax.set_xticks([0, 1])
ax.set_xticklabels(["SS316", "LIS"])
ax.set_ylabel("Counts")
plt.show()
Alternative model ¶
In the alternative model, we relax the assumption of a single shared mean expression level across all samples and instead allow for condition-specific means. This means we model gene expression separately for the SS316 and LIS groups, enabling us to test whether the gene is differentially expressed between conditions.
Unlike the null model, which assumes a single mean $\mu$ for all samples, the full model introduces separate means:
- $\mu_1$: The mean expression level for the SS316 condition.
- $\mu_2$: The mean expression level for the LIS condition.
- $\alpha$: The dispersion parameter, which remains shared between the two groups to maintain model simplicity.
This means that each observed count $x_i$ follows a Negative Binomial distribution:
$$ X_i \sim \mathrm{NB}(\mu_1, \alpha), \quad \text{if sample } i \text{ belongs to SS316} $$
$$ X_i \sim \mathrm{NB}(\mu_2, \alpha), \quad \text{if sample } i \text{ belongs to LIS} $$
Here, we assume that only the mean expression levels differ between groups while keeping dispersion constant across conditions. The reason for keeping a single $\alpha$ is to avoid overfitting and to ensure stable parameter estimation, especially with limited sample sizes.
The alternative model is more flexible than the null model because it accounts for potential differences in gene expression between the two conditions. This flexibility allows us to formally test whether differential expression exists by comparing the likelihood of the null model (single mean $\mu$) against the alternative model (separate means $\mu_1$ and $\mu_2$).
If the alternative model provides a significantly better fit to the data than the null model, it suggests that the gene is differentially expressed between SS316 and LIS. The strength of this evidence is typically assessed using a likelihood ratio test (LRT), which quantifies how much better the alternative model explains the data compared to the null model.
By estimating the parameters $\mu_1$, $\mu_2$, and $\alpha$, we will be able to determine whether gene expression varies systematically between conditions, which is a key objective in differential expression analysis.
def negbin_nloglik_two_groups(params, counts, group_labels):
"""
Negative log-likelihood for a two-group NB model.
params = [mu1, mu2, alpha]
counts: all counts from both groups combined
group_labels: 0 for group1 (SS316), 1 for group2 (LIS)
"""
mu1, mu2, alpha = params
if mu1 <= 0 or mu2 <= 0 or alpha <= 0:
return np.inf
r = 1.0 / alpha
nll = 0.0
# separate counts by group
group1_counts = counts[group_labels == 0]
group2_counts = counts[group_labels == 1]
# group1
for x in group1_counts:
term = (
lgamma(x + r)
- lgamma(r)
- lgamma(x + 1)
+ r * log(r / (r + mu1))
+ x * log(mu1 / (r + mu1))
)
nll -= term
# group2
for x in group2_counts:
term = (
lgamma(x + r)
- lgamma(r)
- lgamma(x + 1)
+ r * log(r / (r + mu2))
+ x * log(mu2 / (r + mu2))
)
nll -= term
return nll
def fit_full_model(counts_all, group_labels):
"""
Fit two-group NB (mu1, mu2, alpha).
returns (mu1_hat, mu2_hat, alpha_hat, neg_loglike).
"""
group1_counts = counts_all[group_labels == 0]
group2_counts = counts_all[group_labels == 1]
mu1_init = np.mean(group1_counts) + 1e-9
mu2_init = np.mean(group2_counts) + 1e-9
alpha_init = 0.1
init_params = [mu1_init, mu2_init, alpha_init]
bnds = [(1e-9, None), (1e-9, None), (1e-9, None)]
result = minimize(
negbin_nloglik_two_groups,
init_params,
args=(counts_all, group_labels),
method="L-BFGS-B",
bounds=bnds,
)
mu1_hat, mu2_hat, alpha_hat = result.x
nll = result.fun
return mu1_hat, mu2_hat, alpha_hat, nll
mu_ss, mu_lis, alpha_f, nll_f = fit_full_model(counts_all, group_labels)
samples_f_ss = sample_nb(mu_ss, alpha_f, size=5000)
samples_f_lis = sample_nb(mu_lis, alpha_f, size=5000)
fig, ax = plt.subplots(figsize=(6, 4))
# X positions for SS316: [0, 0, 0, 0], LIS: [1,1,1,1]
x_ss = np.zeros_like(ss_counts)
x_lis = np.ones_like(lis_counts)
# Scatter the raw counts
ax.scatter(
x_ss, ss_counts, color="#003594", label="SS316 Observed", alpha=0.9, zorder=10
)
ax.scatter(
x_lis, lis_counts, color="#f46036", label="LIS Observed", alpha=0.9, zorder=10
)
data_to_violin = [samples_f_ss, samples_f_lis]
parts = ax.violinplot(
dataset=data_to_violin,
positions=[0, 1],
showmeans=False,
showextrema=False,
showmedians=False,
)
parts["bodies"][0].set_facecolor("#003594")
parts["bodies"][0].set_edgecolor("#003594")
parts["bodies"][1].set_facecolor("#f46036")
parts["bodies"][1].set_edgecolor("#f46036")
print(parts)
# Clean up the plot
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(0)
ax.set_xticks([0, 1])
ax.set_xticklabels(["SS316", "LIS"])
ax.set_ylabel("Counts")
plt.show()
Likelihood Ratio Test (LRT) ¶
The Likelihood Ratio Test (LRT) is a statistical method used to compare two nested models and determine whether adding additional parameters significantly improves the fit to the data. In the context of differential gene expression analysis, we use the LRT to test whether a gene exhibits significant differences in expression between conditions (SS316 and LIS).
The test compares:
- The Reduced Model (Null Model): Assumes a single mean $\mu$ for all samples, implying no differential expression.
- The Full Model (Alternative Model): Allows separate means $\mu_1$ and $\mu_2$ for the two conditions, permitting differential expression.
If the full model significantly improves the likelihood of the data compared to the reduced model, we reject the null hypothesis and conclude that the gene is differentially expressed.
- Compute the Negative Log-Likelihoods (NLLs)
- $\text{NLL}_r$: The negative log-likelihood of the reduced model, where all samples share a single mean $\mu$.
- $\text{NLL}_f$: The negative log-likelihood of the full model, which allows separate means $\mu_1$ (SS316) and $\mu_2$ (LIS).
Since lower negative log-likelihood values indicate a better fit, we expect $\text{NLL}_f$ to be smaller than $\text{NLL}_r$ if allowing different means provides a better explanation of the data.
- Compute the Likelihood Ratio Test (LRT) Statistic
The likelihood ratio statistic is defined as:
$$ \chi^2 = 2 (\text{NLL}_r - \text{NLL}_f) $$
This statistic quantifies how much better the full model explains the data compared to the reduced model. A large $\chi^2$ value suggests a significant improvement in fit, favoring the hypothesis that differential expression exists.
- Determine the Degrees of Freedom
The degrees of freedom for the test correspond to the number of additional parameters in the full model compared to the reduced model. In our case:
- The reduced model has one mean parameter $\mu$.
- The full model has two mean parameters $\mu_1$ and $\mu_2$.
- The dispersion parameter $\alpha$ is shared in both models, so it does not contribute to the degrees of freedom change.
Thus, the difference in the number of parameters is:
$$ \text{Degrees of Freedom} = 2 - 1 = 1 $$
- Compute the p-value
To determine statistical significance, we compare $\chi^2$ to a chi-square distribution with 1 degree of freedom. The p-value is given by:
$$ p = 1 - F_{\chi^2_1}(\chi^2) $$
where $F_{\chi^2_1}(\chi^2)$ is the cumulative distribution function (CDF) of the chi-square distribution with 1 degree of freedom.
- If $p$ is small (e.g., $p < 0.05$), we reject the null hypothesis and conclude that the gene is differentially expressed between SS316 and LIS.
- If $p$ is large, we do not have enough evidence to claim differential expression.
By performing the Likelihood Ratio Test, we statistically assess whether the mean expression levels for a gene differ between conditions. If the full model significantly improves the likelihood of the data compared to the reduced model, it suggests that the gene exhibits condition-dependent expression, which is crucial for identifying biologically meaningful differences in gene regulation.
This test provides a robust statistical foundation for determining differential gene expression, ensuring that the observed differences are unlikely to be due to random chance alone.
from scipy.stats import chi2
# LRT statistic
lr_stat = 2.0 * (nll_r - nll_f)
# p-value
p_val = 1.0 - chi2.cdf(lr_stat, df=1)
print(p_val)
Log2 Fold Change (Log2FC) ¶
One of the key measures used in differential gene expression analysis is the Log2 Fold Change (Log2FC). It quantifies how much a gene’s expression level differs between two conditions, making it easier to interpret biological significance.
The fold change (FC) is the ratio of mean expression levels between two conditions:
$$ \text{FC} = \frac{\mu_2}{\mu_1} $$
where:
- $\mu_1$ is the mean expression level in the SS316 condition.
- $\mu_2$ is the mean expression level in the LIS condition.
Since gene expression data is often highly variable and spans several orders of magnitude, it is common to take the log base 2 of the fold change:
$$ \log_2(\text{FC}) = \log_2\left(\frac{\mu_2}{\mu_1}\right). $$
This transformation ensures that the measure is symmetric around zero:
- Log2FC > 0 → The gene is more highly expressed in LIS than in SS316.
- Log2FC < 0 → The gene is more highly expressed in SS316 than in LIS.
- Log2FC = 0 → No difference in expression between conditions.
Why Use Log2 Transform?
- Better Interpretability: Instead of dealing with large ratios, a Log2FC of +1 means the gene is doubled in expression, and a Log2FC of -1 means the expression is halved.
- Symmetry: Unlike raw fold changes, the log transformation makes the scale equal in both directions (upregulation vs. downregulation).
- Reducing Skewness: Gene expression values often have a highly skewed distribution, and the log transformation helps normalize the data, making it more suitable for statistical analysis.
Handling Low Expression Values
When dealing with RNA-seq data, some genes have very low or zero counts, which can make the ratio $\frac{\mu_2}{\mu_1}$ unstable. To prevent division by zero or extreme values, a small constant (e.g., pseudocount $\epsilon$) is sometimes added to both means:
$$ \log_2\left(\frac{\mu_2 + \epsilon}{\mu_1 + \epsilon}\right). $$
This prevents extreme log fold changes caused by very low expression levels.
Interpretation in Differential Expression Analysis
- Genes with large absolute Log2FC values (e.g., $|\text{Log2FC}| > 1$) are often considered biologically significant, as they show strong upregulation or downregulation.
- Statistical significance is typically assessed alongside Log2FC using the Likelihood Ratio Test (LRT) or other statistical methods. A gene with high Log2FC but a large p-value is not considered reliably differentially expressed.
- Volcano plots often display Log2FC on the x-axis and statistical significance (e.g., -log10 p-value) on the y-axis to highlight genes that are both significantly differentially expressed and have large fold changes.
log2_fc = np.log2((mu_lis + 1e-9) / (mu_ss + 1e-9))
print(log2_fc)
Connecting to DESeq2 ¶
So far, we have implemented a differential gene expression analysis pipeline using the Negative Binomial model, the Likelihood Ratio Test (LRT), and Log2 Fold Change (Log2FC). These concepts closely align with what is done in DESeq2, a widely used R package for analyzing RNA-seq data. Let’s compare our approach to DESeq2, highlighting the similarities and limitations of our implementation.
Similarities to DESeq2
-
Negative Binomial Distribution for Count Data
-
Both our Python-based approach and DESeq2 assume that gene expression follows a Negative Binomial (NB) distribution, which accounts for the overdispersion commonly observed in RNA-seq data.
-
Like in our implementation, DESeq2 models the variance as:
$$ \text{Var}(X) = \mu + \alpha \mu^2 $$
ensuring that dispersion is taken into account.
-
-
Estimation of Mean and Dispersion ($\mu$ and $\alpha$)
- In our method, we estimate $\mu$ (mean) and $\alpha$ (dispersion) by minimizing the negative log-likelihood (NLL).
- DESeq2 also estimates these parameters but does so using shrinkage estimation, which helps improve stability, especially for genes with low counts.
-
Differential Expression Testing with Likelihood Ratio Test (LRT)
- We used the Likelihood Ratio Test (LRT) to compare a reduced model (single mean for all samples) to a full model (separate means for conditions).
- DESeq2 performs a very similar LRT-based approach when testing for differential expression across multiple conditions, comparing nested models.
-
Log2 Fold Change (Log2FC) Interpretation
- Both our approach and DESeq2 compute Log2 Fold Change (Log2FC) to quantify how much a gene's expression changes between conditions.
-
The Log2FC calculation follows the same equation:
$$ \log_2\left(\frac{\mu_2}{\mu_1}\right) $$ - DESeq2 applies shrinkage correction (via the "apeglm" or "ashr" methods) to prevent extremely large or unreliable Log2FC values, whereas our implementation does not currently include this correction.
Limitations of Our Approach Compared to DESeq2
While our Python-based implementation follows the core statistical principles behind DESeq2, there are several key limitations:
-
Dispersion Estimation is More Simplistic
- In our approach, each gene has a single estimated $\alpha$ (dispersion parameter).
- DESeq2 shares information across genes to estimate dispersions more robustly, preventing overfitting and improving estimates for lowly expressed genes.
-
No Automatic Normalization
- DESeq2 applies median ratio normalization (MRN), which corrects for differences in sequencing depth and RNA composition across samples.
- Our implementation assumes that raw counts are comparable without explicitly normalizing them.
-
Shrinkage for Low Counts
- DESeq2 applies Bayesian shrinkage to both dispersion estimates and Log2FC values, stabilizing results for low-expression genes.
- Our implementation does not incorporate shrinkage, which may lead to more extreme fold changes for genes with low counts.
-
Multiple Testing Correction
- When testing thousands of genes, we must control for false discoveries.
- DESeq2 automatically adjusts p-values using the Benjamini-Hochberg (FDR) correction, whereas our approach currently outputs raw p-values without multiple testing correction.
Practice ¶
Now that we have gone through the process of identifying differentially expressed genes using the Negative Binomial model, the Likelihood Ratio Test (LRT), and Log2 Fold Change (Log2FC), it’s time to apply these methods to a different gene. This hands-on exercise will reinforce the concepts covered and allow you to practice working with real gene expression data.
Instructions ¶
-
Pick a New Gene:
- Choose a different gene by selecting a new index from the dataset.
- Make sure the gene has nonzero counts in at least some samples to avoid numerical issues.
-
Extract Expression Data:
- Retrieve the expression counts for both SS316 and LIS conditions.
-
Fit the Reduced Model (Null Model):
- Estimate $\mu$ (single mean for all samples).
- Estimate $\alpha$ (dispersion).
- Compute the negative log-likelihood ($\text{NLL}_r$).
-
Fit the Full Model (Alternative Model):
- Estimate separate means $\mu_1$ (SS316) and $\mu_2$ (LIS).
- Keep $\alpha$ the same.
- Compute the negative log-likelihood ($\text{NLL}_f$).
-
Perform the Likelihood Ratio Test (LRT):
- Calculate the test statistic $\chi^2 = 2 (\text{NLL}_r - \text{NLL}_f)$.
- Compute the p-value and interpret statistical significance.
-
Calculate Log2 Fold Change (Log2FC):
- Determine whether the gene is upregulated in SS316 or LIS.
-
Visualize the Data:
- Create a scatter plot of observed counts for SS316 and LIS.
- Overlay a violin plot using simulated counts from the fitted Negative Binomial model.
Questions to Consider ¶
- Does this gene show strong evidence of differential expression?
- What does the Log2 Fold Change tell you about the direction of expression changes?
- How well does the Negative Binomial model capture the observed distribution of counts?
This activity will help reinforce your understanding of differential expression analysis by applying the full workflow to a new gene. Try working in small groups, compare results, and discuss any patterns or challenges you encounter!