In-class activity: Dynamic programming ¶
The activity implements the Needleman-Wunsch algorithm for global sequence alignment. Two sequences, seq1 and seq2, are provided as inputs. The code consists of several functions that build and process a scoring matrix, perform a recursive traceback to recover all optimal alignments, and remove duplicate results. The overall goal is to compute the optimal alignment score and display the best alignments between the two sequences.
seq1 = "GCATGTAGGCGCTGGACTCGCTAGTAGTACAATGGCCGCCTCAGTGATGCGCGTAACCTAGTACGATGCCTAGTGAATT"
seq2 = "GGCGATAAGTTAAATTGTGTCAAGGGATGTCTTCGGAGTTCGAGCAACTGCATACCCCCAGTTAACGTCGTCC"
The first step in the algorithm is to initialize a scoring matrix that will be used to compute the alignment.
The function initialize_matrix takes two sequences and a gap penalty as input.
Within this function, the lengths of
seq1
and
seq2
are determined.
An (m+1) by (n+1) matrix is created where m and n are the lengths of
seq1
and
seq2
respectively.
Initially, the matrix is filled with zeros, and then the first row and first column are populated with cumulative gap penalties.
This initialization is essential because it sets up the boundaries for scoring alignments that start with gaps.
At present, the function raises a
NotImplementedError
, which indicates that the implementation details need to be filled in.
def initialize_matrix(seq1: str, seq2: str, gap: int) -> list[list[int]]:
"""Initialize the scoring matrix for global alignment.
Args:
seq1: First sequence which will be along the row.
seq2: Second sequence which will be along the column.
gap: Penalty for adding a gap.
"""
# TODO:
# 1. Determine the lengths of seq1 (m) and seq2 (n).
# 2. Create an (m+1) x (n+1) matrix filled with zeros using for loops.
# 3. Initialize the first column and the first row with the appropriate gap penalties.
#
# Return the matrix once it has been set up.
raise NotImplementedError("TODO: Implement the scoring matrix initialization")
Once the scoring matrix has been initialized, the
get_scoring_matrix
function fills it in using the Needleman-Wunsch recurrence relation.
This function iterates through the matrix starting from the second row and second column.
At each cell, three possible scores are computed: one by moving diagonally from the top-left (which corresponds to aligning the two current characters from
seq1
and
seq2
, with a score for either a match or a mismatch), one by moving upward (which corresponds to introducing a gap in
seq2
), and one by moving leftward (which corresponds to introducing a gap in
seq1
).
The maximum of these three scores is then stored in the current cell.
This dynamic programming approach systematically builds up the optimal alignment score for every possible prefix of the two sequences.
Like the initialization function, the implementation of this function has been left as a TODO.
def get_scoring_matrix(
seq1: str, seq2: str, match: int, mismatch: int, gap: int
) -> list[list[int]]:
"""
Fill in the scoring matrix using the Needleman-Wunsch recurrence.
Args:
seq1: First sequence which will be along the row.
seq2: Second sequence which will be along the column.
match: Score for forming a sequence match.
mismatch: Score for forming a sequence mismatch.
gap: Penalty for adding a gap.
"""
# TODO:
# 1. Loop through the matrix from i=1..m and j=1..n.
# 2. For each cell (i, j):
# - Compute diag (matrix[i-1][j-1] + match or mismatch)
# - Compute up (matrix[i-1][j] + gap)
# - Compute left (matrix[i][j-1] + gap)
# - Take the maximum of these three values.
# - Store this maximum in matrix[i][j].
raise NotImplementedError("TODO: Implement matrix filling using the NW recurrence")
After the scoring matrix is fully computed, the algorithm performs a traceback to recover the actual alignments.
The traceback function begins at the bottom-right cell of the matrix and recursively explores the possible moves that could have resulted in the current cell’s score.
The function checks three directions: a diagonal move, which aligns the corresponding characters from both sequences; an upward move, which represents aligning a character from
seq1
with a gap; and a leftward move, which represents aligning a character from
seq2
with a gap.
At each recursive call, the current characters or a gap are appended to the alignments being constructed.
When the recursion reaches the top-left corner of the matrix, an alignment has been completed.
This recursive approach ensures that all optimal alignments are recovered.
def traceback(
seq1: str,
seq2: str,
match: int,
mismatch: int,
gap: int,
matrix: list[list[int]],
i: int,
j: int,
) -> list[tuple[str, str]]:
"""
Recursively performs a traceback on the Needleman-Wunsch scoring matrix to recover all
optimal sequence alignments between seq1 and seq2.
Starting from the cell (i, j) in the matrix, the function traces back to the origin (0, 0),
constructing alignments based on the moves (diagonal, up, left) that could have produced the
optimal score in the cell. At each step, it checks whether the score in the current cell is
consistent with one of the possible moves and recurses accordingly.
Args:
seq1: The first sequence (aligned along the rows of the matrix).
seq2: The second sequence (aligned along the columns of the matrix).
match: The score for matching characters.
mismatch: The score for mismatching characters.
gap: The penalty for inserting a gap.
matrix: The filled scoring matrix from the Needleman-Wunsch algorithm.
i: The current row index in the matrix.
j: The current column index in the matrix.
Returns:
A list of tuples, where each tuple consists of two strings representing
an optimal alignment of seq1 and seq2. The first element of the tuple
is the alignment for seq1, and the second is the alignment for seq2.
"""
# Base case: if we have reached the top-left corner (0, 0), return an empty alignment.
if i == 0 and j == 0:
return [("", "")]
# List to accumulate all possible optimal alignments.
alignments = []
# 1. Check for a diagonal move:
# This move aligns seq1[i-1] with seq2[j-1]. It is valid if we are not on the first row or column.
if i > 0 and j > 0:
# Score from the diagonal cell.
diag_score = matrix[i - 1][j - 1]
# Compute expected score if the move was diagonal (either match or mismatch).
expected = diag_score + (match if seq1[i - 1] == seq2[j - 1] else mismatch)
# Check if the current cell's score is consistent with a diagonal move.
if matrix[i][j] == expected:
# Recursively traceback from the diagonal cell.
diag_alignments = traceback(
seq1, seq2, match, mismatch, gap, matrix, i - 1, j - 1
)
# For each alignment, append the current characters.
for a in diag_alignments:
aligned1 = a[0] + seq1[i - 1]
aligned2 = a[1] + seq2[j - 1]
alignments.append((aligned1, aligned2))
# 2. Check for an upward move:
# This move represents aligning seq1[i-1] with a gap (i.e., a gap in seq2).
if i > 0:
# Expected score if coming from above, with a gap penalty.
expected = matrix[i - 1][j] + gap
# Check if the move is valid by comparing with the current cell's score.
if matrix[i][j] == expected:
# Recursively traceback from the cell above.
up_alignments = traceback(
seq1, seq2, match, mismatch, gap, matrix, i - 1, j
)
for a in up_alignments:
aligned1 = a[0] + seq1[i - 1]
aligned2 = a[1] + "-"
alignments.append((aligned1, aligned2))
# 3. Check for a leftward move:
# This move represents aligning a gap with seq2[j-1] (i.e., a gap in seq1).
if j > 0:
# Expected score if coming from the left, with a gap penalty.
expected = matrix[i][j - 1] + gap
# Check if the left move produced the current score.
if matrix[i][j] == expected:
# Recursively traceback from the left cell.
left_alignments = traceback(
seq1, seq2, match, mismatch, gap, matrix, i, j - 1
)
for a in left_alignments:
aligned1 = a[0] + "-"
aligned2 = a[1] + seq2[j - 1]
alignments.append((aligned1, aligned2))
# Return all accumulated alignments from the current cell.
return alignments
After the traceback step, the
remove_duplicates
function is used to filter out any duplicate alignments from the results.
This function iterates over the list of alignments and collects only the unique ones.
A dictionary is used internally to keep track of the alignments that have already been seen.
By preserving the order of first occurrence, the function ensures that the final output contains a set of distinct optimal alignments.
def remove_duplicates(alignments: list[tuple[str, str]]) -> list[tuple[str, str]]:
"""
Remove duplicate alignments from the provided list.
This function iterates through the list of alignment tuples and collects only unique
alignments. An alignment is considered duplicate if it has already been encountered
during the iteration. The order of the unique alignments is preserved based on their
first occurrence in the input list.
Args:
alignments: A list of alignment tuples, where each tuple
represents an alignment of two sequences (e.g., (aligned_seq1, aligned_seq2)).
Returns:
A list of unique alignment tuples with duplicates removed.
"""
seen = {} # Dictionary to track seen alignments.
unique = [] # List to store unique alignments in the order they are found.
# Iterate over each alignment in the provided list.
for alignment in alignments:
# If the alignment has not been seen before, add it to the dictionary and unique list.
if alignment not in seen:
seen[alignment] = True # Mark this alignment as seen.
unique.append(alignment) # Add the unique alignment to the result list.
return unique
The
needleman_wunsch
function ties all these components together. It begins by computing the full scoring matrix using the get_scoring_matrix function.
The optimal alignment score is found in the bottom-right cell of this matrix.
Next, the function performs a traceback starting from this cell to generate all optimal alignments between the two sequences.
Once all possible alignments have been recovered, duplicate alignments are removed using the
remove_duplicates
function.
The function finally returns the unique alignments along with the final alignment score.
The main section of the code then calls
needleman_wunsch
with the given sequences and prints the alignment score followed by up to five of the optimal alignments.
def needleman_wunsch(
seq1: str, seq2: str, match: int = 1, mismatch: int = -1, gap: int = -2
) -> tuple[list[tuple[str, str]], int]:
"""
Perform global sequence alignment using the Needleman-Wunsch algorithm.
This function computes a scoring matrix based on the provided match, mismatch, and gap
penalties. It then performs a traceback through the matrix to recover all optimal alignments
between seq1 and seq2. Duplicate alignments are removed, and the final optimal alignment
score is retrieved from the bottom-right cell of the scoring matrix.
Parameters:
seq1: The first sequence to be aligned.
seq2: The second sequence to be aligned.
matc: The score awarded for matching characters. Default is 1.
mismatch: The penalty for mismatching characters. Default is -1.
gap: The penalty for introducing a gap. Default is -2.
Returns:
A list of unique optimal alignments. Each alignment is a tuple
containing two strings representing the aligned sequences.
The final alignment score, located in the bottom-right cell of the scoring matrix.
"""
# Compute the full scoring matrix using the Needleman-Wunsch algorithm.
# The matrix encapsulates the optimal scores for aligning all prefixes of seq1 and seq2.
matrix = get_scoring_matrix(seq1, seq2, match, mismatch, gap)
# The optimal alignment score is stored in the bottom-right cell of the matrix.
final_score = matrix[len(seq1)][len(seq2)]
# Perform a traceback starting from the bottom-right cell to recover all optimal alignments.
alignments = traceback(
seq1, seq2, match, mismatch, gap, matrix, len(seq1), len(seq2)
)
# Remove any duplicate alignments to ensure that only unique optimal alignments are returned.
unique_alignments = remove_duplicates(alignments)
# Return the list of unique alignments along with the final alignment score.
return unique_alignments, final_score
When you run the final code, the Needleman-Wunsch algorithm will compute the optimal global alignment between
seq1
and
seq2
.
The alignment score is printed first, followed by the aligned sequences for each optimal alignment.
You have the option to adjust the match, mismatch, and gap parameters, which will influence the computed scores and the resulting alignments.
# Call the new smith_waterman function.
alignments, alignment_score = needleman_wunsch(seq1, seq2)
print("=== Needleman-Wunsch alignments ===\n")
print(f"Alignment Score: {alignment_score}\n")
# Iterate through all optimal alignments.
for idx, (aligned1, aligned2) in enumerate(alignments[:5], start=1):
print(f"Alignment {idx}:")
print(f" Seq1 Alignment: {aligned1}")
print(f" Seq2 Alignment: {aligned2}")
print()