Q01 ¶
Points: 10
In this part of the project, you will work with paired-end FASTQ reads from Pseudomonas aeruginosa biofilms grown under Ground and Space Day 3 conditions. The goal is to investigate how the bacteria responds to microgravity by analyzing gene expression data.
Accessing data
Navigate to this shared Galaxy project that contains all necessary data for this project. You should see the following files:
- ASM1462v1 Genes : GFF3 file containing genes from the Pseudomonas aeruginosa strain used in the student.
- ASM1462v1 rRNA : FASTA file containing Pseudomonas aeruginosa rRNA sequences.
- ASM1462v1 : Assembled Pseudomonas aeruginosa genome.
- Space - Day 3 - FASTQ : RNA-seq reads from Pseudomonas aeruginosa after three days on the International Space Station.
- Ground - Day 3 - FASTQ : RNA-seq reads from Pseudomonas aeruginosa after three days on the ground (i.e., control).
Click the "Import this history" near the top of the screen and then click "Copy History". You only need to copy the active, non-deleted datasets. You will only be using this new Galaxy project for this assignment.
rRNA mapping
The objective of this step is to map your sequencing reads to a reference rRNA FASTA file, thereby identifying and filtering out ribosomal RNA sequences from your dataset. Ribosomal RNAs are highly abundant in total RNA samples, and if not removed, they can account for a large proportion of your sequencing data. This overwhelming presence can obscure the signals from messenger RNAs (mRNAs), which are crucial for understanding gene expression patterns.
By mapping and subsequently removing rRNA sequences, you significantly enhance the quality and interpretability of your data. This process minimizes background noise, allowing for a more precise quantification of mRNA transcripts. Improved data clarity is essential for downstream analyses, such as differential gene expression, where accurate measurement of transcript levels is paramount. Moreover, eliminating rRNA contamination reduces computational load during analysis and contributes to more robust statistical conclusions.
Begin by launching the Bowtie2 tool in Galaxy and configuring it as follows. For both the Space and Ground samples, you will run separate mappings for the forward and reverse paired-end reads. Follow these steps:
-
Library Type:
- Choose "paired" as your library type.
-
Input FASTQ Files:
- For each sample category, assign the forward and reverse FASTQ files to the appropriate input fields.
-
Output Options:
- Enable the option to write unaligned reads to separate file(s) by setting it to yes .
- Ensure that the option for writing aligned reads to separate files is set to no .
-
Reference Selection:
- Instead of using a built-in genome index, use a genome from the history and build index.
- Select the rRNA FASTA file (i.e., ASM1462v1) file as the reference.
-
Additional Settings:
- Do not set paired-end options.
- Do not set read groups.
- Do not tweak the analysis mode.
- Do not tweak SAM/BAM options.
- Save the Bowtie2 mapping statistics to your history (set this option to true ).
-
Execution:
- With all parameters set as specified, run the Bowtie2 mapping process for each sample condition (Space and Ground) separately.
This configuration will align the paired-end reads from each sample to the rRNA reference, allowing you to filter out ribosomal RNA contamination.
Once the Bowtie2 mapping process is completed, it's essential to carefully review the outputs in your Galaxy history to ensure that the alignment ran as expected.
Locate the mapping statistics file generated by Bowtie2 in your Galaxy history. This file provides key performance metrics of your alignment, including the overall alignment rate, which is critical for assessing the quality of your mapping. Identify the percentage of reads that were successfully aligned to the rRNA reference.
The variable
OVERALL_ALIGNMENT_RATE
is initially set to
0.0
.
Replace this placeholder with the actual alignment rate reported in the mapping statistics.
This value will be a benchmark for evaluating subsequent steps in your RNA-Seq analysis workflow.
OVERALL_ALIGNMENT_RATE_GROUND = 0.0
OVERALL_ALIGNMENT_RATE_SPACE = 0.0
Question : Please comment on the meaning and significance of these values.
PUT YOUR ANSWER HERE
Q02 ¶
Points: 15
In computational biology, read mapping is a crucial step in sequence alignment, where short DNA sequences (reads) are matched against a reference genome. One way to accelerate this process is by constructing a $k$-mer index, which serves as a lookup table for identifying potential matching regions in the reference genome. Instead of scanning the entire reference sequence for each query, we can use this index to efficiently locate potential seed matches.
In this problem, you will implement a function,
build_kmer_index
, that constructs a hash table (Python dictionary) mapping each $k$-mer (subsequence of length $k$) to the list of positions where it appears in the reference genome.
This index serves as the first step in seed-and-extend alignment algorithms, where we first locate potential matching regions before refining the alignment.
Why This Matters?
This indexing approach is fundamental in sequence alignment algorithms such as BLAST and Burrows-Wheeler Transform-based mappers. By precomputing locations of k-mers, we significantly speed up read alignment, making it feasible to process large genomic datasets.
This problem will help students understand:
- The importance of indexing in bioinformatics.
- How to use Python data structures efficiently.
- How preprocessing data (like creating a k-mer index) leads to faster searches in large datasets.
Programming Concepts Needed
To complete this task, students will apply the following Python concepts:
-
Dictionaries (
dict
)- We will use a dictionary to store k-mers as keys and their starting positions as values.
- Efficiently retrieving values from a dictionary allows quick lookups.
-
Lists (
list
)- Each dictionary key (a k-mer) maps to a list of positions, since a k-mer can occur multiple times in the reference sequence.
-
For Loops (
for
)- We iterate over the reference sequence to extract k-mers.
- The loop ensures that we capture every possible k-mer of length $k$.
-
String Slicing (
str[start:end]
)- Extracting a k-mer from the reference sequence using Python string slicing.
-
Conditionals (
if-else
)- Checking if a k-mer already exists in the dictionary to append to an existing list or create a new entry.
def build_kmer_index(reference: str, k: int) -> dict[str, list[int]]:
"""
Builds a hash table (i.e., dictionary) of k-mers from the reference sequence.
For each k-mer (substring of length k) in the reference, we record all starting
positions where the k-mer appears. This index will help us quickly look up
positions in the reference that match a k-mer from a query.
Args:
reference: The reference DNA sequence.
k: The length of k-mers to extract.
Returns:
A dictionary with k-mers as keys and a list of starting positions as values.
Examples:
>>> build_kmer_index("ATCGATCGA", 3)
{'ATC': [0, 4], 'TCG': [1, 5], 'CGA': [2, 6], 'GAT': [3]}
>>> build_kmer_index("AAGGAA", 2)
{'AA': [0, 4], 'AG': [1], 'GG': [2], 'GA': [3]}
"""
# TODO: Initialize an empty dictionary to hold k-mer positions.
# TODO: Loop through the reference sequence.
# TODO: Extract the k-mer substring starting at the current index.
# TODO: Check if the k-mer already exists in the dictionary.
# TODO: If it exists, append the current index to its list.
# TODO: If it does not exist, create a new list with the current index.
# TODO: Return the dictionary containing all k-mers and their positions.
Q03 ¶
Points: 15
Now that we've built a k-mer index, the next step is to use this index to efficiently identify candidate alignment start positions for a query sequence (short DNA read). Given a new read (query), we need to determine where in the reference genome we should start searching for an alignment.
In traditional sequence alignment, an exhaustive search would compare the query against every possible position in the reference. However, by using a precomputed k-mer index, we can quickly vote on likely starting positions based on shared k-mers between the reference and query.
Your task is to implement
map_query_to_reference
, a function that takes a query sequence, a precomputed k-mer index, and a k-mer length and returns a list of potential start positions where an alignment might begin.
A candidate alignment start position is a position in the reference genome where the query sequence is likely to align. We identify these positions by looking for exact matches of short k-mers between the reference and query.
-
Each k-mer (substring of length
k
) in the query may also exist in the reference. - If a k-mer is found in the reference, it gives us a hint about where the query might align.
- However, because the k-mer could appear midway through the query**, we must adjust the reference position to estimate where the query actually starts.
Why This Matters?
- Many read mapping algorithms (like BLAST, BWA, and minimap2) first identify seed positions where an alignment is likely before performing more computationally expensive dynamic programming-based alignment (like Smith-Waterman).
- A k-mer index allows us to quickly find regions of the reference that share exact substrings (k-mers) with the query, serving as potential alignment start points.
- This process reduces the search space, making read mapping significantly faster.
Programming Concepts Needed
To complete this task, students will apply the following Python concepts:
-
Dictionaries (
dict
)- The k-mer index is stored in a dictionary, mapping k-mers to their reference positions.
- Efficient key lookups allow quick retrieval of candidate positions.
-
Lists (
list
)- A list is used to store candidate alignment positions.
- Removing duplicates ensures unique candidate positions.
-
For Loops (
for
)- A loop slides a window of size k across the query sequence to extract k-mers.
- Another loop iterates over each reference position where a k-mer is found.
-
String Slicing (
str[start:end]
)- Extracts k-mers from the query sequence.
-
Conditionals (
if-else
)- Ensures only valid positions (non-negative) are included.
- Avoids duplicate entries by using a set.
-
Sorting (
sorted()
)- Ensures candidate positions are returned in ascending order for consistency.
def map_query_to_reference(
query: str, kmer_index: dict[str, list[int]], k: int
) -> list[int]:
"""
Maps a query (read) to candidate alignment positions in the reference using the k-mer index.
For each k-mer in the query, if it exists in the k-mer index, we compute a potential
alignment start position by subtracting the k-mer's offset in the query from the k-mer's
position in the reference. This "votes" for a candidate alignment start position.
Args:
query: The query read sequence.
kmer_index: A dictionary mapping k-mers to lists of positions in the reference.
k: The k-mer length that was used to build the index.
Returns:
A sorted list of unique candidate alignment start positions in the reference.
Examples:
>>> kmer_dict = build_kmer_index("ATCGATCGA", 3)
>>> map_query_to_reference("TCGA", kmer_dict, 3)
[1, 5]
>>> kmer_dict = build_kmer_index("AAGGAAGGA", 2)
>>> map_query_to_reference("GGA", kmer_dict, 2)
[2, 6]
>>> map_query_to_reference("GGA", kmer_dict, 4)
[]
>>> kmer_dict = build_kmer_index("AGCTTAGCTAAGCTT", 3)
>>> map_query_to_reference("TAGCT", kmer_dict, 3)
[4, 9]
"""
# TODO: Initialize an empty list to store candidate alignment positions.
# TODO: Loop over the query sequence with a window of length k.
# Hint: Use range(len(query) - k + 1) to avoid index errors.
# TODO: Extract the current k-mer from the query using slicing.
# TODO: Check if the k-mer exists in the kmer_index.
# TODO: For each occurrence (position) of the k-mer in the reference,
# TODO: Compute the candidate alignment position (ref position minus current query index).
# TODO: Append the computed candidate position to your candidate positions list.
# TODO: Remove any duplicate candidate positions and filter out negative positions.
# Hints:
# - Use a set to track which positions have been added.
# - Only include positions that are non-negative.
# TODO: Return the sorted list of unique candidate alignment positions.
Q04 ¶
Points: 58
The BWT is a fundamental data structure used in genomic sequence alignment and compression. It enables efficient searching, indexing, and compression of large DNA sequences. BWT is a critical component of tools like BWA and Bowtie, which are widely used for mapping sequencing reads to a reference genome.
At a high level, BWT rearranges the input string in a way that groups similar characters together, making it highly compressible and enabling fast substring searches through FM-indexing. However, before we compute the BWT, we must ensure that the input text is properly formatted.
The first function in our BWT pipeline is
prepare_text
. Before computing the BWT, we need to make sure the text contains a unique end marker (
$
).
This marker helps us differentiate between different rotations of the string, ensuring that our suffix sorting is well-defined.
Why is the
$
marker important?
- Indicates the end of the string → Ensures that the transformation can be reversed.
- Creates a unique suffix → Prevents ambiguity in suffix sorting.
-
Allows efficient alignment → Many bioinformatics algorithms rely on
$
for indexing genomic data.
Your task is to implement
prepare_text
, which guarantees that any input string ends with a
$
marker.
Programming Concepts Needed
To complete this task, students will apply the following Python concepts:
-
String Methods (
str.endswith()
)-
Used to check if the input already contains
$
at the end.
-
Used to check if the input already contains
-
String Concatenation (
+
)-
If the string does not end with
$
, we append it.
-
If the string does not end with
-
Conditionals (
if-else
)- Ensures we do not modify the string unnecessarily.
test_str = "bananas and bandanas are in the banana band"
def prepare_text(text: str) -> str:
"""
Cleans the input text and ensures that it ends with a unique end marker '$'.
We do not handle the case if the original string ends in '$'.
Args:
text: The raw input string.
Returns:
The cleaned string that is guaranteed to end with '$'.
Examples:
>>> test_str = "bananas and bandanas are in the banana band"
>>> prepare_text(test_str)
bananas and bandanas are in the banana band$
"""
# TODO: Check if the input text ends with '$'.
# TODO: If the text does not end with '$', append '$' to the text.
# TODO: Return the cleaned text.
The Burrows-Wheeler Transform relies on a sorted list of all cyclic rotations of the input text. Each rotation is a version of the string where characters are cyclically shifted to the left. The key motivation behind this step is that sorting these rotations reveals patterns in the sequence that make substring searches faster. Why do we generate rotations?
- Sorting these rotations is the foundation of BWT → The last column of the sorted rotations gives us the transformed text.
- Reveals hidden structure in the sequence → Allows better compression and efficient searching.
- Prepares the text for efficient FM-indexing, which is used in read alignment tools like BWA and Bowtie.
Your task is to implement
generate_rotations
, which cyclically shifts the input text to create all possible rotations.
Programming Concepts Needed
To complete this task, students will apply the following Python concepts:
-
String Concatenation (
+
)- Construct cyclic rotations by shifting characters and combining substrings.
-
String Slicing (
str[start:end]
)- Used to extract substrings for creating rotations.
-
Loops (
for
)- Iterates through all possible shift positions to generate rotations.
-
Lists (
list
)- Stores and returns all generated rotations.
def generate_rotations(text: str) -> list[str]:
"""
Generates all cyclic rotations of the input text.
Each rotation is created by taking a substring from the current index
to the end and concatenating it with the substring from the beginning
of the text up to the current index.
Args:
text: The input string.
Returns:
A list of all rotations of the text.
Examples:
>>> clean_str = "bananas and bandanas are in the banana band$"
>>> generate_rotations(clean_str)
[
'bananas and bandanas are in the banana band$',
'ananas and bandanas are in the banana band$b',
'nanas and bandanas are in the banana band$ba',
...
'd$bananas and bandanas are in the banana ban',
'$bananas and bandanas are in the banana band'
]
"""
# TODO: Initialize an empty list to store the rotations.
# TODO: Loop over the indices of the input text.
# TODO: For the current index i, create a rotation by concatenating
# the substring from i to the end with the substring from the beginning up to i.
# TODO: Append the generated rotation to the list.
# TODO: Return the list containing all rotations of the text.
Now that we have generated all cyclic rotations of our input text, the next step in the Burrows-Wheeler Transform is to sort these rotations lexicographically.
However, we must handle a special case: The end-of-string marker
$
should always be considered the smallest character (i.e., it should sort before all other characters).
This function,
bwt_sort_key
, defines a custom sorting key to ensure that when we sort rotations,
$
is always ranked lower than any other character.
Why do we need a custom sorting key?
- Lexicographic sorting is crucial → The structure of BWT relies on sorting the cyclic rotations of the input string.
-
Special handling of
$
→ Since$
is a non-alphabetic character, we need to ensure it sorts before any letter. - Facilitates fast FM-index searching → Sorting suffixes correctly is key to efficient substring searches.
To achieve this, we create a tuple of numeric values where:
-
Characters are represented by their ASCII values (via
ord()
). -
$
is assigned a value of-1
, ensuring it appears before any other character when sorted.
Programming Concepts Needed
To complete this task, students will apply the following Python concepts:
-
ASCII Values (
ord()
)- Converts characters into numeric ASCII values to be used as sorting keys.
-
Conditionals (
if-else
)-
Ensures
$
is assigned a unique low value (-1
).
-
Ensures
-
Lists (
list
)- Stores the numeric representation of each character before converting to a tuple.
-
Tuples (
tuple()
)- Since sorting functions require immutable keys, the numeric list is converted into a tuple.
def bwt_sort_key(rotation: str) -> tuple[int]:
"""
Creates a sort key for a given rotation string, ensuring that the '$' character
is treated as smaller than all other ASCII characters.
This function iterates over each character in the rotation. If the character
is '$', it appends -1 to the key list (since -1 is less than any ASCII value).
Otherwise, it appends the ASCII value of the character using ord().
Args:
rotation: A rotation string from the Burrows-Wheeler Transform process.
Returns:
A tuple of numbers representing the sort key for the rotation.
Examples:
>>> test_str = "bananas and bandanas are in the banana band$"
>>> bwt_sort_key(test_str)
(98, 97, 110, 97, ..., 110, 100, -1)
"""
# TODO: Initialize an empty list to store the key values.
# TODO: Iterate over each character in the input rotation string.
# TODO: For each character, check if it is '$'.
# TODO: If it is '$', append -1 to the key list.
# TODO: Otherwise, append the ASCII value (obtained using ord()) of the character.
# TODO: Convert the list of key values into a tuple.
# TODO: Return the tuple as the sort key.
Now that we have generated all cyclic rotations of our input text and defined a custom sorting key, the next step in the Burrows-Wheeler Transform process is to sort these rotations lexicographically. This is the key step that structures the transformation and prepares for efficient searching and compression. Why do we need to sort rotations?
- The essence of the Burrows-Wheeler Transform is built on the idea of sorting all cyclic shifts of the input text.
- By sorting these shifts, we create a structured representation of the text that allows faster searching and compression.
- The sorted rotations help identify patterns in the sequence, which is fundamental for FM-index-based read mapping.
This function,
sort_rotations
, takes in a list of rotations and sorts them using our previously defined sorting key (
bwt_sort_key
), ensuring that the
$
marker appears first.
Programming Concepts Needed
To complete this task, students will apply the following Python concepts:
-
Sorting (
sorted()
)- Uses Python’s built-in sorting function to arrange rotations.
-
Custom Sorting Key (
key=function
)-
Applies the previously implemented
bwt_sort_key
function to handle the special ordering of$
.
-
Applies the previously implemented
def sort_rotations(rotations: list[str]) -> list[str]:
"""
Sorts the list of rotations in lexicographical order,
ensuring that the '$' character is treated as smaller than all other ASCII characters.
Args:
rotations: A list of rotations (strings).
Returns:
A list of rotations sorted according to the custom key.
"""
# TODO: Sort the list of rotations using the custom key function bwt_sort_key.
# TODO: Return the sorted list of rotations.
Now that we have sorted all cyclic rotations of the input text, we can finally construct the Burrows-Wheeler Transform. This is done by extracting the last column from the sorted rotations. Why do we extract the last column?
- The last column of the sorted cyclic rotations is the key to the Burrows-Wheeler Transform.
- The BWT rearranges the original string into a highly compressible form by grouping similar characters together.
- This transformation allows for fast substring searches using the FM-index, which is widely used in genomic sequence alignment.
By extracting the last character of each sorted rotation, we obtain the BWT of the original input text.
Programming Concepts Needed
To complete this task, students will apply the following Python concepts:
-
String Indexing (
str[-1]
)- Retrieves the last character of each sorted rotation.
-
Lists (
list
)- Stores last characters before converting them into a string.
-
Looping (
for
)- Iterates through each sorted rotation to extract the last column.
-
String Joining (
"".join()
)- Efficiently combines characters into the final BWT string.
def extract_last_column(sorted_rotations: list[str]) -> str:
"""
Extracts the last character from each rotation to form the BWT string.
Args:
sorted_rotations: The sorted list of rotations.
Returns:
The BWT of the original string.
Examples:
>>> extract_last_column(sorted_rotations)
'dsseadennnbbndb b nn $ nnnhrt iaaaaaaaaaaa '
"""
# TODO: Initialize an empty list to store the last characters from each rotation.
# TODO: Loop over each rotation in the sorted_rotations list.
# TODO: Extract the last character of the current rotation.
# TODO: Append the extracted character to the list of last characters.
# TODO: Combine the list of last characters into a single string.
# TODO: Return the resulting string (the BWT of the original text).
Q05 ¶
Points: 2
Building on our introduction to the BWT, we now turn to an essential step in reconstructing the original string or aligning reads using LF-mapping. In the context of read mapping, once we have compressed the reference sequence via the BWT, we often want to efficiently locate where a given read might match. LF-mapping is central to this process because it tells us, for each character in the BWT (the “last column”), which row and character in the first column it corresponds to. By repeatedly applying LF-mapping, we can recover the positions in the original string or track how short reads align.
test_bwt = "dsseadennnbbndb b nn $ nnnhrt iaaaaaaaaaaa "
However, to perform LF-mapping, we first need to construct the so-called “first column,” which lists the same characters found in the BWT but in sorted order. This sorting step is crucial because it re-establishes the lexicographic relationships among the characters that were disrupted by the transform. By knowing the positions of characters in the first column, we can track exactly which occurrence of each character in the BWT corresponds to each occurrence in the sorted first column.
From a programming standpoint, this function is a straightforward exercise in string manipulation and sorting in Python:
- You will work with a string (the BWT) and transform it by applying the built-in sorted function, which returns a list of characters sorted in ascending order.
- You will then join these sorted characters back together into a single string to form the first column.
- This highlights the use of Python’s core data structures—specifically, lists for intermediate sorting—and the string operations available to compose your final output.
def create_first_column(bwt: str) -> str:
"""
Given a BWT, construct the 'first column' by sorting the characters in BWT.
Args:
bwt: The Burrows-Wheeler Transform string.
Returns:
A string representing the sorted characters of the BWT (the first column).
Examples:
>>> test_bwt = "dsseadennnbbndb b nn $ nnnhrt iaaaaaaaaaaa "
>>> create_first_column(test_bwt)
'$ aaaaaaaaaaaabbbbdddeehinnnnnnnnnrsst'
"""
# TODO: Sort the characters in the BWT using the custom sort key bwt_sort_key.
# Hint: Use Python's built-in sorted() function and join the characters to form a string.
# TODO: Return the sorted string (the first column).
Once we have created the sorted list of characters (i.e., the first column) from our BWT, we need an efficient way to locate the exact row where each character first appears in this column. Tracking these first occurrences is important for reconstructing the original string or aligning reads using LF-mapping because it tells us the offset in the first column for any character in the BWT. By knowing exactly where each character starts, we can map occurrences of that character in the BWT (the “last column”) back to the correct position in the first column.
In practical terms, if we encounter a certain character at position $i$ in the BWT, we will need to know where that character’s block begins in the first column to calculate how many times it has appeared up to that point. This “first occurrence” dictionary provides that starting index for every character.
Your task is to write a function that, given the first column string, computes a dictionary that maps every character in the first column to the index of its first occurrence. You will:
- Iterate over the characters in the first column.
- For each character, if it has not yet been added to the dictionary, store its current index.
- Return the completed dictionary.
This function is another critical step in our LF-mapping pipeline, as we will ultimately use the returned dictionary to navigate through the BWT’s rows when reconstructing or aligning reads.
Programming Concepts Needed
- You will use a Python dictionary to map each character to its first occurrence index.
- You will iterate through the first column once, which is an O(n) operation where n is the length of the string.
-
A simple
if char not in dict
check ensures each character is recorded only once.
def compute_first_occurrence(first_col: str) -> dict[str, int]:
"""
Compute the position of the first occurrence of each character in the first column.
Args:
first_col: The first column (sorted characters of BWT).
Returns:
A dictionary mapping each character to its first occurrence index in `first_col`.
Examples:
>>> test_bwt = "dsseadennnbbndb b nn $ nnnhrt iaaaaaaaaaaa "
>>> first_column = create_first_column(test_bwt)
>>> compute_first_occurrence(first_column)
{
'$': 0,
' ': 1,
'a': 8,
'b': 20,
'd': 24,
'e': 27,
'h': 29,
'i': 30,
'n': 31,
'r': 40,
's': 41,
't': 43
}
Imagine your first column is generated from a BWT that includes the characters
`['$', ' ', 'a', 'a', 'b', ...]`. If `$` appears first, then `'$': 0` will be
added to the dictionary. Once the code sees a new character like `'a'` at
index 2, it records `'a': 2`. When it encounters another `'a'` later,
it ignores it because `'a'` is already in the dictionary.
"""
# TODO: Initialize an empty dictionary to store the first occurrence positions.
# TODO: Iterate over the characters in first_col with their index.
# Hint: Use the enumerate() function.
# TODO: For each character, check if it is not already present in the dictionary.
# TODO: If it is not present, assign the current index as its value.
# TODO: Return the dictionary with the first occurrence of each character.
After computing the first occurrence dictionary and obtaining the BWT string, the next critical step is to implement the backward search algorithm. This algorithm allows us to quickly identify all positions (i.e., rows) in the BWT where a given search pattern occurs. In a nutshell, backward search begins with the full range of the BWT and refines it by processing the pattern in reverse—from the last character to the first. For each character in the pattern, we update the range using two key pieces of information:
- First Occurrence tells us where the block of a particular character starts in the sorted first column.
- By counting how many times the character has appeared in the BWT up to certain positions, we adjust our search range accordingly by using the occurrence count.
By iteratively narrowing down the range (defined by the
top
and
bottom
indices), backward search eventually isolates a contiguous block in the BWT where the entire pattern is present.
If at any point the range becomes invalid (i.e.,
top
exceeds
bottom
), it indicates that the pattern does not exist within the BWT.
Your task is to write a function that performs this backward search. Given the BWT string, the search pattern, and the first occurrence dictionary, the function should:
- Initialize the search range to cover the entire BWT.
- Iterate through the pattern in reverse order.
-
For each character, update the
top
andbottom
indices by:- Determining the starting index of the character in the first column using the dictionary.
-
Counting the occurrences of the character up to the current
top
andbottom
indices.
-
Return the list of row indices (from
top
tobottom
) where the pattern is found. - Return an empty list if the pattern is not present (i.e., when the search range becomes invalid).
This backward search function is a key component in various string-matching and bioinformatics applications, such as reconstructing the original string from its BWT or aligning sequencing reads efficiently.
Programming Concepts Needed
-
Index Manipulation: Handling the
top
andbottom
indices that define the current search range. - Reverse Iteration: Traversing the pattern from end to start to progressively refine the range.
- Counting Occurrences: Using simple counting (or more advanced techniques for larger datasets) to determine the number of times a character appears up to a certain index in the BWT.
- Conditional Checks: Ensuring the search range remains valid throughout the process.
def backward_search(
bwt: str, pattern: str, first_occurrence: dict[str, int]
) -> list[int]:
"""
Find all rows in the BWT where 'pattern' occurs via the backward search algorithm.
This version returns a list of row indices [top, top+1, ..., bottom].
Args:
bwt: The Burrows-Wheeler Transform string.
pattern: The substring to search for.
first_occurrence: Dictionary mapping a character to its first occurrence index in
the sorted first column (e.g., {'$':0, 'a':1, 'b':4, ...}).
Returns:
A list of row indices in the BWT that match 'pattern'.
If no occurrences are found, this list is empty.
Examples:
>>> test_str = "bananas and bandanas are in the banana band"
>>> test_bwt = "dsseadennnbbndb b nn $ nnnhrt iaaaaaaaaaaa "
>>> search_str = "ban"
>>> first_column = create_first_column(test_bwt)
>>> first_occ = compute_first_occurrence(first_column)
>>> backward_search(test_bwt, search_str, first_occ)
[20, 21, 22, 23]
"""
# TODO: Initialize 'top' to 0 and 'bottom' to len(bwt) - 1.
# TODO: Loop over the pattern from the last character to the first character.
# Hint: Use a loop like: for i in range(len(pattern) - 1, -1, -1)
# TODO: Extract the current symbol from the pattern.
# TODO: Check if the current symbol is not in first_occurrence.
# If it's not, return an empty list (no matches found).
# TODO: Count the number of occurrences of the symbol in bwt from the start up to 'top'.
# Hint: Use bwt[:top].count(symbol)
# TODO: Count the number of occurrences of the symbol in bwt from the start up to 'bottom + 1'.
# Hint: Use bwt[:bottom + 1].count(symbol)
# TODO: Update 'top' to be first_occurrence[symbol] plus the count from bwt[:top].
# TODO: Update 'bottom' to be first_occurrence[symbol] plus the count from bwt[:bottom + 1] minus 1.
# TODO: If the updated 'top' is greater than 'bottom', return an empty list as no match exists.
# TODO: After processing all characters of the pattern, return a list of indices from 'top' to 'bottom' (inclusive).
# Hint: Use range(top, bottom + 1) to generate the list of row indices.
LF-mapping is the operation that links each character in the BWT “last column” to its corresponding position in the “first column.” By pairing a character with its rank —that is, which occurrence of that character it is in the BWT from left to right—you can pinpoint the exact row in the full BWT matrix where that character belongs. This process is essential for reconstructing the original string from the BWT or for aligning reads quickly against a reference in advanced alignment algorithms.
When traversing the BWT, you will repeatedly ask, “Given this character and how many times we have seen it so far (its rank), which row and character in the first column does this correspond to?” Specifically, for each
(character, rank)
pair in the BWT, it tracks the row index at which that pair appears.
Later, you will combine this with your
first_occurrence
information to jump back and forth between the last and first columns (an essential step in full reconstruction or read alignment).
Programming Concepts Needed
-
You will loop through each position
i
in the BWT (e.g., usingfor i, char in enumerate(bwt):
). -
A dictionary called
lf_map
will map(char, rank)
pairs to row indices. This requires you to use tuples(char, rank)
as keys. -
You will maintain a separate dictionary (
char_count
) that tracks how many times each character has appeared so far. This uses basic dictionary get-or-initialize logic. -
Storing
(char, rank)
as a key demonstrates how Python supports composite keys in dictionaries.
def compute_lf_mapping(bwt: str) -> dict[tuple[str, int], int]:
"""
Compute the LF-mapping, which maps (character, rank) in BWT
to its position in the first column.
Args:
bwt: The Burrows-Wheeler Transform string.
Returns:
A dictionary where the key is (character, rank_in_bwt) and
the value is the index in the first column where that
(character, rank_in_bwt) pair is found.
Examples:
>>> test_bwt = "dsseadennnbbndb b nn $ nnnhrt iaaaaaaaaaaa "
>>> compute_lf_mapping(test_bwt)
{
('d', 0): 0,
('s', 0): 1,
('s', 1): 2,
('e', 0): 3,
...
('a', 10): 41,
('a', 11): 42,
(' ', 6): 43
}
"""
# TODO: Initialize an empty dictionary to keep track of the count of each character.
# TODO: Initialize an empty dictionary for the LF-mapping.
# TODO: Loop over the indices and characters of the bwt string using enumerate().
# TODO: If the character hasn't been seen before, initialize its count to 0.
# TODO: Create a key as a tuple (character, current count) representing its rank.
# TODO: Map this key to the current index in the LF-mapping dictionary.
# TODO: Increment the count for this character.
# TODO: Return the completed LF-mapping dictionary.
After running
backward_search
to find the matching row indices in the BWT, you often want to get each match’s starting position in the
original
text (not just the row in the BWT).
To do this, you can walk backward from a given row to the sentinel character (e.g.,
'$'
) in the BWT, counting how many steps it takes.
That step count is effectively the
suffix array index
—the starting position of that matching suffix in the original text.
The function
get_suffix_indices_for_matches
accomplishes this by building an inverse of your
lf_map
so you can quickly go from “row index” back to “(char, rank).”
Each step in this backward walk lets you track one character earlier in the original string, until you finally reach the sentinel.
Programming Concepts Needed
-
You must invert the LF map (which goes
(char, rank) -> row_index
) into a structure that answersrow_index -> (char, rank)
. -
A
while
loop continues walking backward until it encounters the sentinel ('$'
), at which point it stops. -
Like LF mapping, the inverse uses tuples
(char, rank)
as dictionary keys, but here you also store them as dictionary values to allow the reverse lookup. -
You maintain a counter (
steps
) to count how many backward steps you take, which equates to the position offset in the original string. -
The function processes each matching row from
backward_search
in turn, generating a final list of suffix positions in the original text.
def get_suffix_indices_for_matches(
bwt: str,
matched_rows: list[int],
first_occurrence: dict[str, int],
lf_map: dict[tuple[str, int], int],
) -> list[int]:
"""
Given a list of rows in the BWT that match a pattern, determine each matching
suffix's starting position in the original text by walking backward to the sentinel.
Args:
bwt: The Burrows-Wheeler Transform string of the original text,
which must contain a unique sentinel (e.g., '$').
matched_rows: A list of row indices in the BWT that match a pattern
(e.g., from the backward_search function).
first_occurrence: Dictionary mapping each character to its first occurrence
index in the first column.
lf_map: Dictionary mapping (char, rank_in_bwt) to the row index in the first column.
For example, {('a', 0): 5, ('a', 1): 6, ...}.
Returns:
A list of integers representing the starting positions of each matching suffix
in the original text. The i-th element of the list corresponds to the
i-th element of matched_rows.
Examples:
>>> test_str = "bananas and bandanas are in the banana band"
>>> test_bwt = "dsseadennnbbndb b nn $ nnnhrt iaaaaaaaaaaa "
>>> row_matches = backward_search(test_bwt, search_str, first_occ)
>>> first_column = create_first_column(test_bwt)
>>> first_occ = compute_first_occurrence(first_column)
>>> lf_map = compute_lf_mapping(test_bwt)
>>> get_suffix_indices_for_matches(test_bwt, row_matches, first_occ, lf_map)
[32, 0, 39, 12]
Notes:
This function constructs the 'inverse LF-map' so that we can go from
row -> (char, rank). Once we have (char, rank), we use:
```python
new_row = first_occurrence[char] + rank
```
to jump back one character in the original rotation.
The total number of such backward steps to reach '$' is exactly the
suffix's starting position in the original text (assuming the sentinel marks
the beginning of the text).
"""
# TODO: Build the inverse mapping of lf_map so that you can go from row index -> (char, rank).
# TODO: Initialize an empty list to store the starting positions of each suffix.
# TODO: Iterate over each row in matched_rows.
# TODO: Initialize a counter for the number of backward steps.
# TODO: While the character in bwt at the current row is not the sentinel '$':
# TODO: Retrieve the (char, rank) corresponding to the current row using the inverse mapping.
# TODO: Update row by computing the new row: first_occurrence[char] + rank.
# TODO: Increment the steps counter.
# TODO: After reaching the sentinel, append the number of steps to the list of positions.
# TODO: Return the list of starting positions.