STUDENT_ID = 1111111
Q01 ¶
First, we are going to download the reference genome for Staphylococcus aureus MRSA252 .
- In the Galaxy interface, search for "NCBI Datasets Genomes" in the tools panel.
- Click on "NCBI Datasets Genomes download genome sequence, annotation and metadata".
-
In the tool interface, set the following parameters:
-
"Enter comma separated list of accessions": enter
GCF_000011505.1
-
"Enter comma separated list of accessions": enter
- Click "Run Tool".
A ¶
In the assembly information page linked above (or click here ), please fill out the following three statistics.
GENOME_LENGTH = 0
N_GENES = 0
N_CDS_WITH_PROTEIN = 0
Q02 ¶
We will use SPAdes , a versatile genome assembler designed for both small genomes and single-cell projects, to assemble our genome. Follow the instructions below for both your cleaned parental and evolved FASTQ files (after fastp).
- In the Galaxy interface, search for "SPAdes" in the tools panel.
- Click on "SPAdes genome assembler for genomes of regular and single-cell projects".
-
In the tool interface, set the following parameters:
- "Operation mode": Select "Assembly and error correction"
- "Single-end or paired-end short-reads": Choose "Paired-end: list of dataset pairs"
- "FASTA/FASTQ file(s)": Select your paired-end output from the fastp calculation (from P01B ).
- Under "Pipeline options", check "Isolate: highly recommended for high-coverage isolate and multi-cell data (--isolate)" to be selected.
- Click "Run Tool".
Once this finishes, which could take from minutes to hours, we will use quast to quantify the quality of our assembly. Follow the instructions below for both your parental and evolved SPAdes assemblies.
- In the Galaxy interface, search for "Quast" in the tools panel.
- Click on "Quast GEnome assembly Quality"
-
In the tool interface, set the following options:
- "Assembly mode?" set to "Individual assembly (1 contig file per sample)"
- "Contigs/scaffolds file" select your SPAdes Scaffolds.
- "Use a reference genome?" should be "Yes"
- For "Reference genome", click the "Dataset icon" (looks like a folder) and select "NCBI Genome Dataset: genome fasta".
- Click "Run Tool".
A ¶
Please fill out the following variables from your quast HTML report.
PARENTAL_GENOME_FRACTION = 0.0
PARENTAL_DUPLICATION_RATIO = 0.0
PARENTAL_TOTAL_ALIGNED_LENGTH = 0.0
PARENTAL_NUM_MISASSEMBLIES = 0.0
PARENTAL_N_CONTIGS = 0.0
PARENTAL_N50 = 0.0
PARENTAL_L50 = 0.0
PARENTAL_GC_PERCENT = 0.0
EVOLVED_GENOME_FRACTION = 0.0
EVOLVED_DUPLICATION_RATIO = 0.0
EVOLVED_TOTAL_ALIGNED_LENGTH = 0.0
EVOLVED_NUM_MISASSEMBLIES = 0.0
EVOLVED_N_CONTIGS = 0.0
EVOLVED_N50 = 0.0
EVOLVED_L50 = 0.0
EVOLVED_GC_PERCENT = 0.0
B ¶
Using the assembly statistics provided for the parental and evolved genomes, determine which assembly (if any) is more reliable. Your answer should compare quantifiable metrics such as N50, L50, total assembly length, number of contigs, and any other relevant quality indicators. Be sure to explain your reasoning and how these metrics support your conclusion.
(Provide your detailed response here, comparing the metrics for the two assemblies and justifying which is more reliable based on the data.)
C ¶
Compare the genome fraction between the reference genome and the parental assembly. Use this comparison to interpret the quality and completeness of your parental assembly. Discuss what the genome fraction indicates about how well the assembly represents the reference genome, including potential gaps, missing regions, or errors.
(Provide your detailed response here, referencing the genome fraction comparison and explaining what it reveals about the assembly quality and its alignment with the reference genome.)
Q03 ¶
Using the reads provided in the FASTA file, perform a manual greedy assembly.
>read_1
TGTACGTA
>read_2
TACATTAA
>read_3
TAAGCGAG
>read_4
GCCACTAG
>read_5
AGCGTT
>read_6
CATTAAGC
>read_7
ATTAAGCG
>read_8
ACGTACAT
>read_9
GTACGTAC
>read_10
ACGTACAT
Instructions for submission:
- Write your step-by-step work for the greedy assembly directly in the notebook below this prompt.
- Clearly show the overlaps you identify at each step, the reads you merge, and the resulting sequences after each merge.
- Provide the final assembled sequence(s) at the end of your explanation.
Example Format for Answer:
-
Step 1:
-
Overlap between
read_1
andread_9
:TGTACGTA
overlaps withGTACGTAC
(overlap = 7 bases). -
Merge:
TGTACGTAC
. -
Remaining reads:
[merged_read, read_2, read_3, ...]
.
-
Overlap between
-
Step 2:
-
Overlap between
merged_read
andread_10
:ACGTACAT
overlaps withTACAT...
(overlap = 5 bases). -
Merge:
TGTACGTACAT
. -
Remaining reads:
[merged_read, read_2, read_3, ...]
.
-
Overlap between
(Continue this process for all steps.)
-
Final Assembly:
-
Assembled sequence(s):
TGTACGTACAT...
(or list of contigs if not fully assembled).
-
Assembled sequence(s):
Put your steps here.
B ¶
Using the reads provided in the above FASTA file, construct a De Bruijn graph to model the relationships between overlapping sequences. A De Bruijn graph is a powerful tool in genome assembly, where $(k-1)$-mers form the nodes, and $k$-mers define the directed edges connecting the nodes.
Mermaid.js
uses a simple, markdown-like syntax within code blocks to create diagrams, making it accessible and easy to integrate into documentation.
For constructing flowcharts, such as De Bruijn graphs, you begin by specifying the diagram type with
flowchart
followed by a directional keyword like
LR
(left-to-right) or
TB
(top-to-bottom).
Nodes are defined using unique identifiers and can be styled with double parentheses for circular nodes, e.g.,
ATC((ATC))
.
Directed edges between nodes are represented with arrows (
-->
), indicating the flow from one node to another, such as
ATC --> TCG
.
This syntax allows you to visually map relationships by connecting nodes based on their interactions, which is particularly useful for illustrating the overlapping $(k-1)$-mers and $k$-mers in a De Bruijn graph.
Additionally,
Mermaid.js
supports comments using
%%
for clarity and can be customized with styles to enhance the readability and aesthetics of your graph.
Here’s a brief example of Mermaid.js syntax for a De Bruijn graph:
flowchart LR %% Define Nodes ATC((ATC)) TCG((TCG)) CGT((CGT)) %% Define Edges ATC --> TCG TCG --> CGT CGT --> ATC
In this example,
flowchart LR
sets the layout direction, nodes like
ATC
,
TCG
, and
CGT
are created with circular styling, and the arrows (
-->
) establish the directed connections between them.
Q04 ¶
K-mers are substrings of length $k$ extracted from DNA sequences, often overlapping by one base.
For example, the 4-mers in the sequence
"ATCGTAC"
are
"ATCG"
,
"TCGT"
,
"CGTA"
, and
"GTAC"
.
By calculating the frequency of each k-mer across all the reads, we can identify repetitive patterns and unique sequences.
A ¶
Your first task is to write a function,
kmer_frequency
, that will:
- Take a single argument of a list of DNA sequences and a specified k-mer size $k$ as input.
- Return a single dictionary where the keys are unique k-mers and the values are the number of times each k-mer appears across all reads.
Imagine you’re working with the following DNA reads:
["ATCGTACGTA", "GCGTACGTAA", "TACGTAGCGA"]
and you’re analyzing 4-mers ($k=4$) for a genome estimated to be 50 bases long.
Your function should identify and count all 4-mers across the reads. For example,
"ATCG"
appears once, while
"CGTA"
appears three times. The resulting dictionary might look like this:
{"ATCG": 1, "TCGT": 1, "CGTA": 5, "GTAC": 2, "TACG": 3, "ACGT": 3, "GCGT": 1, "GTAA": 1, "GTAG": 1, "TAGC": 1, "AGCG": 1, "GCGA": 1}
# TODO: Write kmer_frequency function here
B ¶
Once you’ve calculated the k-mer frequencies, you’ll need to find the k-mer that appears most frequently. This step is crucial for identifying repetitive sequences, which may represent highly abundant genes or genomic regions prone to sequencing errors. If multiple k-mers have the same highest frequency, choose the lexicographically smallest one (alphabetical order).
You’ll write a second function,
most_frequent_kmer
, that:
- Takes the k-mer frequency dictionary as input.
- Returns two outputs: the k-mer with the highest frequency and its count.
From the above dictionary,
"CGTA"
is the most frequent k-mer, appearing 5 times.
If there were a tie, you would choose the lexicographically smallest k-mer.
# TODO: Write most_frequent_kmer function here