Project 01
Assignment C

Genome assembly

Always ensure you are working within in same "BIOSC 1540 Project" that you used for P01B . If you encounter any issues with Galaxy, check their support and training .

Please set the STUDENT_ID variable in the cell below to your student ID number.

In [ ]:

         
            Copied!
           
         STUDENT_ID = 1111111

         STUDENT_ID = 1111111

Q01 ¶

First, we are going to download the reference genome for Staphylococcus aureus MRSA252 .

In the Galaxy interface, search for "NCBI Datasets Genomes" in the tools panel.
Click on "NCBI Datasets Genomes download genome sequence, annotation and metadata".
In the tool interface, set the following parameters:
- "Enter comma separated list of accessions": enter GCF_000011505.1
Click "Run Tool".

A ¶

In the assembly information page linked above (or click here ), please fill out the following three statistics.

In [ ]:

         
            Copied!
           
         GENOME_LENGTH = 0
N_GENES = 0
N_CDS_WITH_PROTEIN = 0

         GENOME_LENGTH = 0
N_GENES = 0
N_CDS_WITH_PROTEIN = 0

B ¶

Using the PGAP documentation , what does "CDSs (with protein)" mean?

Put your answer here.

Q02 ¶

We will use SPAdes , a versatile genome assembler designed for both small genomes and single-cell projects, to assemble our genome. Follow the instructions below for both your cleaned parental and evolved FASTQ files (after fastp).

In the Galaxy interface, search for "SPAdes" in the tools panel.
Click on "SPAdes genome assembler for genomes of regular and single-cell projects".
In the tool interface, set the following parameters:
- "Operation mode": Select "Assembly and error correction"
- "Single-end or paired-end short-reads": Choose "Paired-end: list of dataset pairs"
- "FASTA/FASTQ file(s)": Select your paired-end output from the fastp calculation (from P01B ).
- Under "Pipeline options", check "Isolate: highly recommended for high-coverage isolate and multi-cell data (--isolate)" to be selected.
Click "Run Tool".

Once this finishes, which could take from minutes to hours, we will use quast to quantify the quality of our assembly. Follow the instructions below for both your parental and evolved SPAdes assemblies.

In the Galaxy interface, search for "Quast" in the tools panel.
Click on "Quast GEnome assembly Quality"
In the tool interface, set the following options:
- "Assembly mode?" set to "Individual assembly (1 contig file per sample)"
- "Contigs/scaffolds file" select your SPAdes Scaffolds.
- "Use a reference genome?" should be "Yes"
- For "Reference genome", click the "Dataset icon" (looks like a folder) and select "NCBI Genome Dataset: genome fasta".
Click "Run Tool".

A ¶

Please fill out the following variables from your quast HTML report.

In [1]:

         
            Copied!
           
         PARENTAL_GENOME_FRACTION = 0.0
PARENTAL_DUPLICATION_RATIO = 0.0
PARENTAL_TOTAL_ALIGNED_LENGTH = 0.0
PARENTAL_NUM_MISASSEMBLIES = 0.0

PARENTAL_N_CONTIGS = 0.0
PARENTAL_N50 = 0.0
PARENTAL_L50 = 0.0
PARENTAL_GC_PERCENT = 0.0

         PARENTAL_GENOME_FRACTION = 0.0
PARENTAL_DUPLICATION_RATIO = 0.0
PARENTAL_TOTAL_ALIGNED_LENGTH = 0.0
PARENTAL_NUM_MISASSEMBLIES = 0.0

PARENTAL_N_CONTIGS = 0.0
PARENTAL_N50 = 0.0
PARENTAL_L50 = 0.0
PARENTAL_GC_PERCENT = 0.0

In [ ]:

         
            Copied!
           
         EVOLVED_GENOME_FRACTION = 0.0
EVOLVED_DUPLICATION_RATIO = 0.0
EVOLVED_TOTAL_ALIGNED_LENGTH = 0.0
EVOLVED_NUM_MISASSEMBLIES = 0.0

EVOLVED_N_CONTIGS = 0.0
EVOLVED_N50 = 0.0
EVOLVED_L50 = 0.0
EVOLVED_GC_PERCENT = 0.0

         EVOLVED_GENOME_FRACTION = 0.0
EVOLVED_DUPLICATION_RATIO = 0.0
EVOLVED_TOTAL_ALIGNED_LENGTH = 0.0
EVOLVED_NUM_MISASSEMBLIES = 0.0

EVOLVED_N_CONTIGS = 0.0
EVOLVED_N50 = 0.0
EVOLVED_L50 = 0.0
EVOLVED_GC_PERCENT = 0.0

B ¶

Using the assembly statistics provided for the parental and evolved genomes, determine which assembly (if any) is more reliable. Your answer should compare quantifiable metrics such as N50, L50, total assembly length, number of contigs, and any other relevant quality indicators. Be sure to explain your reasoning and how these metrics support your conclusion.

(Provide your detailed response here, comparing the metrics for the two assemblies and justifying which is more reliable based on the data.)

C ¶

Compare the genome fraction between the reference genome and the parental assembly. Use this comparison to interpret the quality and completeness of your parental assembly. Discuss what the genome fraction indicates about how well the assembly represents the reference genome, including potential gaps, missing regions, or errors.

(Provide your detailed response here, referencing the genome fraction comparison and explaining what it reveals about the assembly quality and its alignment with the reference genome.)

Q03 ¶

Using the reads provided in the FASTA file, perform a manual greedy assembly.

>read_1
TGTACGTA
>read_2
TACATTAA
>read_3
TAAGCGAG
>read_4
GCCACTAG
>read_5
AGCGTT
>read_6
CATTAAGC
>read_7
ATTAAGCG
>read_8
ACGTACAT
>read_9
GTACGTAC
>read_10
ACGTACAT

Instructions for submission:

Write your step-by-step work for the greedy assembly directly in the notebook below this prompt.
Clearly show the overlaps you identify at each step, the reads you merge, and the resulting sequences after each merge.
Provide the final assembled sequence(s) at the end of your explanation.

Example Format for Answer:

Step 1:
- Overlap between read_1 and read_9 : TGTACGTA overlaps with GTACGTAC (overlap = 7 bases).
- Merge: TGTACGTAC .
- Remaining reads: [merged_read, read_2, read_3, ...] .
Step 2:
- Overlap between merged_read and read_10 : ACGTACAT overlaps with TACAT... (overlap = 5 bases).
- Merge: TGTACGTACAT .
- Remaining reads: [merged_read, read_2, read_3, ...] .

(Continue this process for all steps.)

Final Assembly:
- Assembled sequence(s): TGTACGTACAT... (or list of contigs if not fully assembled).

Put your steps here.

B ¶

Using the reads provided in the above FASTA file, construct a De Bruijn graph to model the relationships between overlapping sequences. A De Bruijn graph is a powerful tool in genome assembly, where $(k-1)$-mers form the nodes, and $k$-mers define the directed edges connecting the nodes.

Mermaid.js uses a simple, markdown-like syntax within code blocks to create diagrams, making it accessible and easy to integrate into documentation. For constructing flowcharts, such as De Bruijn graphs, you begin by specifying the diagram type with flowchart followed by a directional keyword like LR (left-to-right) or TB (top-to-bottom). Nodes are defined using unique identifiers and can be styled with double parentheses for circular nodes, e.g., ATC((ATC)) . Directed edges between nodes are represented with arrows ( --> ), indicating the flow from one node to another, such as ATC --> TCG . This syntax allows you to visually map relationships by connecting nodes based on their interactions, which is particularly useful for illustrating the overlapping $(k-1)$-mers and $k$-mers in a De Bruijn graph. Additionally, Mermaid.js supports comments using %% for clarity and can be customized with styles to enhance the readability and aesthetics of your graph.

Here’s a brief example of Mermaid.js syntax for a De Bruijn graph:

flowchart LR
    %% Define Nodes
    ATC((ATC))
    TCG((TCG))
    CGT((CGT))
    
    %% Define Edges
    ATC --> TCG
    TCG --> CGT
    CGT --> ATC

In this example, flowchart LR sets the layout direction, nodes like ATC , TCG , and CGT are created with circular styling, and the arrows ( --> ) establish the directed connections between them.

Q04 ¶

K-mers are substrings of length $k$ extracted from DNA sequences, often overlapping by one base. For example, the 4-mers in the sequence "ATCGTAC" are "ATCG" , "TCGT" , "CGTA" , and "GTAC" . By calculating the frequency of each k-mer across all the reads, we can identify repetitive patterns and unique sequences.

A ¶

Your first task is to write a function, kmer_frequency , that will:

Take a single argument of a list of DNA sequences and a specified k-mer size $k$ as input.
Return a single dictionary where the keys are unique k-mers and the values are the number of times each k-mer appears across all reads.

Imagine you’re working with the following DNA reads:
["ATCGTACGTA", "GCGTACGTAA", "TACGTAGCGA"]
and you’re analyzing 4-mers ($k=4$) for a genome estimated to be 50 bases long.

Your function should identify and count all 4-mers across the reads. For example, "ATCG" appears once, while "CGTA" appears three times. The resulting dictionary might look like this:

{"ATCG": 1, "TCGT": 1, "CGTA": 5, "GTAC": 2, "TACG": 3, "ACGT": 3, "GCGT": 1, "GTAA": 1, "GTAG": 1, "TAGC": 1, "AGCG": 1, "GCGA": 1}

In [ ]:

         
            Copied!
           
         # TODO: Write kmer_frequency function here

         # TODO: Write kmer_frequency function here

B ¶

Once you’ve calculated the k-mer frequencies, you’ll need to find the k-mer that appears most frequently. This step is crucial for identifying repetitive sequences, which may represent highly abundant genes or genomic regions prone to sequencing errors. If multiple k-mers have the same highest frequency, choose the lexicographically smallest one (alphabetical order).

You’ll write a second function, most_frequent_kmer , that:

Takes the k-mer frequency dictionary as input.
Returns two outputs: the k-mer with the highest frequency and its count.

From the above dictionary, "CGTA" is the most frequent k-mer, appearing 5 times. If there were a tie, you would choose the lexicographically smallest k-mer.

In [ ]:

         
            Copied!
           
         # TODO: Write most_frequent_kmer function here

         # TODO: Write most_frequent_kmer function here

Project 01 Assignment C

Genome assembly

Q01 ¶

A ¶

B ¶

Q02 ¶

A ¶

B ¶

C ¶

Q03 ¶

B ¶

Q04 ¶

A ¶

B ¶

Project 01
Assignment C