Project 01
Assignment B

DNA sequencing

Before beginning, you need to set up an account on Galaxy and create a project for this assignment. Galaxy is a web-based platform for data-intensive biomedical research that will allow you to perform all the necessary analyses for this project.

Go to https://usegalaxy.org/
Click on the "Login or Register" button at the top.
If you don't have an account, click on "Register here" and fill out the registration form. Make sure to use a valid email address as you'll need to verify it.
Once you've registered and logged in, click "Create new history+", name your new project, and provide a brief description if you wish. Click Create .

Always ensure you're working within your project for this assignment. This will help you keep your work organized and easily accessible.

Galaxy will also save your work automatically, but it's a good practice to regularly check that your analyses are being saved correctly.

If you encounter any issues with Galaxy, check their support and training .

Q01 ¶

In this problem, you'll download the assigned sequencing data for your parent and evolved isolate and download it to Galaxy.

Set the STUDENT_ID variable in the cell below to your student ID number.

In [1]:

         
            Copied!
           
         STUDENT_ID = 1111111

         STUDENT_ID = 1111111

Follow these instructions carefully:

Your instructor has assigned you two SRA accession numbers for a parent and evolved isolate . You can find this list here on Canvas .
In the Galaxy interface, click on Tools in the left sidebar.
In the search bar at the top of the Tools panel, type Download and Extract Reads in FASTQ format from NCBI SRA and click on it.
In the tool interface:
- For select input type , choose SRR accession .
- In the Accession field, enter your assigned SRA accession number.
- Leave all other settings as default.
Click Run Tool .

Copy and paste the first five FASTQ entries from your parent forward reads in the multiline string below.

In [2]:

         
            Copied!
           
         PARENT_FASTQ_FIVE = """

"""

         PARENT_FASTQ_FIVE = """

"""

Copy and paste the first five FASTQ entries from your evolved forward reads in the multiline string below.

In [3]:

         
            Copied!
           
         EVOLVED_FASTQ_FIVE = """

"""

         EVOLVED_FASTQ_FIVE = """

"""

Q02 ¶

Quality control is a crucial step in any sequencing data analysis. It helps you identify any issues with your sequencing data that might affect downstream analyses. We'll use FastQC, a widely used tool for quality control of high throughput sequencing data.

In the Galaxy interface, search for "FastQC" in the tools panel.
In the tool interface:
- For Raw read data from your current history , select Dataset Collection and choose Paired-end data (fastq-dump) .
- Leave all other settings as default.
Click Run Tool near the bottom.

Do this for both you parent and evolved isolate. Once the job is complete, you'll see new items in your history for each FastQC report (Webpage and RawData).

Click on the FastQC on collection: RawData . The report contains several modules, each assessing a different aspect of your sequence data. Please fill in the variables below.

PARENT_FORWARD_PER_BASE_SEQ_QUALITY_1 : Under the line >>Per base sequence quality pass there is a table. You should put the value under the Mean column for #Base of 1.
PARENT_FORWARD_PER_BASE_SEQ_QUALITY_LAST : Under the line >>Per base sequence quality pass there is a table. You should put the value under the Mean column for the last row in the table.
PARENT_FORWARD_PER_SEQ_GC_CONTENT_34 : Under the line >>Per sequence GC content pass , go to the #GC Content row and put the value under Count .
PARENT_FORWARD_DUPLICATION_LEVEL_1 : Under the line >>Sequence Duplication Levels pass , under the #Duplication Level column of 1 , put the Percentage of total .
PARENT_FORWARD_ADAPTER_CONTENT_UNIVERSAL_ADAPTER_LAST : Under the line >>Adapter Content pass , go to the last row and put the value under the second column ( Illumina Universal Adapter ).

In [4]:

         
            Copied!
           
         PARENT_FORWARD_PER_BASE_SEQ_QUALITY_1 = 0.0
PARENT_FORWARD_PER_BASE_SEQ_QUALITY_LAST = 0.0
PARENT_FORWARD_PER_SEQ_GC_CONTENT_34 = 0.0
PARENT_FORWARD_DUPLICATION_LEVEL_1 = 0.0
PARENT_FORWARD_ADAPTER_CONTENT_UNIVERSAL_ADAPTER_LAST = 0.0

         PARENT_FORWARD_PER_BASE_SEQ_QUALITY_1 = 0.0
PARENT_FORWARD_PER_BASE_SEQ_QUALITY_LAST = 0.0
PARENT_FORWARD_PER_SEQ_GC_CONTENT_34 = 0.0
PARENT_FORWARD_DUPLICATION_LEVEL_1 = 0.0
PARENT_FORWARD_ADAPTER_CONTENT_UNIVERSAL_ADAPTER_LAST = 0.0

In [5]:

         
            Copied!
           
         EVOLVED_FORWARD_PER_BASE_SEQ_QUALITY_1 = 0.0
EVOLVED_FORWARD_PER_BASE_SEQ_QUALITY_LAST = 0.0
EVOLVED_FORWARD_PER_SEQ_GC_CONTENT_34 = 0.0
EVOLVED_FORWARD_DUPLICATION_LEVEL_1 = 0.0
EVOLVED_FORWARD_ADAPTER_CONTENT_UNIVERSAL_ADAPTER_LAST = 0.0

         EVOLVED_FORWARD_PER_BASE_SEQ_QUALITY_1 = 0.0
EVOLVED_FORWARD_PER_BASE_SEQ_QUALITY_LAST = 0.0
EVOLVED_FORWARD_PER_SEQ_GC_CONTENT_34 = 0.0
EVOLVED_FORWARD_DUPLICATION_LEVEL_1 = 0.0
EVOLVED_FORWARD_ADAPTER_CONTENT_UNIVERSAL_ADAPTER_LAST = 0.0

Q03 ¶

After assessing the quality of your raw sequencing data, the next step is to trim adapters and perform quality control. We'll use fastp, a fast all-in-one preprocessing tool for FASTQ files.

In the Galaxy interface, search for "fastp" in the tools panel.
Click on "fastp: fast all-in-one preprocessing for FASTQ files".
In the tool interface, set the following parameters:
- "Single-end or paired reads": Select "Paired Collection".
- "Select paired collection(s)": Choose your paired-end reads.
Click "Run Tool".

Do this for both you parent and evolved isolate.

Click on the eye icon next to the HTML report to view it. The report contains information about the trimming and filtering process.

Please fill out the following information

In [6]:

         
            Copied!
           
         PARENT_INSERT_SIZE_PEAK = 0.0
PARENT_TOTAL_READS_BEFORE = 0.0
PARENT_TOTAL_READS_AFTER = 0.0
PARENT_GC_AFTER_FILTERING = 0.0
PARENT_ADAPTER_PERCENTAGE_READ1 = 0.0

         PARENT_INSERT_SIZE_PEAK = 0.0
PARENT_TOTAL_READS_BEFORE = 0.0
PARENT_TOTAL_READS_AFTER = 0.0
PARENT_GC_AFTER_FILTERING = 0.0
PARENT_ADAPTER_PERCENTAGE_READ1 = 0.0

In [7]:

         
            Copied!
           
         EVOLVED_INSERT_SIZE_PEAK = 0.0
EVOLVED_TOTAL_READS_BEFORE = 0.0
EVOLVED_TOTAL_READS_AFTER = 0.0
EVOLVED_GC_AFTER_FILTERING = 0.0
EVOLVED_ADAPTER_PERCENTAGE_READ1 = 0.0

         EVOLVED_INSERT_SIZE_PEAK = 0.0
EVOLVED_TOTAL_READS_BEFORE = 0.0
EVOLVED_TOTAL_READS_AFTER = 0.0
EVOLVED_GC_AFTER_FILTERING = 0.0
EVOLVED_ADAPTER_PERCENTAGE_READ1 = 0.0

Q04 ¶

You have run a sequencing dataset through FastQC and observed the following:

A significant drop in quality scores at the 3' end of reads.
A GC content distribution that is slightly bimodal.
High duplication levels for several reads.

Explain the possible causes of these observations and suggest preprocessing steps to address each issue.

Put your answer here.

Q05 ¶

A genome exhibits an single GC content peak at 70%. Discuss how this might affect sequencing accuracy, amplification efficiency, and downstream analysis. Provide two potential strategies to mitigate these challenges.

Put your answer here.

Q06 ¶

Write a Python function calculate_average_quality that takes a list of quality strings from a FASTQ file as input and returns the average Phred quality score for all the reads combined.

Input Example:

["47287653825380557902185865586", "GGDIHIFGEHGGIGGIHGFGIIFIHF", ":<B;<;;5<;6;@9:?=8:@<9>9>=<:<A;?>=;:"]

Expected Output:

27.945054945054945

Details:

Each quality string corresponds to one read.
Use the Phred quality formula to convert ASCII characters to quality scores.
Calculate the average of all scores across all reads.

In [ ]:

         
            Copied!
           
         def calculate_average_quality(quality_strings: list[str]) -> float:
    """
    Calculates the average Phred quality score across all reads.

    Args:
        quality_strings: List of quality strings from a FASTQ file.

    Returns:
        The average Phred quality score.
    """
    total_score = 0
    total_bases = 0

    # TODO: Compute the average score
    average_score = 0

    return average_score

         def calculate_average_quality(quality_strings: list[str]) -> float:
    """
    Calculates the average Phred quality score across all reads.

    Args:
        quality_strings: List of quality strings from a FASTQ file.

    Returns:
        The average Phred quality score.
    """
    total_score = 0
    total_bases = 0

    # TODO: Compute the average score
    average_score = 0

    return average_score

Project 01 Assignment B