Project 1
Assignment B
DNA sequencing
Before beginning, you need to set up an account on Galaxy and create a project for this assignment. Galaxy is a web-based platform for data-intensive biomedical research that will allow you to perform all the necessary analyses for this project.
- Go to https://usegalaxy.org/
- Click on the "Login or Register" button at the top.
- If you don't have an account, click on "Register here" and fill out the registration form. Make sure to use a valid email address as you'll need to verify it.
-
Once you've registered and logged in, click "Create new history+", name your new project, and provide a brief description if you wish.
Click
Create
.
Always ensure you're working within your project for this assignment. This will help you keep your work organized and easily accessible.
Galaxy will also save your work automatically, but it's a good practice to regularly check that your analyses are being saved correctly.
If you encounter any issues with Galaxy, check their support and training .
Q01 ¶
In this problem, you'll download the assigned sequencing data for your parent and evolved isolate and download it to Galaxy.
Set the
STUDENT_ID
variable in the cell below to your student ID number.
STUDENT_ID = 1111111
Follow these instructions carefully:
- Your instructor has assigned you two SRA accession numbers for a parent and evolved isolate . You can find this list here on Canvas .
-
In the Galaxy interface, click on
Tools
in the left sidebar. -
In the search bar at the top of the Tools panel, type
Download and Extract Reads in FASTQ format from NCBI SRA
and click on it. -
In the tool interface:
-
For
select input type
, chooseSRR accession
. -
In the
Accession
field, enter your assigned SRA accession number. - Leave all other settings as default.
-
For
-
Click
Run Tool
.
Copy and paste the first five FASTQ entries from your parent forward reads in the multiline string below.
PARENT_FASTQ_FIVE = """
"""
Copy and paste the first five FASTQ entries from your evolved forward reads in the multiline string below.
EVOLVED_FASTQ_FIVE = """
"""
Q02 ¶
Quality control is a crucial step in any sequencing data analysis. It helps you identify any issues with your sequencing data that might affect downstream analyses. We'll use FastQC, a widely used tool for quality control of high throughput sequencing data.
- In the Galaxy interface, search for "FastQC" in the tools panel.
-
In the tool interface:
-
For
Raw read data from your current history
, selectDataset Collection
and choosePaired-end data (fastq-dump)
. - Leave all other settings as default.
-
For
-
Click
Run Tool
near the bottom.
Do this for both you parent and evolved isolate. Once the job is complete, you'll see new items in your history for each FastQC report (Webpage and RawData).
Click on the
FastQC on collection: RawData
.
The report contains several modules, each assessing a different aspect of your sequence data.
Please fill in the variables below.
-
PARENT_FORWARD_PER_BASE_SEQ_QUALITY_1
: Under the line>>Per base sequence quality pass
there is a table. You should put the value under theMean
column for#Base
of 1. -
PARENT_FORWARD_PER_BASE_SEQ_QUALITY_LAST
: Under the line>>Per base sequence quality pass
there is a table. You should put the value under theMean
column for the last row in the table. -
PARENT_FORWARD_PER_SEQ_GC_CONTENT_34
: Under the line>>Per sequence GC content pass
, go to the#GC Content
row and put the value underCount
. -
PARENT_FORWARD_DUPLICATION_LEVEL_1
: Under the line>>Sequence Duplication Levels pass
, under the#Duplication Level
column of1
, put thePercentage of total
. -
PARENT_FORWARD_ADAPTER_CONTENT_UNIVERSAL_ADAPTER_LAST
: Under the line>>Adapter Content pass
, go to the last row and put the value under the second column (Illumina Universal Adapter
).
PARENT_FORWARD_PER_BASE_SEQ_QUALITY_1 = 0.0
PARENT_FORWARD_PER_BASE_SEQ_QUALITY_LAST = 0.0
PARENT_FORWARD_PER_SEQ_GC_CONTENT_34 = 0.0
PARENT_FORWARD_DUPLICATION_LEVEL_1 = 0.0
PARENT_FORWARD_ADAPTER_CONTENT_UNIVERSAL_ADAPTER_LAST = 0.0
EVOLVED_FORWARD_PER_BASE_SEQ_QUALITY_1 = 0.0
EVOLVED_FORWARD_PER_BASE_SEQ_QUALITY_LAST = 0.0
EVOLVED_FORWARD_PER_SEQ_GC_CONTENT_34 = 0.0
EVOLVED_FORWARD_DUPLICATION_LEVEL_1 = 0.0
EVOLVED_FORWARD_ADAPTER_CONTENT_UNIVERSAL_ADAPTER_LAST = 0.0
Q03 ¶
After assessing the quality of your raw sequencing data, the next step is to trim adapters and perform quality control. We'll use fastp, a fast all-in-one preprocessing tool for FASTQ files.
- In the Galaxy interface, search for "fastp" in the tools panel.
- Click on "fastp: fast all-in-one preprocessing for FASTQ files".
-
In the tool interface, set the following parameters:
- "Single-end or paired reads": Select "Paired Collection".
- "Select paired collection(s)": Choose your paired-end reads.
- Click "Run Tool".
Do this for both you parent and evolved isolate.
Click on the eye icon next to the HTML report to view it. The report contains information about the trimming and filtering process.
Please fill out the following information
PARENT_INSERT_SIZE_PEAK = 0.0
PARENT_TOTAL_READS_BEFORE = 0.0
PARENT_TOTAL_READS_AFTER = 0.0
PARENT_GC_AFTER_FILTERING = 0.0
PARENT_ADAPTER_PERCENTAGE_READ1 = 0.0
EVOLVED_INSERT_SIZE_PEAK = 0.0
EVOLVED_TOTAL_READS_BEFORE = 0.0
EVOLVED_TOTAL_READS_AFTER = 0.0
EVOLVED_GC_AFTER_FILTERING = 0.0
EVOLVED_ADAPTER_PERCENTAGE_READ1 = 0.0
Q04 ¶
You have run a sequencing dataset through FastQC and observed the following:
- A significant drop in quality scores at the 3' end of reads.
- A GC content distribution that is slightly bimodal.
- High duplication levels for several reads.
Explain the possible causes of these observations and suggest preprocessing steps to address each issue.
Put your answer here.
Q05 ¶
A genome exhibits an single GC content peak at 70%. Discuss how this might affect sequencing accuracy, amplification efficiency, and downstream analysis. Provide two potential strategies to mitigate these challenges.
Put your answer here.
Q06 ¶
Write a Python function
calculate_average_quality
that takes a list of quality strings from a FASTQ file as input and returns the average Phred quality score for all the reads combined.
Input Example:
["47287653825380557902185865586", "GGDIHIFGEHGGIGGIHGFGIIFIHF", ":<B;<;;5<;6;@9:?=8:@<9>9>=<:<A;?>=;:"]
Expected Output:
27.945054945054945
Details:
- Each quality string corresponds to one read.
- Use the Phred quality formula to convert ASCII characters to quality scores.
- Calculate the average of all scores across all reads.
def calculate_average_quality(quality_strings: list[str]) -> float:
"""
Calculates the average Phred quality score across all reads.
Args:
quality_strings: List of quality strings from a FASTQ file.
Returns:
The average Phred quality score.
"""
total_score = 0
total_bases = 0
# TODO: Compute the average score
average_score = 0
return average_score