FASTA ¶
In this notebook, we'll work with a DNA sequence and learn some basic programming concepts in Python. These exercises are designed to help you get familiar with strings, loops, and dictionaries, which are useful tools for bioinformatics.
Let's start with a simple DNA sequence stored as a string in Python.
Strings are sequences of characters, and in bioinformatics, they are often used to represent DNA sequences, where each character represents a nucleotide (
A
,
T
,
G
,
C
).
DNA_SEQ = "GAGCAATTTCGAGGATCGCTTGTTGGTATTACTCGGGCTTTTC"
We have a DNA sequence stored in a variable called
DNA_SEQ
.
Your first task is to print the sequence to see what it looks like.
print(DNA_SEQ)
GAGCAATTTCGAGGATCGCTTGTTGGTATTACTCGGGCTTTTC
The length of a DNA sequence tells us how many nucleotides it contains. Python provides a built-in function called len() that can calculate the length of a string.
seq_len = len(DNA_SEQ)
print(seq_len)
43
DNA sequences consist of four nucleotides: Adenine (A), Thymine (T), Guanine (G), and Cytosine (C). To analyze the sequence, let's count how many times each nucleotide appears.
We’ll use a dictionary to store the counts for each nucleotide, and a for loop to iterate over the sequence.
def count_nucleotides(seq):
nucleotide_counts = {"A": 0, "T": 0, "G": 0, "C": 0}
for nucleotide in seq:
if nucleotide in nucleotide_counts:
nucleotide_counts[nucleotide] += 1
return nucleotide_counts
nuc_counts = count_nucleotides(DNA_SEQ)
print(nuc_counts)
{'A': 7, 'T': 16, 'G': 12, 'C': 8}
print(seq_len)
print(nuc_counts["A"] + nuc_counts["T"] + nuc_counts["G"] + nuc_counts["C"])
43 43
The GC content of a DNA sequence is the percentage of G and C nucleotides in the sequence. This is an important metric in genomics.
To calculate the GC content:
- Add the counts of G and C.
- Divide this sum by the total length of the sequence.
- Multiply by 100 to get a percentage.
def get_gc_content(seq):
nuc_counts = count_nucleotides(seq)
gc_content = (nuc_counts["G"] + nuc_counts["C"]) / len(seq) * 100
return gc_content
seq_gc_content = get_gc_content(DNA_SEQ)
# Print the GC content
print(seq_gc_content)
46.51162790697674
print(f"GC Content: {seq_gc_content:.1f}%")
GC Content: 46.5%
Now, let us consider a much longer DNA sequence.
DNA_SEQ_LONG = """GAGCAATTTCGAGGATCGCTTGTTGGTATTACTCGGGCTTTTCGTTTTTTAAACTCAAGG
TACGCAAGGACCGCGCGCAGTCGACATGAACGTCAGTTGCTATGGGAACGTTGACAGAAT
GTGAGGAAGGCAGCGCATGCACTCTGCCTGGCAATTTGTTGTGTATTCGGTGGTTGGGCT
CGGTGAAGGTGTATTACGACCGAACGACGGATTGGATCGACAAGCATGGCACGTATGATC
ACTTAGTGTGACTCACTGGTGAAGACACTCTGCAGTCCGTTGGAACTCATGTTGGAAGGC
AGGCGGGGACTCACTGTTAGACGGGCGGTTGCCTTTCGATGCGCCAGATGACGAAAGCTC
GTTCCTATGGCCCTCGACGCGGGAGTCTTACCTGACCTTTGCTCTCGCTAAAAGAGATAC
CTTTGGCACCGGGCTCGATCCTGTGTAGGCCGCAGACGATGCGGGCCAAATTAGTTCAGC
AAAAACCTAGTGGTTTGCCCAGGCATGAACCACTTTGTCCTGACCGGAGAGAATTGATCA
ATGAGGATGATTCGTTGCTTAGGCCGAGAACGGCCCACGCTCTCCACCGGGAGCAAAAAC
GAGCTTCGTTAACTGGGGGGGGGCTATGCCGTCGCGGCCCTGCCTGGACACAAAAAGATG
AAAATCAGAAATAGGGAGTGGTAGAACAATATTTAATGAACTTCCTTCCATATATCTTGT
TGGAAATCGCAATGTATACCACGGTACGTTGTGGCACGAAAAGCCTAAACCACCATATCT
CTAGAGAGGAATTAAAACGCCCGGATGCGGCCAGCAGCATGTTACGTTGCAGGTCGGAAA
CAATGGATTCCTTAGCCCTCTACTACGCGTTTCCCGACTTTTGGTTACCATGGGTTTTGT
AGCGATAAAACTATCCGTATACTACTGGTCGCTCTTGTCTAGTTATTCGAGTACTATATC
CAATATTCTGTCTATAGCGTTTGTGATGCCCTATTGTATACATCCTAGCGGCCTACTCGT
ACATATTGTCTTTTACTCTAATTCAACGTTAATTACCCTCCGCCCCCCTTCCACATACAG
CCCGACCGCTCGACGTCTGTGCGGTTCGTGCCTCCCTGCCCCTGTAGTAAACTTGTTAGT
AAAGCAACTACCGACCACTGTGCACAACTGACAACCATTTTTGGTTGCCGTCATTGAATG
TACGTGGAGCACTTTCGGAACACAGGTCTCGATAAACCTCCGGAGGAGACGAGTTCAGAC
AAAACTAGCTTGCAAGGTCAAAGCGACACTGGAAGAAGAGTCCGTTATCTACGTTGCGCC
CTGCGATCTTGGTCGAGTATGACTGTGCGCGGGCGTGCCTCCCACAACTGCCACACTCTT
CCGTCGGTGTGCAGGATGGTTTCGGATCGTGAGACACGGGCAATAACCATTGGGATGCAC
TGGGTCTGACGAGGATCATCCGGTGAGCGTCGAGTCCGTCAGAGACTGCTTCCGTATTCT
TCGTAATGTCCAGATCGGCATCACGCACTAACCAGGTCGTGTCAAATGCCCGTTGACACC
TGAAGTAGTCGGAATTTCTCTATACAAGCTCAGCAGGATACCCTCACGCGACAATTAGAC
ATTCACTGCGCCAAACACCACCTGTTCGTGCCGTATCTGGACTAGGACAAGAACGGCGCT
ACTCCCTCCGACGTAAATCCCTCATTTGTGTGTTTAGCCAACACCCGCATCGGTCCTCAT
AGACCTTTCCCCGGTGCGTTTTACCCCTGGGTTGACCTCGCGGGCGCAAAAGTACCTACA
CGTAACGTTTGAACACTACAAAAACATAGTTCCTGATCTTCTCGTGATACTGTCCTCTTA
TTAAAGAAAATGAGTGTAGTCTTTCACCATACTATCGGTCGTGGATTTTCCACGCCTTCA
TGTAGATTATCAACGCTTGACTCGGCCTAGAGTAACGGAGTCCTCACGTCAAGGCGTGTT
GACTGAGCTACGGGCAGATGTTTAGGTTTGGATTTGGAGGAAGGCCTATACCTGTATTAG
TTCCATGAGCGCAGTGCACTGCAGTGTGTACGGAGGCCTGCCTCGCGGATACCTCACCAA
AAGAGAATAGCACCCCTGGGTAGCCTTTACGNAAATAGATTGATATGGTCATAGGTGGAC
AGTCGGGCATCTAGCTCATACTGTCGTCCCGTTAGAGAATTACCTGATGGGTTTCAACCT
AAGGCCCGTAGCTTTTACGCCCCAACTGACACCCCCCTGTTAACCCCGACAATCCAAGTC
TCTAACGCAGGAGAGGCTAGGAACGGCCGAAGCCGTCCAGGCGCGCCGATTGTACATCCG
CCTATGCCCCGAGCGCTGGCTTCGGGGGTTCTCCCGGATCTTCATAACATATCGCCATTG
ATGTGTACACTTTCTCGCAGTGGCCAAATCGTTAACAGTGTAAGTGGTATTGGGCGCGAA
AAGCTAACCCGCCGGGAATGACTTTGCGTTTAGGCATACGTAATACCCGATTGCATTGGT
AAGGTCAGCGCACGCCGAGCCTCGCAGGACGACCTCAATTCACAACGAATACCCAATAGA
GTAGAAACACTACTTTATACGCACGAATAGATGGGCCCTAAAATCCGGGACACCGGCGGT
GCAGATTACCCGGCGCGATCCCTCCTAATCATCAAACACGCCTCCTCCGACCTCAACCGG
ACGGGCTAGCTAACGTAGAAAACCGCCCGTCACAGTTCTCAACAGCTTGTGGATCACAGG
CAGCCATATTGCTGACGTCTCACGGGGCGTCGAGCACGTCAGGCTACAGAAACCTTGCAG
CGGCAGTGGCGGAGATAATCTCTATTTCAGCCGAGCCGGACTGGCCTCCTTAGTTCGGAG
AAGCAGGGACCCCTACGCGGCGAGATTCGCTACACAGAACGCGACTATCTGACCGACTTT
GCCCTAGCTTAAGTCAGTTATACCAAAGATGTTCCGACGATTCCTTCTTTGTTCGGACCT
ATCCGAAATCATTTACACCTCAGGCGACCACTTACGGCACAATTCGGCTTGGCCAAACAC
ATGGTCCCGCCGGATAAAACTCGATTTGGATATGCCTGTCATAAGTCGATCCCTGCACCT
GGGGGTGGATAGTCATGTTGCTACTTCTGCCTATTCGGGGGGTGATATGGTACTGCCCAA
CTCTCATCCGTTCTAGCATATCGGCAAGGAAAGCTCACAGGCCAGCACCGGACTCCATCA
CCGTCCGCTTAATTATAAAATCCACTTTGAAAAGTTGTGCAATGAAAGGACCCCAAGGCA
GTTGTCGTACCGGTGAAGTAGGCAGTTGATATGGTTACTTACGACACGGTTAATTAAACT
TTGCGGGAGCCCCAACGATGATGGCCTGAGTGCGATAAGACCGTGTACGGACTCAATTAA
CATCTTAGCTTGGGATGTTTACAAATAGGCGAAACCATTAGTAACGAACGGCTATATTTA
TGGGTGCTTAACGTGTCGATCCAGGCCGAAGGAGAGCTAGGTTATCTCGTAACAGCAGCC
AGCTGCCACCCCCGGATCGAGCTTCTTTTGTACACCTAAAGAGCAGTTACGGCCGGATGA
CGCAACCATAGCTGTGAACGATAACGTCTGATCAGCGAGTATCCGGTTCATCATGGCCTT
GGTTTTAAAAAATAGATGTTAGCGGACTGTCCGCATATTTCAGCAATTGCCGCGTAATGC
GGATAGATGCTCCATTGCAAGAGTTTGTGAAAAGTGGTCCTTAGGCAGGTGTTGAATGAG
GTTCGTTATTCGCGCTTATTTGACGCTAAATTTTTACGAAACAAACGCCCCGTGCAAACG
CCGGTCCCGGATGTACTCACGCCGTTCGTTGTTGGGGCGAAGGCAAGCTAACTGCTGACC
TTGCGTTAGTGCTCCACCCGAGCCAGAAGTCTTGTATATTTCCCATCAGGCTACACCAAT
ACTTTGTAAAGTCCGTATTGAGGCCGTTGTAGCTGTTCCGCGTATAGCACAACGCTCCAG
GCTTCTGCCCGGTAGGCTCAGCATGAACGACACGTAATGTATTAGAACGTCCAGGCAAAT
GCTCCTAGAACTATGGAGGGCACTGCTATGCGGGGATGGCCGAGGGGCCTGATGCGTTCG
AAAGTCAGCAGTGGGACAGTCAGCCTAGCTGTCGGTCACTCCCCCCGTCTTAATGGCGTG
TTCCCGCCCCCGTGCGGCCAGCTATTACCCCAGAGTACCGTTGCGATTTCAGGGTACGCG
TCATTGGGTGTGTGATGTGGATGATGCCATGCCTCTCGGCATTAAGTCTTAGGGCTTCTA
AACCATTTCTACAACCACAATAGGCGGTAGCAGTTCGGTAGATGGAAGCAACTACCGCAA
TGATGGGAGATTCTGGTCGATCTTTTTAGCAGGTGACATAACGGCCCAGCGAATATCTTG
ATCGATTGGGGGAGCCCGAAAGAGTTAAAGCACATCGTCTCCCAGATAGTCAGCGCGCAG
CCGACCTGGGGTCCGAATGAACGAGCTGCCGGTTAGATTTGCCCTCGACTATATGCAAGC
CGATTGACGTTTTGGCAGCAAACTCGCATTTCCCTTAAGAATCGCGGACCGGAGTTGCTC
GTGCAAACACCGTTGATTACAATGTGTGCGCTTGAGGGCGGCGAACAGACTCCCGTGAGG
AAGCATCCCTCTCGGAATTAGGCGCGACCCGCCGAGGCTACGGGTTGTTGCCTCTGATCT
TTTCCATCAATCAGTTGCTAGATCCGTCATTATGTACGATCGCGCAAACTTTTAAAATTA
GGAGCTCGAGAACTCGTTTTTTCTCCTATACTAGAACAGTGACAGATACGCTTATCATGC
GCAAGGAGGATGAGGTAGACAGGAGGTGGACGGCTATCATATATAGTGCTAGACCTGTAG
TTACGTCGACAGGCTAAAGCCTACATATCGGCAAAGTTTACAGGCGAAAATGAGCATCAG
TTCCCCAGGCGTATATCTGG"""
Let's see how long the sequence is.
seq_len_long = len(DNA_SEQ_LONG)
print(seq_len_long)
5083
Now let's count the nucleotides.
nuc_counts_long = count_nucleotides(DNA_SEQ_LONG)
print(nuc_counts_long)
{'A': 1221, 'T': 1229, 'G': 1265, 'C': 1284}
Perform a sanity check that they match.
print(seq_len_long)
print(
nuc_counts_long["A"]
+ nuc_counts_long["T"]
+ nuc_counts_long["G"]
+ nuc_counts_long["C"]
)
5083 4999
What's wrong? Well, let's find a way to print all unique characters present in the string.
print(set(DNA_SEQ_LONG))
{'A', 'N', 'T', 'G', '\n', 'C'}
We can remove these characters by using the
replace
method.
DNA_SEQ_LONG_CLEAN = DNA_SEQ_LONG.replace("\n", "")
print(set(DNA_SEQ_LONG_CLEAN))
{'A', 'T', 'G', 'N', 'C'}
We have to write a new function to count all unique characters, regardless if they are nucleotides or not.
def count_characters(seq):
char_counts = {}
for nucleotide in seq:
if nucleotide in char_counts:
char_counts[nucleotide] += 1
else:
char_counts[nucleotide] = 1
return char_counts
char_counts_long = count_characters(DNA_SEQ_LONG_CLEAN)
print(char_counts_long)
{'G': 1265, 'A': 1221, 'C': 1284, 'T': 1229, 'N': 1}
Redo our sanity check.
print(len(DNA_SEQ_LONG_CLEAN))
print(
char_counts_long["A"]
+ char_counts_long["T"]
+ char_counts_long["G"]
+ char_counts_long["C"]
+ char_counts_long["N"]
)
5000 5000
Download and read file ¶
import requests
# URL to the raw content of the file
fasta_url = "https://github.com/oasci/pitt-biosc1540-2025s/raw/refs/heads/main/content/data/fasta/synthetic.fasta"
# Fetch the file content
response = requests.get(fasta_url)
# Handles HTTP request codes
if response.status_code == 200:
FASTA_FILE = response.text
else:
print(f"Failed to fetch file. Status code: {response.status_code}")
def get_seqs_from_fasta(fasta):
seqs = []
for line in fasta.split("\n"):
if len(line) > 0 and line[0] != ">":
seqs.append(line)
return seqs
seqs = get_seqs_from_fasta(FASTA_FILE)
print(len(seqs))
5000