Module 0

Biological Foundations & Data Ecosystems

Modern biology runs on open databases and standardised file formats. This module reviews the central dogma, genome architecture, and the principal public repositories (NCBI, UniProt, PDB, Ensembl), then surveys the file formats (FASTA, FASTQ, SAM/BAM, VCF, GFF3) that carry biological data through analytical pipelines.

1. Central Dogma & Data Flow

DNA → RNA → protein, with expression and regulation, underwrites a chain of distinct data types: genome sequences, transcript abundances, protein structures, and interaction networks. Each data type has a canonical repository: GenBank (DNA), Ensembl (annotated genomes), RefSeq (curated transcripts), ArrayExpress/GEO (expression), UniProt (proteins), PDB (structures), STRING (interactions), and Reactome (pathways).

2. Genome Architecture

Eukaryotic genomes are organised into chromosomes with protein-coding genes (~20 000 in humans), non-coding RNAs, regulatory elements, and vast repetitive sequences. Exons (~1% of genome) code for protein; introns require splicing; UTRs regulate translation. Prokaryotic genomes are much more compact and gene-dense. Understanding architecture is the prerequisite for annotation, variant calling, and RNA-seq analysis.

3. Key File Formats

FASTA: plain-text sequences (>header + sequence).
FASTQ: reads + Phred quality scores (Illumina, ONT raw output).
SAM/BAM/CRAM: alignments to a reference genome.
VCF: variant call format; genotype calls relative to reference.
GFF3 / GTF: gene annotation (exon, CDS, transcript coordinates).
BED: simple genomic intervals.
mmCIF / PDB: protein coordinates.

Simulation: Data Repository Growth

Python

script.py35 lines

import numpy as np, matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

# GenBank exponential growth 1982-2024
years = np.arange(1982, 2025)
records = 1000 * 2 ** ((years - 1982) / 1.8)
records = np.clip(records, 1000, 3e13)

# UniProt accumulation
years_u = np.arange(1986, 2025)
uniprot = 5000 + 60000 * (years_u - 1986) ** 2.4
uniprot = np.clip(uniprot, 5000, 2.5e8)

# PDB structures
years_p = np.arange(1972, 2025)
pdb = 10 + 10 * (years_p - 1972) ** 2.1
pdb = np.clip(pdb, 10, 2.1e5)

fig, ax = plt.subplots(figsize=(11, 6), facecolor='#0a0a1a')
ax.set_facecolor('#111827'); ax.tick_params(colors='#cbd5e1')
for s in ax.spines.values(): s.set_color('#334155')
ax.semilogy(years, records, color='#2dd4bf', lw=2.6, label='GenBank sequences')
ax.semilogy(years_u, uniprot, color='#fbbf24', lw=2.6, label='UniProt entries')
ax.semilogy(years_p, pdb, color='#f472b6', lw=2.6, label='PDB structures')
ax.set_xlabel('Year', color='#cbd5e1')
ax.set_ylabel('Records (log scale)', color='#cbd5e1')
ax.set_title('Growth of biological data repositories',
             color='#5eead4', fontweight='bold')
ax.grid(True, color='#334155', alpha=0.3, which='both')
ax.legend(facecolor='#1e293b', edgecolor='#334155', labelcolor='#cbd5e1')
plt.tight_layout()
plt.savefig('output.png', dpi=120, bbox_inches='tight', facecolor='#0a0a1a')
print('GenBank doubled every ~18 months (Moore-law-like sequencing growth)')
print('2024: >3 trillion nucleotides, >250M protein sequences, >200k structures')

Click Run to execute the Python code

Code will be executed with Python 3 on the server

4. Python Tooling

Biopython (Cock 2009) is the Python-standard bioinformatics library; samtools / bcftools / htslib are the sequencing-data workhorses. scikit-bio, pyranges, pysam, and gffutils cover downstream analyses. AI-era tools like ESM, AlphaFold, and DNABERT (M8) have entered the standard pipeline.

Key References

• Cock, P. J. A. et al. (2009). “Biopython: freely available Python tools for computational molecular biology.” Bioinformatics, 25, 1422–1423.

• Sayers, E. W. et al. (2022). “Database resources of the National Center for Biotechnology Information.” Nucleic Acids Res., 50, D20–D26.

• UniProt Consortium (2023). “UniProt: the Universal Protein Knowledgebase.” Nucleic Acids Res., 51, D523–D531.

• Berman, H. M. et al. (2000). “The Protein Data Bank.” Nucleic Acids Res., 28, 235–242.

Share:X Reddit LinkedIn

← Overview Module 1: Sequence Alignment →