DNA Sequencing

2019-10-04

MBP Tech Talks

Fall 2019

Fundamentals of Genome Sequencing and Applications

Motivations

Lots of research involves genome sequencing or and analysis
Few resources covering the fundamentals
Information is often scattered through blog posts, presentations
Multidisciplinary - difficult for a single person to discuss thoroughly

Goals for the fall semester

Understand foundations of genome sequencing
- How DNA is captured and read
- What makes good sequencing data
Sequence alignment
Wet and dry sides of sequencing experiments
How to think about good experiment design when sequencing

Outline

Part 1: Fundamentals of genome sequencing

History of reading DNA
Massively-parallel sequencing
Standard file formats
Sequence alignment
Alignment metrics

Outline

Part 2: Applications of genome sequencing

Mutation detection
Chromatin accessibililty
Histone modifications and protein binding
Transcriptome sequencing
Chromatin organization
DNA methylation

Session structure

HSB 100, Friday 12:00 - 14:00
No food or drinks
Each session has 2 parts, break in the middle

Fundamentals of DNA sequencing

DNA molecules

Alberts, Molecular biology of the cell, 6ed. pg. 176

DNA is double-stranded polymer
DNA contains 4 nucleotides: adenine (A), cytosine (C), guanine (G), thymine (T)
DNA has a sugar-phosphate backbone
Nucleotides come in complementary pairs: A-T, C-G

DNA molecules

Alberts, Molecular biology of the cell, 6ed. pg. 177

DNA has a right-handed orientation
Strands are oriented in opposite directions
We measure direction by counting 5’ to 3’

DNA encodes information for organisms

Alberts, Molecular biology of the cell, 6ed. pg. 178

Complementary strands allow for replication
Sequences themselves code for proteins inside cells

DNA sequencing

Genome: set of all DNA inside an organism
Sequencing: the process of measuring the order of these nucleotides in a set of cells

Sanger sequencing

Just from a high level, since details aren’t totally relevant, just the idea

Gel electrophoresis
- Small fragments of DNA are loaded into lanes of gel
- Small electric potential pulls DNA through gel
- Larger resistance on longer fragments
Load polymerase, primers, and ddNTPs to template strand
- Primers to act as substrate for polymerase
- Polymerase to bind ddNTPs to template strand
- Add ddNTPs one at a time to control which nucleotide gets added next

Sanger sequencing

Mardis, 2013

Separate small portion after adding a particular ddNTP
- Keep all portions after adding a given ddNTP together
- Load these in a lane
Bright band in the lane gives nucleotide
Vertical position gives order in sequence

Sanger sequencing

Pros

Sequence DNA fragments
Can visually see the order of nucleotides

Cons

Limited by sharpness of bands
Limited by length of gel
X-ray gel exposure, image development, one nucleotide at a time (laborious)

Sequencing by synthesis

Use ideas of Sanger sequencing:
- Primers, polymerase, nucleotides
- Synthesize complementary strand base-by-base
Instead of measuring all at once via gel, measure fluorescence

Sequencing by synthesis

Massively parallel sequencing

Complement of captured bases to reveal the original DNA sequence
Image of the entire slide allows simultaneous capturing of millions of fragments of DNA

Storing DNA sequences

FASTA format

Sequence name
Sequence string

>Name
SEQUENCE

>reference-seq-name1
ATCTATACTTTATCTTTATCTTTA
>reference-seq-name2
ATTTTATCGCGTAGCTAGCTGGCT

Pairs of lines are the fundamental units
Everything encoded in plain text

FASTA format Pros

Simple to understand
Encodes DNA in plain letters
Easy to parse, edit, etc

FASTA format Cons

Assumes sequences are exact
- Often not true in practice
Often extremely large files
- Human genome: 3 Gbp
- Tend to be gzipped to compress size
No computational optimization
- Random access
- ASCII encoding is overkill when you only need 2 bits

FASTQ format

Very similar to FASTA

Sequence name
Sequence string
Description
Quality scores (uncertainty in measurement)

Analogous to working with sig figs, % err in measurements

FASTQ format

Quality scores

Consider same set of reads as before
Fragments spread out on a substrate
Ideally we’d be able to measure a single strand of DNA
More robust measurements come from having clones of the same fragment

FASTQ format

Quality scores

FASTQ format

Quality scores

Consensus call is blue
Not unanimous

\(p = \mathbb{P}[\text{incorrect call}] = \frac{1}{9}\)

FASTQ format

Quality scores

\(p = \mathbb{P}[\text{incorrect call}] = \frac{3}{9} = \frac{1}{3}\)

This is a simple model for calculating errors in calls
More complicated methods can be used to calculate \(p\), based on the chemistry, the asymmetric errors, input sample base distribution, etc

FASTQ format

Quality scores

Modern sequencers are very accurate
Often \(p \approx 0\)
Phred quality score \(q = -10\log_{10}(p)\)
\(q \in [0, \infty)\),
\(q = 10 \implies p = \frac{1}{10}\)
\(q = 20 \implies p = \frac{1}{100}\)
\(q = 30 \implies p = \frac{1}{1000}\)
\(q = 40 \implies p = \frac{1}{10000}\)

FASTQ format

Quality scores

Encode \(q\) as a single ASCII character
ASCII(round(\(q\)) + 33)
\(q = 0 \implies\) !
\(q = 40 \implies\) I

@reference-seq-name1
ATCTATACTTTATCTTTATCTTTA
+
GFFFFBBBCBCCBBAAA:::;;;;
@reference-seq-name2
ATTTTATCGCGTAGCTAGCTGGCT
+
FFFEEBBBCBCCBBAAA:::;;;;

DNA sequencing data is full of errors

Say you have 1000 fragments, each 100 bp long, each base has \(q = 40\)
What is the probability you have no errors anywhere in your data?

DNA sequencing data is full of errors

\(\mathbb{P}[\text{no errors}]\)
\((\mathbb{P}[\text{no errors in read}])^{1000}\)
\(((1-\mathbb{P}[\text{incorrect base call}])^{100})^{1000}\)
\((1-p)^{100000} \approx 4.54 \cdot 10^{-5}\)
\(\mathbb{E}[\text{incorrect base calls}] \approx np = 100000 \cdot \frac{1}{10000} = 10\)

DNA sequencing data is full of errors

Sequencing is a random sampling problem

Many steps involve complicated chemistry
DNA doesn’t come in these pre-defined chunks that we can measure
All sequencing is random sample of input space
Similar to taking polls of demographics to extrapolate to the population
Prime example of Hidden Markov Model

Many steps in sequencing are stochastic
- Composition of input DNA
- Fragmentation
- Adapter ligation
- Annealing to substrate
- Amplification
- Errors in base calls
- Trillions of possible fragments, only millions sequenced
Single measurements are not deterministic

Break

Assessing DNA sequencing data quality with FastQC

FastQC

Tool to assess quality of sequencing data
Produces HTML report

fastqc {FASTQ}

FastQC Examples

Summary

A brief history of DNA sequencing and various methods
Phred quality scores
Storing DNA sequences (FASTA, FASTQ)
FASTQ quality control metrics
Sequencing as a random sampling measurement

What I didn’t cover

Alternative sequencing technologies (PacBio, Oxford Nanopore)
Chemistry of these technologies
What to do with these data

Next time

Exact sequencing alignment
String alignment to a reference
Naive alignment
Boyer-Moore algorithm