Sequence Alignment

2019-10-18

Last session

DNA sequencing technologies
Phred quality scores
FASTQ quality control
Sequencing as a random sampling

Outline

Sequencing reads as strings
Algorithms for finding string matches
Algorithmic efficiency
Tutorial: aligning sequencing data

Exact string matching

DNA sequences as strings

@reference-seq-name1
ATCTATACTTTATCTTTATCTTTA
+
GFFFFBBBCBCCBBAAA:::;;;;
@reference-seq-name2
ATTTTATCGCGTAGCTAGCTGGCT
+
FFFEEBBBCBCCBBAAA:::;;;;

Bunch of sequences and quality scores
How do they relate to each other?

We have the sequences themselves
As the title suggests, this general process is called “sequence alignment”
Broad approach is going to be to line them up (match characters)
To do this, we’re going to have to build up the idea of what it means for 2 strings of letters to be similar
First is the idea of a “string”, how we’re going to model the sequences

Strings

String: an ordered sequence of characters belonging to an alphabet
\(S_n = \left\{ s_1, …, s_n | s_i \in \Sigma \right\},\) where \(\Sigma\) is the alphabet
DNA: \(\Sigma = \{A, C, G, T\}\)
RNA: \(\Sigma = \{A, C, G, U\}\)
Protein: \(\Sigma = \{D, E, R, ..., M, W, C\}\)

String properties

Length of a string: \(|S_n| = n\)
\(S[i] = (i + 1)th\) character of S
Start counting \(i\) at 0

String comparisons

Substring: \(S\) is a substring of \(T\) if there exists strings \(u\), \(v\), where \(T = uSv\)
Prefix: \(S\) is a prefix of \(T\) if there exists string \(v\), where \(T = Sv\)
Suffix: \(S\) is a suffix of \(T\) if there exists string \(u\), where \(T = uS\)

We’ve built up most of the machinery we need

Exact string matching

Take two strings \(S, T\)
Assume \(|S| < |T|\)
Find all positions (indices) in \(T\) where \(S\) matches exactly
Question: What’s an easy way to find all of these positions?

Exact string matching

Examples

\(S =\) “ATCG”, \(T =\) “TATCGTGA”
\(S =\) “AAA”, \(T =\) “AAAAAAAAAAAAAAAA”
\(S =\) “ACGG”, \(T =\) “GTGTGTGTGTGTGTGTGTGT”

Exact string matching

Naive algorithm

Check first character of \(S\) against first character of \(T\)
If match, check second of each
Repeat #2 until no match or end of \(S\)
Remove first character of \(T\), and repeat #1-3 above
Stop when \(|T| < |S|\)

Exact string matching

Naive algorithm

\(S =\) “ATCG”, \(T =\) “TATCGTGA”
\(S =\) “AAA”, \(T =\) “AAAAAAAAAAAAAAAA”
\(S =\) “ACGG”, \(T =\) “GTGTGTGTGTGTGTGTGTGT”
See naive-alignment.py.

Exact string matching

Naive algorithm

How efficient is this method?
How many comparisons will we make, in the worst case?
Computer science: “big-O” notation
\(|S|\) comparisons for each shift of \(S\)
\(|T| - |S| + 1\) shifts in total
\(O(|S|(|T| - |S| + 1)) = O(|S||T|)\)

Exact string matching

Boyer-Moore algorithm

Can improvements be made?
Want to reduce total number of comparisons in our worst case scenarios

Bad character: When mismatch found, shift \(S\) until the mismatch becomes a match
Longer skip: Try alignments in one direction, compare characters in opposite direction
Good suffix: Check characters that previously matched prior to shifting \(S\) still match after shift

See boyer-moore.py.

Exact string matching

Boyer-Moore algorithm

Can show that Boyer-Moore is \(O(|T|)\)
Worst case, asymptotically linear with the length of \(T\)
Compare with naive: \(O(|T| |S|)\)

Summary so far

Two methods to align strings exactly
Key ideas:
- Efficiency of algorithm is critical
- Preprocess strings to skip alignments
Questions?

Break

Non-exact string matching

Previous methods only produced exact matches
DNA contains mutations, insertions, deletions
Need to do string matching with errors

Edit distance

Consider \(S, T\) where \(|S| < |T|\)
\(E(S, T) =\) min number of operations to transform \(S\) into \(T\)
3 operations on characters:
- Substitution
- Insertion
- Deletion

Edit distance

Examples

\(S =\) AATTC, \(T =\) AATTG
\(S =\) AATTC, \(T =\) AAGGTGT
\(S =\) ACCGT, \(T =\) TCCTAGT
Can show \(E(S, T) \sim O(|S| |T|)\)

Non-exact alignment

Exact alignment: finding positions where edit distance = 0
Non-exact alignment: finding positions with smallest edit distance

Reduced molecular biology problem to a calculation

Global vs local alignment

Global: match entirety of \(S\) to entirety of \(T\)
- Needleman-Wunsch algorithm
Local: match entirety of \(S\) to a part of \(T\)
- Smith-Waterman algorithm

Modern sequence alignment algorithms

Preprocessing reference and input
Edit distance for local alignment
Different penalties for mismatches
Including sequence quality scores

Performing sequence alignment

Tools

Anaconda (https://docs.conda.io/en/latest/miniconda.html)
Bowtie2 (https://github.com/BenLangmead/bowtie2)

Data

https://bit.ly/2BkwQF0
Reference (FASTA)
Bowtie2 index (.bt2)
DNA sequences (FASTQ)

Alignment process

DNA sequencing file (FASTQ)
Perform sequence alignment (Bowtie2)
Look at file of alignments (SAM/BAM file)

Summary

DNA sequences as strings of characters
Naive and Boyer-Moore Algorithms
Edit distance
Aligned a FASTQ to a reference genome