2019-10-18

Last session

  • DNA sequencing technologies
  • Phred quality scores
  • FASTQ quality control
  • Sequencing as a random sampling

Outline

  • Sequencing reads as strings
  • Algorithms for finding string matches
  • Algorithmic efficiency
  • Tutorial: aligning sequencing data

Exact string matching

DNA sequences as strings

@reference-seq-name1
ATCTATACTTTATCTTTATCTTTA
+
GFFFFBBBCBCCBBAAA:::;;;;
@reference-seq-name2
ATTTTATCGCGTAGCTAGCTGGCT
+
FFFEEBBBCBCCBBAAA:::;;;;
  • Bunch of sequences and quality scores
  • How do they relate to each other?
  • We have the sequences themselves
  • As the title suggests, this general process is called “sequence alignment”
  • Broad approach is going to be to line them up (match characters)
  • To do this, we’re going to have to build up the idea of what it means for 2 strings of letters to be similar
  • First is the idea of a “string”, how we’re going to model the sequences

Strings

  • String: an ordered sequence of characters belonging to an alphabet
  • \(S_n = \left\{ s_1, …, s_n | s_i \in \Sigma \right\},\) where \(\Sigma\) is the alphabet
  • DNA: \(\Sigma = \{A, C, G, T\}\)
  • RNA: \(\Sigma = \{A, C, G, U\}\)
  • Protein: \(\Sigma = \{D, E, R, ..., M, W, C\}\)

String properties

  • Length of a string: \(|S_n| = n\)
  • \(S[i] = (i + 1)th\) character of S
  • Start counting \(i\) at 0

String comparisons

  • Substring: \(S\) is a substring of \(T\) if there exists strings \(u\), \(v\), where \(T = uSv\)
  • Prefix: \(S\) is a prefix of \(T\) if there exists string \(v\), where \(T = Sv\)
  • Suffix: \(S\) is a suffix of \(T\) if there exists string \(u\), where \(T = uS\)
  • We’ve built up most of the machinery we need

Exact string matching

  • Take two strings \(S, T\)
  • Assume \(|S| < |T|\)
  • Find all positions (indices) in \(T\) where \(S\) matches exactly
  • Question: What’s an easy way to find all of these positions?

Exact string matching

Examples

  • \(S =\) “ATCG”, \(T =\) “TATCGTGA”
  • \(S =\) “AAA”, \(T =\) “AAAAAAAAAAAAAAAA”
  • \(S =\) “ACGG”, \(T =\) “GTGTGTGTGTGTGTGTGTGT”

Exact string matching

Naive algorithm

  1. Check first character of \(S\) against first character of \(T\)
  2. If match, check second of each
  3. Repeat #2 until no match or end of \(S\)
  4. Remove first character of \(T\), and repeat #1-3 above
  5. Stop when \(|T| < |S|\)

Exact string matching

Naive algorithm

  • \(S =\) “ATCG”, \(T =\) “TATCGTGA”

  • \(S =\) “AAA”, \(T =\) “AAAAAAAAAAAAAAAA”

  • \(S =\) “ACGG”, \(T =\) “GTGTGTGTGTGTGTGTGTGT”

  • See naive-alignment.py.

Exact string matching

Naive algorithm

  • How efficient is this method?

  • How many comparisons will we make, in the worst case?

  • Computer science: “big-O” notation

  • \(|S|\) comparisons for each shift of \(S\)

  • \(|T| - |S| + 1\) shifts in total

  • \(O(|S|(|T| - |S| + 1)) = O(|S||T|)\)

Exact string matching

Boyer-Moore algorithm

  • Can improvements be made?
  • Want to reduce total number of comparisons in our worst case scenarios
  1. Bad character: When mismatch found, shift \(S\) until the mismatch becomes a match
  2. Longer skip: Try alignments in one direction, compare characters in opposite direction
  3. Good suffix: Check characters that previously matched prior to shifting \(S\) still match after shift
  • See boyer-moore.py.

Exact string matching

Boyer-Moore algorithm

  • Can show that Boyer-Moore is \(O(|T|)\)
  • Worst case, asymptotically linear with the length of \(T\)
  • Compare with naive: \(O(|T| |S|)\)

Summary so far

  • Two methods to align strings exactly
  • Key ideas:
    • Efficiency of algorithm is critical
    • Preprocess strings to skip alignments
  • Questions?

Break

Non-exact string matching

  • Previous methods only produced exact matches
  • DNA contains mutations, insertions, deletions
  • Need to do string matching with errors

Edit distance

  • Consider \(S, T\) where \(|S| < |T|\)
  • \(E(S, T) =\) min number of operations to transform \(S\) into \(T\)
  • 3 operations on characters:
    • Substitution
    • Insertion
    • Deletion

Edit distance

Examples

  • \(S =\) AATTC, \(T =\) AATTG
  • \(S =\) AATTC, \(T =\) AAGGTGT
  • \(S =\) ACCGT, \(T =\) TCCTAGT
  • Can show \(E(S, T) \sim O(|S| |T|)\)
  • 1
  • 4
  • 3

Non-exact alignment

  • Exact alignment: finding positions where edit distance = 0
  • Non-exact alignment: finding positions with smallest edit distance
  • Reduced molecular biology problem to a calculation

Global vs local alignment

  • Global: match entirety of \(S\) to entirety of \(T\)
    • Needleman-Wunsch algorithm
  • Local: match entirety of \(S\) to a part of \(T\)
    • Smith-Waterman algorithm

Modern sequence alignment algorithms

  • Preprocessing reference and input
  • Edit distance for local alignment
  • Different penalties for mismatches
  • Including sequence quality scores

Performing sequence alignment

Tools

Data

Alignment process

  1. DNA sequencing file (FASTQ)
  2. Perform sequence alignment (Bowtie2)
  3. Look at file of alignments (SAM/BAM file)

Summary

Summary

  • DNA sequences as strings of characters
  • Naive and Boyer-Moore Algorithms
  • Edit distance
  • Aligned a FASTQ to a reference genome