Skip to content

claytonjwong/Algorithms-DNA-Sequencing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Algorithms for DNA Sequencing

DNA Sequencing

https://en.wikipedia.org/wiki/DNA_sequencing

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

3 Laws of Assembly

  1. If the suffix of read A is similar to the prefix of read B then A and B might overlap in the genome
  2. More coverage leads to more and longer overlaps
  3. Repeats make assembly difficult

Lectures

  1. Introduction
  2. Why Study This?
  3. DNA sequencing past and present
  4. Genomes as strings, reads as substrings
  5. String definitions and Python examples
  6. How DNA gets copied
  7. How second-generation sequencers work
  8. Sequencing errors and base qualities
  9. Working with sequencing reads (FASTQ format)
  10. Sequencers give pieces to genomic puzzles
  11. Read alignment and why it's hard
  12. Naive exact matching

Assignment

Resources

Lectures

  1. Boyer-Moore: Basics
  2. Boyer-Moore: Putting It All Together
  3. Diversion: Repetitive Elements
  4. Preprocessing
  5. Indexing and K-mers
  6. Data Structures for Indexing
  7. Hash Tables for Indexing
  8. Variations on K-mer Indexes
  9. Indexing by Suffix
  10. Approximate Matching, Hamming, and Edit Distance
  11. Pigeonhole Principle

Assignment

Resources

Lectures

  1. Edit Distance (part 1)
  2. Edit Distance (part 2)
  3. Edit Distance (part 3)
  4. Edit Distance (part 4)
  5. Global and Local Alignment
  6. De Novo Shotgun Assembly
  7. 1st and 2nd Laws of Assembly: Overlaps and Coverage
  8. Overlap Graph

Assignment

Resources

Lectures

  1. Shortest Common Substring
  2. Greedy Shortest Common Substring
  3. 3rd Law of Assembly: Repeats are Bad
  4. De Bruijn Graphs and Eulerian Walks
  5. When Eulerian Walks Go Wrong
  6. Assemblers in Practice
  7. The Future is Long?
  8. Wrap Up

Resources

Assignment

Utility Functions

def read_FAST_A(filename):
    genome = ''
    with open(filename, 'r') as f:
        for line in f:
            # ignore header line with genome information
            if not line[0] == '>':
                genome += line.rstrip()
    return genome
def readFAST_Q(filename):
    sequences = []
    qualities = []
    with open(filename) as fh:
        while True:
            fh.readline()  # skip name line
            seq = fh.readline().rstrip()  # read base sequence
            fh.readline()  # skip placeholder line
            qual = fh.readline().rstrip() # base quality line
            if len(seq) == 0:
                break
            sequences.append(seq)
            qualities.append(qual)
    return sequences, qualities

External Resources

Supplemental

  • Jupyter Notebooks can be executed from the command line:
$ jupyter notebook 1_notebook.ipynb
[I 11:45:05.991 NotebookApp] The Jupyter Notebook is running at:
[I 11:45:05.991 NotebookApp] http://localhost:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
[I 11:45:05.991 NotebookApp]  or http://127.0.0.1:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
[I 11:45:05.991 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

[C 11:45:05.998 NotebookApp]
    To access the notebook, open this file in a browser:
        file:///Users/.../Library/Jupyter/runtime/nbserver-38748-open.html
    Or copy and paste one of these URLs:
        http://localhost:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
     or http://127.0.0.1:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
  • The Python file for each Jupyter Notebook can be executed using ipython. If python is used to execute then the following error will occur:
NameError: name 'get_ipython' is not defined

Auxillary

K-mers are a fundamental concept for creating "words" from a DNA sequencing read. These "words" are abstracted to computer science string algorithms (ie. simply finding pattern in text).

For example, a DNA substring consisting of two neucleotides is a 2-mer (regardless of Mr. Schwarzenegger's beliefs):

About

🧬 Algorithms for DNA Sequencing by Johns Hopkins University

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •