https://en.wikipedia.org/wiki/DNA_sequencing
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.
- If the suffix of read A is similar to the prefix of read B then A and B might overlap in the genome
- See Week 3 - Lecture 7 for details
- More coverage leads to more and longer overlaps
- See Week 3 - Lecture 7 for details
- Repeats make assembly difficult
- See Week 4 - Lecture 3 for details
- Introduction
- Why Study This?
- DNA sequencing past and present
- Genomes as strings, reads as substrings
- String definitions and Python examples
- How DNA gets copied
- How second-generation sequencers work
- Sequencing errors and base qualities
- Working with sequencing reads (FASTQ format)
- Sequencers give pieces to genomic puzzles
- Read alignment and why it's hard
- Naive exact matching
- Boyer-Moore: Basics
- Boyer-Moore: Putting It All Together
- Diversion: Repetitive Elements
- Preprocessing
- Indexing and K-mers
- Data Structures for Indexing
- Hash Tables for Indexing
- Variations on K-mer Indexes
- Indexing by Suffix
- Approximate Matching, Hamming, and Edit Distance
- Pigeonhole Principle
- Edit Distance (part 1)
- Edit Distance (part 2)
- Edit Distance (part 3)
- Edit Distance (part 4)
- Global and Local Alignment
- De Novo Shotgun Assembly
- 1st and 2nd Laws of Assembly: Overlaps and Coverage
- Overlap Graph
- Shortest Common Substring
- Greedy Shortest Common Substring
- 3rd Law of Assembly: Repeats are Bad
- De Bruijn Graphs and Eulerian Walks
- When Eulerian Walks Go Wrong
- Assemblers in Practice
- The Future is Long?
- Wrap Up
def read_FAST_A(filename):
genome = ''
with open(filename, 'r') as f:
for line in f:
# ignore header line with genome information
if not line[0] == '>':
genome += line.rstrip()
return genome
def readFAST_Q(filename):
sequences = []
qualities = []
with open(filename) as fh:
while True:
fh.readline() # skip name line
seq = fh.readline().rstrip() # read base sequence
fh.readline() # skip placeholder line
qual = fh.readline().rstrip() # base quality line
if len(seq) == 0:
break
sequences.append(seq)
qualities.append(qual)
return sequences, qualities
- Jupyter Notebooks can be executed from the command line:
$ jupyter notebook 1_notebook.ipynb
[I 11:45:05.991 NotebookApp] The Jupyter Notebook is running at:
[I 11:45:05.991 NotebookApp] http://localhost:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
[I 11:45:05.991 NotebookApp] or http://127.0.0.1:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
[I 11:45:05.991 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 11:45:05.998 NotebookApp]
To access the notebook, open this file in a browser:
file:///Users/.../Library/Jupyter/runtime/nbserver-38748-open.html
Or copy and paste one of these URLs:
http://localhost:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
or http://127.0.0.1:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
- The Python file for each Jupyter Notebook can be executed using
ipython
. Ifpython
is used to execute then the following error will occur:
NameError: name 'get_ipython' is not defined
K-mers are a fundamental concept for creating "words" from a DNA sequencing read. These "words" are abstracted to computer science string algorithms (ie. simply finding pattern in text).
For example, a DNA substring consisting of two neucleotides is a 2-mer (regardless of Mr. Schwarzenegger's beliefs):