Skip to content

Commit

Permalink
Added pre-processing command to annotate raw FAST5s with basecalled s…
Browse files Browse the repository at this point in the history
…equence from a FASTQ. Also fixed several minor bugs and updated documentation. Added a note to docs for RNA addressing #9. Corrected minor bug in SAM parsing, closes #10. Re-factored Tombo events access code, closes #11 and closes #14. Minor fix to wiggle output function, closes #13. Minor fix to extract genome sequence from reads function, closes #15. Bumped to version 1.1.1
  • Loading branch information
marcus1487 committed Dec 21, 2017
1 parent 93f28a1 commit c7c08a7
Show file tree
Hide file tree
Showing 19 changed files with 912 additions and 618 deletions.
25 changes: 19 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
=======
Summary
=======
=============
Tombo Summary
=============

.. image:: https://travis-ci.org/nanoporetech/tombo.svg?branch=master
:target: https://travis-ci.org/nanoporetech/tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from nanopore sequencing data.

Expand All @@ -10,6 +13,14 @@ Tombo also provides tools for the analysis and visualization of raw nanopore sig
Installation
============

|bioconda_badge| |pypi_badge|

.. |bioconda_badge| image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square
:target: http://bioconda.github.io/recipes/ont-tombo/README.html

.. |pypi_badge| image:: https://badge.fury.io/py/ont-tombo.svg
:target: https://badge.fury.io/py/ont-tombo

Basic tombo installation (python2.7 support only)

::
Expand Down Expand Up @@ -44,16 +55,18 @@ Re-squiggle (Raw Data Alignment)

..
FAST5 files need not contain Events data, but must contain Fastq slot.
Only R9.4/5 data is supported at this time.

DNA or RNA is automatically determined from FAST5s (set explicitly with `--dna` or `--rna`).

Only R9.4/5 data (DNA or RNA) is supported at this time. Processing of other samples may produce sub-optimal results.
FAST5 files need not contain Events data, but must contain Fastq slot. See `annotate_raw_with_fastqs` for pre-processing of raw FAST5s.

Identify Modified Bases
^^^^^^^^^^^^^^^^^^^^^^^

::

# comparing to an alternative 5mC model
# comparing to an alternative 5mC model (recommended method)
tombo test_significance --fast5-basedirs path/to/native/dna/fast5s/ \
--alternate-bases 5mC --statistics-file-basename sample_compare

Expand Down
5 changes: 0 additions & 5 deletions build_docs.sh

This file was deleted.

2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath('..'))
sys.path.insert(0, os.path.abspath('../tombo'))

# -- General configuration -----------------------------------------------------

Expand Down
3 changes: 3 additions & 0 deletions docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ For more details see the :doc:`re-squiggle documentation </resquiggle>`.

.. code-block:: bash
# optionally annotate raw FAST5s with FASTQ files produced from the same reads
tombo annotate_raw_with_fastqs --fast5-basedir <fast5s-base-directory> --fastq-filenames reads.fastq
tombo resquiggle [-h] <fast5s-base-directory> <reference-fasta> --minimap2-executable ./minimap2
-----------------------
Expand Down
8 changes: 8 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@ Tombo also provides tools for the analysis and visualization of raw nanopore sig
Installation
------------

|bioconda_badge| |pypi_badge|

.. |bioconda_badge| image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square
:target: http://bioconda.github.io/recipes/ont-tombo/README.html

.. |pypi_badge| image:: https://badge.fury.io/py/ont-tombo.svg
:target: https://badge.fury.io/py/ont-tombo

Basic Tombo installation (python2.7 support only)::

# install via bioconda environment
Expand Down
49 changes: 34 additions & 15 deletions docs/resquiggle.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ The re-squiggle algorithm is the basis for the Tombo framework. The re-squiggle

TL;DR:

* Re-squiggle must be run before any other Tombo command
* Re-squiggle must be run before any other Tombo command (aside from the `annotate_raw_with_fastqs` pre-processing sub-command)
* Minimally the command takes a directory containing FAST5 files, a genome reference and an executable mapper.
* FAST5 files must contain basecalls (as produced by albacore with fast5 mode), but need not contain the "Events" table
* FAST5 files must contain basecalls (as produced by albacore in fast5 mode or added with `annotate_raw_with_fastqs`), but need not contain the "Events" table
* Tombo currently only supports R9.4 and R9.5 data (via included default models). Other data may produce sub-optimal results
* DNA and RNA reads will be detected automatically and processed accordingly
* DNA and RNA reads will be detected automatically and processed accordingly (set explicitly with `--dna` or `--rna`)

-----------------
Algorithm Details
Expand Down Expand Up @@ -95,18 +95,6 @@ Resolve Skipped Bases

After the dynamic programming step, skipped bases must be resolved using the raw signal to obtain a matching of each genomic base to a bit of raw signal. A region around each skipped genomic base is identified. Then a dynamic programming algorithm very similar to the last step is performed, except the raw signal is used instead of events and the skip move is not allowed. Additionally, the algorithm forces each genomic base to be assigned at least 3 raw observations to produce more robust assignments. After this procedure the full genomic sequence has raw signal assigned.

------------------
Tombo FAST5 Format
------------------

The result of the re-squiggle algorithm writes the sequence to signal assignment back into the read FAST5 files (found in the ``--corrected-group`` slot; the default value is the default for all other Tombo commands to read in this data). When running the re-squiggle algorithm a second time on a set of reads, the --overwrite option is required in order to write over the previous Tombo results.

The Tombo slot contains several useful bits of information. The ``--corrected-group`` slot contains attributes for the signal normalization (shift, scale, upper_limit, lower_limit and outlier_threshold) as well as a binary flag indicating whether the read is DNA or RNA. Within the ``Alignment`` group, the gemomic mapped start, end, strand and chromosome are stored. The mapping statistics (number clipped start and end bases, matching, mismatching, insertioned and deleted bases). Note that this information is not enabled at this time, but should be added back soon.

The ``Events`` slot contains a matrix containing the matching of raw signal to genomic sequence. This slot contains a single attribute (``read_start_rel_to_raw``) giving the zero-based offset into the raw signal slot for genomic sequence matching. The events table then starts matching sequence from this offset. Each entry in the ``Events`` table indicates the normalized mean signal level (``norm_mean``), optionally (triggered by the ``--include-event-stdev`` option) the normalized signal standard deviation (``norm_stdev``), the start position of this base (``start``), the length of this event in raw signal values (``length``) and the genomic base (``base``).

This information is accessed as needed for down-stream Tombo processing commands.

-------------------------------
Common Failed Read Descriptions
-------------------------------
Expand Down Expand Up @@ -136,6 +124,22 @@ Common Failed Read Descriptions

* These errors indicate that the dynamic programming algorithm produce a poorly scored matching of genomic sequence to raw signal. This likely indicates that either the genomic mapping is incorrect or that the raw signal is of low quality in some sense.

------------------
Tombo FAST5 Format
------------------

The result of the re-squiggle algorithm writes the sequence to signal assignment back into the read FAST5 files (found in the ``--corrected-group`` slot; the default value is the default for all other Tombo commands to read in this data). When running the re-squiggle algorithm a second time on a set of reads, the --overwrite option is required in order to write over the previous Tombo results.

The Tombo slot contains several useful bits of information. The ``--corrected-group`` slot contains attributes for the signal normalization (shift, scale, upper_limit, lower_limit and outlier_threshold) as well as a binary flag indicating whether the read is DNA or RNA. Within the ``Alignment`` group, the gemomic mapped start, end, strand and chromosome are stored. The mapping statistics (number clipped start and end bases, matching, mismatching, insertioned and deleted bases). Note that this information is not enabled at this time, but should be added back soon.

The ``Events`` slot contains a matrix containing the matching of raw signal to genomic sequence. This slot contains a single attribute (``read_start_rel_to_raw``) giving the zero-based offset into the raw signal slot for genomic sequence matching. The events table then starts matching sequence from this offset. Each entry in the ``Events`` table indicates the normalized mean signal level (``norm_mean``), optionally (triggered by the ``--include-event-stdev`` option) the normalized signal standard deviation (``norm_stdev``), the start position of this base (``start``), the length of this event in raw signal values (``length``) and the genomic base (``base``).

This information is accessed as needed for down-stream Tombo processing commands.

This data generally adds ~75% to the memory footprint of a minimal FAST5 file (containing raw and sequence data). This may vary across files and sample types.

**Important RNA note**: RNA reads pass through the pore in the 3' to 5' direction during sequencing. As such, the raw signal and albacore events are stored in the reverse direction from the genome. Tombo events are stored in the opposite direction (corresponding to the genome sequence direction, not sequencing time direction) for several practical reasons. Thus if events are to be compared to the raw signal, the raw signal must be reversed.

----------------
Tombo Index File
----------------
Expand All @@ -153,3 +157,18 @@ Additional Command Line Options
``--obs-per-base-filter``

* This option applies a filter to "stuck" reads (too many observations per genomic base). This filter is applied only to the index file and can be cleared later. See the Filters section for more details.

---------------------
Pre-process Raw Reads
---------------------

Nanopore raw signal-space data consumes more memory than sequence-space data. As such, many users will produce only FASTQ basecalls initially from a set of raw reads in FAST5 format. The Tombo framework requires the linking of these basecalls with the raw signal-space data. The `annotate_raw_with_fastqs` sub-command is provided to assist with this workflow.

Given a directory (or nested directories) of FAST5 raw read files and a set of FASTQ format basecalls from these reads, the `annotate_raw_with_fastqs` adds the sequence information from the FASTQ files to the appropriate FAST5 files. This command generally adds 15-25% to the disk usage for the raw reads.

This functionality requires that the FASTQ seqeunce header lines begin with the read identifier from the FAST5 file. This has been tested with the Oxford Nanopore Technologies supported basecaller, albacore. Third-party basecallers may be not be processed correctly.

.. code-block:: bash
tombo annotate_raw_with_fastqs --fast5-basedir <fast5s-base-directory> --fastq-filenames reads.fastq
31 changes: 18 additions & 13 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@
from setuptools import setup, Extension
from setuptools.command.build_ext import build_ext as _build_ext

# Get the version number from version.py, and exe_path
verstrline = open(os.path.join('tombo', 'version.py'), 'r').read()
# Get the version number from _version.py, and exe_path
verstrline = open(os.path.join('tombo', '_version.py'), 'r').read()
vsre = r"^TOMBO_VERSION = ['\"]([^'\"]*)['\"]"
mo = re.search(vsre, verstrline)
if mo:
__version__ = mo.group(1)
else:
raise RuntimeError('Unable to find version string in "tombo/version.py".'.format(__pkg_name__))
raise RuntimeError('Unable to find version string in "tombo/_version.py".'.format(__pkg_name__))

def readme():
with open('README.rst') as f:
Expand All @@ -34,6 +34,20 @@ def readme():
if not sys.version_info[0] == 2:
sys.exit("Sorry, Python 3 is not supported (yet)")

ext_modules = [
Extension("tombo.dynamic_programming",
["tombo/dynamic_programming.pyx"],
include_dirs=include_dirs,
language="c++"),
Extension("tombo.c_helper",
["tombo/c_helper.pyx"],
include_dirs=include_dirs,
language="c++")
]

for e in ext_modules:
e.cython_directives = {"embedsignature": True}

setup(
name = "ont-tombo",
version = __version__,
Expand All @@ -57,16 +71,7 @@ def readme():
]
},
include_package_data=True,
ext_modules=[
Extension("tombo.dynamic_programming",
["tombo/dynamic_programming.pyx"],
include_dirs=include_dirs,
language="c++"),
Extension("tombo.c_helper",
["tombo/c_helper.pyx"],
include_dirs=include_dirs,
language="c++")
],
ext_modules=ext_modules,
test_suite='nose2.collector.collector',
tests_require=['nose2'],
)
Loading

0 comments on commit c7c08a7

Please sign in to comment.