Added pre-processing command to annotate raw FAST5s with basecalled s…

…equence from a FASTQ. Also fixed several minor bugs and updated documentation. Added a note to docs for RNA addressing #9. Corrected minor bug in SAM parsing, closes #10. Re-factored Tombo events access code, closes #11 and closes #14. Minor fix to wiggle output function, closes #13. Minor fix to extract genome sequence from reads function, closes #15. Bumped to version 1.1.1
nanoporetech · Dec 21, 2017 · c7c08a7 · c7c08a7
1 parent 93f28a1
commit c7c08a7
Show file tree

Hide file tree

Showing 19 changed files with 912 additions and 618 deletions.
diff --git a/README.rst b/README.rst
@@ -1,6 +1,9 @@
-=======
-Summary
-=======
+=============
+Tombo Summary
+=============
+
+.. image:: https://travis-ci.org/nanoporetech/tombo.svg?branch=master
+    :target: https://travis-ci.org/nanoporetech/tombo
 
 Tombo is a suite of tools primarily for the identification of modified nucleotides from nanopore sequencing data.
 
@@ -10,6 +13,14 @@ Tombo also provides tools for the analysis and visualization of raw nanopore sig
 Installation
 ============
 
+|bioconda_badge| |pypi_badge|
+
+.. |bioconda_badge| image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square
+    :target: http://bioconda.github.io/recipes/ont-tombo/README.html
+
+.. |pypi_badge| image:: https://badge.fury.io/py/ont-tombo.svg
+    :target: https://badge.fury.io/py/ont-tombo
+
 Basic tombo installation (python2.7 support only)
 
 ::
@@ -44,16 +55,18 @@ Re-squiggle (Raw Data Alignment)
 
 ..
 
-    FAST5 files need not contain Events data, but must contain Fastq slot.
+    Only R9.4/5 data is supported at this time.
+
+    DNA or RNA is automatically determined from FAST5s (set explicitly with `--dna` or `--rna`).
 
-    Only R9.4/5 data (DNA or RNA) is supported at this time. Processing of other samples may produce sub-optimal results.
+    FAST5 files need not contain Events data, but must contain Fastq slot. See `annotate_raw_with_fastqs` for pre-processing of raw FAST5s.
 
 Identify Modified Bases
 ^^^^^^^^^^^^^^^^^^^^^^^
 
 ::
 
-    # comparing to an alternative 5mC model
+    # comparing to an alternative 5mC model (recommended method)
     tombo test_significance --fast5-basedirs path/to/native/dna/fast5s/ \
         --alternate-bases 5mC --statistics-file-basename sample_compare
 

diff --git a/build_docs.sh b/build_docs.sh
diff --git a/docs/conf.py b/docs/conf.py
@@ -14,7 +14,7 @@
 # If extensions (or modules to document with autodoc) are in another directory,
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
-sys.path.insert(0, os.path.abspath('..'))
+sys.path.insert(0, os.path.abspath('../tombo'))
 
 # -- General configuration -----------------------------------------------------
 

diff --git a/docs/examples.rst b/docs/examples.rst
@@ -20,6 +20,9 @@ For more details see the :doc:`re-squiggle documentation </resquiggle>`.
 
 .. code-block:: bash
 
+    # optionally annotate raw FAST5s with FASTQ files produced from the same reads
+    tombo annotate_raw_with_fastqs --fast5-basedir <fast5s-base-directory> --fastq-filenames reads.fastq
+
     tombo resquiggle [-h] <fast5s-base-directory> <reference-fasta> --minimap2-executable ./minimap2
 
 -----------------------

diff --git a/docs/index.rst b/docs/index.rst
@@ -10,6 +10,14 @@ Tombo also provides tools for the analysis and visualization of raw nanopore sig
 Installation
 ------------
 
+|bioconda_badge| |pypi_badge|
+
+.. |bioconda_badge| image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square
+    :target: http://bioconda.github.io/recipes/ont-tombo/README.html
+
+.. |pypi_badge| image:: https://badge.fury.io/py/ont-tombo.svg
+    :target: https://badge.fury.io/py/ont-tombo
+
 Basic Tombo installation (python2.7 support only)::
 
     # install via bioconda environment

diff --git a/docs/resquiggle.rst b/docs/resquiggle.rst
@@ -8,11 +8,11 @@ The re-squiggle algorithm is the basis for the Tombo framework. The re-squiggle
 
 TL;DR:
 
-*  Re-squiggle must be run before any other Tombo command
+*  Re-squiggle must be run before any other Tombo command (aside from the `annotate_raw_with_fastqs` pre-processing sub-command)
 *  Minimally the command takes a directory containing FAST5 files, a genome reference and an executable mapper.
-*  FAST5 files must contain basecalls (as produced by albacore with fast5 mode), but need not contain the "Events" table
+*  FAST5 files must contain basecalls (as produced by albacore in fast5 mode or added with `annotate_raw_with_fastqs`), but need not contain the "Events" table
 *  Tombo currently only supports R9.4 and R9.5 data (via included default models). Other data may produce sub-optimal results
-*  DNA and RNA reads will be detected automatically and processed accordingly
+*  DNA and RNA reads will be detected automatically and processed accordingly (set explicitly with `--dna` or `--rna`)
 
 -----------------
 Algorithm Details
@@ -95,18 +95,6 @@ Resolve Skipped Bases
 
 After the dynamic programming step, skipped bases must be resolved using the raw signal to obtain a matching of each genomic base to a bit of raw signal. A region around each skipped genomic base is identified. Then a dynamic programming algorithm very similar to the last step is performed, except the raw signal is used instead of events and the skip move is not allowed. Additionally, the algorithm forces each genomic base to be assigned at least 3 raw observations to produce more robust assignments. After this procedure the full genomic sequence has raw signal assigned.
 
-------------------
-Tombo FAST5 Format
-------------------
-
-The result of the re-squiggle algorithm writes the sequence to signal assignment back into the read FAST5 files (found in the ``--corrected-group`` slot; the default value is the default for all other Tombo commands to read in this data). When running the re-squiggle algorithm a second time on a set of reads, the --overwrite option is required in order to write over the previous Tombo results.
-
-The Tombo slot contains several useful bits of information. The ``--corrected-group`` slot contains attributes for the signal normalization (shift, scale, upper_limit, lower_limit and outlier_threshold) as well as a binary flag indicating whether the read is DNA or RNA. Within the ``Alignment`` group, the gemomic mapped start, end, strand and chromosome are stored. The mapping statistics (number clipped start and end bases, matching, mismatching, insertioned and deleted bases). Note that this information is not enabled at this time, but should be added back soon.
-
-The ``Events`` slot contains a matrix containing the matching of raw signal to genomic sequence. This slot contains a single attribute (``read_start_rel_to_raw``) giving the zero-based offset into the raw signal slot for genomic sequence matching. The events table then starts matching sequence from this offset. Each entry in the ``Events`` table indicates the normalized mean signal level (``norm_mean``), optionally (triggered by the ``--include-event-stdev`` option) the normalized signal standard deviation (``norm_stdev``), the start position of this base (``start``), the length of this event in raw signal values (``length``) and the genomic base (``base``).
-
-This information is accessed as needed for down-stream Tombo processing commands.
-
 -------------------------------
 Common Failed Read Descriptions
 -------------------------------
@@ -136,6 +124,22 @@ Common Failed Read Descriptions
 
 *  These errors indicate that the dynamic programming algorithm produce a poorly scored matching of genomic sequence to raw signal. This likely indicates that either the genomic mapping is incorrect or that the raw signal is of low quality in some sense.
 
+------------------
+Tombo FAST5 Format
+------------------
+
+The result of the re-squiggle algorithm writes the sequence to signal assignment back into the read FAST5 files (found in the ``--corrected-group`` slot; the default value is the default for all other Tombo commands to read in this data). When running the re-squiggle algorithm a second time on a set of reads, the --overwrite option is required in order to write over the previous Tombo results.
+
+The Tombo slot contains several useful bits of information. The ``--corrected-group`` slot contains attributes for the signal normalization (shift, scale, upper_limit, lower_limit and outlier_threshold) as well as a binary flag indicating whether the read is DNA or RNA. Within the ``Alignment`` group, the gemomic mapped start, end, strand and chromosome are stored. The mapping statistics (number clipped start and end bases, matching, mismatching, insertioned and deleted bases). Note that this information is not enabled at this time, but should be added back soon.
+
+The ``Events`` slot contains a matrix containing the matching of raw signal to genomic sequence. This slot contains a single attribute (``read_start_rel_to_raw``) giving the zero-based offset into the raw signal slot for genomic sequence matching. The events table then starts matching sequence from this offset. Each entry in the ``Events`` table indicates the normalized mean signal level (``norm_mean``), optionally (triggered by the ``--include-event-stdev`` option) the normalized signal standard deviation (``norm_stdev``), the start position of this base (``start``), the length of this event in raw signal values (``length``) and the genomic base (``base``).
+
+This information is accessed as needed for down-stream Tombo processing commands.
+
+This data generally adds ~75% to the memory footprint of a minimal FAST5 file (containing raw and sequence data). This may vary across files and sample types.
+
+**Important RNA note**: RNA reads pass through the pore in the 3' to 5' direction during sequencing. As such, the raw signal and albacore events are stored in the reverse direction from the genome. Tombo events are stored in the opposite direction (corresponding to the genome sequence direction, not sequencing time direction) for several practical reasons. Thus if events are to be compared to the raw signal, the raw signal must be reversed.
+
 ----------------
 Tombo Index File
 ----------------
@@ -153,3 +157,18 @@ Additional Command Line Options
 ``--obs-per-base-filter``
 
 *  This option applies a filter to "stuck" reads (too many observations per genomic base). This filter is applied only to the index file and can be cleared later. See the Filters section for more details.
+
+---------------------
+Pre-process Raw Reads
+---------------------
+
+Nanopore raw signal-space data consumes more memory than sequence-space data. As such, many users will produce only FASTQ basecalls initially from a set of raw reads in FAST5 format. The Tombo framework requires the linking of these basecalls with the raw signal-space data. The `annotate_raw_with_fastqs` sub-command is provided to assist with this workflow.
+
+Given a directory (or nested directories) of FAST5 raw read files and a set of FASTQ format basecalls from these reads, the `annotate_raw_with_fastqs` adds the sequence information from the FASTQ files to the appropriate FAST5 files. This command generally adds 15-25% to the disk usage for the raw reads.
+
+This functionality requires that the FASTQ seqeunce header lines begin with the read identifier from the FAST5 file. This has been tested with the Oxford Nanopore Technologies supported basecaller, albacore. Third-party basecallers may be not be processed correctly.
+
+.. code-block:: bash
+
+    tombo annotate_raw_with_fastqs --fast5-basedir <fast5s-base-directory> --fastq-filenames reads.fastq
+
diff --git a/setup.py b/setup.py
@@ -6,14 +6,14 @@
 from setuptools import setup, Extension
 from setuptools.command.build_ext import build_ext as _build_ext
 
-# Get the version number from version.py, and exe_path
-verstrline = open(os.path.join('tombo', 'version.py'), 'r').read()
+# Get the version number from _version.py, and exe_path
+verstrline = open(os.path.join('tombo', '_version.py'), 'r').read()
 vsre = r"^TOMBO_VERSION = ['\"]([^'\"]*)['\"]"
 mo = re.search(vsre, verstrline)
 if mo:
     __version__ = mo.group(1)
 else:
-    raise RuntimeError('Unable to find version string in "tombo/version.py".'.format(__pkg_name__))
+    raise RuntimeError('Unable to find version string in "tombo/_version.py".'.format(__pkg_name__))
 
 def readme():
     with open('README.rst') as f:
@@ -34,6 +34,20 @@ def readme():
 if not sys.version_info[0] == 2:
     sys.exit("Sorry, Python 3 is not supported (yet)")
 
+ext_modules = [
+    Extension("tombo.dynamic_programming",
+              ["tombo/dynamic_programming.pyx"],
+              include_dirs=include_dirs,
+              language="c++"),
+    Extension("tombo.c_helper",
+              ["tombo/c_helper.pyx"],
+              include_dirs=include_dirs,
+              language="c++")
+]
+
+for e in ext_modules:
+    e.cython_directives = {"embedsignature": True}
+
 setup(
     name = "ont-tombo",
     version = __version__,
@@ -57,16 +71,7 @@ def readme():
         ]
     },
     include_package_data=True,
-    ext_modules=[
-        Extension("tombo.dynamic_programming",
-                  ["tombo/dynamic_programming.pyx"],
-                  include_dirs=include_dirs,
-                  language="c++"),
-        Extension("tombo.c_helper",
-                  ["tombo/c_helper.pyx"],
-                  include_dirs=include_dirs,
-                  language="c++")
-    ],
+    ext_modules=ext_modules,
     test_suite='nose2.collector.collector',
     tests_require=['nose2'],
 )