Merge branch 'master' into issue-303

# Conflicts: # tiny/cwl/tools/tiny-count.cwl # tiny/rna/counter/counter.py
MontgomeryLab · May 2, 2023 · e2704c0 · e2704c0
2 parents 4715ec6 + 5aac71b
commit e2704c0
Show file tree

Hide file tree

Showing 38 changed files with 657 additions and 364 deletions.
diff --git a/START_HERE/paths.yml b/START_HERE/paths.yml
@@ -1,14 +1,17 @@
 ############################## MAIN INPUT FILES FOR ANALYSIS ##############################
 #
-# Edit this section to provide the path to your Samples and Features sheets. Relative and
-# absolute paths are both allowed. All relative paths are relative to THIS config file.
+# Relative and absolute paths are both allowed.
+# All relative paths are evaluated relative to THIS config file.
 #
 # Directions:
 #   1. Fill out the Samples Sheet with files to process + naming scheme. [samples.csv]
 #   2. Fill out the Features Sheet with selection rules [features.csv]
-#   3. Set samples_csv and features_csv (below) to point to these files
+#   3. Set samples_csv and features_csv to point to these files
 #   4. Add annotation files and per-file alias preferences to gff_files (optional)
 #
+# If using the tinyRNA workflow, additionally set ebwt and/or reference_genome_files
+# in the BOWTIE-BUILD section.
+#
 ######-------------------------------------------------------------------------------######
 
 ##-- Path to Sample & Features Sheets (relative paths are relative to this config file) --##

diff --git a/START_HERE/run_config.yml b/START_HERE/run_config.yml
@@ -41,9 +41,8 @@ run_native: false
 
 ######------------------------- BOWTIE INDEX BUILD OPTIONS --------------------------######
 #
-# If you do not already have bowtie indexes, they can be built for you by setting
-# run_bowtie_build (above) to true and adding your reference genome file(s) to your
-# paths_config file.
+# If you do not already have bowtie indexes, they can be built for you
+# (see the BOWTIE-BUILD section in the Paths File)
 #
 # We have specified default parameters for small RNA data based on our own "best practices".
 # You can change the parameters here.

diff --git a/START_HERE/samples.csv b/START_HERE/samples.csv
@@ -1,4 +1,4 @@
-FASTQ/SAM Files,Sample/Group Name,Replicate number,Control,Normalization
+Input Files,Sample/Group Name,Replicate number,Control,Normalization
 ./fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE,
 ./fastq_files/cond1_rep2.fastq.gz,condition1,2,,
 ./fastq_files/cond1_rep3.fastq.gz,condition1,3,,

diff --git a/START_HERE/tiny-count_TUTORIAL.md b/START_HERE/tiny-count_TUTORIAL.md
@@ -11,7 +11,7 @@ Alternatively, if you have already installed tinyRNA, you can use the `tiny-coun
 
 ## Your Data Files
 Gather the following files for the analysis:
-1. **SAM files** containing small RNA reads aligned to a reference genome, one file per sample
+1. **SAM or BAM files** containing small RNA reads aligned to a reference genome, one file per sample
 2. **GFF3 or GFF2/GTF file(s)** containing annotations for features that you want to assign reads to
 
 ## Configuration Files
@@ -24,7 +24,7 @@ tiny-count --get-templates
 Next, fill out the configuration files that were copied:
 
 ### 1. The Samples Sheet (samples.csv)
-Edit this file to add the paths to your SAM files, and to define the group name, replicate number, etc. for each sample.
+Edit this file to add the paths to your SAM or BAM files, and to define the group name, replicate number, etc. for each sample.
 
 ### 2. The Paths File (paths.yml)
 Edit this file to add the paths to your GFF annotation(s) under the `gff_files` key. You can leave the `alias` key as-is for now. All other keys in this file are used in the tinyRNA workflow.

diff --git a/doc/Configuration.md b/doc/Configuration.md
@@ -119,7 +119,7 @@ The final output directory name has three components:
 The `run_directory` suffix in the Paths File supports subdirectories; if provided, the final output directory will be named as indicated above, but the subdirectory structure specified in `run_directory` will be retained. 
 
 ## Samples Sheet Details
-|  _Column:_ | FASTQ/SAM Files     | Sample/Group Name | Replicate Number | Control | Normalization |
+|  _Column:_ | Input Files         | Sample/Group Name | Replicate Number | Control | Normalization |
 |-----------:|---------------------|-------------------|------------------|---------|---------------|
 | _Example:_ | cond1_rep1.fastq.gz | condition1        | 1                | True    | RPM           |
 

diff --git a/doc/Parameters.md b/doc/Parameters.md
@@ -101,7 +101,7 @@ A custom Cython implementation of HTSeq's StepVector is used for finding feature
 ### Is Pipeline
 | Run Config Key | Commandline Argument |
 |----------------|----------------------|
-|                | `--is-pipeline`      |
+|                | `--in-pipeline`      |
 
 This commandline argument tells tiny-count that it is running as a workflow step rather than a standalone/manual run. Under these conditions tiny-count will look for all input files in the current working directory regardless of the paths defined in the Samples Sheet and Features Sheet.
 
@@ -152,7 +152,7 @@ Optional arguments:
   -sv {Cython,HTSeq}, --stepvector {Cython,HTSeq}
                         Select which StepVector implementation is used to find
                         features overlapping an interval. (default: Cython)
-  -p, --is-pipeline     Indicates that tiny-count was invoked as part of a
+  -p, --in-pipeline     Indicates that tiny-count was invoked as part of a
                         pipeline run and that input files should be sourced as
                         such. (default: False)
   -d, --report-diags    Produce diagnostic information about

diff --git a/doc/tiny-count.md b/doc/tiny-count.md
@@ -7,7 +7,16 @@ For an explanation of tiny-count's parameters in the Run Config and by commandli
 tiny-count offers a variety of options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. Using the command `tiny recount`, tinyRNA will run the workflow starting at the tiny-count step using inputs from a prior end-to-end run to save time. See the [pipeline resume documentation](Pipeline.md#resuming-a-prior-analysis) for details and prerequisites.
 
 ## Running as a Standalone Tool
-If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command has [one required argument](Parameters.md#full-tiny-count-help-string): your Paths File. Your Samples Sheet will need to list SAM files rather than FASTQ files in the `FASTQ/SAM Files` column. SAM files from third party sources are also supported, and if they have been produced from reads collapsed by tiny-collapse or fastx, tiny-count will honor the reported read counts.
+Skip to [Feature Selection](#feature-selection) if you are using the tinyRNA workflow.
+
+If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command has [one required argument](Parameters.md#full-tiny-count-help-string): your Paths File. Your Samples Sheet will need to list SAM or BAM alignment files rather than FASTQ files in the `Input Files` column. Alignment files from third party sources are also supported, and if they have been produced from reads collapsed by tiny-collapse or fastx, tiny-count will honor the reported read counts.
+
+#### Input File Requirements
+The SAM/BAM files provided during standalone runs _must_ be ordered so that multi-mapping read alignments are listed adjacent to one another. This adjacency convention is required for proper normalization by genomic hits. For this reason, files with ambiguous order will be rejected unless they were produced by an alignment tool that we recognize for following the adjacency convention. At this time, this includes Bowtie, Bowtie2, and STAR (an admittedly incomplete list).
+
+#### BAM File Tips
+- Use the `--no-PG` option with `samtools view` when converting alignments 
+- Pysam will issue two warnings about missing index files; they can be ignored
 
 #### Using Non-collapsed Sequence Alignments
 While third-party SAM files from non-collapsed reads are supported, there are some caveats. These files will result in substantially higher resource usage and runtimes; we strongly recommend collapsing prior to alignment. Additionally, the sequence-related stats produced by tiny-count will no longer represent _unique_ sequences. These stats will instead refer to all sequences with unique QNAMEs (that is, multi-alignment bundles still cary a sequence count of 1).
@@ -139,8 +148,10 @@ Examples:
 
 ## Count Normalization
 Small RNA reads passing selection will receive a normalized count increment. By default, read counts are normalized twice before being assigned to a feature. Both normalization steps can be disabled in `run_config.yml` if desired. Counts for each small RNA sequence are divided: 
-1. By the number of loci it aligns to in the genome.
-2. By the number of _selected_ features for each of its alignments.
+1. By the number of loci it aligns to in the genome (genomic hits).
+2. By the number of _selected_ features for each of its alignments (feature hits).
+
+>**Important**: For proper normalization by genomic hits, input files must be ordered such that multi-mapping read alignments are listed adjacent to one another. 
 
 ## The Details
 You may encounter the following cases when you have more than one unique GFF file listed in your Paths File:

diff --git a/tests/testdata/config_files/paths.yml b/tests/testdata/config_files/paths.yml
@@ -1,14 +1,17 @@
 ############################## MAIN INPUT FILES FOR ANALYSIS ##############################
 #
-# Edit this section to provide the path to your Samples and Features sheets. Relative and
-# absolute paths are both allowed. All relative paths are relative to THIS config file.
+# Relative and absolute paths are both allowed.
+# All relative paths are evaluated relative to THIS config file.
 #
 # Directions:
 #   1. Fill out the Samples Sheet with files to process + naming scheme. [samples.csv]
 #   2. Fill out the Features Sheet with selection rules [features.csv]
-#   3. Set samples_csv and features_csv (below) to point to these files
+#   3. Set samples_csv and features_csv to point to these files
 #   4. Add annotation files and per-file alias preferences to gff_files (optional)
 #
+# If using the tinyRNA workflow, additionally set ebwt and/or reference_genome_files
+# in the BOWTIE-BUILD section.
+#
 ######-------------------------------------------------------------------------------######
 
 ##-- Path to Sample & Features Sheets (relative paths are relative to this config file) --##

diff --git a/tests/testdata/config_files/run_config_template.yml b/tests/testdata/config_files/run_config_template.yml
@@ -41,9 +41,8 @@ run_native: false
 
 ######------------------------- BOWTIE INDEX BUILD OPTIONS --------------------------######
 #
-# If you do not already have bowtie indexes, they can be built for you by setting
-# run_bowtie_build (above) to true and adding your reference genome file(s) to your
-# paths_config file.
+# If you do not already have bowtie indexes, they can be built for you
+# (see the BOWTIE-BUILD section in the Paths File)
 #
 # We have specified default parameters for small RNA data based on our own "best practices".
 # You can change the parameters here.

diff --git a/tests/testdata/config_files/samples.csv b/tests/testdata/config_files/samples.csv
@@ -1,4 +1,4 @@
-FASTQ/SAM Files,Sample/Group Name,Replicate Number,Control,Normalization
+Input Files,Sample/Group Name,Replicate Number,Control,Normalization
 ../../../START_HERE/fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE,
 ../../../START_HERE/fastq_files/cond1_rep2.fastq.gz,condition1,2,,
 ../../../START_HERE/fastq_files/cond1_rep3.fastq.gz,condition1,3,,

diff --git a/tests/testdata/counter/bam/Lib304_test.bam b/tests/testdata/counter/bam/Lib304_test.bam
diff --git a/tests/testdata/counter/bam/single.bam b/tests/testdata/counter/bam/single.bam
diff --git a/tests/testdata/counter/discontinuous.gff3 → ...s/testdata/counter/gff/discontinuous.gff3 b/tests/testdata/counter/discontinuous.gff3 → ...s/testdata/counter/gff/discontinuous.gff3
diff --git a/...estdata/counter/identity_choice_test.gff3 → ...ata/counter/gff/identity_choice_test.gff3 b/...estdata/counter/identity_choice_test.gff3 → ...ata/counter/gff/identity_choice_test.gff3
diff --git a/tests/testdata/counter/single.gff3 → tests/testdata/counter/gff/single.gff3 b/tests/testdata/counter/single.gff3 → tests/testdata/counter/gff/single.gff3
diff --git a/tests/testdata/counter/single2.gff3 → tests/testdata/counter/gff/single2.gff3 b/tests/testdata/counter/single2.gff3 → tests/testdata/counter/gff/single2.gff3
diff --git a/tests/testdata/counter/Lib304_test.sam → tests/testdata/counter/sam/Lib304_test.sam b/tests/testdata/counter/Lib304_test.sam → tests/testdata/counter/sam/Lib304_test.sam
diff --git a/...testdata/counter/identity_choice_test.sam → ...data/counter/sam/identity_choice_test.sam b/...testdata/counter/identity_choice_test.sam → ...data/counter/sam/identity_choice_test.sam
diff --git a/tests/testdata/counter/non-collapsed.sam → tests/testdata/counter/sam/non-collapsed.sam b/tests/testdata/counter/non-collapsed.sam → tests/testdata/counter/sam/non-collapsed.sam
@@ -1,2 +1,4 @@
+@HD	SO:unsorted
 @SQ	SN:I	LN:21
+@PG	ID:bowtie
 NON_COLLAPSED_QNAME	16	I	15064570	255	21M	*	0	0	CAAGACAGAGCTTCACCGTTC	IIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:21	NM:i:0	XM:i:2
diff --git a/tests/testdata/counter/single.sam → tests/testdata/counter/sam/single.sam b/tests/testdata/counter/single.sam → tests/testdata/counter/sam/single.sam
@@ -1,2 +1,4 @@
+@HD	SO:unsorted
 @SQ	SN:I	LN:21
+@PG	ID:bowtie
 0_count=5	16	I	15064570	255	21M	*	0	0	CAAGACAGAGCTTCACCGTTC	IIIIIIIIIIIIIIIIIIIII	XA:i:0	MD:Z:21	NM:i:0	XM:i:2
diff --git a/tests/unit_test_helpers.py b/tests/unit_test_helpers.py
@@ -131,7 +131,7 @@ def get_dir_checksum_tree(root_path: str) -> dict:
     return dir_tree
 
 
-def make_parsed_sam_record(Name="0_count=1", Seq="CAAGACAGAGCTTCACCGTTC", Chrom='I', Start=15064570, Strand=True, NM=0):
+def make_parsed_alignment(Name="0_count=1", Seq="CAAGACAGAGCTTCACCGTTC", Chrom='I', Start=15064570, Strand=True, NM=0):
     return {
         "Name": Name,
         "Length": len(Seq),