Skip to content

Commit

Permalink
Merge branch 'master' into issue-303
Browse files Browse the repository at this point in the history
# Conflicts:
#	tiny/cwl/tools/tiny-count.cwl
#	tiny/rna/counter/counter.py
  • Loading branch information
AlexTate committed May 2, 2023
2 parents 4715ec6 + 5aac71b commit e2704c0
Show file tree
Hide file tree
Showing 38 changed files with 657 additions and 364 deletions.
9 changes: 6 additions & 3 deletions START_HERE/paths.yml
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
############################## MAIN INPUT FILES FOR ANALYSIS ##############################
#
# Edit this section to provide the path to your Samples and Features sheets. Relative and
# absolute paths are both allowed. All relative paths are relative to THIS config file.
# Relative and absolute paths are both allowed.
# All relative paths are evaluated relative to THIS config file.
#
# Directions:
# 1. Fill out the Samples Sheet with files to process + naming scheme. [samples.csv]
# 2. Fill out the Features Sheet with selection rules [features.csv]
# 3. Set samples_csv and features_csv (below) to point to these files
# 3. Set samples_csv and features_csv to point to these files
# 4. Add annotation files and per-file alias preferences to gff_files (optional)
#
# If using the tinyRNA workflow, additionally set ebwt and/or reference_genome_files
# in the BOWTIE-BUILD section.
#
######-------------------------------------------------------------------------------######

##-- Path to Sample & Features Sheets (relative paths are relative to this config file) --##
Expand Down
5 changes: 2 additions & 3 deletions START_HERE/run_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,8 @@ run_native: false

######------------------------- BOWTIE INDEX BUILD OPTIONS --------------------------######
#
# If you do not already have bowtie indexes, they can be built for you by setting
# run_bowtie_build (above) to true and adding your reference genome file(s) to your
# paths_config file.
# If you do not already have bowtie indexes, they can be built for you
# (see the BOWTIE-BUILD section in the Paths File)
#
# We have specified default parameters for small RNA data based on our own "best practices".
# You can change the parameters here.
Expand Down
2 changes: 1 addition & 1 deletion START_HERE/samples.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FASTQ/SAM Files,Sample/Group Name,Replicate number,Control,Normalization
Input Files,Sample/Group Name,Replicate number,Control,Normalization
./fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE,
./fastq_files/cond1_rep2.fastq.gz,condition1,2,,
./fastq_files/cond1_rep3.fastq.gz,condition1,3,,
Expand Down
4 changes: 2 additions & 2 deletions START_HERE/tiny-count_TUTORIAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Alternatively, if you have already installed tinyRNA, you can use the `tiny-coun

## Your Data Files
Gather the following files for the analysis:
1. **SAM files** containing small RNA reads aligned to a reference genome, one file per sample
1. **SAM or BAM files** containing small RNA reads aligned to a reference genome, one file per sample
2. **GFF3 or GFF2/GTF file(s)** containing annotations for features that you want to assign reads to

## Configuration Files
Expand All @@ -24,7 +24,7 @@ tiny-count --get-templates
Next, fill out the configuration files that were copied:

### 1. The Samples Sheet (samples.csv)
Edit this file to add the paths to your SAM files, and to define the group name, replicate number, etc. for each sample.
Edit this file to add the paths to your SAM or BAM files, and to define the group name, replicate number, etc. for each sample.

### 2. The Paths File (paths.yml)
Edit this file to add the paths to your GFF annotation(s) under the `gff_files` key. You can leave the `alias` key as-is for now. All other keys in this file are used in the tinyRNA workflow.
Expand Down
2 changes: 1 addition & 1 deletion doc/Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ The final output directory name has three components:
The `run_directory` suffix in the Paths File supports subdirectories; if provided, the final output directory will be named as indicated above, but the subdirectory structure specified in `run_directory` will be retained.

## Samples Sheet Details
| _Column:_ | FASTQ/SAM Files | Sample/Group Name | Replicate Number | Control | Normalization |
| _Column:_ | Input Files | Sample/Group Name | Replicate Number | Control | Normalization |
|-----------:|---------------------|-------------------|------------------|---------|---------------|
| _Example:_ | cond1_rep1.fastq.gz | condition1 | 1 | True | RPM |

Expand Down
4 changes: 2 additions & 2 deletions doc/Parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ A custom Cython implementation of HTSeq's StepVector is used for finding feature
### Is Pipeline
| Run Config Key | Commandline Argument |
|----------------|----------------------|
| | `--is-pipeline` |
| | `--in-pipeline` |

This commandline argument tells tiny-count that it is running as a workflow step rather than a standalone/manual run. Under these conditions tiny-count will look for all input files in the current working directory regardless of the paths defined in the Samples Sheet and Features Sheet.

Expand Down Expand Up @@ -152,7 +152,7 @@ Optional arguments:
-sv {Cython,HTSeq}, --stepvector {Cython,HTSeq}
Select which StepVector implementation is used to find
features overlapping an interval. (default: Cython)
-p, --is-pipeline Indicates that tiny-count was invoked as part of a
-p, --in-pipeline Indicates that tiny-count was invoked as part of a
pipeline run and that input files should be sourced as
such. (default: False)
-d, --report-diags Produce diagnostic information about
Expand Down
17 changes: 14 additions & 3 deletions doc/tiny-count.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,16 @@ For an explanation of tiny-count's parameters in the Run Config and by commandli
tiny-count offers a variety of options for refining your analysis. You might find that repeat analyses are required while tuning these options to your goals. Using the command `tiny recount`, tinyRNA will run the workflow starting at the tiny-count step using inputs from a prior end-to-end run to save time. See the [pipeline resume documentation](Pipeline.md#resuming-a-prior-analysis) for details and prerequisites.

## Running as a Standalone Tool
If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command has [one required argument](Parameters.md#full-tiny-count-help-string): your Paths File. Your Samples Sheet will need to list SAM files rather than FASTQ files in the `FASTQ/SAM Files` column. SAM files from third party sources are also supported, and if they have been produced from reads collapsed by tiny-collapse or fastx, tiny-count will honor the reported read counts.
Skip to [Feature Selection](#feature-selection) if you are using the tinyRNA workflow.

If you would like to run tiny-count as a standalone tool, not as part of an end-to-end or resumed analysis, you can do so with the command `tiny-count`. The command has [one required argument](Parameters.md#full-tiny-count-help-string): your Paths File. Your Samples Sheet will need to list SAM or BAM alignment files rather than FASTQ files in the `Input Files` column. Alignment files from third party sources are also supported, and if they have been produced from reads collapsed by tiny-collapse or fastx, tiny-count will honor the reported read counts.

#### Input File Requirements
The SAM/BAM files provided during standalone runs _must_ be ordered so that multi-mapping read alignments are listed adjacent to one another. This adjacency convention is required for proper normalization by genomic hits. For this reason, files with ambiguous order will be rejected unless they were produced by an alignment tool that we recognize for following the adjacency convention. At this time, this includes Bowtie, Bowtie2, and STAR (an admittedly incomplete list).

#### BAM File Tips
- Use the `--no-PG` option with `samtools view` when converting alignments
- Pysam will issue two warnings about missing index files; they can be ignored

#### Using Non-collapsed Sequence Alignments
While third-party SAM files from non-collapsed reads are supported, there are some caveats. These files will result in substantially higher resource usage and runtimes; we strongly recommend collapsing prior to alignment. Additionally, the sequence-related stats produced by tiny-count will no longer represent _unique_ sequences. These stats will instead refer to all sequences with unique QNAMEs (that is, multi-alignment bundles still cary a sequence count of 1).
Expand Down Expand Up @@ -139,8 +148,10 @@ Examples:
## Count Normalization
Small RNA reads passing selection will receive a normalized count increment. By default, read counts are normalized twice before being assigned to a feature. Both normalization steps can be disabled in `run_config.yml` if desired. Counts for each small RNA sequence are divided:
1. By the number of loci it aligns to in the genome.
2. By the number of _selected_ features for each of its alignments.
1. By the number of loci it aligns to in the genome (genomic hits).
2. By the number of _selected_ features for each of its alignments (feature hits).

>**Important**: For proper normalization by genomic hits, input files must be ordered such that multi-mapping read alignments are listed adjacent to one another.
## The Details
You may encounter the following cases when you have more than one unique GFF file listed in your Paths File:
Expand Down
9 changes: 6 additions & 3 deletions tests/testdata/config_files/paths.yml
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
############################## MAIN INPUT FILES FOR ANALYSIS ##############################
#
# Edit this section to provide the path to your Samples and Features sheets. Relative and
# absolute paths are both allowed. All relative paths are relative to THIS config file.
# Relative and absolute paths are both allowed.
# All relative paths are evaluated relative to THIS config file.
#
# Directions:
# 1. Fill out the Samples Sheet with files to process + naming scheme. [samples.csv]
# 2. Fill out the Features Sheet with selection rules [features.csv]
# 3. Set samples_csv and features_csv (below) to point to these files
# 3. Set samples_csv and features_csv to point to these files
# 4. Add annotation files and per-file alias preferences to gff_files (optional)
#
# If using the tinyRNA workflow, additionally set ebwt and/or reference_genome_files
# in the BOWTIE-BUILD section.
#
######-------------------------------------------------------------------------------######

##-- Path to Sample & Features Sheets (relative paths are relative to this config file) --##
Expand Down
5 changes: 2 additions & 3 deletions tests/testdata/config_files/run_config_template.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,8 @@ run_native: false

######------------------------- BOWTIE INDEX BUILD OPTIONS --------------------------######
#
# If you do not already have bowtie indexes, they can be built for you by setting
# run_bowtie_build (above) to true and adding your reference genome file(s) to your
# paths_config file.
# If you do not already have bowtie indexes, they can be built for you
# (see the BOWTIE-BUILD section in the Paths File)
#
# We have specified default parameters for small RNA data based on our own "best practices".
# You can change the parameters here.
Expand Down
2 changes: 1 addition & 1 deletion tests/testdata/config_files/samples.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FASTQ/SAM Files,Sample/Group Name,Replicate Number,Control,Normalization
Input Files,Sample/Group Name,Replicate Number,Control,Normalization
../../../START_HERE/fastq_files/cond1_rep1.fastq.gz,condition1,1,TRUE,
../../../START_HERE/fastq_files/cond1_rep2.fastq.gz,condition1,2,,
../../../START_HERE/fastq_files/cond1_rep3.fastq.gz,condition1,3,,
Expand Down
Binary file added tests/testdata/counter/bam/Lib304_test.bam
Binary file not shown.
Binary file added tests/testdata/counter/bam/single.bam
Binary file not shown.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
@HD SO:unsorted
@SQ SN:I LN:21
@PG ID:bowtie
NON_COLLAPSED_QNAME 16 I 15064570 255 21M * 0 0 CAAGACAGAGCTTCACCGTTC IIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:21 NM:i:0 XM:i:2
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
@HD SO:unsorted
@SQ SN:I LN:21
@PG ID:bowtie
0_count=5 16 I 15064570 255 21M * 0 0 CAAGACAGAGCTTCACCGTTC IIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:21 NM:i:0 XM:i:2
2 changes: 1 addition & 1 deletion tests/unit_test_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ def get_dir_checksum_tree(root_path: str) -> dict:
return dir_tree


def make_parsed_sam_record(Name="0_count=1", Seq="CAAGACAGAGCTTCACCGTTC", Chrom='I', Start=15064570, Strand=True, NM=0):
def make_parsed_alignment(Name="0_count=1", Seq="CAAGACAGAGCTTCACCGTTC", Chrom='I', Start=15064570, Strand=True, NM=0):
return {
"Name": Name,
"Length": len(Seq),
Expand Down
Loading

0 comments on commit e2704c0

Please sign in to comment.