CZ ID's AMR workflow implements the Resistance Gene Identifier (RGI) tool for AMR sequence detection. The RGI tool is used to compare quality controlled reads and assembled contigs against AMR references sequences from the Comprehensive Antibiotic Resistance Database (CARD). Further documentation on how to use the CZ ID AMR workflow, including a workflow visualization can be found in the CZ ID help center.
- Fixed an issue that caused assembly to fail for RawSample inputs.
- Docker build now uses the official RGI 6.0.3 release and applies a patch, instead of installing from a fork.
- Adds SPAdes 3.11.1 to the AMR docker image.
- Adds a script to collect information from SPAdes logs when no contigs are assembled in RunSpades.
- Fixes an issue where the
RunRedup
task would hang when processing samples with large amounts of duplicate sequence data.grep
has been replaced withawk
when processing the duplicate cluster sizes tsv and clusters csv files. - Fixes an issue where
RunRedup
output fasta files were inconsistently named.
RunSpades
now uses the container specified indocker_image_id
instead ofhost_filtering_docker_image_id
.- Harmonizes file naming in
RunRedup
to use underscores (_) instead of dashes (-). - Now uses a fork of rgi version 6.0.3, with changes to fix an issue with parsing BAM files. Previously the AMR workflow used a fork of rgi version 6.0.0.
- Samples that have valid k-mer coverage histograms but do not generate any contigs from SPAdes assembly will no longer cause an error in
RunRgiMain
, which does not allow for empty contigs files.
- Fixed a bug where in the case of a sample that had duplicate reads, the
RunRedup
task failed to include reads from the subsampled reads file. - Fixed a bug where the non-host reads output file was in FASTQ format instead of the expected FASTA format.
- Duplicate reads that correspond with a read present in the subsampled reads used as input are now interleaved with subsampled reads before being used as input to the rest of the workflow. This should result in more accurate reporting of reads-based statistics in the workflow output (number of reads, reads per million, coverage depth, and depth per million reads).
- Input parameters that expect a non-empty array are now declared as
Array[T]+
instead ofArray[T]
. - Variables that were in camelCase have been renamed to conform to snake_case to be consistent with the rest of the workflow.
- The workflow's input parameters for declaring sample input files has changed. Samples must now be declared using one of either the
RawSample
struct or theFilteredSample
struct. TheFilteredSample
struct, meant for samples that have previously undergone host filtering in CZID, requires a file containing all non-host reads as well as two tsv files describing duplicate reads clusters and their sizes, in addition to the previously needed subsampled reads and contigs files.
- Fixed a bug that caused the calculated gene coverage percentage to be incorrect in some cases, i.e. when only a reverse read was present in the contigs of a paired-end sample.
- Fixed a bug that caused contig IDs to be improperly parsed when creating the indexed contigs BAM/BAI files. This caused the workflow to halt in some cases.
- The workflow now expects the
card_ontology
file to be organized by drug class, which should be the top-level keys in the JSON structure of the file. The code that generates the primary AMR report now queries the file by drug class instead of gene name to obtain high-level drug classes.
1.2.13 is identical to 1.2.12.
- Added ARO accession number as a column in the primary AMR report.
- Fixed a bug that prevented contigs without corresponding read IDs to be indexed in the contigs BAM/BAI files. All contigs in a sample for a given gene can now be accessed by using the gene's ARO accession number as a key.
- Contigs in the indexed BAM/BAI files are now indexed by ARO accession, not read IDs.
- Fixed a bug that caused the workflow to halt if an ontology entry did not have a
highLevelDrugClasses
key. - Fixed a bug that caused identified genes to not show up in the primary AMR report if they did not have a high-level drug class in the ontology.
NOTE: Due to a bug in our release script, there were no releases with version numbers 1.2.6 - 1.2.9.
- Added seqkit 2.4.0 to the AMR Dockerfile.
- Added workflow parameter
File card_ontology
. The workflow expects this to be a customized JSON file containing the contents of the CARD ontology, with AMR gene names as top-level keys. - Added high-level drug classes to primary AMR report. The workflow queries the
card_ontology
file. - Added sample ontology file for testing purposes.
- The interleaved non-host reads file now renames duplicate sequence IDs with seqkit. The renamed IDs have a forward slash and a counter appended to them:
/1
,/2
, etc.
- Fixed bug in calculating gene coverage. Contigs are now grouped by ARO accession instead of model ID before calculating coverage.
- Pinned
dask
version to2023.5.0
to prevent installation of newer versions incompatible with Python 3.8.
- Updated
requests
to version 2.31.0.
- Added this README file.
- Updated CARD database files.
- Nudged results from running RGI on contigs are now included in the primary AMR report. These rows have the value "Nudged" in the "Cut_Off_contig_amr" column.
- RGI is now called with the
--include_nudge
flag on contigs in taskRunRgiMain
.
- Fixed a bug that caused creation of the indexed contig BAM file to error out if SPAdes assembly failed. The task now outputs an empty BAM file.
Version 1.2.1 is unchanged from 1.2.0.
- Fixed a bug that caused all gene names in the AMR reports to be entirely lowercase.
- Added Python test for the WDL task which generates reports as TSV files,
RunResultsPerSample
.
- Fixed a bug that occurred when determining ARO accessions for rows in the primary AMR report.
- Removed carriage return characters from workflow test data.
- Fixed the filename specified for the non-host reads file produced by
ZipOutputs
.
- The non-host reads file output by
ZipOutputs
is no longer compressed with gzip.
- Contigs indexed by gene are now output as a BAM/BAI file pair from new task
tsvToSam
. - A new column to the primary AMR report
read_gene_id
that can be used to look up sequences in the above-mentioned contigs bam file.
- The workflow will now only interleave non-host reads as part of the
ZipOutputs
task if two files are present. Interleaving is skipped for workflows where all non-host reads are contained in a single file.
- The non-host reads used for workflow runs on unfiltered samples now use the subsampled output from host filtering. Previously, the workflow used reads that were output from the host filtering process before subsampling.
- Dockerfile: Add SeqFu for interleaving input files.
- Non-host reads interleaved in a single file is now an output of
ZipOutputs
. - Contigs are also now an output of
ZipOutputs
.
- Renamed
ZipOutputs
input parameteroutput_files
tocontigs_in
.
- Removed input parameter
Int? total_reads
- Removed task
GetTotalReads
.
- New optional input parameter
Int? total_reads
. - New task
GetTotalReads
.
- Load json with all CARD models when running
rgi load
.`
- More detailed % coverage, now divided into contig coverage and read coverage. New columns have been added to the primary AMR report and some columns have been renamed in order to reflect this.
- New workflow input parameter
File wildcard_index
.
- Fixed typo in
MakeGeneCoverage
task. - Fixed a bug in
MakeGeneCoverage
that did not account for zero-length output when writing results togene_coverage.tsv
. - Output RunSpades.scaffolds is now optional type
File?
to reflect the fact that thescaffolds.fasta
file may be missing in some SPAdes runs.`
- AMR workflow files have been moved from
amr/
toworkflows/amr/
.` - Dockerfile:
bedtools
installed via apt instead of source to eliminate the long compile time when building the image. - Dockerfile: install RGI from commit 20b22dab on the rzlim08 fork.
- CARD databases are now loaded at runtime, and have been removed from the Dockerfile.
- For contigs, RGI is no longer called with the
--include_nudge
flag. - Updated
RunRgiMain
Python test.
- Sample name now supported as input and appears in workflow output.
- Synthesized report
primary_AMR_report.tsv
, a table containing the most important workflow output. - Intermediate file with allele analysis output from k-mer taxonomic classification on non-host reads is now a workflow output.
- Mapped read stats from RGI main analysis of non-host reads available in several output files as workflow outputs.
- Mapped reads from RGI main analysis of non-host reads available as a BAM file with BAI index in workflow outputs.
- Intermediate outputs added to outputs.zip, organized into subfolders.
- Added test for RGI main analysis of failed assembly output.
- Workflow now continues properly with a sample that failed assembly (no contigs).
- Instead of erroring out, workflow now writes empty AMR report if there is no output from RGI.
- Use main branch of rzlim08 RGI fork.
- Updated lxml dependency to version
4.9.1
. - Rename final summary file to
comprehensive_AMR_metrics.tsv
. - Primary AMR workflow output now in
primary_AMR_report.tsv
, from taskRunResultsPerSample
. - Several intermediate output files have their names change to be consistent with snake_case.
- Run SPAdes with a fixed number of threads (36) to prevent issues with using
$(nproc --all)
.
- SPAdes stdout stream redirected to stderr for logging purposes.
- Support for raw reads as input in paired fasta/fastq files.
- Host filtering via calling the short read mNGS host filter workflow.
- WDL task
RunSpades
to assemble contigs from host filter output using SPAdes genome assembler using short read mNGS Docker image. - WDL task
MakeGeneCoverage
to calculate % coverage of reference genome for identified sequences.
- Uses different rzlim08 RGI fork branch that adds gene hit coordinates.
- RGI main tasks now take in sequences from either top-level workflow inputs or host filtering/SPAdes output using
select_first()
.
- Python requirements.txt file to run workflow without using conda.
- Dockerfile based on Ubuntu 20.04 image, with Python 3.8 installed.
- rzlim08 RGI fork that fixes header parsing.
- CARD RGI dependencies installed in Docker build.
- CARD databases added as part of Docker image build.
- WDL workflow support for samples with non-host reads and contigs.
- Workflow input parameter
Array[File] non_host_reads
. - Workflow input parameter
File contigs
. - Workflow input parameter
File card_json
. - Workflow input parameter
File kmer_db
. - Workflow input parameter
File amr_kmer_db
. - Workflow input parameter
File wildcard_data
. - WDL task
RunRgiMain
to identify input contigs usingrgi main
. - WDL task
RunRgiBwtKma
to identify input non-host reads usingrgi bwt
. - WDL task
RunRgiKmerMain
to run k-mer taxonomic classification on input contigs. - WDL task
RunRgiKmerBwt
to run k-mer taxonomic classification on non-host reads. - WDL task
RunResultsPerSample
that collects RGI output into TSV files; primary output in final_summary.tsv. - WDL task
ZipOutputs
to collect primary analysis in one file location. - Python tests for
RgiMain
andRgiBwtKma
WDL tasks. - Test contigs and non-host reads data.
Filename | Provenance |
---|---|
s3://czid-public-references/card/2023-05-22/* | All files downloaded from https://card.mcmaster.ca/download on May 23, 2023. CARD version: 3.2.6 Wildcard version: 4.0.0 |