Skip to content

Commit

Permalink
Merge pull request #30 from phac-nml/add-sample-name
Browse files Browse the repository at this point in the history
Update: Add `sample_name` for IRIDA-Next integration
  • Loading branch information
sgsutcliffe authored Oct 18, 2024
2 parents ed95bf0 + 5f8b3ec commit 9a41c6b
Show file tree
Hide file tree
Showing 17 changed files with 224 additions and 29 deletions.
5 changes: 5 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,8 @@ indent_style = unset
# ignore python
[*.{py}]
indent_style = unset

# ignore nf-test json file
[tests/data/irida/*.json]
insert_final_newline = unset
trim_trailing_whitespace = unset
23 changes: 19 additions & 4 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: nf-core linting
# This workflow is triggered on pushes and PRs to the repository.
# It runs the `nf-core lint` and markdown lint tests to ensure
# It runs the `nf-core pipelines lint` and markdown lint tests to ensure
# that the code meets the nf-core guidelines.
on:
push:
Expand Down Expand Up @@ -41,17 +41,32 @@ jobs:
python-version: "3.12"
architecture: "x64"

- name: read .nf-core.yml
uses: pietrobolcato/[email protected]
id: read_yml
with:
config: ${{ github.workspace }}/.nf-core.yml

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install nf-core
pip install nf-core==${{ steps.read_yml.outputs['nf_core_version'] }}
- name: Run nf-core pipelines lint
if: ${{ github.base_ref != 'master' }}
env:
GITHUB_COMMENTS_URL: ${{ github.event.pull_request.comments_url }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_PR_COMMIT: ${{ github.event.pull_request.head.sha }}
run: nf-core -l lint_log.txt pipelines lint --dir ${GITHUB_WORKSPACE} --markdown lint_results.md

- name: Run nf-core lint
- name: Run nf-core pipelines lint --release
if: ${{ github.base_ref == 'master' }}
env:
GITHUB_COMMENTS_URL: ${{ github.event.pull_request.comments_url }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_PR_COMMIT: ${{ github.event.pull_request.head.sha }}
run: nf-core -l lint_log.txt lint --dir ${GITHUB_WORKSPACE} --markdown lint_results.md
run: nf-core -l lint_log.txt pipelines lint --release --dir ${GITHUB_WORKSPACE} --markdown lint_results.md

- name: Save PR number
if: ${{ always() }}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/linting_comment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Download lint results
uses: dawidd6/action-download-artifact@09f2f74827fd3a8607589e5ad7f9398816f540fe # v3
uses: dawidd6/action-download-artifact@bf251b5aa9c2f7eeb574a96ee720e24f801b7c11 # v6
with:
workflow: linting.yml
workflow_conclusion: completed
Expand Down
5 changes: 4 additions & 1 deletion .nf-core.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
repository_type: pipeline

nf_core_version: "2.14.1"
nf_core_version: "3.0.1"
lint:
files_exist:
- assets/nf-core-gasnomenclature_logo_light.png
Expand All @@ -27,6 +27,9 @@ lint:
- custom_config
- manifest.name
- manifest.homePage
- params.max_cpus
- params.max_memory
- params.max_time
readme:
- nextflow_badge

Expand Down
1 change: 1 addition & 0 deletions .prettierignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ testing/
testing*
*.pyc
bin/
tests/data/irida/sample_name_add_iridanext.output.json
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,15 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## Development

### Changed

- Added the ability to include a `sample_name` column in the input samplesheet.csv. Allows for compatibility with IRIDA-Next input configuration.
- `sample_name` special characters will be replaced with `"_"`
- If no `sample_name` is supplied in the column `sample` will be used
- To avoid repeat values for `sample_name` all `sample_name` values will be suffixed with the unique `sample` value from the input file

## [0.2.3] - 2024/09/25

### `Changed`
Expand Down
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,16 @@ The structure of this file is defined in [assets/schema_input.json](assets/schem

Details on the columns can be found in the [Full samplesheet](docs/usage.md#full-samplesheet) documentation.

## IRIDA-Next Optional Input Configuration

`gasnomenclature` accepts the [IRIDA-Next](https://github.com/phac-nml/irida-next) format for samplesheets which can contain an additional column: `sample_name`

`sample_name`: An **optional** column, that overrides `sample` for outputs (filenames and sample names) and reference assembly identification.

`sample_name`, allows more flexibility in naming output files or sample identification. Unlike `sample`, `sample_name` is not required to contain unique values. `Nextflow` requires unique sample names, and therefore in the instance of repeat `sample_names`, `sample` will be suffixed to any `sample_name`. Non-alphanumeric characters (excluding `_`,`-`,`.`) will be replaced with `"_"`.

An [example samplesheet](tests/data/samplesheets/samplesheet-sample_name.csv) has been provided with the pipeline.

# Parameters

The main parameters are `--input` as defined above and `--output` for specifying the output results directory. You may wish to provide `-profile singularity` to specify the use of singularity containers and `-r [branch]` to specify which GitHub branch you would like to run.
Expand Down
9 changes: 7 additions & 2 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"$schema": "http://json-schema.org/draft-07/schema",
"$schema": "https://json-schema.org/draft-07/schema",
"$id": "https://raw.githubusercontent.com/phac-nml/gasnomenclature/main/assets/schema_input.json",
"title": "phac-nml/gasnomenclature pipeline - params.input schema",
"description": "Schema for the file provided with params.input",
Expand All @@ -10,10 +10,15 @@
"sample": {
"type": "string",
"pattern": "^\\S+$",
"meta": ["id"],
"meta": ["irida_id"],
"unique": true,
"errorMessage": "Sample name must be provided and cannot contain spaces"
},
"sample_name": {
"type": "string",
"meta": ["id"],
"errorMessage": "Sample name is optional, if provided will replace sample for filenames and outputs"
},
"mlst_alleles": {
"type": "string",
"format": "file-path",
Expand Down
9 changes: 7 additions & 2 deletions conf/iridanext.config
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,18 @@ iridanext {
path = "${params.outdir}/iridanext.output.json.gz"
overwrite = true
files {
idkey = "irida_id"
samples = ["**/input/*_error_report.csv"]
}
metadata {
samples {
keep = [
"address"
]
csv {
path = "**/filter/new_addresses.csv"
idcol = "id"
path = "**/filter/new_addresses.tsv"
sep = "\t"
idcol = 'irida_id'
}
}
}
Expand Down
2 changes: 1 addition & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
<summary>Output files</summary>

- `filter/`
- `new_addresses.csv`
- `new_addresses.tsv`

</details>

Expand Down
26 changes: 25 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ You will need to create a samplesheet with information about the samples you wou
--input '[path to samplesheet file]'
```

### Full samplesheet
### Full Standard Samplesheet

The input samplesheet must contain three columns: `sample`, `mlst_alleles`, `address`. The sample names within a samplesheet should be unique. All other columns will be ignored.

Expand All @@ -33,6 +33,28 @@ sampleF,sampleF.mlst.json,

An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.

### IRIDA-Next Optional Samplesheet Configuration

`gasnomenclature` accepts the [IRIDA-Next](https://github.com/phac-nml/irida-next) format for samplesheets which contain the following columns: `sample`, `sample_name`, `mlst_alleles`, `address`. The sample IDs within a samplesheet should be unique.

A final samplesheet file consisting of mlst_alleles and addresses may look something like the one below:

```csv title="samplesheet.csv"
sample,sample_name,mlst_alleles,address
sampleA,S1,sampleA.mlst.json.gz,1.1.1
sampleQ,S2,sampleQ.mlst.json.gz,2.2.2
sampleF,,sampleF.mlst.json,
```

| Column | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `sample` | Custom sample name. Samples should be unique within a samplesheet. |
| `sample_name` | Sample name used in outputs (filenames and sample names) |
| `mlst_alleles` | Full path to an MLST JSON file describing the loci/alleles for the sample against some MLST scheme. A way to generate this file is via [locidex]. File can optionally be gzipped and must have the extension ".mlst.json", ".mlst.subtyping.json" (or with an additional ".gz" if gzipped). |
| `address` | Hierarchal clustering address. If left empty for a sample, the pipeline will assign a cluster address. |

An [example samplesheet](tests/data/samplesheets/samplesheet-sample_name.csv) has been provided with the pipeline.

## Running the pipeline

The typical command for running the pipeline is as follows:
Expand Down Expand Up @@ -185,3 +207,5 @@ We recommend adding the following line to your environment to limit this (typica
```bash
NXF_OPTS='-Xms1g -Xmx4g'
```

[locidex]: https://github.com/phac-nml/locidex
9 changes: 6 additions & 3 deletions modules/local/filter_query/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ process FILTER_QUERY {
val out_format

output:
path("new_addresses.*"), emit: csv
path("new_addresses.*"), emit: tsv
path("versions.yml"), emit: versions

script:
Expand All @@ -24,13 +24,16 @@ process FILTER_QUERY {

"""
# Filter the query samples only; keep only the 'id' and 'address' columns
csvtk cut -t -f 2 ${query_ids} > query_list.txt # Need to use the second column to pull meta.id because there is no header
csvtk add-header ${query_ids} -t -n irida_id,id > id.txt
csvtk grep \\
${addresses} \\
-f 1 \\
-P ${query_ids} \\
-P query_list.txt \\
--delimiter "${delimiter}" \\
--out-delimiter "${out_delimiter}" | \\
csvtk cut -f id,address > ${outputFile}.${out_extension}
csvtk cut -t -f id,address > tmp.tsv
csvtk join -t -f id id.txt tmp.tsv > ${outputFile}.${out_extension}
cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
5 changes: 2 additions & 3 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"$schema": "http://json-schema.org/draft-07/schema",
"$schema": "https://json-schema.org/draft-07/schema",
"$id": "https://raw.githubusercontent.com/phac-nml/gasnomenclature/main/nextflow_schema.json",
"title": "phac-nml/gasnomenclature pipeline parameters",
"description": "Gas Nomenclature assignment pipeline",
Expand Down Expand Up @@ -84,8 +84,7 @@
},
"pd_count_missing": {
"type": "boolean",
"description": "Count missing alleles as different",
"default": false
"description": "Count missing alleles as different"
}
}
},
Expand Down
39 changes: 39 additions & 0 deletions tests/data/irida/sample_name_add_iridanext.output.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
{
"files": {
"global": [

],
"samples": {
"sampleQ": [
{
"path": "input/sample_1_error_report.csv"
}
],
"sample1": [
{
"path": "input/sample_2_error_report.csv"
}
],
"sample2": [
{
"path": "input/sample_2_sample2_error_report.csv"
}
],
"sampleR": [
{
"path": "input/sample4_error_report.csv"
}
]
}
},
"metadata": {
"samples": {
"sampleQ": {
"address": "2.2.3"
},
"sampleR": {
"address": "2.2.3"
}
}
}
}
6 changes: 6 additions & 0 deletions tests/data/samplesheets/samplesheet-sample_name.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
sample,sample_name,mlst_alleles,address
sampleQ,sample 1,https://raw.githubusercontent.com/phac-nml/gasnomenclature/dev/tests/data/reports/sampleQ.mlst.json,
sample1,sample#2,https://raw.githubusercontent.com/phac-nml/gasnomenclature/dev/tests/data/reports/sample1.mlst.json,1.1.1
sample2,sample#2,https://raw.githubusercontent.com/phac-nml/gasnomenclature/dev/tests/data/reports/sample2.mlst.json,1.1.1
sample3,,https://raw.githubusercontent.com/phac-nml/gasnomenclature/dev/tests/data/reports/sample3.mlst.json,1.1.2
sampleR,sample4,https://raw.githubusercontent.com/phac-nml/gasnomenclature/dev/tests/data/reports/sampleF.mlst.json,
54 changes: 47 additions & 7 deletions tests/pipelines/main.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -221,9 +221,9 @@ nextflow_pipeline {
assert lines.contains("sampleR,[\'sampleF\'],Query sampleR ID and JSON key in sampleF.mlst.json DO NOT MATCH. The 'sampleF' key in sampleF.mlst.json has been forcefully changed to 'sampleR': User should manually check input files to ensure correctness.")

// Check filter_query csv file
lines = path("$launchDir/results/filter/new_addresses.csv").readLines()
assert lines.contains("sampleQ,2.2.3")
assert lines.contains("sampleR,2.2.3")
lines = path("$launchDir/results/filter/new_addresses.tsv").readLines()
assert lines.contains("sampleQ\tsampleQ\t2.2.3")
assert lines.contains("sampleR\tsampleR\t2.2.3")

// Check IRIDA Next JSON output
assert path("$launchDir/results/iridanext.output.json").json == path("$baseDir/tests/data/irida/mismatched_iridanext.output.json").json
Expand Down Expand Up @@ -271,8 +271,8 @@ nextflow_pipeline {
assert lines.contains('sample3,"[\'extra_key\', \'sample3\']","MLST JSON file (sample3_multiplekeys.mlst.json) contains multiple keys: [\'extra_key\', \'sample3\']. The MLST JSON file has been modified to retain only the \'sample3\' entry"')

// Check filtered query csv results
lines = path("$launchDir/results/filter/new_addresses.csv").readLines()
assert lines.contains("sampleQ,1.1.3")
lines = path("$launchDir/results/filter/new_addresses.tsv").readLines()
assert lines.contains("sampleQ\tsampleQ\t1.1.3")

// Check IRIDA Next JSON output
assert path("$launchDir/results/iridanext.output.json").json == path("$baseDir/tests/data/irida/multiplekeys_iridanext.output.json").json
Expand Down Expand Up @@ -320,8 +320,8 @@ nextflow_pipeline {
assert lines.contains('sample3,"[\'extra_key\', \'sample4\']",No key in the MLST JSON file (sample3_multiplekeys_nomatch.mlst.json) matches the specified sample ID \'sample3\'. The first key \'extra_key\' has been forcefully changed to \'sample3\' and all other keys have been removed.')

// Check filtered query csv results
lines = path("$launchDir/results/filter/new_addresses.csv").readLines()
assert lines.contains("sampleQ,1.1.3")
lines = path("$launchDir/results/filter/new_addresses.tsv").readLines()
assert lines.contains("sampleQ\tsampleQ\t1.1.3")

// Check IRIDA Next JSON output
assert path("$launchDir/results/iridanext.output.json").json == path("$baseDir/tests/data/irida/multiplekeys_iridanext.output.json").json
Expand Down Expand Up @@ -354,4 +354,44 @@ nextflow_pipeline {
assert (workflow.stdout =~ /sample2_empty.mlst.json is missing the 'profile' section or is completely empty!/).find()
}
}

test("Testing when sample_name column is included on input"){
// For integration in IRIDA-Next there needs to be an option to have the input file include a sample_name column

tag "add-sample-name"

when{
params {
input = "$baseDir/tests/data/samplesheets/samplesheet-sample_name.csv"
outdir = "results"
}
}

then {
assert workflow.success
assert path("$launchDir/results").exists()

// Check outputs
def lines = []

// Ensure that the error_reports are generated for query and reference samples based on sample_name swap with sample
lines = path("$launchDir/results/input/sample_1_error_report.csv").readLines()
assert lines.contains("sample_1,[\'sampleQ\'],Query sample_1 ID and JSON key in sampleQ.mlst.json DO NOT MATCH. The 'sampleQ' key in sampleQ.mlst.json has been forcefully changed to 'sample_1': User should manually check input files to ensure correctness.")

lines = path("$launchDir/results/input/sample_2_error_report.csv").readLines()
assert lines.contains("sample_2,[\'sample1\'],Reference sample_2 ID and JSON key in sample1.mlst.json DO NOT MATCH. The 'sample1' key in sample1.mlst.json has been forcefully changed to 'sample_2': User should manually check input files to ensure correctness.")

lines = path("$launchDir/results/input/sample_2_sample2_error_report.csv").readLines()
assert lines.contains("sample_2_sample2,[\'sample2\'],Reference sample_2_sample2 ID and JSON key in sample2.mlst.json DO NOT MATCH. The 'sample2' key in sample2.mlst.json has been forcefully changed to 'sample_2_sample2': User should manually check input files to ensure correctness.")

// Check filter_query csv file
lines = path("$launchDir/results/filter/new_addresses.tsv").readLines()
assert lines.contains("sampleQ\tsample_1\t2.2.3")
assert lines.contains("sampleR\tsample4\t2.2.3")

// Check IRIDA Next JSON output
assert path("$launchDir/results/iridanext.output.json").json == path("$baseDir/tests/data/irida/sample_name_add_iridanext.output.json").json

}
}
}
Loading

0 comments on commit 9a41c6b

Please sign in to comment.