Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace pipes with tmp file #37

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from
Open

Conversation

sgsutcliffe
Copy link
Contributor

@sgsutcliffe sgsutcliffe commented Jan 27, 2025

This PR is tackling two things:
The benchmark exposed a scalability issue with using bash commands for append_clusters and updating the GAS version to fix some issues with gas call

Addressing the Issue #36

Problem

When dealing with large database files to include with the --db_profiles and db_clusters the process APPEND_CLUSTERS() was failing due to a 141 error, which appears to be cause by the pipes in the bash script, specificallt in pipe between get_address function and the following awk command to get the number of levels of the genomic address.

    # Check if two files have consistent delimeter splits in the address column
    init_splits=\$(get_address "${initial_clusters}" | awk -F '${params.gm_delimiter}' '{print NF}')
    add_splits=\$(get_address "${additional_clusters}" | awk -F '${params.gm_delimiter}' '{print NF}')

I was currently testing 422 Salmonella samples with about 2500 loci and three levels in both the initial and additional clusters when I noticed the bug.

Solution

Tried removing the pipefail for the process but it didn't work. The current temporary solution is to create a tmp file as an intermediate between get_address and the awk command.

Addressing the Issue #38 and Issue #33

There were a number of issues we had in gasnomenclature were coming from gas specifically with gas call. We are updating the version of the genomic address service aka gas to 0.1.4 to address these issues. Will need to revert the tests changed in PR #31

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • Make sure your code lints (nf-core lint).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Copy link

This PR is against the main branch ❌

  • Do not close this PR
  • Click Edit and change the base to dev
  • This CI test will remain failed until you push a new commit

Hi @sgsutcliffe,

It looks like this pull-request is has been made against the phac-nml/gasnomenclature main branch.
The main branch on phac-nml repositories should always contain code from the latest release.
Because of this, PRs to main are only allowed if they come from the phac-nml/gasnomenclature dev branch.

You do not need to close this PR, you can change the target branch to dev by clicking the "Edit" button at the top of this page.
Note that even after this, the test will continue to show as failing until you push a new commit.

Thanks again for your contribution!

Copy link

github-actions bot commented Jan 27, 2025

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 5f56e85

+| ✅ 144 tests passed       |+
#| ❔  23 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

❔ Tests ignored:

  • files_exist - File is ignored: assets/nf-core-gasnomenclature_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-gasnomenclature_logo_dark.png
  • files_exist - File is ignored: docs/images/nf-core-gasnomenclature_logo_light.png
  • files_exist - File is ignored: .github/workflows/awstest.yml
  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • nextflow_config - Config variable ignored: manifest.name
  • nextflow_config - Config variable ignored: manifest.homePage
  • nextflow_config - Config variable ignored: params.max_cpus
  • files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.yml
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: .github/workflows/branch.yml
  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
  • files_unchanged - File does not exist: assets/nf-core-gasnomenclature_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-gasnomenclature_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-gasnomenclature_logo_dark.png
  • files_unchanged - File ignored due to lint config: docs/README.md
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/gasnomenclature/gasnomenclature/.github/workflows/awstest.yml
  • actions_awsfulltest - actions_awsfulltest
  • pipeline_name_conventions - pipeline_name_conventions

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.1
  • Run at 2025-02-04 18:21:29

@sgsutcliffe sgsutcliffe changed the base branch from main to dev January 27, 2025 19:01
Copy link
Member

@apetkau apetkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for fixing this Steven 😄 . In addition to in-line comments could you also add a message to CHANGELOG.md about this fix?

@@ -4,7 +4,7 @@ process APPEND_CLUSTERS {

container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/csvtk:0.22.0--h9ee0642_1' :
'biocontainers/csvtk:0.22.0--h9ee0642_1' }"
'biocontainers/csvtk:0.22.0--0' }"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you revert to container with tag 0 instead of h9ee0642_1?

Copy link
Contributor Author

@sgsutcliffe sgsutcliffe Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A mistake, when I was trying things out in tests to see if it was a csvtk version issue a6d6de9

fi
}

# Check if two files have consistent delimeter splits in the address column
init_splits=\$(get_address "${initial_clusters}" | awk -F '${params.gm_delimiter}' '{print NF}')
add_splits=\$(get_address "${additional_clusters}" | awk -F '${params.gm_delimiter}' '{print NF}')
get_address "${initial_clusters}" > tmp1.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a small thing, but could you rename tmp1 and tmp2 to correspond to the type of data they contain (e.g., initial_clusters and additional_clusters)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1,6 +1,5 @@
process CLUSTER_FILE {
tag "Create cluster file for GAS call"
label 'process_single'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed since the resource label isn't needed for a process that runs native code am I correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A mistake when I was testing things related to the benchmarking. 3850e90

@sgsutcliffe
Copy link
Contributor Author

The tests that stopped working when we updated GAS to 0.1.3, see the PR due to the threshold issue were:

  • 'Test pipeline with single threshold set to 1'
  • 'Test pipeline with threshold set to 1,0'
  • 'Full pipeline hashes and missing data'
  • 'Full pipeline hashes and missing data count missing as differences'
  • 'Full pipeline remove loci with missing data'

When I updated the GAS container to 0.1.4, with the idea of fixing the issue that caused these tests to fail due to the threshold issue, only one test failed:

  • Full pipeline hashes and missing data

Which was not what I was expecting. It is possible it was the other isssue addressed with GAS 0.1.4 which was regarding the linkage.method, which appeared to be hard-coded to single. I thought a lot of tests would have failed with 0.1.4 due to this switch since the default was average in the nextflow.config. It's possible the linkage.method was causing an issue with gasnomenclature.

I will look more closely into the tests to make sure they are behaving as we would expect, based on the GAS versions that were being used.

@sgsutcliffe
Copy link
Contributor Author

sgsutcliffe commented Feb 4, 2025

I have confirmed that it is actually the linkage.method that was causing the issue initially but obfuscated by threshold bug.

The results for the example the test 'Test pipeline with threshold set to 1,0' that failed was:

id	address
sample1	1.1
sample2	1.1
sample3	1.1
sampleQ	1.2

This is output was generated by a gas call with the settings of linkage.method = "single" and threshold is >= but the nextflow.config has linkage.method = "average" which was not actually working due to linkage.method being hard-coded to single.

In GAS 0.1.3 the linkage.method was still hard-coded to be "single" BUT introduced the new bug of threshold as >, which results in an output of gas call of:

id	address
sample1	1.1
sample2	1.1
sample3	1.1
sampleQ	2.2

In GAS 0.1.4 we made two fixes 1) that linkage.method was no longer hard-coded to be single and 2) threshold is correctly back to >=

The output of the test stayed the same as GAS 0.1.3 because now gas call has both fixes implemented. If we were to set the test to single for linkage.method we would get the originally test output for sampleQ = 1.2 but since the default in the config is actually average and that's what the test uses sampleQ = 2.2 is correct.

Copy link
Member

@DarianHole DarianHole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks Steven!! Two small questions from me and that is all

Comment on lines +31 to +33
echo "\$ref_headers"
echo "\$add_headers"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this added for testing purposes or is it for something else I'm overlooking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was! Thanks! My test changes keep showing up -- I think I messed up the version I commited.

Comment on lines +51 to +54
csvtk join -t -f sample_id combined_profiles.tsv sample_counts.tsv | csvtk mutate2 -t -n new_sample_id -e '(\$source == "db" && \$frequency > 1) ? "db_" + \$sample_id : \$sample_id' > tmp.txt
csvtk cut -t -f 2-\${n} tmp.txt > tmp2.txt
csvtk cut -t -f new_sample_id tmp.txt | csvtk rename -t -f new_sample_id -n sample_id > tmp3.txt
paste tmp3.txt tmp2.txt > profiles_ref.tsv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaron had mentioned it in the previous review; do we want to adjust the temporary file names to match what they contain here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a6d6de9 don't know what happened to the changes!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I forgot that I did a similar thing on APPEND_PROFILES I will make those changes. It was a different issue I was having that came up in the benchmark I forgot I had made those changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants