-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace pipes with tmp file #37
base: dev
Are you sure you want to change the base?
Conversation
This PR is against the
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for fixing this Steven 😄 . In addition to in-line comments could you also add a message to CHANGELOG.md
about this fix?
@@ -4,7 +4,7 @@ process APPEND_CLUSTERS { | |||
|
|||
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? | |||
'https://depot.galaxyproject.org/singularity/csvtk:0.22.0--h9ee0642_1' : | |||
'biocontainers/csvtk:0.22.0--h9ee0642_1' }" | |||
'biocontainers/csvtk:0.22.0--0' }" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you revert to container with tag 0
instead of h9ee0642_1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A mistake, when I was trying things out in tests to see if it was a csvtk
version issue a6d6de9
fi | ||
} | ||
|
||
# Check if two files have consistent delimeter splits in the address column | ||
init_splits=\$(get_address "${initial_clusters}" | awk -F '${params.gm_delimiter}' '{print NF}') | ||
add_splits=\$(get_address "${additional_clusters}" | awk -F '${params.gm_delimiter}' '{print NF}') | ||
get_address "${initial_clusters}" > tmp1.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a small thing, but could you rename tmp1
and tmp2
to correspond to the type of data they contain (e.g., initial_clusters and additional_clusters)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modules/local/cluster_file/main.nf
Outdated
@@ -1,6 +1,5 @@ | |||
process CLUSTER_FILE { | |||
tag "Create cluster file for GAS call" | |||
label 'process_single' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed since the resource label isn't needed for a process that runs native code am I correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A mistake when I was testing things related to the benchmarking. 3850e90
The tests that stopped working when we updated GAS to 0.1.3, see the PR due to the threshold issue were:
When I updated the GAS container to 0.1.4, with the idea of fixing the issue that caused these tests to fail due to the threshold issue, only one test failed:
Which was not what I was expecting. It is possible it was the other isssue addressed with GAS 0.1.4 which was regarding the I will look more closely into the tests to make sure they are behaving as we would expect, based on the GAS versions that were being used. |
I have confirmed that it is actually the The results for the example the test
This is output was generated by a In GAS 0.1.3 the
In GAS 0.1.4 we made two fixes 1) that The output of the test stayed the same as GAS 0.1.3 because now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks Steven!! Two small questions from me and that is all
echo "\$ref_headers" | ||
echo "\$add_headers" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this added for testing purposes or is it for something else I'm overlooking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was! Thanks! My test changes keep showing up -- I think I messed up the version I commited.
csvtk join -t -f sample_id combined_profiles.tsv sample_counts.tsv | csvtk mutate2 -t -n new_sample_id -e '(\$source == "db" && \$frequency > 1) ? "db_" + \$sample_id : \$sample_id' > tmp.txt | ||
csvtk cut -t -f 2-\${n} tmp.txt > tmp2.txt | ||
csvtk cut -t -f new_sample_id tmp.txt | csvtk rename -t -f new_sample_id -n sample_id > tmp3.txt | ||
paste tmp3.txt tmp2.txt > profiles_ref.tsv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaron had mentioned it in the previous review; do we want to adjust the temporary file names to match what they contain here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a6d6de9 don't know what happened to the changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I forgot that I did a similar thing on APPEND_PROFILES
I will make those changes. It was a different issue I was having that came up in the benchmark I forgot I had made those changes.
This PR is tackling two things:
The benchmark exposed a scalability issue with using bash commands for
append_clusters
and updating the GAS version to fix some issues withgas call
Addressing the Issue #36
Problem
When dealing with large database files to include with the
--db_profiles
anddb_clusters
the processAPPEND_CLUSTERS()
was failing due to a 141 error, which appears to be cause by the pipes in the bash script, specificallt in pipe betweenget_address
function and the following awk command to get the number of levels of the genomic address.I was currently testing 422 Salmonella samples with about 2500 loci and three levels in both the initial and additional clusters when I noticed the bug.
Solution
Tried removing the
pipefail
for the process but it didn't work. The current temporary solution is to create a tmp file as an intermediate betweenget_address
and theawk
command.Addressing the Issue #38 and Issue #33
There were a number of issues we had in
gasnomenclature
were coming fromgas
specifically withgas call
. We are updating the version of thegenomic address service
aka gas to 0.1.4 to address these issues. Will need to revert the tests changed in PR #31PR checklist
nf-core lint
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).