-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubDir should reuse s3 objects rather than re-copying them #115
Comments
Actually drernie I was thinking that this is not just related to fusion. In a typical run without fusion in AWS/batch you can ask the intstances running your processes to directly copy the output files to s3 using the aws client. |
Thanks! Can you clarify what this means?
Specifically:
A couple questions:
|
In my pipeline I've created processes that look like this:
process FASTQC {
tag "$meta.name"
input:
tuple val(meta), path(reads)
output:
path "*_fastqc.zip", emit: zip
path "*_fastqc.html", emit: html
path "versions.yml", emit: versions
path "*.quilt", emit: quilt
publishDir "$outdir/QC/fastqc", mode: 'copy', overwrite: true
script:
"""
fastqc -t 2 --memory 2000 ${reads[0]} &
fastqc -t 2 --memory 2000 ${reads[1]} &
wait
touch ${task.process}_${meta.name}.quillt
ls *_fastqc.zip | xargs -i echo $outdir/QC/fastqc/{} >> ${task.process}_${meta.name}.quilt
ls *_fastqc.html | xargs -i echo $outdir/QC/fastqc/{} >> ${task.process}_${meta.name}.quilt
ls versions.yml | xargs -i echo $outdir/QC/fastqc/{} >> ${task.process}_${meta.name}.quilt
"""
(notice the .quilt file output) Seems a bit hacky, but this way I capture all the files that are being published to s3, and at the end of the pipeline I call a Python script that calls the Quilt API process the *.quilt files , creates and pushes the package. I'm trying this as a workaround to be able to directly push s3 locations into the packages.
A process does all its work with local files in the instance, but since input files are on s3: (most typical), locally or on a shared disk, they must be copied to the instance disk. The same goes for outputs, they must be copied back on to: s3 (most typical agian), locally or on a shared disk.
The advantage of doing transfers while processing is that you can do that while doing other things so one doesn't have to spend extra time at the end which is nice but not critical. In my mind the big disadvantage now with the plugin is that result files are copied locally to the instance running the head Nextflow job (the one controlling all the workflow), this is usually a small instance as Nextlow doesn't need too much resources. However the output of the workflow can be massive, ours are typically several TBs so that will overwhelm the head instance.
I guess that solution that I proposed will work well when creating packages, but if one has to modify a large preexisting package it might also run into issues.
…________________________________
From: Dr. Ernie Prabhakar ***@***.***>
Sent: Thursday, August 24, 2023 9:01 PM
To: quiltdata/nf-quilt ***@***.***>
Cc: Raul Alcantara ***@***.***>; Comment ***@***.***>
Subject: Re: [quiltdata/nf-quilt] pubDir should reuse FUSION s3 objects rather than re-copying them (Issue #115)
Thanks! Can you clarify what this means?
you can ask the instances running your processes to directly copy the output files to s3 using the aws client
Specifically:
capture the full s3: path of all the output files from the pipeline. (which is not easily accessible from Nextflow)
makes it sound like "you can ask" using some non-standard process, which may be difficult for us to automate.
A couple questions:
1. Does "directly copy the output files to s3" actually mean "only write them to s3", with NO local storage at all?
2. What is the benefit of doing the "direct copy" in the middle of the pipeline, rather than the end?
3. Could we do that "direct copy" to the same destination S3 path that would be used by Quilt? In that case, we may be able to intelligently avoid an upload.
—
Reply to this email directly, view it on GitHub<#115 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEAAUCD5HXP7MCS62WC6Z53XW6XIHANCNFSM6AAAAAA32FL6FM>.
You are receiving this because you commented.Message ID: ***@***.***>
|
When running in Amazon EC2,
the FUSION filesystem alreadyusers can manually sync "local" files from the instance onto the S3 backing store.Right now, the nf-quilt 0.4.x plugin:
Questions:
The text was updated successfully, but these errors were encountered: