pubDir should reuse s3 objects rather than re-copying them #115

drernie · 2023-08-22T16:15:57Z

When running in Amazon EC2, ~~the FUSION filesystem already~~ users can manually sync "local" files from the instance onto the S3 backing store.

Right now, the nf-quilt 0.4.x plugin:

copies those files to a temporary directory
then recopies them onto Amazon S3

Questions:

Is the native pubDir for S3 smart enough to reuse the pre-synced S3 objects, rather than copying them?
If so, can nf-quilt do the same thing?

rulixxx · 2023-08-24T10:55:13Z

Actually drernie I was thinking that this is not just related to fusion. In a typical run without fusion in AWS/batch you can ask the intstances running your processes to directly copy the output files to s3 using the aws client.
So what Im trying to do at the moment is to capture the full s3: path of all the output files from the pipeline. (which is not easily accessible from Nextflow) Then proceed to do:
p.set("LIB301.cram", "s3://data-output/crams/LIB301.cram")

drernie · 2023-08-24T20:00:57Z

Thanks! Can you clarify what this means?

you can ask the instances running your processes to directly copy the output files to s3 using the aws client

Specifically:

capture the full s3: path of all the output files from the pipeline. (which is not easily accessible from Nextflow)
makes it sound like "you can ask" using some non-standard process, which may be difficult for us to automate.

A couple questions:

Does "directly copy the output files to s3" actually mean "only write them to s3", with NO local storage at all?
What is the benefit of doing the "direct copy" in the middle of the pipeline, rather than the end?
Could we do that "direct copy" to the same destination S3 path that would be used by Quilt? In that case, we may be able to intelligently avoid an upload.

rulixxx · 2023-08-24T20:48:03Z

In my pipeline I've created processes that look like this: process FASTQC { tag "$meta.name" input: tuple val(meta), path(reads) output: path "*_fastqc.zip", emit: zip path "*_fastqc.html", emit: html path "versions.yml", emit: versions path "*.quilt", emit: quilt publishDir "$outdir/QC/fastqc", mode: 'copy', overwrite: true script: """ fastqc -t 2 --memory 2000 ${reads[0]} & fastqc -t 2 --memory 2000 ${reads[1]} & wait touch ${task.process}_${meta.name}.quillt ls *_fastqc.zip | xargs -i echo $outdir/QC/fastqc/{} >> ${task.process}_${meta.name}.quilt ls *_fastqc.html | xargs -i echo $outdir/QC/fastqc/{} >> ${task.process}_${meta.name}.quilt ls versions.yml | xargs -i echo $outdir/QC/fastqc/{} >> ${task.process}_${meta.name}.quilt """ (notice the .quilt file output) Seems a bit hacky, but this way I capture all the files that are being published to s3, and at the end of the pipeline I call a Python script that calls the Quilt API process the *.quilt files , creates and pushes the package. I'm trying this as a workaround to be able to directly push s3 locations into the packages. A process does all its work with local files in the instance, but since input files are on s3: (most typical), locally or on a shared disk, they must be copied to the instance disk. The same goes for outputs, they must be copied back on to: s3 (most typical agian), locally or on a shared disk. The advantage of doing transfers while processing is that you can do that while doing other things so one doesn't have to spend extra time at the end which is nice but not critical. In my mind the big disadvantage now with the plugin is that result files are copied locally to the instance running the head Nextflow job (the one controlling all the workflow), this is usually a small instance as Nextlow doesn't need too much resources. However the output of the workflow can be massive, ours are typically several TBs so that will overwhelm the head instance. I guess that solution that I proposed will work well when creating packages, but if one has to modify a large preexisting package it might also run into issues.

…

________________________________ From: Dr. Ernie Prabhakar ***@***.***> Sent: Thursday, August 24, 2023 9:01 PM To: quiltdata/nf-quilt ***@***.***> Cc: Raul Alcantara ***@***.***>; Comment ***@***.***> Subject: Re: [quiltdata/nf-quilt] pubDir should reuse FUSION s3 objects rather than re-copying them (Issue #115) Thanks! Can you clarify what this means? you can ask the instances running your processes to directly copy the output files to s3 using the aws client Specifically: capture the full s3: path of all the output files from the pipeline. (which is not easily accessible from Nextflow) makes it sound like "you can ask" using some non-standard process, which may be difficult for us to automate. A couple questions: 1. Does "directly copy the output files to s3" actually mean "only write them to s3", with NO local storage at all? 2. What is the benefit of doing the "direct copy" in the middle of the pipeline, rather than the end? 3. Could we do that "direct copy" to the same destination S3 path that would be used by Quilt? In that case, we may be able to intelligently avoid an upload. — Reply to this email directly, view it on GitHub<#115 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEAAUCD5HXP7MCS62WC6Z53XW6XIHANCNFSM6AAAAAA32FL6FM>. You are receiving this because you commented.Message ID: ***@***.***>

drernie changed the title ~~pubDir to packages directly from FUSION s3~~ pubDir should reuse FUSION s3 objects rather than re-copying them Aug 22, 2023

drernie changed the title ~~pubDir should reuse FUSION s3 objects rather than re-copying them~~ pubDir should reuse s3 objects rather than re-copying them Aug 24, 2023

drernie mentioned this issue Sep 4, 2024

explicit output URI #235

Closed

drernie mentioned this issue Sep 12, 2024

pass dest to quiltcore-java #239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pubDir should reuse s3 objects rather than re-copying them #115

pubDir should reuse s3 objects rather than re-copying them #115

drernie commented Aug 22, 2023 •

edited

Loading

rulixxx commented Aug 24, 2023

drernie commented Aug 24, 2023

rulixxx commented Aug 24, 2023 via email

pubDir should reuse s3 objects rather than re-copying them #115

pubDir should reuse s3 objects rather than re-copying them #115

Comments

drernie commented Aug 22, 2023 • edited Loading

rulixxx commented Aug 24, 2023

drernie commented Aug 24, 2023

rulixxx commented Aug 24, 2023 via email

drernie commented Aug 22, 2023 •

edited

Loading