Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubDir should reuse s3 objects rather than re-copying them #115

Open
drernie opened this issue Aug 22, 2023 · 3 comments
Open

pubDir should reuse s3 objects rather than re-copying them #115

drernie opened this issue Aug 22, 2023 · 3 comments

Comments

@drernie
Copy link
Member

drernie commented Aug 22, 2023

When running in Amazon EC2, the FUSION filesystem already users can manually sync "local" files from the instance onto the S3 backing store.

Right now, the nf-quilt 0.4.x plugin:

  • copies those files to a temporary directory
  • then recopies them onto Amazon S3

Questions:

  1. Is the native pubDir for S3 smart enough to reuse the pre-synced S3 objects, rather than copying them?
  2. If so, can nf-quilt do the same thing?
@drernie drernie changed the title pubDir to packages directly from FUSION s3 pubDir should reuse FUSION s3 objects rather than re-copying them Aug 22, 2023
@rulixxx
Copy link

rulixxx commented Aug 24, 2023

Actually drernie I was thinking that this is not just related to fusion. In a typical run without fusion in AWS/batch you can ask the intstances running your processes to directly copy the output files to s3 using the aws client.
So what Im trying to do at the moment is to capture the full s3: path of all the output files from the pipeline. (which is not easily accessible from Nextflow) Then proceed to do:
p.set("LIB301.cram", "s3://data-output/crams/LIB301.cram")

@drernie
Copy link
Member Author

drernie commented Aug 24, 2023

Thanks! Can you clarify what this means?

you can ask the instances running your processes to directly copy the output files to s3 using the aws client

Specifically:

capture the full s3: path of all the output files from the pipeline. (which is not easily accessible from Nextflow)
makes it sound like "you can ask" using some non-standard process, which may be difficult for us to automate.

A couple questions:

  1. Does "directly copy the output files to s3" actually mean "only write them to s3", with NO local storage at all?
  2. What is the benefit of doing the "direct copy" in the middle of the pipeline, rather than the end?
  3. Could we do that "direct copy" to the same destination S3 path that would be used by Quilt? In that case, we may be able to intelligently avoid an upload.

@rulixxx
Copy link

rulixxx commented Aug 24, 2023 via email

@drernie drernie changed the title pubDir should reuse FUSION s3 objects rather than re-copying them pubDir should reuse s3 objects rather than re-copying them Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants