Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option for large, public files #1

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

akarve
Copy link

@akarve akarve commented Sep 6, 2018

No description provided.

@betatim
Copy link
Member

betatim commented Sep 6, 2018

I don't think quilt addresses the problem of what to do with a multi gigabyte file. I would move it to the medium files section. It does solve the problem of a very large dataset that consists of medium sized files (for which it is reasonable to fetch the entire file) and it makes it convenient to work with these datasets as it will fetch files on demand (I think).

I would reserve the "large files" section for approaches that only fetch roughly as much data as the code actually processes. Example: Process 10MB from each file in dataset of 10 files each 100GB in size. In this case we should only transfer around 100MB into the binder for an approach to qualify for the "large files" section. (If quilt does actually do this please correct me)

I'd also shorten the sentence a bit to "Quilt lets you fetch individual files from a dataset, which makes it convenient to work with them. Check out an example link to the quilt example". Right now it reads too much like an advert for my taste. Maybe someone else wants to comment on that aspect to break our tie.

@akarve
Copy link
Author

akarve commented Sep 6, 2018

To clarify how Quilt works: multi-GB files are common. Users can select and stream an arbitrary subset of files from a large package. So, for example, 50,000 x 1TB files would actually work in Quilt (it's S3 on the backend) as long as the user has the upload bandwidth. Is that in the spirit of "large files"?

I am open to any rewrites. So if you want to edit in place, that is cool. I do not like ads either :-) Quilt should feel like Docker or GitHub, which are widely mentioned in this project without being annoying.

@akarve
Copy link
Author

akarve commented Sep 7, 2018

It seems by "large files" you mean "slicing", which is worth clarifying in the docs, as slicing is orthogonal to file size.

@betatim
Copy link
Member

betatim commented Sep 12, 2018

I disagree that random access to a remote file is orthogonal to file size.

For large files you do not want to have to copy the whole file if you just want to read a few MB from somewhere in the middle of it. You want to only fetch that data.

Users can select and stream an arbitrary subset of files from a large package. So, for example, 50,000 x 1TB files would actually work in Quilt (it's S3 on the backend) as long as the user has the upload bandwidth. Is that in the spirit of "large files"?

How much data would be transferred if I access the 5th byte of one of the 50000 files? If the answer is O(1byte) then I would say quilt qualifies. If the answer is "way more than a few bytes" or "1TB" then I would say it doesn't. A way to access data that would qualify is something like GeoTIFF or xrootd.

@akarve
Copy link
Author

akarve commented Sep 12, 2018

See if something like the following table makes sense for Binder users. If so, I will amplify it and rework the PR.

One question I still have, and would like to answer in the PR: when do binder containers start to tip over in terms of memory and disk?

Hosting data for use in Binder

Service Max file size Max repo size Slicing * Cost Notes
GitHub 100 MB 1 GB Large repos get slow
GitHub LFS 2 GB Paid † Transfer can be slow
Quilt 5 TB 50,000 files Requires Python
S3 5 TB See below Paid † Boto3 Python client

* Slicing selects a physical or logical chunk of a file. Slicing reduces disk and memory pressure when working with large files.
† AWS (S3) and GitHub LFS allow some free data transfer, but are paid beyond certain volumes

Slicing with S3

@betatim
Copy link
Member

betatim commented Sep 17, 2018

There are currently no automatically enforced limits on the size of a repository's image in mybinder.org. There is also no automatically enforced limit when it comes to total volume of data you can transfer in. We rely on people being nice/not needing that much data.

The bigger the image is (by baking in files into the docker image via eg postBuild) the slower your repo will launch as it will take longer to transfer the image to the node on which your container is being spawned.

I like the table! I prefer "random access" to "slicing", maybe we can use both in the column header? One other column that would be useful is "Posix like". Telling users if they can "just see" the files in their container with ls, cp, etc or if they have to access it through a special library (boto + something that knows the ranges, geotiff, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants