-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option for large, public files #1
base: master
Are you sure you want to change the base?
Conversation
I don't think quilt addresses the problem of what to do with a multi gigabyte file. I would move it to the medium files section. It does solve the problem of a very large dataset that consists of medium sized files (for which it is reasonable to fetch the entire file) and it makes it convenient to work with these datasets as it will fetch files on demand (I think). I would reserve the "large files" section for approaches that only fetch roughly as much data as the code actually processes. Example: Process 10MB from each file in dataset of 10 files each 100GB in size. In this case we should only transfer around 100MB into the binder for an approach to qualify for the "large files" section. (If quilt does actually do this please correct me) I'd also shorten the sentence a bit to "Quilt lets you fetch individual files from a dataset, which makes it convenient to work with them. Check out an example |
To clarify how Quilt works: multi-GB files are common. Users can select and stream an arbitrary subset of files from a large package. So, for example, 50,000 x 1TB files would actually work in Quilt (it's S3 on the backend) as long as the user has the upload bandwidth. Is that in the spirit of "large files"? I am open to any rewrites. So if you want to edit in place, that is cool. I do not like ads either :-) Quilt should feel like Docker or GitHub, which are widely mentioned in this project without being annoying. |
It seems by "large files" you mean "slicing", which is worth clarifying in the docs, as slicing is orthogonal to file size. |
I disagree that random access to a remote file is orthogonal to file size. For large files you do not want to have to copy the whole file if you just want to read a few MB from somewhere in the middle of it. You want to only fetch that data.
How much data would be transferred if I access the 5th byte of one of the 50000 files? If the answer is O(1byte) then I would say quilt qualifies. If the answer is "way more than a few bytes" or "1TB" then I would say it doesn't. A way to access data that would qualify is something like GeoTIFF or xrootd. |
See if something like the following table makes sense for Binder users. If so, I will amplify it and rework the PR. One question I still have, and would like to answer in the PR: when do binder containers start to tip over in terms of memory and disk? Hosting data for use in Binder
Slicing with S3
|
There are currently no automatically enforced limits on the size of a repository's image in mybinder.org. There is also no automatically enforced limit when it comes to total volume of data you can transfer in. We rely on people being nice/not needing that much data. The bigger the image is (by baking in files into the docker image via eg postBuild) the slower your repo will launch as it will take longer to transfer the image to the node on which your container is being spawned. I like the table! I prefer "random access" to "slicing", maybe we can use both in the column header? One other column that would be useful is "Posix like". Telling users if they can "just see" the files in their container with |
No description provided.