Add an option for large, public files #1

akarve · 2018-09-06T20:33:08Z

No description provided.

betatim · 2018-09-06T21:24:39Z

I don't think quilt addresses the problem of what to do with a multi gigabyte file. I would move it to the medium files section. It does solve the problem of a very large dataset that consists of medium sized files (for which it is reasonable to fetch the entire file) and it makes it convenient to work with these datasets as it will fetch files on demand (I think).

I would reserve the "large files" section for approaches that only fetch roughly as much data as the code actually processes. Example: Process 10MB from each file in dataset of 10 files each 100GB in size. In this case we should only transfer around 100MB into the binder for an approach to qualify for the "large files" section. (If quilt does actually do this please correct me)

I'd also shorten the sentence a bit to "Quilt lets you fetch individual files from a dataset, which makes it convenient to work with them. Check out an example link to the quilt example". Right now it reads too much like an advert for my taste. Maybe someone else wants to comment on that aspect to break our tie.

akarve · 2018-09-06T23:15:43Z

To clarify how Quilt works: multi-GB files are common. Users can select and stream an arbitrary subset of files from a large package. So, for example, 50,000 x 1TB files would actually work in Quilt (it's S3 on the backend) as long as the user has the upload bandwidth. Is that in the spirit of "large files"?

I am open to any rewrites. So if you want to edit in place, that is cool. I do not like ads either :-) Quilt should feel like Docker or GitHub, which are widely mentioned in this project without being annoying.

akarve · 2018-09-07T18:29:22Z

It seems by "large files" you mean "slicing", which is worth clarifying in the docs, as slicing is orthogonal to file size.

betatim · 2018-09-12T08:16:12Z

I disagree that random access to a remote file is orthogonal to file size.

For large files you do not want to have to copy the whole file if you just want to read a few MB from somewhere in the middle of it. You want to only fetch that data.

Users can select and stream an arbitrary subset of files from a large package. So, for example, 50,000 x 1TB files would actually work in Quilt (it's S3 on the backend) as long as the user has the upload bandwidth. Is that in the spirit of "large files"?

How much data would be transferred if I access the 5th byte of one of the 50000 files? If the answer is O(1byte) then I would say quilt qualifies. If the answer is "way more than a few bytes" or "1TB" then I would say it doesn't. A way to access data that would qualify is something like GeoTIFF or xrootd.

akarve · 2018-09-12T23:51:45Z

See if something like the following table makes sense for Binder users. If so, I will amplify it and rework the PR.

One question I still have, and would like to answer in the PR: when do binder containers start to tip over in terms of memory and disk?

Hosting data for use in Binder

Service	Max file size	Max repo size	Slicing *	Cost	Notes
GitHub	100 MB	1 GB			Large repos get slow
GitHub LFS	2 GB			Paid †	Transfer can be slow
Quilt	5 TB	50,000 files			Requires Python
S3	5 TB		See below	Paid †	Boto3 Python client

* Slicing selects a physical or logical chunk of a file. Slicing reduces disk and memory pressure when working with large files.
† AWS (S3) and GitHub LFS allow some free data transfer, but are paid beyond certain volumes

Slicing with S3

get_object(Range=) supports physical slicing by byte range
select_object_content() supports logical slicing for columnar data with SQL

betatim · 2018-09-17T16:09:36Z

There are currently no automatically enforced limits on the size of a repository's image in mybinder.org. There is also no automatically enforced limit when it comes to total volume of data you can transfer in. We rely on people being nice/not needing that much data.

The bigger the image is (by baking in files into the docker image via eg postBuild) the slower your repo will launch as it will take longer to transfer the image to the node on which your container is being spawned.

I like the table! I prefer "random access" to "slicing", maybe we can use both in the column header? One other column that would be useful is "Posix like". Telling users if they can "just see" the files in their container with ls, cp, etc or if they have to access it through a special library (boto + something that knows the ranges, geotiff, etc).

akarve added 2 commits September 6, 2018 12:04

Add an option for large files

b1f70b3

Update README.md

5efe6d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option for large, public files #1

Add an option for large, public files #1

akarve commented Sep 6, 2018

betatim commented Sep 6, 2018

akarve commented Sep 6, 2018

akarve commented Sep 7, 2018

betatim commented Sep 12, 2018

akarve commented Sep 12, 2018

betatim commented Sep 17, 2018

Add an option for large, public files #1

Are you sure you want to change the base?

Add an option for large, public files #1

Conversation

akarve commented Sep 6, 2018

betatim commented Sep 6, 2018

akarve commented Sep 6, 2018

akarve commented Sep 7, 2018

betatim commented Sep 12, 2018

akarve commented Sep 12, 2018

Hosting data for use in Binder

Slicing with S3

betatim commented Sep 17, 2018