-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stream RO-Crate Zip #212
Stream RO-Crate Zip #212
Conversation
How it works: #test_zipstream.py
from rocrate.rocrate import ROCrate
crate = ROCrate(gen_preview=True)
# add a file
testfile = crate.add_file("file.txt", properties={
"name": "my test file",
}, record_size=True)
# add a folder
crate.add_dataset(
source="folder",
dest_path="internal_folder",
)
# add a remote dataset
crate.add_dataset(
source="https://raw.githubusercontent.com/ResearchObject/ro-crate-py/refs/heads/master/",
dest_path="remote/folder/",
validate_url=False,
fetch_remote=True,
properties={
"hasPart": [{"@id": "README.md"}]
}
)
# write zip stream to file.
# Instead, it could also be used as response to an url request
# to stream the content with very low memory footprint and no disk usage
with open("out.zip", "wb") as out:
for chunk in crate.stream_zip():
out.write(chunk) $> python test_zipstream.py
$> unzip -l out.zip
Archive: out.zip
Length Date Time Name
--------- ---------- ----- ----
19 1980-01-01 00:00 file.txt
5 1980-01-01 00:00 internal_folder/test.txt
17395 1980-01-01 00:00 remote/folder/README.md
6027 1980-01-01 00:00 ro-crate-preview.html
1371 1980-01-01 00:00 ro-crate-metadata.json
--------- -------
24817 5 files |
604623a
to
5e9a597
Compare
Seems like I broke some test cases, but I think we can first discuss if this is an approach worth to continue pursuing before I start to clean things up. |
I tried this: from rocrate.rocrate import ROCrate
in_crate_dir = "test/test-data/ro-crate-galaxy-sortchangecase"
out_crate_zip = "/tmp/ro-crate-galaxy-sortchangecase.zip"
crate = ROCrate(in_crate_dir)
with open(out_crate_zip, "wb") as f:
for chunk in crate.stream_zip():
f.write(chunk) And the output crate is missing |
Yes there are some implementation because I tried to replace the write code. It's probably easier to, as you suggested, take a step back and implement it without changing the write code. This will lead to a lot of code duplication though. |
Let's avoid code duplication then. Moving forward, please follow these steps:
|
268fbb4
to
08eeb02
Compare
I spent some time today to figure out why there are so many test failing. There were quite some edge-cases I missed which required some refactoring to address. (force-pushed since I basically started from scratch) Updates:
Things to do / to discuss:
|
I ran the above snippet again. There is something wrong with the zip that gets created: $ unzip -d ro-crate-galaxy-sortchangecase{,.zip}
Archive: ro-crate-galaxy-sortchangecase.zip
warning [ro-crate-galaxy-sortchangecase.zip]: 245282 extra bytes at beginning or within zipfile
(attempting to process anyway)
inflating: ro-crate-galaxy-sortchangecase/sort-and-change-case.ga
inflating: ro-crate-galaxy-sortchangecase/LICENSE
inflating: ro-crate-galaxy-sortchangecase/README.md
inflating: ro-crate-galaxy-sortchangecase/test/test1/sort-and-change-case-test.yml
inflating: ro-crate-galaxy-sortchangecase/ro-crate-metadata.json |
My bad on not properly testing between two commits ... Seems like ZipFile treats the BytesIO differently than a subclass from RawIOBase or sth like that. When using BytesIO, the initial bytes of the buffer are repeatedly written to the stream, even when I tried to |
The chunks yielded by >>> from rocrate.rocrate import ROCrate
>>> in_crate_dir = "test/test-data/ro-crate-galaxy-sortchangecase"
>>> crate = ROCrate(in_crate_dir)
>>> chunks = [_ for _ in crate.stream_zip()]
>>> len(chunks)
310
>>> from collections import Counter
>>> Counter(len(_) for _ in chunks).most_common()
[(0, 304), (53, 1), (916, 1), (3607, 1), (348, 1), (168, 1), (1108, 1)]
>>> I was expecting a more regular stream, and not to see empty chunks. |
Good point. the reason for the non-uniform chunk sizes is not only because of the different input stream sizes, but also because the chunk size of chunks going into the zip is different then the output chunk sizes (compression, zip file format overhead, ..). So even if the input streams would yield nicely sized chunks, the sizes will be different from the output. It shouldn't be too complicated to implement to wait for the buffer for a given chunk size to be filled before yielding, though. It might actually be sensible to make the chunk size a parameter that gets passed through to underlying read operations. |
f47a584
to
d4a66cb
Compare
d4a66cb
to
37f2430
Compare
df27e0b
to
5f0a709
Compare
5f0a709
to
3634e4b
Compare
3d77677
to
1a4253f
Compare
I've made some adjustments after testing with the crate at https://zenodo.org/records/8095888, see 2178b42. @dnlbauer before merging I need you to update author and copyright info. Author info is in |
@simleo thanks! the fix for large files was important. |
This is a Draft implementation showing how #205 could be solved.
Implementation is rather straightforward for file entities where I could simply add a method to stream the file contents.
For datasets entities, it was a bit challenging to find a good solution. Creating a zip stream requires that all the file handling need to happen at one place where the zip stream object is created (
rocrate.stream_zip
in this case). However, dataset entities currently handle the file writing operation internally and don't expose their containing files in an easy to use way. I solved this with a generator that not only yields the data of the contained files, but also yields file paths. However, the solution still looks a bit "hacky" to me. I didn't want to do more refactoring though.