Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Zarr 3.0.0 Performance and Integration #352

Open
7 tasks
CSSFrancis opened this issue Jan 10, 2025 · 5 comments
Open
7 tasks

Tracking Zarr 3.0.0 Performance and Integration #352

CSSFrancis opened this issue Jan 10, 2025 · 5 comments

Comments

@CSSFrancis
Copy link
Member

Describe the functionality you would like to see.

I'm going to try to implement support for zarr 3.0.0 over the next couple of days. I'll try to test performance here and see if I can make a small guide to optimizing performance. Specifically, I want to look at optimal sharding for 4D datasets for helping with efficient data slicing and to improve storage on windows computers. I'm not sure how zarr + dask + sharding will ultimately preform but I assume that dask is not quite smart enough to handle that effectively, however, if the perfromance is good enough it might be worth returning the zarr array rather than automatically converting it to a dask array and only doing the conversion when necessary.

As far as implementation goes:

  • Implement Local Store
  • Implement ZipStore
  • Compare Speed for v2.0.0 to v3.0.0
    • Mac OS
    • Windows
  • Test Sharding Implementation
  • Test GPU Direct Storage performance
@CSSFrancis
Copy link
Member Author

Just a comment that they are not planning on implmenting a generic object store in zarr 3.0.0 due to security reasons. I'm curious if something like https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html#zarr.codecs.VLenBytesCodec or https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html#zarr.codecs.VLenUTF8Codec would satisfy most of our usecases.

@ericpre
Copy link
Member

ericpre commented Jan 10, 2025

Indeed, it is worth making the distinction between ragged array (array with variable length axis of "standard dtype") and object dtype.
Assuming that is possible to get zarr 3 to work with ragged array of standard dtype, other potential issues may be:

  • Models and components, almost of them should be fine, because they are converted to python dictionary which are then saved to hdf5/zarr array or attributes. I am not sure about custom components (not implemented in hyperspy or extensions) which are not expression-based, because they may be pickled or something similar...
  • array of string dtype?

@CSSFrancis
Copy link
Member Author

Indeed, it is worth making the distinction between ragged array (array with variable length axis of "standard dtype") and object dtype.

This seems realtively easy, there is already the VLenBytesCodec which seems like it could be easily extended. Although like the UTF encoding it isn't directly supported.

Assuming that is possible to get zarr 3 to work with ragged array of standard dtype, other potential issues may be:

  • Models and components, almost of them should be fine, because they are converted to python dictionary which are then saved to hdf5/zarr array or attributes. I am not sure about custom components (not implemented in hyperspy or extensions) which are not expression-based, because they may be pickled or something similar...

I think that this case was always going to be a problem. Maybe this isn't something that we should be supporting as that sounds like a fairly big security risk. Someone could potentially make a component that pickels some object and then runs some malicious code.

  • array of string dtype?

The array of string dtype is supported via VLenUTF8Codec although that is "tecnically" not part of the V3 standard https://zarr.readthedocs.io/en/stable/_modules/zarr/codecs/vlen_utf8.html#VLenUTF8Codec

@CSSFrancis
Copy link
Member Author

That does scare me a little bit though... I'd rather not write a bunch of files which end up being unreadable via other readers... Although maybe as long as we can ready every file version then there isn't a huge risk of losing data.

@CSSFrancis
Copy link
Member Author

CSSFrancis commented Feb 6, 2025

Oddly enough, blosc + hdf5 actually seems fairly promissing and is closer to what I expected from zarr3... https://www.blosc.org/posts/pytables-b2nd-slicing/. Especially the 2 level blocking strucuture.

@magnunor This could solve the endless equal sized chunks vs chunks which span the signal dimensions debate :) I've been testing this a little and hope to make a little bit of a write up which I can post.

Maybe this deserves more of a discussion elsewhere...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants