Tracking Zarr 3.0.0 Performance and Integration #352

CSSFrancis · 2025-01-10T15:06:27Z

Describe the functionality you would like to see.

I'm going to try to implement support for zarr 3.0.0 over the next couple of days. I'll try to test performance here and see if I can make a small guide to optimizing performance. Specifically, I want to look at optimal sharding for 4D datasets for helping with efficient data slicing and to improve storage on windows computers. I'm not sure how zarr + dask + sharding will ultimately preform but I assume that dask is not quite smart enough to handle that effectively, however, if the perfromance is good enough it might be worth returning the zarr array rather than automatically converting it to a dask array and only doing the conversion when necessary.

As far as implementation goes:

The text was updated successfully, but these errors were encountered:

CSSFrancis · 2025-01-10T16:00:39Z

Just a comment that they are not planning on implmenting a generic object store in zarr 3.0.0 due to security reasons. I'm curious if something like https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html#zarr.codecs.VLenBytesCodec or https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html#zarr.codecs.VLenUTF8Codec would satisfy most of our usecases.

ericpre · 2025-01-10T16:32:59Z

Indeed, it is worth making the distinction between ragged array (array with variable length axis of "standard dtype") and object dtype.
Assuming that is possible to get zarr 3 to work with ragged array of standard dtype, other potential issues may be:

Models and components, almost of them should be fine, because they are converted to python dictionary which are then saved to hdf5/zarr array or attributes. I am not sure about custom components (not implemented in hyperspy or extensions) which are not expression-based, because they may be pickled or something similar...
array of string dtype?

CSSFrancis · 2025-01-10T16:48:01Z

Indeed, it is worth making the distinction between ragged array (array with variable length axis of "standard dtype") and object dtype.

This seems realtively easy, there is already the VLenBytesCodec which seems like it could be easily extended. Although like the UTF encoding it isn't directly supported.

Assuming that is possible to get zarr 3 to work with ragged array of standard dtype, other potential issues may be:

Models and components, almost of them should be fine, because they are converted to python dictionary which are then saved to hdf5/zarr array or attributes. I am not sure about custom components (not implemented in hyperspy or extensions) which are not expression-based, because they may be pickled or something similar...

I think that this case was always going to be a problem. Maybe this isn't something that we should be supporting as that sounds like a fairly big security risk. Someone could potentially make a component that pickels some object and then runs some malicious code.

array of string dtype?

The array of string dtype is supported via VLenUTF8Codec although that is "tecnically" not part of the V3 standard https://zarr.readthedocs.io/en/stable/_modules/zarr/codecs/vlen_utf8.html#VLenUTF8Codec

CSSFrancis · 2025-01-10T16:51:02Z

That does scare me a little bit though... I'd rather not write a bunch of files which end up being unreadable via other readers... Although maybe as long as we can ready every file version then there isn't a huge risk of losing data.

CSSFrancis · 2025-02-06T15:31:21Z

Oddly enough, blosc + hdf5 actually seems fairly promissing and is closer to what I expected from zarr3... https://www.blosc.org/posts/pytables-b2nd-slicing/. Especially the 2 level blocking strucuture.

@magnunor This could solve the endless equal sized chunks vs chunks which span the signal dimensions debate :) I've been testing this a little and hope to make a little bit of a write up which I can post.

Maybe this deserves more of a discussion elsewhere...

CSSFrancis mentioned this issue Feb 9, 2025

Sub Array Chunking hyperspy/hyperspy#3490

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Zarr 3.0.0 Performance and Integration #352

Tracking Zarr 3.0.0 Performance and Integration #352

CSSFrancis commented Jan 10, 2025

CSSFrancis commented Jan 10, 2025

ericpre commented Jan 10, 2025

CSSFrancis commented Jan 10, 2025

CSSFrancis commented Jan 10, 2025

CSSFrancis commented Feb 6, 2025 •

edited

Loading

Tracking Zarr 3.0.0 Performance and Integration #352

Tracking Zarr 3.0.0 Performance and Integration #352

Comments

CSSFrancis commented Jan 10, 2025

Describe the functionality you would like to see.

CSSFrancis commented Jan 10, 2025

ericpre commented Jan 10, 2025

CSSFrancis commented Jan 10, 2025

CSSFrancis commented Jan 10, 2025

CSSFrancis commented Feb 6, 2025 • edited Loading

CSSFrancis commented Feb 6, 2025 •

edited

Loading