Add Zarr Reader(s) #262

norlandrhagen · 2024-10-18T21:02:35Z

Opening up a tracking issue for Zarr V2 and Zarr V3/Icechunk compatible readers. It might be useful to open up multiple existing Zarr stores and virtualizarr'ize into a single Zarr. I think @maxrjones raised this possibility a while ago, but I can't find seem to find the issue.

From what I understand from @TomNicholas the open_virtual_dataset_from_v3_store reader is more for chunk manifest / manifest.json style v3 stores.

Zarr V2 compatible reader
Zarr V3 compatible reader

TomNicholas · 2024-10-19T01:24:15Z

the open_virtual_dataset_from_v3_store reader is more for chunk manifest / manifest.json style v3 stores.

That's correct - I think it's arguably made obsolete by the open-sourcing of Icechunk... (cc @LDeakin)

Zarr V3/Icechunk compatible readers

Icechunk uses a different storage layout on disk, so the V3 reader won't work on it. And the Zarr store doesn't define an interface for returning virtual refs. So pulling virtual refs out of Icechunk is a separable issue, and would require Icechunk to create API to support it (see earth-mover/icechunk#104).

TomNicholas · 2024-10-22T18:28:44Z

Okay so after chatting with @norlandrhagen here's how we think the zarr reader for virtualizarr should work:

Open the store using zarr-python v3 (behind a protected import). This should handle both v2 and v3 stores for us.
Use zarr-python to list the variables in the store, and check that all loadable_variables are present
For each virtual variable:
a. Use zarr-python to get the attributes and the dimension names, and coordinate names (which come from the .zmetadata or zarr.json)
b. Use zarr-python to also get the dtype and chunk grid info + everything else needed to create the virtualizarr.zarr.ZArray object (eventually we can skip this step and use a zarr-python array metadata class directly instead of virtualizarr.zarr.ZArray, see Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175)
c. Use the knowledge of the store location, variable name, and the zarr format to deduce which directory / S3 prefix the chunks must live in.
d. List all the chunks in that directory using fsspec.ls(detail=True), as that should also return the nbytes of each chunk. Remember that chunks are allowed to be missing.
e. The offset of each chunk is just 0 (ignoring sharding for now), and the length is the file size fsspec returned. The paths are just all the paths fsspec listed.
f. Parse the path and length information returned by fsspec into the structure that we can pass to ChunkManifest.__init__
g. Create a ManifestArray from our ChunkManifest and ZArray
h. Wrap that ManifestArray in an xarray.Variable, using the dims and attrs we read before
Get the loadable_variables by just using xr.open_zarr on the same store (should use drop_variables to avoid handling the virtual variables that we already have).
Use separate_coords to set the correct variables as coordinate variables (and avoid building indexes whilst doing it)
Merge all the variables into one xr.Dataset and return it.
All the above should be wrapped in a virtualizarr.readers.zarr.open_virtual_dataset function, which then should be called as a method from a ZarrVirtualBackend(VirtualBackend) subclass.
Finally add that ZarrVirtualBackend to the list of readers in virtualizarr.backend.py

For later:

support shards
hopefully call a .chunk_lengths method or equivalent on zarr-python instead of needing fsspec.ls
optimizations (e.g. using async interface to list lengths of chunks for each variable concurrently)

This reader should be called just zarr, as it can handle both v2 and v3. We're not using kerchunk's ZarrToZarr as this approach should be more performant (including async), pushes more of the implementation upstream, and works with zarr-python v3.

The existing zarr_v3 reader (very badly-named) should either be folded into this one or just deleted. If it's folded in the way to do that is to get the chunk manifest dict information from the manifest.json file instead of fsspec.ls. Then we should also alias the zarr_v3 reader to just point to the zarr reader, with a deprecation warning.

TomNicholas · 2024-10-23T16:05:21Z

@norlandrhagen FYI we can use Store.getsize from zarr-developers/zarr-python#2426 instead of fsspec.ls!!

norlandrhagen · 2024-10-23T16:41:17Z

Oh great find!

TomNicholas · 2024-10-24T15:15:18Z

To support generating virtual references to sharded v3 data (which we should definitely leave for a later PR), see this zarr-developers/zarr-python#1661 (comment).

norlandrhagen added the readers label Oct 18, 2024

TomNicholas added zarr-python Relevant to zarr-python upstream enhancement New feature or request labels Oct 19, 2024

This was referenced Oct 19, 2024

Zarr V3 metadata fixes #248

Open

Add Zarr v3 dependency #182

Draft

This was referenced Oct 24, 2024

HyCOM Public Zarr leap-stc/data-management#163

Open

Team Planning - Wednesday, October 23rd leap-stc/data-and-compute-team#32

Closed

Zarr reader #271

Open

TomNicholas mentioned this issue Nov 5, 2024

Added Store.getsize zarr-developers/zarr-python#2426

Merged

6 tasks

TomNicholas mentioned this issue Nov 15, 2024

Listing every format that could be represented as virtual zarr #218

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Zarr Reader(s) #262

Add Zarr Reader(s) #262

norlandrhagen commented Oct 18, 2024 •

edited

Loading

TomNicholas commented Oct 19, 2024 •

edited

Loading

TomNicholas commented Oct 22, 2024 •

edited

Loading

TomNicholas commented Oct 23, 2024 •

edited

Loading

norlandrhagen commented Oct 23, 2024

TomNicholas commented Oct 24, 2024

Add Zarr Reader(s) #262

Add Zarr Reader(s) #262

Comments

norlandrhagen commented Oct 18, 2024 • edited Loading

TomNicholas commented Oct 19, 2024 • edited Loading

TomNicholas commented Oct 22, 2024 • edited Loading

TomNicholas commented Oct 23, 2024 • edited Loading

norlandrhagen commented Oct 23, 2024

TomNicholas commented Oct 24, 2024

norlandrhagen commented Oct 18, 2024 •

edited

Loading

TomNicholas commented Oct 19, 2024 •

edited

Loading

TomNicholas commented Oct 22, 2024 •

edited

Loading

TomNicholas commented Oct 23, 2024 •

edited

Loading