Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Zarr Reader(s) #262

Open
2 tasks
norlandrhagen opened this issue Oct 18, 2024 · 5 comments
Open
2 tasks

Add Zarr Reader(s) #262

norlandrhagen opened this issue Oct 18, 2024 · 5 comments
Labels
enhancement New feature or request readers zarr-python Relevant to zarr-python upstream

Comments

@norlandrhagen
Copy link
Collaborator

norlandrhagen commented Oct 18, 2024

Opening up a tracking issue for Zarr V2 and Zarr V3/Icechunk compatible readers. It might be useful to open up multiple existing Zarr stores and virtualizarr'ize into a single Zarr. I think @maxrjones raised this possibility a while ago, but I can't find seem to find the issue.

From what I understand from @TomNicholas the open_virtual_dataset_from_v3_store reader is more for chunk manifest / manifest.json style v3 stores.

  • Zarr V2 compatible reader
  • Zarr V3 compatible reader
@TomNicholas
Copy link
Member

TomNicholas commented Oct 19, 2024

the open_virtual_dataset_from_v3_store reader is more for chunk manifest / manifest.json style v3 stores.

That's correct - I think it's arguably made obsolete by the open-sourcing of Icechunk... (cc @LDeakin)

Zarr V3/Icechunk compatible readers

Icechunk uses a different storage layout on disk, so the V3 reader won't work on it. And the Zarr store doesn't define an interface for returning virtual refs. So pulling virtual refs out of Icechunk is a separable issue, and would require Icechunk to create API to support it (see earth-mover/icechunk#104).

@TomNicholas TomNicholas added zarr-python Relevant to zarr-python upstream enhancement New feature or request labels Oct 19, 2024
This was referenced Oct 19, 2024
@TomNicholas
Copy link
Member

TomNicholas commented Oct 22, 2024

Okay so after chatting with @norlandrhagen here's how we think the zarr reader for virtualizarr should work:

  1. Open the store using zarr-python v3 (behind a protected import). This should handle both v2 and v3 stores for us.
  2. Use zarr-python to list the variables in the store, and check that all loadable_variables are present
  3. For each virtual variable:
    a. Use zarr-python to get the attributes and the dimension names, and coordinate names (which come from the .zmetadata or zarr.json)
    b. Use zarr-python to also get the dtype and chunk grid info + everything else needed to create the virtualizarr.zarr.ZArray object (eventually we can skip this step and use a zarr-python array metadata class directly instead of virtualizarr.zarr.ZArray, see Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175)
    c. Use the knowledge of the store location, variable name, and the zarr format to deduce which directory / S3 prefix the chunks must live in.
    d. List all the chunks in that directory using fsspec.ls(detail=True), as that should also return the nbytes of each chunk. Remember that chunks are allowed to be missing.
    e. The offset of each chunk is just 0 (ignoring sharding for now), and the length is the file size fsspec returned. The paths are just all the paths fsspec listed.
    f. Parse the path and length information returned by fsspec into the structure that we can pass to ChunkManifest.__init__
    g. Create a ManifestArray from our ChunkManifest and ZArray
    h. Wrap that ManifestArray in an xarray.Variable, using the dims and attrs we read before
  4. Get the loadable_variables by just using xr.open_zarr on the same store (should use drop_variables to avoid handling the virtual variables that we already have).
  5. Use separate_coords to set the correct variables as coordinate variables (and avoid building indexes whilst doing it)
  6. Merge all the variables into one xr.Dataset and return it.
  7. All the above should be wrapped in a virtualizarr.readers.zarr.open_virtual_dataset function, which then should be called as a method from a ZarrVirtualBackend(VirtualBackend) subclass.
  8. Finally add that ZarrVirtualBackend to the list of readers in virtualizarr.backend.py

For later:

  • support shards
  • hopefully call a .chunk_lengths method or equivalent on zarr-python instead of needing fsspec.ls
  • optimizations (e.g. using async interface to list lengths of chunks for each variable concurrently)

This reader should be called just zarr, as it can handle both v2 and v3. We're not using kerchunk's ZarrToZarr as this approach should be more performant (including async), pushes more of the implementation upstream, and works with zarr-python v3.

The existing zarr_v3 reader (very badly-named) should either be folded into this one or just deleted. If it's folded in the way to do that is to get the chunk manifest dict information from the manifest.json file instead of fsspec.ls. Then we should also alias the zarr_v3 reader to just point to the zarr reader, with a deprecation warning.

@TomNicholas
Copy link
Member

TomNicholas commented Oct 23, 2024

@norlandrhagen FYI we can use Store.getsize from zarr-developers/zarr-python#2426 instead of fsspec.ls!!

@norlandrhagen
Copy link
Collaborator Author

Oh great find!

@TomNicholas
Copy link
Member

To support generating virtual references to sharded v3 data (which we should definitely leave for a later PR), see this zarr-developers/zarr-python#1661 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request readers zarr-python Relevant to zarr-python upstream
Projects
None yet
Development

No branches or pull requests

2 participants