Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest arrays use arrayv3metadata #429

Open
wants to merge 57 commits into
base: zarr-python-3.0
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
2a01bfa
Added zarray_to_v3metadata and test
abarciauskas-bgse Feb 4, 2025
17fd547
Working on manifest array tests
abarciauskas-bgse Feb 5, 2025
e5666ab
Fix test_manifests/test_array#TestConcat tests
abarciauskas-bgse Feb 5, 2025
5a8cc4c
Passing TestStack tests and add fixture
abarciauskas-bgse Feb 5, 2025
4c0b616
All test_manifests/test_array tests passing
abarciauskas-bgse Feb 6, 2025
ac2f787
Compressors should be list
abarciauskas-bgse Feb 6, 2025
5503c60
Passing dmrpp tests
abarciauskas-bgse Feb 6, 2025
1272051
Merge branch 'main' into manifest-arrays-use-arrayv3metadata
abarciauskas-bgse Feb 6, 2025
1f36755
Passing test_hdf.py tests
abarciauskas-bgse Feb 6, 2025
7098803
Start to work on kerchunk tests
abarciauskas-bgse Feb 6, 2025
ce2284c
Add method to convert array v3 metadata to v2 metadata for kerchunk (…
abarciauskas-bgse Feb 7, 2025
c9853d5
Fix fixtures and mark xfail netcdf3
abarciauskas-bgse Feb 7, 2025
209dae3
Test for convert_v3_to_v2_metadata
abarciauskas-bgse Feb 7, 2025
e7205ef
Deduplicate fixture for array v3 metadata
abarciauskas-bgse Feb 7, 2025
d65e457
Parse filters and compressors from v3 metdata for v2 metadata
abarciauskas-bgse Feb 7, 2025
190c20f
Rewrite extract_codecs
abarciauskas-bgse Feb 7, 2025
47f5ddd
Refactor convert_to_codec_pipeline
abarciauskas-bgse Feb 8, 2025
5d15608
Fix hdf integration tests
abarciauskas-bgse Feb 8, 2025
908bc52
Test for convert_to_codec_pipeline
abarciauskas-bgse Feb 8, 2025
4a8bfdd
Refactor get_codecs and its tests
abarciauskas-bgse Feb 8, 2025
d05cec3
Fix most integration tests and writer tests
abarciauskas-bgse Feb 9, 2025
ff23eeb
Fix xarray tests
abarciauskas-bgse Feb 9, 2025
8560f2d
Working on integration tests
abarciauskas-bgse Feb 9, 2025
97d0a71
Add expected type
abarciauskas-bgse Feb 10, 2025
669ce52
Mark datetime tests xfail
abarciauskas-bgse Feb 10, 2025
b794dab
Upgrade xarray for tests
abarciauskas-bgse Feb 10, 2025
825142d
xfail some unsupported zarr-python 3 data types
abarciauskas-bgse Feb 10, 2025
6684125
Require zarr
abarciauskas-bgse Feb 10, 2025
5e82de4
Remove zarr dep
abarciauskas-bgse Feb 10, 2025
f57b48d
import zarr, explicit dependency
abarciauskas-bgse Feb 10, 2025
b811959
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 10, 2025
8c5139b
Add zarr as a dependency
abarciauskas-bgse Feb 11, 2025
eb2a86c
Merge branch 'manifest-arrays-use-arrayv3metadata' of github.com:zarr…
abarciauskas-bgse Feb 11, 2025
15ac7a7
Merge branch 'main' into manifest-arrays-use-arrayv3metadata
abarciauskas-bgse Feb 11, 2025
5359762
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 11, 2025
c808351
Min numcodecs version
abarciauskas-bgse Feb 12, 2025
bd50167
numcodecs>=0.15.1 in environment and upstream.yml conda env files
abarciauskas-bgse Feb 12, 2025
ed97704
Working on mypy errors
abarciauskas-bgse Feb 12, 2025
a3c190e
Fix mypy errors and tests
abarciauskas-bgse Feb 12, 2025
95886b9
Remove ZArray class
abarciauskas-bgse Feb 12, 2025
a0f72b2
Just return metadata's shape
abarciauskas-bgse Feb 12, 2025
aad511f
Create update metadata function
abarciauskas-bgse Feb 12, 2025
b357b04
Fix typing for update_metadata
abarciauskas-bgse Feb 12, 2025
08e877a
Check for regular chunk grid in manifest instantiation
abarciauskas-bgse Feb 12, 2025
f040459
Remove obsolete codecs code
abarciauskas-bgse Feb 12, 2025
495d660
Fix chunks function and add docstring
abarciauskas-bgse Feb 12, 2025
a262f0b
Remove custom zattrs type
abarciauskas-bgse Feb 12, 2025
bcd68a0
Move some imports and make update_metadata a private method
abarciauskas-bgse Feb 12, 2025
f0ce778
Remove zarr.py
abarciauskas-bgse Feb 12, 2025
0518488
Add zarr to other ci env files
abarciauskas-bgse Feb 13, 2025
0712979
Fixture array_v3_metadata uses array_v3_metadata_dict
abarciauskas-bgse Feb 13, 2025
c40915d
No need for union type for CodecPipeline
abarciauskas-bgse Feb 13, 2025
cdaca53
Use type alias
abarciauskas-bgse Feb 13, 2025
2415e07
Add comment
abarciauskas-bgse Feb 13, 2025
9366d69
Update virtualizarr/manifests/array_api.py
abarciauskas-bgse Feb 13, 2025
d590cfc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 13, 2025
6394207
Revised copy_and_replace_metadata to be in utils and called correctly
abarciauskas-bgse Feb 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@
import pytest
import xarray as xr
from xarray.core.variable import Variable
from zarr.core.metadata.v3 import ArrayV3Metadata

from virtualizarr.zarr import convert_to_codec_pipeline


def pytest_addoption(parser):
Expand Down Expand Up @@ -150,3 +153,58 @@ def simple_netcdf4(tmp_path: Path) -> str:
ds.to_netcdf(filepath)

return str(filepath)


@pytest.fixture
def array_v3_metadata():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reimplement this fixture to internally just call the array_v3_metadata_dict fixture below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this in 0712979 - I do like just having one method to create array v3 metadata for tests however it does mean tests have to be a bit more verbose as every codecs argument must include an ArrayBytesCodec (which is always {"name": "bytes", "configuration": {"endian": "little"}}).

But I'll think on ways to streamline this more...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a look at the signature of this function, which has a lot of sane defaults, and which works for v2 and v3 metadata: https://github.com/zarr-developers/zarr-python/blob/99621ecf0b81400e323828111363fe21cf0c7592/src/zarr/core/array.py#L4008-L4030. I think we could consider adding an ArrayV3Metadata.build method that has a signature like this, which should make creating metadata documents a lot easier.

def _create_metadata(
shape: tuple,
chunks: tuple,
compressors: list[dict] = [{"id": "zlib", "level": 1}],
filters: list[dict] | None = None,
):
return ArrayV3Metadata(
shape=shape,
data_type="int32",
chunk_grid={"name": "regular", "configuration": {"chunk_shape": chunks}},
chunk_key_encoding={"name": "default"},
fill_value=0,
codecs=convert_to_codec_pipeline(
compressors=compressors,
filters=filters,
dtype=np.dtype("int32"),
),
attributes={},
dimension_names=None,
storage_transformers=None,
)

return _create_metadata


@pytest.fixture
def array_v3_metadata_dict():
def _create_metadata_dict(
shape: tuple,
chunks: tuple,
codecs: list[dict] = [
{"configuration": {"endian": "little"}, "name": "bytes"},
{
"name": "numcodecs.zlib",
"configuration": {"level": 1},
},
],
):
return {
"shape": shape,
"data_type": "int32",
"chunk_grid": {"name": "regular", "configuration": {"chunk_shape": chunks}},
"chunk_key_encoding": {"name": "default"},
"fill_value": 0,
"codecs": codecs,
"attributes": {},
"dimension_names": None,
"storage_transformers": None,
}

return _create_metadata_dict
6 changes: 2 additions & 4 deletions virtualizarr/codecs.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,10 +54,8 @@ def _get_manifestarray_codecs(
normalize_to_zarr_v3: bool = False,
) -> Union[Codec, tuple["ArrayArrayCodec | ArrayBytesCodec | BytesBytesCodec", ...]]:
"""Get codecs for a ManifestArray based on its zarr_format."""
if normalize_to_zarr_v3 or array.zarray.zarr_format == 3:
return (array.zarray.serializer(),) + array.zarray._v3_codec_pipeline()
elif array.zarray.zarr_format == 2:
return array.zarray.codec
if normalize_to_zarr_v3 or array.metadata.zarr_format == 3:
return array.metadata.codecs
else:
raise ValueError("Unsupported zarr_format for ManifestArray.")

Expand Down
45 changes: 28 additions & 17 deletions virtualizarr/manifests/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
from typing import Any, Callable, Union

import numpy as np
from zarr.core.metadata.v3 import ArrayV3Metadata, RegularChunkGrid

from virtualizarr.manifests.array_api import (
MANIFESTARRAY_HANDLED_ARRAY_FUNCTIONS,
_isnan,
)
from virtualizarr.manifests.manifest import ChunkManifest
from virtualizarr.zarr import ZArray


class ManifestArray:
Expand All @@ -24,27 +24,27 @@ class ManifestArray:
"""

_manifest: ChunkManifest
_zarray: ZArray
_metadata: ArrayV3Metadata

def __init__(
self,
zarray: ZArray | dict,
metadata: ArrayV3Metadata | dict,
chunkmanifest: dict | ChunkManifest,
) -> None:
"""
Create a ManifestArray directly from the .zarray information of a zarr array and the manifest of chunks.
Parameters
----------
zarray : dict or ZArray
metadata : dict or ArrayV3Metadata
chunkmanifest : dict or ChunkManifest
"""

if isinstance(zarray, ZArray):
_zarray = zarray
if isinstance(metadata, ArrayV3Metadata):
_metadata = metadata
else:
# try unpacking the dict
_zarray = ZArray(**zarray)
_metadata = ArrayV3Metadata(**metadata)

if isinstance(chunkmanifest, ChunkManifest):
_chunkmanifest = chunkmanifest
Expand All @@ -55,32 +55,43 @@ def __init__(
f"chunkmanifest arg must be of type ChunkManifest or dict, but got type {type(chunkmanifest)}"
)

# TODO check that the zarray shape and chunkmanifest shape are consistent with one another
# TODO check that the metadata shape and chunkmanifest shape are consistent with one another
# TODO also cover the special case of scalar arrays

self._zarray = _zarray
self._metadata = _metadata
self._manifest = _chunkmanifest

@property
def manifest(self) -> ChunkManifest:
return self._manifest

@property
def zarray(self) -> ZArray:
return self._zarray
def metadata(self) -> ArrayV3Metadata:
return self._metadata

@property
def chunks(self) -> tuple[int, ...]:
return tuple(self.zarray.chunks)
"""
Individual chunk size by number of elements.
"""
if isinstance(self._metadata.chunk_grid, RegularChunkGrid):
abarciauskas-bgse marked this conversation as resolved.
Show resolved Hide resolved
return self._metadata.chunk_grid.chunk_shape
else:
raise NotImplementedError(
"Only RegularChunkGrid is currently supported for chunk size"
)

@property
def dtype(self) -> np.dtype:
dtype_str = self.zarray.dtype
return np.dtype(dtype_str)
dtype_str = self.metadata.data_type
return dtype_str.to_numpy()

@property
def shape(self) -> tuple[int, ...]:
return tuple(int(length) for length in list(self.zarray.shape))
"""
Array shape by number of elements along each dimension.
"""
return tuple(int(length) for length in list(self.metadata.shape))
abarciauskas-bgse marked this conversation as resolved.
Show resolved Hide resolved

@property
def ndim(self) -> int:
Expand Down Expand Up @@ -155,7 +166,7 @@ def __eq__( # type: ignore[override]
if self.shape != other.shape:
raise NotImplementedError("Unsure how to handle broadcasting like this")

if self.zarray != other.zarray:
if self.metadata != other.metadata:
return np.full(shape=self.shape, fill_value=False, dtype=np.dtype(bool))
else:
if self.manifest == other.manifest:
Expand Down Expand Up @@ -263,7 +274,7 @@ def rename_paths(
ChunkManifest.rename_paths
"""
renamed_manifest = self.manifest.rename_paths(new)
return ManifestArray(zarray=self.zarray, chunkmanifest=renamed_manifest)
return ManifestArray(metadata=self.metadata, chunkmanifest=renamed_manifest)


def _possibly_expand_trailing_ellipsis(key, ndim: int):
Expand Down
37 changes: 22 additions & 15 deletions virtualizarr/manifests/array_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ def concatenate(
The signature of this function is array API compliant, so that it can be called by `xarray.concat`.
"""
from zarr.core.metadata.v3 import ArrayV3Metadata

from .array import ManifestArray

if axis is None:
Expand Down Expand Up @@ -100,12 +102,12 @@ def concatenate(
lengths=concatenated_lengths,
)

# chunk shape has not changed, there are just now more chunks along the concatenation axis
new_zarray = first_arr.zarray.replace(
shape=tuple(new_shape),
)
metadata_copy = first_arr.metadata.to_dict().copy()
abarciauskas-bgse marked this conversation as resolved.
Show resolved Hide resolved
metadata_copy["shape"] = tuple(new_shape)
# ArrayV3Metadata.from_dict removes extra keys zarr_format and node_type
new_metadata = ArrayV3Metadata.from_dict(metadata_copy)

return ManifestArray(chunkmanifest=concatenated_manifest, zarray=new_zarray)
return ManifestArray(chunkmanifest=concatenated_manifest, metadata=new_metadata)


@implements(np.stack)
Expand All @@ -120,6 +122,8 @@ def stack(
The signature of this function is array API compliant, so that it can be called by `xarray.stack`.
"""
from zarr.core.metadata.v3 import ArrayV3Metadata

from .array import ManifestArray

if not isinstance(axis, int):
Expand Down Expand Up @@ -170,12 +174,13 @@ def stack(
new_chunks = list(old_chunks)
new_chunks.insert(axis, 1)

new_zarray = first_arr.zarray.replace(
chunks=tuple(new_chunks),
shape=tuple(new_shape),
)
metadata_copy = first_arr.metadata.to_dict().copy()
metadata_copy["shape"] = tuple(new_shape)
metadata_copy["chunk_grid"]["configuration"]["chunk_shape"] = tuple(new_chunks)
# ArrayV3Metadata.from_dict removes extra keys zarr_format and node_type
new_metadata = ArrayV3Metadata.from_dict(metadata_copy)

return ManifestArray(chunkmanifest=stacked_manifest, zarray=new_zarray)
return ManifestArray(chunkmanifest=stacked_manifest, metadata=new_metadata)


@implements(np.expand_dims)
Expand All @@ -190,6 +195,7 @@ def broadcast_to(x: "ManifestArray", /, shape: tuple[int, ...]) -> "ManifestArra
"""
Broadcasts a ManifestArray to a specified shape, by either adjusting chunk keys or copying chunk manifest entries.
"""
from zarr.core.metadata.v3 import ArrayV3Metadata

from .array import ManifestArray

Expand Down Expand Up @@ -236,12 +242,13 @@ def broadcast_to(x: "ManifestArray", /, shape: tuple[int, ...]) -> "ManifestArra
lengths=broadcasted_lengths,
)

new_zarray = x.zarray.replace(
chunks=new_chunk_shape,
shape=new_shape,
)
metadata_copy = x.metadata.to_dict().copy()
metadata_copy["shape"] = tuple(new_shape)
metadata_copy["chunk_grid"]["configuration"]["chunk_shape"] = tuple(new_chunk_shape)
# ArrayV3Metadata.from_dict removes extra keys zarr_format and node_type
new_metadata = ArrayV3Metadata.from_dict(metadata_copy)

return ManifestArray(chunkmanifest=broadcasted_manifest, zarray=new_zarray)
return ManifestArray(chunkmanifest=broadcasted_manifest, metadata=new_metadata)


def _prepend_singleton_dimensions(shape: tuple[int, ...], ndim: int) -> tuple[int, ...]:
Expand Down
32 changes: 23 additions & 9 deletions virtualizarr/readers/dmrpp.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@
from virtualizarr.readers.common import VirtualBackend
from virtualizarr.types import ChunkKey
from virtualizarr.utils import _FsspecFSFromFilepath, check_for_collisions
from virtualizarr.zarr import ZArray


class DMRPPVirtualBackend(VirtualBackend):
Expand Down Expand Up @@ -378,6 +377,10 @@ def _parse_variable(self, var_tag: ET.Element) -> Variable:
-------
xr.Variable
"""
from zarr.core.metadata.v3 import ArrayV3Metadata

from virtualizarr.zarr import convert_to_codec_pipeline

# Dimension info
dims: dict[str, int] = {}
dimension_tags = self._find_dimension_tags(var_tag)
Expand Down Expand Up @@ -414,16 +417,27 @@ def _parse_variable(self, var_tag: ET.Element) -> Variable:
# Fill value is placed in zarr array's fill_value and variable encoding and removed from attributes
encoding = {k: attrs.get(k) for k in self._ENCODING_KEYS if k in attrs}
fill_value = attrs.pop("_FillValue", None)
# create ManifestArray and ZArray
zarray = ZArray(
chunks=chunks_shape,
dtype=dtype,
fill_value=fill_value,
filters=filters,
order="C",
# create ManifestArray
metadata = ArrayV3Metadata(
Comment on lines +419 to +420
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call seems to have quite a lot of boilerplate that authors of virtualizarr readers will never need, so as it's generally useful perhaps we should provide our own constructor function in .utils?

shape=shape,
data_type=dtype,
chunk_grid={
"name": "regular",
"configuration": {"chunk_shape": chunks_shape},
},
chunk_key_encoding={"name": "default"},
fill_value=fill_value,
codecs=convert_to_codec_pipeline(
compressors=filters,
dtype=dtype,
filters=None,
serializer="auto",
),
attributes=attrs,
dimension_names=None,
Comment on lines +433 to +434
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to specify attrs here? I feel like we want neither dimension_names nor attrs, because in virtualizarr those both are stored in the xarray objects instead.

storage_transformers=None,
)
marr = ManifestArray(zarray=zarray, chunkmanifest=chunkmanifest)
marr = ManifestArray(metadata=metadata, chunkmanifest=chunkmanifest)
return Variable(dims=dims.keys(), data=marr, attrs=attrs, encoding=encoding)

def _parse_attribute(self, attr_tag: ET.Element) -> dict[str, Any]:
Expand Down
29 changes: 19 additions & 10 deletions virtualizarr/readers/hdf/hdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@
from virtualizarr.readers.hdf.filters import cfcodec_from_dataset, codecs_from_dataset
from virtualizarr.types import ChunkKey
from virtualizarr.utils import _FsspecFSFromFilepath, check_for_collisions, soft_import
from virtualizarr.zarr import ZArray

h5py = soft_import("h5py", "For reading hdf files", strict=False)

Expand Down Expand Up @@ -285,6 +284,9 @@ def _dataset_to_variable(
"""
# This chunk determination logic mirrors zarr-python's create
# https://github.com/zarr-developers/zarr-python/blob/main/zarr/creation.py#L62-L66
from zarr.core.metadata.v3 import ArrayV3Metadata

from virtualizarr.zarr import convert_to_codec_pipeline

chunks = dataset.chunks if dataset.chunks else dataset.shape
codecs = codecs_from_dataset(dataset)
Expand All @@ -306,20 +308,27 @@ def _dataset_to_variable(
if isinstance(fill_value, np.generic):
fill_value = fill_value.item()
filters = [codec.get_config() for codec in codecs]
zarray = ZArray(
chunks=chunks, # type: ignore
compressor=None,
dtype=dtype,
fill_value=fill_value,
filters=filters,
order="C",

metadata = ArrayV3Metadata(
shape=dataset.shape,
zarr_format=2,
data_type=dtype,
chunk_grid={"name": "regular", "configuration": {"chunk_shape": chunks}},
chunk_key_encoding={"name": "default"},
fill_value=fill_value,
codecs=convert_to_codec_pipeline(
compressors=None,
dtype=dtype,
filters=filters,
serializer="auto",
),
attributes=attrs,
dimension_names=None,
storage_transformers=None,
)
dims = HDFVirtualBackend._dataset_dims(dataset, group=group)
manifest = HDFVirtualBackend._dataset_chunk_manifest(path, dataset)
if manifest:
marray = ManifestArray(zarray=zarray, chunkmanifest=manifest)
marray = ManifestArray(metadata=metadata, chunkmanifest=manifest)
variable = xr.Variable(data=marray, dims=dims, attrs=attrs)
else:
variable = xr.Variable(data=np.empty(dataset.shape), dims=dims, attrs=attrs)
Expand Down
Loading
Loading