Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest arrays use arrayv3metadata #429

Open
wants to merge 49 commits into
base: zarr-python-3.0
Choose a base branch
from

Conversation

abarciauskas-bgse
Copy link
Collaborator

@abarciauskas-bgse abarciauskas-bgse commented Feb 6, 2025

This is still very much a WIP - many tests and implementations still need to be fixed.

A few notes:

  • It was suggested we remove ZArray completely as a part of this work, as opposed to using a conversion function for ZArrays to ArrayV3Metadata. So we should be able to remove ZArray as a part of this pr.
  • It was suggested not to use zarr's _parse_chunk_encoding_v3 function since it is a private function and may change, which is why some of that logic is replicated in convert_to_codec_pipeline

Checklist

  • Closes ManifestArray should use zarr-python's ArrayV3Metadata #424
  • Manifest tests passing
  • Library (codecs, etc) tests passing
  • Reader tests passing
  • test_integration tests passing
  • test_xarray tests passing
  • Writer tests passing
  • Cleanup dead code
  • Consider reorganizing codecs and zarr modules
  • Tests added for new functions
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

virtualizarr/manifests/array.py Outdated Show resolved Hide resolved
virtualizarr/manifests/array.py Outdated Show resolved Hide resolved
virtualizarr/manifests/array_api.py Outdated Show resolved Hide resolved
@TomNicholas TomNicholas added zarr-python Relevant to zarr-python upstream internals labels Feb 6, 2025
@norlandrhagen
Copy link
Collaborator

Yes my first suggestion was similar to your third suggestion.

It sounds like we are leaning towards a feature branch (zarr-python-3.0?) but I'll wait for others to chime in. cc @norlandrhagen @ayushnag

That option seems great to me. Thanks for moving this along @abarciauskas-bgse

@abarciauskas-bgse
Copy link
Collaborator Author

abarciauskas-bgse commented Feb 12, 2025

I'm going to continue to review this tomorrow but the tests are passing and I've done an initial reorganization of the code that was in zarr.py. So if any of @TomNicholas @norlandrhagen @ayushnag @sharkinsspatial @jsignell @mpiannucci want to start to review please go ahead 🙏🏽

I also changed the base to a new branch of main zarr-python-3.0,

@abarciauskas-bgse abarciauskas-bgse changed the base branch from main to zarr-python-3.0 February 12, 2025 23:58
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely great @abarciauskas-bgse ! Comments are really just minor.

@@ -13,7 +13,7 @@ dependencies:
- ujson
- universal_pathlib
- hdf5plugin
- numcodecs
- numcodecs>=0.15.1
- imagecodecs>=2024.6.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should list zarr explicitly in all the envs, and in fact in this upstream one we could install it from main.

Comment on lines +218 to +219
@pytest.fixture
def array_v3_metadata():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reimplement this fixture to internally just call the array_v3_metadata_dict fixture below?

```
ZArray(shape=(2920, 25, 53), chunks=(2920, 25, 53), dtype=int16, compressor=None, filters=None, fill_value=None)
ArrayV3Metadata(shape=(2920, 25, 53),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explosion of information in this printed repr is an argument for either an xarray short repr (@keewis is that public API?) or for replacing these markdown files with rendered output of notebooks (see #83).

from .manifests.array import ManifestArray

CodecPipeline = Tuple[
Union["ArrayArrayCodec", "ArrayBytesCodec", "BytesBytesCodec"], ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefer since python 3.10 you can use | instead of Union.

def extract_codecs(
codecs: CodecPipeline,
) -> tuple[
tuple[ArrayArrayCodec, ...], ArrayBytesCodec | None, tuple[BytesBytesCodec, ...]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty complicated type that I had to stare at to work out what it is. Use TypeAlias with an informative name?

Also is it definitely the right type? Seems weird that this would be valid: ((,), None, (,))

],
def test_manifest_array_zarr_v3_with_codecs(self, create_manifestarray):
"""Test get_codecs with ManifestArray using multiple v3 codecs."""
test_codecs = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be a global variable TEST_CODECS, or even a fixture providing multiple sets of test codecs. See also @dcherian's codecs hypothesis strategy that we could use in future (zarr-developers/zarr-python#2822).

order="C",
shape=(5, 1, 20),
zarr_format=2,
# FAILING: TypeError: no implementation found for 'numpy.concatenate' on types that implement __array_function__: [<class 'virtualizarr.manifests.array.ManifestArray'>, <class 'numpy.ndarray'>]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad bug and shouldn't be forgotten - it implies that somehow inside the manifest / ManifestArray concatenation implementation some array is the wrong type.

(This can't be being thrown from the top-level np.concatenate you can see in the test, so it must be being thrown by the lower-level np.concatenate call that is actually used to merge the manifests internally in .array_api.)

"compressor", "chunks", and "shape".
Returns:
-------
ArrayV3Metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ArrayV3Metadata
ArrayV3Metadata

codecs = zarray._v3_codecs()

# create array if it doesn't already exist
# TODO: Should codecs be an argument to zarr's AsyncrGroup.create_array?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're asking for an upstream change here right?

@@ -67,13 +90,65 @@ def remove_file_uri_prefix(path: str):
return path


def convert_v3_to_v2_metadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this (a) not live in the kerchunk-specific writer module, (b) actually live in zarr-python upstream? Or is there no use for it upstream?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internals zarr-python Relevant to zarr-python upstream
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

ManifestArray should use zarr-python's ArrayV3Metadata
4 participants