Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add default compressors to config #2470

Merged
merged 45 commits into from
Dec 19, 2024

Conversation

brokkoli71
Copy link
Member

This PR adds:

  • default compressor for zarr_format=2
  • specification of default compressors dependent on dtype in zarr.config

fixes #2267

Should _get_default_array_bytes_codec for zarr_format=3 also be configurable in zarr.config?

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@brokkoli71 brokkoli71 marked this pull request as draft November 6, 2024 15:36
@brokkoli71
Copy link
Member Author

brokkoli71 commented Nov 6, 2024

I am a little confused: in which cases should we use VLenBytes and in which cases VLenUTF8? Or do we need both at the same time?

@brokkoli71
Copy link
Member Author

Also, do we need a default compressor or are default filters sufficient?

@brokkoli71 brokkoli71 marked this pull request as ready for review November 13, 2024 15:20
@jhamman jhamman added the V3 label Nov 29, 2024
@normanrz normanrz self-requested a review November 29, 2024 15:39
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @brokkoli71. I think @normanrz is also going to review this but I wanted to also bring up an additional point.

In Zarr2 we told people to set zarr.storage.default_compressor = SomeCompressor() This was simple but also an odd way to manage config. What we have now is much better. However, I wonder if we should do something to catch folks trying do set the default_compressor variable. Thoughts?

src/zarr/codecs/_v2.py Outdated Show resolved Hide resolved
src/zarr/core/config.py Outdated Show resolved Hide resolved
src/zarr/core/metadata/v2.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Couple of things:

  • Print statements need removing
  • The config naming could probably be improved

This would also be a good oppurtunity to update or add user facing documentation on what default compressors are used for what types of data - is that something you could add in this PR?

src/zarr/core/config.py Outdated Show resolved Hide resolved
src/zarr/codecs/_v2.py Outdated Show resolved Hide resolved
src/zarr/core/config.py Outdated Show resolved Hide resolved
src/zarr/core/metadata/v2.py Outdated Show resolved Hide resolved
@normanrz
Copy link
Member

normanrz commented Dec 6, 2024

I think we should also have defaults for v3:

          "v3_default_codecs": {
                "numeric": ["bytes", "zstd"],
                "string": ["vlen-utf8"],
                "bytes": ["vlen-bytes"],
            },

@LDeakin
Copy link
Contributor

LDeakin commented Dec 6, 2024

I think we should also have defaults for v3:

          "v3_default_codecs": {
                "numeric": ["bytes", "zstd"],
                "string": ["vlen-utf8"],
                "bytes": ["vlen-bytes"],
            },

zstd isn't in the spec yet: zarr-developers/zarr-specs#256

@brokkoli71
Copy link
Member Author

brokkoli71 commented Dec 8, 2024

thanks for your feedback, i will integrate it this week 👍🏼

@dstansby dstansby added this to the 3.0.0 milestone Dec 9, 2024
Copy link
Member

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you please go through the docstrings again and add some info about the default compressor, filters, codecs? Then, I think this is good to go.

@brokkoli71 brokkoli71 requested a review from normanrz December 18, 2024 12:33
@normanrz normanrz enabled auto-merge (squash) December 18, 2024 12:37
@normanrz normanrz requested a review from dstansby December 18, 2024 12:38
Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left lots of small requests for changes, mainly around the docstrings - in general they are great and clear, but I think worth fixing a lot of the little issues while we're here.

I left most of the docstring comments in asynchronous.py, but they also apply to the other files that have updated docstrintgs.

src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
this collection specify the transformation from array values to stored bytes.
V3 only. V2 arrays should use `filters` and `compressor` instead.
If no codecs are provided, default codecs will be used:
- For numeric arrays, the default is `BytesCodec` and `ZstdCodec`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- For numeric arrays, the default is `BytesCodec` and `ZstdCodec`.
- For numeric arrays, the default is `BytesCodec` and `ZstdCodec`.

Can we also document the default compression level (and any other parameters) here?

src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/api/asynchronous.py Show resolved Hide resolved
src/zarr/core/array.py Outdated Show resolved Hide resolved
tests/test_v2.py Outdated Show resolved Hide resolved
src/zarr/storage/__init__.py Outdated Show resolved Hide resolved
auto-merge was automatically disabled December 18, 2024 14:12

Head branch was pushed to by a user without write access

@brokkoli71 brokkoli71 requested a review from dstansby December 18, 2024 16:36
@normanrz normanrz merged commit 4cb8ddd into zarr-developers:main Dec 19, 2024
27 checks passed
@normanrz normanrz deleted the default-compressor branch December 19, 2024 17:28
@d-v-b
Copy link
Contributor

d-v-b commented Dec 21, 2024

is there a reason why none of the default compressors / codecs have a configuration?

@d-v-b
Copy link
Contributor

d-v-b commented Dec 21, 2024

and a second question, why aren't strings / bytes compressed with zstd by default, like numeric dtypes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v3] default compressor / codec pipeline
6 participants