Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Filter interface #17

Open
minnerbe opened this issue Mar 15, 2023 · 2 comments
Open

Improve Filter interface #17

minnerbe opened this issue Mar 15, 2023 · 2 comments

Comments

@minnerbe
Copy link
Contributor

Trying to implement variable length string arrays #16 made it evident that having numcodecs-like filters would be a huge step towards supporting general data types; this is the result of multiple discussions with @bogovicj and @axtimwalde, see also this corresponding issue.

However, the Filter interface in its current state contains no methods and is not used anywhere. I suggest to flesh out the Filter interface such that implementers of this interface

  • are de-/serializable from/to json with an annotation interface similar to @CompressionType;
  • can be daisy-chained.

For the second point to work, methods for application and inverse application of a filter have to have the same input and output type. I see two possibilities here for this type:

  • Plain buffers (this is the case in numcodecs). This would either require to change the BlockReader and BlockWriter interfaces to work with buffers instead of DataBlocks, which seems unnatural given their names, or to manually expose the raw data of a DataBlock after creation, which seems to go against the intention of the concept.
  • DataBlocks. This would allow filters to create a new DataBlock if necessary (e.g., the size of raw data changes), or modify the data in-place if possible.

Adding filters allows to de-/serialize custom objects in a way that is compatible with the Python implementation of zarr.
A downside of this would be that, for general objects, a DataBlock cannot know the number of deserialized bytes before deserialization. This would probably necessitate some changes in the DataBlock interface and in the way DataBlocks created in the reading process (right now, they pre-allocate an array of the right size to hold the decompressed data).

@mkitti
Copy link
Contributor

mkitti commented Mar 15, 2023

Consider using https://commons.apache.org/proper/commons-compress/ as a generalized compression framework . Also consider upgrade to Blosc2.

@bogovicj
Copy link
Collaborator

related: saalfeldlab/n5#87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants