Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support bitshuffle filter for blosc compressor #1

Closed
wants to merge 1 commit into from
Closed

Support bitshuffle filter for blosc compressor #1

wants to merge 1 commit into from

Conversation

eschnett
Copy link

No description provided.

Copy link
Owner

@braingram braingram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!

I made some inline comments. Unfortunately I'm advocating for removal of most of the suggested changes (in favor of keeping the compression keyword arguments matching those expected by blosc compress/decompress).

asdf doesn't really provide a good (or any?) mechanism for providing documentation for compression keyword arguments (nor does it save them to the asdf file). Until this can be improved (and perhaps even after) it makes sense to me to by-default keep the arguments consistent with the library providing the compression.

I'm open to discussing this further and am not at all opposed to making changes to asdf to make any of this more straightforward to use/develop.

@@ -7,10 +7,34 @@ class BloscCompressor(Compressor):
def compress(self, data, **kwargs):
import blosc

yield blosc.compress(data, **kwargs)
# Type size, necessary for shuffle filters
typesize = data.itemsize
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It appears that blosc defaults to 8 whereas here (with the memoryview) the itemsize (based off the blosc examples) should be 1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the itemsize should be the type of the data ultimately stored. This is an optimization hint for the shuffle operators to expose more regularity. For example, when float32 elements are stored, the itemsize should be 4, when complex128 data are stored, it should be 16 etc.

I'm not familiar with all the details of asdf and using data.itemsize might be the wrong choice. I am looking for the datatype (that asdf knows) of the ndarray here.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASDF blocks are a little weird because they might be shared across arrays (even if the arrays have different dtypes). Because of this every block is just bytes.

Here's one example:

import asdf, numpy as np
base_arr = np.arange(10, dtype='uint8')
view1 = base_arr.view(dtype='uint32')
view2 = base_arr.view(dtype='uint64')
asdf.AsdfFile({'v1': view1, 'v2': view2}).write_to('test.asdf')

The example produces a file with only 1 ASDF block but with 2 arrays each with a different dtype:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 3.0.1.dev0}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension._manifest.ManifestExtension
    extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0
    software: !core/software-1.0.0 {name: asdf, version: 3.0.1.dev0}
v1: !core/ndarray-1.0.0
  source: 0
  datatype: uint32
  byteorder: little
  shape: [8]
v2: !core/ndarray-1.0.0
  source: 0
  datatype: uint64
  byteorder: little
  shape: [4]
...

See this issue for some discussion on asdf saving 'base' arrays: asdf-format/asdf#1669

The standard is not very detailed about this behavior but does contain a short statement about blocks being bytes and not having a datatype:

Blocks represent a contiguous chunk of binary data and nothing more. Information about how to interpret the block, such as the data type or array shape, is stored entirely in ndarray structures in the tree, as described in ndarray.

Is there much of a performance gain to be had for setting itemsize?

typesize = data.itemsize

# Compression level
clevel = kwargs.get("level", 9)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep this as clevel to match the blosc.compress signature. As the default already 9 I'd say let's let **kwargs handle the clevel.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming that there is a (small) standard set of arguments that asdf supports, and that people could use these without knowing details about the compressor.

Apparently that's not the case, it's level for zlib, compresslevel for bzip2, etc...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of the "wild west" when it comes to compression keyword arguments. I'd be open to suggestions if you have any ideas for a less "leaky" api. In general compression has not received much attention in asdf (until now).

clevel = kwargs.get("level", 9)

# Shuffle filter
shuffle_string = kwargs.get("filter", "bitshuffle")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to level vs clevel, I'd say we avoid adding a layer that remaps arguments. Without the remapping we can document this as "kwargs is anything supported by blosc.compress" (or something similar). If we map arguments we'll want to document the new mapping.

shuffle = blosc.BITSHUFFLE

# CODEC name
cname = kwargs.get("codec", "blosclz")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.


# Number of threads
nthreads = kwargs.get("threads", 1)
self._api.set_nthreads(nthreads)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

Suggested change
self._api.set_nthreads(nthreads)
blosc.set_nthreads(nthreads)

If so, I'd say it's cleaner to leave this up to the user to call prior to saving or opening an asdf file and leave it out of the api for this compressor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting it into the compressor allows people to use the compressor without themselves depending on the blosc package.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point that it would mean that a user could just pass in 'nthreads' to the compression kwargs and not need to call blosc.

Does set_nthreads also effect decompression? Looking at how asdf handles decompression there is no mechanism for providing keyword arguments even though the Compressor supports them (that is certainly a bug and gap in the api).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think decompression is also multi-threaded. I'm not sure.


def decompress(self, data, out, **kwargs):
import blosc

# TODO: call `self._api.decompress_ptr` instead to avoid copying the output
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! I wasn't sure how to do this with the current structure (where data is an iterator that spits out chunks). Is there a way to iteratively decompress with blosc or does it need access to the full data to decompress (I'm not at all familiar with how it works)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally blosc supports all the things you mention. The Python API is very limited and only supports two decompression calls – one that allocates (and we copy the result) and one that stores via a given pointer.

Since asdf seems to want to allocate the output buffer ahead of time, the second API would be nice to use. For this we need to convert asdf's pre-allocated buffer into a pointer. I don't know enough Python for this.

Alternatively we could change asdf's decompress API to not take the out buffer as input, but rather as the decompress function to allocate the output buffer.

How blosc's decompression routine processes its input is unrelated. The Python API doesn't support reading in chunks.

blosc compression and decompression are limited in that they can handle at most 2 GByte. blosc2 handles arbitrary sizes, but there isn't any Python package for it available yet. That Python package should probably support chunked compression and decompression, the blosc2 C library supports it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I like your idea of allowing decompress to allocate the buffer. This combined with passing perhaps the file pointer (instead of chunks read out of the file) might allow blosc to use decompress_ptr and avoid the copy of both the input chunks and the output buffer. I think these would be only minor changes in asdf but I'd have to look in more detail to be sure.

@braingram
Copy link
Owner

I just moved this repo from braingram to asdf-format (braingram/asdf-compression is now a fork of asdf-format/asdf-compression). Please feel free to re-open this PR against https://github.com/asdf-format/asdf-compression (to hopefully garner some more feedback from other asdf developers). I'm going to close this PR for now but please don't take this as me dismissing your contribution.

@braingram braingram closed this Nov 14, 2023
@braingram
Copy link
Owner

I opened an issue on asdf-compression referencing your comment about the blosc decompression and decompress_ptr: asdf-format#4

@eschnett
Copy link
Author

Well, omitting the things that you'd rather keep external, there are two things left, and these are actually just the things I don't know how to do:

  • access the size of the actual ndarray data type
  • convert asdf's out buffer to a pointer

I started opening a new PR, but I realize that it would just have two TODO entries:

diff --git a/asdf_compression/blsc/compressor.py b/asdf_compression/blsc/compressor.py
index 19886f2..61875e9 100644
--- a/asdf_compression/blsc/compressor.py
+++ b/asdf_compression/blsc/compressor.py
@@ -7,10 +7,15 @@ class BloscCompressor(Compressor):
     def compress(self, data, **kwargs):
         import blosc

-        yield blosc.compress(data, **kwargs)
+        # Type size, necessary for shuffle filter efficiency
+        # TODO: This should be the type size of the data actually stored, not just "byte" or "uint8"
+        typesize = data.itemsize
+
+        yield blosc.compress(data, typesize=typesize, **kwargs)

     def decompress(self, data, out, **kwargs):
         import blosc

+        # TODO: call `self._api.decompress_ptr` instead to avoid copying the output
         out[:] = blosc.decompress(b"".join(data), **kwargs)
         return out.nbytes

@braingram
Copy link
Owner

If you'd like, I wouldn't be opposed to a PR with "TODOs". Another option would be to open issues. The changes you've proposed here are great as they've exposed a number of gaps in the asdf extension api for block compression and I just want to make sure that these contributions aren't lost. Thanks again for opening the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants