Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support bitshuffle filter for blosc compressor #1

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion asdf_compression/blsc/compressor.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,34 @@ class BloscCompressor(Compressor):
def compress(self, data, **kwargs):
import blosc

yield blosc.compress(data, **kwargs)
# Type size, necessary for shuffle filters
typesize = data.itemsize
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It appears that blosc defaults to 8 whereas here (with the memoryview) the itemsize (based off the blosc examples) should be 1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the itemsize should be the type of the data ultimately stored. This is an optimization hint for the shuffle operators to expose more regularity. For example, when float32 elements are stored, the itemsize should be 4, when complex128 data are stored, it should be 16 etc.

I'm not familiar with all the details of asdf and using data.itemsize might be the wrong choice. I am looking for the datatype (that asdf knows) of the ndarray here.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASDF blocks are a little weird because they might be shared across arrays (even if the arrays have different dtypes). Because of this every block is just bytes.

Here's one example:

import asdf, numpy as np
base_arr = np.arange(10, dtype='uint8')
view1 = base_arr.view(dtype='uint32')
view2 = base_arr.view(dtype='uint64')
asdf.AsdfFile({'v1': view1, 'v2': view2}).write_to('test.asdf')

The example produces a file with only 1 ASDF block but with 2 arrays each with a different dtype:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 3.0.1.dev0}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension._manifest.ManifestExtension
    extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0
    software: !core/software-1.0.0 {name: asdf, version: 3.0.1.dev0}
v1: !core/ndarray-1.0.0
  source: 0
  datatype: uint32
  byteorder: little
  shape: [8]
v2: !core/ndarray-1.0.0
  source: 0
  datatype: uint64
  byteorder: little
  shape: [4]
...

See this issue for some discussion on asdf saving 'base' arrays: asdf-format/asdf#1669

The standard is not very detailed about this behavior but does contain a short statement about blocks being bytes and not having a datatype:

Blocks represent a contiguous chunk of binary data and nothing more. Information about how to interpret the block, such as the data type or array shape, is stored entirely in ndarray structures in the tree, as described in ndarray.

Is there much of a performance gain to be had for setting itemsize?


# Compression level
clevel = kwargs.get("level", 9)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep this as clevel to match the blosc.compress signature. As the default already 9 I'd say let's let **kwargs handle the clevel.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming that there is a (small) standard set of arguments that asdf supports, and that people could use these without knowing details about the compressor.

Apparently that's not the case, it's level for zlib, compresslevel for bzip2, etc...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of the "wild west" when it comes to compression keyword arguments. I'd be open to suggestions if you have any ideas for a less "leaky" api. In general compression has not received much attention in asdf (until now).


# Shuffle filter
shuffle_string = kwargs.get("filter", "bitshuffle")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to level vs clevel, I'd say we avoid adding a layer that remaps arguments. Without the remapping we can document this as "kwargs is anything supported by blosc.compress" (or something similar). If we map arguments we'll want to document the new mapping.

if shuffle_string == "noshuffle":
shuffle = blosc.NOSHUFFLE
elif shuffle_string == "byteshuffle":
shuffle = blosc.BYTESHUFFLE
else:
# Fallback and default
shuffle = blosc.BITSHUFFLE

# CODEC name
cname = kwargs.get("codec", "blosclz")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.


# Number of threads
nthreads = kwargs.get("threads", 1)
self._api.set_nthreads(nthreads)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

Suggested change
self._api.set_nthreads(nthreads)
blosc.set_nthreads(nthreads)

If so, I'd say it's cleaner to leave this up to the user to call prior to saving or opening an asdf file and leave it out of the api for this compressor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting it into the compressor allows people to use the compressor without themselves depending on the blosc package.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point that it would mean that a user could just pass in 'nthreads' to the compression kwargs and not need to call blosc.

Does set_nthreads also effect decompression? Looking at how asdf handles decompression there is no mechanism for providing keyword arguments even though the Compressor supports them (that is certainly a bug and gap in the api).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think decompression is also multi-threaded. I'm not sure.


yield blosc.compress(data, typesize=typesize, clevel=clevel, shuffle=shuffle, cname=cname, **kwargs)

def decompress(self, data, out, **kwargs):
import blosc

# TODO: call `self._api.decompress_ptr` instead to avoid copying the output
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! I wasn't sure how to do this with the current structure (where data is an iterator that spits out chunks). Is there a way to iteratively decompress with blosc or does it need access to the full data to decompress (I'm not at all familiar with how it works)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally blosc supports all the things you mention. The Python API is very limited and only supports two decompression calls – one that allocates (and we copy the result) and one that stores via a given pointer.

Since asdf seems to want to allocate the output buffer ahead of time, the second API would be nice to use. For this we need to convert asdf's pre-allocated buffer into a pointer. I don't know enough Python for this.

Alternatively we could change asdf's decompress API to not take the out buffer as input, but rather as the decompress function to allocate the output buffer.

How blosc's decompression routine processes its input is unrelated. The Python API doesn't support reading in chunks.

blosc compression and decompression are limited in that they can handle at most 2 GByte. blosc2 handles arbitrary sizes, but there isn't any Python package for it available yet. That Python package should probably support chunked compression and decompression, the blosc2 C library supports it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I like your idea of allowing decompress to allocate the buffer. This combined with passing perhaps the file pointer (instead of chunks read out of the file) might allow blosc to use decompress_ptr and avoid the copy of both the input chunks and the output buffer. I think these would be only minor changes in asdf but I'd have to look in more detail to be sure.

out[:] = blosc.decompress(b"".join(data), **kwargs)
return out.nbytes