-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support fpzip and kempressed codecs #391
base: master
Are you sure you want to change the base?
Conversation
size_t get_nf() { return nf; } | ||
|
||
void decode_headers(unsigned char *data) { | ||
FPZ* fpz = fpzip_read_from_buffer(static_cast<void*>(data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a strange/broken API provided by fpzip, since it accepts data
but does not know the number of bytes available.
Please investigate what the correct bounds check is and add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the header file and the missing num_bytes appears to be correct. https://github.com/LLNL/fpzip/blob/develop/include/fpzip.h#L192-L196
Reading the code, it seems the minimum byte stream (for the header) seems to be at least 6 x uint32_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, I found though that a minimum of 28 bytes are required, though I am not sure of why the last 4 are. I'll add that as a check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I see that the actual decoding even after reading the header also does no bounds checking. I don't think there is any safe way to use this API as is. The upstream repository needs to be fixed to include proper bounds checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'm addressing this with the upstream maintainer. Will revisit this PR probably in a few days or weeks.
Now that there are a number of wasm modules in Neuroglancer, and given that esbuild still doesn't support code splitting for non-es module workers, it would be good to change the bundling to load the wasm files separately rather than embedding them as |
I'll investigate ESBuild some more, but what would you recommend as a good place to get started? |
It may be as simple as changing neuroglancer/config/esbuild.js Line 152 in 5f62034
|
Interesting, I just tried this (sorry it took so long) and it appeared to split the |
The reason they aren't being lazy loaded is that we have e.g. |
Ah, I thought that might be the case. If I have a moment, I can try seeing how to change it to be on-demand. |
Hi Jeremy,
Our automated segmentation pipeline operates on a different principle than Google BRAIN's FFN which I understands produces direct segmentation from the output of the network. We produce an voxel affinity map from a boundary detector which is later post-processed into a segmentation through an "agglomeration" step.
These data are float32 and 3 channel, so 12x larger than the original image uncompressed. For petascale inference, this became expensive to store, so in 2018/2019 we investigated alternative compression algorithms. We found that the fpzip lossless compression algorithm for floating point data. Nico Kemnitz did some experimentation that exploited the fact that our affinities are in the range 0 - 1 which adds 2 to the data and switches the Z and Channel axes to get higher compression. Overall, a 2x to 3x improvement in compression is achieved, making the large scale storage of affinities temporarily viable instead of impossible. A table can be seen here: https://github.com/seung-lab/cloud-volume/wiki/Advanced-Topic:-fpzip-and-kempressed-Encodings
Unfortunately, we haven't been able to visualize this data easily without decompressing it or quantizing it, which leads to underutilization of this codec.
This PR adds "fpzip" and "kempressed" encoding support to Neuroglancer using the fpzip-1.3.0 library (https://github.com/LLNL/fpzip). CloudVolume supports fpzip and kempression via Python bindings (https://github.com/seung-lab/fpzip).
I experimented with different em++ settings to optimize size. -Oz produces a reasonable binary of about 37 KB. -O3 is closer to 1MB but may be faster.
The fpzip library is BSD licensed since 1.3.0.
Thanks for your consideration Jeremy!