Discussion: compressed block table sizes #45

anforowicz · 2025-01-02T16:40:41Z

Hello!

I just wanted to share some data and thoughts that stem from my experiments with using different tables sizes in CompressedBlock. This is very exploratory/preliminary and I hope that we can discuss various ideas and directions before we commit to any particular solution (a random internet blogpost tells me that this may result in a better outcome :-P).

One experiment I've done, is using the default table sizes (4096/512) when the estimated image bytes (width x height x samples-per-pixel x bits-per-sample / 8) is above a certain threshold, but using smaller tables (512/128) otherwise (see the Chromium CL here). The results I've got (see section "Measurements 2024-12-30 - 2025-01-02" in my doc) have been positive, but the magnitude of the improvement has been disappointing to me.

The results above were also surprisingly flat - I've expected that small images will significantly benefit from small tables (and big images from big/default tables). One hypothesis that could explain that is that image size is not a good predictor for the size of zlib compressed blocks - e.g. maybe some big images can use lots of relatively short compressed-zlib-blocks. So I tried another experiment to gather this kind of data on my old 2023 corpus of ~1650 PNG images from top 500 websites (see also the tool bits here and here) - the results can be found in a spreadsheet here. I think the following bits of data are interesting:

There is quite a wide range of compressed block sizes. Even when looking at 100 biggest images, the block sizes range from ~3kB (at 10%-ile) to ~44kB (at 90%-ile).
Some images use a mix of compressed blocks with 1) fixed/default-symbol-encoding and 2) custom Huffman trees.
Some images use uncompressed blocks

I also think that it is a bit icky that in my experiments the public API of fdeflate "leaks" the implementation detail of Huffman table sizes. One idea to avoid this is to:

Decouple CompressedBlock and fn read_compressed from Decompressor, so that Decompressor can internally choose to use small or big table sizes (with dynamic dispatch via something like Box<dyn CompressedBlockRead[er]>). I think that moving fn read_compressed to impl...CompressedBlock can be made easier by packaging/encapsulating bits of Decompressor (to make it easier to pass them as &mut reference to fn read_compressed) - for example maybe buffer+nbits can become fields of BitBuffer struct and queued_rle+queued_backref can become variants of enum QueuedOutput.
Add fdeflate::Decompressor::set_output_size_estimate(estimate: usize) which can be used to decide the initial table sizes. (Note that png::ZlibStream already has such estimate available - it calls it max_total_output.)
Track the size of the last compressed blocks and switch the table sizes if that size is below/above a certain threshold.

The text was updated successfully, but these errors were encountered:

anforowicz · 2025-01-04T00:40:04Z

It turns out that looking closer at the previous spreadsheet (titled "png and zlib statistics") shows that 68% of PNG files have a single Zlib block (and therefore wouldn't benefit from changing table sizes dynamically/at-runtime). I took the 902 such files and run png crates decoder benchmark 1) first --save-baseline=my_baseline with default 4096/512 table sizes and 2) then --baseline=my_baseline with small 512/128 table sizes. And then I used an ad-hoc tool to combine the measured throughput gains with file/image/PNG metadata into a separate spreadsheet here (titled "small fdeflate table gains for single-zlib-block PNGs").

Some notes about and based on that spreadsheet:

Individual deltas
- Small Huffman tables are harmful only to 16.7% of files and beneficial for the remaining 83.3% of files.
- At 99%-ile small tables result in a throughput delta (for an individual PNG file) of 55%, at 90%-ile a gain of 36%, at 50% a gain of 9%
- OTOH, at 10%-ile there is a loss of 1.63%, and at 1%-ile the loss becomes 7%
- We can see that the throughput_delta column is most correlated with the first_idat_length column. I interpret this as saying that frame_info_buffer_size and filesize columns are not as good of a predictor for the size-and-nature of the compressed Zlib stream.
Sum of deltas weighted by frame_info_buffer_size
- Weighing by frame_info_buffer_size makes sense, because the denominator of throughput (in the decoder benchmark) is based on the info.buffer_size() (see here). runtime_delta column has nonsensical units (since there is no baseline time captured), but I think it is still useful for further calculations.
- Weighted runtime delta (when switching everything to small tables) would be -19.93%
- Max runtime delta (when applying small tables to all except the few biggest files) would be -21.09%
- The chart of individual throughput_deltas makes it seem that regressions are randomly scattered throughout the whole range of first_idat_len (i.e. the regressions do not seem to be mostly present in big/long Zlib streams).

One thing that I don't fully understand is why I am getting much smaller gains when trying the same things through Chromium-based benchmarks (rather than through png benchmarks and criterion). For example, the spreadsheet shows cumulative gain of 10% for first_idat_len threshold of 1000, but the Chromium measurements show only a gain of 5% for the same threshold. Maybe some of that can probably be explained by additional processing done in Chromium (e.g. alpha premultiplication).

Next steps and open questions:

I probably should run Chromium measurementss for first_idat_len set to 21967 (based on the spreadsheet) + when always using small tables.
I wonder if using single-zlib-block inputs may result in a bias. Maybe compressor which emit multiple Zlib blocks will use more efficient/lopsided Huffman trees for each block (which may benefit from bigger tables by hitting the double_literal case more often)?
Maybe this is enough to proceed with landing additional commits - up to making CompressedBuffer generic (but without changing fdeflate's public API just yet, and without changing default table sizes yet).

anforowicz · 2025-01-07T18:00:07Z

Quick correction of my previous comments: I previously wrote that "weighing by frame_info_buffer_size makes sense". I take that back. I've rerun tests on ~900 single-zlib-block PNGs (from top 500 websites, gathered in 2023) but recording the decoding time (rather than throughput delta). I've also recorded (see the ad-hoc tool here and here):

Size of the first IDAT block (to test this as a heuristic for predicting of Zlib stream length)
Some Huffman encoding properties that I hoped would predict when 4096-sized table is faster than 512-sized one:
- doubles9 and doubles12: how many LITERAL_ENTRY entries in litlen_table cover two instead of just one byte.
- secondaries9 and secondaries12: how many litlen_table symbols don't fit and have to spill over into the secondary table

The spreadsheet with my results can be found here. Some notes:

In these measurements 512-sized table is faster than 4096-sized table (overall tab in the spreadsheet):
- For PNG files with first IDAT smaller than 1716 bytes the overall sum of the runtime improves by 17% (1716 is at 25%-ile from all [single-zlib-block-OR-multiple-blocks PNGs] from my top 500 corpus - see here). For IDAT smaller than 500 bytes the improvement is almost 27%.
- For bigger files the impact is smaller. so the overall impact is an improvement of just 4%
- I also experimented what would happen if we used a separate read_compressed for files with symbol distribution that has no "double" litlen entries (improvements are minimal, but this is a hot loop). The results were disappointing - no "doubles" only happens for just 3.6% of PNG files, and for such files the improvement was just 2%.
In the overall-first_idat_sizes tab in the spreadsheet I tried to see if the delta between 512 and 4096 tables changes for bigger files. I am not sure how to approach this in a rigorous way, but eye-balling the chart of the average + 95%-CI it seems that the delta seems to randomly oscilate around "no difference".
correlation1 tab in the spreadsheet
- This reinforces results from my previous comment that first_idat_size is strongly correlated with time-ratio-4096-vs-512.
- OTOH, secondaries9 has an even bigger correlation coefficient.
- Eye-balling both charts it seems that the biggest difference is for small files, and then results are a random noise around "no difference". I tried to improve on "eye-balling" with the avg+95ci chart in overall-first_idat_sizes and overall-secondaries9 tabs.
correlation2 tab tries to see if doubles9, secondaries9 or other heuristics may better predict time-ratio-4096-vs-512 if we exclude files smaller than 1000 bytes.
- Correlation coefficient for all heuristics drops
- The chart for secondaries9 (the best correlation coefficient in correlation1 tab) is still not convincing to me.

My plan for the next step is to put together an fdeflate PR that transforms fn read_compressed into a free function:

fn read_compressed_impl<
    const LITLEN_TABLE_SIZE: usize,
    const DIST_TABLE_SIZE: usize
>(
    litlen_table: &[u32; LITLEN_TABLE_SIZE],
    secondary_table: &[u16],
    dist_table: &[u32; DIST_TABLE_SIZE],
    dist_secondary_table: &[u16],
    bit_buffer: &mut BitBuffer,
    remaining_input: &mut &[u8],
    output: &mut [u8],
    mut output_index: usize,
    queued_output: &mut Option<QueuedOutput>,
) -> ... { ... }

Making this function const-generic will enable easy switching/experimenting between 4096-sized vs 512-sized table. Having this as a free function will enable using table sizes different from CompressedBlock - e.g. even if CompressedBlock uses 4096-sized table, then we can use 512-sized function to decode fixed symbol blocks (possibly after expanding FIXED_DIST_TABLE to 128 entries to avoid binary size bloat (trying to restrict const-generic parameters to 4096/512 and 512/128 - trying to avoid also having 512/32 and arguing why 32 is okay because there are no secondary tables for fixed symbols).

In the long-term we should try to collectively decide whether to:

Keep using 4096-sized table
Switch to always using 512-sized table
Somehow decide whether to use one or the other at runtime (but I currently can't see what heuristics to use for making such a runtime decision)

I think we don't necessarily need to make this decision in the short-term. I think that Chromium can carry a small patch that changes the table sizes for Chromium experiments. And we could even have Chromium pick the table sizes through a field trials parameter (monomorphizing both options into the binary, but picking one based on a field trial parameter).

anforowicz · 2025-01-07T19:51:46Z

I've realized that in the initial version I have incorrectly counted doubles (including codes 256 and higher). I've edited the spreadsheet to fix this. This doesn't change the overall results AFAICT.

fintelia · 2025-01-08T05:08:20Z

I'd suggest trying table sizes between 512 and 4096. It turns out that zlib-ng uses 10-bit tables (i.e. 1024 entries) and libdeflate uses 11-bit tables (2048 entries). In fact, Chromium switched from 9-bits to 10-bits a while back for its zlib decompressor.

Another trick that may be helpful is that you can actually get an estimate of the size of a deflate block based on the length of the longest symbol: Huffman coding attempts to assign an n-bit code to a symbol that occurs with frequency 1/2^n and the end-of-block symbol occurs once per block, so if it is assigned a n-bit code then block will likely contain roughly 2^n symbols (or more if the maximum length is used).

anforowicz · 2025-01-08T19:30:05Z

I'd suggest trying table sizes between 512 and 4096.

Ack. I may indeed want to re-measure more table sizes with a bigger corpus. But just as a quick clarification, the last spreadsheet did cover 512/128, 1024/512, 2048/512, and 4096/512 table sizes - the overall tab points out that 1024/512 and 2048/512 are slower than 512/128 (but it is probably worth remeasuring with a bigger corpus).

It turns out that zlib-ng uses 10-bit tables (i.e. 1024 entries) and libdeflate uses 11-bit tables (2048 entries). In fact, Chromium switched from 9-bits to 10-bits a while back for its zlib decompressor.

Thank you for these pointers. I didn't have these in focus.

Let me try skimming over https://dougallj.wordpress.com/2022/08/20/ and making quick notes about the proposed optimizations:

Bigger root tables: unclear (this is what this issue tries to investigate)
- dougallj/zlib-dougallj@82b2c9e and https://chromium-review.googlesource.com/c/chromium/src/+/3888219 increased the table sizes from 9 to 10 bits. They reported gains from 1.44% to 7.6%.
- Results for going from 9 to 10 bits seem to be the opposite of what I've been getting so far when going from 12 to 9 bits. This is a bit surprising and concerning.
Limiting table sizes to the maximum code length used: unclear if applicable
- fdeflate's single table entry can cover two symbols, which limits how much the table size can be lowered
- and besides, fdeflate seems reasonably efficient (?):
  - fdeflate is doing no iterations when histogram[length] is 0. (here)
  - fdeflate is copying the table prefix (see here)
Adler 32 with UDOT: N/A
- By default the png crate disables adler32 calculations (see here). I've double-checked that fdeflate's adler32-related code is not hit at runtime.
Reading bits: unsure
- FWIW I've copied the hot loop from fdeflate here: https://godbolt.org/z/G3njf7vx3
Reading extra bits: not sure (potential for 5.5% speed up)
- I didn't fully understand the points here + how they map to fdeflate
Fast-Path for Literals: N/A?
- fdeflate handles not just two, but three literal-in-primary-table entries (see here)
- In theory for smaller table sizes we could handle even more? But expressing that clearly in generic code (and still compiling to efficient assembly) seems tricky...
Overlapping Mispredict Latency: N/A?
- fdeflate already refills before handling dist codes, so this seems N/A
- IIUC fdeflate also copies back-references in 16-byte chunks (see here)
Table-building tweaks: maybe applicable (for ~1% gain)
- fdeflate computes histogram in a simple loop here
Reading bits with zero refill latency (from follow-up [post](Reading bits with zero refill latency)): N/A?
- fdeflate will use litlen_entry4 in the next loop iteration, so IIUC buffer refill already happens "in parallel". See here.

Another trick that may be helpful is that you can actually get an estimate of the size of a deflate block based on the length of the longest symbol: Huffman coding attempts to assign an n-bit code to a symbol that occurs with frequency 1/2^n and the end-of-block symbol occurs once per block, so if it is assigned a n-bit code then block will likely contain roughly 2^n symbols (or more if the maximum length is used).

Good point. I tried to weigh the doubles/singles/colds count by symbol frequencies, but unfortunately it didn't really result in higher correlation coefficients. Maybe this is because we don't have a good estimate of the number of codewords in the input (first_idat_size is the closest we have). Or maybe this is because I measured end-to-end PNG decoding (so the measurements include non-decompression-related noise). My results can be seen in a spreadsheet here (and I also have a link to my ad-hoc code to calculate frequency-weighted Zlib block properties).

kornelski · 2025-01-09T02:14:34Z

BTW, https://lib.rs/cargo-show-asm is like a local godbolt that can show any function. Much easier than reproducing fragments of code online.

anforowicz · 2025-01-13T22:06:18Z

FWIW I've tried gathering data for a bigger corpus:

All of the PNGs from the corpus of PNGs from top 500 websites that I've gathered in 2023 (i.e. not just the PNGs with a single zlib block)
PNGs from the QOI corpus

The results are in a spreadsheet here:

2048/512 seems better than 4096/512, even for big images (see overall impact and impact on higher percentiles tabs)
512/128 seems better than 4096/512 and 2048/512 for small images
- Using 512/128 for first_idat_size smaller than ~650 bytes when decoding ~half of the smallest 25% images, we get improvement of ~4.96% compared to always using 4096/512 and improvement of ~2.84% compared to 2048/512 (see the impact on bottom quartile tab).
  - This impact is smaller than 17% I've reported earlier. I think this just reinforces that the results are highly sensitive to the choice of testing corpus.
- The impact is dwarfed by times of bigger PNGs (0.01% improvement seen in the overall impact tab)
Correlation coefficients are significantly smaller than in the previous spreadsheet.

Given mixed results, I am not sure if it's worth pursuing this direction further. Still, I think that we may want to enable continued experimentation in the future, by merging #49

anforowicz mentioned this issue Jan 7, 2025

Moving functions into impl CompressedBlock #47

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: compressed block table sizes #45

Discussion: compressed block table sizes #45

anforowicz commented Jan 2, 2025

anforowicz commented Jan 4, 2025

anforowicz commented Jan 7, 2025

anforowicz commented Jan 7, 2025

fintelia commented Jan 8, 2025 •

edited

Loading

anforowicz commented Jan 8, 2025

kornelski commented Jan 9, 2025

anforowicz commented Jan 13, 2025

Discussion: compressed block table sizes #45

Discussion: compressed block table sizes #45

Comments

anforowicz commented Jan 2, 2025

anforowicz commented Jan 4, 2025

anforowicz commented Jan 7, 2025

anforowicz commented Jan 7, 2025

fintelia commented Jan 8, 2025 • edited Loading

anforowicz commented Jan 8, 2025

kornelski commented Jan 9, 2025

anforowicz commented Jan 13, 2025

fintelia commented Jan 8, 2025 •

edited

Loading