Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recent changes in experimental branch #71

Closed
phoboslab opened this issue Dec 6, 2021 · 62 comments
Closed

Recent changes in experimental branch #71

phoboslab opened this issue Dec 6, 2021 · 62 comments

Comments

@phoboslab
Copy link
Owner

phoboslab commented Dec 6, 2021

With all that we learned through the analysis and ideas of a lot of people here, I refined QOI quite a bit. More than I thought I would.

The current state is in the experimental branch.

First of all, benchmark results for the new test suite using
qoibench 1 images/ --nopng --onlytotals

## Total for images/textures_photo/
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
master:       8.2        11.6        127.43         90.52      2522   61.6%
experi:       5.9         8.2        178.14        127.56      1981   48.4%

## Total for images/textures_pk01/
master:       0.7         1.0        186.23        126.11       184   36.4%
experi:       0.6         0.9        214.67        145.87       178   35.2%

## Total for images/screenshot_game/
master:       2.7         3.9        231.42        162.40       534   21.6%
experi:       2.6         3.4        245.06        187.25       519   21.0%

## Total for images/textures_pk/
master:       0.3         0.5        138.31         93.64        83   48.1%
experi:       0.3         0.4        159.63        110.87        75   43.5%

## Total for images/textures_pk02/
master:       2.0         2.8        155.27        110.31       504   42.5%
experi:       1.7         2.3        182.73        133.00       479   40.4%

## Total for images/icon_64/
master:       0.0         0.0        251.06        163.38         4   28.3%
experi:       0.0         0.0        343.60        266.60         5   31.3%

## Total for images/icon_512/
master:       0.6         0.9        474.50        308.36        80    7.8%
experi:       0.6         0.7        474.62        378.76       102   10.1%

## Total for images/photo_kodak/
master:       2.9         4.2        137.76         92.77       771   50.2%
experi:       2.4         3.5        166.17        111.66       671   43.7%

## Total for images/textures_plants/
master:       3.8         6.2        281.64        170.37       951   22.9%
experi:       3.3         5.0        324.00        211.07       922   22.2%

## Total for images/screenshot_web/
master:      18.1        28.2        449.27        287.79      2775    8.7%
experi:      17.5        23.2        464.81        350.15      2649    8.3%

## Total for images/pngimg/
master:       6.5        10.0        279.91        180.57      1415   20.0%
experi:       5.9         8.6        307.44        210.93      1445   20.5%

## Total for images/photo_tecnick/
master:      10.1        15.2        142.74         95.00      2710   48.2%
experi:       8.8        13.6        163.36        105.69      2527   44.9%

## Total for images/photo_wikipedia/
master:       7.8        11.7        138.75         92.50      2260   53.4%
experi:       6.7        10.4        161.91        104.37      2102   49.6%

# Grand total for images/
master:       2.1         3.1        220.85        148.50       485   26.8%
experi:       1.9         2.7        245.67        173.24       465   25.7%

As you can see throughput improved a lot, as did the compression ratio for all files without an alpha channel (icon_*/ and pngimg/ suffered a bit, but the overall compression ratio for these files is already quite high. textures_plants/ still saw improvements). For photos or photo-like images QOI now often beats libpng!

What changed? After I switched the tags for QOI_RUN (previously 2-bit tag) and QOI_GDIFF_16 (previously 4-bit tag) I noticed that QOI_GDIFF covered almost all(!) cases that were previously encoded by QOI_DIFF_16/24. So... why not remove them?

#define QOI_OP_INDEX  0x00 // 00xxxxxx
#define QOI_OP_DIFF   0x40 // 01xxxxxx (aka QOI_DIFF_8)
#define QOI_OP_LUMA   0x80 // 10xxxxxx (aka QOI_GDIFF_16)
#define QOI_OP_RUN    0xc0 // 11xxxxxx
#define QOI_OP_RGB    0xfe // 11111110 (aka QOI_COLOR with RGB)
#define QOI_OP_RGBA   0xff // 11111111 (aka QOI_COLOR with RGBA)

(see the experimental file format documentation for the details)

That is, most tags are now 2-bit, while the run-length is limited to 62 and thus leaves some room for the two 8-bit QOI_OP_RGB and QOI_OP_RGBA tags. So QOI would be even simpler than before and (probably?) gain a lot more possibilities for performance improvements:

  • there is no multi-byte run
  • there are no tags with a variable byte-length
  • there are no more values that cross byte boundaries

Yes, it means that a change in the alpha channel will always be encoded as a 5-byte QOI_OP_RGBA, but using the current test suit of images, this seems to be totally fine. The alpha channel is mostly either 255 or 0. The famous dice.png and FLIF's fish.png seem to be awfully "artificial" uses of PNG. (For comparison, in the experimental branch with the original tag-layout and QOI_DIFF_16/24 still present, the overal compression ratio was at 24.6% - but the win in simplicity and performance is imho worth this 1%).

The hash function changed to the following:

#define QOI_COLOR_HASH(C) (C.rgba.r * 3 + C.rgba.g * 5 + C.rgba.b * 7)

This is seriously the best performing hash function I could find and I tried quite a few. This also ignores the alpha channel, making it even more of a second-class citizen.

You may not like it (and I'm truly sorry for all the work that would need to be done in existent implementations), but I strongly believe that this is The Right Thing To Do™.

Thoughts?

@Wulf0x67E7
Copy link

The only nitpick I can think of would be the GRB- instead of RGB-bit-order of QOI_OP_LUMA. Is there any technical reason to use the green channel as the guide instead of the red one?

@phoboslab
Copy link
Owner Author

phoboslab commented Dec 6, 2021

The idea was that changes in lightness (luma) would be most represented in the green channel, since that is the color that the human eye is most sensitive to. I just tried it with a 6-bit red channel instead and got consistently worse results — but not by much. Overall compression rate would be at 25.9%.

Maybe it's worth changing it anyway, for clarity's sake.

@Wulf0x67E7
Copy link

Wulf0x67E7 commented Dec 6, 2021

Huh, I didn't think that it would have any impact at all on a lossless format and that those human-perception-specific tricks only really matter when compressing lossily.

I'd personally change it, but it really is incredibly nitpicky and doesn't matter too much either way. ¯\_(ツ)_/¯

@oscardssmith
Copy link

I think the reason it matters is probably that it is fairly common for images to get subtly lighter or darker while maintaining the same hue, and in those cases, green on average is what changes the most.

@rmn20
Copy link

rmn20 commented Dec 6, 2021

Wouldn't removing DIFF_24 greatly degrade the compression ratio of semi-transparent textures? (like hair textures, or maybe water textures) I think these changes make qoi useless for semi-transparent images. Maybe in that case its better to stop encoding alpha levels at all and only write rgb + 0/1 alpha?

@phoboslab
Copy link
Owner Author

phoboslab commented Dec 6, 2021

Yes, it performs worse for these, but I am not convinced that this is an issue.

From my game dev experience even textures for water, clouds, dust etc. use fairly distinct alpha values. Many of these don't even have an alpha channel at all and are just blended with GL_ONE GL_ONE (fire, flares, etc.) or GL_ZERO GL_SRC_COLOR (water, smoke, etc.).

The two worst performing examples I found are actually the ones most prominently featured by other image formats: dice.png and fish.png. As I said before, I believe these are fairly "artificial" examples.

## images/misc/fish.png size: 1969x1307
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:      28.1       207.8         91.44         12.39       520    5.2%
stbi:        18.5       197.8        139.20         13.01       804    8.0%
qoi_master:   5.9         9.2        438.07        280.36       862    8.6%
qoi_experi:   5.6         6.9        457.53        372.52      1148   11.4%

## images/misc/dice.png size: 800x600
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:       6.7        51.6         71.92          9.30       237   12.7%
stbi:         5.2        45.8         92.34         10.48       341   18.2%
qoi_master:   1.3         1.8        362.65        263.83       355   19.0%
qoi_experi:   1.2         1.3        414.23        356.69       508   27.1%

Compression rate wasn't really great before (compared to libpng/stbi), but it's still very good compared to uncompressed RGBA.

@oscardssmith
Copy link

One last change that might be worth trying is changing QOI_INDEX to actually use a LRU cache. This will probably be a little slower, but it would remove the effect of hash collisions on the file size.

@vsonnier
Copy link

vsonnier commented Dec 7, 2021

Sorry @phoboslab I think there is a mistake in computiing the color differences:

char vr = px.rgba.r - px_prev.rgba.r;
char vg = px.rgba.g - px_prev.rgba.g;
char vb = px.rgba.b - px_prev.rgba.b;

char vg_r = vr - vg;
char vg_b = vb - vg;

You need int to hold [-255; 255] differences in vxxx, else someting funky will happen when differences are beyond +-127. I suppose the computation itself is fine by integer promotion ? Note that the master version indeed holds the differences in ints.

Edit :
Got it the wrap around arithmetic, although the signed char troubled me a loong time: I finally thought of it in terms of

type Wrap_Around_Difference_Type is mod 256; 

in Ada, then everything was clear.

@chocolate42
Copy link

Here's a tweak to the index, tested against the old data set because ISP's suck.
qoi-exluma: Experimental branch baseline
qoi-luma61: The index table is reduced to 61 values (prime) and the hash function is simply px.v%61. No other changes. Possible union endian issues that can be resolved with a compiler check (?)
qoi-luma61.run64: The RGB and RGBA tags have been moved to the end of the index chunk so rle has a full 64 values, which seems to help decode speed a surprising amount (something something power of two?). There are still two unused 8 bit codes, but with a 6 bit RUN_8 there's no need for RUN_16 and I didn't want to theorycraft further here.

## Benchmarking ../qoitests/images/tango512/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:              3.1        20.5         84.87         12.76        51
stbi:                2.1        20.6        124.21         12.74        69
qoi-exluma:          0.7         0.8        353.49        336.18       102
qoi-luma61:          0.7         0.8        371.70        319.96        86
qoi-luma61.run64:    0.6         0.8        457.27        331.28        86

## Benchmarking ../qoitests/images/kodak/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:              9.6       179.6         41.10          2.19       717
stbi:               10.1        98.9         39.05          3.98       979
qoi-exluma:          3.1         4.2        127.20         94.29       671
qoi-luma61:          2.9         4.5        135.15         88.22       675
qoi-luma61.run64:    2.6         4.4        148.52         89.88       675

## Benchmarking ../qoitests/images/misc/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:             12.0       102.1         74.30          8.71       283
stbi:               10.4        97.2         85.39          9.14       415
qoi-exluma:          2.8         3.3        316.36        270.04       513
qoi-luma61:          2.8         3.3        322.46        272.58       504
qoi-luma61.run64:    2.4         3.2        364.61        280.07       503

## Benchmarking ../qoitests/images/textures/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:              3.0        41.7         42.96          3.12       163
stbi:                2.7        23.3         48.05          5.57       232
qoi-exluma:          0.8         1.1        163.08        120.30       178
qoi-luma61:          0.7         1.1        181.62        115.97       179
qoi-luma61.run64:    0.6         1.1        201.48        119.41       179

## Benchmarking ../qoitests/images/screenshots/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:             59.4       620.1        138.56         13.27      2219
stbi:               52.0       755.2        158.20         10.90      2821
qoi-exluma:         28.5        26.1        288.85        315.97      2476
qoi-luma61:         27.2        27.3        302.62        301.02      2478
qoi-luma61.run64:   23.9        27.1        344.47        303.73      2475

## Benchmarking ../qoitests/images/wallpaper/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:            191.4      2922.5         48.97          3.21      9224
stbi:              220.4      1814.6         42.53          5.16     13299
qoi-exluma:         76.9        85.8        121.91        109.21      9923
qoi-luma61:         75.4        94.0        124.24         99.72      9965
qoi-luma61.run64:   64.2        85.0        146.09        110.21      9964

qoi.luma61.run64.h.txt

@phoboslab
Copy link
Owner Author

phoboslab commented Dec 7, 2021

Very cool. Reducing the index to 61 is rather clever! And I'd really like to know why a 64 run-length would help that much with performance.

@vsonnier this should be fine, as long as the decoder wraps these values around, too.

Edit: I can't really reproduce these throughput numbers. qoi-luma61.run64 performs slightly worse on my machine.

# Grand total for images
                decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi-exluma:           1.9         2.7        244.79        172.72       465   25.7%
qoi-luma61.run64:     1.9         2.7        239.50        169.05       463   25.6%

Compression is better though. Images with alpha suffer quite a bit when the hash doesn't include alpha value. Might be worse to change it to:

#define QOI_COLOR_HASH(C) (C.rgba.r * 3 + C.rgba.g * 5 + C.rgba.b * 7 + C.rgba.a * 11)

(or just C.v % 61 as you proposed)

@nigeltao
Copy link

nigeltao commented Dec 7, 2021

Some numbers vs qoi-demo10. qoi-experi is commit 28954f7. qoi-l61r64 is the luma61.run64 mentioned above.

For compressed size, qoi-demo10 wins on icon_512, icon_64 and pngimg, which all involve alpha. qoi-experi wins everywhere else, including on textures_plants (which also involves alpha). I know simplicity of format is a goal, but perhaps consider having two slightly different sets of opcodes (or opcode bit assignments) depending on whether channels == 3 or channels == 4. Edit: now that I've added the luma61.run64 numbers, the qoi-demo10 vs qoi-l61r64 gap is noticably smaller on icon_512, icon_64 and pngimg.

One more quick request is to raise QOI_PADDING from 4 to 7, so that the decoder can always do an 8 byte (64 bit) read.

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate

## Total for images/icon_512
libpng:       5.0        33.9         52.74          7.74        51    5.0%
stbi:         3.9        32.8         66.47          8.00        69    6.8%
qoi-experi:   1.5         1.2        180.47        224.24       102   10.1%
qoi-demo10:   0.8         1.2        309.23        218.27        76    7.5%
qoi-l61r64:   1.0         1.4        254.71        190.54        86    8.4%

## Total for images/icon_64
libpng:       0.2         0.9         23.26          4.35         3   23.6%
stbi:         0.1         0.9         28.05          4.45         4   27.9%
qoi-experi:   0.0         0.0        103.12        106.03         5   31.3%
qoi-demo10:   0.0         0.0        148.43        110.04         4   27.6%
qoi-l61r64:   0.0         0.0        141.35        106.50         4   28.8%

## Total for images/photo_kodak
libpng:      15.2       295.7         25.88          1.33       717   46.7%
stbi:        16.9       157.8         23.33          2.49       979   63.8%
qoi-experi:   5.7         6.7         69.14         58.89       671   43.7%
qoi-demo10:   4.1         7.1         95.72         55.76       772   50.3%
qoi-l61r64:   4.5         6.6         87.26         59.26       675   44.0%

## Total for images/photo_tecnick
libpng:      44.9      1143.8         32.08          1.26      2414   42.9%
stbi:        57.9       608.3         24.88          2.37      3533   62.8%
qoi-experi:  20.9        26.3         68.99         54.67      2527   44.9%
qoi-demo10:  13.0        25.1        111.04         57.44      2737   48.7%
qoi-l61r64:  17.2        25.3         83.93         57.02      2529   45.0%

## Total for images/photo_wikipedia
libpng:      42.9       865.8         25.28          1.25      2046   48.3%
stbi:        54.6       456.6         19.87          2.38      2893   68.3%
qoi-experi:  15.7        19.9         69.17         54.39      2102   49.6%
qoi-demo10:   9.5        19.2        113.99         56.49      2289   54.0%
qoi-l61r64:  12.8        19.3         84.51         56.30      2103   49.7%

## Total for images/pngimg
libpng:      56.4       533.3         32.08          3.39      1201   17.0%
stbi:        59.6       399.9         30.35          4.52      1751   24.8%
qoi-experi:  15.5        15.7        116.99        114.90      1445   20.5%
qoi-demo10:   9.0        15.7        201.83        114.98      1429   20.2%
qoi-l61r64:  11.7        16.4        154.61        110.63      1437   20.3%

## Total for images/screenshot_game
libpng:      18.6       216.7         34.08          2.92       448   18.1%
stbi:        20.9       150.9         30.24          4.19       634   25.7%
qoi-experi:   6.3         6.3        101.13        100.93       519   21.0%
qoi-demo10:   4.1         6.1        156.06        104.53       535   21.7%
qoi-l61r64:   4.9         6.5        130.18         97.85       517   20.9%

## Total for images/screenshot_web
libpng:      92.0      1060.7         88.33          7.66      2402    7.6%
stbi:        78.8      1210.3        103.12          6.71      3076    9.7%
qoi-experi:  50.5        38.8        160.98        209.22      2649    8.3%
qoi-demo10:  28.3        40.4        287.55        201.06      2680    8.4%
qoi-l61r64:  37.7        44.7        215.30        181.93      2649    8.3%

## Total for images/textures_photo
libpng:      39.6       725.0         26.45          1.45      1977   48.3%
stbi:        46.6       370.7         22.49          2.83      2554   62.4%
qoi-experi:  13.9        15.8         75.30         66.35      1981   48.4%
qoi-demo10:  10.1        18.3        104.22         57.40      2506   61.2%
qoi-l61r64:  11.2        15.1         93.39         69.39      1990   48.6%

## Total for images/textures_pk
libpng:       0.9        24.6         47.24          1.81        89   51.5%
stbi:         0.8        17.0         55.03          2.62       121   70.0%
qoi-experi:   0.7         0.8         61.19         58.01        75   43.5%
qoi-demo10:   0.6         0.7         77.52         60.26        78   45.1%
qoi-l61r64:   0.6         0.8         79.75         58.52        75   43.3%

## Total for images/textures_pk01
libpng:       5.3        66.5         24.65          1.95       163   32.3%
stbi:         5.1        37.1         25.61          3.50       232   45.8%
qoi-experi:   1.6         1.7         81.33         77.75       178   35.2%
qoi-demo10:   1.0         1.6        136.54         81.54       180   35.6%
qoi-l61r64:   1.1         1.6        116.92         78.89       179   35.3%

## Total for images/textures_pk02
libpng:      11.7       205.0         25.97          1.48       427   36.1%
stbi:        12.1       100.1         25.07          3.03       623   52.5%
qoi-experi:   4.2         4.4         71.95         69.25       479   40.4%
qoi-demo10:   2.8         4.3        108.01         70.79       492   41.5%
qoi-l61r64:   3.1         4.3         96.66         70.61       481   40.5%

## Total for images/textures_plants
libpng:      31.9       340.6         33.35          3.12       857   20.6%
stbi:        30.7       241.5         34.66          4.41      1191   28.7%
qoi-experi:   8.0         9.2        132.21        116.02       922   22.2%
qoi-demo10:   4.3         9.5        247.12        111.57       957   23.0%
qoi-l61r64:   6.0         9.4        177.48        113.29       922   22.2%

# Grand total for images
libpng:      13.5       187.9         34.47          2.47       423   23.3%
stbi:        14.7       121.4         31.53          3.82       601   33.2%
qoi-experi:   4.7         5.0         98.37         92.79       465   25.7%
qoi-demo10:   3.0         4.9        156.82         94.45       484   26.7%
qoi-l61r64:   3.7         5.1        127.03         91.58       463   25.6%

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate

This was on a mid-range x86_64 laptop (2016, Skylake):

$ cat /proc/cpuinfo | grep model.name | uniq
model name	: Intel(R) Core(TM) m3-6Y30 CPU @ 0.90GHz

@phoboslab
Copy link
Owner Author

phoboslab commented Dec 8, 2021

Regarding a 7-Byte padding: wouldn't it make sense to put and "end-marker" there, instead of just 0x00 bytes?

If we then use 0xfd = QOI_RGB and 0xfe = QOI_RGBA we could pad the stream with 0xff bytes. Since QOI dosn't allow for more than 4 consecutive 0xff bytes anywhere in the stream, this could be a nice replacement for the now-missing size header, when streaming images.

Edit: thinking about it some more, a 7-byte 0x00 shouldn't occur in a stream either, since repeated QOI_INDEX chunks should be encoded as QOI_RUN instaed.

@phoboslab
Copy link
Owner Author

Further changes in experimental:

  • QOI_COLOR_HASH now includes the alpha channel: r * 3 + g * 5 + b * 7 + a * 11 - this helps quite a bit with the icons and pngimg from the test suite.
  • colorspace in the header is now an enum, where 0 = sRGB with linear alpha, 1 = all channels linear. It seems any other channel configuration does not make much sense (see Add initial support for the QOI image format Tom94/tev#148). The 0 value for sRGB provides a sensible default.
  • the data is now padded with 8 0x00 bytes. These also indicate "end of image" to alleviate the absent size header.

@Lokathor
Copy link

Lokathor commented Dec 8, 2021

Just adding a size field to the header would be far mor efficient than making people who want to skip the image iterate every byte looking for a run of four 0xFF or eight 0x00.

@phoboslab
Copy link
Owner Author

We've had this discussion (see #28 (comment)) and I believe the benefit of allowing a "streaming" encoder that does not need seeking outweigh the drawbacks of not having the size in the header.

@oscardssmith
Copy link

What about allowing encoders to put 0 as the size if they don't want it? (I.E. if a decoder reads a size of zero, it has to scan, but if it sees a nonzero size it can skip?)

@Lokathor
Copy link

Lokathor commented Dec 8, 2021

Yeah if the encoder absolutely needs to never go back it can just put a zero, but that's probably very rare. More likely these are files being put on disk, and you can just adjust an earlier part of the file once the compression is completed.

@ikskuh
Copy link

ikskuh commented Dec 8, 2021

What about allowing encoders to put 0 as the size if they don't want it? (I.E. if a decoder reads a size of zero, it has to scan, but if it sees a nonzero size it can skip?)

That would completly defeat the idea of a streaming reader as you then don't know the file size beforehand, but if you want to skip images in a stream, you would need to have a defined header. So you can also just leave it out.

Yeah if the encoder absolutely needs to never go back it can just put a zero, but that's probably very rare. More likely these are files being put on disk, and you can just adjust an earlier part of the file once the compression is completed.

It would make it harder for, let's say, embedded devices that have low memory anyways to require the device to store the image both encoded and unencoded in RAM just for the sake of a stream field. Parsing/seeking the full image stream also doesn't take much time on most machines, especially if no writes are done.

A stronger use case for streaming encoders is stuff like additional compression/encryption where you convert some bytes, push them into the compressor/encrypter and pull some bytes from them. All compression/encryption APIs i know provide this pattern and it would be cool to just not have to allocate additional memory for the compressed image. Especially as it might require either overallocation or reallocation which are both costly. Imho, status quo is already good enough also for special use cases. The tagging of four consecutive 0xFF is imho a good middle ground, especially as a conforming non-streaming parser doesn't have to handle that case at all (it's always a invalid)

@phoboslab: The changes for color hash and colorspace are good 👍

@Lokathor
Copy link

Lokathor commented Dec 8, 2021

That would completly defeat the idea of a streaming reader as you then don't know the file size beforehand, but if you want to skip images in a stream, you would need to have a defined header.

If the size is zero, you have to walk the bytes to find the end. If the size is non-zero you can skip that many bytes immediately.

But if there's no size at all then you also have to walk the bytes to find the end, so things are just as bad as if someone had put a size of zero.

It would make it harder for, let's say, embedded devices that have low memory anyways to require the device to store the image both encoded and unencoded in RAM just for the sake of a stream field.

Why would you store the entire encoded image in ram just to have the total encoded size? A counter of how many encoded bytes have been written would be plenty. Then you write that value down in the file after the encoding is done (or, again, if you can't go back for some reason you'd know that ahead of time and mark a size of 0 in your output).

@ikskuh
Copy link

ikskuh commented Dec 8, 2021

Some reasoning on why i don't think the size field is a good idea:

If the size is zero, you have to walk the bytes to find the end. If the size is non-zero you can skip that many bytes immediately.

That very much depends. I can already see people embedded different QOI images in QOI images by specifying invalid size fields that won't skip the full image. If you want to write safe software, you have to run the decoder anyways to see if the size field is actually valid and can be trusted as a measure to seek forward. Otherwise people might craft malicious streams that will crash your decoder. So removing the field prevents an attack vector for maliciously crafted data.

Why would you store the entire encoded image in ram just to have the total encoded size?

I can't because my (fictional, but realistic) device has only 32k RAM. This means i can neither store the full source image (but have to stream it from somewhere) nor can i store the full output image (but have to stream it somewhere). So in that use case, it would require me to always write zero to that field.

But to implement something like seeking back to disk, you have to create two interfaces to your application: One that allows streaming encoding, and one that allows non-streaming encoding. The latter one would require a bit different API for non-generic purpose implementations and would make the general library/application harder without much benefit in general (i won't argue that there isn't a benefit, i just don't value it that much)

The only use case where such a field is from use is: When you want to skip over the QOI data and the source data is seekable (aka. disk or memory buffer). For all other use cases we either have to decode the data anyways (so no benefit from the size field) or we don't have seeking capabilities (so the size field isn't worth much either, as we have to read the data anyways and skip-decoding QOI is way faster than memory or even I/O devices).
And in case of disk i/o, we already read way more data from the disk than we actually need (read_ahead_kb is 128kB for my primary disk), this means we can already decode 128kB of data without I/O penalty. If we seek, we will pay the additional cost for a syscall which might also take longer than just decoding the data we already got anyways

But the last word is at @phoboslab. I have stated my case and don't have more arguments here.

@Lokathor
Copy link

Lokathor commented Dec 8, 2021

The decoder should of course not blindly trust a size field and simply jump that far ahead in memory. It still has to check how big the current buffer is and not jump past the end of the buffer. This is 101 level stuff. However, if we're speaking about hostile input there's nothing ensuring that the input will ever signal an end of the stream, so it could trick a simplistic decoder into some sort of overflow that way as well.

And I would say that "the source data is seekable" is the overwhelmingly common case. Most things are on disk or are pulled completely into memory via network. Usually you've got seekable data.

@phoboslab
Copy link
Owner Author

I initially put the size field in the header in the first version, because I was immensely frustrated that you absolutely can not know if you found the end of a frame in MPEG-TS. A simple END_OF_DATA would have solved this.

I do not like the optional size header. I goes against my desire for simplicity. All these "maybe it's like that, but maybe not... you have to check" things is what makes PNG, JPEG and many other formats way more complicated than they should be. So far, I tried hard to avoid all these optional features.

Anyway, what's the real use-cases for when you need to skip a QOI image?

  1. you want to have some custom data after the image
  2. you have a bunch of QOI files lumped together
  3. you receive a stream of QOI images

Solutions:

  1. put a fixed size trailer at the end of the file, that tells you where to find your custom data
  2. use a container format that provides a size field or index of each image
  3. prefix each image with a size field in the stream. If this is a video stream, you need a custom header anyway to denote framerate and whatnot

@Wulf0x67E7
Copy link

Then why not also get rid of width and/or height?

You could always get one by dividing the final number of decoded pixels by the other.

Right now you already have to check that there aren't too many/few pixels encoded inside a stream for a given size, how would an encoded byte size be different?

And wouldn't a format that doesn't require you to use some external (and likely needlessly complicated and non-standard) addon to actually use it for a big portion of its use-cases, just because its header is missing one simple value, be the simpler one?

@ikskuh
Copy link

ikskuh commented Dec 8, 2021

Then why not also get rid of width and/or height?
You could always get one by dividing the final number of decoded pixels by the other.

This is a good question! For one, we need a single control mechanism for stream integrity. Width and Height are usually information of interest, in a header, even if you don't care for the data around (think of the file tool which can show you image sizes). As the QOI stream doesn't have a end of file signal, we need other means to compute that.

Right now you already have to check that there aren't too many/few pixels encoded inside a stream for a given size, how would an encoded byte size be different?

You have to check if your stream is still in the image anyways, so a secondary buffer that might even be wrong (or optional) doesn't help here. It also reduces decoding speed (my implementation was 10% faster for not checking that additional value 😮). Also you gave a cool idea:

Proposal:
If the stream ends with less than width×height pixels decoded, the remaining pixels are filled with the last pixel value. This might bring a huge reduction for image files like logos, icons, … which have a "bar" at the bottom. This is cheap optimization for files or known-size memory buffers.

And wouldn't a format that doesn't require you to use some external (and likely needlessly complicated and non-standard) addon to actually use it for a big portion of its use-cases, just because its header is missing one simple value, be the simpler one?

If you really want that field to exist, it must be non-optional and thus remove the possibility of stream-encode QOI images. This is a huge decision to have either/or as a feature. If it's non-optional, people will just put a 0 there, because they are lazy.

@oscardssmith
Copy link

Your proposal isn't a huge optimization. It saves a maximum of 2 bytes, since a single QOI_RUN_16 tag will handle this.

@phoboslab
Copy link
Owner Author

Then why not also get rid of width and/or height?

Non sequitur. The width/height is mandated to be always present; a decoder can rely on these.

actually use it for a big portion of its use-cases

Which are?

It saves a maximum of 2 bytes, since a single QOI_RUN_16 tag will handle this

There is no QOI_RUN_16 anymore (in the experimental branch). The longest run is now 62. But your point still stands and I'd rather be explicit: if there's pixels in the image, they need to be encoded.

@oscardssmith
Copy link

I was relying on the assumption that by the time we finalize, we will either have special handling for multiple QOI_RUN_8 tags or a QOI_RUN_16/QOI_RUN_24 with an 8 bit tag that would deal with this.

@Wulf0x67E7
Copy link

@MasterQ32 The field being optional means any valid data stream with it has to also be valid without it! You can, in fact, just completely ignore it if your use case doesn't need/can't use it!

IT DOESN'T PREVENT STREAMING IN ANY WAY, SHAPE, OR FORM WHATSOEVER!

It not strictly being needed, I can understand. But saying that it would completely prevent streaming is just nonsense.

@phoboslab

Non sequitur. The width/height is mandated to be always present; a decoder can rely on these.

Exactly my point! Every single argument for size being unreliable also makes width and height be unreliable as well, but they are obviously needed and (in one way or another) already being checked, so size, which doesn't actually need to be checked or even considered when not needed, should be even less problematic.

actually use it for a big portion of its use-cases

Which are?

Any time you want to decode more than one image at once (parallel), without using custom/third-party add-ons like additional headers or indices.

F.e. You have a large number of small images (game textures?) and want to avoid the overheads of both working with lots of small files or having to fully decode a single enormous meta-file to find a specific image. You could use some kind of archive, but that would either be completely custom and non-standard or third-party and too general/overly complicated. With a size field, you could simply pack them one after the other and only need to iterate headers while staying fully within the vanilla spec.

I agree that it is more of a (very) nice to have than an absolute necessity.

The main reason I'm still arguing is because of the reason for its exclusion. Because the statement that it would prevent streaming is one which I, as stated above, strongly disagree with.

If you decide to not include a size field because of complexity, then that is ultimately fine.

@phoboslab
Copy link
Owner Author

Well yes, there is no mandatory size field because it would prevent streaming encode. There is no optional size field because I don't like the idea of it being optional.

you could simply pack them one after the other

That's a rather constructed scenario. Not having an index in this case seems like a bad idea. Even Doom WADs have one :)

I was relying on the assumption that by the time we finalize, we will either have special handling for multiple QOI_RUN_8 tags or a QOI_RUN_16/QOI_RUN_24 with an 8 bit tag that would deal with this.

It's not needed. A run length of 62 is sufficient. The benefits of not having variable-width chunks (or fewer tag types) far outweigh the gain in compression rate. I just tested it again: screenshot_web has the biggest(!) win in compression rate, being 0.2% better than current experimental. It does even less for textures_*, icon_* and pngimg and absolutely nothing for photos.

## Total for images/textures_photo
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
ex:             0         8.0             0        130.28      1981   48.4%
ex.r16:         0         8.6             0        122.39      1981   48.4%

## Total for images/textures_pk01
ex:             0         0.9             0        150.01       178   35.2%
ex.r16:         0         0.9             0        141.63       178   35.1%

## Total for images/screenshot_game
ex:             0         3.1             0        201.80       519   21.0%
ex.r16:         0         3.5             0        179.90       516   20.9%

## Total for images/textures_pk
ex:             0         0.4             0        115.40        75   43.5%
ex.r16:         0         0.4             0        106.81        75   43.5%

## Total for images/textures_pk02
ex:             0         2.2             0        135.83       479   40.4%
ex.r16:         0         2.4             0        126.77       478   40.4%

## Total for images/icon_64
ex:             0         0.0             0        252.73         4   28.7%
ex.r16:         0         0.0             0        221.30         4   28.7%

## Total for images/icon_512
ex:             0         0.6             0        468.22        85    8.4%
ex.r16:         0         0.7             0        386.56        83    8.2%

## Total for images/photo_kodak
ex:             0         3.4             0        116.76       671   43.7%
ex.r16:         0         3.7             0        107.62       671   43.7%

## Total for images/textures_plants
ex:             0         4.5             0        235.19       922   22.2%
ex.r16:         0         5.1             0        208.62       914   22.0%

## Total for images/screenshot_web
ex:            0        18.6             0         437.47      2649    8.3%
ex.r16:        0        23.4             0         347.76      2576    8.1%

## Total for images/pngimg
ex:             0         7.7             0        235.06      1436   20.3%
ex.r16:         0         8.7             0        207.12      1422   20.1%

## Total for images/photo_tecnick
ex:             0        13.3             0        108.04      2527   44.9%
ex.r16:         0        14.3             0        100.99      2527   44.9%

## Total for images/photo_wikipedia
ex:             0        10.2             0        106.29      2102   49.6%
ex.r16:         0        10.9             0         99.41      2102   49.6%

# Grand total for images
ex:             0         2.5             0        185.77       463   25.6%
ex.r16:         0         2.8             0        167.49       461   25.5%

@chocolate42
Copy link

F.e. You have a large number of small images (game textures?) and want to avoid the overheads of both working with lots of small files or having to fully decode a single enormous meta-file to find a specific image. You could use some kind of archive, but that would either be completely custom and non-standard or third-party and too general/overly complicated. With a size field, you could simply pack them one after the other and only need to iterate headers while staying fully within the vanilla spec.

Your use case for a size field is basically to have a slightly more compact tarball format that can only store qoi images and without any metadata like filename so you'll need custom nonsense to find a specific image anyway. Better to just use a tarball IMO. Tar is the standard that should have been instead of every game company delighting in creating their own format, but it's not universally suitable thanks to how it works. In an ideal world there'd be a second standard archive format for compact indexed archival (a binaryily-searchable index of UTF-8 filenames with uint64_t LE file sizes, plus a file count and that's it), but there isn't AFAIK.

@phoboslab
Copy link
Owner Author

removing alpha deltas makes it almost pointless to support the alpha channel in qoi, since it will compress really poorly

This is absolutely not the case.

## Total for images/pngimg
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:      30.2       292.0         59.96          6.19      1201   17.0%
stbi:        27.2       234.7         66.50          7.71      1751   24.8%
qoi.master:   6.9        10.5        262.50        172.09      1415   20.0%
qoi.exp:      6.2         7.8        290.55        231.44      1436   20.3%

@nigeltao
Copy link

nigeltao commented Dec 9, 2021

A quick experiment, starting with luma61-run64. Call the original l61r64 and this experiment l61v02.

What's new is using two of the previously-unused 8-bit opcodes (QOI_OP_Z comes from the QOI_OP_DIFF block, QOI_OP_A comes from the QOI_OP_INDEX block). For opaque source images (with a = 0xFF everywhere), these two opcodes won't be used, so there's no difference in the output .qoi file for l61r64 and l61v02.

  • QOI_OP_Z sets r = g = b = a = 0.
  • QOI_OP_A is the old QOI_OP_COLOR(A) as suggested by @chocolate42.
#define QOI_OP_INDEX   0x00 // 00xxxxxx  +0 bytes  i6
#define QOI_OP_DIFF    0x40 // 01xxxxxx  +0 bytes  r2g2b2
#define QOI_OP_LUMA    0x80 // 10xxxxxx  +1 bytes  g6s4c4 (s=r-g, c=b-g)
#define QOI_OP_RUN     0xc0 // 11xxxxxx  +0 bytes  n6
#define QOI_OP_Z       0x6a // 01101010  +0 bytes  transparent black (all 0)
#define QOI_OP_A       0x3d // 00111101  +1 bytes  a8
#define QOI_OP_RGB     0x3e // 00111110  +3 bytes  r8g8b8
#define QOI_OP_RGBA    0x3f // 00111111  +4 bytes  r8g8b8a8

On @erikcorry's two images:

10861 parrot.png
 9019 parrot-master.qoi
 9590 parrot-experi.qoi
 9553 parrot-l61r64.qoi
 9202 parrot-l61v02.qoi

1466 youtube.png
 942 youtube-master.qoi
1565 youtube-experi.qoi
1593 youtube-l61r64.qoi
 930 youtube-l61v02.qoi

On the non-opaque sub-directories of @phoboslab's test suite (the "size kb" is the important column; the mpps numbers could be optimized):

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate

## Total for images/icon_512
qoi-master:   1.5         1.5        170.74        173.94        80    7.8%
qoi-experi:   1.5         1.2        180.47        224.24       102   10.1%
qoi-l61r64:   1.0         1.4        254.71        190.54        86    8.4%
qoi-l61v02:   1.2         1.3        226.15        206.77        79    7.7%

## Total for images/icon_64
qoi-master:   0.0         0.0        117.66         99.81         4   28.3%
qoi-experi:   0.0         0.0        103.12        106.03         5   31.3%
qoi-l61r64:   0.0         0.0        141.35        106.50         4   28.8%
qoi-l61v02:   0.0         0.0        133.57        103.37         4   26.9%

## Total for images/pngimg
qoi-master:  16.8        19.4        107.53         93.19      1415   20.0%
qoi-experi:  15.5        15.7        116.99        114.90      1445   20.5%
qoi-l61r64:  11.7        16.4        154.61        110.63      1437   20.3%
qoi-l61v02:  13.6        16.8        133.24        107.41      1422   20.1%

## Total for images/textures_pk02
qoi-master:   4.4         5.5         68.97         55.69       504   42.5%
qoi-experi:   4.2         4.4         71.95         69.25       479   40.4%
qoi-l61r64:   3.1         4.3         96.66         70.61       481   40.5%
qoi-l61v02:   3.6         4.5         85.54         66.95       480   40.5%

## Total for images/textures_plants
qoi-master:   8.9        11.5        120.21         92.40       951   22.9%
qoi-experi:   8.0         9.2        132.21        116.02       922   22.2%
qoi-l61r64:   6.0         9.4        177.48        113.29       922   22.2%
qoi-l61v02:   7.2         9.8        148.25        108.32       915   22.0%

[other images/* results not shown]

# Grand total for images
qoi-master:   5.0         6.1         92.49         76.72       485   26.8%
qoi-experi:   4.7         5.0         98.37         92.79       465   25.7%
qoi-l61r64:   3.7         5.1        127.03         91.58       463   25.6%
qoi-l61v02:   4.1         5.2        112.78         89.77       462   25.5%

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate

qoi-l61v02.h.txt

@chocolate42
Copy link

chocolate42 commented Dec 9, 2021

What's new is using two of the previously-unused 8-bit opcodes (QOI_OP_Z comes from the QOI_OP_DIFF block, QOI_OP_A comes from the QOI_OP_INDEX block).

I've been guilty of it in earlier iterations but have come to the conclusion that overlapping encodings should be avoided (unless it's nailed down in the spec that one should be used). By that I mean your encoder might emit QOI_OP_RUN instead of QOI_OP_DIFF, but a different encoder might not so a QOI_OP_DIFF representing no change should not have an overlapping encoding. The 3 tags at the end of a 61 value index are fair game because no conforming encoder should use them for anything else.

There is a replacement for QOI_OP_DIFF that compresses the corpus better and only uses 63 values, so in the above case there would be space for QOI_OP_Z. Lets call it QOI_OP_LUMA1. It's a 6 bit mashup of LUMA's vg_r/vg_b and ANS coding the values (like GDELTA) so we can use ranges other than powers of 2 (which we do to increase green's range and better fit around 0). vg_r=-1..1, vg=-3..3, vg_b=-1..1. Apologies for lack of example but the code below should be relatively plug and play. The last value is unused.

Encode

	if (
		vg_r > -2 && vg_r < 2 &&
		vg > -4 && vg < 4 && 
		vg_b > -2 && vg_b < 2
	) {
		bytes[p++] = QOI_OP_LUMA1 | (((vg_b + 1) * 21) + ((vg_r + 1) * 7) + (vg + 3));
	}

Decode

else if ((b1 & QOI_MASK_2) == QOI_OP_LUMA1) {
	b1 ^= QOI_OP_LUMA1;
	int vg = (b1 % 7) - 3;
	b1/=7;
	px.rgba.r += vg - 1 + (b1 % 3);
	px.rgba.g += vg;
	b1/=3;
	px.rgba.b += vg - 1 + (b1 % 3);
}

Just changing the latest experi commit from DIFF to LUMA1 does this

# Grand total for ../qoi_benchmark_suite/images
qoi-ex2103:      2.3         3.2        199.36        144.40       463   25.6%
qoi-gd8b:      2.8         3.4        168.58        138.35       459   25.4%

There are similar gains to be had giving LUMA the ANS treatment (but smaller, the 8 bit version is more critical) but that's bikeshed territory for now.

edit: It may look like processing speed is tanked but perhaps not much when optimised, delta7 in the bikeshed uses a similar but wider 7 bit LUMA/ANS encoding and still achieves decent speeds.

edit: Actually as a mix of luma and ans it should be called lama

@nigeltao
Copy link

It may look like processing speed is tanked but perhaps not much when optimised, delta7 in the bikeshed uses a similar but wider 7 bit LUMA/ANS encoding and still achieves decent speeds.

In case anyone else is wondering, here's a link to delta7 code and benchmark numbers.

@jmi2k
Copy link

jmi2k commented Dec 10, 2021

Edit: thinking about it some more, a 7-byte 0x00 shouldn't occur in a stream either, since repeated QOI_INDEX chunks should be encoded as QOI_RUN instaed.

Is this an explicit condition? I recall reading somewhere around here that an encoder should be free to make any kind of decision about how to encode the data as long as valid opcodes are used (an extreme case would be an encoder emitting QOI_OP_RGB for each pixel. Stupid but technically correct). Looks like 8 zeros in a row would be a valid QOI opcode sequence, even if useless. What do you think?

@chocolate42
Copy link

Speaking of padding, wouldn't requiring the total size to be a multiple of 8 bytes zero padded be better than always outputting 7 zero bytes? Making the header 16 bytes should also eliminate unaligned loads and/or make handling slightly simpler (wouldn't it? I don't know much about the nitty gritty of optimising).

If the above is true why not go one step further and require total size to be a multiple of 32 so SIMD can make some safe optimising assumptions (in particular for AVX2, but many SIMD are in this sort of range or variable which should still benefit)?

@Lokathor
Copy link

There is nearly nothing that can be made SIMD about this format, other than maybe a memcpy.

@phoboslab
Copy link
Owner Author

phoboslab commented Dec 10, 2021

Is this an explicit condition? I recall reading somewhere around here that an encoder should be free to make any kind of decision about how to encode the data as long as valid opcodes are used (an extreme case would be an encoder emitting QOI_OP_RGB for each pixel. Stupid but technically correct). Looks like 8 zeros in a row would be a valid QOI opcode sequence, even if useless. What do you think?

Technically correct. I'm hesitant to sacrifice one more bit of QOI_INDEX for the end-code, though. So I guess we should just make that explicit: "Consecutive QOI_INDEX chunks with the same index value are illegal and must be encoded as a QOI_RUN instead."

Or we switch the QOI_RUN and QOI_INDEX tags and make the 61 values wide as proposed by @chocolate42 - which after some testing had no compression benefit and is somewhat aesthetically less pleasing to me :/

Edit: Narf. Specifying that an encoder must not produce consecutive QOI_INDEX chunks would actually not be enough to have a distinct "ending tag". Even one QOI_INDEX: 0 may be misleading: If a stream ends with a QOI_INDEX: 0 and is then followed by the 8-bytes padding, a decoder that attempts to split QOI files will split one byte early.

@chocolate42
Copy link

Prime numbers are always aesthetic ;)

Powers of two are useful for performance and for some ops it makes sense, IMO index and run are two ops where powers of two do not make sense. I'm experimenting with grouping run/index/small-ops into a single mask to optimise the opcode distribution. How does an index size of 47 tickle your fancy?

@phoboslab
Copy link
Owner Author

I'm generally against introducing more opcodes if they do not increase compression by a lot. Without refactoring QOI to use some notion of blocks and/or more complicated encoding schemes, we will not beat a well optimized PNG anyway. QOI is all about simplicity with a reasonable compression rate.

Thinking about the end-code some more: even if the padding is specified to be an otherwise unused 8-bit tag a decoder may still split QOI files too early if they just happen to end with a QOI_RGBA where the color value(s) match this tag. So the end-code must be a multi-byte sequence that can not occur within the stream 🤔

@Lokathor
Copy link

I think simplicity is maybe the wrong metric to aim for. You only implement the encoder/decoder some small number of times.

Fast operations with decent compression seems like a better goal point. So whatever makes things go fast without hurting compression should probably be favored. Eg: using powers of 2 over prime numbers.

@ikskuh
Copy link

ikskuh commented Dec 10, 2021

I think simplicity is maybe the wrong metric to aim for. You only implement the encoder/decoder some small number of times.

If we go that route, we'll end up PNG in the end. As you can always argue that something might improve compression rate. Simplicity is nice as you can use QOI to explain people how some basic compression works in general and you can easily proof correctness of implementation.

I prefer simple and small libs with marginal drawbacks to complexer ones nowadays because i can manage and maintain them. I don't wanna maintain a libpng, but libqoi is just so tiny, i can do that

@Lokathor
Copy link

You say that, but PNG does at least two things that are not helpful for compression ratio or decompression speed. Possibly more. It's really not so great, it's just what we all use at the moment.

As you can always argue that something might improve compression rate.

I would expect some concrete evidence if someone was making a specific claim of increased compression or better decompression speed.

The example of powers of two compared to primes is just an avenue to investigate, I don't think anyone was trying to advocate a specific spec change without evidence.

@chocolate42
Copy link

I'm generally against introducing more opcodes if they do not increase compression by a lot.

I don't expect half of these to work well, the space just means they can be compared additively for bikeshedding. Index is pretty weak although I doubt it can be removed entirely as it's good at repeating patterns like simple textures (how common are they though, and aren't full index formats more suitable?). Setting the index size to 3 on a variant only increases the average filesize of the corpus from 453 to 475, if index was really pulling its weight I'd expect the filesize to balloon more than that.

Fast operations with decent compression seems like a better goal point. So whatever makes things go fast without hurting compression should probably be favored. Eg: using powers of 2 over prime numbers.

px.v %prime instead of multiplying the components by small primes was an attempt to make something simple and fast. That it managed to not hurt compression as much as the smaller cache indicated it should have was a surprise.

I'm not advocating for anything in particular, just picking a few things that look like low hanging fruit and testing them to make sure they're fit for purpose.

@phoboslab
Copy link
Owner Author

Don't get me wrong, I like seeing all those ideas. I just wanted to emphasize again that trading simplicity for small gains in compression ratio is imho not the right choice for this codec. So adding more op-codes is a hard sell for me.

If you find an alternative to QOI_OP_INDEX (or any other op), I'm all ears :)

@xfmoulet
Copy link
Contributor

I think simplicity is maybe the wrong metric to aim for. You only implement the encoder/decoder some small number of times.

My understanding is that This format / codec optimizes compression speed , simplicity (of algos, implementation and use) AND compression rate. The number of interest, ports, uses and number of tentative improvements after a week of being public clearly shows this optimization was useful to some people.

@chocolate42
Copy link

If you find an alternative to QOI_OP_INDEX (or any other op), I'm all ears :)

Haven't been able to replace QOI_OP_INDEX, but have managed to get decent encode/decode/compression stats using a 32 value index. Details in the bikeshed, tl;dr adding some opcodes that work like LUMA but for different byte size outputs does this:

# Grand total for ../qoi_benchmark_suite/images
qoi-ex2103:     2.310       3.192        200.98        145.42       463   25.6%
qoi-delta9h:    2.563       3.391        181.12        136.87       428   23.6%

DIFF was removed in favour of a LUMA variant and INDEX was reduced to 31/32 values from the start of testing. This isn't necessarily optimal, at some point that might get revisited.

@phoboslab
Copy link
Owner Author

Since I'm happy with the general direction of the experimental branch, I've merged it back into master. Still open for further changes of course.

@jmi2k
Copy link

jmi2k commented Dec 11, 2021

Just brainstorming, but would combining QOI_OP_INDEX and QOI_OP_LUMA (or QOI_OP_RGB) into a single fused op? For those times when you need a color very closely resembling another color in the index (a good hash should make the index have an evenly-distributed sample of the last pixels found), it might improve compression. Or it might not, but maybe it's worth exploring.

@oscardssmith
Copy link

I tried a simple version of that, and wasn't able to find anything especially compelling.

@nigeltao
Copy link

Yeah, I also tried that. The compression sizes were roughly the same (you have to remove some other ops to introduce N fused ops) but the compression speed also roughly got N times slower, where N is the size of the color cache (what the index indexes).

@Wulf0x67E7
Copy link

... but the compression speed also roughly got N times slower, where N is the size of the color cache (what the index indexes).

You should still only need to do a single lookup, as long as you're using a hash function that only considers the upper 4/5/6 bits.

Another idea: Make DIFFs relative to a simple prediction based on the last two pixels, something like prediction = 2*pixel - prev_pixel (predict more change), or prediction = pixel + (prev_pixel - pixel)/2 (predict the average)

@nigeltao
Copy link

nigeltao commented Dec 12, 2021

Since I'm happy with the general direction of the experimental branch, I've merged it back into master.

I tried optimizing master's qoi_decode function. I've called it demo2e (the 2e is based on the master commit hash: 2ee2169). It's not as fast as demo10, but in some sense, demo10 only has three (variable width) ops: DELTA, INDEX, RUN.

The demo2e encoder is identical to the master encoder. On compression size, master/demo2e wins on photographic test images, demo10 wins on icons and pngimg, screenshots slightly favor master/demo2e. Roughly speaking.

Edit: I've also added lz4d2e numbers, which wrap the demo2e implementation with an LZ4 compression after QOI encode and a LZ4 decompression before QOI decode. lz4d10 does likewise for demo10.

The LZ4-enriched file format isn't quite right. Since QOI allows streaming encodes, LZ4+QOI should use LZ4's streaming mode instead of LZ4's block mode (for now, I've inserted an 8 byte uncompressed length after the 14 byte QOI header). But as a quick experiment to get ballpark compression numbers, LZ4's block mode was easiest to get some code running.

Decode / encode speeds obviously take a hit. The compression gains are most impressive for images/icon_512 and images/screenshot_*.

master is commit 2ee2169 (2021-12-11).

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate

## Total for images/icon_512
libpng:       5.0        33.9         52.74          7.74        51    5.0%
qoi-master:   1.5         1.3        171.27        198.88        85    8.4%
qoi-demo2e:   0.9         1.4        300.26        183.00        85    8.4%
qoi-lz4d2e:   1.3         2.5        194.73        104.05        54    5.3%
qoi-demo10:   0.8         1.2        309.23        218.27        76    7.5%
qoi-lz4d10:   0.9         1.5        280.25        177.87        53    5.2%

## Total for images/icon_64
libpng:       0.2         0.9         23.26          4.35         3   23.6%
qoi-master:   0.0         0.0        104.20        100.77         4   28.7%
qoi-demo2e:   0.0         0.0        146.61        105.32         4   28.7%
qoi-lz4d2e:   0.0         0.1        121.38         66.90         4   25.5%
qoi-demo10:   0.0         0.0        148.43        110.04         4   27.6%
qoi-lz4d10:   0.0         0.0        151.51         86.91         4   25.6%

## Total for images/photo_kodak
libpng:      15.2       295.7         25.88          1.33       717   46.7%
qoi-master:   6.1         6.7         64.95         58.81       671   43.7%
qoi-demo2e:   5.3         6.7         74.72         58.59       671   43.7%
qoi-lz4d2e:   6.4         9.2         61.89         42.91       671   43.7%
qoi-demo10:   4.1         7.1         95.72         55.76       772   50.3%
qoi-lz4d10:   4.3         8.0         91.37         49.27       772   50.3%

## Total for images/photo_tecnick
libpng:      44.9      1143.8         32.08          1.26      2414   42.9%
qoi-master:  22.6        26.3         63.73         54.84      2527   44.9%
qoi-demo2e:  18.3        25.8         78.73         55.80      2527   44.9%
qoi-lz4d2e:  20.2        28.8         71.36         50.07      2534   45.1%
qoi-demo10:  13.0        25.1        111.04         57.44      2737   48.7%
qoi-lz4d10:  13.6        26.5        105.76         54.42      2744   48.8%

## Total for images/photo_wikipedia
libpng:      42.9       865.8         25.28          1.25      2046   48.3%
qoi-master:  17.0        20.2         63.66         53.81      2102   49.6%
qoi-demo2e:  14.2        19.8         76.43         54.71      2102   49.6%
qoi-lz4d2e:  17.1        24.6         63.41         44.13      2109   49.8%
qoi-demo10:   9.5        19.2        113.99         56.49      2289   54.0%
qoi-lz4d10:   9.9        20.5        109.44         52.93      2297   54.2%

## Total for images/pngimg
libpng:      56.4       533.3         32.08          3.39      1201   17.0%
qoi-master:  16.4        16.7        110.41        108.00      1436   20.3%
qoi-demo2e:  11.9        17.4        151.75        104.09      1436   20.3%
qoi-lz4d2e:  13.1        21.3        138.40         84.77      1341   19.0%
qoi-demo10:   9.0        15.7        201.83        114.98      1429   20.2%
qoi-lz4d10:   9.5        17.5        190.62        103.52      1355   19.2%

## Total for images/screenshot_game
libpng:      18.6       216.7         34.08          2.92       448   18.1%
qoi-master:   6.7         6.5         93.95         97.27       519   21.0%
qoi-demo2e:   5.2         6.8        121.06         93.26       519   21.0%
qoi-lz4d2e:   6.3         9.4        100.66         67.01       464   18.8%
qoi-demo10:   4.1         6.1        156.06        104.53       535   21.7%
qoi-lz4d10:   4.3         7.1        147.22         89.45       485   19.6%

## Total for images/screenshot_web
libpng:      92.0      1060.7         88.33          7.66      2402    7.6%
qoi-master:  55.7        44.8        145.98        181.17      2649    8.3%
qoi-demo2e:  32.5        47.7        249.95        170.34      2649    8.3%
qoi-lz4d2e:  32.8        62.2        247.97        130.61      2152    6.8%
qoi-demo10:  28.3        40.4        287.55        201.06      2680    8.4%
qoi-lz4d10:  26.0        46.0        312.87        176.74      2284    7.2%

## Total for images/textures_photo
libpng:      39.6       725.0         26.45          1.45      1977   48.3%
qoi-master:  15.0        16.0         69.87         65.44      1981   48.4%
qoi-demo2e:  12.1        15.7         86.86         66.79      1981   48.4%
qoi-lz4d2e:  15.7        23.4         66.98         44.77      1977   48.3%
qoi-demo10:  10.1        18.3        104.22         57.40      2506   61.2%
qoi-lz4d10:  11.0        25.0         95.22         42.02      2441   59.6%

## Total for images/textures_pk
libpng:       0.9        24.6         47.24          1.81        89   51.5%
qoi-master:   0.8         0.8         59.21         57.16        75   43.5%
qoi-demo2e:   0.6         0.8         71.81         53.45        75   43.5%
qoi-lz4d2e:   0.8         1.3         54.76         33.88        72   41.5%
qoi-demo10:   0.6         0.7         77.52         60.26        78   45.1%
qoi-lz4d10:   0.6         0.9         75.72         50.26        77   44.5%

## Total for images/textures_pk01
libpng:       5.3        66.5         24.65          1.95       163   32.3%
qoi-master:   1.6         1.7         80.68         75.78       178   35.2%
qoi-demo2e:   1.2         1.7        110.07         75.88       178   35.2%
qoi-lz4d2e:   1.4         2.2         95.41         58.77       168   33.1%
qoi-demo10:   1.0         1.6        136.54         81.54       180   35.6%
qoi-lz4d10:   1.0         1.8        129.16         72.31       173   34.1%

## Total for images/textures_pk02
libpng:      11.7       205.0         25.97          1.48       427   36.1%
qoi-master:   4.5         4.5         67.71         67.38       479   40.4%
qoi-demo2e:   3.4         4.6         88.95         65.69       479   40.4%
qoi-lz4d2e:   4.3         6.6         70.47         45.74       465   39.3%
qoi-demo10:   2.8         4.3        108.01         70.79       492   41.5%
qoi-lz4d10:   2.9         4.8        103.59         62.67       480   40.5%

## Total for images/textures_plants
libpng:      31.9       340.6         33.35          3.12       857   20.6%
qoi-master:   8.6         9.9        123.62        107.79       922   22.2%
qoi-demo2e:   5.8         9.8        181.93        108.01       922   22.2%
qoi-lz4d2e:   6.5        12.3        163.09         86.62       914   22.0%
qoi-demo10:   4.3         9.5        247.12        111.57       957   23.0%
qoi-lz4d10:   4.4        10.2        239.81        104.78       957   23.0%

# Grand total for images
libpng:      13.5       187.9         34.47          2.47       423   23.3%
qoi-master:   5.1         5.2         91.89         89.38       463   25.6%
qoi-demo2e:   3.8         5.3        121.20         87.21       463   25.6%
qoi-lz4d2e:   4.5         7.0        103.46         66.50       438   24.2%
qoi-demo10:   3.0         4.9        156.82         94.45       484   26.7%
qoi-lz4d10:   3.1         5.5        149.73         83.73       463   25.6%

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate

qoi-demo2e.h.txt

qoi-lz4d2e.h.txt

qoi-lz4d10.h.txt

@nigeltao
Copy link

I've updated my most recent comment to add numbers from an LZ4 compression experiment.

@Niphiil
Copy link

Niphiil commented Dec 29, 2021

There is nearly nothing that can be made SIMD about this format, other than maybe a memcpy.

SIMD maybe can be used to encode animation. If you use the pixels of the previous frame as a source of previous pixel data and use the same algorithm in the third dimension frame by frame, as before pixel by pixel. To store such data for each pixel, after the first frame, you can make a separate field with a fixed size (some magic number, maybe 16 or 64 bytes). After filling any one field in a chunk, the previous chunk closes and starts a new one. This is not very effective if the animation consists of a single rainbow pixel, but can probably work adequately in general cases. It's probably wise to use 0xffff_ffff as a marker for the end of the chunk, then some of the unfilled chunks can be compressed during finalization. Anyway its looks like other format like "QOA".

The two worst performing examples I found are actually the ones most prominently featured by other image formats: dice.png and fish.png. As I said before, I believe these are fairly "artificial" examples.

I think you are wrong about the uselessness of the alpha channel, because sometimes the alpha channel can store different data (elevation map, normal map, etc.), which should also be compressed intelligently. Maybe QOI_OP_LUMA can use 4 bits for primary colors and 2 bits for alpha?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests