-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recent changes in experimental branch #71
Comments
The only nitpick I can think of would be the GRB- instead of RGB-bit-order of QOI_OP_LUMA. Is there any technical reason to use the green channel as the guide instead of the red one? |
The idea was that changes in lightness (luma) would be most represented in the green channel, since that is the color that the human eye is most sensitive to. I just tried it with a 6-bit red channel instead and got consistently worse results — but not by much. Overall compression rate would be at 25.9%. Maybe it's worth changing it anyway, for clarity's sake. |
Huh, I didn't think that it would have any impact at all on a lossless format and that those human-perception-specific tricks only really matter when compressing lossily. I'd personally change it, but it really is incredibly nitpicky and doesn't matter too much either way. ¯\_(ツ)_/¯ |
I think the reason it matters is probably that it is fairly common for images to get subtly lighter or darker while maintaining the same hue, and in those cases, green on average is what changes the most. |
Wouldn't removing DIFF_24 greatly degrade the compression ratio of semi-transparent textures? (like hair textures, or maybe water textures) I think these changes make qoi useless for semi-transparent images. Maybe in that case its better to stop encoding alpha levels at all and only write rgb + 0/1 alpha? |
Yes, it performs worse for these, but I am not convinced that this is an issue. From my game dev experience even textures for water, clouds, dust etc. use fairly distinct alpha values. Many of these don't even have an alpha channel at all and are just blended with The two worst performing examples I found are actually the ones most prominently featured by other image formats: dice.png and fish.png. As I said before, I believe these are fairly "artificial" examples.
Compression rate wasn't really great before (compared to libpng/stbi), but it's still very good compared to uncompressed RGBA. |
One last change that might be worth trying is changing QOI_INDEX to actually use a LRU cache. This will probably be a little slower, but it would remove the effect of hash collisions on the file size. |
Sorry @phoboslab I think there is a mistake in computiing the color differences: char vr = px.rgba.r - px_prev.rgba.r;
char vg = px.rgba.g - px_prev.rgba.g;
char vb = px.rgba.b - px_prev.rgba.b;
char vg_r = vr - vg;
char vg_b = vb - vg; You need Edit : type Wrap_Around_Difference_Type is mod 256; in Ada, then everything was clear. |
Here's a tweak to the index, tested against the old data set because ISP's suck.
|
Very cool. Reducing the index to 61 is rather clever! And I'd really like to know why a 64 run-length would help that much with performance. @vsonnier this should be fine, as long as the decoder wraps these values around, too. Edit: I can't really reproduce these throughput numbers. qoi-luma61.run64 performs slightly worse on my machine.
Compression is better though. Images with alpha suffer quite a bit when the hash doesn't include alpha value. Might be worse to change it to: #define QOI_COLOR_HASH(C) (C.rgba.r * 3 + C.rgba.g * 5 + C.rgba.b * 7 + C.rgba.a * 11) (or just |
Some numbers vs qoi-demo10. qoi-experi is commit 28954f7. qoi-l61r64 is the luma61.run64 mentioned above. For compressed size, qoi-demo10 wins on icon_512, icon_64 and pngimg, which all involve alpha. qoi-experi wins everywhere else, including on textures_plants (which also involves alpha). I know simplicity of format is a goal, but perhaps consider having two slightly different sets of opcodes (or opcode bit assignments) depending on whether channels == 3 or channels == 4. Edit: now that I've added the luma61.run64 numbers, the qoi-demo10 vs qoi-l61r64 gap is noticably smaller on icon_512, icon_64 and pngimg. One more quick request is to raise QOI_PADDING from 4 to 7, so that the decoder can always do an 8 byte (64 bit) read.
This was on a mid-range x86_64 laptop (2016, Skylake):
|
Regarding a 7-Byte padding: wouldn't it make sense to put and "end-marker" there, instead of just If we then use Edit: thinking about it some more, a 7-byte |
Further changes in experimental:
|
Just adding a size field to the header would be far mor efficient than making people who want to skip the image iterate every byte looking for a run of four 0xFF or eight 0x00. |
We've had this discussion (see #28 (comment)) and I believe the benefit of allowing a "streaming" encoder that does not need seeking outweigh the drawbacks of not having the size in the header. |
What about allowing encoders to put |
Yeah if the encoder absolutely needs to never go back it can just put a zero, but that's probably very rare. More likely these are files being put on disk, and you can just adjust an earlier part of the file once the compression is completed. |
That would completly defeat the idea of a streaming reader as you then don't know the file size beforehand, but if you want to skip images in a stream, you would need to have a defined header. So you can also just leave it out.
It would make it harder for, let's say, embedded devices that have low memory anyways to require the device to store the image both encoded and unencoded in RAM just for the sake of a stream field. Parsing/seeking the full image stream also doesn't take much time on most machines, especially if no writes are done. A stronger use case for streaming encoders is stuff like additional compression/encryption where you convert some bytes, push them into the compressor/encrypter and pull some bytes from them. All compression/encryption APIs i know provide this pattern and it would be cool to just not have to allocate additional memory for the compressed image. Especially as it might require either overallocation or reallocation which are both costly. Imho, status quo is already good enough also for special use cases. The tagging of four consecutive 0xFF is imho a good middle ground, especially as a conforming non-streaming parser doesn't have to handle that case at all (it's always a invalid) @phoboslab: The changes for color hash and colorspace are good 👍 |
If the size is zero, you have to walk the bytes to find the end. If the size is non-zero you can skip that many bytes immediately. But if there's no size at all then you also have to walk the bytes to find the end, so things are just as bad as if someone had put a size of zero.
Why would you store the entire encoded image in ram just to have the total encoded size? A counter of how many encoded bytes have been written would be plenty. Then you write that value down in the file after the encoding is done (or, again, if you can't go back for some reason you'd know that ahead of time and mark a size of 0 in your output). |
Some reasoning on why i don't think the size field is a good idea:
That very much depends. I can already see people embedded different QOI images in QOI images by specifying invalid
I can't because my (fictional, but realistic) device has only 32k RAM. This means i can neither store the full source image (but have to stream it from somewhere) nor can i store the full output image (but have to stream it somewhere). So in that use case, it would require me to always write zero to that field. But to implement something like seeking back to disk, you have to create two interfaces to your application: One that allows streaming encoding, and one that allows non-streaming encoding. The latter one would require a bit different API for non-generic purpose implementations and would make the general library/application harder without much benefit in general (i won't argue that there isn't a benefit, i just don't value it that much) The only use case where such a field is from use is: When you want to skip over the QOI data and the source data is seekable (aka. disk or memory buffer). For all other use cases we either have to decode the data anyways (so no benefit from the size field) or we don't have seeking capabilities (so the size field isn't worth much either, as we have to read the data anyways and skip-decoding QOI is way faster than memory or even I/O devices). But the last word is at @phoboslab. I have stated my case and don't have more arguments here. |
The decoder should of course not blindly trust a size field and simply jump that far ahead in memory. It still has to check how big the current buffer is and not jump past the end of the buffer. This is 101 level stuff. However, if we're speaking about hostile input there's nothing ensuring that the input will ever signal an end of the stream, so it could trick a simplistic decoder into some sort of overflow that way as well. And I would say that "the source data is seekable" is the overwhelmingly common case. Most things are on disk or are pulled completely into memory via network. Usually you've got seekable data. |
I initially put the size field in the header in the first version, because I was immensely frustrated that you absolutely can not know if you found the end of a frame in MPEG-TS. A simple I do not like the optional size header. I goes against my desire for simplicity. All these "maybe it's like that, but maybe not... you have to check" things is what makes PNG, JPEG and many other formats way more complicated than they should be. So far, I tried hard to avoid all these optional features. Anyway, what's the real use-cases for when you need to skip a QOI image?
Solutions:
|
Then why not also get rid of width and/or height? You could always get one by dividing the final number of decoded pixels by the other. Right now you already have to check that there aren't too many/few pixels encoded inside a stream for a given size, how would an encoded byte size be different? And wouldn't a format that doesn't require you to use some external (and likely needlessly complicated and non-standard) addon to actually use it for a big portion of its use-cases, just because its header is missing one simple value, be the simpler one? |
This is a good question! For one, we need a single control mechanism for stream integrity. Width and Height are usually information of interest, in a header, even if you don't care for the data around (think of the
You have to check if your stream is still in the image anyways, so a secondary buffer that might even be wrong (or optional) doesn't help here. It also reduces decoding speed (my implementation was 10% faster for not checking that additional value 😮). Also you gave a cool idea: Proposal:
If you really want that field to exist, it must be non-optional and thus remove the possibility of stream-encode QOI images. This is a huge decision to have either/or as a feature. If it's non-optional, people will just put a 0 there, because they are lazy. |
Your proposal isn't a huge optimization. It saves a maximum of 2 bytes, since a single QOI_RUN_16 tag will handle this. |
Non sequitur. The width/height is mandated to be always present; a decoder can rely on these.
Which are?
There is no QOI_RUN_16 anymore (in the experimental branch). The longest run is now 62. But your point still stands and I'd rather be explicit: if there's pixels in the image, they need to be encoded. |
I was relying on the assumption that by the time we finalize, we will either have special handling for multiple QOI_RUN_8 tags or a QOI_RUN_16/QOI_RUN_24 with an 8 bit tag that would deal with this. |
@MasterQ32 The field being optional means any valid data stream with it has to also be valid without it! You can, in fact, just completely ignore it if your use case doesn't need/can't use it! IT DOESN'T PREVENT STREAMING IN ANY WAY, SHAPE, OR FORM WHATSOEVER! It not strictly being needed, I can understand. But saying that it would completely prevent streaming is just nonsense.
Exactly my point! Every single argument for size being unreliable also makes width and height be unreliable as well, but they are obviously needed and (in one way or another) already being checked, so size, which doesn't actually need to be checked or even considered when not needed, should be even less problematic.
Any time you want to decode more than one image at once (parallel), without using custom/third-party add-ons like additional headers or indices. F.e. You have a large number of small images (game textures?) and want to avoid the overheads of both working with lots of small files or having to fully decode a single enormous meta-file to find a specific image. You could use some kind of archive, but that would either be completely custom and non-standard or third-party and too general/overly complicated. With a size field, you could simply pack them one after the other and only need to iterate headers while staying fully within the vanilla spec. I agree that it is more of a (very) nice to have than an absolute necessity. The main reason I'm still arguing is because of the reason for its exclusion. Because the statement that it would prevent streaming is one which I, as stated above, strongly disagree with. If you decide to not include a size field because of complexity, then that is ultimately fine. |
Well yes, there is no mandatory size field because it would prevent streaming encode. There is no optional size field because I don't like the idea of it being optional.
That's a rather constructed scenario. Not having an index in this case seems like a bad idea. Even Doom WADs have one :)
It's not needed. A run length of 62 is sufficient. The benefits of not having variable-width chunks (or fewer tag types) far outweigh the gain in compression rate. I just tested it again:
|
Your use case for a size field is basically to have a slightly more compact tarball format that can only store qoi images and without any metadata like filename so you'll need custom nonsense to find a specific image anyway. Better to just use a tarball IMO. Tar is the standard that should have been instead of every game company delighting in creating their own format, but it's not universally suitable thanks to how it works. In an ideal world there'd be a second standard archive format for compact indexed archival (a binaryily-searchable index of UTF-8 filenames with uint64_t LE file sizes, plus a file count and that's it), but there isn't AFAIK. |
This is absolutely not the case.
|
A quick experiment, starting with luma61-run64. Call the original l61r64 and this experiment l61v02. What's new is using two of the previously-unused 8-bit opcodes (QOI_OP_Z comes from the QOI_OP_DIFF block, QOI_OP_A comes from the QOI_OP_INDEX block). For opaque source images (with a = 0xFF everywhere), these two opcodes won't be used, so there's no difference in the output .qoi file for l61r64 and l61v02.
On @erikcorry's two images:
On the non-opaque sub-directories of @phoboslab's test suite (the "size kb" is the important column; the mpps numbers could be optimized):
|
I've been guilty of it in earlier iterations but have come to the conclusion that overlapping encodings should be avoided (unless it's nailed down in the spec that one should be used). By that I mean your encoder might emit QOI_OP_RUN instead of QOI_OP_DIFF, but a different encoder might not so a QOI_OP_DIFF representing no change should not have an overlapping encoding. The 3 tags at the end of a 61 value index are fair game because no conforming encoder should use them for anything else. There is a replacement for QOI_OP_DIFF that compresses the corpus better and only uses 63 values, so in the above case there would be space for QOI_OP_Z. Lets call it QOI_OP_LUMA1. It's a 6 bit mashup of LUMA's vg_r/vg_b and ANS coding the values (like GDELTA) so we can use ranges other than powers of 2 (which we do to increase green's range and better fit around 0). vg_r=-1..1, vg=-3..3, vg_b=-1..1. Apologies for lack of example but the code below should be relatively plug and play. The last value is unused. Encode
Decode
Just changing the latest experi commit from DIFF to LUMA1 does this
There are similar gains to be had giving LUMA the ANS treatment (but smaller, the 8 bit version is more critical) but that's bikeshed territory for now. edit: It may look like processing speed is tanked but perhaps not much when optimised, delta7 in the bikeshed uses a similar but wider 7 bit LUMA/ANS encoding and still achieves decent speeds. edit: Actually as a mix of luma and ans it should be called lama |
In case anyone else is wondering, here's a link to delta7 code and benchmark numbers. |
Is this an explicit condition? I recall reading somewhere around here that an encoder should be free to make any kind of decision about how to encode the data as long as valid opcodes are used (an extreme case would be an encoder emitting |
Speaking of padding, wouldn't requiring the total size to be a multiple of 8 bytes zero padded be better than always outputting 7 zero bytes? Making the header 16 bytes should also eliminate unaligned loads and/or make handling slightly simpler (wouldn't it? I don't know much about the nitty gritty of optimising). If the above is true why not go one step further and require total size to be a multiple of 32 so SIMD can make some safe optimising assumptions (in particular for AVX2, but many SIMD are in this sort of range or variable which should still benefit)? |
There is nearly nothing that can be made SIMD about this format, other than maybe a memcpy. |
Technically correct. I'm hesitant to sacrifice one more bit of Or we switch the Edit: Narf. Specifying that an encoder must not produce consecutive |
Prime numbers are always aesthetic ;) Powers of two are useful for performance and for some ops it makes sense, IMO index and run are two ops where powers of two do not make sense. I'm experimenting with grouping run/index/small-ops into a single mask to optimise the opcode distribution. How does an index size of 47 tickle your fancy? |
I'm generally against introducing more opcodes if they do not increase compression by a lot. Without refactoring QOI to use some notion of blocks and/or more complicated encoding schemes, we will not beat a well optimized PNG anyway. QOI is all about simplicity with a reasonable compression rate. Thinking about the end-code some more: even if the padding is specified to be an otherwise unused 8-bit tag a decoder may still split QOI files too early if they just happen to end with a |
I think simplicity is maybe the wrong metric to aim for. You only implement the encoder/decoder some small number of times. Fast operations with decent compression seems like a better goal point. So whatever makes things go fast without hurting compression should probably be favored. Eg: using powers of 2 over prime numbers. |
If we go that route, we'll end up PNG in the end. As you can always argue that something might improve compression rate. Simplicity is nice as you can use QOI to explain people how some basic compression works in general and you can easily proof correctness of implementation. I prefer simple and small libs with marginal drawbacks to complexer ones nowadays because i can manage and maintain them. I don't wanna maintain a libpng, but libqoi is just so tiny, i can do that |
You say that, but PNG does at least two things that are not helpful for compression ratio or decompression speed. Possibly more. It's really not so great, it's just what we all use at the moment.
I would expect some concrete evidence if someone was making a specific claim of increased compression or better decompression speed. The example of powers of two compared to primes is just an avenue to investigate, I don't think anyone was trying to advocate a specific spec change without evidence. |
I don't expect half of these to work well, the space just means they can be compared additively for bikeshedding. Index is pretty weak although I doubt it can be removed entirely as it's good at repeating patterns like simple textures (how common are they though, and aren't full index formats more suitable?). Setting the index size to 3 on a variant only increases the average filesize of the corpus from 453 to 475, if index was really pulling its weight I'd expect the filesize to balloon more than that.
px.v %prime instead of multiplying the components by small primes was an attempt to make something simple and fast. That it managed to not hurt compression as much as the smaller cache indicated it should have was a surprise. I'm not advocating for anything in particular, just picking a few things that look like low hanging fruit and testing them to make sure they're fit for purpose. |
Don't get me wrong, I like seeing all those ideas. I just wanted to emphasize again that trading simplicity for small gains in compression ratio is imho not the right choice for this codec. So adding more op-codes is a hard sell for me. If you find an alternative to |
My understanding is that This format / codec optimizes compression speed , simplicity (of algos, implementation and use) AND compression rate. The number of interest, ports, uses and number of tentative improvements after a week of being public clearly shows this optimization was useful to some people. |
Haven't been able to replace QOI_OP_INDEX, but have managed to get decent encode/decode/compression stats using a 32 value index. Details in the bikeshed, tl;dr adding some opcodes that work like LUMA but for different byte size outputs does this:
DIFF was removed in favour of a LUMA variant and INDEX was reduced to 31/32 values from the start of testing. This isn't necessarily optimal, at some point that might get revisited. |
Since I'm happy with the general direction of the experimental branch, I've merged it back into master. Still open for further changes of course. |
Just brainstorming, but would combining |
I tried a simple version of that, and wasn't able to find anything especially compelling. |
Yeah, I also tried that. The compression sizes were roughly the same (you have to remove some other ops to introduce N fused ops) but the compression speed also roughly got N times slower, where N is the size of the color cache (what the index indexes). |
You should still only need to do a single lookup, as long as you're using a hash function that only considers the upper 4/5/6 bits. Another idea: Make DIFFs relative to a simple prediction based on the last two pixels, something like |
I tried optimizing master's The demo2e encoder is identical to the master encoder. On compression size, master/demo2e wins on photographic test images, demo10 wins on icons and pngimg, screenshots slightly favor master/demo2e. Roughly speaking. Edit: I've also added lz4d2e numbers, which wrap the demo2e implementation with an LZ4 compression after QOI encode and a LZ4 decompression before QOI decode. lz4d10 does likewise for demo10. The LZ4-enriched file format isn't quite right. Since QOI allows streaming encodes, LZ4+QOI should use LZ4's streaming mode instead of LZ4's block mode (for now, I've inserted an 8 byte uncompressed length after the 14 byte QOI header). But as a quick experiment to get ballpark compression numbers, LZ4's block mode was easiest to get some code running. Decode / encode speeds obviously take a hit. The compression gains are most impressive for
|
I've updated my most recent comment to add numbers from an LZ4 compression experiment. |
SIMD maybe can be used to encode animation. If you use the pixels of the previous frame as a source of previous pixel data and use the same algorithm in the third dimension frame by frame, as before pixel by pixel. To store such data for each pixel, after the first frame, you can make a separate field with a fixed size (some magic number, maybe 16 or 64 bytes). After filling any one field in a chunk, the previous chunk closes and starts a new one. This is not very effective if the animation consists of a single rainbow pixel, but can probably work adequately in general cases. It's probably wise to use 0xffff_ffff as a marker for the end of the chunk, then some of the unfilled chunks can be compressed during finalization. Anyway its looks like other format like "QOA".
I think you are wrong about the uselessness of the alpha channel, because sometimes the alpha channel can store different data (elevation map, normal map, etc.), which should also be compressed intelligently. Maybe QOI_OP_LUMA can use 4 bits for primary colors and 2 bits for alpha? |
With all that we learned through the analysis and ideas of a lot of people here, I refined QOI quite a bit. More than I thought I would.
The current state is in the experimental branch.
First of all, benchmark results for the new test suite using
qoibench 1 images/ --nopng --onlytotals
As you can see throughput improved a lot, as did the compression ratio for all files without an alpha channel (
icon_*/
andpngimg/
suffered a bit, but the overall compression ratio for these files is already quite high.textures_plants/
still saw improvements). For photos or photo-like images QOI now often beats libpng!What changed? After I switched the tags for
QOI_RUN
(previously 2-bit tag) andQOI_GDIFF_16
(previously 4-bit tag) I noticed thatQOI_GDIFF
covered almost all(!) cases that were previously encoded byQOI_DIFF_16/24
. So... why not remove them?(see the experimental file format documentation for the details)
That is, most tags are now 2-bit, while the run-length is limited to 62 and thus leaves some room for the two 8-bit
QOI_OP_RGB
andQOI_OP_RGBA
tags. So QOI would be even simpler than before and (probably?) gain a lot more possibilities for performance improvements:Yes, it means that a change in the alpha channel will always be encoded as a 5-byte
QOI_OP_RGBA
, but using the current test suit of images, this seems to be totally fine. The alpha channel is mostly either 255 or 0. The famousdice.png
and FLIF'sfish.png
seem to be awfully "artificial" uses of PNG. (For comparison, in the experimental branch with the original tag-layout andQOI_DIFF_16/24
still present, the overal compression ratio was at 24.6% - but the win in simplicity and performance is imho worth this 1%).The hash function changed to the following:
This is seriously the best performing hash function I could find and I tried quite a few. This also ignores the alpha channel, making it even more of a second-class citizen.
You may not like it (and I'm truly sorry for all the work that would need to be done in existent implementations), but I strongly believe that this is The Right Thing To Do™.
Thoughts?
The text was updated successfully, but these errors were encountered: