-
-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python-zstandard produces incomplete data after decompress #196
Comments
Need to verify with supplied sample data this but knee jerk is it has something to do with zstd frames. Various APIs in this library stop reading at frame boundaries by default. In hindsight this was a bad default behavior and the release notes have called out a desire to change it so multiple frames are read transparently. Or you found a legit issue where we fail to handle a special case where the amount of bytes read falls directly on a power of 2 boundary. 1048576 bytes decompressed is very suspicious! |
Thanks for taking a look at this @indygreg, I've posted a $300 bounty on the upstream since this issue affects data integrity of urllib3 users. If you find the root cause and publish a fix for this release you can submit a claim for that bounty. Thanks again! |
Current theory is that python-zstandard's DecompressionObj_decompress() function is stopping after the first frame. Sample C code that seems to match what that function is doing:
This results in a C buffer with 1048576 bytes because ZSTD_decompressStream() is only expected to decompress one frame. I'm still reading up on the zstd APIs to try to understand what it looks like to move on to another frame. |
Yeah, this seems to be the correct explanation. The sample data has multiple frames and python-zstandard stops after the first frame. I have two main follow-up questions:
|
Ok so reading through documentation and other older bug reports, it appears that I should be using ZstdCompressor().stream_reader() to get support for reading across frames. I will talk to the urllib3 maintainers about switching to that API. |
Sorry, one typo. I meant to say that I should be using ZstdDecompressor().stream_reader(). |
For the I plan to change the API to decompress across frames by default. As this is technically an API breaking change, it will require a new 0.x version instead of a 0.x.y point release. |
Actually, reading the docs for
As the existing behavior was explicitly documented, following prior conventions of this library, I'm not comfortable making a breaking change in the next 0.x release. The best we can do is add an off-by-default flag to allow I think the overhead for calling Would urllib3 be comfortable adapting their usage to instantiate multiple |
We would be very comfortable adapting our usage. By the way this is ironic because urllib3 2.0 broke users that did not read the documentation for 1.x so we know that nobody reads the documentation. Especially the documentation for |
@grossag Do you have everything you need to fix this bug on the urllib3 side? |
Thanks for the response! I have been reading through zstd and python-zstandard code for a couple days now and have some questions about the usefulness of APIs to read a single frame. Overall, I don't understand the use of any API that only supports decoding a single frame, requires the caller to tell it how much to read, and doesn't tell you how much it read. A single frame in and of itself is not useful unless I can call the same API again to read the next frame. So I think there is a really strong argument for the one-shot API reading across frames because otherwise, I can't really see any use for this API. It doesn't tell you how much it read, so it's not like I can move forward in my source bytesarray and call the API again to read the next frame. I would understand if there was a version of |
It appears that I can use If The only object I can find that supports reading across frames is |
Part of this is due to bad design in this library. In my defense, when I started this library in 2016 and when zstandard was relatively young, you rarely saw zstd payload with multiple frames. In the years since, zstandard has skyrocketed in popularity and has built out new features, such as adaptive compression. Its APIs have also evolved, making writing multiple frames a bit simpler. Part of this is due to bad design of the (de)compression APIs in the Python standard library. The only reason that The Anyway, you aren't the only ones complaining about the lack of multi-frame support. It is clear we need to evolve this library to support a now common requirement in 2023. The goal posts moved and we need to adapt. This will be the major issue I focus on the next time I sit down to hack on this project. |
All APIs need to gain the ability to transparently read across frames. This commit teaches the stdlib decompressionobj interface to transparently read across frames. Behavior is controlled via a boolean named argument. It defaults to False to preserve existing behavior. Related to #196. I haven't tested this very thoroughly. But the added fuzzing test not failing is usually a good indicator that this works as intended.
FYI in urllib3 a PR has been merged using Moving forward, I personnaly believe the |
Thanks for the response @indygreg ! As @Rogdham mentioned, he was able to put together a solution that worked well in the context of the existing urllib3 design. I had put together urllib3/urllib3#3021 but the So anyway, it seems like there are still many ways to use your APIs to decode data with multiple frames. Would be great to get the one-shot API to support multiple frames but in the meantime, urllib3 has a solution to their issue. Thanks for your support! |
All APIs need to gain the ability to transparently read across frames. This commit teaches the stdlib decompressionobj interface to transparently read across frames. Behavior is controlled via a boolean named argument. It defaults to False to preserve existing behavior. Related to #196. I haven't tested this very thoroughly. But the added fuzzing test not failing is usually a good indicator that this works as intended.
I added support for |
I'm really sorry it took me so long to get to this. I gave it a test today by cloning urllib3 and python-zstandard, reverting urllib3/urllib3@aca0f01 , and adding Overall I didn't get any truncated data, but urllib3 was broken in a small way and I have a question for you about the expected behavior. Here is the test urllib3 code:
In my tests, I confirmed that I had no truncation and everything worked correctly if I commented out the two lines in |
I originally filed this against urllib3 in urllib3/urllib3#3008 but the investigation so far is showing that urllib3 is providing the correct data to python-zstandard.
Environment info
OS Windows-10-10.0.22621-SP0
Python 3.10.11
urllib3 2.0.1
pip freeze results:
appdirs==1.4.4
certifi==2022.9.24
charset-normalizer==2.1.1
colorama==0.4.6
flake8==5.0.4
idna==3.4
mccabe==0.7.0
numpy==1.23.4
pycodestyle==2.9.1
pydiffx==1.1
pyflakes==2.5.0
python-gitlab==3.12.0
RBTools==4.0
requests==2.28.1
requests-toolbelt==0.10.1
six==1.16.0
texttable==1.6.7
tqdm==4.64.1
typing_extensions==4.4.0
urllib3==2.0.1
zstandard==0.21.0
Sample code
Example source data to decompress
I have attached a zip of text.txt.zstd because GitHub doesn't support attaching .zstd files. This file text.txt.zstd contains the raw data passed to ZstdDecoder.decompress(). Extract that text.txt.zstd file out of the zip then run:
My output is:
Whereas downloading zstd 1.5.5 and running
zstd.exe text.txt.zstd -d -o hi.txt
produces 8724469 bytes and a valid text file.The text was updated successfully, but these errors were encountered: