-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some Invalid UTF-8 Sequences Cause Panic #22597
Conversation
while (++s < send) { | ||
while (LIKELY(state != 1) && ++s < send) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[MEMO] We need to ensure that the state
after the first transition isn't 1
. We expect the state
to be a multiple of 19
, so referencing PL_strict_utf8_dfa_tab
with a state
of 1
is completely invalid.
Added to the commit immediately preceding the commit which the poster states broke utf8n_to_uvchr, without adding code changes from that p.r.
You assert that the problem began in a460925 (@khwilliamson):
I wanted to explore this argument in a test-driven manner. So I checked out the commit immediately before that one:
Then configured (
I then ran that test program through the harness:
I then cherry-picked a460925 into that branch, rebuilt and re-tested.
So it appears that one of the two unit tests you are proposing to add would have PASSed both before the breaking commit and at the breaking commit. The breaking commit (a460925) appears to have broken only the code in your second unit test. Can you clarify? (My diagnostic branch can be found here.) NOTE: If this pull request is accepted, the changes may be backported to maintenance releases for perl-5.36, perl-5.38 and perl-5.40. |
@hiratara, if there is something specifically wrong with Encode, you should file a bug ticket in that upstream distribution's bug tracker: https://rt.cpan.org/Dist/Display.html?Name=Encode. That would enable @dankogai to make changes for Encode against any version of perl. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your detailed investigation.
You are correct that the breaking commit (a460925) affects only the second unit test in my pull request. I have added detailed explanations to the test cases I included. Please refer to them for further clarification.
if there is something specifically wrong with Encode, you should file a bug ticket in that upstream distribution's bug tracker
Encode.pm
behaves unexpectedly in certain edge cases because of this issue, but I believe Encode.pm
itself is not at fault. All the author of Encode.pm can do is avoid using utf8n_to_uvchr
and implement their own version instead.
https://github.com/dankogai/p5-encode/blob/51e8cc56415253dfe27d69204b925b4df74b8a59/Encode.xs#L448
Furthermore, other modules besides Encode.pm
may also be affected by this issue, so I think it's important to address it at the core level rather than filing separate bug reports for each module.
t/op/lex.t
Outdated
fresh_perl_like( | ||
qq(use utf8; \xC2\xE3\x81\x82), | ||
qr/^Malformed UTF-8 character:/, | ||
{stderr => 1}, | ||
'Error handling for invalid UTF-8 sequences starting with leading bytes', | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[MEMO] This test case did not fail even after commit a460925 because the leading byte is valid.
Its type
is 2
:
Line 6760 in 698a4c0
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, /*C0-CF*/ |
and it transitions to N1
correctly:
Line 6822 in 698a4c0
/*N0*/ 0, 1, N1, N2, N4, N7, N6, N3, N5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
I added this test case to ensure that my fix doesn't break other functionalities.
t/op/lex.t
Outdated
fresh_perl_like( | ||
qq(use utf8; \xFF\xE3\x81\x82), | ||
qr/^Malformed UTF-8 character:/, | ||
{stderr => 1}, | ||
'Error handling for invalid UTF-8 sequences starting with unassigned bytes', | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[MEMO] This case reproduces the issue.
Since the leading byte \xFF
has type == 1
and transitions to state == 1
, the DFA should immediately terminate as a failure.
@@ -411,6 +411,16 @@ This prevents integer overflows when appending to a large C<SV> for | |||
C<readpipe> aka C<qx//> and C<readline>. | |||
L<https://www.perlmonks.org/?node_id=11161665> | |||
|
|||
=item * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit message: the commit hashes won't be valid once the PR is merged (we rebase and merge)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me; the commit messages need to indicate what is being changed. You could preface each title with utf8n_to_uvchr:
and that would be good enough
The tests belong instead in ext/XS-APItest/t/utf8.t. There already is a function accessible from that file I see the tests for translating to code point are minimal.. The functions that check if a string is well-formed UTF-8 have extensive tests, and they don't have this bug. I would adapt some of those tests to work on this. |
…8367f3f749f4715851c9ca86f9e
ae2d098
to
4a9fdaa
Compare
@jkeenan, I've updated the commit messages and removed the changes from I've force-pushed the updated commits to this branch and backed up the original branch here: https://github.com/hiratara/perl5/tree/tmp/panic-with-invalid-utf8-BK |
One UTF-8 malformation is when the string has a start byte in it before the expected end of the character. This test file tested the case where the unexpected byte came in the final position. GH Perl#22597 found bugs where the undexpected byte came immediately after the first byte. This commit adds tests for unexpected bytes in all possible positions. If the fix for GH Perl#22597 is reverted, this new revised file has 1400 failures.
To be clear, this PR adds tests for this #22646 |
Hello,
Since commit a460925,
utf8n_to_uvchr
has been broken, causing some peculiar behavior when handling invalid UTF-8 sequences.For example, invalid UTF-8 sequences cause a panic message:
This bug also affects
Encode.pm
:This PR fixes those problems.