Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Invalid UTF-8 Sequences Cause Panic #22597

Merged
merged 4 commits into from
Sep 20, 2024

Conversation

hiratara
Copy link
Contributor

@hiratara hiratara commented Sep 15, 2024

Hello,

Since commit a460925, utf8n_to_uvchr has been broken, causing some peculiar behavior when handling invalid UTF-8 sequences.

For example, invalid UTF-8 sequences cause a panic message:

# This is expected
$ perl -E 'say "use strict; use warnings; use utf8; \"\xf4\xE3\x81\x82\""' | perl
Malformed UTF-8 character: \xf4\xe3\x81\x82 (unexpected non-continuation byte 0xe3, immediately after start byte 0xf4; need 4 bytes, got 1) at - line 1.
Malformed UTF-8 character (fatal) at - line 1.

# This isn't expected
$ perl -E 'say "use strict; use warnings; use utf8; \"\xff\xE3\x81\x82\""' | perl
panic: _force_out_malformed_utf8_message should be called only when there are errors found at - line 1.

This bug also affects Encode.pm:

# This is expected
$ perl -MEncode -E 'say encode "UTF-8", decode "UTF-8", "\xf4\xE3\x81\x82", Encode::FB_PERLQQ'
\xF4あ

# This isn't expected
$ perl -MEncode -E 'say encode "UTF-8", decode "UTF-8", "\xff\xE3\x81\x82", Encode::FB_PERLQQ'
\xFF\xE3\x81\x82

This PR fixes those problems.

  • This set of changes requires a perldelta entry, and it is included.
  • This set of changes requires a perldelta entry, and I need help writing it.
  • This set of changes does not require a perldelta entry.

while (++s < send) {
while (LIKELY(state != 1) && ++s < send) {
Copy link
Contributor Author

@hiratara hiratara Sep 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEMO] We need to ensure that the state after the first transition isn't 1. We expect the state to be a multiple of 19, so referencing PL_strict_utf8_dfa_tab with a state of 1 is completely invalid.

jkeenan added a commit to jkeenan/perl5 that referenced this pull request Sep 15, 2024
Added to the commit immediately preceding the commit which the poster
states broke utf8n_to_uvchr, without adding code changes from that p.r.
@jkeenan
Copy link
Contributor

jkeenan commented Sep 15, 2024

You assert that the problem began in a460925 (@khwilliamson):

commit a460925186154b270b7a647a4f30b2f01fd97c4b
Author:     Karl Williamson <[email protected]>
AuthorDate: Thu May 13 10:50:23 2021
Commit:     Karl Williamson <[email protected]>
CommitDate: Tue Feb 1 22:33:48 2022

    Refactor utf8 to code point conversion

I wanted to explore this argument in a test-driven manner. So I checked out the commit immediately before that one:

commit c0e63b1341c14ebd84a23fdbba9759d3fd686498
Author:     Karl Williamson <[email protected]>
AuthorDate: Tue Feb 1 20:58:08 2022
Commit:     Karl Williamson <[email protected]>
CommitDate: Tue Feb 1 22:01:13 2022

    builtin.c: Fix up 730f927
    
    That commit got pushed without a needed fix.

Then configured (-des -Dusedevel) and built an ordinary perl at that point, then added the two unit tests from your pull request to t/op/lex.t.

$ git show | cat
commit 8fbf143586f593a5f599f1b633dc558b7269a3a7
Author: James E Keenan <[email protected]>
Date:   Sun Sep 15 12:51:20 2024 -0400

    Add 2 unit tests from pull request GH #22597
    
    Added to the commit immediately preceding the commit which the poster
    states broke utf8n_to_uvchr, without adding code changes from that p.r.

diff --git a/t/op/lex.t b/t/op/lex.t
index e78fad2c42..a5703fc53e 100644
--- a/t/op/lex.t
+++ b/t/op/lex.t
@@ -7,7 +7,8 @@ use warnings;
 
 BEGIN { chdir 't' if -d 't'; require './test.pl'; }
 
-plan(tests => 36);
+#plan(tests => 36);
+plan(tests => 38);
 
 {
     print <<'';   # Yow!
@@ -284,3 +285,20 @@ EOM
 
 fresh_perl_like('flock  _$', qr/Not enough arguments for flock/, {stderr => 1},
                 "[perl #129190] intuit_method() invalidates PL_bufptr");
+
+# test-driven development; first, add the tests from GH #22597
+
+
+fresh_perl_like(
+    qq(use utf8; \xC2\xE3\x81\x82),
+    qr/^Malformed UTF-8 character:/,
+    {stderr => 1},
+    'Error handling for invalid UTF-8 sequences starting with leading bytes',
+);
+
+fresh_perl_like(
+    qq(use utf8; \xFF\xE3\x81\x82),
+    qr/^Malformed UTF-8 character:/,
+    {stderr => 1},
+    'Error handling for invalid UTF-8 sequences starting with unassigned bytes',
+);

I then ran that test program through the harness:

$ cd t;./perl harness -v op/lex.t; cd -

ok 1
ok 2
...
ok 37 - Error handling for invalid UTF-8 sequences starting with leading bytes
ok 38 - Error handling for invalid UTF-8 sequences starting with unassigned bytes
ok
All tests successful.
Files=1, Tests=38,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.04 cusr  0.09 csys =  0.15 CPU)
Result: PASS
/home/jkeenan/gitwork/perl

I then cherry-picked a460925 into that branch, rebuilt and re-tested.

$ cd t;./perl harness -v op/lex.t; cd -

ok 1
ok 2
...
ok 37 - Error handling for invalid UTF-8 sequences starting with leading bytes
not ok 38 - Error handling for invalid UTF-8 sequences starting with unassigned bytes
# Failed test 38 - Error handling for invalid UTF-8 sequences starting with unassigned bytes at ./test.pl line 1088
#      got 'panic: _force_out_malformed_utf8_message should be called only when there are errors found at - line 1.'
# expected /(?^:^Malformed UTF-8 character:)/
# PROG: 
# use utf8; �あ
# STATUS: 65280
Failed 1/38 subtests 

Test Summary Report
-------------------
op/lex.t (Wstat: 0 Tests: 38 Failed: 1)
  Failed test:  38
Files=1, Tests=38,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.05 cusr  0.08 csys =  0.15 CPU)
Result: FAIL

So it appears that one of the two unit tests you are proposing to add would have PASSed both before the breaking commit and at the breaking commit. The breaking commit (a460925) appears to have broken only the code in your second unit test.

Can you clarify?

(My diagnostic branch can be found here.)

NOTE: If this pull request is accepted, the changes may be backported to maintenance releases for perl-5.36, perl-5.38 and perl-5.40.

@jkeenan jkeenan added the Bug label Sep 15, 2024
@jkeenan
Copy link
Contributor

jkeenan commented Sep 15, 2024

@hiratara, if there is something specifically wrong with Encode, you should file a bug ticket in that upstream distribution's bug tracker: https://rt.cpan.org/Dist/Display.html?Name=Encode. That would enable @dankogai to make changes for Encode against any version of perl.

Copy link
Contributor Author

@hiratara hiratara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkeenan

Thank you for your detailed investigation.

You are correct that the breaking commit (a460925) affects only the second unit test in my pull request. I have added detailed explanations to the test cases I included. Please refer to them for further clarification.

if there is something specifically wrong with Encode, you should file a bug ticket in that upstream distribution's bug tracker

Encode.pm behaves unexpectedly in certain edge cases because of this issue, but I believe Encode.pm itself is not at fault. All the author of Encode.pm can do is avoid using utf8n_to_uvchr and implement their own version instead.

https://github.com/dankogai/p5-encode/blob/51e8cc56415253dfe27d69204b925b4df74b8a59/Encode.xs#L448

Furthermore, other modules besides Encode.pm may also be affected by this issue, so I think it's important to address it at the core level rather than filing separate bug reports for each module.

t/op/lex.t Outdated
Comment on lines 403 to 408
fresh_perl_like(
qq(use utf8; \xC2\xE3\x81\x82),
qr/^Malformed UTF-8 character:/,
{stderr => 1},
'Error handling for invalid UTF-8 sequences starting with leading bytes',
);
Copy link
Contributor Author

@hiratara hiratara Sep 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEMO] This test case did not fail even after commit a460925 because the leading byte is valid.

Its type is 2:

perl5/perl.h

Line 6760 in 698a4c0

1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, /*C0-CF*/

and it transitions to N1 correctly:

perl5/perl.h

Line 6822 in 698a4c0

/*N0*/ 0, 1, N1, N2, N4, N7, N6, N3, N5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

I added this test case to ensure that my fix doesn't break other functionalities.

t/op/lex.t Outdated
Comment on lines 410 to 415
fresh_perl_like(
qq(use utf8; \xFF\xE3\x81\x82),
qr/^Malformed UTF-8 character:/,
{stderr => 1},
'Error handling for invalid UTF-8 sequences starting with unassigned bytes',
);
Copy link
Contributor Author

@hiratara hiratara Sep 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEMO] This case reproduces the issue.

Since the leading byte \xFF has type == 1 and transitions to state == 1, the DFA should immediately terminate as a failure.

@@ -411,6 +411,16 @@ This prevents integer overflows when appending to a large C<SV> for
C<readpipe> aka C<qx//> and C<readline>.
L<https://www.perlmonks.org/?node_id=11161665>

=item *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit message: the commit hashes won't be valid once the PR is merged (we rebase and merge)

Copy link
Contributor

@khwilliamson khwilliamson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me; the commit messages need to indicate what is being changed. You could preface each title with utf8n_to_uvchr: and that would be good enough

@khwilliamson
Copy link
Contributor

The tests belong instead in ext/XS-APItest/t/utf8.t. There already is a function accessible from that file test_utf8n_to_uvchr_error() that allows for arbitrary strings to be tested. If you don't feel like doing this, I will do so, so just pull out that portion of your commits.

I see the tests for translating to code point are minimal.. The functions that check if a string is well-formed UTF-8 have extensive tests, and they don't have this bug. I would adapt some of those tests to work on this.

@hiratara
Copy link
Contributor Author

@jkeenan, I've updated the commit messages and removed the changes from t/op/lex.t. I attempted to add tests to ext/XS-APItest/t/utf8.t, but found it too complicated for me to handle. Could you please take care of adding the test cases instead?

I've force-pushed the updated commits to this branch and backed up the original branch here: https://github.com/hiratara/perl5/tree/tmp/panic-with-invalid-utf8-BK

@khwilliamson khwilliamson merged commit 29501f0 into Perl:blead Sep 20, 2024
33 checks passed
khwilliamson added a commit to khwilliamson/perl5 that referenced this pull request Oct 8, 2024
One UTF-8 malformation is when the string has a start byte in it before
the expected end of the character.  This test file tested the case where
the unexpected byte came in the final position.  GH Perl#22597 found bugs
where the undexpected byte came immediately after the first byte.

This commit adds tests for unexpected bytes in all possible positions.
If the fix for GH Perl#22597 is reverted, this new revised file has 1400
failures.
@khwilliamson
Copy link
Contributor

To be clear, this PR adds tests for this #22646

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants