Some Invalid UTF-8 Sequences Cause Panic #22597

hiratara · 2024-09-15T12:33:12Z

Hello,

Since commit a460925, utf8n_to_uvchr has been broken, causing some peculiar behavior when handling invalid UTF-8 sequences.

For example, invalid UTF-8 sequences cause a panic message:

# This is expected
$ perl -E 'say "use strict; use warnings; use utf8; \"\xf4\xE3\x81\x82\""' | perl
Malformed UTF-8 character: \xf4\xe3\x81\x82 (unexpected non-continuation byte 0xe3, immediately after start byte 0xf4; need 4 bytes, got 1) at - line 1.
Malformed UTF-8 character (fatal) at - line 1.

# This isn't expected
$ perl -E 'say "use strict; use warnings; use utf8; \"\xff\xE3\x81\x82\""' | perl
panic: _force_out_malformed_utf8_message should be called only when there are errors found at - line 1.

This bug also affects Encode.pm:

# This is expected
$ perl -MEncode -E 'say encode "UTF-8", decode "UTF-8", "\xf4\xE3\x81\x82", Encode::FB_PERLQQ'
\xF4あ

# This isn't expected
$ perl -MEncode -E 'say encode "UTF-8", decode "UTF-8", "\xff\xE3\x81\x82", Encode::FB_PERLQQ'
\xFF\xE3\x81\x82

This PR fixes those problems.

This set of changes requires a perldelta entry, and it is included.
This set of changes requires a perldelta entry, and I need help writing it.
This set of changes does not require a perldelta entry.

hiratara · 2024-09-15T12:37:17Z

inline.h

-        while (++s < send) {
+        while (LIKELY(state != 1) && ++s < send) {


[MEMO] We need to ensure that the state after the first transition isn't 1. We expect the state to be a multiple of 19, so referencing PL_strict_utf8_dfa_tab with a state of 1 is completely invalid.

Added to the commit immediately preceding the commit which the poster states broke utf8n_to_uvchr, without adding code changes from that p.r.

jkeenan · 2024-09-15T17:11:58Z

You assert that the problem began in a460925 (@khwilliamson):

commit a460925186154b270b7a647a4f30b2f01fd97c4b
Author:     Karl Williamson <[email protected]>
AuthorDate: Thu May 13 10:50:23 2021
Commit:     Karl Williamson <[email protected]>
CommitDate: Tue Feb 1 22:33:48 2022

    Refactor utf8 to code point conversion

I wanted to explore this argument in a test-driven manner. So I checked out the commit immediately before that one:

commit c0e63b1341c14ebd84a23fdbba9759d3fd686498
Author:     Karl Williamson <[email protected]>
AuthorDate: Tue Feb 1 20:58:08 2022
Commit:     Karl Williamson <[email protected]>
CommitDate: Tue Feb 1 22:01:13 2022

    builtin.c: Fix up 730f927
    
    That commit got pushed without a needed fix.

Then configured (-des -Dusedevel) and built an ordinary perl at that point, then added the two unit tests from your pull request to t/op/lex.t.

$ git show | cat
commit 8fbf143586f593a5f599f1b633dc558b7269a3a7
Author: James E Keenan <[email protected]>
Date:   Sun Sep 15 12:51:20 2024 -0400

    Add 2 unit tests from pull request GH #22597
    
    Added to the commit immediately preceding the commit which the poster
    states broke utf8n_to_uvchr, without adding code changes from that p.r.

diff --git a/t/op/lex.t b/t/op/lex.t
index e78fad2c42..a5703fc53e 100644
--- a/t/op/lex.t
+++ b/t/op/lex.t
@@ -7,7 +7,8 @@ use warnings;
 
 BEGIN { chdir 't' if -d 't'; require './test.pl'; }
 
-plan(tests => 36);
+#plan(tests => 36);
+plan(tests => 38);
 
 {
     print <<'';   # Yow!
@@ -284,3 +285,20 @@ EOM
 
 fresh_perl_like('flock  _$', qr/Not enough arguments for flock/, {stderr => 1},
                 "[perl #129190] intuit_method() invalidates PL_bufptr");
+
+# test-driven development; first, add the tests from GH #22597
+
+
+fresh_perl_like(
+    qq(use utf8; \xC2\xE3\x81\x82),
+    qr/^Malformed UTF-8 character:/,
+    {stderr => 1},
+    'Error handling for invalid UTF-8 sequences starting with leading bytes',
+);
+
+fresh_perl_like(
+    qq(use utf8; \xFF\xE3\x81\x82),
+    qr/^Malformed UTF-8 character:/,
+    {stderr => 1},
+    'Error handling for invalid UTF-8 sequences starting with unassigned bytes',
+);

I then ran that test program through the harness:

$ cd t;./perl harness -v op/lex.t; cd -

ok 1
ok 2
...
ok 37 - Error handling for invalid UTF-8 sequences starting with leading bytes
ok 38 - Error handling for invalid UTF-8 sequences starting with unassigned bytes
ok
All tests successful.
Files=1, Tests=38,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.04 cusr  0.09 csys =  0.15 CPU)
Result: PASS
/home/jkeenan/gitwork/perl

I then cherry-picked a460925 into that branch, rebuilt and re-tested.

$ cd t;./perl harness -v op/lex.t; cd -

ok 1
ok 2
...
ok 37 - Error handling for invalid UTF-8 sequences starting with leading bytes
not ok 38 - Error handling for invalid UTF-8 sequences starting with unassigned bytes
# Failed test 38 - Error handling for invalid UTF-8 sequences starting with unassigned bytes at ./test.pl line 1088
#      got 'panic: _force_out_malformed_utf8_message should be called only when there are errors found at - line 1.'
# expected /(?^:^Malformed UTF-8 character:)/
# PROG: 
# use utf8; �あ
# STATUS: 65280
Failed 1/38 subtests 

Test Summary Report
-------------------
op/lex.t (Wstat: 0 Tests: 38 Failed: 1)
  Failed test:  38
Files=1, Tests=38,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.05 cusr  0.08 csys =  0.15 CPU)
Result: FAIL

So it appears that one of the two unit tests you are proposing to add would have PASSed both before the breaking commit and at the breaking commit. The breaking commit (a460925) appears to have broken only the code in your second unit test.

Can you clarify?

(My diagnostic branch can be found here.)

NOTE: If this pull request is accepted, the changes may be backported to maintenance releases for perl-5.36, perl-5.38 and perl-5.40.

jkeenan · 2024-09-15T17:16:12Z

@hiratara, if there is something specifically wrong with Encode, you should file a bug ticket in that upstream distribution's bug tracker: https://rt.cpan.org/Dist/Display.html?Name=Encode. That would enable @dankogai to make changes for Encode against any version of perl.

hiratara

@jkeenan

Thank you for your detailed investigation.

You are correct that the breaking commit (a460925) affects only the second unit test in my pull request. I have added detailed explanations to the test cases I included. Please refer to them for further clarification.

if there is something specifically wrong with Encode, you should file a bug ticket in that upstream distribution's bug tracker

Encode.pm behaves unexpectedly in certain edge cases because of this issue, but I believe Encode.pm itself is not at fault. All the author of Encode.pm can do is avoid using utf8n_to_uvchr and implement their own version instead.

https://github.com/dankogai/p5-encode/blob/51e8cc56415253dfe27d69204b925b4df74b8a59/Encode.xs#L448

Furthermore, other modules besides Encode.pm may also be affected by this issue, so I think it's important to address it at the core level rather than filing separate bug reports for each module.

hiratara · 2024-09-15T22:34:43Z

t/op/lex.t

+fresh_perl_like(
+    qq(use utf8; \xC2\xE3\x81\x82),
+    qr/^Malformed UTF-8 character:/,
+    {stderr => 1},
+    'Error handling for invalid UTF-8 sequences starting with leading bytes',
+);


[MEMO] This test case did not fail even after commit a460925 because the leading byte is valid.

Its type is 2:

perl5/perl.h

Line 6760 in 698a4c0

1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, /*C0-CF*/

and it transitions to N1 correctly:

perl5/perl.h

Line 6822 in 698a4c0

/*N0*/ 0, 1, N1, N2, N4, N7, N6, N3, N5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

I added this test case to ensure that my fix doesn't break other functionalities.

hiratara · 2024-09-15T22:37:43Z

t/op/lex.t

+fresh_perl_like(
+    qq(use utf8; \xFF\xE3\x81\x82),
+    qr/^Malformed UTF-8 character:/,
+    {stderr => 1},
+    'Error handling for invalid UTF-8 sequences starting with unassigned bytes',
+);


[MEMO] This case reproduces the issue.

Since the leading byte \xFF has type == 1 and transitions to state == 1, the DFA should immediately terminate as a failure.

tonycoz · 2024-09-15T23:55:06Z

pod/perldelta.pod

@@ -411,6 +411,16 @@ This prevents integer overflows when appending to a large C<SV> for
 C<readpipe> aka C<qx//> and C<readline>.
 L<https://www.perlmonks.org/?node_id=11161665>

+=item *


commit message: the commit hashes won't be valid once the PR is merged (we rebase and merge)

khwilliamson

This looks good to me; the commit messages need to indicate what is being changed. You could preface each title with utf8n_to_uvchr: and that would be good enough

khwilliamson · 2024-09-18T00:10:29Z

The tests belong instead in ext/XS-APItest/t/utf8.t. There already is a function accessible from that file test_utf8n_to_uvchr_error() that allows for arbitrary strings to be tested. If you don't feel like doing this, I will do so, so just pull out that portion of your commits.

I see the tests for translating to code point are minimal.. The functions that check if a string is well-formed UTF-8 have extensive tests, and they don't have this bug. I would adapt some of those tests to work on this.

…re code

…8367f3f749f4715851c9ca86f9e

hiratara · 2024-09-20T00:58:12Z

@jkeenan, I've updated the commit messages and removed the changes from t/op/lex.t. I attempted to add tests to ext/XS-APItest/t/utf8.t, but found it too complicated for me to handle. Could you please take care of adding the test cases instead?

I've force-pushed the updated commits to this branch and backed up the original branch here: https://github.com/hiratara/perl5/tree/tmp/panic-with-invalid-utf8-BK

One UTF-8 malformation is when the string has a start byte in it before the expected end of the character. This test file tested the case where the unexpected byte came in the final position. GH Perl#22597 found bugs where the undexpected byte came immediately after the first byte. This commit adds tests for unexpected bytes in all possible positions. If the fix for GH Perl#22597 is reverted, this new revised file has 1400 failures.

khwilliamson · 2024-10-08T18:50:44Z

To be clear, this PR adds tests for this #22646

hiratara commented Sep 15, 2024

View reviewed changes

jkeenan added a commit to jkeenan/perl5 that referenced this pull request Sep 15, 2024

Add 2 unit tests from pull request GH Perl#22597

8fbf143

Added to the commit immediately preceding the commit which the poster states broke utf8n_to_uvchr, without adding code changes from that p.r.

jkeenan added the Bug label Sep 15, 2024

hiratara commented Sep 15, 2024

View reviewed changes

tonycoz reviewed Sep 15, 2024

View reviewed changes

tonycoz requested a review from khwilliamson September 16, 2024 00:01

khwilliamson reviewed Sep 16, 2024

View reviewed changes

hiratara added 4 commits September 20, 2024 09:20

utf8n_to_uvchr: check the state of DFA after the first transition

f2bf404

utf8n_to_uvchr: consolidate condition checks within while loop to sha…

afc13b2

…re code

perldelta for 5f14d2d3e828d17ab57c1a8cd1bae994bdce8fba, e294b95a336db…

1cb16ed

…8367f3f749f4715851c9ca86f9e

chore: run Porting/updateAUTHORS.pl

4a9fdaa

hiratara force-pushed the panic-with-invalid-utf8 branch from ae2d098 to 4a9fdaa Compare September 20, 2024 00:20

khwilliamson merged commit 29501f0 into Perl:blead Sep 20, 2024
33 checks passed

hiratara mentioned this pull request Oct 3, 2024

Request patch for malformed UTF-8 detection #22630

Merged

khwilliamson mentioned this pull request Oct 9, 2024

APItest/t/utf8_warn_base: Add tests #22646

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Invalid UTF-8 Sequences Cause Panic #22597

Some Invalid UTF-8 Sequences Cause Panic #22597

hiratara commented Sep 15, 2024 •

edited

Loading

hiratara Sep 15, 2024 •

edited

Loading

jkeenan commented Sep 15, 2024

jkeenan commented Sep 15, 2024

hiratara left a comment •

edited

Loading

hiratara Sep 15, 2024 •

edited

Loading

hiratara Sep 15, 2024 •

edited

Loading

tonycoz Sep 15, 2024

khwilliamson left a comment •

edited

Loading

khwilliamson commented Sep 18, 2024

hiratara commented Sep 20, 2024

khwilliamson commented Oct 8, 2024

		while (++s < send) {
		while (LIKELY(state != 1) && ++s < send) {

Some Invalid UTF-8 Sequences Cause Panic #22597

Some Invalid UTF-8 Sequences Cause Panic #22597

Conversation

hiratara commented Sep 15, 2024 • edited Loading

hiratara Sep 15, 2024 • edited Loading

Choose a reason for hiding this comment

jkeenan commented Sep 15, 2024

jkeenan commented Sep 15, 2024

hiratara left a comment • edited Loading

Choose a reason for hiding this comment

hiratara Sep 15, 2024 • edited Loading

Choose a reason for hiding this comment

hiratara Sep 15, 2024 • edited Loading

Choose a reason for hiding this comment

tonycoz Sep 15, 2024

Choose a reason for hiding this comment

khwilliamson left a comment • edited Loading

Choose a reason for hiding this comment

khwilliamson commented Sep 18, 2024

hiratara commented Sep 20, 2024

khwilliamson commented Oct 8, 2024

hiratara commented Sep 15, 2024 •

edited

Loading

hiratara Sep 15, 2024 •

edited

Loading

hiratara left a comment •

edited

Loading

hiratara Sep 15, 2024 •

edited

Loading

hiratara Sep 15, 2024 •

edited

Loading

khwilliamson left a comment •

edited

Loading