Skip to content

Commit

Permalink
Adopt GB18030-2022 into GBK more fully
Browse files Browse the repository at this point in the history
https://bugs.webkit.org/show_bug.cgi?id=279903

Reviewed by Alex Christensen.

For GBK and gb18030 we have used the same backing table for quite a
while now. This backing table was updated to account for GB18030-2022
at some point and this impacted GBK as well.

However, the encoder side table was kept disabled for GBK, despite it
actually allowing GBK to be more compatible with its former self.

whatwg/encoding#336 now standardizes the
behavior that GBK and gb18030 are to remain aligned in these matters
and this change implements that.

The corresponding tests are from this PR:
web-platform-tests/wpt#48240

* LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gbk/gbk-decoder.any.js:
* LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gbk/gbk-encoder-expected.txt:
* LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gbk/gbk-encoder.html:
* Source/WebCore/PAL/pal/text/TextCodecCJK.cpp:
(PAL::gb18030AsymmetricEncode):
(PAL::gbEncodeShared):

Canonical link: https://commits.webkit.org/283987@main
  • Loading branch information
annevk committed Sep 20, 2024
1 parent c9bd63f commit c8a6878
Show file tree
Hide file tree
Showing 4 changed files with 84 additions and 7 deletions.
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
// Additional tests can be found in ../gb18030/gb18030-decoder.any.js

const gbkPointers = [
6432, 7533, 7536, 7672, 7673, 7674, 7675, 7676, 7677, 7678, 7679, 7680, 7681, 7682, 7683, 7684,
23766, 23770, 23771, 23772, 23773, 23774, 23776, 23777, 23778, 23779, 23780, 23781, 23782, 23784, 23785, 23786,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,41 @@ PASS gbk encoder: legacy ICU special case 3
PASS gbk encoder: legacy WebKit case 1
PASS gbk encoder: legacy WebKit case 2
PASS gbk encoder: legacy WebKit case 3
PASS gbk encoder: U+10FFFF
PASS gbk encoder: GB18030-2022 1
PASS gbk encoder: GB18030-2022 2
PASS gbk encoder: GB18030-2022 3
PASS gbk encoder: GB18030-2022 4
PASS gbk encoder: GB18030-2022 5
PASS gbk encoder: GB18030-2022 6
PASS gbk encoder: GB18030-2022 7
PASS gbk encoder: GB18030-2022 8
PASS gbk encoder: GB18030-2022 9
PASS gbk encoder: GB18030-2022 10
PASS gbk encoder: GB18030-2022 11
PASS gbk encoder: GB18030-2022 12
PASS gbk encoder: GB18030-2022 13
PASS gbk encoder: GB18030-2022 14
PASS gbk encoder: GB18030-2022 15
PASS gbk encoder: GB18030-2022 16
PASS gbk encoder: GB18030-2022 17
PASS gbk encoder: GB18030-2022 18
PASS gbk encoder: GB18030-2022 19
PASS gbk encoder: GB18030-2022 20
PASS gbk encoder: GB18030-2022 21
PASS gbk encoder: GB18030-2022 22
PASS gbk encoder: GB18030-2022 23
PASS gbk encoder: GB18030-2022 24
PASS gbk encoder: GB18030-2022 25
PASS gbk encoder: GB18030-2022 26
PASS gbk encoder: GB18030-2022 27
PASS gbk encoder: GB18030-2022 28
PASS gbk encoder: GB18030-2022 29
PASS gbk encoder: GB18030-2022 30
PASS gbk encoder: GB18030-2022 31
PASS gbk encoder: GB18030-2022 32
PASS gbk encoder: GB18030-2022 33
PASS gbk encoder: GB18030-2022 34
PASS gbk encoder: GB18030-2022 35
PASS gbk encoder: GB18030-2022 36

Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,43 @@
encode("\u00A5", "%26%23165%3B", "legacy WebKit case 1");
encode("\u22EF", "%26%238943%3B", "legacy WebKit case 2");
encode("\u301C", "%26%2312316%3B", "legacy WebKit case 3");
encode("\u{10FFFF}", "%26%231114111%3B", "U+10FFFF");

// GB18030-2022
encode("\uFE10", "%A6%D9", "GB18030-2022 1");
encode("\uFE12", "%A6%DA", "GB18030-2022 2");
encode("\uFE11", "%A6%DB", "GB18030-2022 3");
encode("\uFE13", "%A6%DC", "GB18030-2022 4");
encode("\uFE14", "%A6%DD", "GB18030-2022 5");
encode("\uFE15", "%A6%DE", "GB18030-2022 6");
encode("\uFE16", "%A6%DF", "GB18030-2022 7");
encode("\uFE17", "%A6%EC", "GB18030-2022 8");
encode("\uFE18", "%A6%ED", "GB18030-2022 9");
encode("\uFE19", "%A6%F3", "GB18030-2022 10");
encode("\u9FB4", "%FEY", "GB18030-2022 11");
encode("\u9FB5", "%FEa", "GB18030-2022 12");
encode("\u9FB6", "%FEf", "GB18030-2022 13");
encode("\u9FB7", "%FEg", "GB18030-2022 14");
encode("\u9FB8", "%FEm", "GB18030-2022 15");
encode("\u9FB9", "%FE~", "GB18030-2022 16");
encode("\u9FBA", "%FE%90", "GB18030-2022 17");
encode("\u9FBB", "%FE%A0", "GB18030-2022 18");
encode("\uE78D", "%A6%D9", "GB18030-2022 19");
encode("\uE78E", "%A6%DA", "GB18030-2022 20");
encode("\uE78F", "%A6%DB", "GB18030-2022 21");
encode("\uE790", "%A6%DC", "GB18030-2022 22");
encode("\uE791", "%A6%DD", "GB18030-2022 23");
encode("\uE792", "%A6%DE", "GB18030-2022 24");
encode("\uE793", "%A6%DF", "GB18030-2022 25");
encode("\uE794", "%A6%EC", "GB18030-2022 26");
encode("\uE795", "%A6%ED", "GB18030-2022 27");
encode("\uE796", "%A6%F3", "GB18030-2022 28");
encode("\uE81E", "%FEY", "GB18030-2022 29");
encode("\uE826", "%FEa", "GB18030-2022 30");
encode("\uE82B", "%FEf", "GB18030-2022 31");
encode("\uE82C", "%FEg", "GB18030-2022 32");
encode("\uE832", "%FEm", "GB18030-2022 33");
encode("\uE843", "%FE~", "GB18030-2022 34");
encode("\uE854", "%FE%90", "GB18030-2022 35");
encode("\uE864", "%FE%A0", "GB18030-2022 36");
</script>
13 changes: 6 additions & 7 deletions Source/WebCore/PAL/pal/text/TextCodecCJK.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -899,7 +899,7 @@ static const GB18030EncodeIndex& gb18030EncodeIndex()
// https://unicode-org.atlassian.net/browse/ICU-22357
// The 2-byte values are handled correctly by values from gb18030()
// but these need to be exceptions from gb18030Ranges().
static std::optional<uint32_t> gb18030AsymmetricEncode(char32_t codePoint)
static std::optional<uint16_t> gb18030AsymmetricEncode(UChar codePoint)
{
switch (codePoint) {
case 0xE81E: return 0xFE59;
Expand Down Expand Up @@ -1031,12 +1031,11 @@ static Vector<uint8_t> gbEncodeShared(StringView string, Function<void(char32_t,
unencodableHandler(codePoint, result);
continue;
}
if (isGBK == IsGBK::Yes) {
if (codePoint == 0x20AC) {
result.append(0x80);
continue;
}
} else if (auto encoded = gb18030AsymmetricEncode(codePoint)) {
if (isGBK == IsGBK::Yes && codePoint == 0x20AC) {
result.append(0x80);
continue;
}
if (auto encoded = gb18030AsymmetricEncode(codePoint)) {
result.append(*encoded >> 8);
result.append(*encoded);
continue;
Expand Down

0 comments on commit c8a6878

Please sign in to comment.