-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GB18030-2022 #335
Support GB18030-2022 #335
Conversation
One legacy encoding was updated and relevant regulation requires software to match. As such the Encoding Standard should match as well. This aims to make the minimum number of changes necessary and does not impact GBK, only gb18030. Updated tests are in https://github.com/WebKit/WebKit/tree/main/LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gb18030. If these are deemed satisfactory they will be exported.
The tests are upstreamed in web-platform-tests/wpt#48221 and I noticed there that Chromium previously imported WebKit's changes and exported them to web-platform-tests. This was done by @yoshisatoyanagisawa and @tkent-google: web-platform-tests/wpt@2ddb8af. Thus it seems Chromium has partial compliance with these changes. But not the most recent changes where WebKit (once again) aligned with the Unicode recommendation around gb18030. One thing that might be good to test is that GBK is not impacted by these new mappings. (At least, I've been assuming we don't want that.) @achristensen07 @ricea @hsivonen @domenic thoughts and review appreciated! |
<p>If <a for="gb18030 decoder">is GBK</a> is false and there is a row in the table below whose | ||
first column is <a>gb18030 first</a>, second column is <a>gb18030 second</a>, third column is | ||
<a>gb18030 third</a>, and fourth column is <var>byte</var>, then set <var>code point</var> to | ||
the fifth column on the same row: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on test results in Firefox I calculated the first override here and it appears that this is already the existing behavior of gb18030 as written in the standard. As such this table isn't needed. It's not clear to me why the Unicode recommendation calls out these four-byte sequences mapping to code points if that matches existing behavior.
<p>If <a for="gb18030 encoder">is GBK</a> is false and there is a row in the table below whose | ||
first column is <var>code point</var>, then return the two bytes on the same row listed in the | ||
second column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we layer this on top of the existing gb18030 index, we can remove all the U+EXXX PUA entries below as they are already in that table and map to the correct bytes.
An implementation that updates that table would need these U+EXXX mappings though, as evidenced by WebKit. (Which I assume has a completely separate implementation for GBK.)
With those U+EXXX mappings removed however it probably makes the most sense to create a small "gb18030 2022 index" which can be reused across the decoder and encoder as a special case when GBK is not in use. Bit more work, but overall would present the data more neatly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebKit (and presumably Chromium) having updated the table (as evidenced by vitorroriz/WebKit@b7d4f07) means GBK is impacted at the moment as well. In particular for the symmetrical mappings at the top of #312 (comment). What do we want to do for GBK? Perhaps it's okay if that gets slightly less PUA as well? @hsivonen what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on Matrix, given that WebKit and Chromium did not get complaints for decoding less PUA for GBK, let's keep GB18030 and GBK aligned even though it means diverging from Windows Code Page 936.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GBK-1995 never been an offical standard, although it asign some characters in U+E8xx for GB+FExx, and 52 Chinese characters has been replaced by GB18030-2000 for unicode extension A, remained GB code GB+FExx unchanged, ex. GB+FE9F【䶮】 mapped to U+4DAE instead of U+E863. For information interchange, we should use official GB18030-2022 mapping table enven in GBK quotation, in order to drop these duplicate unicode code to those same GB 2 bytes characters.
Windows CP 936 method is wrong way to reach the GB18030 standard, i.e. remained U+E8xx characters in ttf font file (Source Han Sans & iOS never done these),convert program using these PUA characters and assign 0x3F to 4 bytes GB18030 characters.
On the other hand, full BMP PUA code range is U+E000-U+F8FF, and SMP U+10000-U+10FFFF mapping to GB18030 is a GB+90308130~ only, no mapping table need for programming and future GB18030 amendments.
See detail in https://zhuanlan.zhihu.com/p/661610604 for WAHTWG GB18030 convert program(in Chinese).
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030. In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following: 1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030. 2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030. 3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".) The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely. Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. The aim is to complete that with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240. This supersedes #335. This fixes #27 and fixes #312.
Closing this in favor of #336. |
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030. In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following: 1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030. 2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030. 3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".) The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely. Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. This is completed with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240. This supersedes #335. This fixes #27 and fixes #312. This also updates the description of index gb18030 ranges to account for #22 (the change from GB18030-2000 to -2005) which it until now did not.
One legacy encoding was updated and relevant regulation requires software to match. As such the Encoding Standard should match as well. This aims to make the minimum number of changes necessary and does not impact GBK, only gb18030.
Updated tests are upstreamed in web-platform-tests/wpt#48221. (See also the comment below.)
(See WHATWG Working Mode: Changes for more details.)
Preview | Diff