diff --git a/encoding.bs b/encoding.bs index 812821c..b5a539d 100644 --- a/encoding.bs +++ b/encoding.bs @@ -3,7 +3,7 @@ Group: WHATWG H1: Encoding Shortname: encoding Text Macro: TWITTER encodings -Text Macro: LATESTRD 2023-06 +Text Macro: LATESTRD 2024-12 Abstract: The Encoding Standard defines encodings and their JavaScript API. Translation: ja https://triple-underscore.github.io/Encoding-ja.html Markup Shorthands: css off diff --git a/review-drafts/2024-12.bs b/review-drafts/2024-12.bs new file mode 100644 index 0000000..0be499d --- /dev/null +++ b/review-drafts/2024-12.bs @@ -0,0 +1,3584 @@ +
+Group: WHATWG +Status: RD +Date: 2024-12-16 +H1: Encoding +Shortname: encoding +Text Macro: TWITTER encodings +Text Macro: LATESTRD 2024-12 +Abstract: The Encoding Standard defines encodings and their JavaScript API. +Translation: ja https://triple-underscore.github.io/Encoding-ja.html +Markup Shorthands: css off +Translate IDs: dictdef-textdecoderoptions textdecoderoptions,dictdef-textdecodeoptions textdecodeoptions,index section-index ++ + + + + +
The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the +universal coded character set. Therefore for new protocols and formats, as well as +existing formats deployed in new contexts, this specification requires (and defines) the +UTF-8 encoding. + +
The other (legacy) encodings have been defined to some extent in the past. However, +user agents have not always implemented them in the same way, have not always used the +same labels, and often differ in dealing with undefined and former proprietary areas of +encodings. This specification addresses those gaps so that new user agents do not have to +reverse engineer encoding implementations and existing user agents can converge. + +
In particular, this specification defines all those encodings, their algorithms to go +from bytes to scalar values and back, and their canonical names and identifying labels. +This specification also defines an API to expose part of the encoding algorithms to +JavaScript. + +
User agents have also significantly deviated from the labels listed in the +IANA Character Sets registry. +To stop spreading legacy encodings further, this specification is exhaustive about the +aforementioned details and therefore has no need for the registry. In particular, this +specification does not provide a mechanism for extending any aspect of encodings. + + + +
There is a set of encoding security issues when the producer and consumer do not agree on the +encoding in use, or on the way a given encoding is to be implemented. For instance, an attack was +reported in 2011 where a Shift_JIS lead byte 0x82 was used to “mask” a 0x22 trail byte in a +JSON resource of which an attacker could control some field. The producer did not see the problem +even though this is an illegal byte combination. The consumer decoded it as a single U+FFFD and +therefore changed the overall interpretation as U+0022 is an important delimiter. Decoders of +encodings that use multiple bytes for scalar values now require that in case of an illegal byte +combination, a scalar value in the range U+0000 to U+007F, inclusive, cannot be “masked”. For the +aforementioned sequence the output would be U+FFFD U+0022. (As an unfortunate exception to this, the +gb18030 decoder will “mask” up to one such byte at end-of-queue.) + +
This is a larger issue for encodings that map anything that is an ASCII byte to something +that is not an ASCII code point, when there is no lead byte present. These are +“ASCII-incompatible” encodings and other than ISO-2022-JP and UTF-16BE/LE, which are +unfortunately required due to deployed content, they are not supported. (Investigation is +ongoing +whether more labels of other such encodings can be mapped to the replacement encoding, rather +than the unknown encoding fallback.) An example attack is injecting carefully crafted content into a +resource and then encouraging the user to override the encoding, resulting in, e.g., script +execution. + +
Encoders used by URLs found in HTML and HTML's form feature can also result in slight information +loss when an encoding is used that cannot represent all scalar values. E.g., when a resource uses +the windows-1252 encoding a server will not be able to distinguish between an end user +entering “💩” and “💩” into a form. + +
The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons +that is now the mandatory encoding for all things. + +
See also the Browser UI chapter. + + + +
This specification depends on the Infra Standard. [[!INFRA]] + +
Hexadecimal numbers are prefixed with "0x". + +
In equations, all numbers are integers, addition is represented by "+", subtraction by "−", +multiplication by "×", integer division by "/" (returns the quotient), modulo by "%" (returns the +remainder of an integer division), logical left shifts by "<<", logical right shifts by ">>", +bitwise AND by "&", and bitwise OR by "|". + +
For logical right shifts operands must have at least twenty-one bits precision. + +
An I/O queue is a type of list with +items of a particular type (i.e., bytes or scalar values). +End-of-queue is a special item that can be +present in I/O queues of any type and it signifies that there are no more +items in the queue. + +
There are two ways to use an I/O queue: in immediate mode, to represent I/O data + stored in memory, and in streaming mode, to represent data coming in from the network. Immediate + queues have end-of-queue as their last item, whereas streaming queues need not have it, and + so their read operation might block. + +
It is expected that streaming I/O queues will be created empty, and that new + items will be pushed to it as data comes in from the + network. When the underlying network stream closes, an end-of-queue item is to be + pushed into the queue. + +
Since reading from a streaming I/O queue might block, streaming + I/O queues are not to be used from an event loop. They are to be used + in parallel instead. +
To read an item from an +I/O queue ioQueue, run these steps: + +
If ioQueue is empty, then wait until its size is + at least 1. + +
If ioQueue[0] is end-of-queue, then return end-of-queue. + +
Remove ioQueue[0] and return it. +
To read a number number of items from +ioQueue, run these steps: + +
Let readItems be « ». + +
Perform the following step number times: + +
+Remove end-of-queue from readItems. + +
Return readItems. +
To peek a number number of items +from an I/O queue ioQueue, run these steps: + +
Wait until either ioQueue's size is equal to or greater than + number, or ioQueue contains end-of-queue, whichever + comes first. + +
Let prefix be « ». + +
For each n in the range 1 to number, inclusive: + +
If ioQueue[n] is end-of-queue, break. + +
Otherwise, append ioQueue[n] to prefix. +
Return prefix. +
To push an item +item to an I/O queue ioQueue, run these steps: + +
If the last item in ioQueue is end-of-queue, then: + +
If item is end-of-queue, do nothing. + +
Otherwise, append item to ioQueue. +
To push a sequence of items to an I/O queue +ioQueue is to push each item in the sequence to ioQueue, in the given order. + +
To restore an item other +than end-of-queue to an I/O queue, perform the list +prepend operation. To restore a list of +items excluding end-of-queue to an I/O queue, insert those +items, in the given order, before the first item in the queue. + +
Inserting the bytes « 0xF0, 0x9F » in an I/O queue +« 0x92 0xA9, end-of-queue », results in an I/O queue +« 0xF0, 0x9F, 0x92 0xA9, end-of-queue ». The next item to be read would be 0xF0. + +
To convert an I/O queue ioQueue into a +list, string, or byte sequence, return the result of +reading an indefinite number of items from +ioQueue. + +
To convert a list, string, or +byte sequence input into an I/O queue, run these steps: + +
Assert: if input is a list, then it does not contain + end-of-queue. + +
Return an I/O queue containing the items in input, + in order, followed by end-of-queue. +
The Infra standard is expected to define some infrastructure around type conversions. +See whatwg/infra issue #319. [[INFRA]] + +
I/O queues are defined as lists, not +queues, because they feature a restore operation. However, +this restore operation is an internal detail of the algorithms in this specification, and is not to +be used by other standards. Implementations are free to find alternative ways to implement such +algorithms, as detailed in [[#implementation-considerations]]. + +
To obtain a scalar value from surrogates, given a leading surrogate +leading and a trailing surrogate trailing, return +0x10000 + ((leading − 0xD800) << 10) + (trailing − 0xDC00). + + + +
An encoding defines a mapping from a scalar value sequence to +a byte sequence (and vice versa). Each encoding has a +name, and one or more +labels. + +
This specification defines three encodings with the same +names as encoding schemes defined in the Unicode standard: UTF-8, UTF-16LE, and +UTF-16BE. The encodings differ from the encoding schemes by byte order +mark (also known as BOM) handling not being part of the encodings themselves and +instead being part of wrapper algorithms in this specification, whereas byte order mark handling is +part of the definition of the encoding schemes in the Unicode Standard. UTF-8 used +together with the UTF-8 decode algorithm matches the encoding scheme of the same name. +This specification does not provide wrapper algorithms that would combine with UTF-16LE and +UTF-16BE to match the similarly-named encoding schemes. [[UNICODE]] + + +
Each encoding has an associated decoder and most of them have an +associated encoder. Instances of decoders and encoders have a +handler algorithm and might also have state. A handler algorithm takes an input +I/O queue and an item, and returns +finished, one or more items, error +optionally with a code point, or continue. + +
The replacement and UTF-16BE/LE encodings have +no encoder. + +
An error mode as used below is "replacement
" or "fatal
" for
+a decoder and "fatal
" or "html
" for an encoder.
+
+
An XML processor would set error mode to "fatal
".
+[[XML]]
+
+
"html
" exists as error mode due to HTML forms requiring a
+non-terminating legacy encoder. The "html
" error mode causes
+a sequence to be emitted that cannot be distinguished from legitimate input and can therefore lead
+to silent data loss. Developers are strongly encouraged to use the UTF-8
+encoding to prevent this from happening. [[HTML]]
+
+
To process a queue +given an encoding's decoder or encoder instance +encoderDecoder, I/O queue input, I/O queue +output, and error mode mode: + +
While true: + +
Let result be the result of processing an item with the result of + reading from input, encoderDecoder, input, + output, and mode. + +
If result is not continue, then return result. +
To process an item +given an item item, encoding's encoder or +decoder instance encoderDecoder, I/O queue input, +I/O queue output, and error mode mode: + +
Assert: if encoderDecoder is an encoder instance, mode is
+ not "replacement
".
+
+
Assert: if encoderDecoder is a decoder instance, mode is
+ not "html
".
+
+
Assert: if encoderDecoder is an encoder instance, item is + not a surrogate. + +
Let result be the result of running encoderDecoder's handler on + input and item. + +
If result is finished: + +
Push end-of-queue to output. + +
Return result. +
Otherwise, if result is one or more items: + +
Assert: if encoderDecoder is a decoder instance, result + does not contain any surrogates. + +
Push result to output. +
Otherwise, if result is an error, switch on mode and run the + associated steps: + +
replacement
"
+ html
"
+ fatal
"
+ Return continue. +
The table below lists all encodings +and their labels user agents must support. +User agents must not support any other encodings +or labels. + +
For each encoding, ASCII-lowercasing its +name yields one of its labels. + +
Authors must use the UTF-8 encoding and must use its
+(ASCII case-insensitive) "utf-8
" label to identify it.
+
+
New protocols and formats, as well as existing formats deployed in new contexts, must use the
+UTF-8 encoding exclusively. If these protocols and formats need to expose the
+encoding's name or label, they must expose it
+as "utf-8
".
+
+
+
To +get an encoding +from a string label, run these steps: + +
Remove any leading and trailing ASCII whitespace from + label. + +
If label is an ASCII case-insensitive match for any of the labels listed + in the table below, then return the corresponding encoding; otherwise return failure. +
This is a more basic and restrictive algorithm of mapping labels to +encodings than +section 1.4 of Unicode Technical Standard #22 +prescribes, as that is necessary to be compatible with deployed content. + +
Name + | Labels + |
---|---|
The Encoding + | |
UTF-8 + | "unicode-1-1-utf-8 "
+ |
"unicode11utf8 "
+ | |
"unicode20utf8 "
+ | |
"utf-8 "
+ | |
"utf8 "
+ | |
"x-unicode20utf8 "
+ | |
Legacy single-byte encodings + | |
IBM866 + | "866 "
+ |
"cp866 "
+ | |
"csibm866 "
+ | |
"ibm866 "
+ | |
ISO-8859-2 + | "csisolatin2 "
+ |
"iso-8859-2 "
+ | |
"iso-ir-101 "
+ | |
"iso8859-2 "
+ | |
"iso88592 "
+ | |
"iso_8859-2 "
+ | |
"iso_8859-2:1987 "
+ | |
"l2 "
+ | |
"latin2 "
+ | |
ISO-8859-3 + | "csisolatin3 "
+ |
"iso-8859-3 "
+ | |
"iso-ir-109 "
+ | |
"iso8859-3 "
+ | |
"iso88593 "
+ | |
"iso_8859-3 "
+ | |
"iso_8859-3:1988 "
+ | |
"l3 "
+ | |
"latin3 "
+ | |
ISO-8859-4 + | "csisolatin4 "
+ |
"iso-8859-4 "
+ | |
"iso-ir-110 "
+ | |
"iso8859-4 "
+ | |
"iso88594 "
+ | |
"iso_8859-4 "
+ | |
"iso_8859-4:1988 "
+ | |
"l4 "
+ | |
"latin4 "
+ | |
ISO-8859-5 + | "csisolatincyrillic "
+ |
"cyrillic "
+ | |
"iso-8859-5 "
+ | |
"iso-ir-144 "
+ | |
"iso8859-5 "
+ | |
"iso88595 "
+ | |
"iso_8859-5 "
+ | |
"iso_8859-5:1988 "
+ | |
ISO-8859-6 + | "arabic "
+ |
"asmo-708 "
+ | |
"csiso88596e "
+ | |
"csiso88596i "
+ | |
"csisolatinarabic "
+ | |
"ecma-114 "
+ | |
"iso-8859-6 "
+ | |
"iso-8859-6-e "
+ | |
"iso-8859-6-i "
+ | |
"iso-ir-127 "
+ | |
"iso8859-6 "
+ | |
"iso88596 "
+ | |
"iso_8859-6 "
+ | |
"iso_8859-6:1987 "
+ | |
ISO-8859-7 + | "csisolatingreek "
+ |
"ecma-118 "
+ | |
"elot_928 "
+ | |
"greek "
+ | |
"greek8 "
+ | |
"iso-8859-7 "
+ | |
"iso-ir-126 "
+ | |
"iso8859-7 "
+ | |
"iso88597 "
+ | |
"iso_8859-7 "
+ | |
"iso_8859-7:1987 "
+ | |
"sun_eu_greek "
+ | |
ISO-8859-8 + | "csiso88598e "
+ |
"csisolatinhebrew "
+ | |
"hebrew "
+ | |
"iso-8859-8 "
+ | |
"iso-8859-8-e "
+ | |
"iso-ir-138 "
+ | |
"iso8859-8 "
+ | |
"iso88598 "
+ | |
"iso_8859-8 "
+ | |
"iso_8859-8:1988 "
+ | |
"visual "
+ | |
ISO-8859-8-I + | "csiso88598i "
+ |
"iso-8859-8-i "
+ | |
"logical "
+ | |
ISO-8859-10 + | "csisolatin6 "
+ |
"iso-8859-10 "
+ | |
"iso-ir-157 "
+ | |
"iso8859-10 "
+ | |
"iso885910 "
+ | |
"l6 "
+ | |
"latin6 "
+ | |
ISO-8859-13 + | "iso-8859-13 "
+ |
"iso8859-13 "
+ | |
"iso885913 "
+ | |
ISO-8859-14 + | "iso-8859-14 "
+ |
"iso8859-14 "
+ | |
"iso885914 "
+ | |
ISO-8859-15 + | "csisolatin9 "
+ |
"iso-8859-15 "
+ | |
"iso8859-15 "
+ | |
"iso885915 "
+ | |
"iso_8859-15 "
+ | |
"l9 "
+ | |
ISO-8859-16 + | "iso-8859-16 "
+ |
KOI8-R + | "cskoi8r "
+ |
"koi "
+ | |
"koi8 "
+ | |
"koi8-r "
+ | |
"koi8_r "
+ | |
KOI8-U + | "koi8-ru "
+ |
"koi8-u "
+ | |
macintosh + | "csmacintosh "
+ |
"mac "
+ | |
"macintosh "
+ | |
"x-mac-roman "
+ | |
windows-874 + | "dos-874 "
+ |
"iso-8859-11 "
+ | |
"iso8859-11 "
+ | |
"iso885911 "
+ | |
"tis-620 "
+ | |
"windows-874 "
+ | |
windows-1250 + | "cp1250 "
+ |
"windows-1250 "
+ | |
"x-cp1250 "
+ | |
windows-1251 + | "cp1251 "
+ |
"windows-1251 "
+ | |
"x-cp1251 "
+ | |
windows-1252 + | "ansi_x3.4-1968 "
+ |
"ascii "
+ | |
"cp1252 "
+ | |
"cp819 "
+ | |
"csisolatin1 "
+ | |
"ibm819 "
+ | |
"iso-8859-1 "
+ | |
"iso-ir-100 "
+ | |
"iso8859-1 "
+ | |
"iso88591 "
+ | |
"iso_8859-1 "
+ | |
"iso_8859-1:1987 "
+ | |
"l1 "
+ | |
"latin1 "
+ | |
"us-ascii "
+ | |
"windows-1252 "
+ | |
"x-cp1252 "
+ | |
windows-1253 + | "cp1253 "
+ |
"windows-1253 "
+ | |
"x-cp1253 "
+ | |
windows-1254 + | "cp1254 "
+ |
"csisolatin5 "
+ | |
"iso-8859-9 "
+ | |
"iso-ir-148 "
+ | |
"iso8859-9 "
+ | |
"iso88599 "
+ | |
"iso_8859-9 "
+ | |
"iso_8859-9:1989 "
+ | |
"l5 "
+ | |
"latin5 "
+ | |
"windows-1254 "
+ | |
"x-cp1254 "
+ | |
windows-1255 + | "cp1255 "
+ |
"windows-1255 "
+ | |
"x-cp1255 "
+ | |
windows-1256 + | "cp1256 "
+ |
"windows-1256 "
+ | |
"x-cp1256 "
+ | |
windows-1257 + | "cp1257 "
+ |
"windows-1257 "
+ | |
"x-cp1257 "
+ | |
windows-1258 + | "cp1258 "
+ |
"windows-1258 "
+ | |
"x-cp1258 "
+ | |
x-mac-cyrillic + | "x-mac-cyrillic "
+ |
"x-mac-ukrainian "
+ | |
Legacy multi-byte Chinese (simplified) encodings + | |
GBK + | "chinese "
+ |
"csgb2312 "
+ | |
"csiso58gb231280 "
+ | |
"gb2312 "
+ | |
"gb_2312 "
+ | |
"gb_2312-80 "
+ | |
"gbk "
+ | |
"iso-ir-58 "
+ | |
"x-gbk "
+ | |
gb18030 + | "gb18030 "
+ |
Legacy multi-byte Chinese (traditional) encodings + | |
Big5 + | "big5 "
+ |
"big5-hkscs "
+ | |
"cn-big5 "
+ | |
"csbig5 "
+ | |
"x-x-big5 "
+ | |
Legacy multi-byte Japanese encodings + | |
EUC-JP + | "cseucpkdfmtjapanese "
+ |
"euc-jp "
+ | |
"x-euc-jp "
+ | |
ISO-2022-JP + | "csiso2022jp "
+ |
"iso-2022-jp "
+ | |
Shift_JIS + | "csshiftjis "
+ |
"ms932 "
+ | |
"ms_kanji "
+ | |
"shift-jis "
+ | |
"shift_jis "
+ | |
"sjis "
+ | |
"windows-31j "
+ | |
"x-sjis "
+ | |
Legacy multi-byte Korean encodings + | |
EUC-KR + | "cseuckr "
+ |
"csksc56011987 "
+ | |
"euc-kr "
+ | |
"iso-ir-149 "
+ | |
"korean "
+ | |
"ks_c_5601-1987 "
+ | |
"ks_c_5601-1989 "
+ | |
"ksc5601 "
+ | |
"ksc_5601 "
+ | |
"windows-949 "
+ | |
Legacy miscellaneous encodings + | |
replacement + | "csiso2022kr "
+ |
"hz-gb-2312 "
+ | |
"iso-2022-cn "
+ | |
"iso-2022-cn-ext "
+ | |
"iso-2022-kr "
+ | |
"replacement "
+ | |
UTF-16BE + | "unicodefffe "
+ |
"utf-16be "
+ | |
UTF-16LE + | "csunicode "
+ |
"iso-10646-ucs-2 "
+ | |
"ucs-2 "
+ | |
"unicode "
+ | |
"unicodefeff "
+ | |
"utf-16 "
+ | |
"utf-16le "
+ | |
x-user-defined + | "x-user-defined "
+ |
All encodings and their labels are also available as +non-normative encodings.json resource. + +
The set of supported encodings is primarily based +on the intersection of the sets supported by major browser engines when the development of this +standard started, while removing encodings that were rarely used legitimately but that could be used +in attacks. The inclusion of some encodings is questionable in the light of anecdotal evidence of +the level of use by existing Web content. That is, while they have been broadly supported by +browsers, it is unclear if they are broadly used by Web content. However, an effort has not been +made to eagerly remove single-byte encodings that were broadly supported by browsers or are +part of the ISO 8859 series. In particular, the necessity of the inclusion of IBM866, +macintosh, x-mac-cyrillic, ISO-8859-3, ISO-8859-10, ISO-8859-14, +and ISO-8859-16 is doubtful for the purpose of supporting existing content, but there are no +plans to remove these.
+ + +To get an output encoding from an encoding +encoding, run these steps: + +
If encoding is replacement or UTF-16BE/LE, then return + UTF-8. + +
Return encoding. +
The get an output encoding algorithm is useful for URL parsing and HTML +form submission, which both need exactly this. + + + +
Most legacy encodings make use of an index. An +index is an ordered list of entries, each entry consisting of a pointer and a +corresponding code point. Within an index pointers are unique and code points can be +duplicated. + +
An efficient implementation likely has two +indexes per encoding. One optimized for its +decoder and one for its encoder. + +
To find the pointers and their corresponding code points in an index, +let lines be the result of splitting the resource's contents on U+000A. +Then remove each item in lines that is the empty string or starts with U+0023. +Then the pointers and their corresponding code points are found by splitting each item in lines on U+0009. +The first subitem is the pointer (as a decimal number) and the second is the corresponding code point (as a hexadecimal number). +Other subitems are not relevant. + +
To signify changes an index includes an +Identifier and a Date. If an Identifier has +changed, so has the index. + +
The index code point for pointer in +index is the code point corresponding to +pointer in index, or null if +pointer is not in index. + +
The index pointer for code point in +index is the first pointer corresponding to +code point in index, or null if +code point is not in index. + +
There is a non-normative visualization for each index other than + index gb18030 ranges and index ISO-2022-JP katakana. index jis0208 also has an + alternative Shift_JIS visualization. Additionally, there is visualization of the Basic + Multilingual Plane coverage of each index other than index gb18030 ranges and + index ISO-2022-JP katakana. + +
The legend for the visualizations is: + +
These are the indexes defined by this +specification, excluding index single-byte, which have their own table: + +
Index | Notes + | |||
---|---|---|---|---|
index Big5 + | index-big5.txt + | index Big5 visualization + | index Big5 BMP coverage + | This matches the Big5 standard in combination with the + Hong Kong Supplementary Character Set and other common extensions. + |
index EUC-KR + | index-euc-kr.txt + | index EUC-KR visualization + | index EUC-KR BMP coverage + | This matches the KS X 1001 standard and the Unified Hangul Code, more commonly known together + as Windows Codepage 949. It covers the Hangul Syllables block of Unicode in its entirety. The + Hangul block whose top left corner in the visualization is at pointer 9026 is in the Unicode + order. Taken separately, the rest of the Hangul syllables in this index are in the Unicode order, + too. + |
index gb18030 + | index-gb18030.txt + | index gb18030 visualization + | index gb18030 BMP coverage + | This matches the GB18030-2022 standard for code points encoded as two bytes, except for + 0xA3 0xA0 which maps to U+3000 to be compatible with deployed content. This index covers the + CJK Unified Ideographs block of Unicode in its entirety. Entries from that block that are above or + to the left of (the first) U+3000 in the visualization are in the Unicode order. + + |
index gb18030 ranges + | index-gb18030-ranges.txt + | This index works different from all others. Listing all code points would result + in over a million items whereas they can be represented neatly in 207 ranges combined with trivial + limit checks. It therefore only superficially matches the GB18030-2000 standard for code points + encoded as four bytes. The change for the GB18030-2005 revision is handled inline by the + index gb18030 ranges code point and index gb18030 ranges pointer algorithms below + that accompany this index. And the changes for the GB18030-2022 revision are handled differently + again to not further increase the number of byte sequences mapping to Private Use code points. The + relevant Private Use code points are mapped in the gb18030 encoder directly through a side + table to preserve compatibility with how they were mapped before. + | ||
index jis0208 + | index-jis0208.txt + | index jis0208 visualization, Shift_JIS visualization + | index jis0208 BMP coverage + | This is the JIS X 0208 standard including formerly proprietary + extensions from IBM and NEC. + + |
index jis0212 + | index-jis0212.txt + | index jis0212 visualization + | index jis0212 BMP coverage + | This is the JIS X 0212 standard. It is only used by the EUC-JP decoder + due to lack of widespread support elsewhere. + + |
index ISO-2022-JP katakana + | index-iso-2022-jp-katakana.txt + | This maps halfwidth to fullwidth katakana as per Unicode Normalization Form KC, except that + U+FF9E and U+FF9F map to U+309B and U+309C rather than U+3099 and U+309A. It is only used by the + ISO-2022-JP encoder. [[UNICODE]] + |
The index gb18030 ranges code point for pointer is +the return value of these steps: + +
If pointer is greater than 39419 and less than + 189000, or pointer is greater than 1237575, return null. + +
If pointer is 7457, return code point U+E7C7. + + +
Let offset be the last pointer in index gb18030 ranges that is less than + or equal to pointer and let code point offset be its corresponding code + point. + +
Return a code point whose value is + code point offset + pointer − offset. +
The index gb18030 ranges pointer for code point is +the return value of these steps: + +
If code point is U+E7C7, return pointer 7457. + +
Let offset be the last code point in index gb18030 ranges that is less + than or equal to code point and let pointer offset be its corresponding + pointer. + +
Return a pointer whose value is + pointer offset + code point − offset. +
The index Shift_JIS pointer for code point is the return value of these +steps: + +
Let index be index jis0208 excluding all entries whose pointer is in + the range 8272 to 8835, inclusive. + + +
The index jis0208 contains duplicate code points so the exclusion of + these entries causes later code points to be used. + +
Return the index pointer for code point in + index. +
The index Big5 pointer for code point is the return value of +these steps: + +
Let index be index Big5 excluding all entries whose pointer is less + than (0xA1 - 0x81) × 157. + +
Avoid returning Hong Kong Supplementary Character Set extensions literally. + +
If code point is U+2550, U+255E, U+2561, U+256A, U+5341, or U+5345, + return the last pointer corresponding to code point in + index. + + +
There are other duplicate code points, but for those the first pointer is + to be used. + +
Return the index pointer for code point in + index. +
All indexes are also available as a non-normative +indexes.json resource. (Index gb18030 ranges has a slightly +different format here, to be able to represent ranges.) + + + +
The algorithms defined below (UTF-8 decode, UTF-8 decode without BOM, + UTF-8 decode without BOM or fail, and UTF-8 encode) are intended for usage by other + standards. + +
For decoding, UTF-8 decode is to be used by new formats. For identifiers or byte + sequences within a format or protocol, use UTF-8 decode without BOM or + UTF-8 decode without BOM or fail. + +
For encoding, UTF-8 encode is to be used. + +
Standards are to ensure that the input I/O queues they pass to UTF-8 encode (as well as + the legacy encode) are effectively I/O queues of scalar values, i.e., they contain no + surrogates. + +
These hooks (as well as decode and encode) will block until the input I/O queue + has been consumed in its entirety. In order to use the output tokens as they are pushed into the + stream, callers are to invoke the hooks with an empty output I/O queue and read from it + in parallel. Note that some care is needed when using + UTF-8 decode without BOM or fail, as any error found during decoding will prevent the + end-of-queue item from ever being pushed into the output I/O queue. +
To UTF-8 decode an I/O queue of bytes ioQueue given an optional I/O +queue of scalar values output (default « »), run these steps: + +
Let buffer be the result of peeking three bytes from + ioQueue, converted to a byte sequence. + +
If buffer is 0xEF 0xBB 0xBF, then read three bytes from + ioQueue. (Do nothing with those bytes.) + +
Process a queue with an instance of UTF-8's decoder,
+ ioQueue, output, and "replacement
".
+
+
Return output. +
To UTF-8 decode without BOM an I/O queue of bytes ioQueue given an +optional I/O queue of scalar values output (default « »), run these steps: + +
Process a queue with an instance of UTF-8's decoder,
+ ioQueue, output, and "replacement
".
+
+
Return output. +
To UTF-8 decode without BOM or fail an I/O queue of bytes ioQueue +given an optional I/O queue of scalar values output (default « »), run these steps: + + +
Let potentialError be the result of processing a queue with an instance of
+ UTF-8's decoder, ioQueue, output, and
+ "fatal
".
+
+
If potentialError is an error, then return failure. + +
Return output. +
To UTF-8 encode an I/O queue of scalar values ioQueue given an +optional I/O queue of bytes output (default « »), return the result of +encoding ioQueue with encoding UTF-8 and output. + + +
Standards are strongly discouraged from using decode, BOM sniff, and + encode, except as needed for compatibility. Standards needing these legacy hooks will + most likely also need to use get an encoding (to turn a label into an encoding) + and get an output encoding (to turn an encoding into another + encoding that is suitable to pass into encode). + +
For the extremely niche case of URL percent-encoding, custom encoder error handling is needed. + The get an encoder and encode or fail algorithms are to be used for that. Other + algorithms are not to be used directly. +
To decode an I/O queue of bytes ioQueue given a fallback encoding +encoding and an optional I/O queue of scalar values output (default « »), run +these steps: + +
Let BOMEncoding be the result of BOM sniffing ioQueue. + +
If BOMEncoding is non-null: + +
Set encoding to BOMEncoding. + +
Read three bytes from ioQueue, if BOMEncoding is + UTF-8; otherwise read two bytes. (Do nothing with those bytes.) +
For compatibility with deployed content, the byte order mark is more authoritative
+ than anything else. In a context where HTTP is used this is in violation of the semantics of the
+ `Content-Type
` header.
+
+
Process a queue with an instance of encoding's decoder,
+ ioQueue, output, and "replacement
".
+
+
Return output. +
To BOM sniff an I/O queue of bytes ioQueue, run these steps: + +
Let BOM be the result of peeking 3 bytes from + ioQueue, converted to a byte sequence. + +
For each of the rows in the table below, starting with the first one and going down, if + BOM starts with the bytes given in the first column, then + return the encoding given in the cell in the second column of that row. Otherwise, + return null. + +
Byte order mark | Encoding + |
---|---|
0xEF 0xBB 0xBF | UTF-8 + |
0xFE 0xFF | UTF-16BE + |
0xFF 0xFE | UTF-16LE + |
This hook is a workaround for the fact that decode has no way to communicate +back to the caller that it has found a byte order mark and is therefore not using the provided +encoding. The hook is to be invoked before decode, and it will return an encoding +corresponding to the byte order mark found, or null otherwise. + +
To encode an I/O queue of scalar values ioQueue given an encoding +encoding and an optional I/O queue of bytes output (default « »), run these +steps: + +
Let encoder be the result of getting an encoder from encoding. + +
Process a queue with encoder, ioQueue, output, and
+ "html
".
+
+
Return output. +
This is a legacy hook for HTML forms. Layering UTF-8 encode on top +is safe as it never triggers errors. [[HTML]] + +
To get an encoder from an +encoding encoding: + +
Assert: encoding is not replacement or UTF-16BE/LE. + +
Return an instance of encoding's encoder. +
To encode or fail an I/O queue of scalar values ioQueue given an +encoder instance encoder and an I/O queue of bytes output, run +these steps: + +
Let potentialError be the result of processing a queue with
+ encoder, ioQueue, output, and "fatal
".
+
+
Push end-of-queue to output. + +
If potentialError is an error, then return error's + code point's value. + +
Return null. +
This is a legacy hook for URL percent-encoding. The caller will have to keep an + encoder instance alive as the ISO-2022-JP encoder can be in two different + states when returning an error. That also means that if the caller emits bytes to encode the + error in some way, these have to be in the range 0x00 to 0x7F, inclusive, excluding 0x0E, 0x0F, + 0x1B, 0x5C, and 0x7E. [[URL]] + +
In particular, if upon returning an error the ISO-2022-JP encoder is in the
+ Roman state, the caller cannot output 0x5C (\) as it will not
+ decode as U+005C (\). For this reason, applications using encode or fail for unintended
+ purposes ought to take care to prevent the use of the ISO-2022-JP encoder in combination
+ with replacement schemes, such as those of JavaScript and CSS, that use U+005C (\) as part of the
+ replacement syntax (e.g., \u2603
) or make sure to pass the replacement syntax through
+ the encoder (in contrast to URL percent-encoding).
+
+
The return value is either the number representing the code point that could not be + encoded or null, if there was no error. When it returns non-null the caller will have to + invoke it again, supplying the same encoder instance and a new output I/O queue. +
This section uses terminology from Web IDL. Browser user agents must support this API. JavaScript +implementations should support this API. Other user agents or programming languages are encouraged +to use an API suitable to their needs, which might not be this one. [[!WEBIDL]] + +
The following example uses the {{TextEncoder}} object to encode + an array of strings into an + {{ArrayBuffer}}. The result is a + {{Uint8Array}} containing the number + of strings (as a {{Uint32Array}}), + followed by the length of the first string (as a + {{Uint32Array}}), the + UTF-8 encoded string data, the length of the second string (as + a {{Uint32Array}}), the string data, + and so on. +
+function encodeArrayOfStrings(strings) {
+ var encoder, encoded, len, bytes, view, offset;
+
+ encoder = new TextEncoder();
+ encoded = [];
+
+ len = Uint32Array.BYTES_PER_ELEMENT;
+ for (var i = 0; i < strings.length; i++) {
+ len += Uint32Array.BYTES_PER_ELEMENT;
+ encoded[i] = encoder.encode(strings[i]);
+ len += encoded[i].byteLength;
+ }
+
+ bytes = new Uint8Array(len);
+ view = new DataView(bytes.buffer);
+ offset = 0;
+
+ view.setUint32(offset, strings.length);
+ offset += Uint32Array.BYTES_PER_ELEMENT;
+ for (var i = 0; i < encoded.length; i += 1) {
+ len = encoded[i].byteLength;
+ view.setUint32(offset, len);
+ offset += Uint32Array.BYTES_PER_ELEMENT;
+ bytes.set(encoded[i], offset);
+ offset += len;
+ }
+ return bytes.buffer;
+}
+
+ The following example decodes an {{ArrayBuffer}} containing data encoded in the + format produced by the previous example, or an equivalent algorithm for encodings other than + UTF-8, back into an array of strings. + +
+function decodeArrayOfStrings(buffer, encoding) {
+ var decoder, view, offset, num_strings, strings, len;
+
+ decoder = new TextDecoder(encoding);
+ view = new DataView(buffer);
+ offset = 0;
+ strings = [];
+
+ num_strings = view.getUint32(offset);
+ offset += Uint32Array.BYTES_PER_ELEMENT;
+ for (var i = 0; i < num_strings; i++) {
+ len = view.getUint32(offset);
+ offset += Uint32Array.BYTES_PER_ELEMENT;
+ strings[i] = decoder.decode(
+ new DataView(view.buffer, offset, len));
+ offset += len;
+ }
+ return strings;
+}
++interface mixin TextDecoderCommon { + readonly attribute DOMString encoding; + readonly attribute boolean fatal; + readonly attribute boolean ignoreBOM; +}; ++ +
The {{TextDecoderCommon}} interface mixin defines common getters that are shared between +{{TextDecoder}} and {{TextDecoderStream}} objects. These objects have an associated: + +
replacement
".
+The serialize I/O queue algorithm, given a +{{TextDecoderCommon}} decoder and an I/O queue of scalar values +ioQueue, runs these steps: + +
Let output be the empty string. + +
While true: + +
Let item be the result of reading from ioQueue. + +
If item is end-of-queue, then return output. + +
If decoder's encoding is UTF-8 or + UTF-16BE/LE, and decoder's ignore BOM and + BOM seen are false, then: + +
+ +Append item to output. +
This algorithm is intentionally different with respect to BOM handling from +the decode algorithm used by the rest of the platform to give API users more +control. + +
The encoding
+getter steps are to return this's encoding's
+name, ASCII lowercased.
+
+
The fatal
getter
+steps are to return true if this's error mode is
+"fatal
", otherwise false.
+
+
The
+ignoreBOM
+getter steps are to return this's ignore BOM.
+
+
+
+dictionary TextDecoderOptions { + boolean fatal = false; + boolean ignoreBOM = false; +}; + +dictionary TextDecodeOptions { + boolean stream = false; +}; + +[Exposed=*] +interface TextDecoder { + constructor(optional DOMString label = "utf-8", optional TextDecoderOptions options = {}); + + USVString decode(optional AllowSharedBufferSource input, optional TextDecodeOptions options = {}); +}; +TextDecoder includes TextDecoderCommon; ++ +
A {{TextDecoder}} object has an associated +do not flush, which is a boolean, +initially false. + +
decoder = new TextDecoder([label = "utf-8" [, options]])
+ Returns a new {{TextDecoder}} object. +
If label is either not a label or is a label for + replacement, throws a {{RangeError}}. + +
decoder . encoding
+ decoder . fatal
+ Returns true if error mode is "fatal
", otherwise
+ false.
+
+
decoder . ignoreBOM
+ Returns the value of ignore BOM. + +
decoder . decode([input [, options]])
+ Returns the result of running encoding's decoder.
+ The method can be invoked zero or more times with options's stream
set to
+ true, and then once without options's stream
(or set to false), to process
+ a fragmented input. If the invocation without options's stream
(or set to
+ false) has no input, it's clearest to omit both arguments.
+
+
+var string = "", decoder = new TextDecoder(encoding), buffer;
+while(buffer = next_chunk()) {
+ string += decoder.decode(buffer, {stream:true});
+}
+string += decoder.decode(); // end-of-queue
+
+ If the error mode is "fatal
" and
+ encoding's decoder returns error,
+ throws a {{TypeError}}.
+
The
+new TextDecoder(label, options)
+constructor steps are:
+
+
Let encoding be the result of getting an encoding from label. + +
If encoding is failure or replacement, then throw a {{RangeError}}. + +
If options["{{TextDecoderOptions/fatal}}"] is true, then set this's
+ error mode to "fatal
".
+
+
Set this's ignore BOM to + options["{{TextDecoderOptions/ignoreBOM}}"]. +
The decode(input, options)
+method steps are:
+
+
If this's do not flush is false, then set this's + decoder to a new instance of this's + encoding's decoder, this's + I/O queue to the I/O queue of bytes + « end-of-queue », and this's BOM seen to false. + +
Set this's do not flush to + options["{{TextDecodeOptions/stream}}"]. + +
If input is given, then push a + copy of input to this's + I/O queue. + +
Implementations are strongly encouraged to use an implementation strategy that
+ avoids this copy. When doing so they will have to make sure that changes to input do
+ not affect future calls to decode()
.
+
+
Let output be the I/O queue of scalar values + « end-of-queue ». + +
While true: + +
Let item be the result of reading from this's + I/O queue. + +
If item is end-of-queue and this's + do not flush is true, then return the result of running + serialize I/O queue with this and output. + +
The way streaming works is to not handle end-of-queue here when + this's do not flush is true and to not set it to false. That way + in a subsequent invocation this's decoder is not set anew in + the first step of the algorithm and its state is preserved. + +
Otherwise: + +
Let result be the result of processing an item with item, + this's decoder, this's + I/O queue, output, and this's + error mode. + +
If result is finished, then return the result of running + serialize I/O queue with this and output. + +
+interface mixin TextEncoderCommon { + readonly attribute DOMString encoding; +}; ++ +
The {{TextEncoderCommon}} interface mixin defines common getters that are shared between +{{TextEncoder}} and {{TextEncoderStream}} objects. + +
The encoding
+getter steps are to return "utf-8
".
+
+
+
+dictionary TextEncoderEncodeIntoResult { + unsigned long long read; + unsigned long long written; +}; + +[Exposed=*] +interface TextEncoder { + constructor(); + + [NewObject] Uint8Array encode(optional USVString input = ""); + TextEncoderEncodeIntoResult encodeInto(USVString source, [AllowShared] Uint8Array destination); +}; +TextEncoder includes TextEncoderCommon; ++ +
A {{TextEncoder}} object offers no label argument as it only
+supports UTF-8. It also offers no stream
option as no encoder
+requires buffering of scalar values.
+
+
encoder = new TextEncoder()
+ Returns a new {{TextEncoder}} object. + +
encoder . encoding
+ Returns "utf-8
".
+
+
encoder . encode([input = ""])
+ encoder . encodeInto(source, destination)
+ Runs the UTF-8 encoder on source, stores the result of that operation into + destination, and returns the progress made as an object wherein + {{TextEncoderEncodeIntoResult/read}} is the number of converted code units of + source and {{TextEncoderEncodeIntoResult/written}} is the number of bytes modified in + destination. +
The
+new TextEncoder()
+constructor steps are to do nothing.
+
+
The encode(input)
method steps are:
+
+
Let output be the I/O queue of bytes « end-of-queue ». + +
While true: + +
Let item be the result of + reading from input. + +
Let result be the result of processing an item with item, an
+ instance of the UTF-8 encoder, input, output, and
+ "fatal
".
+
+
Assert: result is not an error. + +
The UTF-8 encoder cannot return error. + +
If result is finished, then convert + output into a byte sequence and return a {{Uint8Array}} object wrapping an + {{ArrayBuffer}} containing output. + +
The
+encodeInto(source, destination)
+method steps are:
+
+
Let read be 0. + +
Let written be 0. + +
Let encoder be an instance of the UTF-8 encoder. + +
Let unused be the I/O queue of scalar values « end-of-queue ». + +
The handler algorithm invoked below requires this argument, but it is not + used by the UTF-8 encoder. + +
While true: + +
Let item be the result of reading from source. + +
Let result be the result of running encoder's handler on + unused and item. + +
Otherwise: + +
If destination's byte length − + written is greater than or equal to the number of bytes in result, then: + +
If item is greater than U+FFFF, then increment read by 2. + +
Otherwise, increment read by 1. + +
Write the bytes in result into + destination, with startingOffset set to + written. + +
See the
+ warning for SharedArrayBuffer
objects
+ above.
+
+
Increment written by the number of bytes in result. +
Otherwise, break. +
Return «[ "{{TextEncoderEncodeIntoResult/read}}" → read, + "{{TextEncoderEncodeIntoResult/written}}" → written ]». +
The encodeInto() method can + be used to encode a string into an existing {{ArrayBuffer}} object. Various details below are left + as an exercise for the reader, but this demonstrates an approach one could take to use this method: + +
+function convertString(buffer, input, callback) {
+ let bufferSize = 256,
+ bufferStart = malloc(buffer, bufferSize),
+ writeOffset = 0,
+ readOffset = 0;
+ while (true) {
+ const view = new Uint8Array(buffer, bufferStart + writeOffset, bufferSize - writeOffset),
+ {read, written} = cachedEncoder.encodeInto(input.substring(readOffset), view);
+ readOffset += read;
+ writeOffset += written;
+ if (readOffset === input.length) {
+ callback(bufferStart, writeOffset);
+ free(buffer, bufferStart);
+ return;
+ }
+ bufferSize *= 2;
+ bufferStart = realloc(buffer, bufferStart, bufferSize);
+ }
+}
+
++[Exposed=*] +interface TextDecoderStream { + constructor(optional DOMString label = "utf-8", optional TextDecoderOptions options = {}); +}; +TextDecoderStream includes TextDecoderCommon; +TextDecoderStream includes GenericTransformStream; ++ +
decoder = new
+ TextDecoderStream([label =
+ "utf-8" [, options]])
+ Returns a new {{TextDecoderStream}} object. +
If label is either not a label or is a label for + replacement, throws a {{RangeError}}. + +
decoder . encoding
+ decoder . fatal
+ Returns true if error mode is "fatal
", and
+ false otherwise.
+
+
decoder . ignoreBOM
+ Returns the value of ignore BOM. + +
decoder . readable
+ Returns a readable stream whose chunks are strings resulting from running + encoding's decoder on the chunks written to + {{GenericTransformStream/writable}}. + +
decoder . writable
+ Returns a writable stream which accepts
+ AllowSharedBufferSource
chunks and runs
+ them through encoding's decoder before making them
+ available to {{GenericTransformStream/readable}}.
+
+
Typically this will be used via the {{ReadableStream/pipeThrough()}} method on a + {{ReadableStream}} source. + +
+var decoder = new TextDecoderStream(encoding);
+byteReadable
+ .pipeThrough(decoder)
+ .pipeTo(textWritable);
+
+ If the error mode is "fatal
" and
+ encoding's decoder returns error, both
+ {{GenericTransformStream/readable}} and {{GenericTransformStream/writable}} will be errored with a
+ {{TypeError}}.
+
The
+new TextDecoderStream(label, options)
+constructor steps are:
+
+
Let encoding be the result of getting an encoding from label. + +
If encoding is failure or replacement, then throw a {{RangeError}}. + +
If options["{{TextDecoderOptions/fatal}}"] is true, then set this's
+ error mode to "fatal
".
+
+
Set this's ignore BOM to + options["{{TextDecoderOptions/ignoreBOM}}"]. + +
Set this's decoder to a new instance of this's + encoding's decoder, and set this's + I/O queue to a new I/O queue. + +
Let transformAlgorithm be an algorithm which takes a chunk argument + and runs the decode and enqueue a chunk algorithm with this and chunk. + +
Let flushAlgorithm be an algorithm which takes no arguments and runs the + flush and enqueue algorithm with this. + +
Let transformStream be a [=new=] {{TransformStream}}. + +
[=TransformStream/Set up=] transformStream with + transformAlgorithm set to + transformAlgorithm and + flushAlgorithm set to + flushAlgorithm. + +
The decode and enqueue a chunk algorithm, given a {{TextDecoderStream}} object +decoder and a chunk, runs these steps: + +
Let bufferSource be the result of
+ converting chunk to an
+ AllowSharedBufferSource
.
+
+
Push a copy of bufferSource to + decoder's I/O queue. + +
See the
+ warning for SharedArrayBuffer
objects above.
+
+
Let output be the I/O queue of scalar values + « end-of-queue ». + +
While true: + +
Let item be the result of reading from decoder's + I/O queue. + +
If item is end-of-queue, then: + +
Let outputChunk be the result of running serialize I/O queue with + decoder and output. + +
If outputChunk is non-empty, then enqueue + outputChunk in decoder's transform. + +
Return. +
Let result be the result of processing an item with item, + decoder's decoder, decoder's + I/O queue, output, and decoder's + error mode. + +
The flush and enqueue algorithm, which handles the end of data from the input +{{ReadableStream}} object, given a {{TextDecoderStream}} object decoder, runs these +steps: + +
Let output be the I/O queue of scalar values + « end-of-queue ». + +
While true: + +
Let item be the result of reading from decoder's + I/O queue. + +
Let result be the result of processing an item with item, + decoder's decoder, decoder's + I/O queue, output, and decoder's + error mode. + +
If result is finished, then: + +
Let outputChunk be the result of running serialize I/O queue with + decoder and output. + +
If outputChunk is non-empty, then enqueue + outputChunk in decoder's transform. + +
Return. +
+[Exposed=*] +interface TextEncoderStream { + constructor(); +}; +TextEncoderStream includes TextEncoderCommon; +TextEncoderStream includes GenericTransformStream; ++ +
A {{TextEncoderStream}} object has an associated: + +
A {{TextEncoderStream}} object offers no label argument as it +only supports UTF-8. + +
encoder = new TextEncoderStream()
+ Returns a new {{TextEncoderStream}} object. + +
encoder . encoding
+ Returns "utf-8
".
+
+
encoder . readable
+ Returns a readable stream whose chunks are {{Uint8Array}}s resulting from running + UTF-8's encoder on the chunks written to {{GenericTransformStream/writable}}. + +
encoder . writable
+ Returns a writable stream which accepts string chunks and runs them through + UTF-8's encoder before making them available to + {{GenericTransformStream/readable}}. + +
Typically this will be used via the {{ReadableStream/pipeThrough()}} method on a + {{ReadableStream}} source. + +
+textReadable
+ .pipeThrough(new TextEncoderStream())
+ .pipeTo(byteWritable);
+The
+new TextEncoderStream()
+constructor steps are:
+
+
Set this's encoder to an instance of the + UTF-8 encoder. + +
Let transformAlgorithm be an algorithm which takes a chunk argument + and runs the encode and enqueue a chunk algorithm with this and chunk. + +
Let flushAlgorithm be an algorithm which runs the encode and flush + algorithm with this. + +
Let transformStream be a [=new=] {{TransformStream}}. + +
[=TransformStream/Set up=] transformStream with + transformAlgorithm set to + transformAlgorithm and + flushAlgorithm set to + flushAlgorithm. + +
The encode and enqueue a chunk algorithm, given a {{TextEncoderStream}} object +encoder and chunk, runs these steps: + +
Let input be the result of converting + chunk to a {{DOMString}}. + +
Convert input to an I/O queue of + code units. + +
{{DOMString}}, as well as an I/O queue of code units rather than scalar + values, are used here so that a surrogate pair that is split between chunks can be reassembled into + the appropriate scalar value. The behavior is otherwise identical to {{USVString}}. In particular, + lone surrogates will be replaced with U+FFFD. + +
Let output be the I/O queue of bytes « end-of-queue ». + +
While true: + +
Let item be the result of reading from input. + +
If item is end-of-queue, then: + +
+ +Let result be the result of executing the convert code unit to scalar + value algorithm with encoder, item and input. + +
If result is not continue, then process an item with
+ result, encoder's encoder, input,
+ output, and "fatal
".
+
The convert code unit to scalar value algorithm, given a {{TextEncoderStream}} object +encoder, a code unit item, and an I/O queue of code units +input, runs these steps: + +
If encoder's leading surrogate is non-null, then: + +
Let leadingSurrogate be encoder's + leading surrogate. + +
Set encoder's leading surrogate to null. + +
If item is a trailing surrogate, then return a + scalar value from surrogates given leadingSurrogate and item. + +
Restore item to input. + +
Return U+FFFD. +
If item is a leading surrogate, then set encoder's + leading surrogate to item and return continue. + +
If item is a trailing surrogate, then return U+FFFD. + +
Return item. +
This is equivalent to the "convert a string into a +scalar value string" algorithm from the Infra Standard, but allows for surrogate pairs +that are split between strings. [[!INFRA]] + +
The encode and flush algorithm, given a {{TextEncoderStream}} object +encoder, runs these steps: + +
If encoder's leading surrogate is non-null, then: + +
+A byte order mark has priority over a label as it has been found to be more accurate +in deployed content. Therefore it is not part of the UTF-8 decoder algorithm, but rather the +decode and UTF-8 decode algorithms. + +
UTF-8's decoder has an associated +UTF-8 code point, UTF-8 bytes seen, and +UTF-8 bytes needed (all initially 0), a UTF-8 lower boundary +(initially 0x80), and a UTF-8 upper boundary (initially 0xBF). + +
UTF-8's decoder's handler, given +ioQueue and byte, runs these steps: + +
If byte is end-of-queue and + UTF-8 bytes needed is not 0, set + UTF-8 bytes needed to 0 and return error. + +
If byte is end-of-queue, return + finished. + +
If UTF-8 bytes needed is 0, based on byte: + +
Return a code point whose value is byte. + +
Set UTF-8 bytes needed to 1. + +
Set UTF-8 code point to byte & 0x1F. + +
The five least significant bits of byte. +
If byte is 0xE0, set + UTF-8 lower boundary to 0xA0. + +
If byte is 0xED, set + UTF-8 upper boundary to 0x9F. + +
Set UTF-8 bytes needed to 2. + +
Set UTF-8 code point to byte & 0xF. + +
The four least significant bits of byte. +
If byte is 0xF0, set + UTF-8 lower boundary to 0x90. + +
If byte is 0xF4, set + UTF-8 upper boundary to 0x8F. + +
Set UTF-8 bytes needed to 3. + +
Set UTF-8 code point to byte & 0x7. + +
The three least significant bits of byte. +
Return error. +
Return continue. + +
If byte is not in the range UTF-8 lower boundary to + UTF-8 upper boundary, inclusive, then: + +
Set UTF-8 code point, + UTF-8 bytes needed, and UTF-8 bytes seen to 0, + set UTF-8 lower boundary to 0x80, and set + UTF-8 upper boundary to 0xBF. + +
Restore byte to ioQueue. + +
Return error. +
Set UTF-8 lower boundary to 0x80 and + UTF-8 upper boundary to 0xBF. + +
Set UTF-8 code point to (UTF-8 code point << 6) | + (byte & 0x3F) + +
Shift the existing bits of UTF-8 code point left by six + places and set the newly-vacated six least significant bits to the six least significant bits of + byte. + +
Increase UTF-8 bytes seen by one. + +
If UTF-8 bytes seen is not equal to + UTF-8 bytes needed, return continue. + +
Let code point be UTF-8 code point. + +
Set UTF-8 code point, + UTF-8 bytes needed, and UTF-8 bytes seen to 0. + +
Return a code point whose value is code point. +
The constraints in the UTF-8 decoder above match +“Best Practices for Using U+FFFD” from the Unicode standard. No other +behavior is permitted per the Encoding Standard (other algorithms that +achieve the same result are fine, even encouraged). +[[!UNICODE]] + + +
UTF-8's encoder's handler, given +ioQueue and code point, runs these steps: + +
If code point is end-of-queue, return + finished. + +
If code point is an ASCII code point, return + a byte whose value is code point. + +
Set count and offset based on the + range code point is in: + +
Let bytes be a byte sequence whose first byte is + (code point >> (6 × count)) + offset. + +
While count is greater than 0: + +
Set temp to + code point >> (6 × (count − 1)). + +
Append to bytes 0x80 | (temp & 0x3F). + +
Decrease count by one. +
Return bytes bytes, in order. +
This algorithm has identical results to the one described in the Unicode standard. It +is included here for completeness. [[!UNICODE]] + + + +
An encoding where each byte is either a single code point or +nothing, is a single-byte encoding. +Single-byte encodings share the +decoder and encoder. Index single-byte, +as referenced by the single-byte decoder and +single-byte encoder, is defined by the following table, and +depends on the single-byte encoding in use. All but two +single-byte encodings have a +unique index. + +
ISO-8859-8 and ISO-8859-8-I are +distinct encoding names, because +ISO-8859-8 has influence on the layout direction. And although +historically this might have been the case for ISO-8859-6 and +"ISO-8859-6-I" as well, that is no longer true. + + +
Single-byte encodings's +decoder's handler, given ioQueue and +byte, runs these steps: + +
If byte is end-of-queue, return + finished. + +
If byte is an ASCII byte, return a code point whose value + is byte. + +
Let code point be the index code point + for byte − 0x80 in index single-byte. + +
If code point is null, return error. + +
Return a code point whose value is code point. +
Single-byte encodings's +encoder's handler, given ioQueue and +code point, runs these steps: + +
If code point is end-of-queue, return + finished. + +
If code point is an ASCII code point, return + a byte whose value is code point. + +
Let pointer be the index pointer for + code point in index single-byte. + +
If pointer is null, return error with + code point. + +
Return a byte whose value is pointer + 0x80. +
GBK's decoder is gb18030's decoder. + + +
GBK's encoder is gb18030's encoder +with its is GBK set to true. + +
Not fully aliasing GBK with gb18030 +is a conservative move to decrease the chances of breaking legacy servers and other +consumers of content generated with GBK's encoder. + + +
gb18030's decoder has an associated gb18030 first, +gb18030 second, and gb18030 third (all initially 0x00). + +
gb18030's decoder's handler, given +ioQueue and byte, runs these steps: + +
If byte is end-of-queue and + gb18030 first, gb18030 second, and gb18030 third + are 0x00, return finished. + +
If byte is end-of-queue, and + gb18030 first, gb18030 second, or gb18030 third + is not 0x00, set gb18030 first, gb18030 second, and + gb18030 third to 0x00, and return error. + +
If gb18030 third is not 0x00, then: + +
If byte is not in the range 0x30 to 0x39, inclusive, then: + +
Restore « gb18030 second, gb18030 third, byte » to + ioQueue. + +
Set gb18030 first, gb18030 second, and gb18030 third to 0x00. + +
Return error. +
Let code point be the index gb18030 ranges code point for + ((gb18030 first − 0x81) × (10 × 126 × 10)) + + ((gb18030 second − 0x30) × (10 × 126)) + + ((gb18030 third − 0x81) × 10) + byte − 0x30. + +
Set gb18030 first, gb18030 second, and gb18030 third to 0x00. + +
If code point is null, return error. + +
Return a code point whose value is code point. +
If gb18030 second is not 0x00, then: + +
If byte is in the range 0x81 to 0xFE, inclusive, set + gb18030 third to byte and return continue. + +
Restore « gb18030 second, byte » to ioQueue, set + gb18030 first and gb18030 second to 0x00, and return error. +
If gb18030 first is not 0x00, then: + +
If byte is in the range 0x30 to 0x39, inclusive, set + gb18030 second to byte and return continue. + +
Let lead be gb18030 first, let + pointer be null, and set gb18030 first to 0x00. + +
Let offset be 0x40 if byte is less than 0x7F, otherwise 0x41. + +
If byte is in the range 0x40 to 0x7E, inclusive, or + 0x80 to 0xFE, inclusive, set pointer to + (lead − 0x81) × 190 + (byte − offset). + +
Let code point be null if pointer is null, otherwise the + index code point for pointer in index gb18030. + +
If code point is non-null, return a code point whose value is + code point. + +
If byte is an ASCII byte, restore byte to + ioQueue. + +
Return error. +
If byte is an ASCII byte, return + a code point whose value is byte. + +
If byte is 0x80, return code point U+20AC. + +
If byte is in the range 0x81 to 0xFE, inclusive, set + gb18030 first to byte and return continue. + +
Return error. +
gb18030's encoder has an associated is GBK +(initially false). + +
gb18030's encoder's handler, given +ioQueue and code point, runs these steps: + +
If code point is end-of-queue, return + finished. + +
If code point is an ASCII code point, return + a byte whose value is code point. + +
If code point is U+E5E5, return error with code point. + +
Index gb18030 maps 0xA3 0xA0 to U+3000 rather than U+E5E5 for + compatibility with deployed content. Therefore it cannot roundtrip. + +
If is GBK is true and code point is + U+20AC, return byte 0x80. + +
If there is a row in the table below whose first column is code point, then return + the two bytes on the same row listed in the second column: + +
Code point + | Bytes + |
---|---|
U+E78D + | 0xA6 0xD9 + |
U+E78E + | 0xA6 0xDA + |
U+E78F + | 0xA6 0xDB + |
U+E790 + | 0xA6 0xDC + |
U+E791 + | 0xA6 0xDD + |
U+E792 + | 0xA6 0xDE + |
U+E793 + | 0xA6 0xDF + |
U+E794 + | 0xA6 0xEC + |
U+E795 + | 0xA6 0xED + |
U+E796 + | 0xA6 0xF3 + |
U+E81E + | 0xFE 0x59 + |
U+E826 + | 0xFE 0x61 + |
U+E82B + | 0xFE 0x66 + |
U+E82C + | 0xFE 0x67 + |
U+E832 + | 0xFE 0x6D + |
U+E843 + | 0xFE 0x7E + |
U+E854 + | 0xFE 0x90 + |
U+E864 + | 0xFE 0xA0 + |
This asymmetric encoder table preserves compatibility with the GB18030-2005 + standard. See also the explanation at index gb18030 ranges. + +
Let pointer be the index pointer for + code point in index gb18030. + +
If pointer is non-null, then: + +
Let lead be pointer / 190 + 0x81. + +
Let trail be pointer % 190. + +
Let offset be 0x40 if trail is less than 0x3F, + otherwise 0x41. + +
Return two bytes whose values are lead and + trail + offset. +
Set pointer to the + index gb18030 ranges pointer for code point. + +
Let byte1 be pointer / (10 × 126 × 10). + +
Set pointer to pointer % (10 × 126 × 10). + +
Let byte2 be pointer / (10 × 126). + +
Set pointer to pointer % (10 × 126). + +
Let byte3 be pointer / 10. + +
Let byte4 be pointer % 10. + +
Return four bytes whose values are byte1 + 0x81, + byte2 + 0x30, byte3 + 0x81, + byte4 + 0x30. +
Big5's decoder has an associated +Big5 lead (initially 0x00). + +Big5's decoder's handler, given ioQueue +and byte, runs these steps: + +
If byte is end-of-queue and Big5 lead + is not 0x00, set Big5 lead to 0x00 and return error. + +
If byte is end-of-queue and Big5 lead + is 0x00, return finished. + +
If Big5 lead is not 0x00, let lead be + Big5 lead, let pointer be null, set + Big5 lead to 0x00, and then: + +
Let offset be 0x40 if byte is less than 0x7F, otherwise 0x62. + + +
If byte is in the range 0x40 to 0x7E, inclusive, or + 0xA1 to 0xFE, inclusive, set pointer to + (lead − 0x81) × 157 + (byte − offset). + +
If there is a row in the table below whose first column is + pointer, return the two code points listed in + its second column (the third column is irrelevant): + +
Pointer | Code points | Notes + |
---|---|---|
1133 | U+00CA U+0304 | Ê̄ (LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND MACRON) + |
1135 | U+00CA U+030C | Ê̌ (LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND CARON) + |
1164 | U+00EA U+0304 | ê̄ (LATIN SMALL LETTER E WITH CIRCUMFLEX AND MACRON) + |
1166 | U+00EA U+030C | ê̌ (LATIN SMALL LETTER E WITH CIRCUMFLEX AND CARON) + |
Since indexes are limited to + single code points this table is used for these pointers. + +
Let code point be null if pointer is null, otherwise the + index code point for pointer in index Big5. + +
If code point is non-null, return a code point whose value is + code point. + +
If byte is an ASCII byte, restore byte to + ioQueue. + +
Return error. +
If byte is an ASCII byte, return + a code point whose value is byte. + +
If byte is in the range 0x81 to 0xFE, inclusive, set + Big5 lead to byte and return continue. + +
Return error. +
Big5's encoder's handler, given ioQueue +and code point, runs these steps: + +
If code point is end-of-queue, return + finished. + +
If code point is an ASCII code point, return + a byte whose value is code point. + +
Let pointer be the index Big5 pointer for + code point. + +
If pointer is null, return error with + code point. + +
Let lead be pointer / 157 + 0x81. + +
Let trail be pointer % 157. + +
Let offset be 0x40 if trail is less than 0x3F, + otherwise 0x62. + +
Return two bytes whose values are lead and + trail + offset. +
EUC-JP's decoder has an associated +EUC-JP jis0212 (initially false) and +EUC-JP lead (initially 0x00). + +
EUC-JP's decoder's handler, given +ioQueue and byte, runs these steps: + +
If byte is end-of-queue and + EUC-JP lead is not 0x00, set EUC-JP lead to 0x00, and return + error. + +
If byte is end-of-queue and + EUC-JP lead is 0x00, return finished. + +
If EUC-JP lead is 0x8E and byte is + in the range 0xA1 to 0xDF, inclusive, set EUC-JP lead to 0x00 and return + a code point whose value is 0xFF61 − 0xA1 + byte. + + +
If EUC-JP lead is 0x8F and byte is in the range + 0xA1 to 0xFE, inclusive, set EUC-JP jis0212 to true, set + EUC-JP lead to byte, and return continue. + +
If EUC-JP lead is not 0x00, let lead be EUC-JP lead, set + EUC-JP lead to 0x00, and then: + +
Let code point be null. + +
If lead and byte are both in the range 0xA1 to 0xFE, inclusive, then + set code point to the index code point for + (lead − 0xA1) × 94 + byte − 0xA1 + in index jis0208 if EUC-JP jis0212 is false and in + index jis0212 otherwise. + +
Set EUC-JP jis0212 to false. + +
If code point is non-null, return a code point whose value is + code point. + +
If byte is an ASCII byte, restore byte to + ioQueue. + +
Return error. +
If byte is an ASCII byte, return + a code point whose value is byte. + +
If byte is 0x8E, 0x8F, or in the range 0xA1 to + 0xFE, inclusive, set EUC-JP lead to byte and return + continue. + +
Return error. +
EUC-JP's encoder's handler, given +ioQueue and code point, runs these steps: + +
If code point is end-of-queue, return + finished. + +
If code point is an ASCII code point, return + a byte whose value is code point. + +
If code point is U+00A5, return byte 0x5C. + +
If code point is U+203E, return byte 0x7E. + +
If code point is in the range U+FF61 to U+FF9F, inclusive, return + two bytes whose values are 0x8E and code point − 0xFF61 + 0xA1. + +
If code point is U+2212, set it to U+FF0D. + +
Let pointer be the index pointer for code point in + index jis0208. + +
If pointer is non-null, it is less than 8836 due to the nature of + index jis0208 and the index pointer operation. + +
If pointer is null, return error with + code point. + +
Let lead be pointer / 94 + 0xA1. + +
Let trail be pointer % 94 + 0xA1. + +
Return two bytes whose values are lead and + trail. +
ISO-2022-JP's decoder has an associated +ISO-2022-JP decoder state (initially +ASCII), +ISO-2022-JP decoder output state (initially +ASCII), +ISO-2022-JP lead (initially 0x00), and +ISO-2022-JP output (initially false). + +
ISO-2022-JP's decoder's handler, given +ioQueue and byte, runs these steps, switching on +ISO-2022-JP decoder state: + +
Based on byte: + +
Set ISO-2022-JP decoder state to + escape start and return + continue. + +
Set ISO-2022-JP output to false and return a code point whose + value is byte. + +
Return finished. + +
Set ISO-2022-JP output to false and return error. +
Based on byte: + +
Set ISO-2022-JP decoder state to + escape start and return + continue. + +
Set ISO-2022-JP output to false and return code point U+00A5. + +
Set ISO-2022-JP output to false and return code point U+203E. + +
Set ISO-2022-JP output to false and return a code point whose + value is byte. + +
Return finished. + +
Set ISO-2022-JP output to false and return error. +
Based on byte: +
Set ISO-2022-JP decoder state to + escape start and return + continue. + +
Set ISO-2022-JP output to false and return a code point whose + value is 0xFF61 − 0x21 + byte. + + +
Return finished. + +
Set ISO-2022-JP output to false and return error. +
Based on byte: +
Set ISO-2022-JP decoder state to + escape start and return + continue. + +
Set ISO-2022-JP output to false, + ISO-2022-JP lead to byte, + ISO-2022-JP decoder state to + trail byte, and return + continue. + +
Return finished. + +
Set ISO-2022-JP output to false and return error. +
Based on byte: +
Set ISO-2022-JP decoder state to + escape start and return + error. + + +
Set the ISO-2022-JP decoder state to + lead byte. + +
Let pointer be + (ISO-2022-JP lead − 0x21) × 94 + byte − 0x21. + +
Let code point be the index code point for + pointer in index jis0208. + +
If code point is null, return error. + +
Return a code point whose value is code point. +
Set the ISO-2022-JP decoder state to + lead byte and return error. + +
Set ISO-2022-JP decoder state to + lead byte and return + error. + +
If byte is either 0x24 or 0x28, set + ISO-2022-JP lead to byte, + ISO-2022-JP decoder state to + escape, and return + continue. + +
If byte is not end-of-queue, then restore + byte to ioQueue. + +
Set ISO-2022-JP output to false, + ISO-2022-JP decoder state to + ISO-2022-JP decoder output state, and return error. +
Let lead be ISO-2022-JP lead and set + ISO-2022-JP lead to 0x00. + +
Let state be null. + +
If lead is 0x28 and byte is 0x42, set + state to ASCII. + +
If lead is 0x28 and byte is 0x4A, set + state to Roman. + +
If lead is 0x28 and byte is 0x49, set + state to katakana. + +
If lead is 0x24 and byte is either + 0x40 or 0x42, set state to + lead byte. + +
If state is non-null, then: + +
Set ISO-2022-JP decoder state and + ISO-2022-JP decoder output state to state. + +
Let output be the value of ISO-2022-JP output. + +
Set ISO-2022-JP output to true. + +
Return continue, if output is false, and + error otherwise. +
If byte is end-of-queue, then restore lead to + ioQueue; otherwise, restore « lead, byte » to + ioQueue. + +
Set ISO-2022-JP output to false, + ISO-2022-JP decoder state to ISO-2022-JP decoder output state + and return error. +
The ISO-2022-JP encoder is the only encoder for which the concatenation of + multiple outputs can result in an error when run through the corresponding + decoder. + +
Encoding U+00A5 gives 0x1B 0x28 0x4A 0x5C + 0x1B 0x28 0x42. Doing that twice, concatenating the results, and then decoding yields U+00A5 U+FFFD + U+00A5. +
ISO-2022-JP's encoder has an associated +ISO-2022-JP encoder state which is ASCII, +Roman, or +jis0208 (initially +ASCII). + +
ISO-2022-JP's encoder's handler, given +ioQueue and code point, runs these steps: + +
If code point is end-of-queue and + ISO-2022-JP encoder state is not + ASCII, set + ISO-2022-JP encoder state to + ASCII, and return three bytes + 0x1B 0x28 0x42. + +
If code point is end-of-queue and + ISO-2022-JP encoder state is + ASCII, return finished. + +
If ISO-2022-JP encoder state is + ASCII or + Roman, and code point is U+000E, U+000F, + or U+001B, return error with U+FFFD. + +
This returns U+FFFD rather than code point to prevent attacks. + + +
If ISO-2022-JP encoder state is + ASCII and code point is an + ASCII code point, return a byte whose value is code point. + +
If ISO-2022-JP encoder state is Roman and + code point is an ASCII code point, excluding U+005C and U+007E, or is U+00A5 or + U+203E, then: + +
If code point is an ASCII code point, return a byte + whose value is code point. + +
If code point is U+00A5, return byte 0x5C. + +
If code point is U+203E, return byte 0x7E. +
If code point is an ASCII code point, and + ISO-2022-JP encoder state is not + ASCII, + restore code point to + ioQueue, set ISO-2022-JP encoder state to + ASCII, and return three bytes + 0x1B 0x28 0x42. + +
If code point is either U+00A5 or U+203E, and + ISO-2022-JP encoder state is not + Roman, + restore code point to + ioQueue, set ISO-2022-JP encoder state to + Roman, and return three bytes + 0x1B 0x28 0x4A. + +
If code point is U+2212, set it to U+FF0D. + +
If code point is in the range U+FF61 to U+FF9F, inclusive, set it to the + index code point for code point − 0xFF61 in + index ISO-2022-JP katakana. + +
Let pointer be the index pointer for code point in + index jis0208. + +
If pointer is non-null, it is less than 8836 due to the nature of + index jis0208 and the index pointer operation. + +
If pointer is null, then: + +
If ISO-2022-JP encoder state is jis0208, + then restore code point to ioQueue, set + ISO-2022-JP encoder state to ASCII, and return three + bytes 0x1B 0x28 0x42. + +
Return error with code point. +
If ISO-2022-JP encoder state is not + jis0208, + restore code point to + ioQueue, set ISO-2022-JP encoder state to + jis0208, and return three bytes + 0x1B 0x24 0x42. + +
Let lead be pointer / 94 + 0x21. + +
Let trail be pointer % 94 + 0x21. + +
Return two bytes whose values are lead and + trail. +
Shift_JIS's decoder has an associated +Shift_JIS lead (initially 0x00). + +
Shift_JIS's decoder's handler, given +ioQueue and byte, runs these steps: + +
If byte is end-of-queue and + Shift_JIS lead is not 0x00, set Shift_JIS lead to 0x00 and + return error. + +
If byte is end-of-queue and + Shift_JIS lead is 0x00, return finished. + +
If Shift_JIS lead is not 0x00, let lead be Shift_JIS lead, let + pointer be null, set Shift_JIS lead to 0x00, and then: + +
Let offset be 0x40 if byte is less than 0x7F, otherwise 0x41. + +
Let lead offset be 0x81 if lead is less than 0xA0, otherwise 0xC1. + +
If byte is in the range 0x40 to 0x7E, inclusive, or + 0x80 to 0xFC, inclusive, set pointer to + (lead − lead offset) × 188 + byte − offset. + +
If pointer is in the range 8836 to 10715, inclusive, return a code point whose + value is 0xE000 − 8836 + pointer. + + +
This is interoperable legacy from Windows known as EUDC. + + +
Let code point be null if pointer is null, otherwise the + index code point for pointer in index jis0208. + +
If code point is non-null, return a code point whose value is + code point. + +
If byte is an ASCII byte, restore byte to + ioQueue. + +
Return error. +
If byte is an ASCII byte or 0x80, return a code point + whose value is byte. + + +
If byte is in the range 0xA1 to 0xDF, inclusive, return + a code point whose value is 0xFF61 − 0xA1 + byte. + + +
If byte is in the range 0x81 to 0x9F, inclusive, or 0xE0 to 0xFC, + inclusive, set Shift_JIS lead to byte and return + continue. + +
Return error. +
Shift_JIS's encoder's handler, given +ioQueue and code point, runs these steps: + +
If code point is end-of-queue, return + finished. + +
If code point is an ASCII code point or U+0080, return + a byte whose value is code point. + +
If code point is U+00A5, return byte 0x5C. + +
If code point is U+203E, return byte 0x7E. + +
If code point is in the range U+FF61 to U+FF9F, inclusive, return + a byte whose value is code point − 0xFF61 + 0xA1. + +
If code point is U+2212, set it to U+FF0D. + +
Let pointer be the index Shift_JIS pointer for + code point. + +
If pointer is null, return error with + code point. + +
Let lead be pointer / 188. + +
Let lead offset be 0x81 if lead is less than 0x1F, otherwise 0xC1. + + +
Let trail be pointer % 188. + +
Let offset be 0x40 if trail is less than 0x3F, otherwise 0x41. + +
Return two bytes whose values are + lead + lead offset and + trail + offset. +
EUC-KR's decoder has an associated +EUC-KR lead (initially 0x00). + +
EUC-KR's decoder's handler, given +ioQueue and byte, runs these steps: + +
If byte is end-of-queue and + EUC-KR lead is not 0x00, set EUC-KR lead to 0x00 + and return error. + +
If byte is end-of-queue and + EUC-KR lead is 0x00, return finished. + +
If EUC-KR lead is not 0x00, let lead be EUC-KR lead, let + pointer be null, set EUC-KR lead to 0x00, and then: + +
If byte is in the range 0x41 to 0xFE, inclusive, set + pointer to + (lead − 0x81) × 190 + (byte − 0x41). + +
Let code point be null if pointer is null, otherwise the + index code point for pointer in index EUC-KR. + +
If code point is non-null, return a code point whose value is + code point. + +
If byte is an ASCII byte, restore byte to + ioQueue. + +
Return error. +
If byte is an ASCII byte, return + a code point whose value is byte. + +
If byte is in the range 0x81 to 0xFE, inclusive, set + EUC-KR lead to byte and return continue. + +
Return error. +
EUC-KR's encoder's handler, given +ioQueue and code point, runs these steps: + +
If code point is end-of-queue, return + finished. + +
If code point is an ASCII code point, return + a byte whose value is code point. + +
Let pointer be the index pointer for + code point in index EUC-KR. + +
If pointer is null, return error with + code point. + +
Let lead be pointer / 190 + 0x81. + +
Let trail be pointer % 190 + 0x41. + +
Return two bytes whose values are lead and trail. +
The replacement encoding exists to prevent certain +attacks that abuse a mismatch between encodings supported on +the server and the client. + + +
replacement's decoder has an associated +replacement error returned (initially false). + +
replacement's decoder's handler, given +ioQueue and byte, runs these steps: + +
If byte is end-of-queue, return finished. + +
If replacement error returned is false, set + replacement error returned to true and return error. + +
Return finished. +
UTF-16BE/LE is UTF-16BE or UTF-16LE. + + +
A byte order mark has priority over a label as it has been found to be more accurate +in deployed content. Therefore it is not part of the shared UTF-16 decoder algorithm, but +rather the decode algorithm. + +
shared UTF-16 decoder has an associated UTF-16 lead byte and +UTF-16 leading surrogate (both initially null), and +is UTF-16BE decoder (initially false). + +
shared UTF-16 decoder's handler, given ioQueue and +byte, runs these steps: + +
If byte is end-of-queue and either + UTF-16 lead byte or UTF-16 leading surrogate is non-null, set + UTF-16 lead byte and UTF-16 leading surrogate to null, and return + error. + +
If byte is end-of-queue and + UTF-16 lead byte and UTF-16 leading surrogate are null, return + finished. + +
If UTF-16 lead byte is null, set UTF-16 lead byte to + byte and return continue. + +
Let code unit be the result of: + +
(UTF-16 lead byte << 8) + byte. +
(byte << 8) + UTF-16 lead byte. +
Then set UTF-16 lead byte to null. + +
If UTF-16 leading surrogate is non-null: + +
Let leadingSurrogate be UTF-16 leading surrogate. + +
Set UTF-16 leading surrogate to null. + +
If code unit is a trailing surrogate, then return a + scalar value from surrogates given leadingSurrogate and code unit. + +
Let byte1 be code unit >> 8. + +
Let byte2 be code unit & 0x00FF. + +
Let bytes be a list of two bytes whose values are byte1 + and byte2, if is UTF-16BE decoder is true; otherwise byte2 and + byte1. + +
If code unit is a leading surrogate, then set + UTF-16 leading surrogate to code unit and return continue. + +
If code unit is a trailing surrogate, then return error. + +
Return code point code unit. +
UTF-16BE's decoder is shared UTF-16 decoder with +its is UTF-16BE decoder set to true. + + +
"utf-16
" is a label for UTF-16LE to deal with
+deployed content.
+
+
+
UTF-16LE's decoder is shared UTF-16 decoder. + + +
While technically this is a single-byte encoding, +it is defined separately as it can be implemented algorithmically. + + + +
x-user-defined's decoder's handler, given +ioQueue and byte, runs these steps: + +
If byte is end-of-queue, return + finished. + +
If byte is an ASCII byte, return + a code point whose value is byte. + +
Return a code point whose value is 0xF780 + byte − 0x80. +
x-user-defined's encoder's handler, given +ioQueue and code point, runs these steps: + +
If code point is end-of-queue, return + finished. + +
If code point is an ASCII code point, return + a byte whose value is code point. + +
If code point is in the range U+F780 to U+F7FF, inclusive, return + a byte whose value is code point − 0xF780 + 0x80. + +
Return error with code point. +
Browsers are encouraged to not enable overriding the encoding of a resource. If such a feature is +nonetheless present, browsers should not offer UTF-16BE/LE as an option, due to the +aforementioned security issues. Browsers should also disable this feature if the resource was +decoded using UTF-16BE/LE. + + + +
Instead of supporting I/O queues with arbitrary restore, the +decoders for encodings in this standard could be implemented with: + +
The ability to unread the current byte. + +
A single-byte buffer for gb18030 (an ASCII byte) and ISO-2022-JP (0x24 or + 0x28). + +
For gb18030 when hitting a + bogus byte while gb18030 third is not 0x00, gb18030 second could be moved into the + single-byte buffer to be returned next, and gb18030 third would be the new + gb18030 first, checked for not being 0x00 after the single-byte buffer was returned and + emptied. This is possible as the range for the first and third byte in gb18030 is + identical. +
The ISO-2022-JP encoder needs ISO-2022-JP encoder state as additional state, but +other than that, none of the encoders for encodings in this standard +require additional state or buffers. + + + +
There have been a lot of people that have helped make encodings more +interoperable over the years and thereby furthered the goals of this +standard. Likewise many people have helped making this standard what it is +today. + +
With that, many thanks to +Adam Rice, +Alan Chaney, +Alexander Shtuchkin, +Allen Wirfs-Brock, +Andreu Botella, +Aneesh Agrawal, +Arkadiusz Michalski, +Asmus Freytag, +Ben Noordhuis, +Bnaya Peretz, +Boris Zbarsky, +Bruno Haible, +Cameron McCormack, +Charles McCathieNeville, +Christopher Foo, +CodifierNL, +David Carlisle, +Domenic Denicola, +Dominique Hazaël-Massieux, +Doug Ewell, +Erik van der Poel, +譚永鋒 (Frank Yung-Fong Tang), +Glenn Maynard, +Gordon P. Hemsley, +Henri Sivonen, +Ian Hickson, +J. King, +James Graham, +Jeffrey Yasskin, +John Tamplin, +Joshua Bell, +村井純 (Jun Murai), +신정식 (Jungshik Shin), +Jxck, +강 성훈 (Kang Seonghoon), +川幡太一 (Kawabata Taichi), +Ken Lunde, +Ken Whistler, +Kenneth Russell, +田村健人 (Kent Tamura), +Leif Halvard Silli, +Luke Wagner, +Maciej Hirsz, +Makoto Kato, +Mark Callow, +Mark Crispin, +Mark Davis, +Martin Dürst, +Masatoshi Kimura, +Mattias Buelens, +Ms2ger, +Nigel Megitt, +Nigel Tao, +Norbert Lindenberg, +Øistein E. Andersen, +Peter Krefting, +Philip Jägenstedt, +Philip Taylor, +Richard Ishida, +Robbert Broersma, +Robert Mustacchi, +Ryan Dahl, +Sam Sneddon, +Shawn Steele, +Simon Montagu, +Simon Pieters, +Simon Sapin, +Stephen Checkoway, +寺田健 (Takeshi Terada), +Vyacheslav Matva, +Wolf Lammen, and +成瀬ゆい (Yui Naruse) +for being awesome. + +
This standard is written by Anne van Kesteren +(Apple, annevk@annevk.nl). +The API chapter was initially written by Joshua Bell +(Google).