UnicodeReader misdetects UTF-32LE as UTF-16LE #471

tayloj · 2017-07-19T14:30:37Z

UnicodeReader can't actually detect UTF-32LE encodings. There's a big chain of if/else if/... blocks in the constructor that examine the first few bytes from an input stream. The blocks for detecting UTF-16LE and UTF-32LE are:

/* ... * /
else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
  encoding = "UTF-16LE";
  unread = n - 2;
}
/* ...code for UTF-32BE ... */
else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)
    && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
  encoding = "UTF-32LE";
  unread = n - 4;
} else /* ... */

The condition for the UTF-32LE case:

(bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)
  && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)

can't be true unless the earlier case for UTF-16LE was also true:

(bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)

So something that's UTF-32LE would be detected as UTF-16LE.

The text was updated successfully, but these errors were encountered:

ahammel mentioned this issue Jan 30, 2019

Bug: UTF-32-LE misdetected as UTF-16-LE #472

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeReader misdetects UTF-32LE as UTF-16LE #471

UnicodeReader misdetects UTF-32LE as UTF-16LE #471

tayloj commented Jul 19, 2017

UnicodeReader misdetects UTF-32LE as UTF-16LE #471

UnicodeReader misdetects UTF-32LE as UTF-16LE #471

Comments

tayloj commented Jul 19, 2017