Skip to content
This repository has been archived by the owner on Nov 5, 2022. It is now read-only.

UnicodeReader misdetects UTF-32LE as UTF-16LE #471

Open
tayloj opened this issue Jul 19, 2017 · 0 comments
Open

UnicodeReader misdetects UTF-32LE as UTF-16LE #471

tayloj opened this issue Jul 19, 2017 · 0 comments

Comments

@tayloj
Copy link

tayloj commented Jul 19, 2017

UnicodeReader can't actually detect UTF-32LE encodings. There's a big chain of if/else if/... blocks in the constructor that examine the first few bytes from an input stream. The blocks for detecting UTF-16LE and UTF-32LE are:

/* ... * /
else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
  encoding = "UTF-16LE";
  unread = n - 2;
}
/* ...code for UTF-32BE ... */
else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)
    && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
  encoding = "UTF-32LE";
  unread = n - 4;
} else /* ... */

The condition for the UTF-32LE case:

(bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)
  && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)

can't be true unless the earlier case for UTF-16LE was also true:

(bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)

So something that's UTF-32LE would be detected as UTF-16LE.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant