-
-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: Add support of non-UTF-8 encodings to the form loader #13346
Conversation
The returned data should be accompanied by an HTTP response header listing out the encoding. Do we know if Flash uses that header? |
@kmeisthax The response header does not always include a charset. The example through which I found this bug was a .txt file that was loaded with the form loader. The file is encoded in Latin-1 but encoded (as currently everything is) in UTF-8. @adrian17 mentioned in the Discord server that this is likely not how Flash itself did it. Latin-1 was more widely used 10-20 years ago, there might have just been a test to differenciate between Latin-1 and UTF-8 (or also used a given charset if existing). Flash might have decoded texts in other little-used encodings wrongly. |
My hunch is that Flash Player decodes non-UTF8 text according to the system codepage. I tried converting the txt file in Ruffle's loadvariables test to Shift-JIS (and adding some Japanese characters), and Flash Player seems to try to interpret the text as Windows-1252 (which produces garbage output). I don't think smarter behavior is a bad thing though! Just need to make sure it's reliable. |
System codepage is basically guaranteed to create mojibake, and we don't have access to it in HTML, so I'm perfectly fine with sniffing the encoding in that case. Especially if we're using Firefox's sniffer which should be well tested. In the case where the form response does have valid encoding information, is it possible to use that? |
I looked into this for a while and tried to find out how Flash itself exactly determined it. To do this, I created a test SWF file which loads several text files with different encodings and different HTTP headers. Surprisingly, when I tested it, only the UTF-8 is decoded correctly (both with the Flash plugin and the desktop version): And the HTTP header didn't change anything in my tests (CharISO means that the HTTP header includes ISO Latin-1 as charset): As the ISO Latin-1 text file loaded by the SWF I found is correctly decoded by Flash, I'll assume that there is some other factor which determines the encoding Flash uses. @kmeisthax @n0samu Can you test this as well since this might behave differently on different platforms? Independently to what Flash does, I think it might make sense to use this PR's approach and correctly determine the charset. I'll look into using the HTTP header if one exists instead of |
This is the result I got opening the SWF via its URL in the Flash projector on Windows 10. Looks like the same result as yours, hope it helps. Do you have an example of a game that this PR helps with btw? Just wondering
|
Thank you! Yeah, it looks like the same result, and I got the same with the plugin version as well. That's quite interesting, I wonder what's causing Flash to only decode the UTF-8 files correctly in this test (as Flash is correctly decoding ISO Latin-1 files in other SWFs). And yeah, I have a game; I noticed this issue originally in Paraplüsch. If you choose the German flag, the "Vorgeschichte Überspringen" (Skip Intro in German) turns into "Vorgeschichte berspringen" as soon as the text file is loaded whose content overwrites that text field. Text from that file is seen in other places in the game as well, where all letters with diacritics are not displayed. Bildschirmaufnahme.2023-11-22.um.01.45.27.movI also implemented kmeisthax's idea to use the HTTP header encoding and only |
This game is SWFv5, which stored strings with the system encoding instead of UTF-8, so it makes sense that external strings loaded from SWFv5 would behave the same way. See also: #8390 |
58700d2
to
244261d
Compare
I committed the idea to use the response header. With the current commits, the encoding specified in the response header is used if one is existing, otherwise @n0samu Thank you for the tip! I tested it again with different SWF versions and can now confirm that the behaviour is the same in SWF version 6, 8, 10 and 43. I haven't tested version 5 yet as my test SWF uses some functions that only exist in SWF versions 6 and upwards. Why do you say that the game is SWFv5 though? I downloaded I think an approach like this could also be part of the solution for #8390. At least if we don't have access to the system codepage and / or it's not reliable for some reason, detecting the encoding through |
@Korne127 Oh, I'd only looked at |
244261d
to
ff46935
Compare
Hey :) I updated this pull request accordingly to my findings. I also added some tests to make sure that there won't be regressions that cause Ruffle to load files incorrectly again. |
ff46935
to
cda78ae
Compare
f3e1a9e
to
544bc79
Compare
Flash's form loader loads text files in the local system codepage if System#useCodepage has been set to true. Previously, Ruffle always (wrongly) used UTF-8, leading to incorrectly displayed characters. This has been fixed. Ruffle now supports loading files with an encoding other than UTF-8. As Ruffle doesn't always have access to the system codepage and as it's not reliably the correct encoding, the crate chardetng has been added. It's used instead of the system codepage to detect the encoding, and the data is converted into UTF-8.
If System#useCodepage has been set to true, the form loader now uses the encoding specified in the HTTP response content type field, if existing, to decode remote text files. chardetng is now (only) used if the HTTP response doesn't specify any encoding or if the file is local.
The form loader now loads files using Windows-1252 if the SWF version is smaller than 6. This roughly matches Flash's behaviour (Flash uses Windows-1252 on Windows, on macOS a slightly different custom encoding is used). Previously, UTF-8 has been (wrongly) used for all SWF files if System#useCodepage hasn't been set to true, leading to incorrectly displayed characters.
Several form loader encoding tests have been added. They test whether the form loader uses the correct encoding to decode text files with different SWF versions and settings. One test has been marked as known failure as it tests how Flash decodes invalid UTF-8 characters (when decoding as UTF-8), which is not yet implemented in Ruffle.
544bc79
to
a086621
Compare
Edit: To read the current summary of this pull request, click here (or scroll down).
The form loader is currently decoding all files as UTF-8.
Therefore, files that aren't encoded as UTF-8 are decoded wrongly if they're loaded with the form loader, and the content isn't saved correctly.
In this case, non-ASCII characters like European letters with diacritics aren't decoded correctly and therefore cannot be displayed. For non-Latin scripts, the impact is much more severe as almost everything is decoded wrongly and therefore unreadable.
This pull request fixes this. It adds the crate chardetng, which is used to detect the encoding. If necessary, the form data is converted into UTF-8.