Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: Add support of non-UTF-8 encodings to the form loader #13346

Merged
merged 4 commits into from
Jun 26, 2024

Conversation

Korne127
Copy link
Contributor

@Korne127 Korne127 commented Sep 28, 2023

Edit: To read the current summary of this pull request, click here (or scroll down).


The form loader is currently decoding all files as UTF-8.
Therefore, files that aren't encoded as UTF-8 are decoded wrongly if they're loaded with the form loader, and the content isn't saved correctly.
In this case, non-ASCII characters like European letters with diacritics aren't decoded correctly and therefore cannot be displayed. For non-Latin scripts, the impact is much more severe as almost everything is decoded wrongly and therefore unreadable.

This pull request fixes this. It adds the crate chardetng, which is used to detect the encoding. If necessary, the form data is converted into UTF-8.

@kmeisthax
Copy link
Member

The returned data should be accompanied by an HTTP response header listing out the encoding. Do we know if Flash uses that header?

@Korne127
Copy link
Contributor Author

Korne127 commented Nov 9, 2023

@kmeisthax The response header does not always include a charset. The example through which I found this bug was a .txt file that was loaded with the form loader. The file is encoded in Latin-1 but encoded (as currently everything is) in UTF-8.
There is no way to always correctly encode such a txt file with a 100% guarantee as there is no given charset. UTF-8 texts can contain byte order marks (BOMs) indicating the text is encoded in UTF-8 (but don't have to), but in general, the way to display all texts correctly is like this. Firefox itself uses the same crate when opening the same txt file (and displays a warning in the console that it's not guaranteed to be the right encoding).
This way, texts should almost always be encoded with the right encoding in Ruffle and displayed properly.

@adrian17 mentioned in the Discord server that this is likely not how Flash itself did it. Latin-1 was more widely used 10-20 years ago, there might have just been a test to differenciate between Latin-1 and UTF-8 (or also used a given charset if existing). Flash might have decoded texts in other little-used encodings wrongly.
Technically, Flash decoding such texts wrongly could have been used in a way (e.g. while storing byte data) although that's not very likely.

@n0samu
Copy link
Member

n0samu commented Nov 11, 2023

My hunch is that Flash Player decodes non-UTF8 text according to the system codepage. I tried converting the txt file in Ruffle's loadvariables test to Shift-JIS (and adding some Japanese characters), and Flash Player seems to try to interpret the text as Windows-1252 (which produces garbage output). I don't think smarter behavior is a bad thing though! Just need to make sure it's reliable.

@kmeisthax
Copy link
Member

System codepage is basically guaranteed to create mojibake, and we don't have access to it in HTML, so I'm perfectly fine with sniffing the encoding in that case. Especially if we're using Firefox's sniffer which should be well tested.

In the case where the form response does have valid encoding information, is it possible to use that?

@Korne127
Copy link
Contributor Author

I looked into this for a while and tried to find out how Flash itself exactly determined it.

To do this, I created a test SWF file which loads several text files with different encodings and different HTTP headers.
There are twelve files, each combination out of the encoding (UTF-8, ISO Latin-1 and Shift JIS) and the charset in the HTTP header (None, UTF-8, ISO Latin-1 and Shift JIS). The loaded variables are printed onto a text box.
You can try it out here: https://korne127.de/flash_test/encoding_test/Test.swf

Surprisingly, when I tested it, only the UTF-8 is decoded correctly (both with the Flash plugin and the desktop version):
Decoding results without HTTP header

And the HTTP header didn't change anything in my tests (CharISO means that the HTTP header includes ISO Latin-1 as charset):
Decoding results with HTTP header

As the ISO Latin-1 text file loaded by the SWF I found is correctly decoded by Flash, I'll assume that there is some other factor which determines the encoding Flash uses.

@kmeisthax @n0samu Can you test this as well since this might behave differently on different platforms?

Independently to what Flash does, I think it might make sense to use this PR's approach and correctly determine the charset.
Theoretically, someone could have stored binary data in a text file that is then read by Flash as random characters that are used in a specific way, in that case determining a charset would change the behaviour, but I find that rather unlikely.

I'll look into using the HTTP header if one exists instead of chardetng (this obviously only works for remote files though).

@n0samu
Copy link
Member

n0samu commented Nov 22, 2023

This is the result I got opening the SWF via its URL in the Flash projector on Windows 10. Looks like the same result as yours, hope it helps. Do you have an example of a game that this PR helps with btw? Just wondering

UTF-8.txt
var2: ÄÖÜ ß
var1: Example Text in UTF8:

Iso Latin-1.txt
var2: Ŗڠ
var1: Example Text in Iso Latin-1:

Shift JIS.txt
var2: �Љ¼�¼
var1: Example Text in Shift JIS:

CharUTF UTF-8.txt
var2: ÄÖÜ ß
var1: Example Text in UTF8:

CharUTF Iso Latin-1.txt
var2: Ŗڠ
var1: Example Text in Iso Latin-1:

CharUTF Shift JIS.txt
var2: �Љ¼�¼
var1: Example Text in Shift JIS:

CharISO UTF-8.txt
var2: ÄÖÜ ß
var1: Example Text in UTF8:

CharISO Iso Latin-1.txt
var2: Ŗڠ
var1: Example Text in Iso Latin-1:

CharISO Shift JIS.txt
var2: �Љ¼�¼
var1: Example Text in Shift JIS:

CharShift UTF-8.txt
var2: ÄÖÜ ß
var1: Example Text in UTF8:

CharShift Iso Latin-1.txt
var2: Ŗڠ
var1: Example Text in Iso Latin-1:

CharShift Shift JIS.txt
var2: �Љ¼�¼
var1: Example Text in Shift JIS:

@Korne127
Copy link
Contributor Author

Thank you! Yeah, it looks like the same result, and I got the same with the plugin version as well. That's quite interesting, I wonder what's causing Flash to only decode the UTF-8 files correctly in this test (as Flash is correctly decoding ISO Latin-1 files in other SWFs).

And yeah, I have a game; I noticed this issue originally in Paraplüsch. If you choose the German flag, the "Vorgeschichte Überspringen" (Skip Intro in German) turns into "Vorgeschichte berspringen" as soon as the text file is loaded whose content overwrites that text field. Text from that file is seen in other places in the game as well, where all letters with diacritics are not displayed.
That happens because the text is encoded as ISO Latin-1 and read as UTF-8 by Ruffle. This is not the case in Flash which displays all texts correctly.

Bildschirmaufnahme.2023-11-22.um.01.45.27.mov

I also implemented kmeisthax's idea to use the HTTP header encoding and only chardetng if the HTTP header doesn't include one. This worked great on desktop, I'll test the web version tomorrow and then update this pull request.

@n0samu
Copy link
Member

n0samu commented Nov 22, 2023

This game is SWFv5, which stored strings with the system encoding instead of UTF-8, so it makes sense that external strings loaded from SWFv5 would behave the same way.

See also: #8390

@Korne127
Copy link
Contributor Author

I committed the idea to use the response header. With the current commits, the encoding specified in the response header is used if one is existing, otherwise chardetng.

@n0samu Thank you for the tip! I tested it again with different SWF versions and can now confirm that the behaviour is the same in SWF version 6, 8, 10 and 43. I haven't tested version 5 yet as my test SWF uses some functions that only exist in SWF versions 6 and upwards.

Why do you say that the game is SWFv5 though? I downloaded anstalt_kore.swf, and according to FFDec, the SWF version is 8.

I think an approach like this could also be part of the solution for #8390. At least if we don't have access to the system codepage and / or it's not reliable for some reason, detecting the encoding through chardetng seems like a reasonable alternative.

@n0samu
Copy link
Member

n0samu commented Nov 25, 2023

@Korne127 Oh, I'd only looked at offtext_deut.swf, which I assumed was loading the text file. Seems like I was wrong though, since the loadVariables call indeed comes from anstalt_kore.swf, which is SWFv8. Sorry about that

@danielhjacobs danielhjacobs added the waiting-on-review Waiting on review from a Ruffle team member label Dec 12, 2023
@Korne127
Copy link
Contributor Author

Hey :)
I spent some time figuring out what's different about Paraplüsch that it loads text files in Iso Latin-1 (or to be exact Windows-1252) while all my test SWF files used UTF-8.
And after a while, I found it: It contains this command in the first frame: System.useCodepage = true;
As @n0samu assumed, Flash loads the text files in the system codepage, but only if this variable has been set to true.
Otherwise, UTF-8 is used if the SWF version is 6 or higher. If the SWF version is 5 or lower, it uses Windows-1252 on Windows and a slightly different custom encoding on macOS (the letter with diacritics have the same encoding as in Windows-1252 but some other symbols don't).

I updated this pull request accordingly to my findings.
Ruffle doesn't always have access to the system codepage and it's not reliably the correct encoding. Therefore, if System.useCodepage has been set to true, Ruffle now uses the encoding in the HTTP response header if existing, and guesses it via chardetng otherwise.
If System.useCodepage is false, Ruffle uses UTF-8 if the SWF version is 6 or higher and Windows-1252 otherwise.
This way, Ruffle is able to load non-UTF-8 files and display their characters correctly.

I also added some tests to make sure that there won't be regressions that cause Ruffle to load files incorrectly again.

core/src/backend/navigator.rs Outdated Show resolved Hide resolved
core/src/backend/navigator.rs Outdated Show resolved Hide resolved
core/src/avm1/globals/system.rs Show resolved Hide resolved
frontend-utils/src/backends/navigator.rs Outdated Show resolved Hide resolved
tests/framework/src/backends/navigator.rs Show resolved Hide resolved
@Korne127 Korne127 force-pushed the form-loader-encoding branch 3 times, most recently from f3e1a9e to 544bc79 Compare June 22, 2024 14:08
Flash's form loader loads text files in the local system codepage if
System#useCodepage has been set to true. Previously, Ruffle always
(wrongly) used UTF-8, leading to incorrectly displayed characters.
This has been fixed. Ruffle now supports loading files with an encoding
other than UTF-8.
As Ruffle doesn't always have access to the system codepage and as it's
not reliably the correct encoding, the crate chardetng has been added.
It's used instead of the system codepage to detect the encoding, and the
data is converted into UTF-8.
If System#useCodepage has been set to true, the form loader now uses the
encoding specified in the HTTP response content type field, if existing,
to decode remote text files. chardetng is now (only) used if the HTTP
response doesn't specify any encoding or if the file is local.
The form loader now loads files using Windows-1252 if the SWF version is
smaller than 6. This roughly matches Flash's behaviour (Flash uses
Windows-1252 on Windows, on macOS a slightly different custom encoding
is used).
Previously, UTF-8 has been (wrongly) used for all SWF files if
System#useCodepage hasn't been set to true, leading to incorrectly
displayed characters.
Several form loader encoding tests have been added. They test whether
the form loader uses the correct encoding to decode text files with
different SWF versions and settings.
One test has been marked as known failure as it tests how Flash decodes
invalid UTF-8 characters (when decoding as UTF-8), which is not yet
implemented in Ruffle.
@adrian17 adrian17 merged commit 76d3cd9 into ruffle-rs:master Jun 26, 2024
17 checks passed
@danielhjacobs danielhjacobs removed the waiting-on-review Waiting on review from a Ruffle team member label Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

6 participants