[php2cpg] Try multiple charsets when reading file content #5222

johannescoetzee · 2025-01-13T09:59:29Z

There's a customer issue where reading the file content was resulting in a MalformedInputException being thrown. With this PR, we just try all of the Java standard charsets before emitting a warning instead of crashing: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/charset/StandardCharsets.html

I would be very surprised if we ever need the specific endianness UTF-16 variants, but since this only tries encodings until something works, I also don't think adding them to the list has much of a downside.

max-leuthaeuser · 2025-01-13T10:05:29Z

I think thats why we have https://github.com/ShiftLeftSecurity/codepropertygraph/blob/master/codepropertygraph/src/main/scala/io/shiftleft/utils/IOUtils.scala

It does not try multiple encodings but does some replacing and BOM fixing. Also unpaired surrogates are a thing. Had all of this in JS customer code already.

johannescoetzee · 2025-01-13T10:49:17Z

joern-cli/frontends/php2cpg/src/test/scala/io/joern/php2cpg/querying/MethodTests.scala

@@ -146,7 +146,7 @@ class MethodTests extends PhpCode2CpgFixture {
        fooMethod.file.head.content.substring(offsetStart, offsetEnd) shouldBe
          """function foo() {
            |  // ⦝
-            |  $x = "🙂⨌🙂𐇐🙂🙂🙂🙂";
+            |  $x = "??⨌????????????";


Using IOUtils.readEntireFile introduces a regression in the characters we can actually parse, but I suspect it's unlikely we'll encounter many such characters in actual code and is outweighed by the benefit of using a uniform approach across frontends

Hm, stange. It works here: io.joern.jssrc2cpg.io.CodeDumperFromContentTests

I get the same results when pasting that string into one of the jssrc tests:

... x = 'foo[??⨌????????????]'; }" was not equal to "...ass Foo { x = 'foo[🙂⨌🙂𐇐🙂🙂🙂🙂]'; ...

I did a hex dump of a file containing only 🙂 and it matches the UTF-8 encoding from https://www.compart.com/en/unicode/U+1F642

❯ hexdump /tmp/smile 0000000 9ff0 8299 000a 0000005 ❯ cat /tmp/smile 🙂

nvim could've done some conversions on write or such though. If not, it looks like it should be a valid UTF-8 character

~~Maybe its time to move IOUtils to x2cpg (which wasnt a thing back then) and fix the utf8 handling. I'll do a draft PR.~~

Hm, can't do this. semanticcpg also uses that and can't depend on x2cpg (cyclic dependencies).

With this: ShiftLeftSecurity/codepropertygraph#1804 the test case works again.
I was not able to (re-) discover some code with actual unpaired surrogates (which was the reason I implemented it in first place). Not really sure on how to proceed there.

does this table here help with crafting such a file? https://simonsapin.github.io/wtf-8/#surrogates-byte-sequences

I'm pretty sure we only added the surrogate replacement because of issues we had with a customer file.

Interesting! We should add some tests for that. Just not sure how to craft it exactly and if it makes sense to feed such a thing into our parsers at the end.

johannescoetzee · 2025-01-13T15:52:43Z

Upgrading to ShiftLeftSecurity/codepropertygraph#1804 fixed the test. Thanks @max-leuthaeuser!

johannescoetzee requested review from ml86, maltek and max-leuthaeuser January 13, 2025 09:59

johannescoetzee force-pushed the johannes/php-file-encoding branch from 1b08f8c to 6ee5aa4 Compare January 13, 2025 10:47

johannescoetzee commented Jan 13, 2025

View reviewed changes

max-leuthaeuser mentioned this pull request Jan 13, 2025

Do not longer remove surrogates ShiftLeftSecurity/codepropertygraph#1804

Merged

johannescoetzee added 3 commits January 13, 2025 16:50

[php2cpg] Try multiple charsets when reading file content

f76033c

Use IOUtils to read file instead

c884516

Bump cpg and fix test expectations

3a57833

johannescoetzee force-pushed the johannes/php-file-encoding branch from cd2abfd to 3a57833 Compare January 13, 2025 15:51

max-leuthaeuser approved these changes Jan 13, 2025

View reviewed changes

max-leuthaeuser merged commit a922084 into master Jan 14, 2025
5 checks passed

max-leuthaeuser deleted the johannes/php-file-encoding branch January 14, 2025 07:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[php2cpg] Try multiple charsets when reading file content #5222

[php2cpg] Try multiple charsets when reading file content #5222

johannescoetzee commented Jan 13, 2025

max-leuthaeuser commented Jan 13, 2025 •

edited

Loading

johannescoetzee Jan 13, 2025

max-leuthaeuser Jan 13, 2025

johannescoetzee Jan 13, 2025

max-leuthaeuser Jan 13, 2025 •

edited

Loading

max-leuthaeuser Jan 13, 2025

maltek Jan 13, 2025 •

edited

Loading

max-leuthaeuser Jan 13, 2025 •

edited

Loading

johannescoetzee commented Jan 13, 2025

[php2cpg] Try multiple charsets when reading file content #5222

[php2cpg] Try multiple charsets when reading file content #5222

Conversation

johannescoetzee commented Jan 13, 2025

max-leuthaeuser commented Jan 13, 2025 • edited Loading

johannescoetzee Jan 13, 2025

Choose a reason for hiding this comment

max-leuthaeuser Jan 13, 2025

Choose a reason for hiding this comment

johannescoetzee Jan 13, 2025

Choose a reason for hiding this comment

max-leuthaeuser Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

max-leuthaeuser Jan 13, 2025

Choose a reason for hiding this comment

maltek Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

max-leuthaeuser Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

johannescoetzee commented Jan 13, 2025

max-leuthaeuser commented Jan 13, 2025 •

edited

Loading

max-leuthaeuser Jan 13, 2025 •

edited

Loading

maltek Jan 13, 2025 •

edited

Loading

max-leuthaeuser Jan 13, 2025 •

edited

Loading