Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[php2cpg] Try multiple charsets when reading file content #5222

Merged
merged 3 commits into from
Jan 14, 2025

Conversation

johannescoetzee
Copy link
Contributor

There's a customer issue where reading the file content was resulting in a MalformedInputException being thrown. With this PR, we just try all of the Java standard charsets before emitting a warning instead of crashing: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/charset/StandardCharsets.html

I would be very surprised if we ever need the specific endianness UTF-16 variants, but since this only tries encodings until something works, I also don't think adding them to the list has much of a downside.

@max-leuthaeuser
Copy link
Contributor

max-leuthaeuser commented Jan 13, 2025

I think thats why we have https://github.com/ShiftLeftSecurity/codepropertygraph/blob/master/codepropertygraph/src/main/scala/io/shiftleft/utils/IOUtils.scala

It does not try multiple encodings but does some replacing and BOM fixing. Also unpaired surrogates are a thing. Had all of this in JS customer code already.

@johannescoetzee johannescoetzee force-pushed the johannes/php-file-encoding branch from 1b08f8c to 6ee5aa4 Compare January 13, 2025 10:47
@@ -146,7 +146,7 @@ class MethodTests extends PhpCode2CpgFixture {
fooMethod.file.head.content.substring(offsetStart, offsetEnd) shouldBe
"""function foo() {
| // ⦝
| $x = "🙂⨌🙂𐇐🙂🙂🙂🙂";
| $x = "??⨌????????????";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using IOUtils.readEntireFile introduces a regression in the characters we can actually parse, but I suspect it's unlikely we'll encounter many such characters in actual code and is outweighed by the benefit of using a uniform approach across frontends

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, stange. It works here: io.joern.jssrc2cpg.io.CodeDumperFromContentTests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the same results when pasting that string into one of the jssrc tests:

...
  x = 'foo[??⨌????????????]';
}" was not equal to "...ass Foo
{
  x = 'foo[🙂⨌🙂𐇐🙂🙂🙂🙂]';
...

I did a hex dump of a file containing only 🙂 and it matches the UTF-8 encoding from https://www.compart.com/en/unicode/U+1F642

❯ hexdump /tmp/smile
0000000 9ff0 8299 000a                         
0000005
❯ cat /tmp/smile
🙂

nvim could've done some conversions on write or such though. If not, it looks like it should be a valid UTF-8 character

Copy link
Contributor

@max-leuthaeuser max-leuthaeuser Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe its time to move IOUtils to x2cpg (which wasnt a thing back then) and fix the utf8 handling. I'll do a draft PR.

Hm, can't do this. semanticcpg also uses that and can't depend on x2cpg (cyclic dependencies).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this: ShiftLeftSecurity/codepropertygraph#1804 the test case works again.
I was not able to (re-) discover some code with actual unpaired surrogates (which was the reason I implemented it in first place). Not really sure on how to proceed there.

Copy link
Contributor

@maltek maltek Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this table here help with crafting such a file? https://simonsapin.github.io/wtf-8/#surrogates-byte-sequences

I'm pretty sure we only added the surrogate replacement because of issues we had with a customer file.

Copy link
Contributor

@max-leuthaeuser max-leuthaeuser Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! We should add some tests for that. Just not sure how to craft it exactly and if it makes sense to feed such a thing into our parsers at the end.

@johannescoetzee johannescoetzee force-pushed the johannes/php-file-encoding branch from cd2abfd to 3a57833 Compare January 13, 2025 15:51
@johannescoetzee
Copy link
Contributor Author

Upgrading to ShiftLeftSecurity/codepropertygraph#1804 fixed the test. Thanks @max-leuthaeuser!

@max-leuthaeuser max-leuthaeuser merged commit a922084 into master Jan 14, 2025
5 checks passed
@max-leuthaeuser max-leuthaeuser deleted the johannes/php-file-encoding branch January 14, 2025 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants