-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[php2cpg] Try multiple charsets when reading file content #5222
Conversation
I think thats why we have https://github.com/ShiftLeftSecurity/codepropertygraph/blob/master/codepropertygraph/src/main/scala/io/shiftleft/utils/IOUtils.scala It does not try multiple encodings but does some replacing and BOM fixing. Also unpaired surrogates are a thing. Had all of this in JS customer code already. |
1b08f8c
to
6ee5aa4
Compare
@@ -146,7 +146,7 @@ class MethodTests extends PhpCode2CpgFixture { | |||
fooMethod.file.head.content.substring(offsetStart, offsetEnd) shouldBe | |||
"""function foo() { | |||
| // ⦝ | |||
| $x = "🙂⨌🙂𐇐🙂🙂🙂🙂"; | |||
| $x = "??⨌????????????"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using IOUtils.readEntireFile
introduces a regression in the characters we can actually parse, but I suspect it's unlikely we'll encounter many such characters in actual code and is outweighed by the benefit of using a uniform approach across frontends
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, stange. It works here: io.joern.jssrc2cpg.io.CodeDumperFromContentTests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the same results when pasting that string into one of the jssrc tests:
...
x = 'foo[??⨌????????????]';
}" was not equal to "...ass Foo
{
x = 'foo[🙂⨌🙂𐇐🙂🙂🙂🙂]';
...
I did a hex dump of a file containing only 🙂
and it matches the UTF-8 encoding from https://www.compart.com/en/unicode/U+1F642
❯ hexdump /tmp/smile
0000000 9ff0 8299 000a
0000005
❯ cat /tmp/smile
🙂
nvim could've done some conversions on write or such though. If not, it looks like it should be a valid UTF-8 character
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe its time to move IOUtils to x2cpg (which wasnt a thing back then) and fix the utf8 handling. I'll do a draft PR.
Hm, can't do this. semanticcpg also uses that and can't depend on x2cpg (cyclic dependencies).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this: ShiftLeftSecurity/codepropertygraph#1804 the test case works again.
I was not able to (re-) discover some code with actual unpaired surrogates (which was the reason I implemented it in first place). Not really sure on how to proceed there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this table here help with crafting such a file? https://simonsapin.github.io/wtf-8/#surrogates-byte-sequences
I'm pretty sure we only added the surrogate replacement because of issues we had with a customer file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! We should add some tests for that. Just not sure how to craft it exactly and if it makes sense to feed such a thing into our parsers at the end.
cd2abfd
to
3a57833
Compare
Upgrading to ShiftLeftSecurity/codepropertygraph#1804 fixed the test. Thanks @max-leuthaeuser! |
There's a customer issue where reading the file content was resulting in a
MalformedInputException
being thrown. With this PR, we just try all of the Java standard charsets before emitting a warning instead of crashing: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/charset/StandardCharsets.htmlI would be very surprised if we ever need the specific endianness UTF-16 variants, but since this only tries encodings until something works, I also don't think adding them to the list has much of a downside.