Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FindRepeatead and Unicode / may break OP_STAR/PLUS/... #371

Open
User4martin opened this issue Jan 7, 2024 · 2 comments
Open

FindRepeatead and Unicode / may break OP_STAR/PLUS/... #371

User4martin opened this issue Jan 7, 2024 · 2 comments
Labels

Comments

@User4martin
Copy link

I have not further analysed this...

FindRepeated (for unicode) calls IncUnicode2 which may (for surrogates) increment by 2. For the OPs that can match a surrogate this will be a problem.

OP_STAR/.... in MatchPrim will iterate the returned range in steps of one ReChar (codeunit): regInput := save + no;

Also the result of FindRepeated may be the

  • codeunits for OP_ANY (counting a surrogate as 2)
  • "Chars"/full-codepoints for any of the OP_NOT... (counting a surrogate as 1)

One way I can think of (.+).

  • if the last char in the text is a surrogate, then the capture matches half a surrogate
  • if the text is exactly one char, and that is a surrogate, then it incorrectly matches. It needs 2 chars, and takes each half of the surrogate as a full char.

OP_STAR goes back half the surrogate, and then OP_ANY does not check that it matches the 2nd part of a surrogate


This may be fixable (but I have not tested)

  • OP_STAR... in MatchPrim must check regInput := save + no; points to the 2nd part of a surrogate
  • FindRepeated always most return the amount of codeunits (ReChars) / always counting a surrogate as 2.
@Alexey-T
Copy link
Collaborator

Alexey-T commented Jan 7, 2024

On what case (RE, text) does engine fail currently?

@User4martin
Copy link
Author

I only deducted from code review.
But https://www.compart.com/de/unicode/U+10000

IsNotMatching('surrogat', '.+.', #$D800#$DC00); fails (it will match).

This is one char. so the .+ should entirely consume it, and leave nothing for the extra ..

Btw, same issue with combining codepoints.


on https://regex101.com/ not all regex handle this either (Python, GoLang, Java seem to do)

@Alexey-T Alexey-T added the bug label Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants