Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

$DEFINE UnicodeRE off-> tests fail #343

Open
Alexey-T opened this issue Nov 16, 2023 · 9 comments
Open

$DEFINE UnicodeRE off-> tests fail #343

Alexey-T opened this issue Nov 16, 2023 · 9 comments

Comments

@Alexey-T
Copy link
Collaborator

Alexey-T commented Nov 16, 2023

Martin, can you adjust tests project to not fail with UnicodeRE off?
You are really good in composing tests. I admit.
@User4martin

@Alexey-T Alexey-T reopened this Nov 16, 2023
@Alexey-T
Copy link
Collaborator Author

Alexey-T commented Nov 16, 2023

Test-proj has it's own UnicodeRE define copy. but even if i disable it, tests fail! @User4martin

@User4martin
Copy link

Well, haven't checked background yet, but

    // 69
    ( // empty str
    expression: '^ *$';
    inputText: '';
    substitutionText: '';
    expectedResult: '';
    matchStart: 1

fails for me => and that looks like a bug in the regex engine.


The others are down to the define.

Ideally the defines need to move into their own include files.

Then (and I can check that) the failing test may need to be disabled.

The [-] range of Russian chars seems not to be implemented for utf8 yet. Possible, but an issue of its own (and not necessary one that would have my time soon).

The #%85 line break => same thing. But maybe can be fixed easy for utf8.

@User4martin
Copy link

User4martin commented Nov 16, 2023

IsAnyLineBreak could be changed to take a pointer to ReChar.

Then it could return zero, or the length of any matched line break. That way it could handle utf-8 encoded line breaks of more than one byte.

The test case would then need to be changed to have #$C2#$85 in the string.

@Alexey-T
Copy link
Collaborator Author

Then it could return zero, or the length of any matched line break.

do code need this really, if it works good already? only more complex logic.

@User4martin
Copy link

Then it could return zero, or the length of any matched line break.

do code need this really, if it works good already? only more complex logic.

Well, is "not implemented" = works good?

At the moment, using the utf-8 version, Linebreaks like "'NEXT LINE (NEL)' (U+0085)" are simple not detected. utf-8 is unicode, so those codes do exist.

That is unless it is meant to be ASCII? Then a utf-8 version is really needed. (And afaik there is more to be fixed for proper utf8 support, but this would be a start)

@Alexey-T
Copy link
Collaborator Author

so it is needed, okay.

@Alexey-T
Copy link
Collaborator Author

Alexey-T commented Nov 17, 2023

but is it needed that in non-Unicode mode we must find pure Unicode linebreak? we can ignore chr(85) in non-Unicode mode, logical.

@User4martin
Copy link

but is it needed that in non-Unicode mode we must find pure Unicode linebreak? we can ignore chr(85) in non-Unicode mode, logical.

IMHO: Wrong Question.
Utf8 is also a Unicode mode.

The question is: Does the regex currently have an ASCII (non Unicode) or an Ut8 (Unicode) mode?

But, IMHO the answer does not matter. IMHO a Utf8 mode is what is needed.

So then the only question is:
Utf8 mode: Add or Fix?

@Alexey-T
Copy link
Collaborator Author

So then the only question is:
Utf8 mode: Add or Fix?

Add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants