Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Whole-word filters only work correctly with ASCII text #3299

Open
VyrCossont opened this issue Sep 14, 2024 · 1 comment
Open

[bug] Whole-word filters only work correctly with ASCII text #3299

VyrCossont opened this issue Sep 14, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@VyrCossont
Copy link
Contributor

VyrCossont commented Sep 14, 2024

Golang's regexp package documents \b as working only with ASCII text, which affects how our whole-word filters match.

UTR #18 has some guidance for this. We might be able to achieve what they call "Level 1" or "Level 2" word boundary support with comprehensive replacements for \b using the Unicode features that Go can match on. "Level 3" might be too much work:

Semantic analysis may be required for correct word-break in languages that don't require spaces, such as Thai, Japanese, Chinese or Korean. This can require fairly sophisticated support if Level 3 word boundary detection is required, and usually requires drawing on platform OS services.

Discovered while investigating #3128.

@VyrCossont VyrCossont added the bug Something isn't working label Sep 14, 2024
@tsmethurst
Copy link
Contributor

tsmethurst commented Sep 15, 2024

Well that's unfortunate. Thanks for investigating + writing this up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants