Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax Status and Roadmap #63

Closed
milseman opened this issue Dec 10, 2021 · 12 comments
Closed

Syntax Status and Roadmap #63

milseman opened this issue Dec 10, 2021 · 12 comments

Comments

@milseman
Copy link
Member

milseman commented Dec 10, 2021

For the regex literal syntax, we're looking at supporting a syntactic superset of:

  • PCRE2, an "industry standard" of sorts, and a rough superset of Perl, Python, etc.

  • Oniguruma, an internationalization-oriented engine with some modern features

  • ICU, used by NSRegularExpression, a Unicode-focused engine

  • Our interpretation of UTS#18's guidance, which is about semantics, but we can infer syntactic feature sets.

  • TODO: .NET, which has delimiter-balancing and some interesting minor details on conditional patterns

These aren't all strictly compatible (e.g. a set operator in PCRE2 would just be a redundant statement of a set member). We can explore adding strict compatibility modes, but in general the syntactic superset is fairly straight-forward.

Status

The below are (roughly) implemented. There may be bugs, but we have some support and some testing coverage:

  • Alternations a|b
  • Capture groups e.g (x), (?:x), (?<name>x)
  • Escaped character sequences e.g \n, \a
  • Unicode scalars e.g \u{...}, \x{...}, \uHHHH
  • Builtin character classes e.g ., \d, \w, \s
  • Custom character classes [...], including binary operators &&, ~~, --
  • Quantifiers x?, x+, x*, x{n,m}
  • Anchors e.g \b, ^, $
  • Quoted sequences \Q ... \E
  • Comments (?#comment)
  • Character properties \p{...}, [:...:]
  • Named characters \N{...}, \N{U+hh}
  • Lookahead and lookbehind e.g (?=), (?!), (*pla:), (?*...), (?<*...), (napla:...)
  • Script runs e.g (*script_run:...), (*sr:...), (*atomic_script_run:...), (*asr:...)
  • Octal sequences \ddd, \o{...}
  • Backreferences e.g \1, \g2, \g{2}, \k<name>, \k'name', \g{name}, \k{name}, (?P=name)
  • Matching options e.g (?m), (?-i), (?:si), (?^m)
  • Sub-patterns e.g \g<n>, \g'n', (?R), (?1), (?&name), (?P>name)
  • Conditional patterns e.g (?(R)...), (?(n)...), (?(<n>)...), (?('n')...), (?(condition)then|else)
  • PCRE callouts e.g (?C2), (?C"text")
  • PCRE backtracking directives e.g (*ACCEPT), (*SKIP:NAME)
  • [.NET] Balancing group definitions (?<name1-name2>...)
  • [Oniguruma] Recursion level for backreferences e.g \k<n+level>, (?(n+level))
  • [Oniguruma] Extended callout syntax e.g (?{...}), (*name)
    • NOTE: In Perl, (?{...}) has in-line code in it, we could consider the same (for now, we just parse an arbitrary string)
  • [Oniguruma] Absent functions e.g (?~absent)
  • PCRE global matching options e.g (*LIMIT_MATCH=d), (*LF)
  • Extended-mode (?x)/(?xx) syntax allowing for non-semantic whitespace and end-of-line comments abc # comment

Experimental syntax

Additionally, we have (even more experimental) support for some syntactic conveniences, if specified. Note that each of these (except perhaps ranges) may introduce a syntactic incompatibility with existing traditional-syntax regexes. Thus, they are mostly illustrative, showing what happens and where we go as we slide down this "slippery slope".

  • Non-semantic whitespace: /a b c/ === /abc/
  • Modern quotes: /"a.b"/ === /\Qa.b\E/
  • Swift style ranges: /a{2..<10} b{...3}/ === /a{2,9}b{0,3}/
  • Non-captures: /a (_: b) c/ === /a(?:b)c/

TBD:

  • Modern named captures: /a (name: b) c/ === /a(?<name>b)c/
  • Modern comments using /* comment */ or // commentinstead of(?#. comment)`
  • Multi-line expressions
    • Line-terminating comments as // comment
  • Full Swift-lexed comments, string literals as quotes (includes raw and interpolation), etc.
    • Makes sense to add as we suck actual literal lexing through our wormhole in the compiler

Swift's syntactic additions

  • Options for selecting a semantic level
    • X: grapheme cluster semantics
    • O: Unicode scalar semantics
    • b: byte semantics

Source location tracking

Implemented:

  • Location of | in alternation
  • Location of - in [a-f]

TBD:

Integration with the Swift compiler

Initial parser support landed in swiftlang/swift#40595, using the delimiters '/.../', which are lexed in-package.

@milseman
Copy link
Member Author

Lots of source location tracking in #67. after that I'll probably start focusing more on other areas of the project.

@hamishknight
Copy link
Contributor

Escaped backreferences including maybe-octal-sequences in #88

@hamishknight
Copy link
Contributor

Option parsing in #91

@milseman
Copy link
Member Author

Swift-specific options for switching between matching semantic levels: #112

@hamishknight
Copy link
Contributor

Conditional patterns in #113

@hamishknight
Copy link
Contributor

PCRE callouts, backtracking directives, and .NET balanced captures in #117

@hamishknight
Copy link
Contributor

PCRE global options and Oniguruma recursion levels in #123

After that, it's just the extended syntax, and Oniguruma callouts and absent functions.

@hamishknight
Copy link
Contributor

Remaining Oniguruma-specific syntax in #129

@milseman milseman mentioned this issue Jan 26, 2022
13 tasks
@hamishknight
Copy link
Contributor

Extended syntax in #136

@milseman
Copy link
Member Author

milseman commented Mar 4, 2022

Another one is Unicode scalar sequences ala https://unicode.org/reports/tr18/#RL1.1

\u{3b1 3b3 3b5 3b9}
==
\u{3b1}\u{3b3}\u{3b5}\u{3b9}

@milseman
Copy link
Member Author

milseman commented May 6, 2022

@hamishknight can you go over this and see what needs to be tracked as an issue for this release and what needs to go into #370?

@hamishknight
Copy link
Contributor

We've completed the syntax feature work here, future syntax work is being tracked by #370

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants