Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add options support to the compiler #112

Merged
merged 19 commits into from
Jan 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
29faff1
Add MatchingOptionSet
natecook1000 Jan 13, 2022
2086a06
Remove shortcut for `.` matching
natecook1000 Jan 13, 2022
db077d5
Track the current matching options within the compiler
natecook1000 Jan 13, 2022
bf0a8f5
Correctly scope matching options within groups
natecook1000 Jan 13, 2022
29d8f0d
Move MatchingOptionSet up to the compiler level
natecook1000 Jan 14, 2022
ffc1f5f
Simplify MatchingOptionSet down to what the compiler needs
natecook1000 Jan 14, 2022
6e9a5db
Don't allow removing semantic level options
natecook1000 Jan 15, 2022
f199910
Implement the 'U' reluctant default option
natecook1000 Jan 15, 2022
57a550c
Implement options resetting
natecook1000 Jan 15, 2022
4fd82dd
Collapse MatchingOptionSet/Stack interface
natecook1000 Jan 18, 2022
26c83fd
Merge branch 'main' into matching_options_support
natecook1000 Jan 18, 2022
a7a3917
Enforce exclusive options as a MatchingOptions invariant
natecook1000 Jan 24, 2022
265baac
Merge branch 'main' into matching_options_support
natecook1000 Jan 24, 2022
1345bfc
Handle exclusivity of `x` and `xx` option flags
natecook1000 Jan 25, 2022
7b390b7
Merge branch 'main' into matching_options_support
natecook1000 Jan 25, 2022
25345cc
Separate AST options from options in the compiler
natecook1000 Jan 26, 2022
550068f
Clarify Atom.any compilation
natecook1000 Jan 26, 2022
e50bad9
Merge branch 'main' into matching_options_support
natecook1000 Jan 27, 2022
15c470d
Update Sources/_StringProcessing/CharacterClass.swift
natecook1000 Jan 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions Sources/_MatchingEngine/Regex/AST/MatchingOptions.swift
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,11 @@ extension AST {
// be unset, only flipped between)
case textSegmentGraphemeMode // y{g}
case textSegmentWordMode // y{w}

// Swift semantic matching level
case graphemeClusterSemantics // X
case unicodeScalarSemantics // u
case byteSemantics // b
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these existing options, or are you defining Swift-specific syntax?

We'll want to be able to lower this AST node (entirely) into a custom OptionSet-like model type using your merge semantics below.

But, we do need a semantics for each option (even if that's just unsupported("reasons")).

@hamishknight what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah these are Swift-specific, they don't conflict with the matching options from other engines tho AFAIK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep these well-delineated in that case, and be very clear that this is something we're formally proposing as an addition to existing regex syntax. Should we prefix the names with swift to be clearer?

We should also make a note or sub-section in #63 for these kinds of syntactic additions. They're otherwise easy to lose track of.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to #63.

}
public var kind: Kind
public var location: SourceLocation
Expand All @@ -53,6 +58,15 @@ extension AST {
return false
}
}

public var isSemanticMatchingLevel: Bool {
switch kind {
case .graphemeClusterSemantics, .unicodeScalarSemantics, .byteSemantics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do other engines have ways of switching semantic models between byte and scalar? How should we lower those?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only existing engine I've found that has this semantic switching is Rust's regex::bytes, which lets you switch from the default byte-matching mode into a Unicode mode with (?u).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does .textSegmentGraphemeMode do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Oniguruma, .textSegmentGraphemeMode and .textSegmentWordMode are spelled y{g} and y{w}, and change the behavior of \y and \Y between marking grapheme boundaries and marking word boundaries (as defined by https://unicode.org/reports/tr29/). That in turn changes the behavior of \X, since it consumes up to the next \y boundary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably want a DSL equivalent for everything we support here. Have you thought about that at all, something like a TextSegmentBoundary which will read in-coming context to determine what to do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we'll need some way to configure DSL-defined patterns, for these, case insensitivity, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts there? This seems relevant to whether we should be using an enum in the AST as the same type/representation for our model.

return true
default:
return false
}
}
}

/// A sequence of matching options written in source.
Expand Down
3 changes: 3 additions & 0 deletions Sources/_MatchingEngine/Regex/Parse/Diagnostics.swift
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ enum ParseError: Error, Hashable {
case identifierCannotStartWithNumber(IdentifierKind)

case cannotRemoveTextSegmentOptions
case cannotRemoveSemanticsOptions
case expectedCalloutArgument
}

Expand Down Expand Up @@ -145,6 +146,8 @@ extension ParseError: CustomStringConvertible {
return "\(i.diagDescription) must not start with number"
case .cannotRemoveTextSegmentOptions:
return "text segment mode cannot be unset, only changed"
case .cannotRemoveSemanticsOptions:
return "semantic level cannot be unset, only changed"
case .expectedCalloutArgument:
return "expected argument to callout"
}
Expand Down
9 changes: 9 additions & 0 deletions Sources/_MatchingEngine/Regex/Parse/LexicalAnalysis.swift
Original file line number Diff line number Diff line change
Expand Up @@ -574,6 +574,11 @@ extension Source {
try src.expect("}")
return opt

// Swift semantic level options
case "X": return advanceAndReturn(.graphemeClusterSemantics)
case "u": return advanceAndReturn(.unicodeScalarSemantics)
case "b": return advanceAndReturn(.byteSemantics)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this novel syntax we're talking about adding, or does this exist?

Copy link
Member Author

@natecook1000 natecook1000 Jan 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These would be Swift-specific flags. The X flag is to match the \X metacharacter, which matches a grapheme cluster; given that, it might be better to go with O for Unicode scalar semantics to match \O. I go through in the flags write-up.

default:
return nil
}
Expand Down Expand Up @@ -618,6 +623,10 @@ extension Source {
if opt.isTextSegmentMode {
throw ParseError.cannotRemoveTextSegmentOptions
}
// Matching semantics options can only be added, not removed.
if opt.isSemanticMatchingLevel {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is text segment option and how is it related to these? What should we do with them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're pretty much orthogonal — text-segment mode determines how \y and \Y behave, the semantic matching level determines what string view . and the character classes operate on.

throw ParseError.cannotRemoveSemanticsOptions
}
removing.append(opt)
}
return .init(caretLoc: nil, adding: adding, minusLoc: ateMinus.location,
Expand Down
11 changes: 9 additions & 2 deletions Sources/_StringProcessing/CharacterClass.swift
Original file line number Diff line number Diff line change
Expand Up @@ -338,12 +338,19 @@ extension AST.Atom {
switch kind {
case let .escaped(b): return b.characterClass

case .any: return .any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening to CharacterClass.any?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonable line for someone to add. Could you explicitly return nil with a comment explaining why we don't want to produce a character class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(outside the scope of this PR)

Or for that matter, should we remove or redesign CharacteClass? It originally tried to fit 3 rather distinct purposes, and it still awkwardly conflates the latter 2:

  1. Syntactic-information carrying types for the AST, for source tools and compilation consumption
  2. Model types to assist in compilation and/or run-time call outs for the matching engine
  3. API types appearing as public API from the stdlib.

We got it off the AST a while back, but we still have this as being public in the _StringProcessing module.

We definitely will want some type that people can use as namespaces to look up built-in sets, and a type for registering their own via closures (either Character predicates or Character or Scalar consumer closures), etc. We could over time evolve this to have set-algebraic operations on it, etc.

We will separately want a model-layer type to assist in compilation. That might be the same type if it happens to align, but I definitely don't want to try to burden an API type with that need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonable line for someone to add. Could you explicitly return nil with a comment explaining why we don't want to produce a character class?

thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note about where .any is being handled.


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this means that the .any case will be handled using a consumer function, rather than the old CharacterClass type.

case .property:
// TODO: Would our model type for character classes include
// this? Or does grapheme-semantic mode complicate that?
return nil

case .any:
// `.any` is handled in the matching engine by Compiler.emitAny() and in
// the legacy compiler by the `.any` instruction, which can provide lower
// level instructions than the CharacterClass-generated consumer closure
//
// FIXME: We shouldn't be returning `nil` here, but instead fixing the call
// site to check for any before trying to construct a character class.
return nil

default: return nil

Expand Down
66 changes: 51 additions & 15 deletions Sources/_StringProcessing/Compiler.swift
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,13 @@ struct RegexProgram {

class Compiler {
let ast: AST
let matchLevel: CharacterClass.MatchLevel
let options: REOptions
private var options = MatchingOptions()
private var builder = RegexProgram.Program.Builder()

init(
ast: AST,
matchLevel: CharacterClass.MatchLevel = .graphemeCluster,
options: REOptions = []
ast: AST
) {
self.ast = ast
self.matchLevel = matchLevel
self.options = options
}

__consuming func emit() throws -> RegexProgram {
Expand All @@ -42,11 +37,9 @@ class Compiler {
func emit(_ node: AST) throws {

switch node {
// Any: .
// consume 1
case .atom(let a) where a.kind == .any && matchLevel == .graphemeCluster:
builder.buildAdvance(1)

milseman marked this conversation as resolved.
Show resolved Hide resolved
case .atom(let a) where a.kind == .any:
try emitAny()

// Single characters we just match
case .atom(let a) where a.singleCharacter != nil :
builder.buildMatch(a.singleCharacter!)
Expand Down Expand Up @@ -97,6 +90,9 @@ class Compiler {
throw unsupported(node.renderAsCanonical())

case .group(let g):
options.beginScope()
defer { options.endScope() }

if let lookaround = g.lookaroundKind {
try emitLookaround(lookaround, g.child)
return
Expand All @@ -113,6 +109,10 @@ class Compiler {
try emit(g.child)
builder.buildEndCapture(cap)

case .changeMatchingOptions(let optionSequence, _):
options.apply(optionSequence)
try emit(g.child)

default:
// FIXME: Other kinds...
try emit(g.child)
Expand All @@ -124,8 +124,8 @@ class Compiler {
// For now, we model sets and atoms as consumers.
// This lets us rapidly expand support, and we can better
// design the actual instruction set with real examples
case _ where try node.generateConsumer(matchLevel) != nil:
try builder.buildConsume(by: node.generateConsumer(matchLevel)!)
case _ where try node.generateConsumer(options) != nil:
try builder.buildConsume(by: node.generateConsumer(options)!)

case .quote(let q):
// We stick quoted content into read-only constant strings
Expand Down Expand Up @@ -158,6 +158,31 @@ class Compiler {
throw unsupported(node.renderAsCanonical())
}
}

func emitAny() throws {
switch (options.semanticLevel, options.dotMatchesNewline) {
case (.graphemeCluster, true):
builder.buildAdvance(1)
case (.graphemeCluster, false):
builder.buildConsume { input, bounds in
input[bounds.lowerBound].isNewline
? nil
: input.index(after: bounds.lowerBound)
}

case (.unicodeScalar, true):
// TODO: builder.buildAdvanceUnicodeScalar(1)
builder.buildConsume { input, bounds in
input.unicodeScalars.index(after: bounds.lowerBound)
}
case (.unicodeScalar, false):
builder.buildConsume { input, bounds in
input[bounds.lowerBound].isNewline
? nil
: input.unicodeScalars.index(after: bounds.lowerBound)
}
}
}
milseman marked this conversation as resolved.
Show resolved Hide resolved

func emitAssertion(_ kind: AST.Atom.AssertionKind) throws {
// FIXME: Depends on API model we have... We may want to
Expand Down Expand Up @@ -458,7 +483,18 @@ class Compiler {

func emitQuantification(_ quant: AST.Quantification) throws {
let child = quant.child
let kind = quant.kind.value

// If in reluctant-by-default mode, eager and reluctant need to be switched.
let kind: AST.Quantification.Kind
if options.isReluctantByDefault
&& quant.kind.value != .possessive
{
kind = quant.kind.value == .eager
? .reluctant
: .eager
} else {
kind = quant.kind.value
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curiously, I don't think there's a syntax for explicitly requesting eager. Do you know @hamishknight ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's either the default (/a+/) or it's flipped and you use the "reluctant" syntax for eager (/(?U)a+?/).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more broadly useful for this to be a method on the options? Something like remap(_ kind: AST.Quantification.Kind) -> AST.Quantification.Kind?

I'm also thinking that we might want to pull some of these concepts out of the AST, especially if we want to surface them through the DSL. My DSLTree prototypes have these inside them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but I think the "toggle reluctance" flag only makes sense in the terse jumble of regex literal syntax. We should choose a default and just require always spelling alternatives for the DSL.


switch quant.amount.value.bounds {
case (_, atMost: 0):
Expand Down
35 changes: 18 additions & 17 deletions Sources/_StringProcessing/ConsumerInterface.swift
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,8 @@ func unsupported(
file: StaticString = #file,
line: UInt = #line
) -> Unsupported {
// TODO: how do we not have a public init for this?
let fStr = file.withUTF8Buffer {
String(decoding: $0, as: UTF8.self)
}
return Unsupported(
message: s, file: fStr, line: Int(line))
message: s, file: String(describing: file), line: Int(line))
}

extension AST {
Expand All @@ -42,8 +38,7 @@ extension AST {
/// A consumer is a Swift closure that matches against
/// the front of an input range
func generateConsumer(
// TODO: Better option modeling
_ opts: CharacterClass.MatchLevel
_ opts: MatchingOptions
) throws -> Program<String>.ConsumeFunction? {
switch self {
case .atom(let a):
Expand Down Expand Up @@ -77,10 +72,10 @@ extension AST.Atom {
}

func generateConsumer(
_ opts: CharacterClass.MatchLevel
_ opts: MatchingOptions
) throws -> Program<String>.ConsumeFunction? {
// TODO: Wean ourselves off of this type...
if let cc = self.characterClass?.withMatchLevel(opts) {
if let cc = self.characterClass?.withMatchLevel(opts.matchLevel) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(post this PR) definitely be thinking about how we should model character classes better, given that options can change their definitions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we at least need to transition from CharacterClass.MatchLevel to something like the new MatchingOptions type.

return { input, bounds in
// FIXME: should we worry about out of bounds?
cc.matches(in: input, at: bounds.lowerBound)
Expand Down Expand Up @@ -109,10 +104,16 @@ extension AST.Atom {
// TODO: alias? casing?
$0.name == name || $0.nameAlias == name
}

case .any:
fatalError(".atom(.any) is handled in emitAny")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: make a struct Unrecoverable: Error and throw them


case .startOfLine, .endOfLine:
// handled in emitAssertion
return nil

case .escaped, .keyboardControl, .keyboardMeta, .keyboardMetaControl,
.any, .startOfLine, .endOfLine,
.backreference, .subpattern, .callout, .backtrackingDirective:
.backreference, .subpattern, .callout, .backtrackingDirective:
// FIXME: implement
return nil
}
Expand All @@ -121,7 +122,7 @@ extension AST.Atom {

extension AST.CustomCharacterClass.Member {
func generateConsumer(
_ opts: CharacterClass.MatchLevel
_ opts: MatchingOptions
) throws -> Program<String>.ConsumeFunction {
switch self {
case .custom(let ccc):
Expand Down Expand Up @@ -212,7 +213,7 @@ extension AST.CustomCharacterClass.Member {

extension AST.CustomCharacterClass {
func generateConsumer(
_ opts: CharacterClass.MatchLevel
_ opts: MatchingOptions
) throws -> Program<String>.ConsumeFunction {
// NOTE: Easy way to implement, obviously not performant
let consumers = try members.map {
Expand Down Expand Up @@ -265,7 +266,7 @@ private func consumeScalar(

extension AST.Atom.CharacterProperty {
func generateConsumer(
_ opts: CharacterClass.MatchLevel
_ opts: MatchingOptions
) throws -> Program<String>.ConsumeFunction {
// Handle inversion for us, albeit not efficiently
func invert(
Expand Down Expand Up @@ -335,7 +336,7 @@ extension AST.Atom.CharacterProperty {
extension Unicode.BinaryProperty {
// FIXME: Semantic level, vet for precise defs
func generateConsumer(
_ opts: CharacterClass.MatchLevel
_ opts: MatchingOptions
) throws -> Program<String>.ConsumeFunction {
switch self {

Expand Down Expand Up @@ -499,7 +500,7 @@ extension Unicode.BinaryProperty {
extension Unicode.POSIXProperty {
// FIXME: Semantic level, vet for precise defs
func generateConsumer(
_ opts: CharacterClass.MatchLevel
_ opts: MatchingOptions
) -> Program<String>.ConsumeFunction {
// FIXME: semantic levels, modes, etc
switch self {
Expand Down Expand Up @@ -545,7 +546,7 @@ extension Unicode.POSIXProperty {
extension Unicode.ExtendedGeneralCategory {
// FIXME: Semantic level
func generateConsumer(
_ opts: CharacterClass.MatchLevel
_ opts: MatchingOptions
) throws -> Program<String>.ConsumeFunction {
switch self {
case .letter:
Expand Down
Loading