-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add options support to the compiler #112
Changes from 14 commits
29faff1
2086a06
db077d5
bf0a8f5
29d8f0d
ffc1f5f
6e9a5db
f199910
57a550c
4fd82dd
26c83fd
a7a3917
265baac
1345bfc
7b390b7
25345cc
550068f
e50bad9
15c470d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,7 +12,7 @@ | |
extension AST { | ||
/// An option written in source that changes matching semantics. | ||
public struct MatchingOption: Hashable { | ||
public enum Kind { | ||
public enum Kind: Int { | ||
// PCRE options | ||
case caseInsensitive // i | ||
case allowDuplicateGroupNames // J | ||
|
@@ -36,6 +36,11 @@ extension AST { | |
// be unset, only flipped between) | ||
case textSegmentGraphemeMode // y{g} | ||
case textSegmentWordMode // y{w} | ||
|
||
// Swift semantic matching level | ||
case graphemeClusterSemantics // X | ||
case unicodeScalarSemantics // u | ||
case byteSemantics // b | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these existing options, or are you defining Swift-specific syntax? We'll want to be able to lower this AST node (entirely) into a custom OptionSet-like model type using your merge semantics below. But, we do need a semantics for each option (even if that's just @hamishknight what do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah these are Swift-specific, they don't conflict with the matching options from other engines tho AFAIK There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like to keep these well-delineated in that case, and be very clear that this is something we're formally proposing as an addition to existing regex syntax. Should we prefix the names with We should also make a note or sub-section in #63 for these kinds of syntactic additions. They're otherwise easy to lose track of. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added to #63. |
||
} | ||
public var kind: Kind | ||
public var location: SourceLocation | ||
|
@@ -53,6 +58,15 @@ extension AST { | |
return false | ||
} | ||
} | ||
|
||
public var isSemanticMatchingLevel: Bool { | ||
switch kind { | ||
case .graphemeClusterSemantics, .unicodeScalarSemantics, .byteSemantics: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do other engines have ways of switching semantic models between byte and scalar? How should we lower those? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The only existing engine I've found that has this semantic switching is Rust's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In Oniguruma, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'll probably want a DSL equivalent for everything we support here. Have you thought about that at all, something like a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, we'll need some way to configure DSL-defined patterns, for these, case insensitivity, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are your thoughts there? This seems relevant to whether we should be using an enum in the AST as the same type/representation for our model. |
||
return true | ||
default: | ||
return false | ||
} | ||
} | ||
} | ||
|
||
/// A sequence of matching options written in source. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -563,6 +563,11 @@ extension Source { | |
try src.expect("}") | ||
return opt | ||
|
||
// Swift semantic level options | ||
case "X": return advanceAndReturn(.graphemeClusterSemantics) | ||
case "u": return advanceAndReturn(.unicodeScalarSemantics) | ||
case "b": return advanceAndReturn(.byteSemantics) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this novel syntax we're talking about adding, or does this exist? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These would be Swift-specific flags. The |
||
default: | ||
return nil | ||
} | ||
|
@@ -607,6 +612,10 @@ extension Source { | |
if opt.isTextSegmentMode { | ||
throw ParseError.cannotRemoveTextSegmentOptions | ||
} | ||
// Matching semantics options can only be added, not removed. | ||
if opt.isSemanticMatchingLevel { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is text segment option and how is it related to these? What should we do with them? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They're pretty much orthogonal — text-segment mode determines how |
||
throw ParseError.cannotRemoveSemanticsOptions | ||
} | ||
removing.append(opt) | ||
} | ||
return .init(caretLoc: nil, adding: adding, minusLoc: ateMinus.location, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -338,12 +338,14 @@ extension AST.Atom { | |
switch kind { | ||
case let .escaped(b): return b.characterClass | ||
|
||
case .any: return .any | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's happening to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems like a reasonable line for someone to add. Could you explicitly return nil with a comment explaining why we don't want to produce a character class? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (outside the scope of this PR) Or for that matter, should we remove or redesign
We got it off the AST a while back, but we still have this as being We definitely will want some type that people can use as namespaces to look up built-in sets, and a type for registering their own via closures (either We will separately want a model-layer type to assist in compilation. That might be the same type if it happens to align, but I definitely don't want to try to burden an API type with that need. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a note about where |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removing this means that the |
||
case .property: | ||
// TODO: Would our model type for character classes include | ||
// this? Or does grapheme-semantic mode complicate that? | ||
return nil | ||
|
||
case .any: | ||
// 'any' is handled by Compiler.emitAny(), not `CharacterClass` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Depending on the current options, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be a fatal error then? Unfortunately this is trying to make a CharacterClass, and there's an |
||
return nil | ||
|
||
default: return nil | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,18 +18,13 @@ struct RegexProgram { | |
|
||
class Compiler { | ||
let ast: AST | ||
let matchLevel: CharacterClass.MatchLevel | ||
let options: REOptions | ||
private var options = MatchingOptions() | ||
private var builder = RegexProgram.Program.Builder() | ||
|
||
init( | ||
ast: AST, | ||
matchLevel: CharacterClass.MatchLevel = .graphemeCluster, | ||
options: REOptions = [] | ||
ast: AST | ||
) { | ||
self.ast = ast | ||
self.matchLevel = matchLevel | ||
self.options = options | ||
} | ||
|
||
__consuming func emit() throws -> RegexProgram { | ||
|
@@ -42,11 +37,9 @@ class Compiler { | |
func emit(_ node: AST) throws { | ||
|
||
switch node { | ||
// Any: . | ||
// consume 1 | ||
case .atom(let a) where a.kind == .any && matchLevel == .graphemeCluster: | ||
builder.buildAdvance(1) | ||
|
||
milseman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
case .atom(let a) where a.kind == .any: | ||
try emitAny() | ||
|
||
// Single characters we just match | ||
case .atom(let a) where a.singleCharacter != nil : | ||
builder.buildMatch(a.singleCharacter!) | ||
|
@@ -94,6 +87,9 @@ class Compiler { | |
break | ||
|
||
case .group(let g): | ||
options.beginScope() | ||
defer { options.endScope() } | ||
|
||
if let lookaround = g.lookaroundKind { | ||
try emitLookaround(lookaround, g.child) | ||
return | ||
|
@@ -110,6 +106,10 @@ class Compiler { | |
try emit(g.child) | ||
builder.buildEndCapture(cap) | ||
|
||
case .changeMatchingOptions(let optionSequence, _): | ||
options.apply(optionSequence) | ||
try emit(g.child) | ||
|
||
default: | ||
// FIXME: Other kinds... | ||
try emit(g.child) | ||
|
@@ -121,8 +121,8 @@ class Compiler { | |
// For now, we model sets and atoms as consumers. | ||
// This lets us rapidly expand support, and we can better | ||
// design the actual instruction set with real examples | ||
case _ where try node.generateConsumer(matchLevel) != nil: | ||
try builder.buildConsume(by: node.generateConsumer(matchLevel)!) | ||
case _ where try node.generateConsumer(options) != nil: | ||
try builder.buildConsume(by: node.generateConsumer(options)!) | ||
|
||
case .quote(let q): | ||
// We stick quoted content into read-only constant strings | ||
|
@@ -155,6 +155,31 @@ class Compiler { | |
throw unsupported(node.renderAsCanonical()) | ||
} | ||
} | ||
|
||
func emitAny() throws { | ||
switch (options.semanticLevel, options.dotMatchesNewline) { | ||
case (.graphemeCluster, true): | ||
builder.buildAdvance(1) | ||
case (.graphemeCluster, false): | ||
builder.buildConsume { input, bounds in | ||
input[bounds.lowerBound].isNewline | ||
? nil | ||
: input.index(after: bounds.lowerBound) | ||
} | ||
|
||
case (.unicodeScalar, true): | ||
// TODO: builder.buildAdvanceUnicodeScalar(1) | ||
builder.buildConsume { input, bounds in | ||
input.unicodeScalars.index(after: bounds.lowerBound) | ||
} | ||
case (.unicodeScalar, false): | ||
builder.buildConsume { input, bounds in | ||
input[bounds.lowerBound].isNewline | ||
? nil | ||
: input.unicodeScalars.index(after: bounds.lowerBound) | ||
} | ||
} | ||
} | ||
milseman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
func emitAssertion(_ kind: AST.Atom.AssertionKind) throws { | ||
// FIXME: Depends on API model we have... We may want to | ||
|
@@ -455,7 +480,18 @@ class Compiler { | |
|
||
func emitQuantification(_ quant: AST.Quantification) throws { | ||
let child = quant.child | ||
let kind = quant.kind.value | ||
|
||
// If in reluctant-by-default mode, eager and reluctant need to be switched. | ||
let kind: AST.Quantification.Kind | ||
if options.isReluctantByDefault | ||
&& quant.kind.value != .possessive | ||
{ | ||
kind = quant.kind.value == .eager | ||
? .reluctant | ||
: .eager | ||
} else { | ||
kind = quant.kind.value | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Curiously, I don't think there's a syntax for explicitly requesting eager. Do you know @hamishknight ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, it's either the default ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be more broadly useful for this to be a method on the options? Something like I'm also thinking that we might want to pull some of these concepts out of the AST, especially if we want to surface them through the DSL. My DSLTree prototypes have these inside them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps, but I think the "toggle reluctance" flag only makes sense in the terse jumble of regex literal syntax. We should choose a default and just require always spelling alternatives for the DSL. |
||
|
||
switch quant.amount.value.bounds { | ||
case (_, atMost: 0): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,12 +28,8 @@ func unsupported( | |
file: StaticString = #file, | ||
line: UInt = #line | ||
) -> Unsupported { | ||
// TODO: how do we not have a public init for this? | ||
let fStr = file.withUTF8Buffer { | ||
String(decoding: $0, as: UTF8.self) | ||
} | ||
return Unsupported( | ||
message: s, file: fStr, line: Int(line)) | ||
message: s, file: String(describing: file), line: Int(line)) | ||
} | ||
|
||
extension AST { | ||
|
@@ -42,8 +38,7 @@ extension AST { | |
/// A consumer is a Swift closure that matches against | ||
/// the front of an input range | ||
func generateConsumer( | ||
// TODO: Better option modeling | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: MatchingOptions | ||
) throws -> Program<String>.ConsumeFunction? { | ||
switch self { | ||
case .atom(let a): | ||
|
@@ -77,10 +72,10 @@ extension AST.Atom { | |
} | ||
|
||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: MatchingOptions | ||
) throws -> Program<String>.ConsumeFunction? { | ||
// TODO: Wean ourselves off of this type... | ||
if let cc = self.characterClass?.withMatchLevel(opts) { | ||
if let cc = self.characterClass?.withMatchLevel(opts.matchLevel) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (post this PR) definitely be thinking about how we should model character classes better, given that options can change their definitions There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, we at least need to transition from |
||
return { input, bounds in | ||
// FIXME: should we worry about out of bounds? | ||
cc.matches(in: input, at: bounds.lowerBound) | ||
|
@@ -109,10 +104,16 @@ extension AST.Atom { | |
// TODO: alias? casing? | ||
$0.name == name || $0.nameAlias == name | ||
} | ||
|
||
case .any: | ||
fatalError(".atom(.any) is handled in emitAny") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note to self: make a |
||
|
||
case .startOfLine, .endOfLine: | ||
// handled in emitAssertion | ||
return nil | ||
|
||
case .escaped, .keyboardControl, .keyboardMeta, .keyboardMetaControl, | ||
.any, .startOfLine, .endOfLine, | ||
.backreference, .subpattern, .callout, .backtrackingDirective: | ||
.backreference, .subpattern, .callout, .backtrackingDirective: | ||
// FIXME: implement | ||
return nil | ||
} | ||
|
@@ -121,7 +122,7 @@ extension AST.Atom { | |
|
||
extension AST.CustomCharacterClass.Member { | ||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: MatchingOptions | ||
) throws -> Program<String>.ConsumeFunction { | ||
switch self { | ||
case .custom(let ccc): | ||
|
@@ -212,7 +213,7 @@ extension AST.CustomCharacterClass.Member { | |
|
||
extension AST.CustomCharacterClass { | ||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: MatchingOptions | ||
) throws -> Program<String>.ConsumeFunction { | ||
// NOTE: Easy way to implement, obviously not performant | ||
let consumers = try members.map { | ||
|
@@ -265,7 +266,7 @@ private func consumeScalar( | |
|
||
extension AST.Atom.CharacterProperty { | ||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: MatchingOptions | ||
) throws -> Program<String>.ConsumeFunction { | ||
// Handle inversion for us, albeit not efficiently | ||
func invert( | ||
|
@@ -335,7 +336,7 @@ extension AST.Atom.CharacterProperty { | |
extension Unicode.BinaryProperty { | ||
// FIXME: Semantic level, vet for precise defs | ||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: MatchingOptions | ||
) throws -> Program<String>.ConsumeFunction { | ||
switch self { | ||
|
||
|
@@ -499,7 +500,7 @@ extension Unicode.BinaryProperty { | |
extension Unicode.POSIXProperty { | ||
// FIXME: Semantic level, vet for precise defs | ||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: MatchingOptions | ||
) -> Program<String>.ConsumeFunction { | ||
// FIXME: semantic levels, modes, etc | ||
switch self { | ||
|
@@ -545,7 +546,7 @@ extension Unicode.POSIXProperty { | |
extension Unicode.ExtendedGeneralCategory { | ||
// FIXME: Semantic level | ||
func generateConsumer( | ||
_ opts: CharacterClass.MatchLevel | ||
_ opts: MatchingOptions | ||
) throws -> Program<String>.ConsumeFunction { | ||
switch self { | ||
case .letter: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this is
Int
-backed, we can use the raw value as the bit offset forMatchingOptionSet.rawValue
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we'll want a model-layer enum or something that represents the compiler's semantic understanding of options rather than the parser's syntactic representation. I ask because this can be (but isn't necessarily!) a code smell that we're still conflating the syntax layer with the semantic layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is every option here something that would make sense in our DSL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several of these don't really make sense for the DSL:
extra
,extraExtended
: about literal parsing rather than actual matchingallowDuplicateGroupNames
,noAutoCapture
: the DSL will have a different way of capturingreluctantByDefault
: irrelevant (and/or terrible) for the DSL APIThe rest affect how a pattern matches, but I don't know if we'll want to do them as "options" per se or as different function calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there options that make sense for the DSL but are not present in a literal? How divergent is our syntactic representation from our semantic model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the
: Int
, make a model-layer enum with our option model, function to lower from AST to that model.