-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Greedy and Reluctant Quantifiers should not produce same match #1
Comments
Hi Luke, You're right on this behavior. It left me unsatisfied but it was the most (defn strict-exec [re coll & opts]
(let [k (gensym :match)]
(when-some [m (apply exec (cat (as k re) (* _)) coll opts)]
(-> m (dissoc m k) (assoc :match (get m k)))))) I think that doing that in a lazy fashion is a more involved change |
That would be great if you added |
I didn't realize you'd just given me a wrapper function around your library :) (new to Clojure) Upon doing that though, I find that it seems to break longest-match semantics that it had; this test now fails: (deftest simple
(are [se s m]
(= (:match (strict-exec se s)) m)
(se/| (se/cat 1 se/_) (se/cat 1 3 3)) [1 3 3 7 7 9] [1 3 3] ; longest match
)) Easy fix for that? |
It's normal: (re-find #"1.|133" "133337") yields "13" too. Le vendredi 29 août 2014, Luke Nezda [email protected] a écrit :
On Clojure http://clj-me.cgrand.net/ |
I had immediately done a similar test leading to the same conclusion, and even found http://stackoverflow.com/a/4515435/689119 explaining this is how NFA engines work, and already thought this was an NFA engine based on the references you provided, but hoped since the following passed, I'd misunderstood 😄 -- bummer, I guess I need a DFA engine since I need longest match semantics (deftest simple
(are [se s m]
(= (:match (exec se s)) m)
(se/| (se/cat 1 se/_) (se/cat 1 3 3)) [1 3 3 7 7 9] [1 3 3] ; longest match
)) |
The DFA/NFA discussion is slightly wrong. Most NFA engines are backtracking engines. Mine is not. In some way, the algorithm I use builds the DFA on demand. |
I'd like a generic "rule engine" supporting regular expression syntax over sequences with generic function token predicates where operators behave in as familiar/standard a way as possible. More importantly, I'd like to be able to define a collection of rules and have the engine match the leftmost longest among them because otherwise maintenance of a rule set is complicated by a dependence on the order the rules are defined. I thought the collection could be encoded simply as an or ('|') of "top-level" rules whose matched content could be identified by named groups ('as'). Based on this description, clearly longest-match is more important (to support a maintainable rule collection) then shortest-match (so reluctant operator behavior is familiar/standard). |
About familiarity, to recap: strict-exec behaves like re-find (except the match is anchored at the start) which implies that it doesn't always yield the longest match. => (strict-exec (* 1) [1 1 1 1 2])
{:match (1 1 1 1), :rest nil}
=> (re-find #"1*" "11112")
"1111"
=> (strict-exec (*? 1) [1 1 1 1 2])
{:match (), :rest nil}
=> (re-find #"1*?" "11112")
"" However if you go for the standard => (exec (cat (as :1+? (+? 1)) (as :1+ (+ 1))) [1 1 1 1 2])
{:1+ (1 1 1), :1+? (1), :match (1 1 1 1), :rest (2)}
=> (strict-exec (cat (as :1+? (+? 1)) (as :1+ (+ 1))) [1 1 1 1 2])
{:1+ (1 1 1), :1+? (1), :match (1 1 1 1), :rest (2)}
=> (re-find #"(1+?)(1+)" "11112")
["1111" "1" "111"]
; it does not depend on their relative ordering: they just have to compete
=> (exec (cat (as :1+ (+ 1)) (as :1+? (+? 1))) [1 1 1 1 2])
{:1+? (1), :1+ (1 1 1), :match (1 1 1 1), :rest (2)}
=> (strict-exec (cat (as :1+ (+ 1)) (as :1+? (+? 1))) [1 1 1 1 2])
{:1+? (1), :1+ (1 1 1), :match (1 1 1 1), :rest (2)}
=> (re-find #"(1+)(1+?)" "11112")
["1111" "111" "1"] By the way, (defn strict-exec [re coll & opts]
(let [k (gensym :match)
r (gensym :rest)]
(when-some [m (apply exec (cat (as k re) (as r (* _))) coll opts)]
(-> m (dissoc m k r)
(assoc :match (get m k) :rest (get m r)))))) |
I didn't mean relative rule ordering would cause problems with the reluctant operators, I meant I think encoding an evolving collection of "top-level" rules with a root alternation without leftmost longest behavior would be problematic to maintain (e.g., reordering rules could easily break things). After more thought, I think for my application, the reluctant operators within non-top-level rules are important and the leftmost longest match among top-level rules is important. Could this combination be implemented efficiently with an additional "leftmost longest match alternation operator"? |
(disclaimer: caffeine hasn't kicked in yet ;-)) For example "leftmost longest" usually means "the longest match amongst the leftmost matches" (and it implies that the leftmost matches all start at the same position, otherwise they wouldn't be all leftmost).
|
I am only meaning leftmost longest in the standard among-the-matches sense Here's an example of what I'm thinking of with a fictitious operator (longest
(as :shorter (cat (+? 1)))
(as :longer (cat (as :chunk1 (+? 1)) (as :skipped (_ *?)) (as :chunk2 (+ 2))))) that given {:longer [1 1 1 3 3 2 2] :chunk1 [1] :skipped [1 1 3 3] chunk2: [2 2]} I realize I could just run each argument "top level rule" of |
This is behaving to the defined spec ("Greediness affects only submatches and never changes the whole match..."), but it doesn't match the "standard" behavior:
Reluctant and non-reluctant return the same thing.
Maybe there is a small modification to support "traditional" reluctant quantifiers?
The text was updated successfully, but these errors were encountered: