Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greedy and Reluctant Quantifiers should not produce same match #1

Open
nezda opened this issue Aug 27, 2014 · 11 comments
Open

Greedy and Reluctant Quantifiers should not produce same match #1

nezda opened this issue Aug 27, 2014 · 11 comments

Comments

@nezda
Copy link

nezda commented Aug 27, 2014

This is behaving to the defined spec ("Greediness affects only submatches and never changes the whole match..."), but it doesn't match the "standard" behavior:

(= (re-find #".+" "AAA") "AAA") ;; same in seqexp
(= (re-find #".+?" "AAA") "A") ;; seqexp _+? would say "AAA"

Reluctant and non-reluctant return the same thing.

(se/exec (se/cat (se/+ 3)) [3 3 3 3]) ;; -> {:match (3 3 3 3), :rest nil}
(se/exec (se/cat (se/+? 3)) [3 3 3 3]) ;; -> {:match (3 3 3 3), :rest nil}
;; expected/wanted -> {:match (3), :rest (3 3 3)}

Maybe there is a small modification to support "traditional" reluctant quantifiers?

@cgrand
Copy link
Owner

cgrand commented Aug 27, 2014

Hi Luke,

You're right on this behavior. It left me unsatisfied but it was the most
expeditive way to have longest match and still be as lazy as possible.
Maybe I should rename the current exec to lazy-eager-exec and have a
non-lazy exec that behaves in a least surprising manner.

(defn strict-exec [re coll & opts]
  (let [k (gensym :match)]
    (when-some [m (apply exec (cat (as k re) (* _)) coll opts)]
      (-> m (dissoc m k) (assoc :match (get m k))))))

I think that doing that in a lazy fashion is a more involved change

@cgrand cgrand closed this as completed Aug 27, 2014
@cgrand cgrand reopened this Aug 27, 2014
@nezda
Copy link
Author

nezda commented Aug 27, 2014

That would be great if you added strict-exec (maybe eager-exec?) with the expected behavior to a released version of this package. I find the name lazy-eager-exec confusing because 'lazy' and 'eager' are opposites (though I understand they refer to 2 different things in this context, the way the rule treats its input and the nature of the match it produces).

@nezda
Copy link
Author

nezda commented Aug 29, 2014

I didn't realize you'd just given me a wrapper function around your library :) (new to Clojure) Upon doing that though, I find that it seems to break longest-match semantics that it had; this test now fails:

(deftest simple
  (are [se s m]
    (= (:match (strict-exec se s)) m)
    (se/| (se/cat 1 se/_) (se/cat 1 3 3)) [1 3 3 7 7 9] [1 3 3] ; longest match
    ))

Easy fix for that?

@cgrand
Copy link
Owner

cgrand commented Aug 29, 2014

It's normal: (re-find #"1.|133" "133337") yields "13" too.

Le vendredi 29 août 2014, Luke Nezda [email protected] a écrit :

I didn't realize you'd just given me a wrapper function around your
library :) (new to Clojure) Upon doing that though, I find that it seems to
break longest-match semantics that it had; this test now fails:

(deftest simple
(are [se s m](= %28:match %28strict-exec se s%29%29 m)
(se/| (se/cat 1 se/_) (se/cat 1 3 3)) [1 3 3 7 7 9] [1 3 3] ; longest match
))

Easy fix for that?


Reply to this email directly or view it on GitHub
#1 (comment).

On Clojure http://clj-me.cgrand.net/
Clojure Programming http://clojurebook.com
Training, Consulting & Contracting http://lambdanext.eu/

@nezda
Copy link
Author

nezda commented Aug 29, 2014

I had immediately done a similar test leading to the same conclusion, and even found http://stackoverflow.com/a/4515435/689119 explaining this is how NFA engines work, and already thought this was an NFA engine based on the references you provided, but hoped since the following passed, I'd misunderstood 😄 -- bummer, I guess I need a DFA engine since I need longest match semantics

(deftest simple
  (are [se s m]
    (= (:match (exec se s)) m)
    (se/| (se/cat 1 se/_) (se/cat 1 3 3)) [1 3 3 7 7 9] [1 3 3] ; longest match
    ))

@cgrand
Copy link
Owner

cgrand commented Sep 1, 2014

The DFA/NFA discussion is slightly wrong. Most NFA engines are backtracking engines. Mine is not. In some way, the algorithm I use builds the DFA on demand.
The problem is that you want longest-match except when the regex is reluctant (and the fact that a regex is reluctant is ill defined: what if one branch is reluctant and the other eager – plus this notion is lost when the regex is transformed into a [ND]FA).
Could we discuss why you need this behavior (longest or shortest match depending on the regex itself) and see how to deal with that?

@nezda
Copy link
Author

nezda commented Sep 1, 2014

I'd like a generic "rule engine" supporting regular expression syntax over sequences with generic function token predicates where operators behave in as familiar/standard a way as possible. More importantly, I'd like to be able to define a collection of rules and have the engine match the leftmost longest among them because otherwise maintenance of a rule set is complicated by a dependence on the order the rules are defined. I thought the collection could be encoded simply as an or ('|') of "top-level" rules whose matched content could be identified by named groups ('as'). Based on this description, clearly longest-match is more important (to support a maintainable rule collection) then shortest-match (so reluctant operator behavior is familiar/standard).

@cgrand
Copy link
Owner

cgrand commented Sep 1, 2014

About familiarity, to recap: strict-exec behaves like re-find (except the match is anchored at the start) which implies that it doesn't always yield the longest match.

=> (strict-exec (* 1) [1 1 1 1 2])
{:match (1 1 1 1), :rest nil}
=> (re-find #"1*" "11112")
"1111"
=> (strict-exec (*? 1) [1 1 1 1 2])
{:match (), :rest nil}
=> (re-find #"1*?" "11112")
""

However if you go for the standard exec, inner reluctant quantifiers (that is, those battling with greedy ones) still behaves as expected. Only unrestricted reluctant quantifiers go wild!

=> (exec (cat (as :1+? (+? 1)) (as :1+ (+ 1))) [1 1 1 1 2])
{:1+ (1 1 1), :1+? (1), :match (1 1 1 1), :rest (2)}
=> (strict-exec (cat (as :1+? (+? 1)) (as :1+ (+ 1))) [1 1 1 1 2])
{:1+ (1 1 1), :1+? (1), :match (1 1 1 1), :rest (2)}
=> (re-find #"(1+?)(1+)" "11112")
["1111" "1" "111"]
; it does not  depend on their relative ordering: they just have to compete
=> (exec (cat (as :1+ (+ 1)) (as :1+? (+? 1))) [1 1 1 1 2])
{:1+? (1), :1+ (1 1 1), :match (1 1 1 1), :rest (2)}
=> (strict-exec (cat (as :1+ (+ 1)) (as :1+? (+? 1))) [1 1 1 1 2])
{:1+? (1), :1+ (1 1 1), :match (1 1 1 1), :rest (2)}
=> (re-find #"(1+)(1+?)" "11112")
["1111" "111" "1"]

By the way, strict-exec I gave you earlier always reported nil for :rest, here is a more correct version:

(defn strict-exec [re coll & opts]
  (let [k (gensym :match)
        r (gensym :rest)]
    (when-some [m (apply exec (cat (as k re) (as r (* _))) coll opts)]
      (-> m (dissoc m k r) 
        (assoc :match (get m k) :rest (get m r))))))

@nezda
Copy link
Author

nezda commented Sep 3, 2014

I didn't mean relative rule ordering would cause problems with the reluctant operators, I meant I think encoding an evolving collection of "top-level" rules with a root alternation without leftmost longest behavior would be problematic to maintain (e.g., reordering rules could easily break things).

After more thought, I think for my application, the reluctant operators within non-top-level rules are important and the leftmost longest match among top-level rules is important. Could this combination be implemented efficiently with an additional "leftmost longest match alternation operator"?

@cgrand
Copy link
Owner

cgrand commented Sep 3, 2014

(disclaimer: caffeine hasn't kicked in yet ;-))
I think I need examples (inputs & expected output) to really get the behavior you're after.

For example "leftmost longest" usually means "the longest match amongst the leftmost matches" (and it implies that the leftmost matches all start at the same position, otherwise they wouldn't be all leftmost).
However reading you I'm starting to wonder whether you mean "leftmost" as in in "leftmost alternation in the pattern".

exec yields longest prefix match (if any) — so does adding a find function yielding the leftmost (instead of the prefix) longest match be ok? Or am I missing a case?

@nezda
Copy link
Author

nezda commented Sep 4, 2014

I am only meaning leftmost longest in the standard among-the-matches sense

Here's an example of what I'm thinking of with a fictitious operator longest:

(longest
   (as :shorter (cat (+? 1)))
   (as :longer  (cat (as :chunk1 (+? 1)) (as :skipped (_ *?)) (as :chunk2 (+ 2)))))

that given [1 1 1 3 3 2 2] would return

{:longer [1 1 1 3 3 2  2] :chunk1 [1] :skipped [1 1 3 3] chunk2: [2 2]}

I realize I could just run each argument "top level rule" of longest in turn with strict-exec and retain the one with the longest :match but I assume running them together as a single FSA will be more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants