-
-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support split by eggex in Str.split #2051
Conversation
pyext/libc.c
Outdated
@@ -264,8 +264,8 @@ func_regex_first_group_match(PyObject *self, PyObject *args) { | |||
} | |||
|
|||
// Assume there is a match | |||
regoff_t start = m[1].rm_so; | |||
regoff_t end = m[1].rm_eo; | |||
regoff_t start = m[0].rm_so; // Why was this 1 before? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea why this was m[1]
before? I ran into an issue where m
would always equal (-1, -1)
because m[1]
was always unset (default is (-1, -1)
, ref: "Any unused structure elements will contain the value -1.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this function was designed to be passed a regex in ()
, and we do it here:
regex = '(%s)' % self.regex # make it a group
https://github.com/oils-for-unix/oils/blob/master/osh/string_ops.py#L495
However, it is unnecessary there! I guess that's why no tests failed!
I don't remember exactly why it was like that -- maybe there was another usage that we've since deleted, which requires ()
Or maybe I didn't realize you can get the positions of the entire match
Hmm
This is ready for a review. I'd like to get some feedback on the design before I write any docs. |
Hmm yeah these empty string cases are very subtle, and probably why I put this off for SO long! The POSIX shSplit() is not exactly simple either -- splitting is hard. (Also the algorithm may still have some bugs! ) But thanks for looking deeply into it, let me take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very nice, thanks for doing this!
The rationale seems pretty well explained, let me think/poke a bit though ...
Also I'm not sure if we should change regex_first_group_match
, or perhaps create another function regex_group0_match
, or perhaps wrap the regex in ()
--- this is mostly an internal cosmetic issue, that doesn't affect the spec
builtin/method_str.py
Outdated
|
||
if eggex_sep is not None: | ||
if '\0' in string: | ||
raise error.Structured(3, "cannot split a string with a nul-byte", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this is an interesting case ... I would spell it "NUL byte" or "\0 byte"
I think that is what it is called in ASCII
https://www.cs.cmu.edu/~pattis/15-1XX/common/handouts/ascii.html
cpp/libc.cc
Outdated
regoff_t start = m[1].rm_so; | ||
regoff_t end = m[1].rm_eo; | ||
regoff_t start = m[0].rm_so; | ||
regoff_t end = m[0].rm_eo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm does this not break anything? Because pyext/libc.c
also has the same behavior
It was designed to find the first thing in ()
, like pre(.)post
, not the whole match
pyext/libc.c
Outdated
@@ -264,8 +264,8 @@ func_regex_first_group_match(PyObject *self, PyObject *args) { | |||
} | |||
|
|||
// Assume there is a match | |||
regoff_t start = m[1].rm_so; | |||
regoff_t end = m[1].rm_eo; | |||
regoff_t start = m[0].rm_so; // Why was this 1 before? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this function was designed to be passed a regex in ()
, and we do it here:
regex = '(%s)' % self.regex # make it a group
https://github.com/oils-for-unix/oils/blob/master/osh/string_ops.py#L495
However, it is unnecessary there! I guess that's why no tests failed!
I don't remember exactly why it was like that -- maybe there was another usage that we've since deleted, which requires ()
Or maybe I didn't realize you can get the positions of the entire match
Hmm
builtin/method_str.py
Outdated
anchor = end | ||
cursor = end | ||
|
||
# If we found a zero-width match, we need to "bump" our cursor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK interesting semantic, let me think about this, maybe run the cases
I like that the function is pretty short ...
python 3.6 deprecated a behavior?
python 3.11 didn't follow through?
Weird! OK I got nerd sniped by this |
More digging https://bugs.python.org/issue43222
I want to avoid semantics that we'll have to change later ... I agree JS semantics are hard to document, which means hard to use It is possible for us to throw an error on any split expression that matches an empty string? Is that better / more predictable? On the other hand, I suppose if we are confident that the current semantic is stable and will never change -- what if we change regex engines? This actually can be a real issue ... there are significant differences between libc glob() on say OS X and Linux with respect to character classes ... But I am not sure if that analogy applies here I suppose if 2 libc's disagree about which regexes match empty string, then that is ORTHOGONAL to our behavior ... we would have a difference either way But I guess it might not be a silent difference I guess you have looked into this issue more than me, at this point We are supposed to be writing an "executalble spec" -- so I guess the Litmus test is if someone is implementing YSH split() in Rust or Zig, without libc regex, without our exact toolchain -- which semantic makes sense? it's certainly easier to implement without the UTF-8 decoding, which might lean toward an error, but we already have UTF-8 decoding |
More info - https://pypi.org/project/regex/
can we test if YSH Do we get an infinite loop in This does seem quite fiddly -- maybe we just throw an error in both cases if the match happens to be zero length? |
OK actually I tested it, it seem like we enter an infinite loop with replace() on a zero width match
I would lean toward making both cases an ERROR, for simplicity well I guess we can test what Python replace() does too But this follows the "one way door" vs. "two way door" philosophy
But given that Python wasn't defnied/good up until Python 3.7, it seems safe to make it disallowed ... |
Python doesn't enter an infinite loop in
But yeah I don't find this behavior interesting/useful -- I think it's better to disallow it i.e. if there is a match, but the match position is the same, then error in both replace() and split() -- does that make sense? |
Another possibility is to return both the replace() or split(), so it returns a value -- i.e. do a partial splitting or replace But neither Python nor JS do that, so we probably shouldn't |
Oh, cool! That's for looking into this further :) I did not know that Python used to error here.
Agreed on the behavior not being valuable. But re: raising an error, can we do so before we even run the split/replace algorithm? Ie. is is sufficient to error if the regex accepts the empty string? The main reason why I ask is that I'd like to keep the error cases simple to explain and faster to be detected. If we only error when we encounter a zero-width match, would |
Yeah I can see what you mean
Would could be conservative and try to match However that may be slow -- it's a little weird to do it every time. We could try to cache that calculation, assuming the regex is constant. Though that feels complex |
BTW I think we should write the |
Oh I see you added the rule for matching against That does seem pretty principled I am also wondering about the rule of just stopping when there is an infinite loop. And then the docs will call this out as a special case that should be avoided I guess our usual rule is to follow Python and JS. In this case I don't think they have very good guidance strip() and replace() should be consistent So I think the choices are:
I think (1) is the most principled but I worry about performance, especially because we have an existing perf issue where we repeatedly compile the regex EVERY time you call Melvin had a regex cache PR to solve this problem, but the performance was a bit complex, and it hasn't landed yet (2) seems OK -- it is well defined (3) also seems OK but I get the point that it may cause "surprise" errors?? Though IMO a runtime error is probably better than doing something wrong/weird (4) is a little weird as mentioned The "split the difference" choice might be to make 2 the default, and opt into 3, or vice versa ... maybe overthinking it though |
fwiw the regex cache PR was #1770 the issue is that with mycpp it's hard to pass a compiled So we just pass a string and compile it every time, which is a little lame The cache is a good idea, but it's not 100% straightforward to implement |
I've been thinking about this a bit and I still feel that (1) is the best choice for semantics. But yeah, the performance is a good point. And it's not like this is an unsafe API. Unlike eg. array bounds, checking the impact of it's misuse at worst results in a runtime error. Additionally, triggering that error case should be pretty easy with testing. You have to have some very specific forms of inputs to miss the infinite loop case. Now (2) vs. (3), I agree that a runtime error is best here. If, somehow, someone finds that the behaviour like (2) or (4) or something else is actually desirable, then we leave ourselves the option of implementing it. It's a one-way door situation again. So I'll go ahead and implement (3) then. |
Ah OK cool, yeah it was bubbling in the background for me too ... there's no perfect solution, but I'm happy with (3) :) |
This is now ready for another review. Thanks for the help on the design :) It was a little trickier than it seemed. After this PR I'll do something similar for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thank you! just minor comments
This is ready for another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you!
if start == end: | ||
raise error.Structured( | ||
3, | ||
"eggex separators should never match the empty string", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error reminds me that it would be nice if it were possible to take a STRING regex, in ERE syntax
The ~ operator in Oils can do that, e.g. if ('mystr' ~ 'myregex+')
But the signature doesn't really allow that
But I'm OK just leaving this out for now ... I think it is possible to just write your own ERE string split in pure YSH
It would be nice to handle this, but we can put it off until someone asks I think ...
(I want to get through Hay/Modules/pure functions before going to deep, and we definitely need this split(), and split() on whitespace)
TODO:
One area of inconsistency between nodejs and python is when the splitter regex accepts the empty string as a match. I also just find the behaviour confusing in general.
This is python's behaviour as documented at https://docs.python.org/3/library/re.html#re.split:
Meanwhile for JS https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/Symbol.split#description:
(
lastIndex
is a property referring to where the regex will start it's next execution: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex. It's like what I callcursor
in this PR.)The differences are subtle, but manifest in cases like:
For YSH I decided to go with Python's behaviour since it was both a) simpler and b) better fit our regex semantics. The JS design heavily relies on a "sticky mode" flag which, AFAIK, we don't have.