Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support split by eggex in Str.split #2051

Merged
merged 17 commits into from
Aug 31, 2024
Merged

Support split by eggex in Str.split #2051

merged 17 commits into from
Aug 31, 2024

Conversation

PossiblyAShrub
Copy link
Collaborator

@PossiblyAShrub PossiblyAShrub commented Aug 21, 2024

TODO:

  • Docs

One area of inconsistency between nodejs and python is when the splitter regex accepts the empty string as a match. I also just find the behaviour confusing in general.

This is python's behaviour as documented at https://docs.python.org/3/library/re.html#re.split:

Empty matches for the pattern split the string only when not adjacent to a previous empty match.

>>> re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']

>>>re.split(r'\W*', '...words...')
['', '', 'w', 'o', 'r', 'd', 's', '', '']

>>> re.split(r'(\W*)', '...words...')
['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']

Meanwhile for JS https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/Symbol.split#description:

If the current match is an empty string, or if the regexp doesn't match at the current position (since it's sticky), the lastIndex would still be advanced — if the regex is Unicode-aware, it would advance by one Unicode code point; otherwise, it advances by one UTF-16 code unit.

(lastIndex is a property referring to where the regex will start it's next execution: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex. It's like what I call cursor in this PR.)

The differences are subtle, but manifest in cases like:

$ node -e 'console.log("a b  cd".split(/\s*/))'
[ 'a', 'b', 'c', 'd' ]
$ python3 -c 'import re; print(re.split("\\s*", "a b  cd"))'
['', 'a', '', 'b', '', 'c', 'd', '']

$ node -e 'console.log("aa".split(/.*/))'      
[ '', '' ]
$ python3 -c 'import re; print(re.split(".*", "aa"))'      
['', '', '']

For YSH I decided to go with Python's behaviour since it was both a) simpler and b) better fit our regex semantics. The JS design heavily relies on a "sticky mode" flag which, AFAIK, we don't have.

pyext/libc.c Outdated
@@ -264,8 +264,8 @@ func_regex_first_group_match(PyObject *self, PyObject *args) {
}

// Assume there is a match
regoff_t start = m[1].rm_so;
regoff_t end = m[1].rm_eo;
regoff_t start = m[0].rm_so; // Why was this 1 before?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why this was m[1] before? I ran into an issue where m would always equal (-1, -1) because m[1] was always unset (default is (-1, -1), ref: "Any unused structure elements will contain the value -1.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this function was designed to be passed a regex in (), and we do it here:

    regex = '(%s)' % self.regex  # make it a group

https://github.com/oils-for-unix/oils/blob/master/osh/string_ops.py#L495


However, it is unnecessary there! I guess that's why no tests failed!

I don't remember exactly why it was like that -- maybe there was another usage that we've since deleted, which requires ()

Or maybe I didn't realize you can get the positions of the entire match

Hmm

@PossiblyAShrub
Copy link
Collaborator Author

This is ready for a review. I'd like to get some feedback on the design before I write any docs.

@andychu
Copy link
Contributor

andychu commented Aug 23, 2024

Hmm yeah these empty string cases are very subtle, and probably why I put this off for SO long!

The POSIX shSplit() is not exactly simple either -- splitting is hard. (Also the algorithm may still have some bugs! )

But thanks for looking deeply into it, let me take a look

Copy link
Contributor

@andychu andychu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very nice, thanks for doing this!

The rationale seems pretty well explained, let me think/poke a bit though ...

Also I'm not sure if we should change regex_first_group_match, or perhaps create another function regex_group0_match, or perhaps wrap the regex in () --- this is mostly an internal cosmetic issue, that doesn't affect the spec


if eggex_sep is not None:
if '\0' in string:
raise error.Structured(3, "cannot split a string with a nul-byte",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this is an interesting case ... I would spell it "NUL byte" or "\0 byte"

I think that is what it is called in ASCII

https://www.cs.cmu.edu/~pattis/15-1XX/common/handouts/ascii.html

cpp/libc.cc Outdated
regoff_t start = m[1].rm_so;
regoff_t end = m[1].rm_eo;
regoff_t start = m[0].rm_so;
regoff_t end = m[0].rm_eo;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm does this not break anything? Because pyext/libc.c also has the same behavior

It was designed to find the first thing in (), like pre(.)post, not the whole match

pyext/libc.c Outdated
@@ -264,8 +264,8 @@ func_regex_first_group_match(PyObject *self, PyObject *args) {
}

// Assume there is a match
regoff_t start = m[1].rm_so;
regoff_t end = m[1].rm_eo;
regoff_t start = m[0].rm_so; // Why was this 1 before?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this function was designed to be passed a regex in (), and we do it here:

    regex = '(%s)' % self.regex  # make it a group

https://github.com/oils-for-unix/oils/blob/master/osh/string_ops.py#L495


However, it is unnecessary there! I guess that's why no tests failed!

I don't remember exactly why it was like that -- maybe there was another usage that we've since deleted, which requires ()

Or maybe I didn't realize you can get the positions of the entire match

Hmm

anchor = end
cursor = end

# If we found a zero-width match, we need to "bump" our cursor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK interesting semantic, let me think about this, maybe run the cases

I like that the function is pretty short ...

@andychu
Copy link
Contributor

andychu commented Aug 23, 2024

python 3.6 deprecated a behavior?

$ python3 -c 'import re; print(re.split("\\s*", "a b  cd"))'
/usr/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)
['a', 'b', 'cd']

python 3.11 didn't follow through?

$ python3 -V
Python 3.11.2

$ python3 -c 'import re; print(re.split("\\s*", "a b  cd"))'
['', 'a', '', 'b', '', 'c', 'd', '']

Weird! OK I got nerd sniped by this

@andychu
Copy link
Contributor

andychu commented Aug 23, 2024

More digging

https://bugs.python.org/issue43222

There was a bug in the regular expression engine which caused re.split() working incorrectly with zero-width patterns. Note that in your example _DIGIT_BOUNDARY_RE.split("10.0.0") returns ['10.0.0'] on Python 2.7 -- the result which you unlikely expected.

It was impossible to fix that bug without changing behavior of other functions in corner cases and breaking existing code. So we first made re.split() raising an exception instead of returning nonsensical result and added warnings for some other cases to help users to catch potential bugs in their code and avoid ambiguous patterns. You see this in 3.6. In 3.7 we fixed the underlying bug. It caused breakage of some user code, but it made regular expressions more consistent in long perspective and made zero-width patterns more usable.

I want to avoid semantics that we'll have to change later ...

I agree JS semantics are hard to document, which means hard to use


It is possible for us to throw an error on any split expression that matches an empty string?

Is that better / more predictable?

On the other hand, I suppose if we are confident that the current semantic is stable and will never change -- what if we change regex engines?


This actually can be a real issue ... there are significant differences between libc glob() on say OS X and Linux with respect to character classes ...

But I am not sure if that analogy applies here

I suppose if 2 libc's disagree about which regexes match empty string, then that is ORTHOGONAL to our behavior ... we would have a difference either way

But I guess it might not be a silent difference


I guess you have looked into this issue more than me, at this point

We are supposed to be writing an "executalble spec" -- so I guess the Litmus test is if someone is implementing YSH split() in Rust or Zig, without libc regex, without our exact toolchain -- which semantic makes sense?

it's certainly easier to implement without the UTF-8 decoding, which might lean toward an error, but we already have UTF-8 decoding

@andychu
Copy link
Contributor

andychu commented Aug 23, 2024

More info - https://pypi.org/project/regex/

Zero-width matches are not handled correctly in the re module before Python 3.7. The behaviour in those earlier versions is:

.split won’t split a string at a zero-width match.

.sub will advance by one character after a zero-width match.

can we test if YSH replace() and YSH split() are consistent ??

Do we get an infinite loop in replace() when there is a zero width match ?


This does seem quite fiddly -- maybe we just throw an error in both cases if the match happens to be zero length?

@andychu
Copy link
Contributor

andychu commented Aug 23, 2024

OK actually I tested it, it seem like we enter an infinite loop with replace() on a zero width match

ysh andy@hoover:~/git/oilshell/oil$ = 'a b   cd'.replace(/s+/, 'X')
(Str)   'aXbXcd'

ysh andy@hoover:~/git/oilshell/oil$ = 'a b   cd'.replace(/s*/, 'X')

I would lean toward making both cases an ERROR, for simplicity

well I guess we can test what Python replace() does too

But this follows the "one way door" vs. "two way door" philosophy

  • If we define the behavior, it's a one way door. we can never change it
  • If we make it an error, we can define the behavior later if someone WANTS it

But given that Python wasn't defnied/good up until Python 3.7, it seems safe to make it disallowed ...

@andychu
Copy link
Contributor

andychu commented Aug 23, 2024

Python doesn't enter an infinite loop in re.sub() -- it seems to do something like split() --

>>> re.sub(r'\s+', 'X', 'a b   cd')
'aXbXcd'

>>> re.sub(r'\s*', 'X', 'a b   cd')
'XaXXbXXcXdX'
>>> 

But yeah I don't find this behavior interesting/useful -- I think it's better to disallow it

i.e. if there is a match, but the match position is the same, then error in both replace() and split() -- does that make sense?

@andychu
Copy link
Contributor

andychu commented Aug 23, 2024

Another possibility is to return both the replace() or split(), so it returns a value -- i.e. do a partial splitting or replace

But neither Python nor JS do that, so we probably shouldn't

@PossiblyAShrub
Copy link
Collaborator Author

Oh, cool! That's for looking into this further :) I did not know that Python used to error here.

But yeah I don't find this behavior interesting/useful -- I think it's better to disallow it

i.e. if there is a match, but the match position is the same, then error in both replace() and split() -- does that make sense?

Agreed on the behavior not being valuable. But re: raising an error, can we do so before we even run the split/replace algorithm? Ie. is is sufficient to error if the regex accepts the empty string?

The main reason why I ask is that I'd like to keep the error cases simple to explain and faster to be detected. If we only error when we encounter a zero-width match, would ' '.split(/ space* /) or ' foo'.split(/ space* /, count=1) "fly under the radar?"

@andychu
Copy link
Contributor

andychu commented Aug 23, 2024

Yeah I can see what you mean

  • It could be slightly weird if \s* works but ONLY if you don't hit that case at runtime
    • so the answer to "Is splitting by \s* valid?" would be "it depends on what the data is"

Would could be conservative and try to match \s* against the empty string first. If it matches, then it's not valid.

However that may be slow -- it's a little weird to do it every time. We could try to cache that calculation, assuming the regex is constant. Though that feels complex

@andychu
Copy link
Contributor

andychu commented Aug 23, 2024

BTW I think we should write split() with the same libc.regex_search() function as we use in replace()

the regex_first_group_match() can be used for other things ... or arguably it could be deleted, since seems like a special case of regex_search

@andychu
Copy link
Contributor

andychu commented Aug 25, 2024

Oh I see you added the rule for matching against '' first

That does seem pretty principled

I am also wondering about the rule of just stopping when there is an infinite loop. And then the docs will call this out as a special case that should be avoided

I guess our usual rule is to follow Python and JS. In this case I don't think they have very good guidance


strip() and replace() should be consistent

So I think the choices are:

  1. check that '' is not matched - we are kinda "taxing" the common case for an uncommon case - that's my main concern
  2. bail early when there is an infinite loop
  3. error when there is an infinite loop
  4. advance by one byte/rune - this behavior is a bit weird to me

I think (1) is the most principled but I worry about performance, especially because we have an existing perf issue where we repeatedly compile the regex EVERY time you call libc.regex_search()

Melvin had a regex cache PR to solve this problem, but the performance was a bit complex, and it hasn't landed yet


(2) seems OK -- it is well defined

(3) also seems OK but I get the point that it may cause "surprise" errors?? Though IMO a runtime error is probably better than doing something wrong/weird

(4) is a little weird as mentioned

The "split the difference" choice might be to make 2 the default, and opt into 3, or vice versa ... maybe overthinking it though

@andychu
Copy link
Contributor

andychu commented Aug 25, 2024

fwiw the regex cache PR was #1770

the issue is that with mycpp it's hard to pass a compiled regex_t object from C++ <-> Pythonnn

So we just pass a string and compile it every time, which is a little lame

The cache is a good idea, but it's not 100% straightforward to implement

@PossiblyAShrub
Copy link
Collaborator Author

I've been thinking about this a bit and I still feel that (1) is the best choice for semantics. But yeah, the performance is a good point. And it's not like this is an unsafe API. Unlike eg. array bounds, checking the impact of it's misuse at worst results in a runtime error. Additionally, triggering that error case should be pretty easy with testing. You have to have some very specific forms of inputs to miss the infinite loop case.

Now (2) vs. (3), I agree that a runtime error is best here. If, somehow, someone finds that the behaviour like (2) or (4) or something else is actually desirable, then we leave ourselves the option of implementing it. It's a one-way door situation again.

So I'll go ahead and implement (3) then.

@andychu
Copy link
Contributor

andychu commented Aug 27, 2024

Ah OK cool, yeah it was bubbling in the background for me too ... there's no perfect solution, but I'm happy with (3) :)

@PossiblyAShrub PossiblyAShrub changed the title WIP: Support split by eggex in Str.split Support split by eggex in Str.split Aug 27, 2024
@PossiblyAShrub
Copy link
Collaborator Author

This is now ready for another review. Thanks for the help on the design :) It was a little trickier than it seemed. After this PR I'll do something similar for replace().

Copy link
Contributor

@andychu andychu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you! just minor comments

builtin/method_str.py Outdated Show resolved Hide resolved
builtin/method_str.py Outdated Show resolved Hide resolved
builtin/method_str.py Outdated Show resolved Hide resolved
builtin/method_str.py Outdated Show resolved Hide resolved
@PossiblyAShrub
Copy link
Collaborator Author

This is ready for another review.

Copy link
Contributor

@andychu andychu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!

if start == end:
raise error.Structured(
3,
"eggex separators should never match the empty string",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error reminds me that it would be nice if it were possible to take a STRING regex, in ERE syntax

The ~ operator in Oils can do that, e.g. if ('mystr' ~ 'myregex+')

But the signature doesn't really allow that


But I'm OK just leaving this out for now ... I think it is possible to just write your own ERE string split in pure YSH

It would be nice to handle this, but we can put it off until someone asks I think ...

(I want to get through Hay/Modules/pure functions before going to deep, and we definitely need this split(), and split() on whitespace)

@andychu andychu changed the base branch from master to soil-staging August 31, 2024 15:42
@andychu andychu merged commit 7da9c25 into soil-staging Aug 31, 2024
18 checks passed
@PossiblyAShrub PossiblyAShrub deleted the str-split-by-eggex branch August 31, 2024 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants