Support split by eggex in Str.split #2051

PossiblyAShrub · 2024-08-21T05:00:33Z

TODO:

Docs

One area of inconsistency between nodejs and python is when the splitter regex accepts the empty string as a match. I also just find the behaviour confusing in general.

This is python's behaviour as documented at https://docs.python.org/3/library/re.html#re.split:

Empty matches for the pattern split the string only when not adjacent to a previous empty match.

>>> re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']

>>>re.split(r'\W*', '...words...')
['', '', 'w', 'o', 'r', 'd', 's', '', '']

>>> re.split(r'(\W*)', '...words...')
['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']

Meanwhile for JS https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/Symbol.split#description:

If the current match is an empty string, or if the regexp doesn't match at the current position (since it's sticky), the lastIndex would still be advanced — if the regex is Unicode-aware, it would advance by one Unicode code point; otherwise, it advances by one UTF-16 code unit.

(lastIndex is a property referring to where the regex will start it's next execution: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex. It's like what I call cursor in this PR.)

The differences are subtle, but manifest in cases like:

$ node -e 'console.log("a b  cd".split(/\s*/))'
[ 'a', 'b', 'c', 'd' ]
$ python3 -c 'import re; print(re.split("\\s*", "a b  cd"))'
['', 'a', '', 'b', '', 'c', 'd', '']

$ node -e 'console.log("aa".split(/.*/))'      
[ '', '' ]
$ python3 -c 'import re; print(re.split(".*", "aa"))'      
['', '', '']

For YSH I decided to go with Python's behaviour since it was both a) simpler and b) better fit our regex semantics. The JS design heavily relies on a "sticky mode" flag which, AFAIK, we don't have.

PossiblyAShrub · 2024-08-21T05:25:04Z

pyext/libc.c

@@ -264,8 +264,8 @@ func_regex_first_group_match(PyObject *self, PyObject *args) {
  }

  // Assume there is a match
-  regoff_t start = m[1].rm_so;
-  regoff_t end = m[1].rm_eo;
+  regoff_t start = m[0].rm_so;  // Why was this 1 before?


Any idea why this was m[1] before? I ran into an issue where m would always equal (-1, -1) because m[1] was always unset (default is (-1, -1), ref: "Any unused structure elements will contain the value -1.")

Hmm this function was designed to be passed a regex in (), and we do it here:

regex = '(%s)' % self.regex # make it a group

https://github.com/oils-for-unix/oils/blob/master/osh/string_ops.py#L495

However, it is unnecessary there! I guess that's why no tests failed!

I don't remember exactly why it was like that -- maybe there was another usage that we've since deleted, which requires ()

Or maybe I didn't realize you can get the positions of the entire match

Hmm

PossiblyAShrub · 2024-08-23T04:26:21Z

This is ready for a review. I'd like to get some feedback on the design before I write any docs.

andychu · 2024-08-23T18:48:00Z

Hmm yeah these empty string cases are very subtle, and probably why I put this off for SO long!

The POSIX shSplit() is not exactly simple either -- splitting is hard. (Also the algorithm may still have some bugs! )

But thanks for looking deeply into it, let me take a look

andychu

This looks very nice, thanks for doing this!

The rationale seems pretty well explained, let me think/poke a bit though ...

Also I'm not sure if we should change regex_first_group_match, or perhaps create another function regex_group0_match, or perhaps wrap the regex in () --- this is mostly an internal cosmetic issue, that doesn't affect the spec

andychu · 2024-08-23T18:49:36Z

builtin/method_str.py

+
+        if eggex_sep is not None:
+            if '\0' in string:
+                raise error.Structured(3, "cannot split a string with a nul-byte",


OK this is an interesting case ... I would spell it "NUL byte" or "\0 byte"

I think that is what it is called in ASCII

https://www.cs.cmu.edu/~pattis/15-1XX/common/handouts/ascii.html

andychu · 2024-08-23T18:52:04Z

cpp/libc.cc

-  regoff_t start = m[1].rm_so;
-  regoff_t end = m[1].rm_eo;
+  regoff_t start = m[0].rm_so;
+  regoff_t end = m[0].rm_eo;


Hm does this not break anything? Because pyext/libc.c also has the same behavior

It was designed to find the first thing in (), like pre(.)post, not the whole match

andychu · 2024-08-23T18:57:09Z

pyext/libc.c

@@ -264,8 +264,8 @@ func_regex_first_group_match(PyObject *self, PyObject *args) {
  }

  // Assume there is a match
-  regoff_t start = m[1].rm_so;
-  regoff_t end = m[1].rm_eo;
+  regoff_t start = m[0].rm_so;  // Why was this 1 before?


Hmm this function was designed to be passed a regex in (), and we do it here:

regex = '(%s)' % self.regex # make it a group

https://github.com/oils-for-unix/oils/blob/master/osh/string_ops.py#L495

However, it is unnecessary there! I guess that's why no tests failed!

I don't remember exactly why it was like that -- maybe there was another usage that we've since deleted, which requires ()

Or maybe I didn't realize you can get the positions of the entire match

Hmm

andychu · 2024-08-23T18:58:03Z

builtin/method_str.py

+                anchor = end
+                cursor = end
+
+                # If we found a zero-width match, we need to "bump" our cursor


OK interesting semantic, let me think about this, maybe run the cases

I like that the function is pretty short ...

andychu · 2024-08-23T19:05:06Z

python 3.6 deprecated a behavior?

$ python3 -c 'import re; print(re.split("\\s*", "a b  cd"))'
/usr/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)
['a', 'b', 'cd']

python 3.11 didn't follow through?

$ python3 -V
Python 3.11.2


$ python3 -c 'import re; print(re.split("\\s*", "a b  cd"))'
['', 'a', '', 'b', '', 'c', 'd', '']

Weird! OK I got nerd sniped by this

andychu · 2024-08-23T19:11:33Z

More digging

https://bugs.python.org/issue43222

There was a bug in the regular expression engine which caused re.split() working incorrectly with zero-width patterns. Note that in your example _DIGIT_BOUNDARY_RE.split("10.0.0") returns ['10.0.0'] on Python 2.7 -- the result which you unlikely expected.

It was impossible to fix that bug without changing behavior of other functions in corner cases and breaking existing code. So we first made re.split() raising an exception instead of returning nonsensical result and added warnings for some other cases to help users to catch potential bugs in their code and avoid ambiguous patterns. You see this in 3.6. In 3.7 we fixed the underlying bug. It caused breakage of some user code, but it made regular expressions more consistent in long perspective and made zero-width patterns more usable.

I want to avoid semantics that we'll have to change later ...

I agree JS semantics are hard to document, which means hard to use

It is possible for us to throw an error on any split expression that matches an empty string?

Is that better / more predictable?

On the other hand, I suppose if we are confident that the current semantic is stable and will never change -- what if we change regex engines?

This actually can be a real issue ... there are significant differences between libc glob() on say OS X and Linux with respect to character classes ...

But I am not sure if that analogy applies here

I suppose if 2 libc's disagree about which regexes match empty string, then that is ORTHOGONAL to our behavior ... we would have a difference either way

But I guess it might not be a silent difference

I guess you have looked into this issue more than me, at this point

We are supposed to be writing an "executalble spec" -- so I guess the Litmus test is if someone is implementing YSH split() in Rust or Zig, without libc regex, without our exact toolchain -- which semantic makes sense?

it's certainly easier to implement without the UTF-8 decoding, which might lean toward an error, but we already have UTF-8 decoding

andychu · 2024-08-23T19:15:10Z

More info - https://pypi.org/project/regex/

Zero-width matches are not handled correctly in the re module before Python 3.7. The behaviour in those earlier versions is:

.split won’t split a string at a zero-width match.

.sub will advance by one character after a zero-width match.

can we test if YSH replace() and YSH split() are consistent ??

Do we get an infinite loop in replace() when there is a zero width match ?

This does seem quite fiddly -- maybe we just throw an error in both cases if the match happens to be zero length?

andychu · 2024-08-23T19:27:29Z

OK actually I tested it, it seem like we enter an infinite loop with replace() on a zero width match

ysh andy@hoover:~/git/oilshell/oil$ = 'a b   cd'.replace(/s+/, 'X')
(Str)   'aXbXcd'

ysh andy@hoover:~/git/oilshell/oil$ = 'a b   cd'.replace(/s*/, 'X')

I would lean toward making both cases an ERROR, for simplicity

well I guess we can test what Python replace() does too

But this follows the "one way door" vs. "two way door" philosophy

If we define the behavior, it's a one way door. we can never change it
If we make it an error, we can define the behavior later if someone WANTS it

But given that Python wasn't defnied/good up until Python 3.7, it seems safe to make it disallowed ...

andychu · 2024-08-23T19:31:39Z

Python doesn't enter an infinite loop in re.sub() -- it seems to do something like split() --

>>> re.sub(r'\s+', 'X', 'a b   cd')
'aXbXcd'

>>> re.sub(r'\s*', 'X', 'a b   cd')
'XaXXbXXcXdX'
>>>

But yeah I don't find this behavior interesting/useful -- I think it's better to disallow it

i.e. if there is a match, but the match position is the same, then error in both replace() and split() -- does that make sense?

andychu · 2024-08-23T19:32:42Z

Another possibility is to return both the replace() or split(), so it returns a value -- i.e. do a partial splitting or replace

But neither Python nor JS do that, so we probably shouldn't

PossiblyAShrub · 2024-08-23T19:43:51Z

Oh, cool! That's for looking into this further :) I did not know that Python used to error here.

But yeah I don't find this behavior interesting/useful -- I think it's better to disallow it

i.e. if there is a match, but the match position is the same, then error in both replace() and split() -- does that make sense?

Agreed on the behavior not being valuable. But re: raising an error, can we do so before we even run the split/replace algorithm? Ie. is is sufficient to error if the regex accepts the empty string?

The main reason why I ask is that I'd like to keep the error cases simple to explain and faster to be detected. If we only error when we encounter a zero-width match, would ' '.split(/ space* /) or ' foo'.split(/ space* /, count=1) "fly under the radar?"

andychu · 2024-08-23T23:42:26Z

Yeah I can see what you mean

It could be slightly weird if \s* works but ONLY if you don't hit that case at runtime
- so the answer to "Is splitting by \s* valid?" would be "it depends on what the data is"

Would could be conservative and try to match \s* against the empty string first. If it matches, then it's not valid.

However that may be slow -- it's a little weird to do it every time. We could try to cache that calculation, assuming the regex is constant. Though that feels complex

andychu · 2024-08-23T23:43:25Z

BTW I think we should write split() with the same libc.regex_search() function as we use in replace()

the regex_first_group_match() can be used for other things ... or arguably it could be deleted, since seems like a special case of regex_search

andychu · 2024-08-25T02:05:33Z

Oh I see you added the rule for matching against '' first

That does seem pretty principled

I am also wondering about the rule of just stopping when there is an infinite loop. And then the docs will call this out as a special case that should be avoided

I guess our usual rule is to follow Python and JS. In this case I don't think they have very good guidance

strip() and replace() should be consistent

So I think the choices are:

check that '' is not matched - we are kinda "taxing" the common case for an uncommon case - that's my main concern
bail early when there is an infinite loop
error when there is an infinite loop
advance by one byte/rune - this behavior is a bit weird to me

I think (1) is the most principled but I worry about performance, especially because we have an existing perf issue where we repeatedly compile the regex EVERY time you call libc.regex_search()

Melvin had a regex cache PR to solve this problem, but the performance was a bit complex, and it hasn't landed yet

(2) seems OK -- it is well defined

(3) also seems OK but I get the point that it may cause "surprise" errors?? Though IMO a runtime error is probably better than doing something wrong/weird

(4) is a little weird as mentioned

The "split the difference" choice might be to make 2 the default, and opt into 3, or vice versa ... maybe overthinking it though

andychu · 2024-08-25T02:07:07Z

fwiw the regex cache PR was #1770

the issue is that with mycpp it's hard to pass a compiled regex_t object from C++ <-> Pythonnn

So we just pass a string and compile it every time, which is a little lame

The cache is a good idea, but it's not 100% straightforward to implement

PossiblyAShrub · 2024-08-27T04:03:38Z

I've been thinking about this a bit and I still feel that (1) is the best choice for semantics. But yeah, the performance is a good point. And it's not like this is an unsafe API. Unlike eg. array bounds, checking the impact of it's misuse at worst results in a runtime error. Additionally, triggering that error case should be pretty easy with testing. You have to have some very specific forms of inputs to miss the infinite loop case.

Now (2) vs. (3), I agree that a runtime error is best here. If, somehow, someone finds that the behaviour like (2) or (4) or something else is actually desirable, then we leave ourselves the option of implementing it. It's a one-way door situation again.

So I'll go ahead and implement (3) then.

…ring

andychu · 2024-08-27T04:43:07Z

Ah OK cool, yeah it was bubbling in the background for me too ... there's no perfect solution, but I'm happy with (3) :)

PossiblyAShrub · 2024-08-27T04:51:00Z

This is now ready for another review. Thanks for the help on the design :) It was a little trickier than it seemed. After this PR I'll do something similar for replace().

andychu

This looks great, thank you! just minor comments

builtin/method_str.py

PossiblyAShrub · 2024-08-31T15:35:48Z

This is ready for another review.

andychu

Looks great, thank you!

andychu · 2024-08-31T15:42:08Z

builtin/method_str.py

+                if start == end:
+                    raise error.Structured(
+                        3,
+                        "eggex separators should never match the empty string",


This error reminds me that it would be nice if it were possible to take a STRING regex, in ERE syntax

The ~ operator in Oils can do that, e.g. if ('mystr' ~ 'myregex+')

But the signature doesn't really allow that

But I'm OK just leaving this out for now ... I think it is possible to just write your own ERE string split in pure YSH

It would be nice to handle this, but we can put it off until someone asks I think ...

(I want to get through Hay/Modules/pure functions before going to deep, and we definitely need this split(), and split() on whitespace)

PossiblyAShrub added 3 commits August 20, 2024 22:53

Support split by eggex in Str.split

c07ec5e

Appease mycpp

9bb9995

Use match index 0 for cpp/libc.cc too

d496cc6

PossiblyAShrub commented Aug 21, 2024

View reviewed changes

PossiblyAShrub added 3 commits August 20, 2024 23:41

Add string API survey

12a320d

Merge branch 'master' into str-split-by-eggex

ca88a53

Follow python's behaviour when encountering zero-width matches

e2123d7

andychu reviewed Aug 23, 2024

View reviewed changes

andychu mentioned this pull request Aug 23, 2024

can't Ctrl-C out of long operation (regex) #2056

Open

PossiblyAShrub added 6 commits August 23, 2024 23:28

Merge branch 'master' into str-split-by-eggex

165b16d

s/nul-byte/NUL byte

2a2f555

Guard against zero-width matches

16f159b

Use libc.regex_search over libc.regex_first_group_match

8bfd76f

Add TODO comment about caching

b9676a0

Run yapf

a41e66f

Only error on an infinite loop, not if the eggex accepts the empty st…

dd622a9

…ring

Merge branch 'master' into str-split-by-eggex

1b83d87

Update doc/ref entry to note new eggex support in split()

9287357

PossiblyAShrub changed the title ~~WIP: Support split by eggex in Str.split~~ Support split by eggex in Str.split Aug 27, 2024

andychu reviewed Aug 28, 2024

View reviewed changes

builtin/method_str.py Outdated Show resolved Hide resolved

builtin/method_str.py Outdated Show resolved Hide resolved

builtin/method_str.py Outdated Show resolved Hide resolved

builtin/method_str.py Outdated Show resolved Hide resolved

PossiblyAShrub added 2 commits August 31, 2024 09:13

Merge branch 'master' into str-split-by-eggex

18ff59e

Address review feedback

9bfe589

andychu reviewed Aug 31, 2024

View reviewed changes

andychu changed the base branch from master to soil-staging August 31, 2024 15:42

andychu merged commit 7da9c25 into soil-staging Aug 31, 2024
18 checks passed

PossiblyAShrub deleted the str-split-by-eggex branch August 31, 2024 22:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support split by eggex in Str.split #2051

Support split by eggex in Str.split #2051

PossiblyAShrub commented Aug 21, 2024 •

edited

Loading

PossiblyAShrub Aug 21, 2024

andychu Aug 23, 2024

PossiblyAShrub commented Aug 23, 2024

andychu commented Aug 23, 2024

andychu left a comment

andychu Aug 23, 2024

andychu Aug 23, 2024

andychu Aug 23, 2024

andychu Aug 23, 2024

andychu commented Aug 23, 2024

andychu commented Aug 23, 2024

andychu commented Aug 23, 2024 •

edited

Loading

andychu commented Aug 23, 2024

andychu commented Aug 23, 2024

andychu commented Aug 23, 2024 •

edited

Loading

PossiblyAShrub commented Aug 23, 2024

andychu commented Aug 23, 2024

andychu commented Aug 23, 2024

andychu commented Aug 25, 2024

andychu commented Aug 25, 2024

PossiblyAShrub commented Aug 27, 2024

andychu commented Aug 27, 2024

PossiblyAShrub commented Aug 27, 2024

andychu left a comment

PossiblyAShrub commented Aug 31, 2024

andychu left a comment

andychu Aug 31, 2024

Support split by eggex in Str.split #2051

Support split by eggex in Str.split #2051

Conversation

PossiblyAShrub commented Aug 21, 2024 • edited Loading

PossiblyAShrub Aug 21, 2024

Choose a reason for hiding this comment

andychu Aug 23, 2024

Choose a reason for hiding this comment

PossiblyAShrub commented Aug 23, 2024

andychu commented Aug 23, 2024

andychu left a comment

Choose a reason for hiding this comment

andychu Aug 23, 2024

Choose a reason for hiding this comment

andychu Aug 23, 2024

Choose a reason for hiding this comment

andychu Aug 23, 2024

Choose a reason for hiding this comment

andychu Aug 23, 2024

Choose a reason for hiding this comment

andychu commented Aug 23, 2024

python 3.6 deprecated a behavior?

python 3.11 didn't follow through?

andychu commented Aug 23, 2024

andychu commented Aug 23, 2024 • edited Loading

andychu commented Aug 23, 2024

andychu commented Aug 23, 2024

andychu commented Aug 23, 2024 • edited Loading

PossiblyAShrub commented Aug 23, 2024

andychu commented Aug 23, 2024

andychu commented Aug 23, 2024

andychu commented Aug 25, 2024

andychu commented Aug 25, 2024

PossiblyAShrub commented Aug 27, 2024

andychu commented Aug 27, 2024

PossiblyAShrub commented Aug 27, 2024

andychu left a comment

Choose a reason for hiding this comment

PossiblyAShrub commented Aug 31, 2024

andychu left a comment

Choose a reason for hiding this comment

andychu Aug 31, 2024

Choose a reason for hiding this comment

PossiblyAShrub commented Aug 21, 2024 •

edited

Loading

andychu commented Aug 23, 2024 •

edited

Loading

andychu commented Aug 23, 2024 •

edited

Loading