[ysh] Various Str start/end methods #1858

passionsocks · 2024-03-04T20:32:56Z

Consistency
- Naming should be "start" and "end" - startsWith() and endsWith() were the model. - trimLeft() -> trimStart() - trimRight() -> trimEnd() - removePrefix() -> trimStart(prefix_pattern) - removeSuffix() -> trimEnd(suffix_pattern)
- Whitespace characters
  - The set of characters that are considered whitespace has been extended slightly to be consistent with the set used by Javascript/ECMAscript.
- For methods that take a "pattern", that pattern can be a string or an Eggex. This makes the touched methods consistent with the existing Str => replace() and Str => search()` methods.
Now implemented
- Str => endsWith(pattern)
- Str => trimStart(pattern)
- Str => trimEnd(pattern)
- Str => matchStart(pattern)
- Str => matchEnd(pattern)
Updates unit tests and spec tests
Updates docs

passionsocks · 2024-03-04T20:34:00Z

Ready, except for the conflict. I think the CI pipeline will still run in case I missed something.

passionsocks · 2024-03-04T20:37:47Z

Some context for choices made: https://oilshell.zulipchat.com/#narrow/stream/121539-oil-dev/topic/flushing.20out.20YSH.20string.20methods/near/424450020

passionsocks · 2024-03-05T02:51:38Z

The conflict was from a reformat in d3477ac touching the WHITESPACE array (which I moved). I ran devtools/format.sh yapf-changed so it's already reformatted in the new location.

Ready.

andychu · 2024-03-05T16:19:24Z

Thank you, this looks great!

Comments about the "spec" before I look in detail at the code:

I don't think we need matchStart() and matchEnd(), because it's redundant with Eggex. On the other hand, I agree that trimStart() and trimEnd() can take an eggex (rather than inverting the pattern)

In eggex, we already have %start and %end as the names for ^ and $, although the latter are still accepted. (Thankfully they are called start/end, not begin/end or left/right !)

https://www.oilshell.org/release/latest/doc/eggex.html

And we have splicing. So I think users can just use the ~ operator

$ var pat = / dot 'o' /


ysh ysh-0.20.0$ if ('hello' ~ / @pat %end /) { echo 'hi' }
hi

ysh ysh-0.20.0$ if ('hello' ~ / %start @pat /) { echo 'hi' }

We decided to avoid any space characters specific to the Unicode version by default

e.g. see #1836 (comment)

However it's possible we got that wrong somehow.

The idea was to have a keyword argument like unicode_spaces='15.0' to opt into additional spaces

So that the default behavior will never change, and it also relieves us of a maintenance task. TBH I think it may be better to discuss this change in a separate PR

andychu · 2024-03-05T16:24:17Z

And of course you can just write it inline too, like

ysh ysh-0.20.0$ if ('hello' ~ / dot 'o' %end /) { echo 'hi' }
hi

and you can also use ERE

ysh ysh-0.20.0$ if ('hello' ~ '.o$' ) { echo 'hi' }

andychu · 2024-03-05T16:30:37Z

FWIW this was the conversation about the spaces too

Oh also I followed the link in core/shell.py, that doesn't quite agree with the one you added

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#white_space

I find things like this highly annoying

Note: Changes to the Unicode standard used by the JavaScript engine may affect programs' behavior. For example, ES2016 upgraded the reference Unicode standard from 5.1 to 8.0.0, which caused U+180E MONGOLIAN VOWEL SEPARATOR to be moved from the "Space_Separator" category to the "Format (Cf)" category, and made it a non-whitespace. Subsequently, the result of "\u180E".trim().length changed from 0 to 1.

basically if the Unicode database changes, then this function changes ...

which means Oils needs to reference a Unicode version technically, ugh :-(

for a very simple function.

I wonder if we an option for ASCII spaces? Or we could even have unicode_version as a param to trim()!

The initial goals with this function are to

make sure its behavior doesn't change over time
not burden ourselves with maintenance, especially if people don't really care about it
- I actually would like to make YSH powerful/fast enough to write such a string space function in YSH
- there's a thread on Zulip about ByteEncoder / RuneEncoder that is sort of about this. Python/JS are kind of weak at writing such string functions because they can be slow, but there might be a faster abstraction we can provide to users

OK cool, one semantic that could make sense is by default strip the 6 chars in the firs 6 rows of the table

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#white_space
U+0009 	Character tabulation 	<TAB> 	Horizontal tabulation 	\t
U+000B 	Line tabulation 	<VT> 	Vertical tabulation 	\v
U+000C 	Form feed 	<FF> 	Page breaking control character ([Wikipedia](https://en.wikipedia.org/wiki/Page_break#Form_feed)). 	\f
U+0020 	Space 	<SP> 	Normal space 	
U+00A0 	No-break space 	<NBSP> 	Normal space, but no point at which a line may break 	
U+FEFF
And then only if you pass unicode_version=something, "other unicode space characters" are included. This is an open-ended set that's version dependent

Basically I think the default of those 6 spaces is going to work "almost all the time". Most people will NOT need to pass unicode_version EVER, probably like 99.999% of programs

But then we have both stability, and a path to evolve that matches JavaScript and future Unicode standards

passionsocks · 2024-03-05T17:48:39Z

I don't think we need matchStart() and matchEnd(),

Sure, I will remove. I left them there to demonstrate how trivial it is to switch to returning a Match instead of a bool (and then do further operations like trim via the Match). But I get that it isn't as efficient.

In eggex, we already have %start and %end

You wrote this as a reason we don't need matchStart and matchEnd, but I'm calling it out for myself as I managed missed those with my too-fast look through doc/eggex.md, and wrote the spec tests with ^ and $ (which are deprecated) for some of the startsWith etc tests.

I'll adjust the spec tests for those cases.

We decided to avoid any space characters specific to the Unicode version by default

Ah, I wasn't aware of the prior discussion. I just saw a list of whitespace in the code with a link, checked the link, and realized the table was inconsistent with respect to the list, and corrected something that seemed wrong.

I get the desire to avoid tracking the Unicode standard.

I'll leave an a clear comment (the table, the unit test I added, the spec test I modified, and in doc/ref/chap-type-method.md#Trim), and otherwise revert back to the prior list.

The idea was to have a keyword argument like unicode_spaces='15.0' to opt into additional spaces

Feedback/bikeshedding

As you say, 99% of the time the basic set of spaces is sufficient, then the 1% can use the Eggex argument I add to specify the space characters they want to trim, and that is where you specify the standard

setvar line => trim(/ space ; unicode_spaces /)         # Best effort "latest"
setvar line => trim(/ space ; unicode_spaces='15.0' /)  # Specific

Or maybe you can make space take arguments and do...

setvar line => trim(/ space /)                  # Default/basic
setvar line => trim(/ space(unicode) /)         # Best effort "latest"
setvar line => trim(/ space(unicode='15.0') /)  # Specific

I'm out of time this morning to make those updates, but should be able to by tomorrow.

andychu · 2024-03-06T00:24:49Z

Hmm yeah if we it's possible to just use eggex to strip it quickly, then I might leave out unicode_version= altogether

Eggex has a good syntax for char literals

var pat = / [ \u{ff} \u{100} ] /

There are some unicode / eggex issues due to libc using mutable global variables like LANG=C

So this could be a good opportunity to polish eggex and unicode. Right now there is one failing spec test

http://travis-ci.oilshell.org/github-jobs/6412/cpp-spec.wwz/_tmp/spec/ysh-py/ysh-regex.html

But I bet we could expose some issues with a few more spec tests

Oils is unlike bash in that it's UTF-8 centric

This doc sorta sketches our Unicode strategy - https://github.com/oilshell/oil/blob/master/doc/unicode.md

which is perhaps 40-60% done (?)

passionsocks · 2024-03-06T04:26:30Z

Sure, I will remove them [matchStart and matchEnd]

I'll adjust the spec tests for those cases [to use %start and %end]

I'll leave an a clear comment (the table, the unit test I added, the spec test I modified, and in doc/ref/chap-type-method.md#Trim), and otherwise revert back to the prior list [of whitespace]

Done, ready again.

andychu

Thank you, this is great!

I think the whitespace stripping variants should be the only ones that decode UTF-8, since they must to identify the space characters.

Although since we are adding eggex arguments, which gets translated to ERE, we also need to think about how those behave with respect to Unicode

I pointed out that one failing test in spec/ysh-regex

I wonder if we actually have a flag on the eggex, like

= / %start dot %end ; ; utf8 /   # matches one utf-8 code unit, with LANG=utf8
= / %start dot %end ; ; bytes /   # matches one byte, with LANG=C

And this matters for trimStart() and trimEnd() now

I want to remove the string/bytes ambiguity, which Python has been playing "whac-a-mole" with for about a decade since Python 3 now

We are limited a bit by libc, but I think it is expressive enough to do what we want

doc/ref/toc-ysh.md

doc/ref/chap-type-method.md

builtin/method_str.py

andychu · 2024-03-06T20:03:59Z

builtin/method_str.py

+        try:
+            if pattern_str is not None:
+                if self.anchor & START:
+                    _, _, start = string_ops.StartsWithStrByteRange(


Same comment here -- I think we should skip decoding when trimming a constant prefix or suffix

It can just test for byte equality, and trim if so

osh/string_ops.py

andychu · 2024-03-06T20:13:52Z

I guess for this PR I would add separate tests for how much trim(/ dot /) and trim(/ ![z] /) matches -- does it match a byte or a "rune" ?

And we can mark them failing if necessary, with ## oils_allowed_failures: 1

We don't have to change anything about the code yet, because we have this confusion elsewhere I believe. So we can fix that all at once separately, hopefully

This change adds some more API surface area for eggex, so we should test that part

- Consistency - Naming should be "start" and "end" - startsWith() and endsWith() were the model. - trimLeft() -> trimStart() - trimRight() -> trimEnd() - removePrefix() -> trimStart(prefix_pattern) - removeSuffix() -> trimEnd(suffix_pattern) - Whitespace characters - The set of characters that are considered whitespace has been extended slightly to be consistent with the set used by Javascript/ECMAscript. - For methods that take a "pattern", that pattern can be a string or an Eggex. This makes the touched methods consistent with the existing `Str => replace()` and Str => search()` methods. - Now implemented - `Str => endsWith(pattern)` - `Str => trimStart(pattern)` - `Str => trimEnd(pattern)` - `Str => matchStart(pattern)` - `Str => matchEnd(pattern)` - Updates unit tests and spec tests - Updates docs

passionsocks · 2024-03-08T17:53:18Z

Feedback addressed, and ready for another review..

andychu · 2024-03-09T04:01:50Z

Thank you, merged! I like that we have 3 methods now, not 5!

FWIW I put a few more things on the #help-wanted Zulip stream if you have time / interest

https://oilshell.zulipchat.com/#narrow/stream/417617-help-wanted

There also some bugs tagged "help wanted" on Github as well

And some low hanging fruit with the red X's in doc/ref/toc-ysh.md

passionsocks · 2024-03-09T04:25:34Z

Woot, first commit.

I'll take a look at your posts to figure out what to do next, but I'm also happy to continue removing x's.

andychu · 2024-03-09T04:41:26Z

I appreciate the discussions about naming and API design -- there were some open issues there, and I like where we ended up

As one idea, I think there are some pretty obvious improvements we can do over Python's dict API here:

https://oilshell.zulipchat.com/#narrow/stream/384942-language-design/topic/Our.20Dict.20should.20subsume.20collections.2ECounter.28.29

In Python the classic "word count" problem involves something like:

counts[w] = counts.get(w, 0) + 1

which IMO is a little non-obvious for such a common operation.

There was also the old dict.setdefault() which people found hard to use, and collections.Counter()

I think YSH should just be

call counts->inc(w)   # increment by one, assuming zero if nonexistent

call counts->inc(w, by=5)  # optional arguments should be named arguments in YSH

and then then there's also

call words->accum('k', 'value-to-append')

which is a replacement for collections.defaultdict(list)

Basically I think these are very easy methods, but for some reason Python never added them

Also JavaScript -- there is a survey of different languages on that thread (which I did with ChatGPT, it is good at that kind of survey, where it doesn't really have to be 100% correct)

andychu · 2024-03-09T04:43:20Z

This also goes with "Awk" a bit, because in Awk you can do things like

$1 == "fruit" { fruit += 1 }
END { print "TOTAL" fruit }

without initializing vars ... so hopefully we get closer to the convenience of the classic Unix tools, vs Python

So any design input is welcome

passionsocks requested a review from PossiblyAShrub March 4, 2024 20:37

passionsocks force-pushed the ysh-string-affix-methods branch from af91d53 to cb7d877 Compare March 5, 2024 02:49

passionsocks force-pushed the ysh-string-affix-methods branch from cb7d877 to 96b2693 Compare March 6, 2024 04:26

andychu reviewed Mar 6, 2024

View reviewed changes

passionsocks force-pushed the ysh-string-affix-methods branch from 96b2693 to dc9e6d6 Compare March 8, 2024 17:50

andychu changed the base branch from master to soil-staging March 9, 2024 03:52

andychu merged commit a57f638 into soil-staging Mar 9, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ysh] Various Str start/end methods #1858

[ysh] Various Str start/end methods #1858

passionsocks commented Mar 4, 2024

passionsocks commented Mar 4, 2024

passionsocks commented Mar 4, 2024

passionsocks commented Mar 5, 2024

andychu commented Mar 5, 2024 •

edited

Loading

andychu commented Mar 5, 2024 •

edited

Loading

andychu commented Mar 5, 2024

passionsocks commented Mar 5, 2024

andychu commented Mar 6, 2024 •

edited

Loading

passionsocks commented Mar 6, 2024

andychu left a comment

andychu Mar 6, 2024

passionsocks Mar 7, 2024

andychu commented Mar 6, 2024 •

edited

Loading

passionsocks commented Mar 8, 2024

andychu commented Mar 9, 2024

passionsocks commented Mar 9, 2024

andychu commented Mar 9, 2024 •

edited

Loading

andychu commented Mar 9, 2024 •

edited

Loading

[ysh] Various Str start/end methods #1858

[ysh] Various Str start/end methods #1858

Conversation

passionsocks commented Mar 4, 2024

passionsocks commented Mar 4, 2024

passionsocks commented Mar 4, 2024

passionsocks commented Mar 5, 2024

andychu commented Mar 5, 2024 • edited Loading

andychu commented Mar 5, 2024 • edited Loading

andychu commented Mar 5, 2024

passionsocks commented Mar 5, 2024

andychu commented Mar 6, 2024 • edited Loading

passionsocks commented Mar 6, 2024

andychu left a comment

Choose a reason for hiding this comment

andychu Mar 6, 2024

Choose a reason for hiding this comment

passionsocks Mar 7, 2024

Choose a reason for hiding this comment

andychu commented Mar 6, 2024 • edited Loading

passionsocks commented Mar 8, 2024

andychu commented Mar 9, 2024

passionsocks commented Mar 9, 2024

andychu commented Mar 9, 2024 • edited Loading

andychu commented Mar 9, 2024 • edited Loading

andychu commented Mar 5, 2024 •

edited

Loading

andychu commented Mar 5, 2024 •

edited

Loading

andychu commented Mar 6, 2024 •

edited

Loading

andychu commented Mar 6, 2024 •

edited

Loading

andychu commented Mar 9, 2024 •

edited

Loading

andychu commented Mar 9, 2024 •

edited

Loading