Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ysh] Various Str start/end methods #1858

Merged
merged 1 commit into from
Mar 9, 2024
Merged

Conversation

passionsocks
Copy link
Collaborator

  • Consistency
    • Naming should be "start" and "end" - startsWith() and endsWith() were the model. - trimLeft() -> trimStart() - trimRight() -> trimEnd() - removePrefix() -> trimStart(prefix_pattern) - removeSuffix() -> trimEnd(suffix_pattern)
    • Whitespace characters
      • The set of characters that are considered whitespace has been extended slightly to be consistent with the set used by Javascript/ECMAscript.
    • For methods that take a "pattern", that pattern can be a string or an Eggex. This makes the touched methods consistent with the existing Str => replace() and Str => search()` methods.
  • Now implemented
    • Str => endsWith(pattern)
    • Str => trimStart(pattern)
    • Str => trimEnd(pattern)
    • Str => matchStart(pattern)
    • Str => matchEnd(pattern)
  • Updates unit tests and spec tests
  • Updates docs

@passionsocks
Copy link
Collaborator Author

Ready, except for the conflict. I think the CI pipeline will still run in case I missed something.

@passionsocks
Copy link
Collaborator Author

@passionsocks passionsocks force-pushed the ysh-string-affix-methods branch from af91d53 to cb7d877 Compare March 5, 2024 02:49
@passionsocks
Copy link
Collaborator Author

The conflict was from a reformat in d3477ac touching the WHITESPACE array (which I moved). I ran devtools/format.sh yapf-changed so it's already reformatted in the new location.

Ready.

@andychu
Copy link
Contributor

andychu commented Mar 5, 2024

Thank you, this looks great!


Comments about the "spec" before I look in detail at the code:

  1. I don't think we need matchStart() and matchEnd(), because it's redundant with Eggex. On the other hand, I agree that trimStart() and trimEnd() can take an eggex (rather than inverting the pattern)

In eggex, we already have %start and %end as the names for ^ and $, although the latter are still accepted. (Thankfully they are called start/end, not begin/end or left/right !)

https://www.oilshell.org/release/latest/doc/eggex.html

And we have splicing. So I think users can just use the ~ operator

$ var pat = / dot 'o' /


ysh ysh-0.20.0$ if ('hello' ~ / @pat %end /) { echo 'hi' }
hi

ysh ysh-0.20.0$ if ('hello' ~ / %start @pat /) { echo 'hi' }

  1. We decided to avoid any space characters specific to the Unicode version by default

e.g. see #1836 (comment)

However it's possible we got that wrong somehow.

The idea was to have a keyword argument like unicode_spaces='15.0' to opt into additional spaces

So that the default behavior will never change, and it also relieves us of a maintenance task. TBH I think it may be better to discuss this change in a separate PR

@andychu
Copy link
Contributor

andychu commented Mar 5, 2024

And of course you can just write it inline too, like

ysh ysh-0.20.0$ if ('hello' ~ / dot 'o' %end /) { echo 'hi' }
hi

and you can also use ERE

ysh ysh-0.20.0$ if ('hello' ~ '.o$' ) { echo 'hi' }

@andychu
Copy link
Contributor

andychu commented Mar 5, 2024

FWIW this was the conversation about the spaces too

Oh also I followed the link in core/shell.py, that doesn't quite agree with the one you added

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#white_space

I find things like this highly annoying

Note: Changes to the Unicode standard used by the JavaScript engine may affect programs' behavior. For example, ES2016 upgraded the reference Unicode standard from 5.1 to 8.0.0, which caused U+180E MONGOLIAN VOWEL SEPARATOR to be moved from the "Space_Separator" category to the "Format (Cf)" category, and made it a non-whitespace. Subsequently, the result of "\u180E".trim().length changed from 0 to 1.

basically if the Unicode database changes, then this function changes ...

which means Oils needs to reference a Unicode version technically, ugh :-(

for a very simple function.

I wonder if we an option for ASCII spaces? Or we could even have unicode_version as a param to trim()!

The initial goals with this function are to

  • make sure its behavior doesn't change over time
  • not burden ourselves with maintenance, especially if people don't really care about it
    • I actually would like to make YSH powerful/fast enough to write such a string space function in YSH
    • there's a thread on Zulip about ByteEncoder / RuneEncoder that is sort of about this. Python/JS are kind of weak at writing such string functions because they can be slow, but there might be a faster abstraction we can provide to users

OK cool, one semantic that could make sense is by default strip the 6 chars in the firs 6 rows of the table

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#white_space

U+0009 	Character tabulation 	<TAB> 	Horizontal tabulation 	\t
U+000B 	Line tabulation 	<VT> 	Vertical tabulation 	\v
U+000C 	Form feed 	<FF> 	Page breaking control character ([Wikipedia](https://en.wikipedia.org/wiki/Page_break#Form_feed)). 	\f
U+0020 	Space 	<SP> 	Normal space 	
U+00A0 	No-break space 	<NBSP> 	Normal space, but no point at which a line may break 	
U+FEFF

And then only if you pass unicode_version=something, "other unicode space characters" are included. This is an open-ended set that's version dependent

Basically I think the default of those 6 spaces is going to work "almost all the time". Most people will NOT need to pass unicode_version EVER, probably like 99.999% of programs

But then we have both stability, and a path to evolve that matches JavaScript and future Unicode standards

@passionsocks
Copy link
Collaborator Author

I don't think we need matchStart() and matchEnd(),

  • Sure, I will remove. I left them there to demonstrate how trivial it is to switch to returning a Match instead of a bool (and then do further operations like trim via the Match). But I get that it isn't as efficient.

In eggex, we already have %start and %end

You wrote this as a reason we don't need matchStart and matchEnd, but I'm calling it out for myself as I managed missed those with my too-fast look through doc/eggex.md, and wrote the spec tests with ^ and $ (which are deprecated) for some of the startsWith etc tests.

  • I'll adjust the spec tests for those cases.

We decided to avoid any space characters specific to the Unicode version by default

Ah, I wasn't aware of the prior discussion. I just saw a list of whitespace in the code with a link, checked the link, and realized the table was inconsistent with respect to the list, and corrected something that seemed wrong.

I get the desire to avoid tracking the Unicode standard.

  • I'll leave an a clear comment (the table, the unit test I added, the spec test I modified, and in doc/ref/chap-type-method.md#Trim), and otherwise revert back to the prior list.

The idea was to have a keyword argument like unicode_spaces='15.0' to opt into additional spaces

Feedback/bikeshedding

As you say, 99% of the time the basic set of spaces is sufficient, then the 1% can use the Eggex argument I add to specify the space characters they want to trim, and that is where you specify the standard

setvar line => trim(/ space ; unicode_spaces /)         # Best effort "latest"
setvar line => trim(/ space ; unicode_spaces='15.0' /)  # Specific

Or maybe you can make space take arguments and do...

setvar line => trim(/ space /)                  # Default/basic
setvar line => trim(/ space(unicode) /)         # Best effort "latest"
setvar line => trim(/ space(unicode='15.0') /)  # Specific

I'm out of time this morning to make those updates, but should be able to by tomorrow.

@andychu
Copy link
Contributor

andychu commented Mar 6, 2024

Hmm yeah if we it's possible to just use eggex to strip it quickly, then I might leave out unicode_version= altogether

Eggex has a good syntax for char literals

var pat = / [ \u{ff} \u{100} ] /

There are some unicode / eggex issues due to libc using mutable global variables like LANG=C

So this could be a good opportunity to polish eggex and unicode. Right now there is one failing spec test

http://travis-ci.oilshell.org/github-jobs/6412/cpp-spec.wwz/_tmp/spec/ysh-py/ysh-regex.html

But I bet we could expose some issues with a few more spec tests


Oils is unlike bash in that it's UTF-8 centric

This doc sorta sketches our Unicode strategy - https://github.com/oilshell/oil/blob/master/doc/unicode.md

which is perhaps 40-60% done (?)

@passionsocks passionsocks force-pushed the ysh-string-affix-methods branch from cb7d877 to 96b2693 Compare March 6, 2024 04:26
@passionsocks
Copy link
Collaborator Author

  • Sure, I will remove them [matchStart and matchEnd]
  • I'll adjust the spec tests for those cases [to use %start and %end]
  • I'll leave an a clear comment (the table, the unit test I added, the spec test I modified, and in doc/ref/chap-type-method.md#Trim), and otherwise revert back to the prior list [of whitespace]

Done, ready again.

Copy link
Contributor

@andychu andychu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is great!

I think the whitespace stripping variants should be the only ones that decode UTF-8, since they must to identify the space characters.

Although since we are adding eggex arguments, which gets translated to ERE, we also need to think about how those behave with respect to Unicode

I pointed out that one failing test in spec/ysh-regex

I wonder if we actually have a flag on the eggex, like

= / %start dot %end ; ; utf8 /   # matches one utf-8 code unit, with LANG=utf8
= / %start dot %end ; ; bytes /   # matches one byte, with LANG=C

And this matters for trimStart() and trimEnd() now

I want to remove the string/bytes ambiguity, which Python has been playing "whac-a-mole" with for about a decade since Python 3 now

We are limited a bit by libc, but I think it is expressive enough to do what we want

doc/ref/toc-ysh.md Outdated Show resolved Hide resolved
doc/ref/chap-type-method.md Outdated Show resolved Hide resolved
doc/ref/chap-type-method.md Outdated Show resolved Hide resolved
builtin/method_str.py Outdated Show resolved Hide resolved
builtin/method_str.py Outdated Show resolved Hide resolved
builtin/method_str.py Outdated Show resolved Hide resolved
builtin/method_str.py Outdated Show resolved Hide resolved
try:
if pattern_str is not None:
if self.anchor & START:
_, _, start = string_ops.StartsWithStrByteRange(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here -- I think we should skip decoding when trimming a constant prefix or suffix

It can just test for byte equality, and trim if so

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Done.

osh/string_ops.py Outdated Show resolved Hide resolved
@andychu
Copy link
Contributor

andychu commented Mar 6, 2024

I guess for this PR I would add separate tests for how much trim(/ dot /) and trim(/ ![z] /) matches -- does it match a byte or a "rune" ?

And we can mark them failing if necessary, with ## oils_allowed_failures: 1

We don't have to change anything about the code yet, because we have this confusion elsewhere I believe. So we can fix that all at once separately, hopefully

This change adds some more API surface area for eggex, so we should test that part

- Consistency
  - Naming should be "start" and "end"
    - startsWith() and endsWith() were the model.
    - trimLeft() -> trimStart()
    - trimRight() -> trimEnd()
    - removePrefix() -> trimStart(prefix_pattern)
    - removeSuffix() -> trimEnd(suffix_pattern)
  - Whitespace characters
    - The set of characters that are considered whitespace has been
      extended slightly to be consistent with the set used by
      Javascript/ECMAscript.
  - For methods that take a "pattern", that pattern can be a string or
    an Eggex. This makes the touched methods consistent with the
    existing `Str => replace()` and Str => search()` methods.
- Now implemented
  - `Str => endsWith(pattern)`
  - `Str => trimStart(pattern)`
  - `Str => trimEnd(pattern)`
  - `Str => matchStart(pattern)`
  - `Str => matchEnd(pattern)`
- Updates unit tests and spec tests
- Updates docs
@passionsocks passionsocks force-pushed the ysh-string-affix-methods branch from 96b2693 to dc9e6d6 Compare March 8, 2024 17:50
@passionsocks
Copy link
Collaborator Author

Feedback addressed, and ready for another review..

@andychu andychu changed the base branch from master to soil-staging March 9, 2024 03:52
@andychu andychu merged commit a57f638 into soil-staging Mar 9, 2024
18 checks passed
@andychu
Copy link
Contributor

andychu commented Mar 9, 2024

Thank you, merged! I like that we have 3 methods now, not 5!

FWIW I put a few more things on the #help-wanted Zulip stream if you have time / interest

https://oilshell.zulipchat.com/#narrow/stream/417617-help-wanted

There also some bugs tagged "help wanted" on Github as well

And some low hanging fruit with the red X's in doc/ref/toc-ysh.md

@passionsocks
Copy link
Collaborator Author

Woot, first commit.

I'll take a look at your posts to figure out what to do next, but I'm also happy to continue removing x's.

@andychu
Copy link
Contributor

andychu commented Mar 9, 2024

I appreciate the discussions about naming and API design -- there were some open issues there, and I like where we ended up

As one idea, I think there are some pretty obvious improvements we can do over Python's dict API here:

https://oilshell.zulipchat.com/#narrow/stream/384942-language-design/topic/Our.20Dict.20should.20subsume.20collections.2ECounter.28.29

In Python the classic "word count" problem involves something like:

counts[w] = counts.get(w, 0) + 1

which IMO is a little non-obvious for such a common operation.

There was also the old dict.setdefault() which people found hard to use, and collections.Counter()


I think YSH should just be

call counts->inc(w)   # increment by one, assuming zero if nonexistent

call counts->inc(w, by=5)  # optional arguments should be named arguments in YSH

and then then there's also

call words->accum('k', 'value-to-append')

which is a replacement for collections.defaultdict(list)


Basically I think these are very easy methods, but for some reason Python never added them

Also JavaScript -- there is a survey of different languages on that thread (which I did with ChatGPT, it is good at that kind of survey, where it doesn't really have to be 100% correct)

@andychu
Copy link
Contributor

andychu commented Mar 9, 2024

This also goes with "Awk" a bit, because in Awk you can do things like

$1 == "fruit" { fruit += 1 }
END { print "TOTAL" fruit }

without initializing vars ... so hopefully we get closer to the convenience of the classic Unix tools, vs Python


So any design input is welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants