-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ysh] Various Str start/end methods #1858
Conversation
Ready, except for the conflict. I think the CI pipeline will still run in case I missed something. |
Some context for choices made: https://oilshell.zulipchat.com/#narrow/stream/121539-oil-dev/topic/flushing.20out.20YSH.20string.20methods/near/424450020 |
af91d53
to
cb7d877
Compare
The conflict was from a reformat in d3477ac touching the WHITESPACE array (which I moved). I ran Ready. |
Thank you, this looks great! Comments about the "spec" before I look in detail at the code:
In eggex, we already have https://www.oilshell.org/release/latest/doc/eggex.html And we have splicing. So I think users can just use the
e.g. see #1836 (comment) However it's possible we got that wrong somehow. The idea was to have a keyword argument like So that the default behavior will never change, and it also relieves us of a maintenance task. TBH I think it may be better to discuss this change in a separate PR |
And of course you can just write it inline too, like
and you can also use ERE
|
FWIW this was the conversation about the spaces too
The initial goals with this function are to
|
You wrote this as a reason we don't need matchStart and matchEnd, but I'm calling it out for myself as I managed missed those with my too-fast look through doc/eggex.md, and wrote the spec tests with ^ and $ (which are deprecated) for some of the
Ah, I wasn't aware of the prior discussion. I just saw a list of whitespace in the code with a link, checked the link, and realized the table was inconsistent with respect to the list, and corrected something that seemed wrong. I get the desire to avoid tracking the Unicode standard.
Feedback/bikeshedding As you say, 99% of the time the basic set of spaces is sufficient, then the 1% can use the Eggex argument I add to specify the space characters they want to trim, and that is where you specify the standard
Or maybe you can make
I'm out of time this morning to make those updates, but should be able to by tomorrow. |
Hmm yeah if we it's possible to just use eggex to strip it quickly, then I might leave out Eggex has a good syntax for char literals
There are some unicode / eggex issues due to So this could be a good opportunity to polish eggex and unicode. Right now there is one failing spec test http://travis-ci.oilshell.org/github-jobs/6412/cpp-spec.wwz/_tmp/spec/ysh-py/ysh-regex.html But I bet we could expose some issues with a few more spec tests Oils is unlike bash in that it's UTF-8 centric This doc sorta sketches our Unicode strategy - https://github.com/oilshell/oil/blob/master/doc/unicode.md which is perhaps 40-60% done (?) |
cb7d877
to
96b2693
Compare
Done, ready again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is great!
I think the whitespace stripping variants should be the only ones that decode UTF-8, since they must to identify the space characters.
Although since we are adding eggex arguments, which gets translated to ERE, we also need to think about how those behave with respect to Unicode
I pointed out that one failing test in spec/ysh-regex
I wonder if we actually have a flag on the eggex, like
= / %start dot %end ; ; utf8 / # matches one utf-8 code unit, with LANG=utf8
= / %start dot %end ; ; bytes / # matches one byte, with LANG=C
And this matters for trimStart() and trimEnd() now
I want to remove the string/bytes ambiguity, which Python has been playing "whac-a-mole" with for about a decade since Python 3 now
We are limited a bit by libc, but I think it is expressive enough to do what we want
builtin/method_str.py
Outdated
try: | ||
if pattern_str is not None: | ||
if self.anchor & START: | ||
_, _, start = string_ops.StartsWithStrByteRange( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here -- I think we should skip decoding when trimming a constant prefix or suffix
It can just test for byte equality, and trim if so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Done.
I guess for this PR I would add separate tests for how much And we can mark them failing if necessary, with We don't have to change anything about the code yet, because we have this confusion elsewhere I believe. So we can fix that all at once separately, hopefully This change adds some more API surface area for eggex, so we should test that part |
- Consistency - Naming should be "start" and "end" - startsWith() and endsWith() were the model. - trimLeft() -> trimStart() - trimRight() -> trimEnd() - removePrefix() -> trimStart(prefix_pattern) - removeSuffix() -> trimEnd(suffix_pattern) - Whitespace characters - The set of characters that are considered whitespace has been extended slightly to be consistent with the set used by Javascript/ECMAscript. - For methods that take a "pattern", that pattern can be a string or an Eggex. This makes the touched methods consistent with the existing `Str => replace()` and Str => search()` methods. - Now implemented - `Str => endsWith(pattern)` - `Str => trimStart(pattern)` - `Str => trimEnd(pattern)` - `Str => matchStart(pattern)` - `Str => matchEnd(pattern)` - Updates unit tests and spec tests - Updates docs
96b2693
to
dc9e6d6
Compare
Feedback addressed, and ready for another review.. |
Thank you, merged! I like that we have 3 methods now, not 5! FWIW I put a few more things on the https://oilshell.zulipchat.com/#narrow/stream/417617-help-wanted There also some bugs tagged "help wanted" on Github as well And some low hanging fruit with the red X's in |
Woot, first commit. I'll take a look at your posts to figure out what to do next, but I'm also happy to continue removing x's. |
I appreciate the discussions about naming and API design -- there were some open issues there, and I like where we ended up As one idea, I think there are some pretty obvious improvements we can do over Python's dict API here: In Python the classic "word count" problem involves something like:
which IMO is a little non-obvious for such a common operation. There was also the old I think YSH should just be
and then then there's also
which is a replacement for Basically I think these are very easy methods, but for some reason Python never added them Also JavaScript -- there is a survey of different languages on that thread (which I did with ChatGPT, it is good at that kind of survey, where it doesn't really have to be 100% correct) |
This also goes with "Awk" a bit, because in Awk you can do things like
without initializing vars ... so hopefully we get closer to the convenience of the classic Unix tools, vs Python So any design input is welcome |
Str => replace()
and Str => search()` methods.Str => endsWith(pattern)
Str => trimStart(pattern)
Str => trimEnd(pattern)
Str => matchStart(pattern)
Str => matchEnd(pattern)