Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Expressify str.strip_prefix & suffix #11197

Merged
merged 4 commits into from
Sep 21, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions crates/polars-core/src/chunked_array/ops/arity.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,11 @@ use crate::datatypes::{ArrayCollectIterExt, ArrayFromIter, StaticArray};
use crate::prelude::{ChunkedArray, PolarsDataType};
use crate::utils::{align_chunks_binary, align_chunks_ternary};

pub trait Helper<G1, G2, H>: FnMut(G1, G2) -> H {}
impl<T, G1, G2, H> Helper<G1, G2, H> for T where T: FnMut(G1, G2) -> H {}

#[inline]
pub fn binary_elementwise<T, U, V, F, K>(
pub fn binary_elementwise<T, U, V, F>(
lhs: &ChunkedArray<T>,
rhs: &ChunkedArray<U>,
mut op: F,
Expand All @@ -17,8 +20,8 @@ where
T: PolarsDataType,
U: PolarsDataType,
V: PolarsDataType,
F: for<'a> FnMut(Option<T::Physical<'a>>, Option<U::Physical<'a>>) -> Option<K>,
V::Array: ArrayFromIter<Option<K>>,
F: for<'a> Helper<Option<T::Physical<'a>>, Option<U::Physical<'a>>, Option<V::Physical<'a>>>,
Copy link
Collaborator Author

@reswqa reswqa Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I feel uncomfortable with this approach(Helper is for HRTB of the output type of FnMut). But I didn't come up with a good solution to the lifetime issue here in a short period of time. I guess there may be a very simple way, perhaps I overestimated this 😞 Any input or suggestion about this will be helpful to me. :)

Next comment has specific error message.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definition shouldn't merge it like this with this helper.

Copy link
Collaborator Author

@reswqa reswqa Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definition shouldn't merge it like this with this helper.

Yes, I also oppose this approach. But It's just for the convenience of bringing up the problem and seeing if there are any good ways to handle this. Or rather, it's throwing bricks to attract jade. :)

I know we have some way to bypass this issue, but I would rather make binary_elementwise happy with this, especially in the case of closures. If there is really no good solution, I will revert this part of the changes and change it to a way that does not require modifying this function.

V::Array: for<'a> ArrayFromIter<Option<V::Physical<'a>>>,
{
let (lhs, rhs) = align_chunks_binary(lhs, rhs);
let iter = lhs
Expand Down
24 changes: 12 additions & 12 deletions crates/polars-core/src/series/arithmetic/borrowed.rs
Original file line number Diff line number Diff line change
Expand Up @@ -177,10 +177,10 @@ pub mod checked {
// see check_div for chunkedarray<T>
let rhs = unsafe { lhs.unpack_series_matching_physical_type(rhs) };

Ok(arity::binary_elementwise::<_, _, Float32Type, _, _>(
lhs,
rhs,
|opt_l, opt_r| match (opt_l, opt_r) {
Ok(
arity::binary_elementwise::<_, _, Float32Type, _>(lhs, rhs, |opt_l, opt_r| match (
opt_l, opt_r,
) {
(Some(l), Some(r)) => {
if r.is_zero() {
None
Expand All @@ -189,9 +189,9 @@ pub mod checked {
}
},
_ => None,
},
})
.into_series(),
)
.into_series())
}
}

Expand All @@ -201,10 +201,10 @@ pub mod checked {
// see check_div
let rhs = unsafe { lhs.unpack_series_matching_physical_type(rhs) };

Ok(arity::binary_elementwise::<_, _, Float64Type, _, _>(
lhs,
rhs,
|opt_l, opt_r| match (opt_l, opt_r) {
Ok(
arity::binary_elementwise::<_, _, Float64Type, _>(lhs, rhs, |opt_l, opt_r| match (
opt_l, opt_r,
) {
(Some(l), Some(r)) => {
if r.is_zero() {
None
Expand All @@ -213,9 +213,9 @@ pub mod checked {
}
},
_ => None,
},
})
.into_series(),
)
.into_series())
}
}

Expand Down
40 changes: 40 additions & 0 deletions crates/polars-ops/src/chunked_array/strings/namespace.rs
Original file line number Diff line number Diff line change
Expand Up @@ -334,6 +334,46 @@ pub trait Utf8NameSpaceImpl: AsUtf8 {
Ok(builder.finish())
}

fn strip_prefix(&self, prefix: &Utf8Chunked) -> Utf8Chunked {
let ca = self.as_utf8();
match prefix.len() {
1 => match prefix.get(0) {
Some(prefix) => {
ca.apply_generic(|opt_s| opt_s.map(|s| s.strip_prefix(prefix).unwrap_or(s)))
},
_ => Utf8Chunked::full_null(ca.name(), ca.len()),
},
_ => binary_elementwise(
ca,
prefix,
|opt_s: Option<&str>, opt_prefix: Option<&str>| match (opt_s, opt_prefix) {
(Some(s), Some(prefix)) => Some(s.strip_prefix(prefix).unwrap_or(s)),
_ => None,
},
),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the log make the compiler unhappy.

error: lifetime may not live long enough
   --> /Users/reswqa/code/rust/polars/crates/polars-ops/src/chunked_array/strings/namespace.rs:350:48
    |
349 |                 |opt_s: Option<&str>, opt_prefix: Option<&str>| match (opt_s, opt_prefix) {
    |                                -                              - return type of closure is std::option::Option<&'2 str>
    |                                |
    |                                let's call the lifetime of this reference `'1`
350 |                     (Some(s), Some(prefix)) => Some(s.strip_prefix(prefix).unwrap_or(s)),
    |                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ returning this value requires that `'1` must outlive `'2`

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this closure an actual function and annotate the lifetimes?

Copy link
Collaborator Author

@reswqa reswqa Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you means some thing like this?

fn strip_prefix_op<'a>(opt_s: Option<&'a str>, opt_prefix: Option<&'a str>) -> Option<& 'a str> {
        match (opt_s, opt_prefix) {
            (Some(s), Some(prefix)) => Some(s.strip_prefix(prefix).unwrap_or(s)),
            _ => None,
        }
}

and rewrite to

binary_elementwise(ca, prefix,strip_prefix_op)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this given me:

error[E0308]: mismatched types
   --> /Users/reswqa/code/rust/polars/crates/polars-ops/src/chunked_array/strings/namespace.rs:359:18
    |
359 |               _ => binary_elementwise(
    |  __________________^
360 | |                 ca,
361 | |                 prefix,
362 | |                 strip_prefix_op
363 | |             ),
    | |_____________^ one type is more general than the other
    |
    = note: expected enum `std::option::Option<&'a str>`
               found enum `std::option::Option<&str>`
note: the lifetime requirement is introduced here
   --> /Users/reswqa/code/rust/polars/crates/polars-core/src/chunked_array/ops/arity.rs:20:75
    |
20  |     F: for<'a> FnMut(Option<T::Physical<'a>>, Option<U::Physical<'a>>) -> Option<K>,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look tomorrow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

}
}

fn strip_suffix(&self, suffix: &Utf8Chunked) -> Utf8Chunked {
let ca = self.as_utf8();
match suffix.len() {
1 => match suffix.get(0) {
Some(suffix) => {
ca.apply_generic(|opt_s| opt_s.map(|s| s.strip_suffix(suffix).unwrap_or(s)))
},
_ => Utf8Chunked::full_null(ca.name(), ca.len()),
},
_ => binary_elementwise(
ca,
suffix,
|opt_s: Option<&str>, opt_suffix: Option<&str>| match (opt_s, opt_suffix) {
(Some(s), Some(suffix)) => Some(s.strip_suffix(suffix).unwrap_or(s)),
_ => None,
},
),
}
}

fn split(&self, by: &str) -> ListChunked {
let ca = self.as_utf8();
let mut builder = ListUtf8ChunkedBuilder::new(ca.name(), ca.len(), ca.get_values_size());
Expand Down
4 changes: 2 additions & 2 deletions crates/polars-plan/src/dsl/function_expr/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -738,8 +738,8 @@ impl From<StringFunction> for SpecialEq<Arc<dyn SeriesUdf>> {
StripChars(matches) => map!(strings::strip_chars, matches.as_deref()),
StripCharsStart(matches) => map!(strings::strip_chars_start, matches.as_deref()),
StripCharsEnd(matches) => map!(strings::strip_chars_end, matches.as_deref()),
StripPrefix(prefix) => map!(strings::strip_prefix, &prefix),
StripSuffix(suffix) => map!(strings::strip_suffix, &suffix),
StripPrefix => map_as_slice!(strings::strip_prefix),
StripSuffix => map_as_slice!(strings::strip_suffix),
#[cfg(feature = "string_from_radix")]
FromRadix(radix, strict) => map!(strings::from_radix, radix, strict),
Slice(start, length) => map!(strings::str_slice, start, length),
Expand Down
30 changes: 14 additions & 16 deletions crates/polars-plan/src/dsl/function_expr/strings.rs
Original file line number Diff line number Diff line change
Expand Up @@ -71,8 +71,8 @@ pub enum StringFunction {
StripChars(Option<String>),
StripCharsStart(Option<String>),
StripCharsEnd(Option<String>),
StripPrefix(String),
StripSuffix(String),
StripPrefix,
StripSuffix,
#[cfg(feature = "temporal")]
Strptime(DataType, StrptimeOptions),
Split,
Expand Down Expand Up @@ -121,8 +121,8 @@ impl StringFunction {
| StripChars(_)
| StripCharsStart(_)
| StripCharsEnd(_)
| StripPrefix(_)
| StripSuffix(_)
| StripPrefix
| StripSuffix
| Slice(_, _) => mapper.with_same_dtype(),
#[cfg(feature = "string_justify")]
Zfill { .. } | LJust { .. } | RJust { .. } => mapper.with_same_dtype(),
Expand Down Expand Up @@ -164,8 +164,8 @@ impl Display for StringFunction {
StringFunction::StripChars(_) => "strip_chars",
StringFunction::StripCharsStart(_) => "strip_chars_start",
StringFunction::StripCharsEnd(_) => "strip_chars_end",
StringFunction::StripPrefix(_) => "strip_prefix",
StringFunction::StripSuffix(_) => "strip_suffix",
StringFunction::StripPrefix => "strip_prefix",
StringFunction::StripSuffix => "strip_suffix",
#[cfg(feature = "temporal")]
StringFunction::Strptime(_, _) => "strptime",
StringFunction::Split => "split",
Expand Down Expand Up @@ -325,18 +325,16 @@ pub(super) fn strip_chars_end(s: &Series, matches: Option<&str>) -> PolarsResult
}
}

pub(super) fn strip_prefix(s: &Series, prefix: &str) -> PolarsResult<Series> {
let ca = s.utf8()?;
Ok(ca
.apply_values(|s| Cow::Borrowed(s.strip_prefix(prefix).unwrap_or(s)))
.into_series())
pub(super) fn strip_prefix(s: &[Series]) -> PolarsResult<Series> {
let ca = s[0].utf8()?;
let prefix = s[1].utf8()?;
Ok(ca.strip_prefix(prefix).into_series())
}

pub(super) fn strip_suffix(s: &Series, suffix: &str) -> PolarsResult<Series> {
let ca = s.utf8()?;
Ok(ca
.apply_values(|s| Cow::Borrowed(s.strip_suffix(suffix).unwrap_or(s)))
.into_series())
pub(super) fn strip_suffix(s: &[Series]) -> PolarsResult<Series> {
let ca = s[0].utf8()?;
let suffix = s[1].utf8()?;
Ok(ca.strip_suffix(suffix).into_series())
}

pub(super) fn extract_all(args: &[Series]) -> PolarsResult<Series> {
Expand Down
22 changes: 12 additions & 10 deletions crates/polars-plan/src/dsl/string.rs
Original file line number Diff line number Diff line change
Expand Up @@ -441,19 +441,21 @@ impl StringNameSpace {
}

/// Remove prefix.
pub fn strip_prefix(self, prefix: String) -> Expr {
self.0
.map_private(FunctionExpr::StringExpr(StringFunction::StripPrefix(
prefix,
)))
pub fn strip_prefix(self, prefix: Expr) -> Expr {
self.0.map_many_private(
FunctionExpr::StringExpr(StringFunction::StripPrefix),
&[prefix],
false,
)
}

/// Remove suffix.
pub fn strip_suffix(self, suffix: String) -> Expr {
self.0
.map_private(FunctionExpr::StringExpr(StringFunction::StripSuffix(
suffix,
)))
pub fn strip_suffix(self, suffix: Expr) -> Expr {
self.0.map_many_private(
FunctionExpr::StringExpr(StringFunction::StripSuffix),
&[suffix],
false,
)
}

/// Convert all characters to lowercase.
Expand Down
6 changes: 4 additions & 2 deletions py-polars/polars/expr/string.py
Original file line number Diff line number Diff line change
Expand Up @@ -680,7 +680,7 @@ def strip_chars_end(self, characters: str | None = None) -> Expr:
"""
return wrap_expr(self._pyexpr.str_strip_chars_end(characters))

def strip_prefix(self, prefix: str) -> Expr:
def strip_prefix(self, prefix: IntoExpr) -> Expr:
"""
Remove prefix.
Expand Down Expand Up @@ -708,9 +708,10 @@ def strip_prefix(self, prefix: str) -> Expr:
└───────────┴──────────┘
"""
prefix = parse_as_expression(prefix, str_as_lit=True)
return wrap_expr(self._pyexpr.str_strip_prefix(prefix))

def strip_suffix(self, suffix: str) -> Expr:
def strip_suffix(self, suffix: IntoExpr) -> Expr:
"""
Remove suffix.
Expand Down Expand Up @@ -738,6 +739,7 @@ def strip_suffix(self, suffix: str) -> Expr:
└───────────┴──────────┘
"""
suffix = parse_as_expression(suffix, str_as_lit=True)
return wrap_expr(self._pyexpr.str_strip_suffix(suffix))

def zfill(self, alignment: int) -> Expr:
Expand Down
4 changes: 2 additions & 2 deletions py-polars/polars/series/string.py
Original file line number Diff line number Diff line change
Expand Up @@ -1208,7 +1208,7 @@ def strip_chars_end(self, characters: str | None = None) -> Series:
"""

def strip_prefix(self, prefix: str) -> Series:
def strip_prefix(self, prefix: IntoExpr) -> Series:
"""
Remove prefix.
Expand All @@ -1234,7 +1234,7 @@ def strip_prefix(self, prefix: str) -> Series:
"""

def strip_suffix(self, suffix: str) -> Series:
def strip_suffix(self, suffix: IntoExpr) -> Series:
"""
Remove suffix.
Expand Down
8 changes: 4 additions & 4 deletions py-polars/src/expr/string.rs
Original file line number Diff line number Diff line change
Expand Up @@ -75,12 +75,12 @@ impl PyExpr {
self.inner.clone().str().strip_chars_end(matches).into()
}

fn str_strip_prefix(&self, prefix: String) -> Self {
self.inner.clone().str().strip_prefix(prefix).into()
fn str_strip_prefix(&self, prefix: Self) -> Self {
self.inner.clone().str().strip_prefix(prefix.inner).into()
}

fn str_strip_suffix(&self, suffix: String) -> Self {
self.inner.clone().str().strip_suffix(suffix).into()
fn str_strip_suffix(&self, suffix: Self) -> Self {
self.inner.clone().str().strip_suffix(suffix.inner).into()
}

fn str_slice(&self, start: i64, length: Option<u64>) -> Self {
Expand Down
34 changes: 29 additions & 5 deletions py-polars/tests/unit/namespaces/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -260,16 +260,40 @@ def test_str_strip_deprecated() -> None:
pl.Series(["a", "b", "c"]).str.rstrip()


def test_str_strip_prefix() -> None:
s = pl.Series(["foo:bar", "foofoo:bar", "bar:bar", "foo", ""])
expected = pl.Series([":bar", "foo:bar", "bar:bar", "", ""])
def test_str_strip_prefix_literal() -> None:
s = pl.Series(["foo:bar", "foofoo:bar", "bar:bar", "foo", "", None])
expected = pl.Series([":bar", "foo:bar", "bar:bar", "", "", None])
assert_series_equal(s.str.strip_prefix("foo"), expected)
# test null literal
expected = pl.Series([None, None, None, None, None, None], dtype=pl.Utf8)
assert_series_equal(s.str.strip_prefix(pl.lit(None, dtype=pl.Utf8)), expected)


def test_str_strip_prefix_suffix_expr() -> None:
df = pl.DataFrame(
{
"s": ["foo-bar", "foobarbar", "barfoo", "", "anything", None],
"prefix": ["foo", "foobar", "foo", "", None, "bar"],
"suffix": ["bar", "barbar", "bar", "", None, "foo"],
}
)
out = df.select(
pl.col("s").str.strip_prefix(pl.col("prefix")).alias("strip_prefix"),
pl.col("s").str.strip_suffix(pl.col("suffix")).alias("strip_suffix"),
)
assert out.to_dict(False) == {
"strip_prefix": ["-bar", "bar", "barfoo", "", None, None],
"strip_suffix": ["foo-", "foo", "barfoo", "", None, None],
}


def test_str_strip_suffix() -> None:
s = pl.Series(["foo:bar", "foo:barbar", "foo:foo", "bar", ""])
expected = pl.Series(["foo:", "foo:bar", "foo:foo", "", ""])
s = pl.Series(["foo:bar", "foo:barbar", "foo:foo", "bar", "", None])
expected = pl.Series(["foo:", "foo:bar", "foo:foo", "", "", None])
assert_series_equal(s.str.strip_suffix("bar"), expected)
# test null literal
expected = pl.Series([None, None, None, None, None, None], dtype=pl.Utf8)
assert_series_equal(s.str.strip_suffix(pl.lit(None, dtype=pl.Utf8)), expected)


def test_str_split() -> None:
Expand Down