feat: add support for `median` #1212

AlessandroMiola · 2024-10-17T22:46:53Z

What type of PR is this? (check all applicable)

Related issues

Closes feat: Expr.median / Series.median #1190

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below.

WIP: I would need to better dig into details for pyarrow and dask. Already opened the PR to (also) get advices and/or guidance:)

My understanding per now (might be missing bits, though) is the following:

neither pyarrow nor dask do implement a "proper" median; otoh, they respectively implement approximate_median and median_approximate (dask's median_approximate fails tests when npartitions=2, though)
relying on already implemented quantile(s) would come with the following issues: for pyarrow, the issue of having an interpolation parameter which would not quite fit the nw.median() signature; for dask, the issue of having to hardcode a "linear" interpolation strategy and having to xfail tests as well when encountering dask_lazy_p2_constructor.

Also, need to revise docstrings so as to comply with #1000 for median.

FBruzzesi · 2024-10-18T07:25:40Z

Hey @AlessandroMiola thanks for the effort! This already looks very promising.

neither pyarrow nor dask do implement a "proper" median; otoh, they respectively implement approximate_median and median_approximate

I think that's good enough as long as we document that results may slightly differ between backends because of the difference of underlying algorithms used

dask's median_approximate fails tests when npartitions=2, though

This may need some investigation

Also, need to revise docstrings so as to comply with #1000 for median.

This is much appreciated 😁

I have a couple of additional considerations:

Polars on string series returns None, pandas raises an error, I don't know about others - should we check that the input is numeric and limit the functionality to that? cc @MarcoGorelli
Can we check that all algorithms ignore nulls/nans?

AlessandroMiola · 2024-10-18T08:11:33Z

Thanks for your help @FBruzzesi! :) I'll try to address all of your comments!

MarcoGorelli

nice, looks really good!

regarding

Polars on string series returns None,

Polars does indeed raise here for Expr:

In [10]: df
Out[10]:
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ f   │
│ 2   ┆ a   │
│ 3   ┆ x   │
└─────┴─────┘

In [11]: df.select(pl.median('b'))
---------------------------------------------------------------------------
InvalidOperationError                     Traceback (most recent call last)
Cell In[11], line 1
----> 1 df.select(pl.median('b'))

File ~/scratch/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py:9010, in DataFrame.select(self, *exprs, **named_exprs)
   8910 def select(
   8911     self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
   8912 ) -> DataFrame:
   8913     """
   8914     Select columns from this DataFrame.
   8915
   (...)
   9008     └──────────────┘
   9009     """
-> 9010     return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)

File ~/scratch/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2050, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2048 # Only for testing purposes
   2049 callback = _kwargs.get("post_opt_callback", callback)
-> 2050 return wrap_df(ldf.collect(callback))

InvalidOperationError: `median` operation not supported for dtype `str`

I find it a bit odd that Series.median doesn't raise for Polars, and think it'd be ok for us to just always raise

narwhals/_polars/namespace.py

AlessandroMiola · 2024-10-20T13:48:17Z

I find it a bit odd that Series.median doesn't raise for Polars, and think it'd be ok for us to just always raise

A (possibly silly) question on my side for clarification. Should we make it so that it passes tests like the ones below (thus raising a custom error within the underlying backends) or should we just raise where it natively does not while keeping native behaviour as is?
I would have gone the first way, but I'm all ears! :D

data = {
    "a": [3, 8, 2, None],
    "b": [5, 5, None, 7],
    "z": [7.0, 8, 9, None],
    "s": ["f", "a", "x", "x"],
}

@pytest.mark.parametrize(
    "expr", [nw.col("s").median(), nw.median("s")]
)
def test_median_expr_raises_on_str(constructor: Constructor, expr: nw.Expr) -> None:
    df = nw.from_native(constructor(data))
    with pytest.raises(
        TypeError,
        match="`median` operation not supported for non-numeric input type."
    ):
        df.select(expr)

@pytest.mark.parametrize(("col"), [("s")])
def test_median_series_raises_on_str(
    constructor_eager: Any,
    col: str,
) -> None:
    series = nw.from_native(constructor_eager(data), eager_only=True)[col]
    with pytest.raises(
        TypeError,
        match="`median` operation not supported for non-numeric input type."
    ):
        series.median()

Please correct me if I'm completely off-road! Thanks

FBruzzesi · 2024-10-20T17:57:58Z

A (possibly silly) question on my side for clarification. Should we make it so that it passes tests like the ones below (thus raising a custom error within the underlying backends) or should we just raise where it natively does not while keeping native behaviour as is? I would have gone the first way, but I'm all ears! :D

Hey @AlessandroMiola , I would personally opt for the second option: you could create a InvalidOperationError class in _exceptions.py and raise that one. For series it would be even possible to do so in narwhals.Series so that the type checking and raise does not get duplicated. For narwhals.Expr I would not know how to do it directly though, since datatype is not know until computation in polars.

So maybe first option 😂

FBruzzesi · 2024-10-24T07:16:37Z

Hey @AlessandroMiola , I think this is fairly ready for review, however in #1224 , compare_dicts was renamed to assert_equal_data and that's where the error in CI comes from.

AlessandroMiola · 2024-10-24T07:59:19Z

Hey @AlessandroMiola , I think this is fairly ready for review, however in #1224 , compare_dicts was renamed to assert_equal_data and that's where the error in CI comes from.

Hi @FBruzzesi, thanks for having a look. Besides what you mention, I still have uncommitted changes related to:

Can we check that all algorithms ignore nulls/nans?

This is indeed pretty straightforward (as all backends seem to natively do it) and potentially ready for being committed and pushed.

I think that's good enough as long as we document that results may slightly differ between backends because of the difference of underlying algorithms used

As above.

A (possibly silly) question on my side for clarification. Should we make it so that it passes tests like the ones below (thus raising a custom error within the underlying backends) or should we just raise where it natively does not while keeping native behaviour as is? I would have gone the first way, but I'm all ears! :D

Hey @AlessandroMiola , I would personally opt for the second option: you could create a InvalidOperationError class in _exceptions.py and raise that one. For series it would be even possible to do so in narwhals.Series so that the type checking and raise does not get duplicated. For narwhals.Expr I would not know how to do it directly though, since datatype is not know until computation in polars.

Here, I see your point and I indeed had some issues in figuring out how to deal with Expr at the "narwhals-level" (if even possible), but I would need to revise everything

dask's median_approximate fails tests when npartitions=2, though

This may need some investigation

On this, I haven't started yet :|

Later today and tomorrow I should have time, so I'll hopefully update you and/or push something.

AlessandroMiola · 2024-10-30T11:17:00Z

tests/expr_and_series/median_test.py

+    if "polars" in str(constructor):
+        request.applymarker(pytest.mark.xfail)


@FBruzzesi @MarcoGorelli I might need some help/suggestion on this (provided we want this to be handled as defined within this test case or similarly) 🙈 haven't found a way to consistently address raising the custom exception on polars expr; my attempt at wrapping the polars exception into the one I've defined (seems to) fail(s) into not even reaching the median computation, thus preventing it from being caught. Also, other attempts at trying to address the issue have failed because of the datatype not being known until computation (as highlighted in the previous discussion) and I guess that relying on series (if even possible) might not be the best approach for polars (?).
Thanks!! :))

Disclaimer: I am in a bit of a rush, I hope this doesn't get too sloppy of a comment.

You could use a similar approach as in drop_test, namely:

Import polars exception:
from polars.exceptions import InvalidOperationError as PlInvalidOperationError

Match both in raise:
with pytest.raises((InvalidOperationError, PlInvalidOperationError), match=...):
I think error message may differ a bit, but you can partially match it

Regarding polars lazy, you can add a lazy+collect step to trigger the computation, hence the exception:
df.select(expr).lazy().collect()

I hope this helps 😇

Hi @FBruzzesi, thanks for your help and sorry for the late reply, I was on holiday. Just to wrap up and to verify whether my understanding is correct: we might then accept raising the custom exception where feasible (or easier to manage**), else we might accept raising the original exception. Is my understanding correct?

** I mean, it might be me not having found an easy way to handle it.

sorry for the late reply

No need to be sorry! It's all well and good :)

we might then accept raising the custom exception where feasible

Yes correct, that's exactly my point, and how we already do for DataFrame.drop method and ColumnNotFoundError exception

FBruzzesi · 2024-11-06T21:51:50Z

Hey @AlessandroMiola I have one more request 🙈 As median returns a scalar, could you add a test verifying that it works in a group_by context? I would expect that a remapping is required at least in the pyarrow case (median to hash_approximate_median)

…id branching

FBruzzesi

Thanks a ton @AlessandroMiola 🙌🏼

py-shiny test seems unrelated, we will investigate but not a blocker 🚀

github-actions bot added the enhancement New feature or request label Oct 17, 2024

MarcoGorelli reviewed Oct 18, 2024

View reviewed changes

narwhals/_polars/namespace.py Outdated Show resolved Hide resolved

AlessandroMiola force-pushed the add-median branch from ca18cf7 to c49127e Compare October 23, 2024 20:18

AlessandroMiola force-pushed the add-median branch 3 times, most recently from 6da817c to a12fae2 Compare October 30, 2024 09:37

AlessandroMiola commented Oct 31, 2024

View reviewed changes

AlessandroMiola force-pushed the add-median branch from 277add4 to ac7cd3b Compare November 6, 2024 21:05

AlessandroMiola marked this pull request as ready for review November 6, 2024 21:15

AlessandroMiola added 10 commits November 9, 2024 16:41

feat: add support for median

a6aec66

docs: update docstrings with pyarrow examples

f501d5e

test: test median ignores null values

a138981

chore: restore inadvertently removed func and apply suggestion to avo…

5be9326

…id branching

docs: highlight differences in median impl across backends

ce25442

WIP: raise on str and momentaneously xfail on polars expr

f8c1a76

fix: fix docstring inconsistency

d654124

chore: apply suggestions and handle polars exception separately

4bc49b9

test: add test for median in group_by context

5c841ae

chore: add backend_version arg to PolarsExpr call

bd14357

AlessandroMiola force-pushed the add-median branch from bd45a9b to bd14357 Compare November 9, 2024 15:55

FBruzzesi approved these changes Nov 9, 2024

View reviewed changes

FBruzzesi merged commit b96dc92 into narwhals-dev:main Nov 9, 2024
21 of 22 checks passed

AlessandroMiola deleted the add-median branch November 9, 2024 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for `median` #1212

feat: add support for `median` #1212

AlessandroMiola commented Oct 17, 2024 •

edited

Loading

FBruzzesi commented Oct 18, 2024 •

edited

Loading

AlessandroMiola commented Oct 18, 2024

MarcoGorelli left a comment

AlessandroMiola commented Oct 20, 2024

FBruzzesi commented Oct 20, 2024

FBruzzesi commented Oct 24, 2024

AlessandroMiola commented Oct 24, 2024

AlessandroMiola Oct 30, 2024

FBruzzesi Oct 31, 2024

AlessandroMiola Nov 6, 2024 •

edited

Loading

FBruzzesi Nov 6, 2024

FBruzzesi commented Nov 6, 2024

FBruzzesi left a comment

		if "polars" in str(constructor):
		request.applymarker(pytest.mark.xfail)

feat: add support for median #1212

feat: add support for median #1212

Conversation

AlessandroMiola commented Oct 17, 2024 • edited Loading

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below.

FBruzzesi commented Oct 18, 2024 • edited Loading

AlessandroMiola commented Oct 18, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

AlessandroMiola commented Oct 20, 2024

FBruzzesi commented Oct 20, 2024

FBruzzesi commented Oct 24, 2024

AlessandroMiola commented Oct 24, 2024

AlessandroMiola Oct 30, 2024

Choose a reason for hiding this comment

FBruzzesi Oct 31, 2024

Choose a reason for hiding this comment

AlessandroMiola Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

FBruzzesi Nov 6, 2024

Choose a reason for hiding this comment

FBruzzesi commented Nov 6, 2024

FBruzzesi left a comment

Choose a reason for hiding this comment

feat: add support for `median` #1212

feat: add support for `median` #1212

AlessandroMiola commented Oct 17, 2024 •

edited

Loading

FBruzzesi commented Oct 18, 2024 •

edited

Loading

AlessandroMiola Nov 6, 2024 •

edited

Loading