-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for median
#1212
Conversation
Hey @AlessandroMiola thanks for the effort! This already looks very promising.
I think that's good enough as long as we document that results may slightly differ between backends because of the difference of underlying algorithms used
This may need some investigation
This is much appreciated 😁 I have a couple of additional considerations:
|
Thanks for your help @FBruzzesi! :) I'll try to address all of your comments! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, looks really good!
regarding
Polars on string series returns None,
Polars does indeed raise here for Expr:
In [10]: df
Out[10]:
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ f │
│ 2 ┆ a │
│ 3 ┆ x │
└─────┴─────┘
In [11]: df.select(pl.median('b'))
---------------------------------------------------------------------------
InvalidOperationError Traceback (most recent call last)
Cell In[11], line 1
----> 1 df.select(pl.median('b'))
File ~/scratch/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py:9010, in DataFrame.select(self, *exprs, **named_exprs)
8910 def select(
8911 self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr
8912 ) -> DataFrame:
8913 """
8914 Select columns from this DataFrame.
8915
(...)
9008 └──────────────┘
9009 """
-> 9010 return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
File ~/scratch/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py:2050, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _eager, **_kwargs)
2048 # Only for testing purposes
2049 callback = _kwargs.get("post_opt_callback", callback)
-> 2050 return wrap_df(ldf.collect(callback))
InvalidOperationError: `median` operation not supported for dtype `str`
I find it a bit odd that Series.median
doesn't raise for Polars, and think it'd be ok for us to just always raise
A (possibly silly) question on my side for clarification. Should we make it so that it passes tests like the ones below (thus raising a custom error within the underlying backends) or should we just raise where it natively does not while keeping native behaviour as is?
Please correct me if I'm completely off-road! Thanks |
Hey @AlessandroMiola , I would personally opt for the second option: you could create a So maybe first option 😂 |
ca18cf7
to
c49127e
Compare
Hey @AlessandroMiola , I think this is fairly ready for review, however in #1224 , |
Hi @FBruzzesi, thanks for having a look. Besides what you mention, I still have uncommitted changes related to:
This is indeed pretty straightforward (as all backends seem to natively do it) and potentially ready for being committed and pushed.
As above.
Here, I see your point and I indeed had some issues in figuring out how to deal with
On this, I haven't started yet :| Later today and tomorrow I should have time, so I'll hopefully update you and/or push something. |
6da817c
to
a12fae2
Compare
tests/expr_and_series/median_test.py
Outdated
if "polars" in str(constructor): | ||
request.applymarker(pytest.mark.xfail) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FBruzzesi @MarcoGorelli I might need some help/suggestion on this (provided we want this to be handled as defined within this test case or similarly) 🙈 haven't found a way to consistently address raising the custom exception on polars expr; my attempt at wrapping the polars exception into the one I've defined (seems to) fail(s) into not even reaching the median computation, thus preventing it from being caught. Also, other attempts at trying to address the issue have failed because of the datatype not being known until computation (as highlighted in the previous discussion) and I guess that relying on series (if even possible) might not be the best approach for polars (?).
Thanks!! :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disclaimer: I am in a bit of a rush, I hope this doesn't get too sloppy of a comment.
You could use a similar approach as in drop_test
, namely:
- Import polars exception:
from polars.exceptions import InvalidOperationError as PlInvalidOperationError
- Match both in raise:
I think error message may differ a bit, but you can partially match it
with pytest.raises((InvalidOperationError, PlInvalidOperationError), match=...):
- Regarding polars lazy, you can add a lazy+collect step to trigger the computation, hence the exception:
df.select(expr).lazy().collect()
I hope this helps 😇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @FBruzzesi, thanks for your help and sorry for the late reply, I was on holiday. Just to wrap up and to verify whether my understanding is correct: we might then accept raising the custom exception where feasible (or easier to manage**), else we might accept raising the original exception. Is my understanding correct?
** I mean, it might be me not having found an easy way to handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the late reply
No need to be sorry! It's all well and good :)
we might then accept raising the custom exception where feasible
Yes correct, that's exactly my point, and how we already do for DataFrame.drop
method and ColumnNotFoundError
exception
277add4
to
ac7cd3b
Compare
Hey @AlessandroMiola I have one more request 🙈 As |
bd45a9b
to
bd14357
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton @AlessandroMiola 🙌🏼
py-shiny test seems unrelated, we will investigate but not a blocker 🚀
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below.
WIP: I would need to better dig into details for
pyarrow
anddask
. Already opened the PR to (also) get advices and/or guidance:)My understanding per now (might be missing bits, though) is the following:
pyarrow
nordask
do implement a "proper"median
; otoh, they respectively implementapproximate_median
andmedian_approximate
(dask
'smedian_approximate
fails tests whennpartitions=2
, though)quantile
(s) would come with the following issues: forpyarrow
, the issue of having aninterpolation
parameter which would not quite fit thenw.median()
signature; for dask, the issue of having to hardcode a"linear"
interpolation strategy and having toxfail
tests as well when encounteringdask_lazy_p2_constructor
.Also, need to revise docstrings so as to comply with #1000 for
median
.