-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement round
and truncate
for Duration columns
#12597
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is off to a good start, but we need to be careful about some subtleties
For example, is this expected?
In [3]: pl.Series([timedelta(days=29)]).dt.round('1mo')
Out[3]:
shape: (1,)
Series: '' [duration[μs]]
[
28d
]
?
I'd suggest raising if offset
isn't a constant duration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the delay, just got a question about negative durations
thanks for updating - CI is red, could you fixup please then I'll take another look? |
e675ac0
to
1bf4ac6
Compare
1bf4ac6
to
2de1d78
Compare
round
and truncate
for Duration columns
@MarcoGorelli would you mind giving this PR another look? It looks to me like it's ready to be merged. |
nice! yup, taking a look this afternoon UPDATE: spent some time trying to break this, even with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Just some minor notes:
- if you round with a zero-duration, then you can a mysterious panic exception:
In [7]: pl.Series([timedelta(seconds=30)]).dt.round('0m')
PanicException: attempt to calculate the remainder with a divisor of zero
Could an informative error be raised here instead?
- instead of
Err(polars_err!
, I think you can just usepolars_bail!
2de1d78
to
78b8601
Compare
thanks for updating - doing a final pass today, but this might be it EDIT: just noticed some minor things, like that the sign of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait sorry, there is an actual issue here
If you try running
import polars as pl
from datetime import date, datetime, timedelta
df = pl.DataFrame({
'a': [timedelta(1)]
}).with_columns(
b=pl.col('a').dt.cast_time_unit('ms'),
c=pl.col('a').dt.cast_time_unit('ns'),
)
print(df.select(pl.all().dt.truncate('1h', offset='1m')))
print(df.select(pl.all().dt.truncate('1h', offset='-1m')))
you'll see
shape: (1, 3)
┌──────────────┬──────────────┬──────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ duration[μs] ┆ duration[ms] ┆ duration[ns] │
╞══════════════╪══════════════╪══════════════╡
│ 1d 1m ┆ 1d 1m ┆ 1d 1m │
└──────────────┴──────────────┴──────────────┘
shape: (1, 3)
┌──────────────┬──────────────┬──────────────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ duration[μs] ┆ duration[ms] ┆ duration[ns] │
╞══════════════╪══════════════╪══════════════╡
│ 1d 1m ┆ 1d 1m ┆ 1d 1m │
└──────────────┴──────────────┴──────────────┘
The second one should be different?
6f604c2
to
614577a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #12597 +/- ##
==========================================
- Coverage 80.69% 80.68% -0.02%
==========================================
Files 1485 1485
Lines 195485 195546 +61
Branches 2782 2782
==========================================
+ Hits 157747 157769 +22
- Misses 37226 37265 +39
Partials 512 512 ☔ View full report in Codecov by Sentry. |
Hi, sorry for the delay! The negative offset error should be fixed now. Turns out Regarding the sign of
Also, @MarcoGorelli, you mentioned using |
Thanks @rob-sil for updating!
This doesn't sound well-defined. "round to the next month" I can understand what it means, but "round to the next minus one month"? In [16]: pl.Series([datetime(2020, 2, 15)]).dt.round('-1mo')
Out[16]:
shape: (1,)
Series: '' [datetime[μs]]
[
2020-02-01 00:00:00
] My guess is that it only currently works by accident, and that it's gone unnoticed because it's not something most users would write anyway Do you fancy making a separate precursor PR in which you disallow rounding by negative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid, nice one @rob-sil
Thanks for all the help, @MarcoGorelli! Does this need anything else to merge? |
crates/polars-time/src/round.rs
Outdated
if !offset.is_constant_duration() { | ||
polars_bail!(InvalidOperation: "Cannot offset a Duration series by a non-constant duration."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset
has been deprecated for round
in a separate PR - rather than adding complexity here for something which is now deprecated anyway, can we just raise if someone passes offset
when rounding a duration?
This would be backwards-compatible as rounding durations wasn't supported yet
CodSpeed Performance ReportMerging #12597 will not alter performanceComparing Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
A couple of things that I think should be discussed as follow-ups:
- when should
ComputeError
vsInvalidOperationError
be raised? - the case "scalar vs expression" is not covered here - but it's also not covered for rounding / truncating dates or datetimes, so I think it's OK to address it separately. I've opened a separate issue for that:
truncate
does not correctly broadcast scalar vs expression #15743
EDIT: marking as draft for a little: from discussion today, the some errors should be InvalidOperationError
. For consistency, with truncating date/datetime, I think it's OK for the "cannot truncate by negative duration" to raise a ComputeError, maybe they can all be switched over at a later point, it'd be a breaking change
Hi @MarcoGorelli, I'm interested but I think this is pushing the limits of my polars/rust knowledge. For example, I'm a bit confused by >>> import polars as pl
>>> from datetime import datetime, timedelta
>>>
>>> df = pl.DataFrame(
... {
... "datetime": datetime(2000, 2, 3, 12, 30),
... "every": ["1y", "1mo", "1d", "1h"],
... "duration": [timedelta(i) for i in range(4)],
... }
... )
>>> df
shape: (4, 3)
┌─────────────────────┬───────┬──────────────┐
│ datetime ┆ every ┆ duration │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ duration[μs] │
╞═════════════════════╪═══════╪══════════════╡
│ 2000-02-03 12:30:00 ┆ 1y ┆ 0µs │
│ 2000-02-03 12:30:00 ┆ 1mo ┆ 1d │
│ 2000-02-03 12:30:00 ┆ 1d ┆ 2d │
│ 2000-02-03 12:30:00 ┆ 1h ┆ 3d │
└─────────────────────┴───────┴──────────────┘ If I filter "datetime" to a single value but leave all four values of "every", then truncation broadcasts and returns four values: >>> df.select(
... pl.col("datetime").filter(pl.col("every") == "1mo")
... .dt.truncate(pl.col("every"))
... )
shape: (4, 1)
┌─────────────────────┐
│ datetime │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2000-01-01 00:00:00 │
│ 2000-02-01 00:00:00 │
│ 2000-02-03 00:00:00 │
│ 2000-02-03 12:00:00 │
└─────────────────────┘ If I filter "datetime" to two values, then truncation picks the first two of "every" and gives me two values back: >>> df.select(
... pl.col("datetime").filter(pl.col("every").is_in(["1mo", "1d"]))
... .dt.truncate(pl.col("every"))
... )
shape: (2, 1)
┌─────────────────────┐
│ datetime │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2000-01-01 00:00:00 │
│ 2000-02-01 00:00:00 │
└─────────────────────┘ But if I try other operations that use two series, like addition, I would get an error. >>> df.select(
... pl.col("datetime").filter(pl.col("every").is_in(["1mo", "1d"]))
... + pl.col("duration")
... )
[...]
polars.exceptions.ComputeError: cannot evaluate two Series of different lengths (2 and 4)
[...] Is this the intended behavior? |
Hey
this looks correct, but
I think this isn't. It should raise - length 2 vs length 4 shouldn't silently truncate the RHS to length 2. Thanks for spotting this! I think this can be handled within |
3c3147b
to
da2199a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks @rob-sil
Closes #12417
I tried to match the behavior of
dt.round
anddt.truncate
for dates and datetimes as closely as possible. Sincedt.round
is experimental, I wanted to highlight:"1ns"
causes a divide-by-zero error, as one nanosecond is converted to zero microseconds.offset
shifts the windows after rounding, not before.dt.round(every="1h", offset="30m")
rounds 25 minutes to 30 minutes, but rounds 35 minutes to 90 minutes.