Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement round and truncate for Duration columns #12597

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

rob-sil
Copy link
Contributor

@rob-sil rob-sil commented Nov 20, 2023

Closes #12417

I tried to match the behavior of dt.round and dt.truncate for dates and datetimes as closely as possible. Since dt.round is experimental, I wanted to highlight:

  • Rounding happens in the time unit of the series. This means that rounding any microseconds series to "1ns" causes a divide-by-zero error, as one nanosecond is converted to zero microseconds.
  • The offset shifts the windows after rounding, not before. dt.round(every="1h", offset="30m") rounds 25 minutes to 30 minutes, but rounds 35 minutes to 90 minutes.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Nov 20, 2023
@rob-sil rob-sil marked this pull request as ready for review November 21, 2023 14:18
Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is off to a good start, but we need to be careful about some subtleties

For example, is this expected?

In [3]: pl.Series([timedelta(days=29)]).dt.round('1mo')
Out[3]: 
shape: (1,)
Series: '' [duration[μs]]
[
        28d
]

?

I'd suggest raising if offset isn't a constant duration

crates/polars-time/src/round.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay, just got a question about negative durations

py-polars/tests/unit/namespaces/test_datetime.py Outdated Show resolved Hide resolved
@rob-sil rob-sil requested a review from c-peters as a code owner January 3, 2024 15:01
@MarcoGorelli
Copy link
Collaborator

thanks for updating - CI is red, could you fixup please then I'll take another look?

@stinodego stinodego changed the title feat: round and truncate for durations feat: Implement round and truncate for Duration columns Feb 14, 2024
@stinodego
Copy link
Member

@MarcoGorelli would you mind giving this PR another look? It looks to me like it's ready to be merged.

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Feb 14, 2024

nice! yup, taking a look this afternoon

UPDATE: spent some time trying to break this, even with hypothesis tests, and wasn't able to. Well done! Gonna try out some final things, then will approve

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Just some minor notes:

  1. if you round with a zero-duration, then you can a mysterious panic exception:
In [7]: pl.Series([timedelta(seconds=30)]).dt.round('0m')

PanicException: attempt to calculate the remainder with a divisor of zero

Could an informative error be raised here instead?

  1. instead of Err(polars_err!, I think you can just use polars_bail!

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Feb 19, 2024

thanks for updating - doing a final pass today, but this might be it

EDIT: just noticed some minor things, like that the sign of every is ignored - I'll add a commit to fixup

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait sorry, there is an actual issue here

If you try running

import polars as pl
from datetime import date, datetime, timedelta

df = pl.DataFrame({
    'a': [timedelta(1)]
}).with_columns(
    b=pl.col('a').dt.cast_time_unit('ms'),
    c=pl.col('a').dt.cast_time_unit('ns'),
)
print(df.select(pl.all().dt.truncate('1h', offset='1m')))
print(df.select(pl.all().dt.truncate('1h', offset='-1m')))

you'll see

shape: (1, 3)
┌──────────────┬──────────────┬──────────────┐
│ a            ┆ b            ┆ c            │
│ ---          ┆ ---          ┆ ---          │
│ duration[μs] ┆ duration[ms] ┆ duration[ns] │
╞══════════════╪══════════════╪══════════════╡
│ 1d 1m        ┆ 1d 1m        ┆ 1d 1m        │
└──────────────┴──────────────┴──────────────┘
shape: (1, 3)
┌──────────────┬──────────────┬──────────────┐
│ a            ┆ b            ┆ c            │
│ ---          ┆ ---          ┆ ---          │
│ duration[μs] ┆ duration[ms] ┆ duration[ns] │
╞══════════════╪══════════════╪══════════════╡
│ 1d 1m        ┆ 1d 1m        ┆ 1d 1m        │
└──────────────┴──────────────┴──────────────┘

The second one should be different?

Copy link

codecov bot commented Mar 17, 2024

Codecov Report

Attention: Patch coverage is 70.68966% with 34 lines in your changes missing coverage. Please review.

Project coverage is 80.68%. Comparing base (8bad2fd) to head (1554a57).

Files Patch % Lines
crates/polars-time/src/round.rs 55.76% 23 Missing ⚠️
...ates/polars-plan/src/dsl/function_expr/datetime.rs 50.00% 6 Missing ⚠️
crates/polars-time/src/truncate.rs 90.38% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #12597      +/-   ##
==========================================
- Coverage   80.69%   80.68%   -0.02%     
==========================================
  Files        1485     1485              
  Lines      195485   195546      +61     
  Branches     2782     2782              
==========================================
+ Hits       157747   157769      +22     
- Misses      37226    37265      +39     
  Partials      512      512              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rob-sil
Copy link
Contributor Author

rob-sil commented Mar 17, 2024

Hi, sorry for the delay!

The negative offset error should be fixed now. Turns out Duration's duration_ns doesn't include its sign and just returns the absolute value of the duration in nanoseconds. I made a note about it, but should it return negative values for negative durations instead?

Regarding the sign of every: what's the expect outcome of rounding a duration to a negative duration?

  1. Round symmetrically, so rounding with every=-1h is the same as every=1h. I assumed this option, which is what the PR currently does.
  2. Raising an error, like if someone tries to round dates and datetimes to negative durations.
  3. Rounding in the opposite direction. I think this is what pandas does, but it might be a bug.

Also, @MarcoGorelli, you mentioned using hypothesis tests. Should I add some?

@MarcoGorelli
Copy link
Collaborator

Thanks @rob-sil for updating!

Regarding the sign of every: what's the expect outcome of rounding a duration to a negative duration?

This doesn't sound well-defined. "round to the next month" I can understand what it means, but "round to the next minus one month"?

In [16]: pl.Series([datetime(2020, 2, 15)]).dt.round('-1mo')
Out[16]:
shape: (1,)
Series: '' [datetime[μs]]
[
        2020-02-01 00:00:00
]

My guess is that it only currently works by accident, and that it's gone unnoticed because it's not something most users would write anyway

Do you fancy making a separate precursor PR in which you disallow rounding by negative every for date/datetime? Then we can do the same in this PR, and then finally approve it 💪

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid, nice one @rob-sil

@rob-sil
Copy link
Contributor Author

rob-sil commented Mar 27, 2024

Thanks for all the help, @MarcoGorelli! Does this need anything else to merge?

Comment on lines 57 to 156
if !offset.is_constant_duration() {
polars_bail!(InvalidOperation: "Cannot offset a Duration series by a non-constant duration.");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset has been deprecated for round in a separate PR - rather than adding complexity here for something which is now deprecated anyway, can we just raise if someone passes offset when rounding a duration?

This would be backwards-compatible as rounding durations wasn't supported yet

crates/polars-time/src/round.rs Outdated Show resolved Hide resolved
@rob-sil rob-sil requested a review from reswqa as a code owner April 14, 2024 18:54
Copy link

codspeed-hq bot commented Apr 14, 2024

CodSpeed Performance Report

Merging #12597 will not alter performance

Comparing rob-sil:round-durations (1554a57) with main (8bad2fd)

Summary

✅ 37 untouched benchmarks

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

A couple of things that I think should be discussed as follow-ups:

  • when should ComputeError vs InvalidOperationError be raised?
  • the case "scalar vs expression" is not covered here - but it's also not covered for rounding / truncating dates or datetimes, so I think it's OK to address it separately. I've opened a separate issue for that: truncate does not correctly broadcast scalar vs expression #15743

EDIT: marking as draft for a little: from discussion today, the some errors should be InvalidOperationError. For consistency, with truncating date/datetime, I think it's OK for the "cannot truncate by negative duration" to raise a ComputeError, maybe they can all be switched over at a later point, it'd be a breaking change

@MarcoGorelli MarcoGorelli marked this pull request as draft April 19, 2024 14:20
@MarcoGorelli
Copy link
Collaborator

Hey @rob-sil - are you interested in updating this to use broadcast_try_binary_elementwise and a cache, as was done in #15768 ?

No worries if not, happy to add a commit if you like, I understand this PR has been taking some time

@rob-sil
Copy link
Contributor Author

rob-sil commented Apr 21, 2024

Hi @MarcoGorelli, I'm interested but I think this is pushing the limits of my polars/rust knowledge.

For example, I'm a bit confused by broadcast_try_binary_elementwise and how to handle series of different lengths.

>>> import polars as pl
>>> from datetime import datetime, timedelta
>>>
>>> df = pl.DataFrame(
...     {
...         "datetime": datetime(2000, 2, 3, 12, 30),
...         "every":  ["1y", "1mo", "1d", "1h"],
...         "duration": [timedelta(i) for i in range(4)],
...     }
... )
>>> df
shape: (4, 3)
┌─────────────────────┬───────┬──────────────┐
│ datetimeeveryduration     │
│ ---------          │
│ datetime[μs]        ┆ strduration[μs] │
╞═════════════════════╪═══════╪══════════════╡
│ 2000-02-03 12:30:001y0µs          │
│ 2000-02-03 12:30:001mo1d           │
│ 2000-02-03 12:30:001d2d           │
│ 2000-02-03 12:30:001h3d           │
└─────────────────────┴───────┴──────────────┘

If I filter "datetime" to a single value but leave all four values of "every", then truncation broadcasts and returns four values:

>>> df.select(
...     pl.col("datetime").filter(pl.col("every") == "1mo")
...         .dt.truncate(pl.col("every"))
... )
shape: (4, 1)
┌─────────────────────┐
│ datetime            │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2000-01-01 00:00:00 │
│ 2000-02-01 00:00:00 │
│ 2000-02-03 00:00:00 │
│ 2000-02-03 12:00:00 │
└─────────────────────┘

If I filter "datetime" to two values, then truncation picks the first two of "every" and gives me two values back:

>>> df.select(
...     pl.col("datetime").filter(pl.col("every").is_in(["1mo", "1d"]))
...         .dt.truncate(pl.col("every"))
... )
shape: (2, 1)
┌─────────────────────┐
│ datetime            │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2000-01-01 00:00:00 │
│ 2000-02-01 00:00:00 │
└─────────────────────┘

But if I try other operations that use two series, like addition, I would get an error.

>>> df.select(
...     pl.col("datetime").filter(pl.col("every").is_in(["1mo", "1d"]))
...         + pl.col("duration")
... )
[...]
polars.exceptions.ComputeError: cannot evaluate two Series of different lengths (2 and 4)
[...]

Is this the intended behavior?

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Apr 26, 2024

Hey

If I filter "datetime" to a single value but leave all four values of "every", then truncation broadcasts and returns four values:

this looks correct, but

If I filter "datetime" to two values, then truncation picks the first two of "every" and gives me two values back:

I think this isn't. It should raise - length 2 vs length 4 shouldn't silently truncate the RHS to length 2. Thanks for spotting this! I think this can be handled within broadcast_try_binary_elementwise - @reswqa is this right?

@MarcoGorelli MarcoGorelli marked this pull request as ready for review July 16, 2024 08:20
Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @rob-sil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rounding for a duration data type
4 participants