Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(python): Clean up conversion utils #11789

Merged

Conversation

rancomp
Copy link
Contributor

@rancomp rancomp commented Oct 17, 2023

This is not an urgent or significant PR.

I saw this PR: #11759, and noticed some code repetitions. A SECONDS_PER_DAY variable was introduced, and motivated me to clean up the repetitions.
Also, the changes here are in the spirit of #11693, but much smaller in scope, and only related to the python side.

Changes summary:

  • removed _fromtimestamp which was introduced 6MO ago, and is not used or linked to
  • use SECONDS_PER_DAY wherever possible
  • match the style of multipliers (e.g. 1e3 -> 1_000 and 1000 ->1_000)
  • match the style of _timedelta_to_pl_timedelta with _datetime_to_pl_timestamp. This leads to a marginal perf improvement for the former (10%).
  • align ValueError message

A couple of the computations can be encapsulated in a function, but the overhead is so tremendous it will lead to a ~20% perf decrease. There were 2 PRs in this file that led to 15% and 35% perf increase, so it looked not justifiable to encapsulate.

@github-actions github-actions bot added internal An internal refactor or improvement python Related to Python Polars labels Oct 17, 2023
@rancomp
Copy link
Contributor Author

rancomp commented Oct 17, 2023

Many of the functions accept time_unit: TimeUnit | None, but their behavior is inconsistent when accepting None.
For example, _datetime_to_pl_timestamp and _timedelta_to_pl_timedelta accepts time_unit=None and treat it like microesconds.
On the other hand, _to_python_datetime accepts time_unit=None, but will raise a ValueError when encountering that.

Note that the only place in the code base where None is passed explicitly to any of these functions is in the test_utils for _timedelta_to_pl_timedelta

@rancomp rancomp marked this pull request as ready for review October 18, 2023 09:12
Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the attention to detail here. A few comments from me.

py-polars/polars/utils/convert.py Outdated Show resolved Hide resolved
py-polars/polars/utils/convert.py Outdated Show resolved Hide resolved
py-polars/polars/utils/convert.py Show resolved Hide resolved
py-polars/polars/utils/convert.py Outdated Show resolved Hide resolved
py-polars/polars/utils/convert.py Outdated Show resolved Hide resolved
py-polars/polars/utils/convert.py Outdated Show resolved Hide resolved
py-polars/polars/utils/convert.py Outdated Show resolved Hide resolved
@ritchie46
Copy link
Member

I do think it is important to run benchmarks with these before we merge. These are called a lot upon instantiation.

@rancomp
Copy link
Contributor Author

rancomp commented Oct 23, 2023

I do think it is important to run benchmarks with these before we merge. These are called a lot upon instantiation.

Of course. Are there standardized benchmarks you want to see?
I ran some tests on my machine with randomized inputs, and they agree with the %timeit magic.
Here is a summary:

_timedelta_to_pl_timedelta comparison
There's a slight improvement because I'm collecting the seconds together
current implementation:

%timeit old_timedelta_to_pl_timedelta(td, "ns")
196 ns ± 2.28 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

%timeit old_timedelta_to_pl_timedelta(td, "ms")
223 ns ± 2.16 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

%timeit old_timedelta_to_pl_timedelta(td, "us")
202 ns ± 3.89 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

%timeit old_timedelta_to_pl_timedelta(td, None)
196 ns ± 4.28 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

proposed implementation:

%timeit new_timedelta_to_pl_timedelta(td, "ns")
175 ns ± 6.81 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

%timeit new_timedelta_to_pl_timedelta(td, "ms")
207 ns ± 1.44 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

%timeit new_timedelta_to_pl_timedelta(td, "us")
178 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

%timeit new_timedelta_to_pl_timedelta(td, None)
191 ns ± 1.53 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

datetime_to_pl_timestamp comparison
Not much of a difference
current implementation:

 %timeit old_datetime_to_pl_timestamp(dt, None)
1.1 µs ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

%timeit old_datetime_to_pl_timestamp(dt, "ms")
1.14 µs ± 12.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

%timeit old_datetime_to_pl_timestamp(dt, "us")
1.07 µs ± 16.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

%timeit old_datetime_to_pl_timestamp(dt, "ns")
1.12 µs ± 5.92 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

proposed implementation:

%timeit new_datetime_to_pl_timestamp(dt, None)
1.09 µs ± 24 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

%timeit new_datetime_to_pl_timestamp(dt, "ms")
1.12 µs ± 10 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

%timeit new_datetime_to_pl_timestamp(dt, "us")
1.09 µs ± 8.71 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

%timeit new_datetime_to_pl_timestamp(dt, "ns")
1.1 µs ± 8.35 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I made a few more minor improvements.

Turns out that seconds * 1_000_000_000 + microseconds * 1_000 is faster than 1_000 * (seconds * 1_000_000 + microseconds).

This can be merged!

@stinodego stinodego merged commit e436e4f into pola-rs:main Nov 9, 2023
12 checks passed
@stinodego stinodego changed the title refactor(python): clean up convert.py refactor(python): Clean up conversion utils Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants