-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(rust): utf8 to temporal casting #10517
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I have a few comments already - will look at implementation later.
@@ -1313,7 +1313,7 @@ def test_quadratic_behavior_4736() -> None: | |||
ldf.select(reduce(add, (pl.col(fld) for fld in ldf.columns))) | |||
|
|||
|
|||
@pytest.mark.parametrize("input_dtype", [pl.Utf8, pl.Int64, pl.Float64]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was pl.Utf8
removed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pl.utf8 is to test that the epoch string behavior works, but I think it should not be supported as it is a string not following date/datetime format. ie: pl.DataFrame({"x1":["1234"]}).with_columns(**{ "x1-date": pl.col("x1").cast(pl.Date)})
Should we keep the support for epoch string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this is meant to work?
To be honest it feels a bit strange to have that pl.Series([<string>]).cast(pl.Date)
could be interpreted as both a format string, or as the number of days since 1970-01-01
I find it hard to believe that anyone would store a timestamp as a string - if they're trying to cast a string to Date, it's almost certainly a date string (and if not, I think it's OK to let the user cast to Int32
first):
# this looks odd to me
In [9]: print(pl.Series(['1234']).cast(pl.Date))
shape: (1,)
Series: '' [date]
[
1973-05-19
]
My suggestion would be to remove Utf8 from here, treat this as a bug fix, and not double-traverse the array with is_parsable_as_number
- @stinodego thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree, that's why I removed it in the first place but got no feedback on my answer so I put it back, also there is an open issue where this behavior is reported as a bug, #10478
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, thanks - ok maybe let's add that as a test case and check and closes #10478
to the description?
and could we wait for Stijn's response before marking this as resolved please? thanks 🙏
finally, sorry for having been slow on this PR, been on holiday recently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, haven't been able to get back to this one. Agree with what you guys are saying - casting "1234"
to a Date should fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I understand that there is a lot of work here to be done and also life happens. Will make the changes. Thanks.
# Conflicts: # py-polars/tests/unit/test_queries.py
Support for different timeunits added in nano-arrow
# Conflicts: # crates/polars-arrow/src/compute/cast/utf8_to.rs # crates/polars-arrow/src/temporal_conversions.rs
Co-authored-by: Weijie Guo <[email protected]>
…ame validation on unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, looks solid now, thanks for persevering with this! If you rebase onto the latest main, the tests should pass
I don't want to block this PR any further, but if I you wanted to do a follow-up (in a separate PR, after this one) to use time_unit
instead of tu
and time_zone
instead of tz
in these files, that would be welcome
@@ -1415,6 +1415,23 @@ def test_from_epoch(input_dtype: pl.PolarsDataType) -> None: | |||
_ = ldf.select(pl.from_epoch(ts_col, time_unit="s2")) # type: ignore[call-overload] | |||
|
|||
|
|||
def test_from_epoch_str() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, good one for splitting this out into a separate test!
Sure! I will do the rebase as well as fixing the naming. |
Support for different timeunits added in nano-arrow
…ame validation on unit tests
…n matching on timestamp casting
… string to datetime.
Just for curiosity how can I run hypothesis test locally ? I tried using pytest to the specific file but they do not run. |
…-to-temporal-cast # Conflicts: # crates/polars-arrow/src/compute/cast/mod.rs # crates/polars-arrow/src/compute/cast/utf8_to.rs # crates/polars-core/src/chunked_array/cast.rs
you'll need to also pass |
Yup; I tend to run with a blanket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" Files changed 247"
🤔 something may have gone wrong when rebasing
Ohh not I did not saw that when pushed the changes. I am sorry. I am going to do a hard reset to the commit previous to the rebase, and I will try to make the rebase again after that. |
I am still trying to make this merge work out, I did a revert but just realized that it keeps history so I cannot redo the merge from main because previous changes will go unnoticed. I want to avoid doing hard reset and then push because I don't want to hurt changes. Any Idea on what to do at this point? I can do rever-reverted but that will only leave me with the rebase that went wrong. |
if I were you I'd open a new PR with just a single commit |
I just opened a new PR #12072, I will close this one. |
Closes #9994 , closes #10478
The implementation I did in this PR support converting strings to date or datetime, it only support specific formats which are the following:
Additionally:
I found that timestamp casting for time-unit milliseconds and microseconds is not implemented in arrow2 library, so I opened this issue.This is now implemented in nano-arrow crate inside polars."2021-01-01 12:54:58.432"
or"2021-01-01 12:54:58"
is not supported by nano-arrow cast function, it only supports RFC3339(or a variation of it), should the other formats be supported as well ? if so I think we should open a different issue for that.pl.DataFrame({"x1":["1234"]}).with_columns(**{ "x1-date": pl.col("x1").cast(pl.Date)})
, this is why I removed pl.Utf8 fromtest_from_epoch
unit test, should it be supported?Example