Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust): utf8 to temporal casting #10517

Closed
wants to merge 106 commits into from

Conversation

brayanjuls
Copy link
Contributor

@brayanjuls brayanjuls commented Aug 15, 2023

Closes #9994 , closes #10478

The implementation I did in this PR support converting strings to date or datetime, it only support specific formats which are the following:

  • %Y-%m-%d
  • %Y-%m-%dT%H:%M:%S
  • %Y-%m-%dT%H:%M:%S.3f
  • %Y-%m-%dT%H:%M:%S%.f%:z

Additionally:

  • I found that timestamp casting for time-unit milliseconds and microseconds is not implemented in arrow2 library, so I opened this issue. This is now implemented in nano-arrow crate inside polars.
  • Converting date times with the following format "2021-01-01 12:54:58.432" or "2021-01-01 12:54:58" is not supported by nano-arrow cast function, it only supports RFC3339(or a variation of it), should the other formats be supported as well ? if so I think we should open a different issue for that.
  • Converting a string number to date is not supported, ie: pl.DataFrame({"x1":["1234"]}).with_columns(**{ "x1-date": pl.col("x1").cast(pl.Date)}), this is why I removed pl.Utf8 from test_from_epoch unit test, should it be supported?

Example

import polars as pl

df = pl.DataFrame({
    "x1": ["2021-01-01"], 
    "x2": ["2021-01-01T12:54:58"],
    "x3": ["2021-01-01T12:54:58.432"],
    "x4": ["2021-01-01T12:54:58.432 +00:00"],
}).with_columns(**{
    "x1-date":pl.col("x1").cast(pl.Date),
    "x2-datetime":pl.col("x2").cast(pl.Datetime),
    "x3-datetime":pl.col("x3").cast(pl.Datetime),
    "x4-datetime-tz":pl.col("x3").cast(pl.Datetime("ms","America/Santiago")),
})

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature rust Related to Rust Polars labels Aug 15, 2023
Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have a few comments already - will look at implementation later.

@@ -1313,7 +1313,7 @@ def test_quadratic_behavior_4736() -> None:
ldf.select(reduce(add, (pl.col(fld) for fld in ldf.columns)))


@pytest.mark.parametrize("input_dtype", [pl.Utf8, pl.Int64, pl.Float64])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was pl.Utf8 removed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pl.utf8 is to test that the epoch string behavior works, but I think it should not be supported as it is a string not following date/datetime format. ie: pl.DataFrame({"x1":["1234"]}).with_columns(**{ "x1-date": pl.col("x1").cast(pl.Date)})

Should we keep the support for epoch string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it back.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this is meant to work?

To be honest it feels a bit strange to have that pl.Series([<string>]).cast(pl.Date) could be interpreted as both a format string, or as the number of days since 1970-01-01

I find it hard to believe that anyone would store a timestamp as a string - if they're trying to cast a string to Date, it's almost certainly a date string (and if not, I think it's OK to let the user cast to Int32 first):

# this looks odd to me
In [9]: print(pl.Series(['1234']).cast(pl.Date))
shape: (1,)
Series: '' [date]
[
        1973-05-19
]

My suggestion would be to remove Utf8 from here, treat this as a bug fix, and not double-traverse the array with is_parsable_as_number - @stinodego thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree, that's why I removed it in the first place but got no feedback on my answer so I put it back, also there is an open issue where this behavior is reported as a bug, #10478

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks - ok maybe let's add that as a test case and check and closes #10478 to the description?

and could we wait for Stijn's response before marking this as resolved please? thanks 🙏

finally, sorry for having been slow on this PR, been on holiday recently

Copy link
Member

@stinodego stinodego Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, haven't been able to get back to this one. Agree with what you guys are saying - casting "1234" to a Date should fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I understand that there is a lot of work here to be done and also life happens. Will make the changes. Thanks.

py-polars/tests/unit/test_queries.py Show resolved Hide resolved
py-polars/tests/unit/test_queries.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, looks solid now, thanks for persevering with this! If you rebase onto the latest main, the tests should pass

I don't want to block this PR any further, but if I you wanted to do a follow-up (in a separate PR, after this one) to use time_unit instead of tu and time_zone instead of tz in these files, that would be welcome

@@ -1415,6 +1415,23 @@ def test_from_epoch(input_dtype: pl.PolarsDataType) -> None:
_ = ldf.select(pl.from_epoch(ts_col, time_unit="s2")) # type: ignore[call-overload]


def test_from_epoch_str() -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, good one for splitting this out into a separate test!

@brayanjuls
Copy link
Contributor Author

Right, looks solid now, thanks for persevering with this! If you rebase onto the latest main, the tests should pass

I don't want to block this PR any further, but if I you wanted to do a follow-up (in a separate PR, after this one) to use time_unit instead of tu and time_zone instead of tz in these files, that would be welcome

Sure! I will do the rebase as well as fixing the naming.

@brayanjuls
Copy link
Contributor Author

not sure what to make of the test failure
@alexander-beedie do you know?

I do! And a rebase will fix it, heh - not an issue with this code 😄 #11910 (comment)

Just for curiosity how can I run hypothesis test locally ? I tried using pytest to the specific file but they do not run.

…-to-temporal-cast

# Conflicts:
#	crates/polars-arrow/src/compute/cast/mod.rs
#	crates/polars-arrow/src/compute/cast/utf8_to.rs
#	crates/polars-core/src/chunked_array/cast.rs
@MarcoGorelli
Copy link
Collaborator

you'll need to also pass -m 'not slow' on the command line

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Oct 21, 2023

you'll need to also pass -m 'not slow' on the command line

Yup; I tend to run with a blanket -m "" to make sure I always run everything locally :)

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" Files changed 247"

🤔 something may have gone wrong when rebasing

@brayanjuls
Copy link
Contributor Author

" Files changed 247"

🤔 something may have gone wrong when rebasing

Ohh not I did not saw that when pushed the changes. I am sorry. I am going to do a hard reset to the commit previous to the rebase, and I will try to make the rebase again after that.

…nto utf8-to-temporal-cast"

This reverts commit 215990b, reversing
changes made to c5459f1.
@brayanjuls
Copy link
Contributor Author

I am still trying to make this merge work out, I did a revert but just realized that it keeps history so I cannot redo the merge from main because previous changes will go unnoticed. I want to avoid doing hard reset and then push because I don't want to hurt changes. Any Idea on what to do at this point?

I can do rever-reverted but that will only leave me with the rebase that went wrong.

@MarcoGorelli
Copy link
Collaborator

if I were you I'd open a new PR with just a single commit

@brayanjuls
Copy link
Contributor Author

I just opened a new PR #12072, I will close this one.

@brayanjuls brayanjuls closed this Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature rust Related to Rust Polars
Projects
None yet