Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Add date32 support to __dataframe__ protocol #39539

Open
WillAyd opened this issue Jan 9, 2024 · 12 comments
Open

[Python] Add date32 support to __dataframe__ protocol #39539

WillAyd opened this issue Jan 9, 2024 · 12 comments

Comments

@WillAyd
Copy link
Contributor

WillAyd commented Jan 9, 2024

Describe the enhancement requested

>>> pa.Table.from_pydict({"col": [datetime.date(2024, 1, 1)]}).__dataframe__().get_column(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/willayd/mambaforge/envs/pantab-dev/lib/python3.12/site-packages/pyarrow/interchange/dataframe.py", line 139, in get_column
    return _PyArrowColumn(self._df.column(i),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willayd/mambaforge/envs/pantab-dev/lib/python3.12/site-packages/pyarrow/interchange/column.py", line 239, in __init__
    self._dtype = self._dtype_from_arrowdtype(dtype, bit_width)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willayd/mambaforge/envs/pantab-dev/lib/python3.12/site-packages/pyarrow/interchange/column.py", line 322, in _dtype_from_arrowdtype
    raise ValueError(
ValueError: Data type date32[day] not supported by interchange protocol

Component(s)

Python

@kou kou changed the title Add date32 support to __dataframe__ protocol [Python] Add date32 support to __dataframe__ protocol Jan 10, 2024
@AlenkaF
Copy link
Member

AlenkaF commented Jan 10, 2024

It would also be good to add 64-bit date type, 32 and 64-bit time type plus duration type.

Contributions are more than welcome as there is no immediate plan to work on this. I can guide anybody interested! ❤️

@jorisvandenbossche
Copy link
Member

The interchange protocol currently doesn't define a date type AFAIK (https://data-apis.org/dataframe-protocol/latest/API.html#interface), so do you expect it to be written as DATETIME?

@AlenkaF
Copy link
Member

AlenkaF commented Jan 10, 2024

@AlenkaF
Copy link
Member

AlenkaF commented Jan 10, 2024

Date and Duration data type classes are added to the staging branch of the protocol: https://github.com/data-apis/dataframe-api/blob/c5f08352e0a1d25387fe1737ffe9cccb36f554f7/spec/API_specification/dataframe_api/dtypes.py#L50

which I guess should be the draft docs page? https://data-apis.org/dataframe-api/draft/API_specification/index.html

But I am not sure if this will move forward soon.

@jorisvandenbossche
Copy link
Member

Date and Duration data type classes are added to the staging branch of the protocol: https://github.com/data-apis/dataframe-api/blob/c5f08352e0a1d25387fe1737ffe9cccb36f554f7/spec/API_specification/dataframe_api/dtypes.py#L50

That's for the standard API, though, not for the interchange protocol (I was confused as well, and so wrote a wrong comment on the PR adding it asking for clarification ;) -> data-apis/dataframe-api#197)

Yes, that was my idea. Similar to what polars does: https://github.com/pola-rs/polars/blob/2b43fc1ac1af84ed118ff3f8840d328a12c35510/py-polars/polars/interchange/utils.py#L35-L54

Personally I think it would be better if this was first clarified or added in the interchange protocol. While for date it does make some sense (as you could just see it as a different resolution of datetime), duration is really different. And for example the pandas implementation also wouldn't support consuming duration. And pyarrow only supports consuming datetime as timestamp, not even date.

@AlenkaF
Copy link
Member

AlenkaF commented Jan 10, 2024

That's for the standard API, though, not for the interchange protocol (I was confused as well, and so wrote a wrong comment on the PR adding it asking for clarification ;) -> data-apis/dataframe-api#197)

Oooh, sorry for taking you into the wrong direction!
I didn't see it at the time.

Personally I think it would be better if this was first clarified or added in the interchange protocol.

That does make sense 👍

@WillAyd
Copy link
Contributor Author

WillAyd commented Jan 10, 2024

@jorisvandenbossche my expectation was that the buffer would contain 32 bit integers (date64 would be 64). The consumer would be responsible for interpreting that correctly to the appropriate date based off of the precision defined in the format string

@jorisvandenbossche
Copy link
Member

(existing upstream issue about duration/timedelta: data-apis/dataframe-api#329)

@jonmmease
Copy link

Let me know if you think this is a distinct issue, but I ran into a different error message when converting a Date32 from Polars through the DataFrame interchange protocol.

import datetime
import polars as pl
from pyarrow.interchange import from_dataframe

from_dataframe(pl.DataFrame({"date": [datetime.date(2024, 3, 22)]}))
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
...
File [.../envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:563]
in validity_buffer_nan_sentinel(data_pa_buffer, data_type, describe_null, length, offset, allow_copy)
    537 """
    538 Build a PyArrow buffer from NaN or sentinel values.
    539 
   (...)
    560 pa.Buffer
    561 """
    562 kind, bit_width, _, _ = data_type
--> 563 data_dtype = map_date_type(data_type)
    564 null_kind, sentinel_val = describe_null
    566 # Check for float NaN values

File [...envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:332](http://localhost:8889/lab/workspaces/auto-z/tree/scratch/bugs/~/VegaFusion/repos/altair/.pixi/envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py#line=331), in map_date_type(data_type)
    329 kind, bit_width, f_string, _ = data_type
    331 if kind == DtypeKind.DATETIME:
--> 332     unit, tz = parse_datetime_format_str(f_string)
    333     return pa.timestamp(unit, tz=tz)
    334 else:

File [.../envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:324](http://localhost:8889/lab/workspaces/auto-z/tree/scratch/bugs/~/VegaFusion/repos/altair/.pixi/envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py#line=323), in parse_datetime_format_str(format_str)
    320         unit += "s"
    322     return unit, tz
--> 324 raise NotImplementedError(f"DateTime kind is not supported: {format_str}")

NotImplementedError: DateTime kind is not supported: tdD

What's the current thinking on the best way forward to support this?

@AlenkaF
Copy link
Member

AlenkaF commented Mar 25, 2024

Thank you for contributing to the discussion @jonmmease. I see that libraries are working around this by defining date and time types as protocol DATETIME data type with Apache Arrow C Data Interface format string (example tdD for date32, tdm for date64 etc, see Polars code and pandas code).

I do not mind going about it in similar way in PyArrow until date is added to the dataframe protocol spec. Also adding the option to consume this data type. It would be ideal, though, that this is clarified and set in the protocol first.

@jorisvandenbossche, what do you think?

@jorisvandenbossche
Copy link
Member

I would still prefer someone to first do a PR to the spec to add this. If it is just clarifying that the existing DATETIME dtype kind can also be used for other Arrow date and time dtypes, that should relatively easy.

I see that libraries are working around this by defining date and time types as protocol DATETIME data type with Apache Arrow C Data Interface format string (example tdD for date32, tdm for date64 etc, see Polars code and pandas code).

AFAIK pandas doesn't actually support this for duration, at least not for the default timedelta dtype (from testing with pandas main):

In [7]: from pyarrow.interchange import from_dataframe

In [8]: from_dataframe(pd.DataFrame({'a': pd.timedelta_range(0, "1 days", freq='s')}))
...
File ~/scipy/repos/pandas/pandas/core/interchange/utils.py:147, in dtype_to_arrow_c_fmt(dtype)
    144 elif isinstance(dtype, DatetimeTZDtype):
    145     return ArrowCTypes.TIMESTAMP.format(resolution=dtype.unit[0], tz=dtype.tz)
--> 147 raise NotImplementedError(
    148     f"Conversion of {dtype} to Arrow C format string is not implemented."
    149 )

NotImplementedError: Conversion of timedelta64[ns] to Arrow C format string is not implemented.

FWIW, my proposal to add support for the Arrow PyCapsule protocol to the interchange standard (data-apis/dataframe-api#342) would also solve this for the case of polars and pyarrow, as both are Arrow-memory based, and could interchange easily those data types.
(although that of course requires polars to implement it, and based on pola-rs/polars#12530 that is still WIP I think)

We could start checking for that protocol in pyarrow.interchange.from_dataframe, although that would also be an extension not covered by the official spec.

@AlenkaF
Copy link
Member

AlenkaF commented Mar 25, 2024

Thank you for clarification Joris!

I propose we start with a PR to the dataframe protocol specification to add that the existing DATETIME dtype kind can also be used for other Arrow date and time dtype (not duration). I will do this today/tomorrow.

The proposal to add support for the Arrow PyCapsule protocol to the interchange standard would be great in my opinion. I hope it will move forward otherwise the libs involved will start checking for the protocol by themselves like you have suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants