-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improving array casting #1865
base: main
Are you sure you want to change the base?
Conversation
self: Self, inner: DType | type[DType], width: int | None = None | ||
self: Self, | ||
inner: DType | type[DType], | ||
shape: int | tuple[int, ...] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change. However by checking how narwhals is used on github, I could not find any instances of Array
@@ -220,7 +224,7 @@ def broadcast_and_extract_dataframe_comparand( | |||
|
|||
if isinstance(other, ArrowSeries): | |||
len_other = len(other) | |||
if len_other == 1: | |||
if len_other == 1 and length != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise for list/array types we end up getting the first element
@@ -111,10 +111,13 @@ def native_to_narwhals_dtype(duckdb_dtype: str, version: Version) -> DType: | |||
) | |||
if match_ := re.match(r"(.*)\[\]$", duckdb_dtype): | |||
return dtypes.List(native_to_narwhals_dtype(match_.group(1), version)) | |||
if match_ := re.match(r"(\w+)\[(\d+)\]", duckdb_dtype): | |||
if match_ := re.match(r"(\w+)((?:\[\d+\])+)", duckdb_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array type in duckdb can have multiple dimensions. The resulting type is: INNER[d1][d2][...]
With this new regex we can parse multiple instances of the dimensions
duckdb_shape_fmt = "".join(f"[{item}]" for item in dtype.shape) # type: ignore[union-attr] | ||
while isinstance(dtype.inner, dtypes.Array): # type: ignore[union-attr] | ||
dtype = dtype.inner # type: ignore[union-attr] | ||
inner = narwhals_to_native_dtype(dtype.inner, version) # type: ignore[union-attr] | ||
return f"{inner}{duckdb_shape_fmt}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First creates the shape [d1][d2]...
then find the inner type recursively (first being non array)
@@ -160,7 +160,7 @@ def broadcast_and_extract_dataframe_comparand(index: Any, other: Any) -> Any: | |||
if isinstance(other, PandasLikeSeries): | |||
len_other = other.len() | |||
|
|||
if len_other == 1: | |||
if len_other == 1 and len(index) != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly as for pyarrow, otherwise for list/array types we end up getting the first element
) | ||
return NotImplementedError(msg) | ||
if isinstance_or_issubclass(dtype, dtypes.Struct): | ||
if isinstance_or_issubclass(dtype, (dtypes.Struct, dtypes.Array, dtypes.List)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diff is quite nice π
def __repr__(self) -> str: | ||
# Get leaf type | ||
dtype = self.inner | ||
while isinstance(dtype, Array): | ||
dtype = dtype.inner | ||
|
||
def __repr__(self: Self) -> str: | ||
class_name = self.__class__.__name__ | ||
return f"{class_name}({self.inner!r}, {self.width})" | ||
return f"{class_name}({dtype!r}, shape={self.shape})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diff is that's causing marimo ci to fail - we could provide a fix there π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also wouldn't be too bad to override repr in v1 right?
Not saying we necessarily have to do that here though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maintain width
in v1 then? My understanding is that we are limited to 1d arrays with the v1 implementation.
Or should we just keep the __repr__
and show f"{class_name}({self.inner!r}, {width})"
with width=self.shape[0]
if shape has only one dimension else the tuple self.shape
?
Sporadically coming back to this to avoid falling too far back in conflicts and diffs to solve. I would love to have as much support as possible for converting between native and narwhals datatypes/schemas. The only issue is marimo CI failing. So far they seemed pretty open to the changes, so we could just provide a fix for them. Happy to hear feedback on this |
Thanks - sorry didn't get a chance to think about this too much Is this mainly for ergonomics? |
No worries @MarcoGorelli, we have other priorities and there is no rush! We will get everything sooner or later!
I sense that the PR title is completely misleading! Actually it is much more than ergonomics. There are a few improvements happening in this PR:
I will add this description in the PR body on top |
just took a look at the failing test: def test_complex_data_field_types(self) -> None:
complex_data = self.get_complex_data()
field_types = complex_data.get_field_types()
> snapshot("narwhals.field_types.json", json.dumps(field_types)) I think that, if we provide a fix, this is probably fine to update, it wouldn't a Marimo-user-facing change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, this is probably good to do sooner rather than later (while nobody is explicitly using the array type's shape/width)
just left some minor comments, but then I think that, if combined with a PR to Marimo, this can be OK - for Marimo users it really only affects the repr, makes it more aligned with Polars
@@ -765,7 +765,7 @@ class Array(NestedType): | |||
|
|||
Arguments: | |||
inner: The datatype of the values within each array. | |||
width: the length of each array. | |||
shape: the length of each array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the shape of each array?
while isinstance(dtype, Array): | ||
dtype = dtype.inner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put a limit on the depth of this (just to avoid some infinite loop in some unexpected scenario)?
same with the other place where there's the while loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disclaimer, from polars: https://github.com/pola-rs/polars/blob/faad12f7277751006e3faebf0fffb1f6bf9aa7e7/py-polars/polars/datatypes/classes.py#L905-L912
and the __init__
is also recursive:
Lines 808 to 810 in 981f87c
elif isinstance(shape, tuple) and isinstance(shape[0], int): | |
if len(shape) > 1: | |
inner = Array(inner, shape[1:]) |
In principle, max depth should be len(shape)
. Is that a good depth to check? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was thinking something like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember the trick, yet I don't feel super comfortable with introducing a fixed number of depth to check for.
Since we know already that the max depth should be len(shape)
, I think that would be the best limit to introduce, if any.
In practice, if someone generate a datatype to pass which ends up in an infinite recursion, maybe it's good to raise with sure error instead of an AssertionError
. I am thinking out loud here though
we'll already need to update the Marimo test suite for #1918 anyway, we could bundle this with that |
What type of PR is this? (check all applicable)
Checklist
If you have comments or can explain your changes, please do so below
There are a few improvements happening in this PR:
broadcast_and_extract_dataframe_comparand