feat: improving array casting #1865

FBruzzesi · 2025-01-25T15:24:34Z

What type of PR is this? (check all applicable)

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

There are a few improvements happening in this PR:

Current implementation of Array is not on par with polars one, and the support of multi dimension arrays is limited
Introduces:
- Arrow support for converting to array dtypes
- PySpark support for converting to array dtypes
Fixes
- polars by supporting newer implementation and backporting for older version
- duckdb by supporting multidimensional arrays
- pandas: mostly internal code clean up, the only fix is in broadcast_and_extract_dataframe_comparand

FBruzzesi · 2025-01-25T15:25:47Z

narwhals/dtypes.py

-        self: Self, inner: DType | type[DType], width: int | None = None
+        self: Self,
+        inner: DType | type[DType],
+        shape: int | tuple[int, ...] | None = None,


This is a breaking change. However by checking how narwhals is used on github, I could not find any instances of Array

FBruzzesi · 2025-01-27T12:32:53Z

narwhals/_arrow/utils.py

@@ -220,7 +224,7 @@ def broadcast_and_extract_dataframe_comparand(

    if isinstance(other, ArrowSeries):
        len_other = len(other)
-        if len_other == 1:
+        if len_other == 1 and length != 1:


Otherwise for list/array types we end up getting the first element

FBruzzesi · 2025-01-27T12:33:47Z

narwhals/_duckdb/utils.py

@@ -111,10 +111,13 @@ def native_to_narwhals_dtype(duckdb_dtype: str, version: Version) -> DType:
        )
    if match_ := re.match(r"(.*)\[\]$", duckdb_dtype):
        return dtypes.List(native_to_narwhals_dtype(match_.group(1), version))
-    if match_ := re.match(r"(\w+)\[(\d+)\]", duckdb_dtype):
+    if match_ := re.match(r"(\w+)((?:\[\d+\])+)", duckdb_dtype):


Array type in duckdb can have multiple dimensions. The resulting type is: INNER[d1][d2][...]

With this new regex we can parse multiple instances of the dimensions

FBruzzesi · 2025-01-27T12:34:48Z

narwhals/_duckdb/utils.py

+        duckdb_shape_fmt = "".join(f"[{item}]" for item in dtype.shape)  # type: ignore[union-attr]
+        while isinstance(dtype.inner, dtypes.Array):  # type: ignore[union-attr]
+            dtype = dtype.inner  # type: ignore[union-attr]
+        inner = narwhals_to_native_dtype(dtype.inner, version)  # type: ignore[union-attr]
+        return f"{inner}{duckdb_shape_fmt}"


First creates the shape [d1][d2]... then find the inner type recursively (first being non array)

FBruzzesi · 2025-01-27T12:35:01Z

narwhals/_pandas_like/utils.py

@@ -160,7 +160,7 @@ def broadcast_and_extract_dataframe_comparand(index: Any, other: Any) -> Any:
    if isinstance(other, PandasLikeSeries):
        len_other = other.len()

-        if len_other == 1:
+        if len_other == 1 and len(index) != 1:


Similarly as for pyarrow, otherwise for list/array types we end up getting the first element

narwhals/_spark_like/utils.py

FBruzzesi · 2025-02-01T09:07:52Z

narwhals/_pandas_like/utils.py

-            )
-            return NotImplementedError(msg)
-    if isinstance_or_issubclass(dtype, dtypes.Struct):
+    if isinstance_or_issubclass(dtype, (dtypes.Struct, dtypes.Array, dtypes.List)):


This diff is quite nice 😉

FBruzzesi · 2025-02-01T09:09:57Z

narwhals/dtypes.py

+    def __repr__(self) -> str:
+        # Get leaf type
+        dtype = self.inner
+        while isinstance(dtype, Array):
+            dtype = dtype.inner

-    def __repr__(self: Self) -> str:
        class_name = self.__class__.__name__
-        return f"{class_name}({self.inner!r}, {self.width})"
+        return f"{class_name}({dtype!r}, shape={self.shape})"


This diff is that's causing marimo ci to fail - we could provide a fix there 🙈

It also wouldn't be too bad to override repr in v1 right?

Not saying we necessarily have to do that here though

Should we maintain width in v1 then? My understanding is that we are limited to 1d arrays with the v1 implementation.
Or should we just keep the __repr__ and show f"{class_name}({self.inner!r}, {width})" with width=self.shape[0] if shape has only one dimension else the tuple self.shape?

FBruzzesi · 2025-02-07T22:04:59Z

Sporadically coming back to this to avoid falling too far back in conflicts and diffs to solve.

I would love to have as much support as possible for converting between native and narwhals datatypes/schemas. The only issue is marimo CI failing. So far they seemed pretty open to the changes, so we could just provide a fix for them.

Happy to hear feedback on this

MarcoGorelli · 2025-02-09T19:30:00Z

Thanks - sorry didn't get a chance to think about this too much

Is this mainly for ergonomics?

FBruzzesi · 2025-02-09T19:38:18Z

Thanks - sorry didn't get a chance to think about this too much

No worries @MarcoGorelli, we have other priorities and there is no rush! We will get everything sooner or later!

Is this mainly for ergonomics?

I sense that the PR title is completely misleading! Actually it is much more than ergonomics. There are a few improvements happening in this PR:

Current implementation of Array is not on par with polars one, and the support of multi dimension arrays is limited
Introduces:
- Arrow support for converting to array dtypes
- PySpark support for converting to array dtypes
Fixes
- polars by supporting newer implementation and backporting for older version
- duckdb by supporting multidimensional arrays
- pandas: mostly internal code clean up, the only fix is in broadcast_and_extract_dataframe_comparand

I will add this description in the PR body on top

MarcoGorelli · 2025-02-09T19:54:04Z

just took a look at the failing test:

    def test_complex_data_field_types(self) -> None:
        complex_data = self.get_complex_data()
        field_types = complex_data.get_field_types()
>       snapshot("narwhals.field_types.json", json.dumps(field_types))

I think that, if we provide a fix, this is probably fine to update, it wouldn't a Marimo-user-facing change

MarcoGorelli

thanks, this is probably good to do sooner rather than later (while nobody is explicitly using the array type's shape/width)

just left some minor comments, but then I think that, if combined with a PR to Marimo, this can be OK - for Marimo users it really only affects the repr, makes it more aligned with Polars

MarcoGorelli · 2025-02-09T20:02:31Z

narwhals/dtypes.py

@@ -765,7 +765,7 @@ class Array(NestedType):

    Arguments:
        inner: The datatype of the values within each array.
-        width: the length of each array.
+        shape: the length of each array.


the shape of each array?

MarcoGorelli · 2025-02-09T20:06:12Z

narwhals/dtypes.py

+        while isinstance(dtype, Array):
+            dtype = dtype.inner


can we put a limit on the depth of this (just to avoid some infinite loop in some unexpected scenario)?
same with the other place where there's the while loop

Disclaimer, from polars: https://github.com/pola-rs/polars/blob/faad12f7277751006e3faebf0fffb1f6bf9aa7e7/py-polars/polars/datatypes/classes.py#L905-L912

and the __init__ is also recursive:

narwhals/narwhals/dtypes.py

Lines 808 to 810 in 981f87c

elif isinstance(shape, tuple) and isinstance(shape[0], int):

if len(shape) > 1:

inner = Array(inner, shape[1:])

In principle, max depth should be len(shape). Is that a good depth to check? WDYT?

i was thinking something like

https://github.com/psf/black/blob/5f2370170819d282ec14dcda70f963d7574271e2/src/black/handle_ipynb_magics.py#L217-L226

I remember the trick, yet I don't feel super comfortable with introducing a fixed number of depth to check for.

Since we know already that the max depth should be len(shape), I think that would be the best limit to introduce, if any.

In practice, if someone generate a datatype to pass which ends up in an infinite recursion, maybe it's good to raise with sure error instead of an AssertionError. I am thinking out loud here though

MarcoGorelli · 2025-02-09T20:32:54Z

we'll already need to update the Marimo test suite for #1918 anyway, we could bundle this with that

FBruzzesi added 2 commits January 25, 2025 15:48

feat: improving array casting

1b8fd7d

Merge branch 'main' into feat/improving-array-casting

569c204

FBruzzesi commented Jan 25, 2025

View reviewed changes

FBruzzesi added 3 commits January 25, 2025 16:45

WIP

ab25f4d

update docstring

65287c9

merge main

7c84531

FBruzzesi commented Jan 27, 2025

View reviewed changes

narwhals/_spark_like/utils.py Outdated Show resolved Hide resolved

FBruzzesi marked this pull request as ready for review January 27, 2025 14:55

FBruzzesi added 3 commits January 27, 2025 16:10

rm imports

31dd90f

Merge branch 'main' into feat/improving-array-casting

044db09

merge main

35c2a2e

FBruzzesi commented Feb 1, 2025

View reviewed changes

merge main

c79507d

merge main

981f87c

FBruzzesi added the enhancement New feature or request label Feb 9, 2025

MarcoGorelli approved these changes Feb 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improving array casting #1865

feat: improving array casting #1865

FBruzzesi commented Jan 25, 2025 •

edited

Loading

FBruzzesi Jan 25, 2025 •

edited

Loading

FBruzzesi Jan 27, 2025

FBruzzesi Jan 27, 2025 •

edited

Loading

FBruzzesi Jan 27, 2025

FBruzzesi Jan 27, 2025

FBruzzesi Feb 1, 2025

FBruzzesi Feb 1, 2025 •

edited

Loading

MarcoGorelli Feb 9, 2025 •

edited

Loading

FBruzzesi Feb 9, 2025

FBruzzesi commented Feb 7, 2025

MarcoGorelli commented Feb 9, 2025

FBruzzesi commented Feb 9, 2025 •

edited

Loading

MarcoGorelli commented Feb 9, 2025

MarcoGorelli left a comment

MarcoGorelli Feb 9, 2025

MarcoGorelli Feb 9, 2025

FBruzzesi Feb 9, 2025 •

edited

Loading

MarcoGorelli Feb 9, 2025

FBruzzesi Feb 9, 2025

MarcoGorelli commented Feb 9, 2025

	elif isinstance(shape, tuple) and isinstance(shape[0], int):
	if len(shape) > 1:
	inner = Array(inner, shape[1:])

feat: improving array casting #1865

Are you sure you want to change the base?

feat: improving array casting #1865

Conversation

FBruzzesi commented Jan 25, 2025 • edited Loading

What type of PR is this? (check all applicable)

Checklist

If you have comments or can explain your changes, please do so below

FBruzzesi Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FBruzzesi Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FBruzzesi Feb 1, 2025 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli Feb 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FBruzzesi commented Feb 7, 2025

MarcoGorelli commented Feb 9, 2025

FBruzzesi commented Feb 9, 2025 • edited Loading

MarcoGorelli commented Feb 9, 2025

MarcoGorelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FBruzzesi Feb 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Feb 9, 2025

FBruzzesi commented Jan 25, 2025 •

edited

Loading

FBruzzesi Jan 25, 2025 •

edited

Loading

FBruzzesi Jan 27, 2025 •

edited

Loading

FBruzzesi Feb 1, 2025 •

edited

Loading

MarcoGorelli Feb 9, 2025 •

edited

Loading

FBruzzesi commented Feb 9, 2025 •

edited

Loading

FBruzzesi Feb 9, 2025 •

edited

Loading