Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improving array casting #1865

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Jan 25, 2025

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

There are a few improvements happening in this PR:

  • Current implementation of Array is not on par with polars one, and the support of multi dimension arrays is limited
  • Introduces:
    • Arrow support for converting to array dtypes
    • PySpark support for converting to array dtypes
  • Fixes
    • polars by supporting newer implementation and backporting for older version
    • duckdb by supporting multidimensional arrays
    • pandas: mostly internal code clean up, the only fix is in broadcast_and_extract_dataframe_comparand

Comment on lines -735 to +741
self: Self, inner: DType | type[DType], width: int | None = None
self: Self,
inner: DType | type[DType],
shape: int | tuple[int, ...] | None = None,
Copy link
Member Author

@FBruzzesi FBruzzesi Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. However by checking how narwhals is used on github, I could not find any instances of Array

@@ -220,7 +224,7 @@ def broadcast_and_extract_dataframe_comparand(

if isinstance(other, ArrowSeries):
len_other = len(other)
if len_other == 1:
if len_other == 1 and length != 1:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise for list/array types we end up getting the first element

@@ -111,10 +111,13 @@ def native_to_narwhals_dtype(duckdb_dtype: str, version: Version) -> DType:
)
if match_ := re.match(r"(.*)\[\]$", duckdb_dtype):
return dtypes.List(native_to_narwhals_dtype(match_.group(1), version))
if match_ := re.match(r"(\w+)\[(\d+)\]", duckdb_dtype):
if match_ := re.match(r"(\w+)((?:\[\d+\])+)", duckdb_dtype):
Copy link
Member Author

@FBruzzesi FBruzzesi Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array type in duckdb can have multiple dimensions. The resulting type is: INNER[d1][d2][...]

With this new regex we can parse multiple instances of the dimensions

Comment on lines +177 to +181
duckdb_shape_fmt = "".join(f"[{item}]" for item in dtype.shape) # type: ignore[union-attr]
while isinstance(dtype.inner, dtypes.Array): # type: ignore[union-attr]
dtype = dtype.inner # type: ignore[union-attr]
inner = narwhals_to_native_dtype(dtype.inner, version) # type: ignore[union-attr]
return f"{inner}{duckdb_shape_fmt}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First creates the shape [d1][d2]... then find the inner type recursively (first being non array)

@@ -160,7 +160,7 @@ def broadcast_and_extract_dataframe_comparand(index: Any, other: Any) -> Any:
if isinstance(other, PandasLikeSeries):
len_other = other.len()

if len_other == 1:
if len_other == 1 and len(index) != 1:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly as for pyarrow, otherwise for list/array types we end up getting the first element

@FBruzzesi FBruzzesi marked this pull request as ready for review January 27, 2025 14:55
)
return NotImplementedError(msg)
if isinstance_or_issubclass(dtype, dtypes.Struct):
if isinstance_or_issubclass(dtype, (dtypes.Struct, dtypes.Array, dtypes.List)):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff is quite nice πŸ˜‰

Comment on lines +787 to +794
def __repr__(self) -> str:
# Get leaf type
dtype = self.inner
while isinstance(dtype, Array):
dtype = dtype.inner

def __repr__(self: Self) -> str:
class_name = self.__class__.__name__
return f"{class_name}({self.inner!r}, {self.width})"
return f"{class_name}({dtype!r}, shape={self.shape})"
Copy link
Member Author

@FBruzzesi FBruzzesi Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff is that's causing marimo ci to fail - we could provide a fix there πŸ™ˆ

Copy link
Member

@MarcoGorelli MarcoGorelli Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also wouldn't be too bad to override repr in v1 right?

Not saying we necessarily have to do that here though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maintain width in v1 then? My understanding is that we are limited to 1d arrays with the v1 implementation.
Or should we just keep the __repr__ and show f"{class_name}({self.inner!r}, {width})" with width=self.shape[0] if shape has only one dimension else the tuple self.shape?

@FBruzzesi
Copy link
Member Author

Sporadically coming back to this to avoid falling too far back in conflicts and diffs to solve.

I would love to have as much support as possible for converting between native and narwhals datatypes/schemas. The only issue is marimo CI failing. So far they seemed pretty open to the changes, so we could just provide a fix for them.

Happy to hear feedback on this

@FBruzzesi FBruzzesi added the enhancement New feature or request label Feb 9, 2025
@MarcoGorelli
Copy link
Member

Thanks - sorry didn't get a chance to think about this too much

Is this mainly for ergonomics?

@FBruzzesi
Copy link
Member Author

FBruzzesi commented Feb 9, 2025

Thanks - sorry didn't get a chance to think about this too much

No worries @MarcoGorelli, we have other priorities and there is no rush! We will get everything sooner or later!

Is this mainly for ergonomics?

I sense that the PR title is completely misleading! Actually it is much more than ergonomics. There are a few improvements happening in this PR:

  • Current implementation of Array is not on par with polars one, and the support of multi dimension arrays is limited
  • Introduces:
    • Arrow support for converting to array dtypes
    • PySpark support for converting to array dtypes
  • Fixes
    • polars by supporting newer implementation and backporting for older version
    • duckdb by supporting multidimensional arrays
    • pandas: mostly internal code clean up, the only fix is in broadcast_and_extract_dataframe_comparand

I will add this description in the PR body on top

@MarcoGorelli
Copy link
Member

just took a look at the failing test:

    def test_complex_data_field_types(self) -> None:
        complex_data = self.get_complex_data()
        field_types = complex_data.get_field_types()
>       snapshot("narwhals.field_types.json", json.dumps(field_types))

I think that, if we provide a fix, this is probably fine to update, it wouldn't a Marimo-user-facing change

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, this is probably good to do sooner rather than later (while nobody is explicitly using the array type's shape/width)

just left some minor comments, but then I think that, if combined with a PR to Marimo, this can be OK - for Marimo users it really only affects the repr, makes it more aligned with Polars

@@ -765,7 +765,7 @@ class Array(NestedType):

Arguments:
inner: The datatype of the values within each array.
width: the length of each array.
shape: the length of each array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the shape of each array?

Comment on lines +844 to +845
while isinstance(dtype, Array):
dtype = dtype.inner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put a limit on the depth of this (just to avoid some infinite loop in some unexpected scenario)?
same with the other place where there's the while loop

Copy link
Member Author

@FBruzzesi FBruzzesi Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclaimer, from polars: https://github.com/pola-rs/polars/blob/faad12f7277751006e3faebf0fffb1f6bf9aa7e7/py-polars/polars/datatypes/classes.py#L905-L912

and the __init__ is also recursive:

narwhals/narwhals/dtypes.py

Lines 808 to 810 in 981f87c

elif isinstance(shape, tuple) and isinstance(shape[0], int):
if len(shape) > 1:
inner = Array(inner, shape[1:])

In principle, max depth should be len(shape). Is that a good depth to check? WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember the trick, yet I don't feel super comfortable with introducing a fixed number of depth to check for.

Since we know already that the max depth should be len(shape), I think that would be the best limit to introduce, if any.

In practice, if someone generate a datatype to pass which ends up in an infinite recursion, maybe it's good to raise with sure error instead of an AssertionError. I am thinking out loud here though

@MarcoGorelli
Copy link
Member

we'll already need to update the Marimo test suite for #1918 anyway, we could bundle this with that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants