Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: pandas-wrapped narwhals .over() leads to inconsistent result #1296

Open
mscolnick opened this issue Nov 1, 2024 · 4 comments
Open

Comments

@mscolnick
Copy link
Contributor

Describe the bug

import marimo as mo
import narwhals as nw
import polars as pl
import pandas as pd

data = {"item": ["a", "a", "a", "b", "b", "b"], "start": [6, 7, 8, 14, 15, 16]}

df_pandas = pd.DataFrame(data)
df_polars = pl.DataFrame(data)
nw_df_pandas = nw.from_native(df_pandas)
nw_df_polars = nw.from_native(df_polars)

# Correct, creates a new column with the shifted values
nw_df_polars.with_columns(
    (nw.col("start") * 1 - nw.min("start")).over("item").alias("shifted")
).to_native()

# Incorrect, all starts at 0
nw_df_pandas.with_columns(
    (nw.col("start") * 1 - nw.min("start")).over("item").alias("shifted")
).to_native()
image

Steps or code to reproduce the bug

https://static.marimo.app/static/narwhals-bug-bo1z

Expected results

Expected the results of polars. Values are incrementing.

Actual results

Values are the same

Please run narwhals.show_version() and enter the output below.

System:
    python: 3.12.6 (main, Sep  6 2024, 19:03:47) [Clang 15.0.0 (clang-1500.3.9.4)]
executable: /Users/myles/.cache/uv/archive-v0/sNIzHlLTqqBXxH79AubYO/bin/python
   machine: macOS-14.6-arm64-arm-64bit

Python dependencies:
     narwhals: 1.12.1
       pandas: 2.2.3
       polars: 1.12.0
         cudf: 
        modin: 
      pyarrow: 17.0.0
        numpy: 2.1.2


### Relevant log output

_No response_
@mscolnick mscolnick changed the title [Bug]: pandas-wrapper narwhals .over() leads to inconsistent result [Bug]: pandas-wrapped narwhals .over() leads to inconsistent result Nov 1, 2024
@FBruzzesi
Copy link
Member

FBruzzesi commented Nov 1, 2024

Hey @mscolnick , thanks for reporting the issue.
I have some mixed feelings about this, as I would not be sure what to expect as output.

What I mean by that: over does end up applying only to the reduction nw.min/pl.min and not the entire expression (nor I would know what to expect by that). By changing the code to:

df.with_columns(
    (nw.col("start") - nw.min("start").over("item")).alias("shifted")
)

the result is consistent across backends.

I apologize if I am missing something 🙈

@mscolnick
Copy link
Contributor Author

Ah, I got it. I am probably using the API wrong. I was doing:

(
        pl.col("year") * 12
        + pl.col("month")
        - pl.min("year") * 12
        - pl.min("month")
        + 1
    ).over("project")

but since this worked (in polars), I didn't think twice about doing

(
        pl.col("year") * 12
        + pl.col("month")
        - pl.min("year").over("project") * 12
        - pl.min("month").over("project")
        + 1
    )

Makes sense, and fine to close!

@MarcoGorelli
Copy link
Member

thanks for the report!

if we only support reductions with over, then I think this should raise

@FBruzzesi
Copy link
Member

Our current Expr API for pandas-like and pyarrow does not distinguish between aggregation or not, while for Dask it does.
Standardizing it would definitly help in the long term, and yield more control in each operation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants