Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join fails because of an uninstructed cast from int to array[int, x] on 1.13.1 #19763

Closed
2 tasks done
TNieuwdorp opened this issue Nov 13, 2024 · 7 comments · Fixed by #19776
Closed
2 tasks done

join fails because of an uninstructed cast from int to array[int, x] on 1.13.1 #19763

TNieuwdorp opened this issue Nov 13, 2024 · 7 comments · Fixed by #19776
Assignees
Labels
bug Something isn't working python Related to Python Polars regression Issue introduced by a new release

Comments

@TNieuwdorp
Copy link
Contributor

TNieuwdorp commented Nov 13, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Haven't been able to construct one yet. Working on it.

Log output

No response

Issue description

When joining two LazyFrames, one large, one smaller, the datatype of an int column in the larger frame suddenly turns into an array, causing a data type mismatch in the join.

The state of the two LazyFrames before the join:
image

The error of the join:
image

This error also occurs when optimizations are turned off:
image

Explicitly casting the columns to int32 before the operation seems to work:
image

Expected behavior

I expect the data type to not be cast to an array, and the join to succeed.

Installed versions

>>> pl.show_versions()
--------Version info---------
Polars:              1.13.1
Index type:          UInt32
Platform:            macOS-15.0.1-x86_64-i386-64bit
Python:              3.12.7 (main, Oct  1 2024, 02:05:46) [Clang 15.0.0 (clang-1500.3.9.4)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2023.12.2
gevent               <not installed>
great_tables         0.10.0
matplotlib           3.8.4
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             3.1.2
pandas               2.2.2
pyarrow              14.0.2
pydantic             1.10.15
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             0.8.2
xlsxwriter           <not installed>
</details>
@TNieuwdorp TNieuwdorp added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 13, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Nov 14, 2024

This should have been fixed by #19753, and we have made a new patch release available; take a look at 1.13.1? 👍

@TNieuwdorp TNieuwdorp changed the title join fails because of an uninstructed cast from int to array[int, x] on 1.13.0 join fails because of an uninstructed cast from int to array[int, x] on 1.13.1 Nov 14, 2024
@TNieuwdorp
Copy link
Contributor Author

@alexander-beedie Unfortunately that patch doesn't fix the problem, and it still occurs exactly as described.

@alexander-beedie
Copy link
Collaborator

Hmm, surprising! Are you able to create a small reproducible test-case that you can paste in to the Issue report?

@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Nov 14, 2024

Did you by any chance happen to have exploded the pgm_id column at some point?

I think I can produce the same error message with this -

q = pl.LazyFrame().select(
    pl.lit(pl.Series([[1, 1], [2, 2]], dtype=pl.Array(pl.Int64, 2)))
    .explode()
    .alias("k")
)

q = q.join(pl.LazyFrame({"k": [1, 2]}), on="k")

print(q.collect())

@nameexhaustion nameexhaustion self-assigned this Nov 14, 2024
@nameexhaustion nameexhaustion added regression Issue introduced by a new release and removed needs triage Awaiting prioritization by a maintainer labels Nov 14, 2024
@TNieuwdorp
Copy link
Contributor Author

@nameexhaustion Let me know if you get stuck on this, I might be able to dig a bit deeper in our code to try and figure out the source of the data and order of operations that are applied that leads up to this.

@TNieuwdorp
Copy link
Contributor Author

Did you by any chance happen to have exploded the pgm_id column at some point?

I think I can produce the same error message with this -

q = pl.LazyFrame().select(
    pl.lit(pl.Series([[1, 1], [2, 2]], dtype=pl.Array(pl.Int64, 2)))
    .explode()
    .alias("k")
)

q = q.join(pl.LazyFrame({"k": [1, 2]}), on="k")

print(q.collect())

Checking...

@TNieuwdorp
Copy link
Contributor Author

Yes, explode() is applied to the data!
image

The original data comes from a structured numpy array (although since you reproduced it without that, it might not be relevant)
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars regression Issue introduced by a new release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants