Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Allow chaining joins in SQL syntax #14220

Closed

Conversation

michiel-de-muynck
Copy link
Contributor

Fixes #14217

With this fix, any of the previously joined table names can be used to refer to any of the previously joined tables. For example, in the query

SELECT * FROM df1
INNER JOIN df2 ON ...
INNER JOIN df3 ON <join condition>

in the join condition marked <join condition> you can use either df1. or df2. to refer to any column on the left side of the join. For instance, you can use df1.x even if column x only exists in df2. You can also use df2.y even if column y only exists in df1.

That is more generous than should be the case of course, in the sense that this new implementation accepts some invalid queries where it should give an error. However, in the current implementation, you can already use df1. to refer to any column (of dataframe df1 or df2) but cannot use df2 at all (see issue #14217).

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 2, 2024

Nice, I'll take a look; had almost finished fixing this (or something very similar) and then got wildly distracted by calamine bindings; I'll check this first thing tomorrow against some additional tests I added for verification - if it looks good then it's all yours 😉👌

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 4, 2024

Hi @migi, thanks for the PR - I took a look and tested it out but currently it only addresses a subset of the wider issue and, in its current state, doesn't actually return the correct result. The added unit test result should include an additional two columns for this query (on top of "a", "b", "c", "d"); specifically, both "b" and "c" should be repeated - here's a pure PostgreSQL version of the unit test to demonstrate:

Screenshot 2024-02-04 at 20 58 47

I've also been working on fixing the current JOIN issues, accounting for the above (and some additional edge-cases connected with column resolution), and should have something ready in the near future (we have also been discussing how to handle duplicate columns in a way that is more consistent with SQL, rather than appending "_right", as we do for non-SQL joins, and that will be part of the coming update) 🤔

@michiel-de-muynck
Copy link
Contributor Author

Thank you Alexander for the PR. It indeed only addresses the problem in #14217, not all of the problems with joins, that is a much bigger refactoring than the PR here.

The problem you describe also occurs with just a simple join instead of chained joins. For example, the query

SELECT * FROM df1
INNER JOIN df2 ON df1.b = df2.b

for this query Polars currently returns 3 column ( a, b and c), not 4 (a, df1.b, df2b and c).

Feel free to close the PR if the work you were doing encompasses the changes in this PR.

@stinodego stinodego changed the title Fix: Allow chaining joins fix: Allow chaining joins in SQL syntax Feb 15, 2024
@stinodego stinodego added the A-sql Area: Polars SQL functionality label Feb 15, 2024
@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Feb 15, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented May 28, 2024

Thank you Alexander for the PR. It indeed only addresses the problem in #14217, not all of the problems with joins, that is a much bigger refactoring than the PR here.

Took me a while, but got there in the end... ;)

Feel free to close the PR if the work you were doing encompasses the changes in this PR.

Should all be dealt with by #16507, which can be merged after #16541 (and #16559).

Update: all merged now - fixes will be in 0.20.31 ✌️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql Area: Polars SQL functionality fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL joins cannot be "chained"
3 participants