Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: FULL OUTER JOIN and LIMIT produces wrong results #14338

Merged
merged 3 commits into from
Jan 28, 2025

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Jan 28, 2025

Which issue does this PR close?

Rationale for this change

LIMITs are incorrectly pushed through FULL OUTER Joins

What changes are included in this PR?

disable full outer join limit push down

Are these changes tested?

Added unit test, also the original unit test show right behaviour now.

Are there any user-facing changes?

Yes, FULL OUTER JOIN and LIMIT produces right results now.

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jan 28, 2025
@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Jan 28, 2025

cc @alamb @Dandandan
For the first step, i think we should disable full outer join push down limit, this PR done.
For further improvement, we can investigate if we have some cases which we can support push down limit for full outer join.

Thanks a lot!

And also added the test cases in #14336

It shows right result now.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Jan 28, 2025

I am confident now that we can remove the pushdown limit for full outer join, because i also checked the full outer join pushdown filter, we will not push down for full outer join for both side, code reference:

// Get pushable predicates from current optimizer state
let (left_preserved, right_preserved) = lr_is_preserved(join.join_type);
pub(crate) fn lr_is_preserved(join_type: JoinType) -> (bool, bool) {
    match join_type {
        JoinType::Inner => (true, true),
        JoinType::Left => (true, false),
        JoinType::Right => (false, true),
        JoinType::Full => (false, false),
        // No columns from the right side of the join can be referenced in output
        // predicates for semi/anti joins, so whether we specify t/f doesn't matter.
        JoinType::LeftSemi | JoinType::LeftAnti | JoinType::LeftMark => (true, false),
        // No columns from the left side of the join can be referenced in output
        // predicates for semi/anti joins, so whether we specify t/f doesn't matter.
        JoinType::RightSemi | JoinType::RightAnti => (false, true),
    }
}

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pushing down limit to full join will introduce the issue of correctness, we should avoid it.

# there should be no nulls
# Reproducer for https://github.com/apache/datafusion/issues/14335
query II
select * from t1 FULL JOIN t2 ON t1.a = t2.b LIMIT 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test will be flaky. (Maybe explain is enough)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in theory the results from this query is undefined (in the sense that any two rows would be valid)

However, given we are limiting to a single core via

set datafusion.execution.target_partitions = 1;

I think this won't be flaky in practice

Of course I am somewhat biased as I wrote these tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, make sense, I didn't notice the setting

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas and @xudong963

# there should be no nulls
# Reproducer for https://github.com/apache/datafusion/issues/14335
query II
select * from t1 FULL JOIN t2 ON t1.a = t2.b LIMIT 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in theory the results from this query is undefined (in the sense that any two rows would be valid)

However, given we are limiting to a single core via

set datafusion.execution.target_partitions = 1;

I think this won't be flaky in practice

Of course I am somewhat biased as I wrote these tests

@xudong963
Copy link
Member

Let's go

@xudong963 xudong963 merged commit ecc5694 into apache:main Jan 28, 2025
27 checks passed
@zhuqi-lucas
Copy link
Contributor Author

Thank you @alamb and @xudong963!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FULL OUTER JOIN and LIMIT produces wrong results
3 participants