Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: AggregateMonotonicity #14271

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

mertak-synnada
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

This PR creates a definition of set-monotonicity for Aggregate expressions. Some aggregation functions create ordered results by definition (such as count, min, max). With this PR, we're adding this information to the output ordering and be able to remove some SortExecs while optimizing

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions labels Jan 24, 2025
@@ -4963,6 +4963,9 @@ false
true
NULL

statement ok
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are related with #14231

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order that the tests better explain the implications of this change, can you please add a new test rather than updating the existing test (by setting this option).

So that would mean set the flag and run the EXPLAIN again in a separate block

That will let the tests better illustrate any change in behavior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/// function is monotonically increasing if its value increases as its argument grows
/// (as a set). Formally, `f` is a monotonically increasing set function if `f(S) >= f(T)`
/// whenever `S` is a superset of `T`.
fn monotonicity(&self, _data_type: &DataType) -> AggregateExprMonotonicity {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding a note at the beginning of the comment: This is used for a specific (is it BoundedWindowAggExec? )optimization and can be skipped by using the default implementation.
This interface seems quite difficult to understand for a general user who only wants to add a simple UDAF

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to follow the existing model for ScalarUDFs here instead?

https://github.com/apache/datafusion/blob/27db82fe396f43077b5056bab4b20b084c8f6948/datafusion/expr/src/udf.rs#L753-L752

Soemthing like this:

 pub trait AggregateUDFImpl {
...
    /// returns the output order of this aggregate expression given the input properites
    fn output_ordering(&self, inputs: &[ExprProperties]) -> Result<SortProperties>;
...
}

Copy link
Contributor

@berkaysynnada berkaysynnada Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not possible because this property is purely related with the function's nature. It does not depend input order or anything else, just the relation between the element-wise increment (or decrement) in the grouping set and resulting values of aggregate function. I'm renaming the monotonicity as set-monotonicity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding a note at the beginning of the comment: This is used for a specific (is it BoundedWindowAggExec? )optimization and can be skipped by using the default implementation. This interface seems quite difficult to understand for a general user who only wants to add a simple UDAF

We've tried to provide a good documentation, and the API's itself comes up with a default implementation. If the general users are not interested at these properties, we are not forcing them to be. Do you have further suggestions either for code or documentation level?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mertak-synnada -- I like where this is headed. I am not sure about some of the plan changes and I also have some questions about the API

Thanks @2010YOUY01 for the look as well

datafusion/expr/src/udaf.rs Outdated Show resolved Hide resolved
/// function is monotonically increasing if its value increases as its argument grows
/// (as a set). Formally, `f` is a monotonically increasing set function if `f(S) >= f(T)`
/// whenever `S` is a superset of `T`.
fn monotonicity(&self, _data_type: &DataType) -> AggregateExprMonotonicity {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to follow the existing model for ScalarUDFs here instead?

https://github.com/apache/datafusion/blob/27db82fe396f43077b5056bab4b20b084c8f6948/datafusion/expr/src/udf.rs#L753-L752

Soemthing like this:

 pub trait AggregateUDFImpl {
...
    /// returns the output order of this aggregate expression given the input properites
    fn output_ordering(&self, inputs: &[ExprProperties]) -> Result<SortProperties>;
...
}

datafusion/sqllogictest/test_files/aggregate.slt Outdated Show resolved Hide resolved
@@ -4963,6 +4963,9 @@ false
true
NULL

statement ok
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order that the tests better explain the implications of this change, can you please add a new test rather than updating the existing test (by setting this option).

So that would mean set the flag and run the EXPLAIN again in a separate block

That will let the tests better illustrate any change in behavior

datafusion/sqllogictest/test_files/aggregates_topk.slt Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the optimizer Optimizer rules label Jan 28, 2025
@berkaysynnada berkaysynnada force-pushed the feature/monotonic-sets branch from f1777ef to 1f02953 Compare January 29, 2025 13:04
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor comments, almost ready to go

datafusion/physical-expr/src/aggregate.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/aggregate.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/aggregate.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/window/aggregate.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/window/standard.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/window/standard.rs Outdated Show resolved Hide resolved
datafusion/physical-plan/src/aggregates/mod.rs Outdated Show resolved Hide resolved
@berkaysynnada berkaysynnada force-pushed the feature/monotonic-sets branch from 6b90eba to 1875336 Compare January 29, 2025 14:29
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM and is ready to go from my perspective. @alamb, it'd be great if you can take a look. It doesn't introduce any changes to existing plans/tests unless it is a strict improvement, but I'd still prefer if you could take a final quick look.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mertak-synnada and @ozankabak

I think I am missing something here -- the code is very nicely structured and does what the PR says it should do. However, the optimization doesn't seem to compute the same answer

datafusion/functions-aggregate/src/min_max.rs Outdated Show resolved Hide resolved
datafusion/functions-aggregate/src/min_max.rs Show resolved Hide resolved
datafusion/expr/src/udaf.rs Outdated Show resolved Hide resolved
datafusion/expr/src/udaf.rs Outdated Show resolved Hide resolved
datafusion/expr/src/udaf.rs Outdated Show resolved Hide resolved
datafusion/functions-aggregate/src/sum.rs Outdated Show resolved Hide resolved
datafusion/expr/src/udaf.rs Outdated Show resolved Hide resolved
datafusion/sqllogictest/test_files/aggregate.slt Outdated Show resolved Hide resolved
@alamb alamb changed the title Feature: Monotonic Sets Feature: AggregateMonotonicity Jan 29, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @ozankabak and @mertak-synnada

I am still confused about this PR -- I am sorry I am probably missing something silly

My understanding of this PR

As I understand this PR, it is optimizing queries like

select a, count(b) FROM ... GROUP BY a ORDER BY a, count(b)

By noticing that when

  1. the input is already sorted by a
  2. we use the ordering preserving grouping (ordering_mode=Sorted)

This implies that the output is already sorted by a, count(b) and thus no SortExec is needed

This makes total sense to me and is a great optimization ✅

My confusion -- doesn't this always hold?

What I don't understand is why this optimization relies on the specific aggregate function used (aka why is AggregateExprSetMonotonicity needed)?

It seems to me like any query like the following doesn't need an extra sort.

select a, agg(b) FROM ... GROUP BY a ORDER BY a, agg(b)

(where agg(b) is any aggregate )

My reasoning is that the GROUP BY ensures that there are no duplicates in the a column, so by definition the stream is sorted by a, <any other columns> as we know a is unique 😕

@ozankabak
Copy link
Contributor

ozankabak commented Jan 30, 2025

Thanks for reviewing carefully, as always, much appreciated 🚀

select a, agg(b) FROM ... GROUP BY a ORDER BY a, agg(b)

You are right that all queries of this form can be optimized independent of what agg is. The unit tests involving such queries in this PR (in aggregate.slt) should work even in the absence of the AggregateExprSetMonotonicity concept. This feature should actually be already tested by other tests (ones exercising correct handling of uniqueness constraints in equivalence properties). We will double check this, and if there are no problems, remove the redundant tests. If we discover any bugs, we will add the fix into this PR.

Now, coming back to the original aim of the PR -- the main intent behind AggregateExprSetMonotonicity is the following:

  1. To make windowing queries ordering aware in case of set-monotonic window/aggregation functions. This is the immediate benefit, and you can find the tests exercising this in window.slt.
  2. To open the door to incremental computations involving filters containing comparisons between an accumulated value (e.g. a COUNT) and a fixed value (or a value with a bound). In such cases, you can do efficient calculations/pruning only when you have set-monotonicity information on the aggregate function computing the accumulated value. We plan to bring such functionality to DataFusion in the upcoming months.

Does that help?

@berkaysynnada
Copy link
Contributor

@alamb could you take a final look?

@ozankabak ozankabak force-pushed the feature/monotonic-sets branch from d7e3135 to 5e9b2db Compare January 30, 2025 12:54
@ozankabak
Copy link
Contributor

This now includes the optimization for single-row outputs, windowing operations with set-monotonic functions, and it lays the foundational machinery for more sophisticated optimizations based on expressions involving functions with set-monotonicity properties.

I am quite happy with the final state of this PR. Once @alamb confirms there are no concerns left, I will merge.

@alamb
Copy link
Contributor

alamb commented Jan 30, 2025

I will try and give it a good look later today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate functions logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants