Feature: AggregateMonotonicity #14271

mertak-synnada · 2025-01-24T13:02:25Z

Which issue does this PR close?

Closes #.

Rationale for this change

This PR creates a definition of set-monotonicity for Aggregate expressions. Some aggregation functions create ordered results by definition (such as count, min, max). With this PR, we're adding this information to the output ordering and be able to remove some SortExecs while optimizing

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/core/src/physical_optimizer/test_utils.rs

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/physical-optimizer/src/test_utils.rs

separate stubs and count_udafs

change monotonicity to return an Enum rather than Option<bool> fix indices re-add monotonicity tests

# Conflicts: # datafusion/core/tests/physical_optimizer/enforce_sorting.rs

mertak-synnada · 2025-01-24T13:04:01Z

datafusion/sqllogictest/test_files/aggregate.slt

@@ -4963,6 +4963,9 @@ false
 true
 NULL

+statement ok


These are related with #14231

In order that the tests better explain the implications of this change, can you please add a new test rather than updating the existing test (by setting this option).

So that would mean set the flag and run the EXPLAIN again in a separate block

That will let the tests better illustrate any change in behavior

2010YOUY01 · 2025-01-25T08:54:50Z

datafusion/expr/src/udaf.rs

+    /// function is monotonically increasing if its value increases as its argument grows
+    /// (as a set). Formally, `f` is a monotonically increasing set function if `f(S) >= f(T)`
+    /// whenever `S` is a superset of `T`.
+    fn monotonicity(&self, _data_type: &DataType) -> AggregateExprMonotonicity {


I recommend adding a note at the beginning of the comment: This is used for a specific (is it BoundedWindowAggExec? )optimization and can be skipped by using the default implementation.
This interface seems quite difficult to understand for a general user who only wants to add a simple UDAF

Would it be possible to follow the existing model for ScalarUDFs here instead?

https://github.com/apache/datafusion/blob/27db82fe396f43077b5056bab4b20b084c8f6948/datafusion/expr/src/udf.rs#L753-L752

Soemthing like this:

pub trait AggregateUDFImpl { ... /// returns the output order of this aggregate expression given the input properites fn output_ordering(&self, inputs: &[ExprProperties]) -> Result<SortProperties>; ... }

I think this is not possible because this property is purely related with the function's nature. It does not depend input order or anything else, just the relation between the element-wise increment (or decrement) in the grouping set and resulting values of aggregate function. I'm renaming the monotonicity as set-monotonicity.

I recommend adding a note at the beginning of the comment: This is used for a specific (is it BoundedWindowAggExec? )optimization and can be skipped by using the default implementation. This interface seems quite difficult to understand for a general user who only wants to add a simple UDAF

We've tried to provide a good documentation, and the API's itself comes up with a default implementation. If the general users are not interested at these properties, we are not forcing them to be. Do you have further suggestions either for code or documentation level?

alamb

Thanks @mertak-synnada -- I like where this is headed. I am not sure about some of the plan changes and I also have some questions about the API

Thanks @2010YOUY01 for the look as well

datafusion/expr/src/udaf.rs

alamb · 2025-01-26T11:10:07Z

datafusion/expr/src/udaf.rs

+    /// function is monotonically increasing if its value increases as its argument grows
+    /// (as a set). Formally, `f` is a monotonically increasing set function if `f(S) >= f(T)`
+    /// whenever `S` is a superset of `T`.
+    fn monotonicity(&self, _data_type: &DataType) -> AggregateExprMonotonicity {


Would it be possible to follow the existing model for ScalarUDFs here instead?

https://github.com/apache/datafusion/blob/27db82fe396f43077b5056bab4b20b084c8f6948/datafusion/expr/src/udf.rs#L753-L752

Soemthing like this:

pub trait AggregateUDFImpl { ... /// returns the output order of this aggregate expression given the input properites fn output_ordering(&self, inputs: &[ExprProperties]) -> Result<SortProperties>; ... }

datafusion/sqllogictest/test_files/aggregate.slt

alamb · 2025-01-26T11:13:33Z

datafusion/sqllogictest/test_files/aggregate.slt

@@ -4963,6 +4963,9 @@ false
 true
 NULL

+statement ok


In order that the tests better explain the implications of this change, can you please add a new test rather than updating the existing test (by setting this option).

So that would mean set the flag and run the EXPLAIN again in a separate block

That will let the tests better illustrate any change in behavior

datafusion/sqllogictest/test_files/aggregates_topk.slt

ozankabak

I have some minor comments, almost ready to go

datafusion/physical-expr/src/aggregate.rs

datafusion/physical-expr/src/window/aggregate.rs

datafusion/physical-expr/src/window/standard.rs

datafusion/physical-plan/src/aggregates/mod.rs

ozankabak

This LGTM and is ready to go from my perspective. @alamb, it'd be great if you can take a look. It doesn't introduce any changes to existing plans/tests unless it is a strict improvement, but I'd still prefer if you could take a final quick look.

alamb

Thanks @mertak-synnada and @ozankabak

I think I am missing something here -- the code is very nicely structured and does what the PR says it should do. However, the optimization doesn't seem to compute the same answer

datafusion/functions-aggregate/src/min_max.rs

datafusion/expr/src/udaf.rs

datafusion/functions-aggregate/src/sum.rs

datafusion/expr/src/udaf.rs

datafusion/sqllogictest/test_files/aggregate.slt

datafusion/core/tests/physical_optimizer/enforce_sorting.rs

alamb

Thanks again @ozankabak and @mertak-synnada

I am still confused about this PR -- I am sorry I am probably missing something silly

My understanding of this PR

As I understand this PR, it is optimizing queries like

select a, count(b) FROM ... GROUP BY a ORDER BY a, count(b)

By noticing that when

the input is already sorted by a
we use the ordering preserving grouping (ordering_mode=Sorted)

This implies that the output is already sorted by a, count(b) and thus no SortExec is needed

This makes total sense to me and is a great optimization ✅

My confusion -- doesn't this always hold?

What I don't understand is why this optimization relies on the specific aggregate function used (aka why is AggregateExprSetMonotonicity needed)?

It seems to me like any query like the following doesn't need an extra sort.

select a, agg(b) FROM ... GROUP BY a ORDER BY a, agg(b)

(where agg(b) is any aggregate )

My reasoning is that the GROUP BY ensures that there are no duplicates in the a column, so by definition the stream is sorted by a, <any other columns> as we know a is unique 😕

datafusion/core/tests/physical_optimizer/enforce_sorting.rs

ozankabak · 2025-01-30T07:24:24Z

Thanks for reviewing carefully, as always, much appreciated 🚀

select a, agg(b) FROM ... GROUP BY a ORDER BY a, agg(b)

You are right that all queries of this form can be optimized independent of what agg is. The unit tests involving such queries in this PR (in aggregate.slt) should work even in the absence of the AggregateExprSetMonotonicity concept. This feature should actually be already tested by other tests (ones exercising correct handling of uniqueness constraints in equivalence properties). We will double check this, and if there are no problems, remove the redundant tests. If we discover any bugs, we will add the fix into this PR.

Now, coming back to the original aim of the PR -- the main intent behind AggregateExprSetMonotonicity is the following:

To make windowing queries ordering aware in case of set-monotonic window/aggregation functions. This is the immediate benefit, and you can find the tests exercising this in window.slt.
To open the door to incremental computations involving filters containing comparisons between an accumulated value (e.g. a COUNT) and a fixed value (or a value with a bound). In such cases, you can do efficient calculations/pruning only when you have set-monotonicity information on the aggregate function computing the accumulated value. We plan to bring such functionality to DataFusion in the upcoming months.

Does that help?

berkaysynnada · 2025-01-30T07:49:39Z

@alamb could you take a final look?

ozankabak · 2025-01-30T13:31:48Z

This now includes the optimization for single-row outputs, windowing operations with set-monotonic functions, and it lays the foundational machinery for more sophisticated optimizations based on expressions involving functions with set-monotonicity properties.

I am quite happy with the final state of this PR. Once @alamb confirms there are no concerns left, I will merge.

alamb · 2025-01-30T14:16:38Z

I will try and give it a good look later today

mertak-synnada added 16 commits January 16, 2025 15:21

add monotonic function definitions for aggregate expressions

a2919b6

fix benchmark results

14109e6

set prefer_existing_sort to true in sqllogictests

b3d75ba

set prefer_existing_sort to true in sqllogictests

549502e

fix typo

623e0c5

Merge branch 'refs/heads/apache_main' into feature/monotonic-sets

6a9d24e

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/core/src/physical_optimizer/test_utils.rs

re-add test_utils.rs changes to the new file

53ee3de

clone input with Arc

97d8951

Merge branch 'refs/heads/apache_main' into feature/monotonic-sets

cc33031

Merge branch 'refs/heads/apache_main' into feature/monotonic-sets

41d9430

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_sorting.rs # datafusion/physical-optimizer/src/test_utils.rs

inject aggr expr indices

e988dcf

separate stubs and count_udafs

remove redundant file

906245e

add Sum monotonicity

475fe2d

change monotonicity to return an Enum rather than Option<bool> fix indices re-add monotonicity tests

fix sql logic tests

57e000e

fix sql logic tests

ca57f46

Merge branch 'refs/heads/apache_main' into feature/monotonic-sets

6cf9644

# Conflicts: # datafusion/core/tests/physical_optimizer/enforce_sorting.rs

github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions labels Jan 24, 2025

mertak-synnada commented Jan 24, 2025

View reviewed changes

update docs

072e6ef

2010YOUY01 reviewed Jan 25, 2025

View reviewed changes

alamb reviewed Jan 26, 2025

View reviewed changes

Merge branch 'apache_main' into feature/monotonic-sets

7d62cb0

github-actions bot removed the optimizer Optimizer rules label Jan 28, 2025

berkaysynnada added 2 commits January 28, 2025 16:20

review part 1

491aabe

fix the tests

972c56f

berkaysynnada added 4 commits January 29, 2025 15:19

revert slt's

4b946b3

simplify terms

481b5b4

Update mod.rs

29af731

remove unnecessary computations

1f02953

berkaysynnada force-pushed the feature/monotonic-sets branch from f1777ef to 1f02953 Compare January 29, 2025 13:04

berkaysynnada added 2 commits January 29, 2025 16:29

remove index calc

79dd942

Update mod.rs

247d5fe

ozankabak reviewed Jan 29, 2025

View reviewed changes

ozankabak and others added 2 commits January 29, 2025 17:26

Apply suggestions from code review

16bdac4

add slt

1875336

berkaysynnada force-pushed the feature/monotonic-sets branch from 6b90eba to 1875336 Compare January 29, 2025 14:29

ozankabak approved these changes Jan 29, 2025

View reviewed changes

alamb reviewed Jan 29, 2025

View reviewed changes

alamb changed the title ~~Feature: Monotonic Sets~~ Feature: AggregateMonotonicity Jan 29, 2025

alamb reviewed Jan 29, 2025

View reviewed changes

datafusion/core/tests/physical_optimizer/enforce_sorting.rs Outdated Show resolved Hide resolved

alamb reviewed Jan 29, 2025

View reviewed changes

datafusion/core/tests/physical_optimizer/enforce_sorting.rs Outdated Show resolved Hide resolved

berkaysynnada added 2 commits January 30, 2025 10:44

remove aggregate changes, tests already give expected results

ba7b94f

fix clippy

2152b7f

berkaysynnada and others added 2 commits January 30, 2025 14:56

remove one row sorts

7822613

Improve comments

5e9b2db

ozankabak force-pushed the feature/monotonic-sets branch from d7e3135 to 5e9b2db Compare January 30, 2025 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: AggregateMonotonicity #14271

Feature: AggregateMonotonicity #14271

mertak-synnada commented Jan 24, 2025

mertak-synnada Jan 24, 2025

alamb Jan 26, 2025

berkaysynnada Jan 29, 2025

2010YOUY01 Jan 25, 2025

alamb Jan 26, 2025

berkaysynnada Jan 28, 2025 •

edited

Loading

berkaysynnada Jan 28, 2025

alamb left a comment

alamb Jan 26, 2025

alamb Jan 26, 2025

ozankabak left a comment

ozankabak left a comment

alamb left a comment

alamb left a comment

ozankabak commented Jan 30, 2025 •

edited

Loading

berkaysynnada commented Jan 30, 2025

ozankabak commented Jan 30, 2025

alamb commented Jan 30, 2025

Feature: AggregateMonotonicity #14271

Are you sure you want to change the base?

Feature: AggregateMonotonicity #14271

Conversation

mertak-synnada commented Jan 24, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berkaysynnada Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak left a comment

Choose a reason for hiding this comment

ozankabak left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

My understanding of this PR

My confusion -- doesn't this always hold?

ozankabak commented Jan 30, 2025 • edited Loading

berkaysynnada commented Jan 30, 2025

ozankabak commented Jan 30, 2025

alamb commented Jan 30, 2025

berkaysynnada Jan 28, 2025 •

edited

Loading

ozankabak commented Jan 30, 2025 •

edited

Loading