[DISCUSSION] Aggregations on all-null / empty data not following SQL standard #19700

Julian-J-S · 2024-11-08T13:36:12Z

Description

I would like to start a discussion on this topic. This is definitely controversial and there are and have been many issues opened on this topic (some still "need decision").

#19686 (sum [] should be null)
#8500 (sum [] should be 0)
#9576 (sum [] should be 0)
#17527 (sum [a, b] != a + b) needs decision
...

Problems

Not folowing the SQL standard and others

The SQL standard specifies that aggregations first filter out null values and then apply the aggregation.
If there are no values left, the result is null.
Examples:

sum [] or sum [null, null]
- polars / pandas: 0
- sql (spark, pyarrow, postgres, duckdb, ...): null

Inconsistent aggregations

sum, prod, any, all have a default/identity value but all other aggregations do not.
it is weird that some aggregations over "no data" return a value

Unintuitive behavior

It is unintuitive and weird that

sum [a, b] != a + b

pl.DataFrame(data={"a": [None, None], "b": [None, None]}, schema={"a": pl.Int64, "b": pl.Int64}).with_columns(
    (pl.col("a") + pl.col("b")).alias("a + b"),
    pl.sum_horizontal("a", "b").alias("sum_horizontal(a, b)"),
)

┌──────┬──────┬───────┬──────────────────────┐
│ a    ┆ b    ┆ a + b ┆ sum_horizontal(a, b) │
│ ---  ┆ ---  ┆ ---   ┆ ---                  │
│ i64  ┆ i64  ┆ i64   ┆ i64                  │
╞══════╪══════╪═══════╪══════════════════════╡
│ null ┆ null ┆ null  ┆ 0                    │
│ null ┆ null ┆ null  ┆ 0                    │
└──────┴──────┴───────┴──────────────────────┘

Opinion

I propose to follow the SQL standard and return null for any aggregation on an empty or all-null column.

Reasoning

it is consistent with the SQL standard
consistent aggregation behavior
- no data -> no result -> null
- all aggregations return null when there are no values
- (mathematical set theory is not always ideal/applicable to databases)
more intuitive; sum [a, b] should be equal to a + b

Examples

Imagine receiving data from an api, sensor, csv, etc. and get the following data: [-3, 3], [null, null], []

with the current behavior
- sum: 0, 0, 0: no idea if there was data or not
with the proposed behavior
- sum: 0, null, null: clear that there was no data in the second and third case
- we have to be specific about how we handle missing data
- it is easy to fill_null after aggregation if required

Goal

Would love to discuss this topic and come to a conclusion on how to handle this in polars.
I can see the following options:

keep the current behavior (+ document it)
change to the proposed behavior (+ document it)
add a configuration option to the affected aggregations to allow both behaviors

In any way, I think we should document no matter what the behavior is!

The text was updated successfully, but these errors were encountered:

Julian-J-S added the enhancement New feature or an improvement of an existing feature label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Aggregations on all-null / empty data not following SQL standard #19700

[DISCUSSION] Aggregations on all-null / empty data not following SQL standard #19700

Julian-J-S commented Nov 8, 2024

[DISCUSSION] Aggregations on all-null / empty data not following SQL standard #19700

[DISCUSSION] Aggregations on all-null / empty data not following SQL standard #19700

Comments

Julian-J-S commented Nov 8, 2024

Description

Problems

Not folowing the SQL standard and others

Inconsistent aggregations

Unintuitive behavior

Opinion

Reasoning

Examples

Goal