Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Aggregations on all-null / empty data not following SQL standard #19700

Open
Julian-J-S opened this issue Nov 8, 2024 · 0 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@Julian-J-S
Copy link
Contributor

Description

I would like to start a discussion on this topic. This is definitely controversial and there are and have been many issues opened on this topic (some still "need decision").

#19686 (sum [] should be null)
#8500 (sum [] should be 0)
#9576 (sum [] should be 0)
#17527 (sum [a, b] != a + b) needs decision
...

Problems

Not folowing the SQL standard and others

The SQL standard specifies that aggregations first filter out null values and then apply the aggregation.
If there are no values left, the result is null.
Examples:

  • sum [] or sum [null, null]
    • polars / pandas: 0
    • sql (spark, pyarrow, postgres, duckdb, ...): null

Inconsistent aggregations

  • sum, prod, any, all have a default/identity value but all other aggregations do not.
  • it is weird that some aggregations over "no data" return a value

Unintuitive behavior

It is unintuitive and weird that

  • sum [a, b] != a + b
pl.DataFrame(data={"a": [None, None], "b": [None, None]}, schema={"a": pl.Int64, "b": pl.Int64}).with_columns(
    (pl.col("a") + pl.col("b")).alias("a + b"),
    pl.sum_horizontal("a", "b").alias("sum_horizontal(a, b)"),
)

┌──────┬──────┬───────┬──────────────────────┐
│ aba + bsum_horizontal(a, b) │
│ ------------                  │
│ i64i64i64i64                  │
╞══════╪══════╪═══════╪══════════════════════╡
│ nullnullnull0                    │
│ nullnullnull0                    │
└──────┴──────┴───────┴──────────────────────┘

Opinion

I propose to follow the SQL standard and return null for any aggregation on an empty or all-null column.

Reasoning

  • it is consistent with the SQL standard
  • consistent aggregation behavior
    • no data -> no result -> null
    • all aggregations return null when there are no values
    • (mathematical set theory is not always ideal/applicable to databases)
  • more intuitive; sum [a, b] should be equal to a + b

Examples

Imagine receiving data from an api, sensor, csv, etc. and get the following data: [-3, 3], [null, null], []

  • with the current behavior
    • sum: 0, 0, 0: no idea if there was data or not
  • with the proposed behavior
    • sum: 0, null, null: clear that there was no data in the second and third case
    • we have to be specific about how we handle missing data
    • it is easy to fill_null after aggregation if required

Goal

Would love to discuss this topic and come to a conclusion on how to handle this in polars.
I can see the following options:

  • keep the current behavior (+ document it)
  • change to the proposed behavior (+ document it)
  • add a configuration option to the affected aggregations to allow both behaviors

In any way, I think we should document no matter what the behavior is!

@Julian-J-S Julian-J-S added the enhancement New feature or an improvement of an existing feature label Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant