You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to start a discussion on this topic. This is definitely controversial and there are and have been many issues opened on this topic (some still "need decision").
#19686 (sum [] should be null) #8500 (sum [] should be 0) #9576 (sum [] should be 0) #17527 (sum [a, b] != a + b) needs decision
...
Problems
Not folowing the SQL standard and others
The SQL standard specifies that aggregations first filter out null values and then apply the aggregation.
If there are no values left, the result is null.
Examples:
sum [] or sum [null, null]
polars / pandas: 0
sql (spark, pyarrow, postgres, duckdb, ...): null
Inconsistent aggregations
sum, prod, any, all have a default/identity value but all other aggregations do not.
it is weird that some aggregations over "no data" return a value
Description
I would like to start a discussion on this topic. This is definitely controversial and there are and have been many issues opened on this topic (some still "need decision").
#19686 (sum [] should be null)
#8500 (sum [] should be 0)
#9576 (sum [] should be 0)
#17527 (sum [a, b] != a + b)
needs decision
...
Problems
Not folowing the SQL standard and others
The SQL standard specifies that aggregations first filter out
null
values and then apply the aggregation.If there are no values left, the result is
null
.Examples:
sum []
orsum [null, null]
null
Inconsistent aggregations
sum
,prod
,any
,all
have a default/identity value but all other aggregations do not.Unintuitive behavior
It is unintuitive and weird that
sum [a, b] != a + b
Opinion
I propose to follow the SQL standard and return
null
for any aggregation on an empty or all-null column.Reasoning
null
null
when there are no valuessum [a, b]
should be equal toa + b
Examples
Imagine receiving data from an api, sensor, csv, etc. and get the following data:
[-3, 3], [null, null], []
0, 0, 0
: no idea if there was data or not0, null, null
: clear that there was no data in the second and third casefill_null
after aggregation if requiredGoal
Would love to discuss this topic and come to a conclusion on how to handle this in polars.
I can see the following options:
In any way, I think we should document no matter what the behavior is!
The text was updated successfully, but these errors were encountered: