fix: nullability semantics of predicates in JoinRel and FilterRel #52

mbrobbel · 2022-09-13T14:36:26Z

An attempt to fix part of #51.

Filter relations
According to What are the semantics of the ANTI join? substrait#325 (comment):

For those boolean expressions, true is treated as "succeeds" and false and unknown (aka null) are treated as "fails".

I modified the summary of filter relations to reflect this behavior.
Join relations
- According to https://github.com/substrait-io/substrait/blob/e68fdc049c1219a1269225456072d7548c83f297/site/docs/relations/logical_relations.md?plain=1#L199, literal join expression values are allowed, but they must be true. Added a check and test.
- Added some notes to the summary of join relations about nullability behavior of the join and post join filter predicates.

I've added comment tests to the test runner to test these changes.

mbasmanova · 2022-09-13T14:41:35Z

rs/src/parse/relations/filter.rs

@@ -32,14 +32,10 @@ pub fn parse_filter_rel(
    describe!(y, Relation, "Filter by {}", &predicate);
    summary!(
        y,
-        "This relation discards all rows for which {} yields false.",
-        &predicate
+        "This relation discards all rows for which the expression {} yields {}.",


Thank you for the fix.

I'm confused by "nullable" flag here. In SQL, all types are nullable, e.g. value of any type can be null. I'm curious which systems are expected to specify nullable as false.

In substrait types can be non-nullable (e.g. booleans). A scalar function from an extension could return a non-nullable boolean. This change reflects that.

Substrait's type system is strictly typed and actually uses non-nullable by default; please review https://substrait.io/types/type_system/. As for why, that's a question more suited for slack or maybe the mailing list I suppose. It was built this way before @mbrobbel or I joined the project.

@mbrobbel @jvanstraten Thank you for explaining. This is confusing to me because most SQL functions have so-called default null behavior, e.g. null in any of the arguments automatically returns a null. Hence, most SQL functions return nullable types. Thus, it is strange to use non-nullable type by default. CC: @jacques-n

I assume I should open an issue in https://github.com/substrait-io/substrait/issues to get more context on this design decision.

The default behavior for Substrait functions is that the return type is nullable if and only if any of the arguments is nullable. This is covered by MIRROR nullability behavior. With the "by default" thing I just mean that if you write i32 you're actually talking about a non-nullable 32-bit integer, whereas you need to write i32? to talk about a nullable 32-bit integer. We could have used i32! for non-nullable and i32 for nullable, too (or just require either ! or ? if you want to consider nullability). It's just a notation convention thing.

As for the "why" part by the way, I can't speak for the community because I don't know the reasons, but as a more back-end/hardware guy, I'd say that not just allowing everything to basically fail silently by default is a good thing. null is hardly treated in a consistent manner, so being able to avoid it or make assertions that an expression can never fail that way is also a good thing. It's certainly true though that for a practical plan using SQL-esque functions and relations, the vast majority of your types are going to be nullable.

I suppose part of it is also that we're representing a whole row/record as a single struct in some contexts. That struct is never nullable, even in SQL; there is no way to have a row that is "so null" that it doesn't even have any fields anymore. This generalization is, again, really nice when you're actually implementing these things and want to support nested types (which, AFAICT, SQL generally does not support, but Substrait does). It prevents you from having to implement column/field access and nested struct field access separately.

With the "by default" thing I just mean that if you write i32 you're actually talking about a non-nullable 32-bit integer, whereas you need to write i32? to talk about a nullable 32-bit integer.

@jvanstraten This is very helpful clarification. Thanks.

@chaojun-zhang We need to make sure we use i32? and similar when defining custom function signatures in facebookincubator/velox#2496

I suppose part of it is also that we're representing a whole row/record as a single struct in some contexts. That struct is never nullable, even in SQL

I understand this point. In Velox, we also use RowVector to represent both a struct field and a top-level row, which like you pointed out cannot possibly be null.

This generalization is, again, really nice when you're actually implementing these things and want to support nested types (which, AFAICT, SQL generally does not support, but Substrait does).

Modern SQL engines (Spark, Presto, Trino, Velox at least) do support complex types and also support higher order functions / lambdas. For example, transform in Presto: https://prestodb.io/docs/current/functions/array.html#transform

I'm curious if lambda functions are supported in Substrait as well and, if so, what is the type of the "function" argument in these?

We need to make sure we use i32? and similar when defining custom function signatures

Be aware that it won't do anything unless you also specify DISCRETE nullability, because the nullability flags end up getting "overridden" by the semantics of MIRROR or (for arguments) DECLARED_OUTPUT. This and the other logic behind type derivations are, honestly, super weird, and have been causing me headaches for over half a year now to try to get them into the validator, so if something doesn't make sense to you there, it might well be because it just doesn't. I'm working on it.

I'm curious if lambda functions are supported in Substrait as well and, if so, what is the type of the "function" argument in these?

They're not, but only because no one has bothered to define them yet. I thought about them for a bit in substrait-io/substrait#320 because some functions naturally take a comparator lambda for sort-like semantics, which currently can't be done. I figured that, since we don't really have a concept of statements, a lambda function would just be written like a normal expression, but with argument references at the leaves rather than (only) field references and literals. The types of those would just be whatever the function you're passing the lambda to as an argument defines them to be.

This is not as powerful as generalized lambdas that you can pass around as values, though. A more general solution, with that lambda data type, would probably look something like lambda<struct<arg0, arg1, ...>, return>, in which case the argument types would need to be defined along with the argument references in the expression tree (or, better yet, in the special expression type that constructs the lambda), and then matched against the function prototype that will be calling them (rather than derived by the prototype).

rs/src/parse/relations/join.rs

mbrobbel · 2022-09-20T14:08:23Z

@jvanstraten do you prefer a single PR to fix #51, or should I split over multiple PRs? If we're doing multiple PRs, this one is ready for review.

jvanstraten · 2022-09-20T15:22:35Z

rs/src/parse/relations/join.rs

+                "Returns rows combining the row from the left and right \
+                input for each pair where the join expression yields true. \
+                Discarding rows where the join expression yields {}.",


This doesn't seem very grammatically correct;

Suggested change

"Returns rows combining the row from the left and right \

input for each pair where the join expression yields true. \

Discarding rows where the join expression yields {}.",

"Returns rows combining the row from the left and right \

input for each pair where the join expression yields true, \

discarding rows where the join expression yields {}.",

Likewise for the other options.

jvanstraten · 2022-09-20T15:23:49Z

do you prefer a single PR to fix #51, or should I split over multiple PRs?

I'm fine with either, so I just went ahead and reviewed. LGTM aside from the grammar there.

fix: summary of filter relations (nullability of expression)

5e6bc3c

mbasmanova reviewed Sep 13, 2022

View reviewed changes

chore: python formatting

8d7d0f8

mbrobbel force-pushed the pred-null branch from 9319917 to 8d7d0f8 Compare September 13, 2022 14:51

mbrobbel marked this pull request as draft September 13, 2022 14:57

jvanstraten mentioned this pull request Sep 13, 2022

What are use cases for non-nullable types? substrait-io/substrait#332

Closed

fix: add literal join expression value check

467d8f9

mbrobbel force-pushed the pred-null branch from 416ca90 to 467d8f9 Compare September 14, 2022 08:18

fix: summary improvements for nullable join predicates

664e062

mbrobbel force-pushed the pred-null branch from b3861e5 to 664e062 Compare September 14, 2022 10:25

mbrobbel marked this pull request as ready for review September 14, 2022 10:26

jvanstraten requested changes Sep 14, 2022

View reviewed changes

rs/src/parse/relations/join.rs Outdated Show resolved Hide resolved

rs/src/parse/relations/join.rs Outdated Show resolved Hide resolved

jvanstraten changed the title ~~Fix nullability semantics of predicate expressions~~ fix: nullability semantics of predicates in JoinRel and FilterRel Sep 14, 2022

jvanstraten mentioned this pull request Sep 14, 2022

Double-check whether predicate expressions have correct nullability semantics #51

Open

mbrobbel added 3 commits September 20, 2022 15:23

fix: revert literal join expression check

05ece35

style: fix join type summary string style

c463c86

style: fix indentation of summary strings

46266f0

jvanstraten requested changes Sep 20, 2022

View reviewed changes

fix: grammar in join summary

f8122b3

jvanstraten approved these changes Sep 20, 2022

View reviewed changes

jvanstraten merged commit add22d9 into substrait-io:main Sep 20, 2022

mbrobbel deleted the pred-null branch September 21, 2022 11:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: nullability semantics of predicates in JoinRel and FilterRel #52

fix: nullability semantics of predicates in JoinRel and FilterRel #52

mbrobbel commented Sep 13, 2022 •

edited

Loading

mbasmanova Sep 13, 2022

mbrobbel Sep 13, 2022

jvanstraten Sep 13, 2022

mbasmanova Sep 13, 2022

jvanstraten Sep 13, 2022

jvanstraten Sep 13, 2022 •

edited

Loading

mbasmanova Sep 13, 2022

mbasmanova Sep 13, 2022 •

edited

Loading

jvanstraten Sep 13, 2022

mbrobbel commented Sep 20, 2022

jvanstraten Sep 20, 2022

jvanstraten commented Sep 20, 2022

fix: nullability semantics of predicates in JoinRel and FilterRel #52

fix: nullability semantics of predicates in JoinRel and FilterRel #52

Conversation

mbrobbel commented Sep 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvanstraten Sep 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbasmanova Sep 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbrobbel commented Sep 20, 2022

Choose a reason for hiding this comment

jvanstraten commented Sep 20, 2022

mbrobbel commented Sep 13, 2022 •

edited

Loading

jvanstraten Sep 13, 2022 •

edited

Loading

mbasmanova Sep 13, 2022 •

edited

Loading