Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: nullability semantics of predicates in JoinRel and FilterRel #52

Merged
merged 8 commits into from
Sep 20, 2022

Conversation

mbrobbel
Copy link
Member

@mbrobbel mbrobbel commented Sep 13, 2022

An attempt to fix part of #51.

I've added comment tests to the test runner to test these changes.

@@ -32,14 +32,10 @@ pub fn parse_filter_rel(
describe!(y, Relation, "Filter by {}", &predicate);
summary!(
y,
"This relation discards all rows for which {} yields false.",
&predicate
"This relation discards all rows for which the expression {} yields {}.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix.

I'm confused by "nullable" flag here. In SQL, all types are nullable, e.g. value of any type can be null. I'm curious which systems are expected to specify nullable as false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In substrait types can be non-nullable (e.g. booleans). A scalar function from an extension could return a non-nullable boolean. This change reflects that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substrait's type system is strictly typed and actually uses non-nullable by default; please review https://substrait.io/types/type_system/. As for why, that's a question more suited for slack or maybe the mailing list I suppose. It was built this way before @mbrobbel or I joined the project.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbrobbel @jvanstraten Thank you for explaining. This is confusing to me because most SQL functions have so-called default null behavior, e.g. null in any of the arguments automatically returns a null. Hence, most SQL functions return nullable types. Thus, it is strange to use non-nullable type by default. CC: @jacques-n

I assume I should open an issue in https://github.com/substrait-io/substrait/issues to get more context on this design decision.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behavior for Substrait functions is that the return type is nullable if and only if any of the arguments is nullable. This is covered by MIRROR nullability behavior. With the "by default" thing I just mean that if you write i32 you're actually talking about a non-nullable 32-bit integer, whereas you need to write i32? to talk about a nullable 32-bit integer. We could have used i32! for non-nullable and i32 for nullable, too (or just require either ! or ? if you want to consider nullability). It's just a notation convention thing.

Copy link
Collaborator

@jvanstraten jvanstraten Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the "why" part by the way, I can't speak for the community because I don't know the reasons, but as a more back-end/hardware guy, I'd say that not just allowing everything to basically fail silently by default is a good thing. null is hardly treated in a consistent manner, so being able to avoid it or make assertions that an expression can never fail that way is also a good thing. It's certainly true though that for a practical plan using SQL-esque functions and relations, the vast majority of your types are going to be nullable.

I suppose part of it is also that we're representing a whole row/record as a single struct in some contexts. That struct is never nullable, even in SQL; there is no way to have a row that is "so null" that it doesn't even have any fields anymore. This generalization is, again, really nice when you're actually implementing these things and want to support nested types (which, AFAICT, SQL generally does not support, but Substrait does). It prevents you from having to implement column/field access and nested struct field access separately.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the "by default" thing I just mean that if you write i32 you're actually talking about a non-nullable 32-bit integer, whereas you need to write i32? to talk about a nullable 32-bit integer.

@jvanstraten This is very helpful clarification. Thanks.

@chaojun-zhang We need to make sure we use i32? and similar when defining custom function signatures in facebookincubator/velox#2496

Copy link

@mbasmanova mbasmanova Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose part of it is also that we're representing a whole row/record as a single struct in some contexts. That struct is never nullable, even in SQL

I understand this point. In Velox, we also use RowVector to represent both a struct field and a top-level row, which like you pointed out cannot possibly be null.

This generalization is, again, really nice when you're actually implementing these things and want to support nested types (which, AFAICT, SQL generally does not support, but Substrait does).

Modern SQL engines (Spark, Presto, Trino, Velox at least) do support complex types and also support higher order functions / lambdas. For example, transform in Presto: https://prestodb.io/docs/current/functions/array.html#transform

I'm curious if lambda functions are supported in Substrait as well and, if so, what is the type of the "function" argument in these?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure we use i32? and similar when defining custom function signatures

Be aware that it won't do anything unless you also specify DISCRETE nullability, because the nullability flags end up getting "overridden" by the semantics of MIRROR or (for arguments) DECLARED_OUTPUT. This and the other logic behind type derivations are, honestly, super weird, and have been causing me headaches for over half a year now to try to get them into the validator, so if something doesn't make sense to you there, it might well be because it just doesn't. I'm working on it.

I'm curious if lambda functions are supported in Substrait as well and, if so, what is the type of the "function" argument in these?

They're not, but only because no one has bothered to define them yet. I thought about them for a bit in substrait-io/substrait#320 because some functions naturally take a comparator lambda for sort-like semantics, which currently can't be done. I figured that, since we don't really have a concept of statements, a lambda function would just be written like a normal expression, but with argument references at the leaves rather than (only) field references and literals. The types of those would just be whatever the function you're passing the lambda to as an argument defines them to be.

This is not as powerful as generalized lambdas that you can pass around as values, though. A more general solution, with that lambda data type, would probably look something like lambda<struct<arg0, arg1, ...>, return>, in which case the argument types would need to be defined along with the argument references in the expression tree (or, better yet, in the special expression type that constructs the lambda), and then matched against the function prototype that will be calling them (rather than derived by the prototype).

@mbrobbel mbrobbel marked this pull request as ready for review September 14, 2022 10:26
rs/src/parse/relations/join.rs Outdated Show resolved Hide resolved
rs/src/parse/relations/join.rs Outdated Show resolved Hide resolved
@jvanstraten jvanstraten changed the title Fix nullability semantics of predicate expressions fix: nullability semantics of predicates in JoinRel and FilterRel Sep 14, 2022
@mbrobbel
Copy link
Member Author

@jvanstraten do you prefer a single PR to fix #51, or should I split over multiple PRs? If we're doing multiple PRs, this one is ready for review.

Comment on lines 117 to 119
"Returns rows combining the row from the left and right \
input for each pair where the join expression yields true. \
Discarding rows where the join expression yields {}.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem very grammatically correct;

Suggested change
"Returns rows combining the row from the left and right \
input for each pair where the join expression yields true. \
Discarding rows where the join expression yields {}.",
"Returns rows combining the row from the left and right \
input for each pair where the join expression yields true, \
discarding rows where the join expression yields {}.",

Likewise for the other options.

@jvanstraten
Copy link
Collaborator

do you prefer a single PR to fix #51, or should I split over multiple PRs?

I'm fine with either, so I just went ahead and reviewed. LGTM aside from the grammar there.

@jvanstraten jvanstraten merged commit add22d9 into substrait-io:main Sep 20, 2022
@mbrobbel mbrobbel deleted the pred-null branch September 21, 2022 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants