-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: nullability semantics of predicates in JoinRel and FilterRel #52
Conversation
@@ -32,14 +32,10 @@ pub fn parse_filter_rel( | |||
describe!(y, Relation, "Filter by {}", &predicate); | |||
summary!( | |||
y, | |||
"This relation discards all rows for which {} yields false.", | |||
&predicate | |||
"This relation discards all rows for which the expression {} yields {}.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix.
I'm confused by "nullable" flag here. In SQL, all types are nullable, e.g. value of any type can be null. I'm curious which systems are expected to specify nullable as false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In substrait types can be non-nullable (e.g. booleans). A scalar function from an extension could return a non-nullable boolean. This change reflects that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Substrait's type system is strictly typed and actually uses non-nullable by default; please review https://substrait.io/types/type_system/. As for why, that's a question more suited for slack or maybe the mailing list I suppose. It was built this way before @mbrobbel or I joined the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbrobbel @jvanstraten Thank you for explaining. This is confusing to me because most SQL functions have so-called default null behavior, e.g. null in any of the arguments automatically returns a null. Hence, most SQL functions return nullable types. Thus, it is strange to use non-nullable type by default. CC: @jacques-n
I assume I should open an issue in https://github.com/substrait-io/substrait/issues to get more context on this design decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default behavior for Substrait functions is that the return type is nullable if and only if any of the arguments is nullable. This is covered by MIRROR
nullability behavior. With the "by default" thing I just mean that if you write i32
you're actually talking about a non-nullable 32-bit integer, whereas you need to write i32?
to talk about a nullable 32-bit integer. We could have used i32!
for non-nullable and i32
for nullable, too (or just require either ! or ? if you want to consider nullability). It's just a notation convention thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the "why" part by the way, I can't speak for the community because I don't know the reasons, but as a more back-end/hardware guy, I'd say that not just allowing everything to basically fail silently by default is a good thing. null
is hardly treated in a consistent manner, so being able to avoid it or make assertions that an expression can never fail that way is also a good thing. It's certainly true though that for a practical plan using SQL-esque functions and relations, the vast majority of your types are going to be nullable.
I suppose part of it is also that we're representing a whole row/record as a single struct in some contexts. That struct is never nullable, even in SQL; there is no way to have a row that is "so null" that it doesn't even have any fields anymore. This generalization is, again, really nice when you're actually implementing these things and want to support nested types (which, AFAICT, SQL generally does not support, but Substrait does). It prevents you from having to implement column/field access and nested struct field access separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the "by default" thing I just mean that if you write i32 you're actually talking about a non-nullable 32-bit integer, whereas you need to write i32? to talk about a nullable 32-bit integer.
@jvanstraten This is very helpful clarification. Thanks.
@chaojun-zhang We need to make sure we use i32? and similar when defining custom function signatures in facebookincubator/velox#2496
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose part of it is also that we're representing a whole row/record as a single struct in some contexts. That struct is never nullable, even in SQL
I understand this point. In Velox, we also use RowVector to represent both a struct field and a top-level row, which like you pointed out cannot possibly be null.
This generalization is, again, really nice when you're actually implementing these things and want to support nested types (which, AFAICT, SQL generally does not support, but Substrait does).
Modern SQL engines (Spark, Presto, Trino, Velox at least) do support complex types and also support higher order functions / lambdas. For example, transform in Presto: https://prestodb.io/docs/current/functions/array.html#transform
I'm curious if lambda functions are supported in Substrait as well and, if so, what is the type of the "function" argument in these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make sure we use i32? and similar when defining custom function signatures
Be aware that it won't do anything unless you also specify DISCRETE
nullability, because the nullability flags end up getting "overridden" by the semantics of MIRROR
or (for arguments) DECLARED_OUTPUT
. This and the other logic behind type derivations are, honestly, super weird, and have been causing me headaches for over half a year now to try to get them into the validator, so if something doesn't make sense to you there, it might well be because it just doesn't. I'm working on it.
I'm curious if lambda functions are supported in Substrait as well and, if so, what is the type of the "function" argument in these?
They're not, but only because no one has bothered to define them yet. I thought about them for a bit in substrait-io/substrait#320 because some functions naturally take a comparator lambda for sort-like semantics, which currently can't be done. I figured that, since we don't really have a concept of statements, a lambda function would just be written like a normal expression, but with argument references at the leaves rather than (only) field references and literals. The types of those would just be whatever the function you're passing the lambda to as an argument defines them to be.
This is not as powerful as generalized lambdas that you can pass around as values, though. A more general solution, with that lambda data type, would probably look something like lambda<struct<arg0, arg1, ...>, return>
, in which case the argument types would need to be defined along with the argument references in the expression tree (or, better yet, in the special expression type that constructs the lambda), and then matched against the function prototype that will be calling them (rather than derived by the prototype).
@jvanstraten do you prefer a single PR to fix #51, or should I split over multiple PRs? If we're doing multiple PRs, this one is ready for review. |
rs/src/parse/relations/join.rs
Outdated
"Returns rows combining the row from the left and right \ | ||
input for each pair where the join expression yields true. \ | ||
Discarding rows where the join expression yields {}.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem very grammatically correct;
"Returns rows combining the row from the left and right \ | |
input for each pair where the join expression yields true. \ | |
Discarding rows where the join expression yields {}.", | |
"Returns rows combining the row from the left and right \ | |
input for each pair where the join expression yields true, \ | |
discarding rows where the join expression yields {}.", |
Likewise for the other options.
I'm fine with either, so I just went ahead and reviewed. LGTM aside from the grammar there. |
An attempt to fix part of #51.
Filter relations
According to What are the semantics of the ANTI join? substrait#325 (comment):
I modified the summary of filter relations to reflect this behavior.
Join relations
According to https://github.com/substrait-io/substrait/blob/e68fdc049c1219a1269225456072d7548c83f297/site/docs/relations/logical_relations.md?plain=1#L199, literal join expression values are allowed, but they must betrue
. Added a check and test.I've added comment tests to the test runner to test these changes.