-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(rust, python): correct struct null counts #10142
Conversation
let chunks_lens = self.fields()[0].chunks().len(); | ||
|
||
// fast path | ||
// we early return if a column doesn't have nulls | ||
for i in 0..chunks_lens { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep this fast path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It was going chunk by chunk, and if it found one chunk of a multi-chunk series that had no nulls it would abandon early. Actually, when reading this I didn't understand going per-chunk. Are the chunks in every series in a struct always aligned in this way? Is it faster than just checking the null count of the whole series?
- If we're going to create a total null count we can't stop early. However, if it's not expensive to do a first pass where we iterate over every series and grab their null counts for the total, then we could set a fast path flag there.
- Even if we have to iterate over everything, this new version will do no more bit manipulation than it has to.
I think the question is whether a first pass over every field that doesn't get to stop early because it has to count all the nulls, but which will avoid bit operations later is faster in aggregate. I'm going to guess "yes" since structs are likely to be much "longer" than they are "wide" so I'll go ahead and do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I added a pre-check/count for total nulls and whether any null rows are possible. Should be fine and there's a test so the multi-chunk bug doesn't happen again.
} | ||
(_, None, _) => n_nulls = Some(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should break early here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't with the current algorithm, see above. But if you're ok with potentially two full passes over all the fields where the first one is very cheap, we can. After writing the above explanation it does sound like that will be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we count total nulls before, it will break.
ba840fd
to
2048a5c
Compare
(acc.0 & (s.null_count() != 0), acc.1 + s.null_count()) | ||
}); | ||
self.total_null_count = total_null_count; | ||
if !could_have_nulls { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have an extra could_have_nulls
variable? Isn't that the same as total_null_count != 0
?
As I understand it, if total_null_count == 0
, we can return early, otherwise we have to check the rows.
Or do I misunderstand it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have nulls only if the count in every field is > 0. So in addition to total_null_count == 0
you can return early when any one field has no nulls. For example consider a struct with three fields
f1 | f2 | f3 |
---|---|---|
1 | null | null |
2 | 'a' | null |
3 | null | 'b' |
The total null count is 4, but we can still return early because the first field has no nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. This is confusing as the null
name now collides with two meanings.
Shall we name one null_row
and the other null_value
to disambiguate and reflect that in the parameter names. That should make it clear which one we mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean instead of null_count
and total_null_count
? Or just here to indicate the potential for a null row and separate out the count of all null values across all fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed could_have_nulls
to make it clearer we're talking about entire rows. Hopefully that's what you meant. I think that, combined with the comment about should make the distinction clear. Since after that section we're done with the total number of nulls across all fields.
validity_agg = | ||
validity_agg.map_or(Some(v.clone()), |agg| Some(v.bitor(&agg))); | ||
// n.b. This is "free" since any bitops trigger a count. | ||
n_nulls = Some(validity_agg.as_ref().unwrap().unset_bits()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you do Some(foo.unwrap())
where you unwrap the Some
in Option<foo>
, this would be better written as foo.map
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Writing code at 4 in the morning is a high variance behavior. Fixed.
} | ||
match (arr.validity(), n_nulls, arr.null_count() == 0) { | ||
// The null count is to avoid touching chunks with a validity mask but no nulls | ||
(_, Some(0), _) => break, // No all-null rows, next chunk! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I think this should be good to go now. |
Yes, it looks great. Thank you! |
resolves #10130
Structs now count nulls the way we say they should.
I also added a
total_null_count
field to be used internally for now. It can be exposed later.