Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust, python): correct struct null counts #10142

Merged
merged 4 commits into from
Aug 1, 2023

Conversation

magarick
Copy link
Contributor

resolves #10130
Structs now count nulls the way we say they should.
I also added a total_null_count field to be used internally for now. It can be exposed later.

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Jul 28, 2023
let chunks_lens = self.fields()[0].chunks().len();

// fast path
// we early return if a column doesn't have nulls
for i in 0..chunks_lens {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this fast path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It was going chunk by chunk, and if it found one chunk of a multi-chunk series that had no nulls it would abandon early. Actually, when reading this I didn't understand going per-chunk. Are the chunks in every series in a struct always aligned in this way? Is it faster than just checking the null count of the whole series?
  2. If we're going to create a total null count we can't stop early. However, if it's not expensive to do a first pass where we iterate over every series and grab their null counts for the total, then we could set a fast path flag there.
  3. Even if we have to iterate over everything, this new version will do no more bit manipulation than it has to.

I think the question is whether a first pass over every field that doesn't get to stop early because it has to count all the nulls, but which will avoid bit operations later is faster in aggregate. I'm going to guess "yes" since structs are likely to be much "longer" than they are "wide" so I'll go ahead and do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I added a pre-check/count for total nulls and whether any null rows are possible. Should be fine and there's a test so the multi-chunk bug doesn't happen again.

}
(_, None, _) => n_nulls = Some(0),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should break early here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't with the current algorithm, see above. But if you're ok with potentially two full passes over all the fields where the first one is very cheap, we can. After writing the above explanation it does sound like that will be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we count total nulls before, it will break.

(acc.0 & (s.null_count() != 0), acc.1 + s.null_count())
});
self.total_null_count = total_null_count;
if !could_have_nulls {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have an extra could_have_nulls variable? Isn't that the same as total_null_count != 0?

As I understand it, if total_null_count == 0, we can return early, otherwise we have to check the rows.

Or do I misunderstand it here?

Copy link
Contributor Author

@magarick magarick Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have nulls only if the count in every field is > 0. So in addition to total_null_count == 0 you can return early when any one field has no nulls. For example consider a struct with three fields

f1 f2 f3
1 null null
2 'a' null
3 null 'b'

The total null count is 4, but we can still return early because the first field has no nulls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. This is confusing as the null name now collides with two meanings.

Shall we name one null_row and the other null_value to disambiguate and reflect that in the parameter names. That should make it clear which one we mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean instead of null_count and total_null_count? Or just here to indicate the potential for a null row and separate out the count of all null values across all fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed could_have_nulls to make it clearer we're talking about entire rows. Hopefully that's what you meant. I think that, combined with the comment about should make the distinction clear. Since after that section we're done with the total number of nulls across all fields.

validity_agg =
validity_agg.map_or(Some(v.clone()), |agg| Some(v.bitor(&agg)));
// n.b. This is "free" since any bitops trigger a count.
n_nulls = Some(validity_agg.as_ref().unwrap().unset_bits());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you do Some(foo.unwrap()) where you unwrap the Some in Option<foo>, this would be better written as foo.map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Writing code at 4 in the morning is a high variance behavior. Fixed.

}
match (arr.validity(), n_nulls, arr.null_count() == 0) {
// The null count is to avoid touching chunks with a validity mask but no nulls
(_, Some(0), _) => break, // No all-null rows, next chunk!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@magarick
Copy link
Contributor Author

I think this should be good to go now.

@ritchie46
Copy link
Member

I think this should be good to go now.

Yes, it looks great. Thank you!

@ritchie46 ritchie46 merged commit 78460fe into pola-rs:main Aug 1, 2023
23 checks passed
@magarick magarick deleted the struct-null-counts branch August 1, 2023 19:48
magarick added a commit to magarick/polars that referenced this pull request Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

null_count() for Structs is inconsistent.
2 participants