fix(rust, python): correct struct null counts #10142

magarick · 2023-07-28T11:43:09Z

resolves #10130
Structs now count nulls the way we say they should.
I also added a total_null_count field to be used internally for now. It can be exposed later.

ritchie46 · 2023-07-28T12:44:10Z

polars/polars-core/src/chunked_array/logical/struct_/mod.rs

        let chunks_lens = self.fields()[0].chunks().len();

-        // fast path
-        // we early return if a column doesn't have nulls
-        for i in 0..chunks_lens {


Can we keep this fast path.

It was going chunk by chunk, and if it found one chunk of a multi-chunk series that had no nulls it would abandon early. Actually, when reading this I didn't understand going per-chunk. Are the chunks in every series in a struct always aligned in this way? Is it faster than just checking the null count of the whole series?

If we're going to create a total null count we can't stop early. However, if it's not expensive to do a first pass where we iterate over every series and grab their null counts for the total, then we could set a fast path flag there.

Even if we have to iterate over everything, this new version will do no more bit manipulation than it has to.

I think the question is whether a first pass over every field that doesn't get to stop early because it has to count all the nulls, but which will avoid bit operations later is faster in aggregate. I'm going to guess "yes" since structs are likely to be much "longer" than they are "wide" so I'll go ahead and do that.

OK, I added a pre-check/count for total nulls and whether any null rows are possible. Should be fine and there's a test so the multi-chunk bug doesn't happen again.

ritchie46 · 2023-07-28T12:46:07Z

polars/polars-core/src/chunked_array/logical/struct_/mod.rs

                    }
+                    (_, None, _) => n_nulls = Some(0),


We should break early here.

Can't with the current algorithm, see above. But if you're ok with potentially two full passes over all the fields where the first one is very cheap, we can. After writing the above explanation it does sound like that will be better.

Now that we count total nulls before, it will break.

ritchie46 · 2023-07-30T06:30:40Z

crates/polars-core/src/chunked_array/logical/struct_/mod.rs

+                (acc.0 & (s.null_count() != 0), acc.1 + s.null_count())
+            });
+        self.total_null_count = total_null_count;
+        if !could_have_nulls {


Why do we have an extra could_have_nulls variable? Isn't that the same as total_null_count != 0?

As I understand it, if total_null_count == 0, we can return early, otherwise we have to check the rows.

Or do I misunderstand it here?

You can have nulls only if the count in every field is > 0. So in addition to total_null_count == 0 you can return early when any one field has no nulls. For example consider a struct with three fields

f1 f2 f3

1 null null

2 'a' null

3 null 'b'

The total null count is 4, but we can still return early because the first field has no nulls.

Ah, right. This is confusing as the null name now collides with two meanings.

Shall we name one null_row and the other null_value to disambiguate and reflect that in the parameter names. That should make it clear which one we mean.

You mean instead of null_count and total_null_count? Or just here to indicate the potential for a null row and separate out the count of all null values across all fields?

I renamed could_have_nulls to make it clearer we're talking about entire rows. Hopefully that's what you meant. I think that, combined with the comment about should make the distinction clear. Since after that section we're done with the total number of nulls across all fields.

ritchie46 · 2023-07-30T06:32:56Z

crates/polars-core/src/chunked_array/logical/struct_/mod.rs

+                        validity_agg =
+                            validity_agg.map_or(Some(v.clone()), |agg| Some(v.bitor(&agg)));
+                        // n.b. This is "free" since any bitops trigger a count.
+                        n_nulls = Some(validity_agg.as_ref().unwrap().unset_bits());


Since you do Some(foo.unwrap()) where you unwrap the Some in Option<foo>, this would be better written as foo.map.

Indeed. Writing code at 4 in the morning is a high variance behavior. Fixed.

ritchie46 · 2023-07-30T06:33:57Z

crates/polars-core/src/chunked_array/logical/struct_/mod.rs

+                }
+                match (arr.validity(), n_nulls, arr.null_count() == 0) {
+                    // The null count is to avoid touching chunks with a validity mask but no nulls
+                    (_, Some(0), _) => break, // No all-null rows, next chunk!


crates/polars-core/src/chunked_array/logical/struct_/mod.rs

magarick · 2023-07-31T15:33:48Z

I think this should be good to go now.

ritchie46 · 2023-08-01T06:02:17Z

I think this should be good to go now.

Yes, it looks great. Thank you!

magarick requested review from ritchie46, stinodego and alexander-beedie as code owners July 28, 2023 11:43

github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Jul 28, 2023

ritchie46 reviewed Jul 28, 2023

View reviewed changes

struct null counts

2048a5c

magarick force-pushed the struct-null-counts branch from ba840fd to 2048a5c Compare July 28, 2023 21:27

fastpath and another test

82d9897

ritchie46 reviewed Jul 30, 2023

View reviewed changes

jonashaag mentioned this pull request Jul 30, 2023

null_count() for Structs is inconsistent. #10130

Closed

be more rusty

14c1bea

ritchie46 reviewed Jul 31, 2023

View reviewed changes

crates/polars-core/src/chunked_array/logical/struct_/mod.rs Outdated Show resolved Hide resolved

names

9ee305b

ritchie46 approved these changes Aug 1, 2023

View reviewed changes

ritchie46 merged commit 78460fe into pola-rs:main Aug 1, 2023
23 checks passed

magarick deleted the struct-null-counts branch August 1, 2023 19:48

magarick added a commit to magarick/polars that referenced this pull request Aug 4, 2023

fix(rust, python): correct struct null counts (pola-rs#10142)

e58f7b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rust, python): correct struct null counts #10142

fix(rust, python): correct struct null counts #10142

magarick commented Jul 28, 2023

ritchie46 Jul 28, 2023

magarick Jul 28, 2023

magarick Jul 29, 2023

ritchie46 Jul 28, 2023

magarick Jul 28, 2023

magarick Jul 29, 2023

ritchie46 Jul 30, 2023

magarick Jul 30, 2023 •

edited

Loading

ritchie46 Jul 31, 2023

magarick Jul 31, 2023

magarick Jul 31, 2023

ritchie46 Jul 30, 2023

magarick Jul 30, 2023

ritchie46 Jul 30, 2023

magarick commented Jul 31, 2023

ritchie46 commented Aug 1, 2023

fix(rust, python): correct struct null counts #10142

fix(rust, python): correct struct null counts #10142

Conversation

magarick commented Jul 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

magarick Jul 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

magarick commented Jul 31, 2023

ritchie46 commented Aug 1, 2023

magarick Jul 30, 2023 •

edited

Loading