Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust, python): Keep min/max and arg_min/arg_max consistent. #10716

Merged
merged 2 commits into from
Aug 25, 2023

Conversation

reswqa
Copy link
Collaborator

@reswqa reswqa commented Aug 24, 2023

This based on #10708 and fixes #10707.

Some tests may fail because we changed arg_min and arg_max behavior. I will fix that cases one by one.

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Aug 24, 2023
@reswqa
Copy link
Collaborator Author

reswqa commented Aug 24, 2023

You can take a look at the changed unit.series.test_series.test_arg_min_and_arg_max first to confirm it is our expected behavior, especially in the case of boolean.

ca.into_iter()
.position(|opt_val| matches!(opt_val, Some(true)))
.enumerate()
.find_map(|(idx, val)| match val {
Copy link
Collaborator Author

@reswqa reswqa Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is: we first find true, and if it does not exist, we return the first false location. Is this exactly what we want?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reswqa Yes, that's the correct behavior, although I think we can do a faster implementation more explicitly iterating over the bitmap. But that doesn't have to be in this PR.

match ca.is_sorted_flag() {
IsSorted::Ascending => Some(0),
IsSorted::Descending => Some(ca.len() - 1),
IsSorted::Ascending => ca.first_non_null(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can test ca.null_count() == 0, and if it doesn't hold up, we can go here. Otherwise, we will still follow the previous logic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would matter much in performance. I leave this up to you.

IsSorted::Not => ca
.into_iter()
.enumerate()
.flat_map(|(idx, val)| val.map(|val| (idx, val)))
Copy link
Collaborator Author

@reswqa reswqa Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None will always be considered the minimum value as None < Some(_). But I'm not quite sure if it's better to do flat_map here or do match directly in reduce?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match in reduce will be faster as we have less indirection. I think we should first loop over downcast_iter and then loop over the array.

Copy link
Collaborator Author

@reswqa reswqa Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match in reduce will be faster as we have less indirection.

Make sense, will rewrite this.

I think we should first loop over downcast_iter and then loop over the array.

Does this means rewriting it to the same pattern of arg_max_numeric 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are similar. We could even share a same generic if you are up for that. ^^

But that is maybe a nice follow up PR. :)

Copy link
Collaborator Author

@reswqa reswqa Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to do this refactoring: Let them both share a same generic function as they are almost similar. The only difference is that for numerical types, we have a fast path with the following bound:

for<'b> &'b [T::Native]: ArgMinMax,

But for Utf8ChunkedArray: we do not have this bound, also we doesn't have this branch:

} else {
// When no nulls & array not empty => we can use fast argminmax
let min_idx: usize = arr.values().as_slice().argmin();
Some((min_idx, arr.value(min_idx)))
};

I can think of some hack solution (such as a magical macro), but I don't really like that way. Do we have a good solution for this that looks simpler and cleaner 🤔.

@reswqa reswqa marked this pull request as ready for review August 24, 2023 18:45
match ca.is_sorted_flag() {
IsSorted::Ascending => Some(0),
IsSorted::Descending => Some(ca.len() - 1),
IsSorted::Ascending => ca.first_non_null(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would matter much in performance. I leave this up to you.

IsSorted::Not => ca
.into_iter()
.enumerate()
.flat_map(|(idx, val)| val.map(|val| (idx, val)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are similar. We could even share a same generic if you are up for that. ^^

But that is maybe a nice follow up PR. :)

@reswqa
Copy link
Collaborator Author

reswqa commented Aug 25, 2023

Thanks @ritchie46 and @orlp for the Patient review!

that is maybe a nice follow up PR

Agree with this, we can merge this first to ensure that it works correctly. The optimization suggestions you have put forward are very helpful, and I will try opening a new perf PR on the weekend to do these things. :)

@ritchie46
Copy link
Member

he optimization suggestions you have put forward are very helpful, and I will try opening a new perf PR on the weekend to do these things. :)

🚀 Thanks!

@ritchie46 ritchie46 merged commit ecb819a into pola-rs:main Aug 25, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

series arg_min & arg_max seems inconsistent with min & max
3 participants