-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve speed of median
by implementing special GroupsAccumulator
#13681
base: main
Are you sure you want to change the base?
Improve speed of median
by implementing special GroupsAccumulator
#13681
Conversation
ded4e5f
to
311f82f
Compare
f80e890
to
a53da8c
Compare
I am back and continue working this pr yesterday. Very sorry for long delay for some private reason. Thanks @Dandandan for helping! |
5047371
to
40284c8
Compare
I think this pr is ready now, sorry again for long delay. Q6 in h2o medium:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I have benchmarked locally and it's much faster (for benchmarks running on csvs, i think most time is spend reading from csv, so the results are closer)
h2o Q6 on parquet (10k groups):
main: 1500ms
pr: 300ms
query with 4 groups (from tpch sf10 lineitem table):
select median(l_orderkey) from lineitem group by l_returnflag, l_linestatus;
main: 0.7s
pr: 0.35s
I have a suggestion for testing: I noticed existing null tests for median()
won't take this GroupsAccumulator
path, those test cases don't have group by
so they should be executed with regular Accumulator
(see https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/aggregates/no_grouping.rs), could you include tests for null handling?
# median with nulls |
// Extend values to related groups | ||
// TODO: avoid using iterator of the `ListArray`, this will lead to | ||
// many calls of `slice` of its ``inner array`, and `slice` is not | ||
// so efficient(due to the calculation of `null_count` for each `slice`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's safe to directly use the value without checking null, null values should be ignored during accumulation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's safe to directly use the value without checking null, null values should be ignored during accumulation
🤔 The input list
is possible to be null actually, due to some of them are generated from convert_to_state(skip_partial)
.
And batch like:
row0: 0
row1: 1
row2: null
...
rown: n
will be converted to a list like:
row0: [0]
row1: [1]
row2: null
...
rown: [n]
I think we can implement a simple version for correctness firstly.
@2010YOUY01 testcases with nulls have been added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good to me, thank you
Plan to merge this one tomorrow, would you like to review this pr again before merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Rachelint for this really nice PR and @2010YOUY01 for the review.
I thought this PR looks really really nice (easy to understand and read). I am running the end to end benchmarks on my gcp machine now to get final numbers but I suspect it will be much faster 🚀
I left various suggestions on ways to potentially make this PR faster, but they could all be done as follow ons (or never)
Thanks again 🙏
let data_gen_config = baseline_config(); | ||
|
||
// Queries like SELECT median(a), median(distinct) FROM fuzz_table GROUP BY b | ||
let query_builder = QueryBuilder::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
/// For calculating the accurate medians of groups, we need to store all values | ||
/// of groups before final evaluation. | ||
/// So values in each group will be stored in a `Vec<T>`, and the total group values | ||
/// will be actually organized as a `Vec<Vec<T>>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given it is important to track the median values for each group separately I don't really see a way around Vec/Vec -- I think it is the simplest version and will have pretty reasonable performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tried not to use Vec<Vec<T>>
for avoiding copying from Vec<Vec<T>>
to the result Vec<T>
, but it is hard to do that.
|
||
// `offsets` in `ListArray`, each row as a list element | ||
let offsets = (0..=input_array.len() as i32).collect::<Vec<_>>(); | ||
let offsets = OffsetBuffer::new(ScalarBuffer::from(offsets)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here OffsetBuffer::new_unchecked
could be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
It is easy to ensure all check in OffsetBuffer::new
can be passed by adding
assert!(input_array.len() <= i32::MAX as usize);
GroupsAccumulator
for median
median
by implementing special GroupsAccumulator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW for completeness I also ran the sqllite suite against this PR too:
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests
...
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests
Finished `release-nonlto` profile [optimized] target(s) in 0.34s
Running bin/sqllogictests.rs (target/release-nonlto/deps/sqllogictests-6c6dc6221381c36b)
Completed in 8 minutes
And it all passed (thanks again to @Omega359 for making this happen)
I was trying to figure out how to run the extended test suite against this PR, but I couldn't figure out how to setup the workflow syntax. I filed this to track the idea:
BTW I am still trying to benchmark this branch to show how awesome it is. I am having trouble as the h2o large benchmark is being oomkilled on my machine |
d2d8ca9
to
a975adb
Compare
#[derive(Debug)] | ||
struct MedianGroupsAccumulator<T: ArrowNumericType + Send> { | ||
data_type: DataType, | ||
group_values: Vec<Vec<T::Native>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wonder -- using Vec<Vec<>>
for as a state storage doesn't seem to differ much from what regular accumulator does, but this PR still introduces a noticeable performance improvement. Are there any other optimizations that could be used in regular accumulator?
P.S. asking just because when I was doing +- same for count distinct (PR), the performance for GroupsAccumulator with Vec<HashSet<>>
was not that significant comparing to regular accumulators with HashSet<>
states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think among other things, the intermediate state management (creating ListArrays directly rather than from ScalarValue) probably helps a lot:
There is also an extra allocation per group when using the groups accumulator adapter thingie
That being said, it is a fair question how much better the existing MedianAccumulator could be if it built the ListArrays as does this PR directly 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@korowa I think what mentioned by @alamb is a important point about the improvement.
Following are some other points for me:
-
in
GroupsAccumulatorAdapter::update_batch
, we need to reorder theinput batch
, and useslice
to split the reordered batch after. I think such two operations may be not cheap.
datafusion/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator.rs
Lines 241 to 265 in 6c9355d
let values = take_arrays(values, &batch_indices, None)?; let opt_filter = get_filter_at_indices(opt_filter, &batch_indices)?; // invoke each accumulator with the appropriate rows, first // pulling the input arguments for this group into their own // RecordBatch(es) let iter = groups_with_rows.iter().zip(offsets.windows(2)); let mut sizes_pre = 0; let mut sizes_post = 0; for (&group_idx, offsets) in iter { let state = &mut self.states[group_idx]; sizes_pre += state.size(); let values_to_accumulate = slice_and_maybe_filter( &values, opt_filter.as_ref().map(|f| f.as_boolean()), offsets, )?; f(state.accumulator.as_mut(), &values_to_accumulate)?; // clear out the state so they are empty for next // iteration state.indices.clear(); sizes_post += state.size(); -
in
GroupsAccumulatorAdapter::merge_batch
, the similar problem asinput batch
may be even more serious... Becasue we need to reorder aListArray
-
and in
GroupsAccumulatorAdapter::state
, extra allocations exist as mentioned by @alamb .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some improvements, but overall results for clickbench q9 (I was mostly looking at this query) were like x2.63 for GroupsAccumulator, and x2.30 for the regular Accumulator -- so it would be like 13-15% overall difference, which is not as massive as this PR results.
However, maybe things has changed in GroupsAccumulator implementation, and now even plain Vec<HashSet<>>
will be way faster.
UPD: and, yes, maybe producing state, as pointed out by @alamb above, was (at least partially) the cause of non-significant improvement -- in count distinct it was implemented via ListArray::from_iter_primitive
(commit), instead of building it from single flattened array and its offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem really worth seeking the reason more deeply.
Sorry for the delay @Rachelint . I was having trouble with the benchmark queries Here are my benchmark results -- not bad almost 7x faster for our extended clickbench query:
And the actual h2o benchmark (which is dominated by CSV parsing) also shows a noticeable 1.6x improvement
|
@korowa would you like to review this one again before merging? |
Will plan to merge this one tomorrow if there is not anyone else who would like time to review |
.with_data_type(self.data_type.clone()); | ||
|
||
// `offsets` in `ListArray`, each row as a list element | ||
assert!(input_array.len() <= i32::MAX as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could use i32::try_from
here instead of assert + following cast in range creation? I cannot imagine a real life use-case when this assertion will fail, but it still can be avoided
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i32::try_from
is indeed better, fixed
@Rachelint, I've partially went through it, and haven't found any major or blocking issues, so it looks like good to go. |
abc0068
to
85ed001
Compare
Co-authored-by: Andrew Lamb <[email protected]>
85ed001
to
e963d50
Compare
.with_data_type(self.data_type.clone()); | ||
|
||
// `offsets` in `ListArray`, each row as a list element | ||
let offset_end = i32::try_from(input_array.len()).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And another one here -- why unwrap()
and not ?
, since we can return Error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, it is better to just use it to replace assert, fixed.
Thanks.
Which issue does this PR close?
Closes #13550
Rationale for this change
Support specific GroupsAccumulator for median for better performance.
What changes are included in this PR?
MedianGroupsAccumulator
Are these changes tested?
Yes, by exist tests and new e2e and fuzzy tests.
Are there any user-facing changes?
No.