-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement lazy Distinct
operation
#1558
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1558 +/- ##
==========================================
- Coverage 88.45% 88.44% -0.02%
==========================================
Files 362 362
Lines 27455 27492 +37
Branches 3705 3717 +12
==========================================
+ Hits 24285 24314 +29
- Misses 1938 1939 +1
- Partials 1232 1239 +7 ☔ View full report in Codecov by Sentry. |
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review on everything but the tests.
|
||
// Removes all duplicates from input with regards to the columns | ||
// in keepIndices. The input needs to be sorted on the keep indices, | ||
// otherwise the result of this function is undefined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also document the previousRow
argument.
|
||
VariableToColumnMap computeVariableToColumnMap() const override; | ||
|
||
template <size_t WIDTH> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
short docstring please.
@@ -36,17 +33,107 @@ VariableToColumnMap Distinct::computeVariableToColumnMap() const { | |||
return subtree_->getVariableColumns(); | |||
} | |||
|
|||
template <size_t WIDTH> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <size_t WIDTH> | |
// ____________________________________________________________________ | |
template <size_t WIDTH> |
cppcoro::generator<IdTable> Distinct::lazyDistinct( | ||
cppcoro::generator<IdTable> originalGenerator, | ||
std::vector<ColumnIndex> keepIndices, | ||
std::optional<IdTable> aggregateTable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good reason to not use the bool yieldOnce
pattern from other operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly for simplicity reasons. This function is currently static just like the regular distinct function and if a bool was passed instead you'd have to pass the width and the allocator to construct the IdTable
within the generator. But conceptually it does the same thing
auto last = result.end(); | ||
|
||
auto dest = result.begin(); | ||
if (first == dest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I see it,there are two optimizations:
-
Do this columnwise (see the Engine.cpp for a variation of this algorithm that only returns the count of unique elements)
-
Use
std::ranges::unique
(with your matches row you can build something) -
What about cancellation? (see 1.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In total I think there might be a version with less code possible here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This would work, but it would require
O(numRows())
bytes of additional memory (the current algorithm is in-place, no additional memory required, we're just freeing memory that has been allocated by the previous operation). - After implementing this I'm fairly confident
std::ranges::unique
(as in this exact function) doesn't help here whatsoever, I really tried using it, but it just doesn't work for this use-case (withpreviousRow
considered), but of course there might be other algorithms that might help making this shorter. - Yeah, cancellation would be nice, but that should be added once we settled on a general approach. For the non-lazy approach admittedly it would be better to not clone the whole thing at all and just copy the entries that are actually required, but this would require even more code.
Allow the
Distinct
operation to deal with lazy values.