-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(fuzzer): Add TopNRowNumberFuzzer #12103
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
9a7590d
to
a9011fa
Compare
f7beb33
to
25ac51c
Compare
25ac51c
to
49a43dd
Compare
49a43dd
to
f458e56
Compare
8be77d4
to
7fbe198
Compare
Have run the fuzzer for an hour without failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aditi-pandit, thank you for adding fuzzer coverage for TopNRowNumber node. I noticed that the member methods of TopNRowNumberFuzzer are very similar to RowNumberFuzzer's. Does it make sense to create a RowNumberFuzzerBase class with the common functions and let RowNumberFuzzer and TopNRowNumberFuzzer inherit that? Similar to how we organize AggregationFuzzer, WindowFuzzer, and AggregationFuzzerBase?
int32_t randInt(int32_t min, int32_t max) { | ||
return boost::random::uniform_int_distribution<int32_t>(min, max)(rng_); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the T rand(FuzzerGenerator& rng, T min, T max)
in velox/common/fuzzer/Utils.h
|
||
std::pair<std::vector<std::string>, std::vector<TypePtr>> | ||
TopNRowNumberFuzzer::generateKeys(const std::string& prefix) { | ||
static const std::vector<TypePtr> kKeyTypes{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the key type be floating-point?
LOG(ERROR) << "Expected\n" | ||
<< expected->toString(0, expected->size()) << "\nActual\n" | ||
<< actual->toString(0, actual->size()); | ||
throw; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: assertEqualResults() should already print out the unmatched rows. Is that not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i picked this up from some RowNumberFuzzer code. Though you're right, it might not be needed.
std::unique_ptr<exec::test::ReferenceQueryRunner> setupReferenceQueryRunner( | ||
memory::MemoryPool* aggregatePool, | ||
const std::string& prestoUrl, | ||
const std::string& runnerName, | ||
const uint32_t& reqTimeoutMs) { | ||
if (prestoUrl.empty()) { | ||
auto duckQueryRunner = | ||
std::make_unique<exec::test::DuckQueryRunner>(aggregatePool); | ||
LOG(INFO) << "Using DuckDB as the reference DB."; | ||
return duckQueryRunner; | ||
} | ||
|
||
LOG(INFO) << "Using Presto as the reference DB."; | ||
return std::make_unique<exec::test::PrestoQueryRunner>( | ||
aggregatePool, | ||
prestoUrl, | ||
runnerName, | ||
static_cast<std::chrono::milliseconds>(reqTimeoutMs)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use setupReferenceQueryRunner() defined in exec/fuzzer/FuzzerUtil.cpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I moved that in a separate PR as it affected multiple fuzzers. Will change it here as well.
7fbe198
to
04b0749
Compare
@kagamiori : Most of the common logic in those classes could have even broader applicability so I moved them to FuzzerUtil class. What remains are the flags etc. I notice that AggregationFuzzerBase covers flags as well. We can try something like that. Will modify the code shortly. |
cafd0d1
to
cfa8421
Compare
@kagamiori : Have added a RowNumberFuzzerBase class now. PTAL. Thanks ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aditi-pandit, the code looks good to me, except a few comments for small refactoring. Thanks!
velox/exec/fuzzer/FuzzerUtil.cpp
Outdated
std::vector<RowVectorPtr> flatten(const std::vector<RowVectorPtr>& vectors) { | ||
std::vector<RowVectorPtr> flatVectors; | ||
for (const auto& vector : vectors) { | ||
auto flat = BaseVector::create<RowVector>( | ||
vector->type(), vector->size(), vector->pool()); | ||
flat->copy(vector.get(), 0, 0, vector->size()); | ||
flatVectors.push_back(flat); | ||
} | ||
|
||
return flatVectors; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can this method be replaced with BaseVector::flattenVector()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kagamiori : BaseVector::flattenVector flattens the original input vector, while this flatten method returns a separate output vector for the flattened list of input vectors. The original vector is retained and we want to run the fuzzer on the original input vectors only.
An option could be to copy the input vector and then use BaseVector::flatten(...) on the copied vector. But the rest of the stubbing around it makes it seem simpler to use the current flatten method. wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kagamiori : Nvm... I realize after using common logVectors method from AggregationFuzzerBase that this method isn't needed. see #12300
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with #12300
// Disable testing with TableScan when input contains TIMESTAMP type, due to | ||
// the issue #8127. | ||
if (type->kind() == TypeKind::TIMESTAMP) { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aditi-pandit, I remember that for running fuzzers with PQR, the issue #8127 can be resolved by adding -Duser.timezone=America/Los_Angeles
to etc/jvm.config in the directory where you installed the Presto server (e.g., presto-server-0.284/etc/jvm.config). Could you please have a try and let me know if it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kagamiori : Have tried with the user.timezone config and I don't see the errors either.
But of course it fails in Velox CI here.
So we'll have to change that separately and let this PR remain as is.
wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aditi-pandit, I believe we already set -Duser.timezone=America/Los_Angeles
in the CI jobs here, so if you still see an error when enabling Timestamp type, that needs to be looked into.
For now, could we move if (type->kind() == TypeKind::TIMESTAMP) { return false; }
to a local method only used in TopNRowNumberFuzzer.cpp, so that this won't affect other fuzzers that use this isTableScanSupported()
utility method? Then you can look into the error caused by Timestamp type after this PR is merged.
void RowNumberFuzzerBase::logInput(const std::vector<RowVectorPtr>& input) { | ||
if (VLOG_IS_ON(1)) { | ||
// Flatten inputs. | ||
const auto flatInput = test::flatten(input); | ||
VLOG(1) << "Input: " << input[0]->toString(); | ||
for (const auto& v : flatInput) { | ||
VLOG(1) << std::endl << v->toString(0, v->size()); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: AggregationFuzzerBase also has a logVectors() method. We can move that to a common place to reuse the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check #12300. I'll update this PR post its merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with #12300
2d7ec1a
to
6dbf273
Compare
@kagamiori : Have updated this PR after the refactoring PR is merged. PTAL. |
Timestamp issue
6dbf273
to
7de27c6
Compare
LGTM. Let's look into the error caused by Timestamp type and enable Timestamp in a separate PR. |
@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
topNRowNumber node is an optimized planNode for SQL with ranking window functions but which limits them to only the topN results. Add a TopNRowNumberFuzzer for plans with this planNode.
This fuzzer is closely modeled after the RowNumberFuzzer. So the common code is abstracted to a RowNumberFuzzerBase class which is used as the parent class for both RowNumberFuzzer and TopnRowNumberFuzzer.
The fuzzer generates plans only for row_number function right now. It will be enhanced to support rank and dense_rank functions after #11554