-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(iceberg): Date partition value parse issue #12126
base: main
Are you sure you want to change the base?
fix(iceberg): Date partition value parse issue #12126
Conversation
✅ Deploy Preview for meta-velox canceled.
|
return applyFilter(*filter, result.value()); | ||
int32_t result = 0; | ||
if (tableFormat == SplitReader::TableFormat::kIceberg) { | ||
result = boost::lexical_cast<int32_t>(partitionValue.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not use std::stoi
here instead of including a boost header?
In other comments i see that this function is slow. This could be a problem?
Also we wouldn't need a new include and add a new dependency here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::stoi
is converting a string like "2022-04-05" to int value 2022, so its not safe and may lead to wrong results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::stoi
should work here. https://www.geeksforgeeks.org/convert-string-to-int-in-cpp/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::stoi
will work for converting int string to int. But it also doesn't throw error if we input a string like "2022-04-05" as it just converts it to 2022.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, 2022-04-05
is parsed by boost into what number? It is not a valid integer in the first place. So either function will not work.
Also you are forgetting that the string is actually the days since epoch like you have in your description. Which means it is not an actual date formatted string. If this was the case you need a date parser here and not a string to int parse function in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to use std::stoi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nmahadevuni : Thanks for this fix. Have bunch of review comments.
partitionKeys = {}, | ||
const std::vector<std::string> filters = {}, | ||
const std::string duckDbSql = "", | ||
const int32_t numPrefetchSplits = 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have this parameter to the function ? Since it seems like we never test its use really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this.
partitionKeys["ds"] = "17627"; | ||
|
||
std::vector<RowVectorPtr> dataVectors; | ||
VectorPtr c0 = vectorMaker_.flatVector<int64_t>({1}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vectorMaker_ is deprecated. Use makeFlatVector API instead.
@@ -477,6 +514,15 @@ class HiveIcebergTest : public HiveConnectorTestBase { | |||
return PlanBuilder(pool_.get()).tableScan(rowType_).planNode(); | |||
} | |||
|
|||
core::PlanNodePtr tableScanNode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think there is a need for this function as its only used in a single place and doesn't really represent anything.
const std::unordered_map<std::string, std::optional<std::string>> | ||
partitionKeys = {}, | ||
const std::vector<std::string> filters = {}, | ||
const std::string duckDbSql = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not reasonable to have an empty duckDbSql for this function, as that is the main sql string to verify the plan results with. Since we know the plan we are generating, it might be better to build the duckDBSql in this function itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test case, we don't know how to generate the duckDbSql, when we add more test cases, we can enhance this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an empty string check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this code a very specific plan with a TableScanNode and partitionfilters, columnfilters is setup. This maps to a quite precise Sql. We could just build it in the logic.
But I'm also fine with the sql passed as a parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the default "" value for this parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test case, we generate int vector for date values and create the data file which is ok for velox, but cannot create duckdb table with the same vectors, so using a sql statement to verify.
auto scanNodeId = plan->id(); | ||
auto it = planStats.find(scanNodeId); | ||
ASSERT_TRUE(it != planStats.end()); | ||
ASSERT_TRUE(it->second.peakMemoryBytes > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you testing this ? Isn't matching results sufficient ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not required, removed it.
@@ -225,6 +226,41 @@ class HiveIcebergTest : public HiveConnectorTestBase { | |||
ASSERT_TRUE(it->second.peakMemoryBytes > 0); | |||
} | |||
|
|||
void assertQuery( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename this function to assertPartitionKey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to keep this name generic, since it could be used to test any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is building a very specific plan with TableScanNode and IcebergSplits... The assertQuery function name is being used in all the TestBase classes for very generic usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertPartitionKey also doesn't seem to be the right name, since we can just pass empty partition key. It is just another overloaded assertQuery method.
if (tableFormat == SplitReader::TableFormat::kIceberg) { | ||
result = boost::lexical_cast<int32_t>(partitionValue.c_str()); | ||
} else { | ||
result = DATE()->toDays((folly::StringPiece)partitionValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix this to use C++ cast since we are touching this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't get it. toDays converts date into daysSinceEpoch, where to use C++ cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ style cast:
static_cast<folly::StringPiece>(partitionValue)
See the example in a line you actually removed (below in the review).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added C++ cast in both places.
velox/connectors/hive/SplitReader.h
Outdated
@@ -154,6 +157,7 @@ class SplitReader { | |||
std::shared_ptr<HiveColumnHandle>>* const partitionKeys_; | |||
const ConnectorQueryCtx* connectorQueryCtx_; | |||
const std::shared_ptr<const HiveConfig> hiveConfig_; | |||
const TableFormat tableFormat_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need a const
for enums and scalars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it a const reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we missing something? This is just a plain const and not a const reference. Also for enums you don't need a reference. And the argument to the constructor is not a const & either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was looking at a different place altogether. You are right. I removed all const references.
@@ -634,12 +636,16 @@ namespace { | |||
bool applyPartitionFilter( | |||
const TypePtr& type, | |||
const std::string& partitionValue, | |||
common::Filter* filter) { | |||
common::Filter* filter, | |||
const SplitReader::TableFormat& tableFormat) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its better to move the read-only const& parameters before the writable * ones. So lets move tableFormat to the first parameter.
const std::unordered_map<std::string, std::optional<std::string>> | ||
partitionKeys = {}, | ||
const std::vector<std::string> filters = {}, | ||
const std::string duckDbSql = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this code a very specific plan with a TableScanNode and partitionfilters, columnfilters is setup. This maps to a quite precise Sql. We could just build it in the logic.
But I'm also fine with the sql passed as a parameter.
const std::unordered_map<std::string, std::optional<std::string>> | ||
partitionKeys = {}, | ||
const std::vector<std::string> filters = {}, | ||
const std::string duckDbSql = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the default "" value for this parameter.
d3564ed
to
c7843f4
Compare
Thank you @aditi-pandit @majetideepak @czentgr . I have addressed your comments. Please review. |
c7843f4
to
67b025c
Compare
@majetideepak @aditi-pandit @czentgr Addressed your comments. Please have a look. |
The date values are set in daysSinceEpoch in the FileScanTask returned by the Iceberg API. If we want to change that, that conversion back to string format has to happen in Coordinator. But Presto Java workers do not have any issue with that, and will need to be changed if we change it back to date string. So instead of changing at two places in Java, I fixed it in Velox. testFilters and applyPartitionFilter are static functions, setPartitionValue is the only class member. We will override and do special handling in IcebergSplitReader::setPartitionValue? @majetideepak |
@nmahadevuni Let's add |
67b025c
to
ff840fc
Compare
As discussed on slack, added the bool to SplitReader class. Please review @majetideepak |
VectorPtr ds = makeFlatVector<int32_t>((std::vector<int32_t>){17627}); | ||
dataVectors.push_back(makeRowVector({"c0", "ds"}, {c0, ds})); | ||
|
||
assertQuery( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writeToFile can be moved here since it is the same data.
I feel it is cleaner to create the plan and call HiveConnectorTestBase::assertQuery here.
This is inline with all the partition tests inside TableScanTest.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree... The current assertQuery function can be a lambda here for reuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Inlining the partition tests here fits better with the rest of the tests in this file like assertPositionalDeletes.
@Yuhta can you take a look at this change? Thanks. |
ff840fc
to
2f5fa0a
Compare
velox/connectors/hive/TableHandle.h
Outdated
@@ -45,12 +47,14 @@ class HiveColumnHandle : public ColumnHandle { | |||
ColumnType columnType, | |||
TypePtr dataType, | |||
TypePtr hiveType, | |||
std::vector<common::Subfield> requiredSubfields = {}) | |||
std::vector<common::Subfield> requiredSubfields = {}, | |||
ValueType valueType = ValueType::kDefault) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should follow the HiveTableHandle and add a generic argument like
const std::unordered_map<std::string, std::string>& columnParameters = {}
We can then define struct ColumnParameter
similar to struct TableParameter
and add an entry for our need. Example: key=partition.date.value.format, value=daysSinceEpoch/ISODateFormat
as from our discussion offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta Do you agree with this API approach?
2f5fa0a
to
30f5dad
Compare
@majetideepak @Yuhta Thanks for the review. Made the changes as suggested. Please have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nmahadevuni the Velox API looks good to me. Found some more comments.
partitionColumns.insert(partitionKey.first); | ||
} | ||
|
||
// auto plan = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove commented code.
@Yuhta can you comment on this API? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nmahadevuni. Have few questions.
assertQuery( | ||
rowType, dataVectors, "SELECT 1, '2018-04-06'", partitionKeys, {}); | ||
|
||
std::vector<std::string> filters = {"ds = date'2018-04-06'"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is very simple. The partition keys, filters are the same and there is only one matching row that is selected. Would be good to add rows in the input that don't match the partition key and are filtered out. Also would be good to add a filter for another non-partition key value that matches an input row.
VectorPtr ds = makeFlatVector<int32_t>((std::vector<int32_t>){17627}); | ||
dataVectors.push_back(makeRowVector({"c0", "ds"}, {c0, ds})); | ||
|
||
assertQuery( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Inlining the partition tests here fits better with the rest of the tests in this file like assertPositionalDeletes.
int32_t result = 0; | ||
// Iceberg partition values are already in daysSinceEpoch, no need to | ||
// convert. | ||
if (columnParameters.count( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is repeated twice. Can we abstract this check as a method of ColumnHandle ?
const std::unordered_map<std::string, std::string>& columnParameters() const { | ||
return columnParameters_; | ||
} | ||
|
||
std::string toString() const; | ||
|
||
folly::dynamic serialize() const override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The serialize and creata methods should be enhanced to also serialize/deserialize the columnParameters_ in the HiveColumnHandle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to change serde and toString methods? The columnParameters_ will be set in Prestissimo code. Even tableParameters_ from HiveTableHandle class is implemented the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nmahadevuni : Yes, serde and toString methods should always be updated when we update classes that appear in a Velox plan.
toString is used in printPlanWithStats debugging utilities https://facebookincubator.github.io/velox/develop/debugging/print-plan-with-stats.html and the serde methods have been used in non-Presto use-cases.
tableParameters_ should be handled this way as well. Its a bug if it is not.
const std::unordered_map<std::string, std::string>& columnParameters() const { | ||
return columnParameters_; | ||
} | ||
|
||
std::string toString() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toString() should also be enhanced to show the columnParameters_.
Please update unit tests as well.
velox/connectors/hive/TableHandle.h
Outdated
@@ -115,6 +121,7 @@ class HiveColumnHandle : public ColumnHandle { | |||
const TypePtr dataType_; | |||
const TypePtr hiveType_; | |||
const std::vector<common::Subfield> requiredSubfields_; | |||
const std::unordered_map<std::string, std::string> columnParameters_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here indicating that these are used for metadata like column date value format.
If you have a link to how these are populated that will be great as well.
This change would be accompanied by presto-native code to populate these fields in IcebergPrestoToVeloxConnector::toVeloxColumnHandle method. So is there a change in the protocol classes as well to pass these from the Iceberg connector ? Do you the e2e Prestissimo PR ? Would be great to take a look at that to understand the scope of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We will need to change IcebergPrestoToVeloxConnector::toVeloxColumnHandle
. This change will need to be merged first for me to work on the Prestissimo PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to see the Prestissimo code at the same time to know this works e2e.
You can change the Velox submodule in the Presto project to pick up the changes from your branch https://git-scm.com/book/en/v2/Git-Tools-Submodules section "Pulling in Upstream Changes from the Submodule Remote"
and try out this code with presto-native-execution test.
See prestodb/presto#24138 for how a PR looks with Velox submodule changes for a branch with local commits.
30f5dad
to
9e5d6d2
Compare
@majetideepak @aditi-pandit Thank you. I have addressed your comments. Please review. |
a98d364
to
3804e10
Compare
velox/connectors/hive/TableHandle.h
Outdated
@@ -115,6 +129,9 @@ class HiveColumnHandle : public ColumnHandle { | |||
const TypePtr dataType_; | |||
const TypePtr hiveType_; | |||
const std::vector<common::Subfield> requiredSubfields_; | |||
// The column parameters are used for metadata like Iceberg date partition | |||
// value format. | |||
const std::unordered_map<std::string, std::string> columnParameters_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just make it typed struct as there is no need for serialization for this parameter. Something like this:
struct ColumnParseParameters {
enum PartitionDateValueFormat {
kISO8601,
kDaysSinceEpoch,
} partitionDateValueFormat;
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Yuhta . I have made the changes. Also, the draft changes on Prestissimo side are at nmahadevuni/presto@32b22cb. Please review @majetideepak @aditi-pandit
3804e10
to
f6964f1
Compare
f6964f1
to
27baa3c
Compare
Have made the changes. Please review. @Yuhta @majetideepak @aditi-pandit |
27baa3c
to
52b81fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nmahadevuni this looks nice! Few comments.
@aditi-pandit can you take another look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @nmahadevuni
I will add the merge label once Aditi approves.
52b81fa
to
457e82a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nmahadevuni
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@nmahadevuni looks like this unit test failed internally. Could you take a look?
|
it's a memory leak detected by ASAN:
|
Thank you. Will have a look. |
std::move(requiredSubFields), | ||
columnParseParameters)}); | ||
|
||
auto planBuilder = new PlanBuilder(pool_.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a leak.
You can use auto plan = PlanBuilder(pool_.get()).startTableScan()....
|
||
// Test filter on non-partitioned non-date column | ||
std::vector<std::string> nonPartitionFilters = {"c0 = 1"}; | ||
plan = planBuilder->startTableScan() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PlanBuilder().startTableScan()
457e82a
to
b0596f0
Compare
Fixed it and ran the leaks tool on Mac to check. @pedroerp @majetideepak
|
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
fixes prestodb/presto#24371
Iceberg partition values are already in daysSinceEpoch, but in velox we assume its in date form and try to convert as with Hive. Fixed this.