Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(iceberg): Date partition value parse issue #12126

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nmahadevuni
Copy link
Collaborator

@nmahadevuni nmahadevuni commented Jan 20, 2025

fixes prestodb/presto#24371

Iceberg partition values are already in daysSinceEpoch, but in velox we assume its in date form and try to convert as with Hive. Fixed this.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 20, 2025
Copy link

netlify bot commented Jan 20, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit b0596f0
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67af889cc7f37300089a848a

return applyFilter(*filter, result.value());
int32_t result = 0;
if (tableFormat == SplitReader::TableFormat::kIceberg) {
result = boost::lexical_cast<int32_t>(partitionValue.c_str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use std::stoi here instead of including a boost header?
In other comments i see that this function is slow. This could be a problem?

Also we wouldn't need a new include and add a new dependency here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::stoi is converting a string like "2022-04-05" to int value 2022, so its not safe and may lead to wrong results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::stoi will work for converting int string to int. But it also doesn't throw error if we input a string like "2022-04-05" as it just converts it to 2022.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, 2022-04-05 is parsed by boost into what number? It is not a valid integer in the first place. So either function will not work.

Also you are forgetting that the string is actually the days since epoch like you have in your description. Which means it is not an actual date formatted string. If this was the case you need a date parser here and not a string to int parse function in the first place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to use std::stoi

Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmahadevuni : Thanks for this fix. Have bunch of review comments.

velox/connectors/hive/HiveConnectorUtil.cpp Outdated Show resolved Hide resolved
partitionKeys = {},
const std::vector<std::string> filters = {},
const std::string duckDbSql = "",
const int32_t numPrefetchSplits = 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have this parameter to the function ? Since it seems like we never test its use really.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this.

partitionKeys["ds"] = "17627";

std::vector<RowVectorPtr> dataVectors;
VectorPtr c0 = vectorMaker_.flatVector<int64_t>({1});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vectorMaker_ is deprecated. Use makeFlatVector API instead.

@@ -477,6 +514,15 @@ class HiveIcebergTest : public HiveConnectorTestBase {
return PlanBuilder(pool_.get()).tableScan(rowType_).planNode();
}

core::PlanNodePtr tableScanNode(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think there is a need for this function as its only used in a single place and doesn't really represent anything.

velox/connectors/hive/SplitReader.cpp Show resolved Hide resolved
const std::unordered_map<std::string, std::optional<std::string>>
partitionKeys = {},
const std::vector<std::string> filters = {},
const std::string duckDbSql = "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not reasonable to have an empty duckDbSql for this function, as that is the main sql string to verify the plan results with. Since we know the plan we are generating, it might be better to build the duckDBSql in this function itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test case, we don't know how to generate the duckDbSql, when we add more test cases, we can enhance this function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an empty string check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this code a very specific plan with a TableScanNode and partitionfilters, columnfilters is setup. This maps to a quite precise Sql. We could just build it in the logic.

But I'm also fine with the sql passed as a parameter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the default "" value for this parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test case, we generate int vector for date values and create the data file which is ok for velox, but cannot create duckdb table with the same vectors, so using a sql statement to verify.

auto scanNodeId = plan->id();
auto it = planStats.find(scanNodeId);
ASSERT_TRUE(it != planStats.end());
ASSERT_TRUE(it->second.peakMemoryBytes > 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you testing this ? Isn't matching results sufficient ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not required, removed it.

@@ -225,6 +226,41 @@ class HiveIcebergTest : public HiveConnectorTestBase {
ASSERT_TRUE(it->second.peakMemoryBytes > 0);
}

void assertQuery(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this function to assertPartitionKey.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to keep this name generic, since it could be used to test any case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is building a very specific plan with TableScanNode and IcebergSplits... The assertQuery function name is being used in all the TestBase classes for very generic usage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertPartitionKey also doesn't seem to be the right name, since we can just pass empty partition key. It is just another overloaded assertQuery method.

if (tableFormat == SplitReader::TableFormat::kIceberg) {
result = boost::lexical_cast<int32_t>(partitionValue.c_str());
} else {
result = DATE()->toDays((folly::StringPiece)partitionValue);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix this to use C++ cast since we are touching this code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't get it. toDays converts date into daysSinceEpoch, where to use C++ cast?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ style cast:

static_cast<folly::StringPiece>(partitionValue)

See the example in a line you actually removed (below in the review).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added C++ cast in both places.

@@ -154,6 +157,7 @@ class SplitReader {
std::shared_ptr<HiveColumnHandle>>* const partitionKeys_;
const ConnectorQueryCtx* connectorQueryCtx_;
const std::shared_ptr<const HiveConfig> hiveConfig_;
const TableFormat tableFormat_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need a const for enums and scalars.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it a const reference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing something? This is just a plain const and not a const reference. Also for enums you don't need a reference. And the argument to the constructor is not a const & either.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was looking at a different place altogether. You are right. I removed all const references.

@@ -634,12 +636,16 @@ namespace {
bool applyPartitionFilter(
const TypePtr& type,
const std::string& partitionValue,
common::Filter* filter) {
common::Filter* filter,
const SplitReader::TableFormat& tableFormat) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its better to move the read-only const& parameters before the writable * ones. So lets move tableFormat to the first parameter.

velox/connectors/hive/SplitReader.cpp Show resolved Hide resolved
const std::unordered_map<std::string, std::optional<std::string>>
partitionKeys = {},
const std::vector<std::string> filters = {},
const std::string duckDbSql = "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this code a very specific plan with a TableScanNode and partitionfilters, columnfilters is setup. This maps to a quite precise Sql. We could just build it in the logic.

But I'm also fine with the sql passed as a parameter.

const std::unordered_map<std::string, std::optional<std::string>>
partitionKeys = {},
const std::vector<std::string> filters = {},
const std::string duckDbSql = "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the default "" value for this parameter.

@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch 2 times, most recently from d3564ed to c7843f4 Compare January 24, 2025 06:55
@nmahadevuni
Copy link
Collaborator Author

Thank you @aditi-pandit @majetideepak @czentgr . I have addressed your comments. Please review.

@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch from c7843f4 to 67b025c Compare January 27, 2025 06:58
@nmahadevuni
Copy link
Collaborator Author

@majetideepak @aditi-pandit @czentgr Addressed your comments. Please have a look.

@nmahadevuni
Copy link
Collaborator Author

nmahadevuni commented Jan 28, 2025

The date values are set in daysSinceEpoch in the FileScanTask returned by the Iceberg API. If we want to change that, that conversion back to string format has to happen in Coordinator. But Presto Java workers do not have any issue with that, and will need to be changed if we change it back to date string. So instead of changing at two places in Java, I fixed it in Velox. testFilters and applyPartitionFilter are static functions, setPartitionValue is the only class member. We will override and do special handling in IcebergSplitReader::setPartitionValue? @majetideepak

@majetideepak
Copy link
Collaborator

The date values are set in daysSinceEpoch in the FileScanTask returned by the Iceberg API.

@nmahadevuni Let's add bool isPartitionValueDaysSinceEpoch_ to the ScanSpec.

@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch from 67b025c to ff840fc Compare January 29, 2025 03:58
@nmahadevuni
Copy link
Collaborator Author

The date values are set in daysSinceEpoch in the FileScanTask returned by the Iceberg API.

@nmahadevuni Let's add bool isPartitionValueDaysSinceEpoch_ to the ScanSpec.

As discussed on slack, added the bool to SplitReader class. Please review @majetideepak

velox/connectors/hive/HiveConnectorUtil.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveConnectorUtil.h Outdated Show resolved Hide resolved
velox/exec/tests/utils/PlanBuilder.h Outdated Show resolved Hide resolved
VectorPtr ds = makeFlatVector<int32_t>((std::vector<int32_t>){17627});
dataVectors.push_back(makeRowVector({"c0", "ds"}, {c0, ds}));

assertQuery(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeToFile can be moved here since it is the same data.
I feel it is cleaner to create the plan and call HiveConnectorTestBase::assertQuery here.
This is inline with all the partition tests inside TableScanTest.cpp

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree... The current assertQuery function can be a lambda here for reuse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Inlining the partition tests here fits better with the rest of the tests in this file like assertPositionalDeletes.

@majetideepak
Copy link
Collaborator

@Yuhta can you take a look at this change? Thanks.

@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch from ff840fc to 2f5fa0a Compare February 5, 2025 17:18
@@ -45,12 +47,14 @@ class HiveColumnHandle : public ColumnHandle {
ColumnType columnType,
TypePtr dataType,
TypePtr hiveType,
std::vector<common::Subfield> requiredSubfields = {})
std::vector<common::Subfield> requiredSubfields = {},
ValueType valueType = ValueType::kDefault)
Copy link
Collaborator

@majetideepak majetideepak Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should follow the HiveTableHandle and add a generic argument like
const std::unordered_map<std::string, std::string>& columnParameters = {}

We can then define struct ColumnParameter similar to struct TableParameter and add an entry for our need. Example: key=partition.date.value.format, value=daysSinceEpoch/ISODateFormat as from our discussion offline.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta Do you agree with this API approach?

@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch from 2f5fa0a to 30f5dad Compare February 7, 2025 17:56
@nmahadevuni
Copy link
Collaborator Author

@majetideepak @Yuhta Thanks for the review. Made the changes as suggested. Please have a look.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmahadevuni the Velox API looks good to me. Found some more comments.

velox/connectors/hive/HiveConnectorUtil.cpp Outdated Show resolved Hide resolved
partitionColumns.insert(partitionKey.first);
}

// auto plan =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove commented code.

@majetideepak
Copy link
Collaborator

@Yuhta can you comment on this API? Thanks.

Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nmahadevuni. Have few questions.

assertQuery(
rowType, dataVectors, "SELECT 1, '2018-04-06'", partitionKeys, {});

std::vector<std::string> filters = {"ds = date'2018-04-06'"};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is very simple. The partition keys, filters are the same and there is only one matching row that is selected. Would be good to add rows in the input that don't match the partition key and are filtered out. Also would be good to add a filter for another non-partition key value that matches an input row.

VectorPtr ds = makeFlatVector<int32_t>((std::vector<int32_t>){17627});
dataVectors.push_back(makeRowVector({"c0", "ds"}, {c0, ds}));

assertQuery(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Inlining the partition tests here fits better with the rest of the tests in this file like assertPositionalDeletes.

int32_t result = 0;
// Iceberg partition values are already in daysSinceEpoch, no need to
// convert.
if (columnParameters.count(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is repeated twice. Can we abstract this check as a method of ColumnHandle ?

const std::unordered_map<std::string, std::string>& columnParameters() const {
return columnParameters_;
}

std::string toString() const;

folly::dynamic serialize() const override;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serialize and creata methods should be enhanced to also serialize/deserialize the columnParameters_ in the HiveColumnHandle.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change serde and toString methods? The columnParameters_ will be set in Prestissimo code. Even tableParameters_ from HiveTableHandle class is implemented the same way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmahadevuni : Yes, serde and toString methods should always be updated when we update classes that appear in a Velox plan.

toString is used in printPlanWithStats debugging utilities https://facebookincubator.github.io/velox/develop/debugging/print-plan-with-stats.html and the serde methods have been used in non-Presto use-cases.

tableParameters_ should be handled this way as well. Its a bug if it is not.

const std::unordered_map<std::string, std::string>& columnParameters() const {
return columnParameters_;
}

std::string toString() const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toString() should also be enhanced to show the columnParameters_.

Please update unit tests as well.

@@ -115,6 +121,7 @@ class HiveColumnHandle : public ColumnHandle {
const TypePtr dataType_;
const TypePtr hiveType_;
const std::vector<common::Subfield> requiredSubfields_;
const std::unordered_map<std::string, std::string> columnParameters_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here indicating that these are used for metadata like column date value format.

If you have a link to how these are populated that will be great as well.

This change would be accompanied by presto-native code to populate these fields in IcebergPrestoToVeloxConnector::toVeloxColumnHandle method. So is there a change in the protocol classes as well to pass these from the Iceberg connector ? Do you the e2e Prestissimo PR ? Would be great to take a look at that to understand the scope of this change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We will need to change IcebergPrestoToVeloxConnector::toVeloxColumnHandle. This change will need to be merged first for me to work on the Prestissimo PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to see the Prestissimo code at the same time to know this works e2e.

You can change the Velox submodule in the Presto project to pick up the changes from your branch https://git-scm.com/book/en/v2/Git-Tools-Submodules section "Pulling in Upstream Changes from the Submodule Remote"
and try out this code with presto-native-execution test.

See prestodb/presto#24138 for how a PR looks with Velox submodule changes for a branch with local commits.

@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch from 30f5dad to 9e5d6d2 Compare February 11, 2025 16:39
@nmahadevuni
Copy link
Collaborator Author

@majetideepak @aditi-pandit Thank you. I have addressed your comments. Please review.

@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch 2 times, most recently from a98d364 to 3804e10 Compare February 12, 2025 11:53
@@ -115,6 +129,9 @@ class HiveColumnHandle : public ColumnHandle {
const TypePtr dataType_;
const TypePtr hiveType_;
const std::vector<common::Subfield> requiredSubfields_;
// The column parameters are used for metadata like Iceberg date partition
// value format.
const std::unordered_map<std::string, std::string> columnParameters_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just make it typed struct as there is no need for serialization for this parameter. Something like this:

struct ColumnParseParameters {
  enum PartitionDateValueFormat {
    kISO8601,
    kDaysSinceEpoch,
  } partitionDateValueFormat;
};

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Yuhta . I have made the changes. Also, the draft changes on Prestissimo side are at nmahadevuni/presto@32b22cb. Please review @majetideepak @aditi-pandit

@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch from 3804e10 to f6964f1 Compare February 12, 2025 16:55
velox/exec/tests/utils/PlanBuilder.h Outdated Show resolved Hide resolved
@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch from f6964f1 to 27baa3c Compare February 13, 2025 08:21
@nmahadevuni
Copy link
Collaborator Author

Have made the changes. Please review. @Yuhta @majetideepak @aditi-pandit

@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch from 27baa3c to 52b81fa Compare February 13, 2025 13:26
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmahadevuni this looks nice! Few comments.
@aditi-pandit can you take another look?

velox/connectors/hive/TableHandle.h Outdated Show resolved Hide resolved
velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @nmahadevuni
I will add the merge label once Aditi approves.

@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch from 52b81fa to 457e82a Compare February 13, 2025 17:44
Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nmahadevuni

@majetideepak majetideepak added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Feb 13, 2025
@facebook-github-bot
Copy link
Contributor

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pedroerp
Copy link
Contributor

@nmahadevuni looks like this unit test failed internally. Could you take a look?

'velox/connectors/hive/iceberg/tests:tests - HiveIcebergTest.testPartitionedRead'

@pedroerp
Copy link
Contributor

it's a memory leak detected by ASAN:

=================================================================
==8199==ERROR: LeakSanitizer: detected memory leaks

Indirect leak of 272 byte(s) in 1 object(s) allocated from:
    #0 0x881d8d in operator new(unsigned long) (/data/sandcastle/boxes/eden-trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/07838712e61ba1ef/velox/connectors/hive/iceberg/tests/__tests__/tests+0x881d8d)
    #1 0x5739a5 in facebook::velox::exec::test::PlanBuilder::startTableScan() fbcode/velox/exec/tests/utils/PlanBuilder.h:305
    #2 0x3708b1 in facebook::velox::connector::hive::iceberg::HiveIcebergTest_testPartitionedRead_Test::TestBody() fbcode/velox/connectors/hive/iceberg/tests/IcebergReadTest.cpp:741
    #3 0x7f62dcd2b1be in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) fbsource/src/gtest.cc:2675
    #4 0x7f62dcd2aa44 in testing::Test::Run() fbsource/src/gtest.cc:2692
    #5 0x7f62dcd3068f in testing::TestInfo::Run() fbsource/src/gtest.cc:2841
    #6 0x7f62dcd38646 in testing::TestSuite::Run() fbsource/src/gtest.cc:3020
    #7 0x7f62dcd73fab in testing::internal::UnitTestImpl::RunAllTests() fbsource/src/gtest.cc:5925
    #8 0x7f62dcd7300b in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) fbsource/src/gtest.cc:2675
    #9 0x7f62dcd72549 in testing::UnitTest::Run() fbsource/src/gtest.cc:5489
    #10 0x7f62dcc06980 in RUN_ALL_TESTS() fbsource/gtest/gtest.h:2317
    #11 0x7f62dcc0660f in main fbcode/common/gtest/LightMain.cpp:20
    #12 0x7f62d202c656 in __libc_start_call_main /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #13 0x7f62d202c717 in [__libc_start_main@GLIBC_2.2.5](mailto:__libc_start_main@GLIBC_2.2.5) /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../csu/libc-start.c:409:3
    #14 0x360130 in _start /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/x86_64/start.S:116

Indirect leak of 256 byte(s) in 2 object(s) allocated from:
    #0 0x881d8d in operator new(unsigned long) (/data/sandcastle/boxes/eden-trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/07838712e61ba1ef/velox/connectors/hive/iceberg/tests/__tests__/tests+0x881d8d)
    #1 0x7f62d624efe9 in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<facebook::velox::connector::hive::HiveColumnHandle, std::allocator<facebook::velox::connector::hive::HiveColumnHandle>, (__gnu_cxx::_Lock_policy)2>>::allocate(unsigned long, void const*) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/ext/new_allocator.h:128
    #2 0x7f62d624eea0 in std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<facebook::velox::connector::hive::HiveColumnHandle, std::allocator<facebook::velox::connector::hive::HiveColumnHandle>, (__gnu_cxx::_Lock_policy)2>>>::allocate(std::allocator<std::_Sp_counted_ptr_inplace<facebook::velox::connector::hive::HiveColumnHandle, std::allocator<facebook::velox::connector::hive::HiveColumnHandle>, (__gnu_cxx::_Lock_policy)2>>&, unsigned long) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/allocator.h:197
    #3 0x7f62d624ea40 in std::__allocated_ptr<std::allocator<std::_Sp_counted_ptr_inplace<facebook::velox::connector::hive::HiveColumnHandle, std::allocator<facebook::velox::connector::hive::HiveColumnHandle>, (__gnu_cxx::_Lock_policy)2>>> std::__allocate_guarded<std::allocator<std::_Sp_counted_ptr_inplace<facebook::velox::connector::hive::HiveColumnHandle, std::allocator<facebook::velox::connector::hive::HiveColumnHandle>, (__gnu_cxx::_Lock_policy)2>>>(std::allocator<std::_Sp_counted_ptr_inplace<facebook::velox::connector::hive::HiveColumnHandle, std::allocator<facebook::velox::connector::hive::HiveColumnHandle>, (__gnu_cxx::_Lock_policy)2>>&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/allocated_ptr.h:97
    #4 0x7f62d62d91d8 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<facebook::velox::connector::hive::HiveColumnHandle, std::allocator<facebook::velox::connector::hive::HiveColumnHandle>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>&, facebook::velox::connector::hive::HiveColumnHandle::ColumnType, std::shared_ptr<facebook::velox::Type const> const&, std::shared_ptr<facebook::velox::Type const> const&>(facebook::velox::connector::hive::HiveColumnHandle*&, std::_Sp_alloc_shared_tag<std::allocator<facebook::velox::connector::hive::HiveColumnHandle>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>&, facebook::velox::connector::hive::HiveColumnHandle::ColumnType&&, std::shared_ptr<facebook::velox::Type const> const&, std::shared_ptr<facebook::velox::Type con
...
...

@nmahadevuni
Copy link
Collaborator Author

@nmahadevuni looks like this unit test failed internally. Could you take a look?

'velox/connectors/hive/iceberg/tests:tests - HiveIcebergTest.testPartitionedRead'

Thank you. Will have a look.

std::move(requiredSubFields),
columnParseParameters)});

auto planBuilder = new PlanBuilder(pool_.get());
Copy link
Collaborator

@majetideepak majetideepak Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a leak.
You can use auto plan = PlanBuilder(pool_.get()).startTableScan()....


// Test filter on non-partitioned non-date column
std::vector<std::string> nonPartitionFilters = {"c0 = 1"};
plan = planBuilder->startTableScan()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PlanBuilder().startTableScan()

@nmahadevuni nmahadevuni force-pushed the fix_ice_date_partition_value_parse branch from 457e82a to b0596f0 Compare February 14, 2025 18:16
@nmahadevuni
Copy link
Collaborator Author

Fixed it and ran the leaks tool on Mac to check. @pedroerp @majetideepak

Process:         velox_hive_iceberg_test [30630]
Path:            /Users/USER/*/velox_hive_iceberg_test
Load Address:    0x1007a4000
Identifier:      velox_hive_iceberg_test
Version:         0
Code Type:       ARM64
Platform:        macOS
Parent Process:  leaks [30629]

Date/Time:       2025-02-14 23:40:39.413 +0530
Launch Time:     2025-02-14 23:40:38.511 +0530
OS Version:      macOS 14.5 (23F79)
Report Version:  7
Analysis Tool:   /usr/bin/leaks

Physical footprint:         40.0M
Physical footprint (peak):  40.8M
Idle exit:                  untracked
----

leaks Report Version: 4.0, multi-line stacks
Process 30630: 1001 nodes malloced for 238 KB
Process 30630: 0 leaks for 0 total leaked bytes.

@facebook-github-bot
Copy link
Contributor

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[native] Iceberg read from partitioned Date column fails
7 participants