-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed request to tables with Object Storage Engines #615
base: antalya
Are you sure you want to change the base?
Distributed request to tables with Object Storage Engines #615
Conversation
This is an automated comment for commit 17c53f4 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a first glance, it looks good. This is the first round of review, it includes just a few questions and minor suggestions.
Once these are addressed, I'll dive a bit deeper in the AST creation logic and it should be good to go :).
@@ -247,6 +247,10 @@ class StorageObjectStorage::Configuration | |||
|
|||
virtual void update(ObjectStoragePtr object_storage, ContextPtr local_context); | |||
|
|||
virtual void setFunctionArgs(ASTs & /* args */) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this function needs to be either renamed or documented. Usually, when I read a function named void setSomething(Argument argument)
, I assume the members of the current instance will be modified.
This function is doing the opposite, it is taking the members of the existing instance and populating the argumennt.
Maybe you could construct the arguments object and return it?
ASTPtr getTableFunctionArguments()
{
auto arguments = std::make_shared<ASTExpressionList>();
// populate it
return arguments;
}
and then, upon call site:
auto function_arguments = configuration->getTableFunctionArguments();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, couple more changes or discussions and it is good to be merged
@@ -153,7 +160,7 @@ void StorageObjectStorageCluster::updateQueryForDistributedEngineIfNeeded(ASTPtr | |||
queryToString(query)); | |||
} | |||
|
|||
configuration->setFunctionArgs(arguments->children); | |||
configuration->getTableFunctionArguments(arguments->children); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #615 (comment), I suggested the arguments to be created inside getTableFunctionArguments
instead of populating a parameter. Any specific reason you did it this way?
If there is one, I am fine with this approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to avoid copying after function call. children
is a vector, not a pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow.
On line 155, you do: auto arguments = std::make_shared<ASTExpressionList>();
Then you are passing children as a reference on 167 and populating it inside the getTableFunctionArguments
function.
On line 169, you read the values.
What I am suggesting is that you create the arguments list inside the getTableFunctionArguments
function, populate and simply return it. It'll be a shared_ptr, there is no deep copy in this process. Then, the rest is the same.
What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes not in arguments
but in arguments->children
children
has type ASTs
https://github.com/ClickHouse/ClickHouse/blob/master/src/Parsers/IAST.h#L37
It's a variant of vector, not a single pointer
https://github.com/ClickHouse/ClickHouse/blob/master/src/Parsers/IAST_fwd.h#L14
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on slack, let's make the function return it as suggested in the original comment
void StorageAzureConfiguration::getTableFunctionArguments(ASTs & args) const | ||
{ | ||
if (!args.empty()) | ||
{ /// Just check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just remove this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, if you implement this function in the way I suggested in #615 (comment), perhaps this whole if statement is no longer necessary?
args.push_back(std::make_shared<ASTLiteral>(auth_settings[S3AuthSetting::session_token].value)); | ||
if (format != "auto") | ||
args.push_back(std::make_shared<ASTLiteral>(format)); | ||
if (!compression_method.empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok not to add some other arguments like structure
and headers
as the docs suggest? https://clickhouse.com/docs/en/sql-reference/table-functions/s3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this fields are already added in addStructureAndFormatToArgsIfNeeded
methods.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/ObjectStorage/S3/Configuration.cpp#L399
These field required for table functions too, when access parameters need to add only for table engine case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: perhaps add a comment saying this is already added somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I rename method to addPathAndAccessKeysToArgs
, to consistency with addStructureAndFormatToArgsIfNeeded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I posted two more comments which are not mandatory, although I am curious about #615 (comment).
In any case, it can be merged as it is. LGTM
c7b1ad3
to
3fafe6f
Compare
src/Storages/IStorageCluster.cpp
Outdated
{ | ||
auto cluster_name_ = getClusterName(context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the underscore in the variable name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I assume it's because IStorageCluster
already has a cluster_name
member that comes from engines like S3Cluster
and not the setting, right?
If that's so, what happens if the user has both of them set?
PLus, I would rename it to be something like cluster_name_from_settings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning that local variable masked class member, and option "warning as error" (-Werror
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tomorrow I'll have annother look on the pure_storage
member and its auxiliary function getPureStorage
ContextPtr context, | ||
bool async_insert) override; | ||
|
||
ClusterPtr getCluster(ContextPtr context) const { return getClusterImpl(context, cluster_name); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why move the implementation to the header file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single line implementation. In my opinion single-line implementation in header is more readable and simple to understanding than separate implementation in cpp-file, when declaration and implementation in two different places.
@@ -136,4 +260,87 @@ RemoteQueryExecutor::Extension StorageObjectStorageCluster::getTaskIteratorExten | |||
return RemoteQueryExecutor::Extension{ .task_iterator = std::move(callback) }; | |||
} | |||
|
|||
std::shared_ptr<StorageObjectStorage> StorageObjectStorageCluster::getPureStorage(ContextPtr context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: nonClusteredStorage instead of pureStorage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remove method - pure_storage created in constructor anyway, so it always exists.
|
||
String StorageObjectStorageCluster::getClusterName(ContextPtr context) const | ||
{ | ||
auto cluster_name_ = context->getSettingsRef()[Setting::object_storage_cluster].value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing. variable needs to be renamed and some comments need to be written explaining why there are two sources of cluster names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there is an easier way to handle this, perhaps at construction time you could decide which cluster name to use and stick to that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable name: rename the variable. One must be named cluster_name_from_settings
, while the other remains intact.
Comment/ Docs: This piece of code includes just one if statement checking two sources of cluster name, it is not hard to understand that. The "tricky" part to understand is WHY this is being done. I would like to see something along these lines: "StorageObjectStorageCluster is always being instantiated, even for non clustered table definitions. But the user can specify the cluster in the query settings, and that must be honored. The xyz setting takes precedence over the table definition one because of abc."
const String engine_name; | ||
const StorageObjectStorage::ConfigurationPtr configuration; | ||
const ObjectStoragePtr object_storage; | ||
bool cluster_name_in_settings; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining why these new members are needed
if (pure_storage) | ||
pure_storage->setInMemoryMetadata(metadata_); | ||
IStorageCluster::setInMemoryMetadata(metadata_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you are calling the method for both simply because it is more convenient and there are no side effects
942c5e1
to
5bc11ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I am wrong, but one thing you are trying to achieve here that wasn't in the original plans is to be able to switch cluster on and off using query settings. Is that correct? For example:
create table test ... engine=IcebergS3 -- no cluster settings here
select * from test -- single node call
select * from test settings object_storage_cluster="acluster" -- clustered call
If that is correct, could you please add a test that cover this use case?
|
||
String StorageObjectStorageCluster::getClusterName(ContextPtr context) const | ||
{ | ||
auto cluster_name_ = context->getSettingsRef()[Setting::object_storage_cluster].value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable name: rename the variable. One must be named cluster_name_from_settings
, while the other remains intact.
Comment/ Docs: This piece of code includes just one if statement checking two sources of cluster name, it is not hard to understand that. The "tricky" part to understand is WHY this is being done. I would like to see something along these lines: "StorageObjectStorageCluster is always being instantiated, even for non clustered table definitions. But the user can specify the cluster in the query settings, and that must be honored. The xyz setting takes precedence over the table definition one because of abc."
Correct. Test is here. |
Ok. So am I right that the reason you can't pick one cluster name during the instantiation of Initially, I wanted you to do something like the below:
This would remove the branching from inside the class, making it simpler and more idiomatic. But, if my above statement is correct, then this is not possible since you must be able to switch between clustered and non clustered implementation after the class has been instantiated. If all of this is correct, then the only thing left for me to approve the PR is the comment / docs I asked for in: #615 (comment) |
src/Core/SettingsChangesHistory.cpp
Outdated
@@ -73,6 +73,7 @@ static std::initializer_list<std::pair<ClickHouseVersion, SettingsChangesHistory | |||
{"least_greatest_legacy_null_behavior", true, false, "New setting"}, | |||
{"object_storage_cluster", "", "", "New setting"}, | |||
{"object_storage_max_nodes", 0, 0, "New setting"}, | |||
{"input_format_parquet_use_metadata_cache", false, false, "New setting"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, one more thing. I believe you added this because of a CI/CD failure. @Enmk has merged a PR that fixes this. Perhaps you want to update your branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -75,19 +94,132 @@ StorageObjectStorageCluster::StorageObjectStorageCluster( | |||
|
|||
setVirtuals(VirtualColumnUtils::getVirtualsForFileLikeStorage(metadata.columns, context_, sample_path)); | |||
setInMemoryMetadata(metadata); | |||
|
|||
pure_storage = std::make_shared<StorageObjectStorage>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setVirtuals()
tries to access pure_storage
, which is nullptr
...
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.789714 [ 15427 ] <Fatal> BaseDaemon: ########################################
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.789754 [ 15427 ] <Fatal> BaseDaemon: (version 24.12.2.20024.altinitytest (altinity build), build id: 1C0EBCF3C00E43841FC84CD01D5EB270B4C8F268, git hash: 664518871e6afcfdfb63964c7c38aa0b14854d25) (from thread 1444) (query_id: 69ae0cff-14c5-4f44-a31a-0c4d86618c26) (query: SELECT *, column0 FROM s3Cluster(test_cluster_one_shard_three_replicas_localhost, 'http://localhost:11111/test/hive_partitioning/column0=Elizabeth/sample.parquet') LIMIT 10;) Received signal Segmentation fault (11)
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.789784 [ 15427 ] <Fatal> BaseDaemon: Address: NULL pointer. Access: read. Address not mapped to object.
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.789801 [ 15427 ] <Fatal> BaseDaemon: Stack trace: 0x000000000dae2ad5 0x00007fd65efc3520 0x0000000010afc85e 0x0000000010af9135 0x000000000f53f1c1 0x000000000f53e3dd 0x0000000010a0afd6 0x000000001199809f 0x00000000117a0abe 0x00000000117cd5bc 0x0000000011790236 0x000000001178e66c 0x000000001178df40 0x0000000011e3d036 0x000000001209961d 0x000000001209757c 0x000000001209b322 0x0000000012035bcc 0x000000001241541f 0x0000000012410a1d 0x00000000136ee19e 0x0000000013709418 0x000000001661cc07 0x000000001661d099 0x00000000165e99bc 0x00000000165e7f5d 0x00007fd65f015ac3 0x00007fd65f0a7850
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.820039 [ 15427 ] <Fatal> BaseDaemon: 0. ./build_docker/./src/Common/SignalHandlers.cpp:105: signalHandler(int, siginfo_t*, void*) @ 0x000000000dae2ad5
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.820109 [ 15427 ] <Fatal> BaseDaemon: 1. ? @ 0x00007fd65efc3520
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.851621 [ 15427 ] <Fatal> BaseDaemon: 2. ./src/Storages/ObjectStorage/StorageObjectStorageCluster.h:49: DB::StorageObjectStorageCluster::setVirtuals(DB::VirtualColumnsDescription) @ 0x0000000010afc85e
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.882793 [ 15427 ] <Fatal> BaseDaemon: 3. ./build_docker/./src/Storages/ObjectStorage/StorageObjectStorageCluster.cpp:95: DB::StorageObjectStorageCluster::StorageObjectStorageCluster(String const&, std::shared_ptr<DB::StorageObjectStorage::Configuration>, std::shared_ptr<DB::IObjectStorage>, std::shared_ptr<DB::Context const>, DB::StorageID const&, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, String const&, std::optional<DB::FormatSettings>, DB::LoadingStrictnessLevel, std::shared_ptr<DB::IAST>) @ 0x0000000010af9135
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.907551 [ 15427 ] <Fatal> BaseDaemon: 4.0. inlined from ./contrib/llvm-project/libcxx/include/__memory/construct_at.h:35: DB::StorageObjectStorageCluster* std::construct_at[abi:v15007]<DB::StorageObjectStorageCluster, String const&, std::shared_ptr<DB::StorageObjectStorage::Configuration>&, std::shared_ptr<DB::IObjectStorage>&, std::shared_ptr<DB::Context const>&, DB::StorageID, DB::ColumnsDescription&, DB::ConstraintsDescription, String, std::nullopt_t const&, DB::LoadingStrictnessLevel, std::nullptr_t, DB::StorageObjectStorageCluster*>(DB::StorageObjectStorageCluster*, String const&, std::shared_ptr<DB::StorageObjectStorage::Configuration>&, std::shared_ptr<DB::IObjectStorage>&, std::shared_ptr<DB::Context const>&, DB::StorageID&&, DB::ColumnsDescription&, DB::ConstraintsDescription&&, String&&, std::nullopt_t const&, DB::LoadingStrictnessLevel&&, std::nullptr_t&&)
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.907617 [ 15427 ] <Fatal> BaseDaemon: 4.1. inlined from ./contrib/llvm-project/libcxx/include/__memory/allocator_traits.h:298: void std::allocator_traits<std::allocator<DB::StorageObjectStorageCluster>>::construct[abi:v15007]<DB::StorageObjectStorageCluster, String const&, std::shared_ptr<DB::StorageObjectStorage::Configuration>&, std::shared_ptr<DB::IObjectStorage>&, std::shared_ptr<DB::Context const>&, DB::StorageID, DB::ColumnsDescription&, DB::ConstraintsDescription, String, std::nullopt_t const&, DB::LoadingStrictnessLevel, std::nullptr_t, void, void>(std::allocator<DB::StorageObjectStorageCluster>&, DB::StorageObjectStorageCluster*, String const&, std::shared_ptr<DB::StorageObjectStorage::Configuration>&, std::shared_ptr<DB::IObjectStorage>&, std::shared_ptr<DB::Context const>&, DB::StorageID&&, DB::ColumnsDescription&, DB::ConstraintsDescription&&, String&&, std::nullopt_t const&, DB::LoadingStrictnessLevel&&, std::nullptr_t&&)
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.907645 [ 15427 ] <Fatal> BaseDaemon: 4.2. inlined from ./contrib/llvm-project/libcxx/include/__memory/shared_ptr.h:292: __shared_ptr_emplace<const std::basic_string<char, std::char_traits<char>, std::allocator<char> > &, std::shared_ptr<DB::StorageObjectStorage::Configuration> &, std::shared_ptr<DB::IObjectStorage> &, std::shared_ptr<const DB::Context> &, DB::StorageID, DB::ColumnsDescription &, DB::ConstraintsDescription, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, const std::nullopt_t &, DB::LoadingStrictnessLevel, std::nullptr_t>
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.907663 [ 15427 ] <Fatal> BaseDaemon: 4. ./contrib/llvm-project/libcxx/include/__memory/shared_ptr.h:953: std::shared_ptr<DB::StorageObjectStorageCluster> std::allocate_shared[abi:v15007]<DB::StorageObjectStorageCluster, std::allocator<DB::StorageObjectStorageCluster>, String const&, std::shared_ptr<DB::StorageObjectStorage::Configuration>&, std::shared_ptr<DB::IObjectStorage>&, std::shared_ptr<DB::Context const>&, DB::StorageID, DB::ColumnsDescription&, DB::ConstraintsDescription, String, std::nullopt_t const&, DB::LoadingStrictnessLevel, std::nullptr_t, void>(std::allocator<DB::StorageObjectStorageCluster> const&, String const&, std::shared_ptr<DB::StorageObjectStorage::Configuration>&, std::shared_ptr<DB::IObjectStorage>&, std::shared_ptr<DB::Context const>&, DB::StorageID&&, DB::ColumnsDescription&, DB::ConstraintsDescription&&, String&&, std::nullopt_t const&, DB::LoadingStrictnessLevel&&, std::nullptr_t&&) @ 0x000000000f53f1c1
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.930592 [ 15427 ] <Fatal> BaseDaemon: 5.0. inlined from ./contrib/llvm-project/libcxx/include/__memory/shared_ptr.h:962: std::shared_ptr<DB::StorageObjectStorageCluster> std::make_shared[abi:v15007]<DB::StorageObjectStorageCluster, String const&, std::shared_ptr<DB::StorageObjectStorage::Configuration>&, std::shared_ptr<DB::IObjectStorage>&, std::shared_ptr<DB::Context const>&, DB::StorageID, DB::ColumnsDescription&, DB::ConstraintsDescription, String, std::nullopt_t const&, DB::LoadingStrictnessLevel, std::nullptr_t, void>(String const&, std::shared_ptr<DB::StorageObjectStorage::Configuration>&, std::shared_ptr<DB::IObjectStorage>&, std::shared_ptr<DB::Context const>&, DB::StorageID&&, DB::ColumnsDescription&, DB::ConstraintsDescription&&, String&&, std::nullopt_t const&, DB::LoadingStrictnessLevel&&, std::nullptr_t&&)
2025-02-28 20:35:49 [389fdafdb833] 2025.02.28 09:35:33.930655 [ 15427 ] <Fatal> BaseDaemon: 5. ./build_docker/./src/TableFunctions/TableFunctionObjectStorageCluster.cpp:53: DB::TableFunctionObjectStorageCluster<DB::S3ClusterDefinition, DB::StorageS3Configuration>::executeImpl(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context const>, String const&, DB::ColumnsDescription, bool) const @ 0x000000000f53e3dd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stateless test 03203_hive_style_partitioning
crashes the server, please take a look and fix it
https://s3.amazonaws.com/altinity-build-artifacts/615/17c53f4b34f3d1de0d46781d014825d061b24897/stateless_tests__release_.html
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Distributed object storage table engines
Documentation entry for user-facing changes
Before ClickHouse can made distributed table function requests with option
object_storage_cluster
.This PR expand this to table engined.
Modify your CI run:
NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step
Include tests (required builds will be added automatically):
Exclude tests:
Extra options:
Only specified batches in multi-batch jobs: