Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate Table Replication During Create/Update #13951

Merged
merged 38 commits into from
Feb 18, 2025

Conversation

suddendust
Copy link
Contributor

@suddendust suddendust commented Sep 7, 2024

Recently, we've observed that RT ingestion stopped simply because of not enough servers to fulfil the replication configured and no warning / error given to user. This PR adds a check for that during table update. For table create, we already do this check and don't let the user create the table if it fails.

@codecov-commenter
Copy link

codecov-commenter commented Sep 7, 2024

Codecov Report

Attention: Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.

Project coverage is 63.39%. Comparing base (59551e4) to head (5b71dca).
Report is 1732 commits behind head on master.

Files with missing lines Patch % Lines
...not/common/assignment/InstancePartitionsUtils.java 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13951      +/-   ##
============================================
+ Coverage     61.75%   63.39%   +1.64%     
- Complexity      207     1483    +1276     
============================================
  Files          2436     2744     +308     
  Lines        133233   154069   +20836     
  Branches      20636    23779    +3143     
============================================
+ Hits          82274    97671   +15397     
- Misses        44911    49021    +4110     
- Partials       6048     7377    +1329     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.36% <94.44%> (+1.65%) ⬆️
java-21 63.29% <94.44%> (+1.66%) ⬆️
skip-bytebuffers-false 63.39% <94.44%> (+1.64%) ⬆️
skip-bytebuffers-true 63.26% <94.44%> (+35.54%) ⬆️
temurin 63.39% <94.44%> (+1.64%) ⬆️
unittests 63.39% <94.44%> (+1.64%) ⬆️
unittests1 56.06% <0.00%> (+9.17%) ⬆️
unittests2 33.87% <94.44%> (+6.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 923 to 924
if (tagOverrideConfig != null && tagOverrideConfig.getRealtimeConsuming() != null
&& tagOverrideConfig.getRealtimeCompleted() != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining the significance of these 3 references not being null?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to use TagNameUtils? I think we allow having only one of CONSUMING/COMPLETED not null

Comment on lines 942 to 943
if (consumingServersCnt < replication || completedServersCnt < replication) {
if (consumingServersCnt < replication && completedServersCnt < replication) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC, what's the Consuming tag and Completed tag mean and how do they get assigned to servers?

@Jackie-Jiang Jackie-Jiang added the user-experience Related to user experience label Sep 9, 2024
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more complicated when instanceAssignmentConfigMap is configured. We probably should do a dry run on instance assignment to verify if the config is valid

@suddendust
Copy link
Contributor Author

@Jackie-Jiang Right. But can we do a dry run when the table doesn't even exist? The idea is to capture this before the table is even created.

@Jackie-Jiang
Copy link
Contributor

We can trigger the same piece of code. It should throw exception when there are not enough servers to host the table

@suddendust suddendust changed the title Validate if Table Replication Factor Can be Honoured during Table Create/Update [WIP] Validate if Table Replication Factor Can be Honoured during Table Create/Update Nov 1, 2024
@suddendust suddendust changed the title [WIP] Validate if Table Replication Factor Can be Honoured during Table Create/Update [WIP] Validate if Table Replication Factor Can be Honoured during Table Update Feb 11, 2025
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to validate table config even when table doesn't exist. I think only validating instance assignment should be good enough. Do you see a requirement of validating table rebalance?

Comment on lines 1344 to 1346
LOGGER.error("Could not calculate target assignment for table: {} for the provided table config",
tableNameWithType);
throw e;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wrap the error message into a new exception and not adding this error log

@suddendust suddendust changed the title [WIP] Validate if Table Replication Factor Can be Honoured during Table Update Validate Table Replication During Create/Update Feb 14, 2025
@suddendust
Copy link
Contributor Author

suddendust commented Feb 15, 2025

@Jackie-Jiang Made the changes. Summarising the existing and new behaviour here:

Existing Behaviour
New Table
OFFLINE: Invalid replication goes through
REALTIME: Invalid replication does not go through

Update Table
OFFLINE: Invalid replication goes through
REALTIME: Invalid replication goes through

New Behaviour
New Table
OFFLINE: Invalid replication does not go through.
REALTIME: Invalid replication does not go through.

Update Table
OFFLINE: Invalid replication does not go through.
REALTIME: Invalid replication does not go through.

Setting reassignInstances to true also catches invalid settings for replica-grouper based assignments.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice approach!

@@ -137,7 +137,7 @@ public static InstancePartitions computeDefaultInstancePartitions(HelixManager h
throw new IllegalStateException();
}
return computeDefaultInstancePartitionsForTag(helixManager, tableConfig.getTableName(),
instancePartitionsType.toString(), serverTag);
instancePartitionsType.toString(), serverTag, tableConfig.getReplication());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's directly pass in tableConfig

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -161,6 +161,24 @@ public static InstancePartitions computeDefaultInstancePartitionsForTag(HelixMan
return instancePartitions;
}

public static InstancePartitions computeDefaultInstancePartitionsForTag(HelixManager helixManager,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can replace the old method. I checked all the usage, and it is safe to use the new method in TierConfigUtils.

@klsince TierConfigUtils.getTieredInstancePartitionsForSegment() shouldn't always use default instance partitions, it should use the tier instance partitions if exists. Can you please take a look?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

try {
tableRebalancer.getInstancePartitionsMap(tableConfig, true, true, true);
} catch (Exception e) {
LOGGER.error("Exception calculating instance partitions for table: {}", tableConfig.getTableName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are throwing the exception out, no need to log the error here.

tableRebalancer.getInstancePartitionsMap(tableConfig, true, true, true);
} catch (Exception e) {
LOGGER.error("Exception calculating instance partitions for table: {}", tableConfig.getTableName());
throw new RuntimeException("Exception calculating instance partitions for table:" + tableConfig.getTableName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a space after :

throw new RuntimeException("Exception calculating instance partitions for table:" + tableConfig.getTableName());
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Remove empty line

instanceReplicaGroupPartitionConfig,
InstanceAssignmentConfig.PartitionSelector.FD_AWARE_INSTANCE_PARTITION_SELECTOR.name(), false);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Remove empty line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Jackie-Jiang Jackie-Jiang merged commit d76fe22 into apache:master Feb 18, 2025
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user-experience Related to user experience
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants