-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate Table Replication During Create/Update #13951
Validate Table Replication During Create/Update #13951
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13951 +/- ##
============================================
+ Coverage 61.75% 63.39% +1.64%
- Complexity 207 1483 +1276
============================================
Files 2436 2744 +308
Lines 133233 154069 +20836
Branches 20636 23779 +3143
============================================
+ Hits 82274 97671 +15397
- Misses 44911 49021 +4110
- Partials 6048 7377 +1329
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
if (tagOverrideConfig != null && tagOverrideConfig.getRealtimeConsuming() != null | ||
&& tagOverrideConfig.getRealtimeCompleted() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining the significance of these 3 references not being null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to use TagNameUtils
? I think we allow having only one of CONSUMING/COMPLETED
not null
if (consumingServersCnt < replication || completedServersCnt < replication) { | ||
if (consumingServersCnt < replication && completedServersCnt < replication) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOC, what's the Consuming tag and Completed tag mean and how do they get assigned to servers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more complicated when instanceAssignmentConfigMap
is configured. We probably should do a dry run on instance assignment to verify if the config is valid
@Jackie-Jiang Right. But can we do a dry run when the table doesn't even exist? The idea is to capture this before the table is even created. |
We can trigger the same piece of code. It should throw exception when there are not enough servers to host the table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to validate table config even when table doesn't exist. I think only validating instance assignment should be good enough. Do you see a requirement of validating table rebalance?
LOGGER.error("Could not calculate target assignment for table: {} for the provided table config", | ||
tableNameWithType); | ||
throw e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wrap the error message into a new exception and not adding this error log
@Jackie-Jiang Made the changes. Summarising the existing and new behaviour here: Existing Behaviour Update Table New Behaviour Update Table Setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice approach!
@@ -137,7 +137,7 @@ public static InstancePartitions computeDefaultInstancePartitions(HelixManager h | |||
throw new IllegalStateException(); | |||
} | |||
return computeDefaultInstancePartitionsForTag(helixManager, tableConfig.getTableName(), | |||
instancePartitionsType.toString(), serverTag); | |||
instancePartitionsType.toString(), serverTag, tableConfig.getReplication()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's directly pass in tableConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -161,6 +161,24 @@ public static InstancePartitions computeDefaultInstancePartitionsForTag(HelixMan | |||
return instancePartitions; | |||
} | |||
|
|||
public static InstancePartitions computeDefaultInstancePartitionsForTag(HelixManager helixManager, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can replace the old method. I checked all the usage, and it is safe to use the new method in TierConfigUtils
.
@klsince TierConfigUtils.getTieredInstancePartitionsForSegment()
shouldn't always use default instance partitions, it should use the tier instance partitions if exists. Can you please take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
try { | ||
tableRebalancer.getInstancePartitionsMap(tableConfig, true, true, true); | ||
} catch (Exception e) { | ||
LOGGER.error("Exception calculating instance partitions for table: {}", tableConfig.getTableName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are throwing the exception out, no need to log the error here.
tableRebalancer.getInstancePartitionsMap(tableConfig, true, true, true); | ||
} catch (Exception e) { | ||
LOGGER.error("Exception calculating instance partitions for table: {}", tableConfig.getTableName()); | ||
throw new RuntimeException("Exception calculating instance partitions for table:" + tableConfig.getTableName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a space after :
throw new RuntimeException("Exception calculating instance partitions for table:" + tableConfig.getTableName()); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Remove empty line
instanceReplicaGroupPartitionConfig, | ||
InstanceAssignmentConfig.PartitionSelector.FD_AWARE_INSTANCE_PARTITION_SELECTOR.name(), false); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Remove empty line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Recently, we've observed that RT ingestion stopped simply because of not enough servers to fulfil the replication configured and no warning / error given to user. This PR adds a check for that during table update. For table create, we already do this check and don't let the user create the table if it fails.