Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better automatic offset reset for Kinesis ingestion #15338

Merged

Conversation

AmatyaAvadhanula
Copy link
Contributor

@AmatyaAvadhanula AmatyaAvadhanula commented Nov 7, 2023

Kinesis automatic reset attempts to reset the offsets in the metadata store for all the shards in a task even if one of them has fallen behind.

This PR aims to improve this behaviour by resetting the checkpoints for only those partitions for which the sequence numbers are unavailable.

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@AmatyaAvadhanula AmatyaAvadhanula marked this pull request as ready for review November 8, 2023 03:51
Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @AmatyaAvadhanula! Looks good overall. Left some comments on the logging and exception messages. Also, CI is failing for code coverage reasons - it seems like we don't have existing unit test coverage for this functionality in general?

sendResetRequestAndWait(shardResetMap, toolbox);
}
catch (IOException e) {
throw new ISE(e, "Exception while attempting to automatically reset sequences");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, perhaps include shardResetMap keys in the exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


if (!shardResetMap.isEmpty()) {
for (Map.Entry<StreamPartition<String>, String> partitionToReset : shardResetMap.entrySet()) {
log.warn("Starting sequence number [%s] is no longer available for partition [%s]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.warn("Starting sequence number [%s] is no longer available for partition [%s]",
log.warn("Starting sequenceNumber[%s] is no longer available for partition[%s]",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

);
}
if (task.getTuningConfig().isResetOffsetAutomatically()) {
log.info("Attempting to reset offsets for [%d] partitions.", shardResetMap.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partitions that repeatedly fall off the stream and get reset automatically would indicate an underlying issue, so I think logging the partition keys shardResetMap.keys() would be useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
} else {
throw new ISE("Sequence numbers are unavailable but automatic offset reset is disabled.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collect the unavailable partitions above and include them in the exception (instead of an end user having to dig up logs)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@AmatyaAvadhanula
Copy link
Contributor Author

@abhishekrb19, thanks for the review.
Yes, this functionality does not seem to have comprehensive tests and is hard to test in general.

Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left a couple of suggestions on variable naming. Thanks!

@@ -125,28 +125,42 @@ protected void possiblyResetDataSourceMetadata(
{
if (!task.getTuningConfig().isSkipSequenceNumberAvailabilityCheck()) {
final ConcurrentMap<String, String> currOffsets = getCurrentOffsets();
final Map<StreamPartition<String>, String> shardResetMap = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for clarity, maybe call this map partitionToSequenceResetMap since we refer to the keys and values of the map a few times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

if (!shardResetMap.isEmpty()) {
for (Map.Entry<StreamPartition<String>, String> partitionToReset : shardResetMap.entrySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: partitionToReset -> partitionToSequence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@AmatyaAvadhanula AmatyaAvadhanula merged commit 48a96f5 into apache:master Dec 13, 2023
48 of 53 checks passed
@LakshSingla LakshSingla added this to the 29.0.0 milestone Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants