Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(all): merge apache kafka 3.8.0 771b957 #1813

Merged
merged 129 commits into from
Aug 15, 2024
Merged

feat(all): merge apache kafka 3.8.0 771b957 #1813

merged 129 commits into from
Aug 15, 2024

Conversation

superhx
Copy link
Collaborator

@superhx superhx commented Aug 15, 2024

Main Review Files:

  1. Conflict files:
both modified:   core/src/main/scala/kafka/server/BrokerServer.scala
both modified:   core/src/main/scala/kafka/server/KafkaConfig.scala
both modified:   core/src/main/scala/kafka/server/KafkaRaftServer.scala
both modified:   core/src/main/scala/kafka/server/ReplicaManager.scala
both modified:   core/src/main/scala/kafka/server/checkpoints/OffsetCheckpointFile.scala
both modified:   core/src/main/scala/kafka/server/metadata/BrokerMetadataPublisher.scala
both modified:   core/src/main/scala/kafka/tools/StorageTool.scala
both modified:   core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala
both modified:   core/src/test/scala/unit/kafka/server/metadata/BrokerMetadataPublisherTest.scala
both modified:   licenses/APL
both modified:   tests/kafkatest/tests/connect/connect_distributed_test.py
  1. Core module changes and whether need apply the changes to it's extension

dajac and others added 30 commits May 31, 2024 21:50
This patch introduces the `group.version` feature flag with one version:
1) Version 1 enables the new consumer group rebalance protocol (KIP-848).

Reviewers: Justine Olshan <[email protected]>
This patch updates the system tests to correctly enable the new consumer protocol/coordinator in the tests requiring them.

I went with the simplest approach for now. Long term, I think that we should refactor the tests to better handle features and non-production features.

I got a successful run of the consumer system tests with this patch combined with apache/kafka#16120: https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1717155071--dajac--KAFKA-16860-2--29028ae0dd/2024-05-31--001./2024-05-31--001./report.html.

Reviewers: Justine Olshan <[email protected]>
This patch optimizes uniform (homogenous) assignor by avoiding creating a copy of all the assignments. Instead, the assignor creates a copy only if the assignment is updated. It is a sort of copy-on-write. This change reduces the overhead of the TargetAssignmentBuilder when ran with the uniform (homogenous) assignor.

Trunk:

```
Benchmark                                     (memberCount)  (partitionsToMemberRatio)  (topicCount)  Mode  Cnt   Score   Error  Units
TargetAssignmentBuilderBenchmark.build                10000                         10           100  avgt    5  24.535 ± 1.583  ms/op
TargetAssignmentBuilderBenchmark.build                10000                         10          1000  avgt    5  24.094 ± 0.223  ms/op
JMH benchmarks done
```

```
Benchmark                                       (assignmentType)  (assignorType)  (isRackAware)  (memberCount)  (partitionsToMemberRatio)  (subscriptionType)  (topicCount)  Mode  Cnt   Score   Error  Units
ServerSideAssignorBenchmark.doAssignment             INCREMENTAL         UNIFORM          false          10000                         10         HOMOGENEOUS           100  avgt    5  14.697 ± 0.133  ms/op
ServerSideAssignorBenchmark.doAssignment             INCREMENTAL         UNIFORM          false          10000                         10         HOMOGENEOUS          1000  avgt    5  15.073 ± 0.135  ms/op
JMH benchmarks done
```

Patch:

```
Benchmark                                     (memberCount)  (partitionsToMemberRatio)  (topicCount)  Mode  Cnt  Score   Error  Units
TargetAssignmentBuilderBenchmark.build                10000                         10           100  avgt    5  3.376 ± 0.577  ms/op
TargetAssignmentBuilderBenchmark.build                10000                         10          1000  avgt    5  3.731 ± 0.359  ms/op
JMH benchmarks done
```

```
Benchmark                                       (assignmentType)  (assignorType)  (isRackAware)  (memberCount)  (partitionsToMemberRatio)  (subscriptionType)  (topicCount)  Mode  Cnt  Score   Error  Units
ServerSideAssignorBenchmark.doAssignment             INCREMENTAL         UNIFORM          false          10000                         10         HOMOGENEOUS           100  avgt    5  1.975 ± 0.086  ms/op
ServerSideAssignorBenchmark.doAssignment             INCREMENTAL         UNIFORM          false          10000                         10         HOMOGENEOUS          1000  avgt    5  2.026 ± 0.190  ms/op
JMH benchmarks done
```

Reviewers: Ritika Reddy <[email protected]>, Jeff Kim <[email protected]>, Justine Olshan <[email protected]>
…gardless of in-flight heartbeats. (#16017)

Fix the bug where the heartbeat is not sent when a newly created consumer is immediately closed.

When there is a heartbeat request in flight and the consumer is then closed. In the current code, the HeartbeatRequestManager does not correctly send the closing heartbeat because a previous heartbeat request is still in flight. However, the closing heartbeat is only sent once, so in this situation, the broker will not know that the consumer has left the consumer group until the consumer's heartbeat times out.
This situation causes the broker to wait until the consumer's heartbeat times out before triggering a consumer group rebalance, which in turn affects message consumption.

Reviewers: Lianet Magrans <[email protected]>, Chia-Ping Tsai <[email protected]>
…945)

When upgrading from a MetadataVersion older than 3.7-IV2, we need to resend the broker registration, so that the controller can record the storage directories. The current code for doing this has several problems, however. One is that it tends to trigger even in cases where we don't actually need it. Another is that when re-registering the broker, the broker is marked as fenced.

This PR moves the handling of the re-registration case out of BrokerMetadataPublisher and into BrokerRegistrationTracker. The re-registration code there will only trigger in the case where the broker sees an existing registration for itself with no directories set.  This is much more targetted than the original code.

Additionally, in ClusterControlManager, when re-registering the same broker, we now preserve its fencing and shutdown state, rather than clearing those. (There isn't any good reason re-registering the same broker should clear these things... this was purely an oversight.) Note that we can tell the broker is "the same" because it has the same IncarnationId.

Reviewers: Gaurav Narula <[email protected]>, Igor Soarez <[email protected]>
…d topics having empty partitions (#16042)

Reviewers: Chia-Ping Tsai <[email protected]>
This patch is a small refactoring which mainly aims at avoid to construct a copy of the new target assignment in the TargetAssignmentBuilder because the copy is not used by the caller. The change relies on the exiting tests and it does not really have an impact on performance (e.g. validated with TargetAssignmentBuilderBenchmark).

Reviewers: Chia-Ping Tsai <[email protected]>
…than group max size. (#16163)

Fix the bug where the group downgrade to a classic one when a member leaves, even though the consumer group size is still larger than `classicGroupMaxSize`.

Reviewers: Chia-Ping Tsai <[email protected]>, David Jacot <[email protected]>
This patch fixes a typo in MetadataVersion.IBP_4_0_IV0. It should be 0 not O.

Reviewers: Justine Olshan <[email protected]>, Jun Rao <[email protected]>,  Chia-Ping Tsai <[email protected]>
We have revamped the thread idle ratio metric in apache/kafka#15835. apache/kafka#15835 (comment) describes a case where the metric loses accuracy and in order to set a lower bound to the accuracy, this patch re-adds a poll with a timeout that was removed as part of apache/kafka#15430.

Reviewers: David Jacot <[email protected]>
…ers (#16151)

Use REQUEST_TIMEOUT_MS_CONFIG in AdminClient.fenceProducers, 
or options.timeoutMs if specified, as transaction timeout.

No transaction will be started with this timeout, but
ReplicaManager.appendRecords uses this value as its timeout.
Use REQUEST_TIMEOUT_MS_CONFIG like a regular producer append
to allow for replication to take place.

Co-Authored-By: Adrian Preston <[email protected]>
… and compare raw task configs before publishing them (#16122)

Reviewers: Mickael Maison <[email protected]>
The time taken to compute a new assignment is critical. This patches extending the existing logging to log it too. This is very useful information to have.

Reviewers: Luke Chen <[email protected]>
This patch reworks the `PartitionAssignor` interface to use interfaces instead of POJOs. It mainly introduces the `MemberSubscriptionSpec` interface that represents a member subscription and changes the `GroupSpec` interfaces to expose the subscriptions and the assignments via different methods.

The patch does not change the performance.

before:
```
Benchmark                                     (memberCount)  (partitionsToMemberRatio)  (topicCount)  Mode  Cnt  Score   Error  Units
TargetAssignmentBuilderBenchmark.build                10000                         10           100  avgt    5  3.462 ± 0.687  ms/op
TargetAssignmentBuilderBenchmark.build                10000                         10          1000  avgt    5  3.626 ± 0.412  ms/op
JMH benchmarks done
```

after:
```
Benchmark                                     (memberCount)  (partitionsToMemberRatio)  (topicCount)  Mode  Cnt  Score   Error  Units
TargetAssignmentBuilderBenchmark.build                10000                         10           100  avgt    5  3.677 ± 0.683  ms/op
TargetAssignmentBuilderBenchmark.build                10000                         10          1000  avgt    5  3.991 ± 0.065  ms/op
JMH benchmarks done
```

Reviewers: David Jacot <[email protected]>
When PartitionRegistration#merge() reads a PartitionChangeRecord
from an older MetadataVersion, with a replica assignment change
and without #directories() set, it produces a direcotry assignment
of DirectoryId.UNASSIGNED. This is problematic because the MetadataVersion
may not yet support directory assignments, leading to a
UnwritableMetadataException in PartitionRegistration#toRecord.

Since the Controller always sets directories on PartitionChangeRecord
if the MetadataVersion supports it, via PartitionChangeBuilder,
there's no need for PartitionRegistration#merge() to populate
directories upon a replica assignment change.

Reviewers: Luke Chen <[email protected]>
Allow KRaft replicas to send requests to any node (Node) not just the nodes configured in the
controller.quorum.voters property. This flexibility is needed so KRaft can implement the
controller.quorum.voters configuration, send request to the dynamically changing set of voters and
send request to the leader endpoint (Node) discovered through the KRaft RPCs (specially
BeginQuorumEpoch request and Fetch response).

This was achieved by changing the RequestManager API to accept Node instead of just the replica ID.
Internally, the request manager tracks connection state using the Node.idString method to match the
connection management used by NetworkClient.

The API for RequestManager is also changed so that the ConnectState class is not exposed in the
API. This allows the request manager to reclaim heap memory for any connection that is ready.

The NetworkChannel was updated to receive the endpoint information (Node) through the outbound raft
request (RaftRequent.Outbound). This makes the network channel more flexible as it doesn't need to
be configured with the list of all possible endpoints. RaftRequest.Outbound and
RaftResponse.Inbound were updated to include the remote node instead of just the remote id.

The follower state tracked by KRaft replicas was updated to include both the leader id and the
leader's endpoint (Node). In this comment the node value is computed from the set of voters. In
future commit this will be updated so that it is sent through KRaft RPCs. For example
BeginQuorumEpoch request and Fetch response.

Support for configuring controller.quorum.bootstrap.servers was added. This includes changes to
KafkaConfig, QuorumConfig, etc. All of the tests using QuorumTestHarness were changed to use the
controller.quorum.bootstrap.servers instead of the controller.quorum.voters for the broker
configuration. Finally, the node id for the bootstrap server will be decreasing negative numbers
starting with -2.

Reviewers: Jason Gustafson <[email protected]>, Luke Chen <[email protected]>, Colin P. McCabe <[email protected]>
… task until the shutdown timer is expired. (#16156)

Reviewers: Lianet Magrans <[email protected]>, Chia-Ping Tsai <[email protected]>
…ompatibility with classic protocol members (#16145)

During online migration, there could be ConsumerGroup that has members that uses the classic protocol. In the current implementation, `STALE_MEMBER_EPOCH` could be thrown in ConsumerGroup offset fetch/commit validation but it's not supported by the classic protocol. Thus this patch changed `ConsumerGroup#validateOffsetCommit` and `ConsumerGroup#validateOffsetFetch` to ensure compatibility.

Reviewers: Jeff Kim <[email protected]>, David Jacot <[email protected]>
…riesAreAvailable and StorageToolTest.testFormatEmptyDirectory (#16186)

Reviewers: Luke Chen <[email protected]>, Chia-Ping Tsai <[email protected]>
…mote storage (#16071)

Reviewers: Kamal Chandraprakash<[email protected]>, Luke Chen <[email protected]>, Satish Duggana <[email protected]>
…schemas without inner schemas (#16161)

Signed-off-by: Greg Harris <[email protected]>
Reviewers: Chris Egerton <[email protected]>
FK left-join was changed via KIP-962. This PR updates the docs accordingly.

Reviewers: Ayoub Omari <[email protected]>, Matthias J. Sax <[email protected]>
…ing (#16147)

This PR takes care of making the call back toTaskAssignor.onAssignmentComputed.

It also contains a change to the public AssignmentConfigs API, as well as some simplifications of the StickyTaskAssignor.

This PR also changes the rack information fetching to happen lazily in the case where the TaskAssignor makes its decisions without said rack information.

Reviewers: Anna Sophie Blee-Goldman <[email protected]>
Fixed the calculation of the store name list based on the subtopology being accessed.

Also added a new test to make sure this new functionality works as intended.

Reviewers: Anna Sophie Blee-Goldman <[email protected]>
…missing (#16165)

When starting up kafka logManager, we'll check stray replicas to avoid some corner cases. But this check might cause broker unable to startup if partition.metadata is missing because when startup kafka, we load log from file, and the topicId of the log is coming from partition.metadata file. So, if partition.metadata is missing, the topicId will be None, and the LogManager#isStrayKraftReplica will fail with no topicID error.

The partition.metadata missing could be some storage failure, or another possible path is unclean shutdown after topic is created in the replica, but before data is flushed into partition.metadata file. This is possible because we do the flush in async way here.

When finding a log without topicID, we should treat it as a stray log and then delete it.

Reviewers: Luke Chen <[email protected]>, Gaurav Narula <[email protected]>
…is in the voter set (#16079)

1. Changing log message from error to info - We may expect the HW calculation to give us a smaller result than the current HW in the case of quorum reconfiguration. We will continue to not allow the HW to actually decrease.
2. Logic for finding the updated LeaderEndOffset for updateReplicaState is changed as well. We do not assume the leader is in the voter set and check the observer states as well.
3. updateLocalState now accepts an additional "lastVoterSet" param which allows us to update the leader state with the last known voters. any nodes in this set but not in voterStates will be added to voterStates and removed from observerStates, any nodes not in this set but in voterStates will be removed from voterStates and added to observerStates

Reviewers: Luke Chen <[email protected]>, José Armando García Sancio <[email protected]>
jolshan and others added 20 commits June 21, 2024 09:04
…s group protocol config (#16409)

This reverts commit e95e91a.

With the change to include the group.version flag, these tests fail due to trying to set the feature for the old version.

It is unclear if these tests originally worked as intended and given the upgrade is not expected for 3.8, we will just revert from 3.8.

Reviewers: David Jacot <[email protected]>
All public interface changes should be briefly mentioned in the
upgrade guide.

Reviewers: Matthias J. Sax <[email protected]>, Anna Sophie Blee-Goldman <[email protected]>, Nick Telford <[email protected]>
Reverting due to complications when trying to fix KAFKA-17011 in 3.8. Now there will be no production features, so we won't send any over the wire in ApiVersions or BrokerRegistration and cause issues when the receiver is on an old version.

I reverted the typo PR to make the reverts cleaner and minimize chances for errors. The only conflicts were due to imports and a modified test testConsumerGroupDescribe. The fix was to keep the modified parameters but remove the metadataCache code.

Some other minor changes for items we wanted to keep (not revert)

Reviewers: Colin P. McCabe <[email protected]>, David Jacot <[email protected]>, Chia-Ping Tsai <[email protected]>, Jun Rao <[email protected]>
…6400)

* Revert "KAFKA-16154: Broker returns offset for LATEST_TIERED_TIMESTAMP (#15213)"
* Set 3.8_IV0 as latest production version in 3.8
* Bring changes committed in KAFKA-16968
* Make ClusterTest annotation metadata default to 3.9

---------

Signed-off-by: Josep Prat <[email protected]>

Reviewers: Justine Olshan <[email protected]>, Jun Rao <[email protected]>
…16533)

LATEST_PRODUCTION version in MetadataVersion.java was updated in
both #16347 and #16400, but it was left unchanged in the system
tests.

Reviewers: Josep Prat <[email protected]>
…16545)

With enabled state updater tasks that are created but not initialized
are stored in a set. In each poll iteration the stream thread drains
that set, intializes the tasks, and adds them to the state updater.

On partition lost, all active tasks are closed.

This commit ensures that active tasks pending initialization in
the set mentioned above are closed cleanly on partition lost.

Reviewer: Lucas Brutschy <[email protected]>
…try (#16555)

When a active tasks are revoked they land as suspended tasks in the
task registry. If they are then reassigned, the tasks are resumed
and put into restoration. On assignment, we first handle the tasks
in the task registry and then the tasks in the state updater. That
means that if a task is re-assigned after a revocation, we remove
the suspended task from the task registry, resume it, add it
to the state updater, and then remove it from the list of tasks
to create. After that we iterate over the tasks in the state
updater and remove from there the tasks that are not in the list
of tasks to create. However, now the state updater contains the
resumed tasks that we removed from the task registry before but
are no more in the list of tasks to create. In other words, we
remove the resumed tasks from the state updater and close them
although we just got them assigned.

This commit ensures that we first handle the tasks in the
state updater and then the tasks in the task registry.

Reviewer: Lucas Brutschy <[email protected]>
… and JsonDeserializer (#16565)

Reviewers: Mickael Maison <[email protected]>, Josep Prat <[email protected]>, Greg Harris <[email protected]>
… (#16570)

When Streams tries to transit a restored active task to RUNNING, the
first thing it does is getting the committed offsets for this task.
If getting the offsets expires a timeout, Streams does not re-throw
the error initially, but tries to get the committed offsets later
until a Streams-specific timeout is hit.

Restored active tasks from the state updater are removed from the
output queue of the restored tasks in the state updater. If a
timeout occurs, the restored task is neither added to the
task registry nor re-added to the state updater. The task is lost
since it is not maintained anywhere. This means the task is also
not closed. When the same task is created again on the same
stream thread since the stream thread does not know about this
lost task, the state stores are opened again and RocksDB will
throw the "No locks available" error.

This commit re-adds the task to the state updater if the
committed request times out.

Reviewers: Lucas Brutschy <[email protected]>
…eTopics (#16217)

Reviewers: Andrew Schofield <[email protected]>, Lianet Magrans <[email protected]>, David Arthur <[email protected]>
Fix to complete Future which was stuck due the exception.getCause() throws an error.

The fix in the #16217 unblocked blocking thread but exception in catch block from blocking get call was wrapped in ExecutionException which is not the case when moved to async workflow hence getCause is not needed.

Reviewers: Luke Chen <[email protected]>, Chia-Ping Tsai <[email protected]>
…aliases correctly (#16608)

Signed-off-by: Greg Harris <[email protected]>
Reviewers: Chris Egerton <[email protected]>, Chia-Ping Tsai <[email protected]>, Josep Prat <[email protected]>
…ool (#16607) (3.8) (#16617)

Reviewers: Chia-Ping Tsai <[email protected]>, Greg Harris <[email protected]>, Chris Egerton <[email protected]>

Co-authored-by: Dmitry Werner <[email protected]>
@CLAassistant
Copy link

CLAassistant commented Aug 15, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 18 committers have signed the CLA.

✅ superhx
❌ lianetm
❌ kamalcph
❌ showuon
❌ mimaison
❌ gaurav-narula
❌ rodesai
❌ jlprat
❌ lucasbru
❌ cadonna
❌ soarez
❌ jinyongchoi
❌ vbalani002
❌ apoorvmittal10
❌ gharris1727
❌ FrankYang0529
❌ jolshan
❌ vinnybod
You have signed the CLA already but the status is still pending? Let us recheck it.

@superhx superhx changed the base branch from main to 1.2 August 15, 2024 08:38
@superhx superhx merged commit 64bad21 into 1.2 Aug 15, 2024
5 of 6 checks passed
@superhx superhx deleted the merge_3.8.0 branch August 15, 2024 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.