-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not close the socket when the broker failed due to MetadataStoreException #390
Merged
BewareMyPower
merged 1 commit into
apache:main
from
BewareMyPower:bewaremypower/service-not-ready-retry
Feb 2, 2024
Merged
Do not close the socket when the broker failed due to MetadataStoreException #390
BewareMyPower
merged 1 commit into
apache:main
from
BewareMyPower:bewaremypower/service-not-ready-retry
Feb 2, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
BewareMyPower
requested review from
merlimat,
RobertIndie,
Demogorgon314,
poorbarcode and
shibd
January 30, 2024 14:50
BewareMyPower
added a commit
to BewareMyPower/pulsar
that referenced
this pull request
Jan 30, 2024
…kBusyException ### Motivation When a broker restarted, there is a case in `NamespaceService#findBrokerServiceUrl`: 1. `ownershipCache.getOwnerAsync(bundle)` got an empty data, then `searchForCandidateBroker` will be called 2. The broker itself was elected as the candidate broker. 3. Meanwhile, the other broker has acquired the distributed lock of the bundle, then `ownershipCache.tryAcquiringOwnership` will fail with ```java lookupFuture.completeExceptionally(new PulsarServerException( "Failed to acquire ownership for namespace bundle " + bundle, exception)); ``` See apache/pulsar-client-cpp#390 for the real world case. Then in `TopicLookupBase#handleLookupError`, this exception will be wrapped into a `ServiceNotReady` error to client. This case happens very frequently in our production environment when a broker restarted. If there is a `PulsarClient` that has many producers or consumers, the connection will be closed, which results in many reconnections, which brings much pressure to the cluster. ### Modifications In `handleLookupError`, check the `PulsarServerException` and unwrap the `CompletionException`. If the unwrapped exception is `MetadataStoreException`, return the `MetadataError` to avoid closing the connection at client side. Add `testLookupConnectionNotCloseIfFailedToAcquireOwnershipOfBundle` to simulate the case and verify the socket won't be closed.
4 tasks
Mark it as drafted for now, I'm trying to mock the |
BewareMyPower
force-pushed
the
bewaremypower/service-not-ready-retry
branch
from
January 31, 2024 13:39
cc5d8b0
to
58223e7
Compare
BewareMyPower
changed the title
Do not close the socket when the broker failed to acquire ownership for namespace bundle
Do not close the socket when the broker failed due to MetadataStoreException
Jan 31, 2024
…ception ### Motivation When the broker failed to acquire the ownership of a namespace bundle by `LockBusyException`. It means there is another broker that has acquired the metadata store path and didn't release that path. For example: Broker 1: ``` 2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000 Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked ``` Broker 2: ``` 2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds ``` After broker 2 released the lock at 23:35:36,650, the lookup request to broker 1 should tell the client that namespace bundle 0x50000000_0x51000000 is currently being unloaded and in the next retry the client will connect to the new owner broker. Here is another typical error: ``` 2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded ``` Though after apache/pulsar#21211, the server error becomes `MetadataError` rather than `ServiceNotReady`. However, since the `ServerError` is `ServiceNotReady`, the client will close the connection. If there are many other producers or consumers on the same connection, they will all reestablish connection to the broker, which is unnecessary and brings much pressure to broker side. ### Modifications In `checkServerError`, when the error code is `ServiceNotReady`, check the error message as well, if it hit the case in `handleLookupError`, do not close the connection. Add `ConnectionTest` on a mocked `ClientConnection` object to verify `close()` will not be called.
BewareMyPower
force-pushed
the
bewaremypower/service-not-ready-retry
branch
from
January 31, 2024 16:47
58223e7
to
b6329e8
Compare
RobertIndie
approved these changes
Feb 2, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
When the broker failed to acquire the ownership of a namespace bundle by
LockBusyException
. It means there is another broker that has acquired the metadata store path and didn't release that path. For example:Broker 1:
Broker 2:
After broker 2 released the lock at 23:35:36,650, the lookup request to broker 1 should tell the client that namespace bundle 0x50000000_0x51000000 is currently being unloaded and in the next retry the client will connect to the new owner broker.
Here is another typical error:
Though after apache/pulsar#21211, the server error becomes
MetadataError
rather thanServiceNotReady
.However, since the
ServerError
isServiceNotReady
, the client will close the connection. If there are many other producers or consumers on the same connection, they will all reestablish connection to the broker, which is unnecessary and brings much pressure to broker side.Modifications
In
checkServerError
, when the error code isServiceNotReady
, checkthe error message as well, if it hit the case in
handleLookupError
, donot close the connection.
Add
ConnectionTest
on a mockedClientConnection
object to verifyclose()
will not be called.