Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not close the socket when the broker failed due to MetadataStoreException #390

Conversation

BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Jan 30, 2024

Motivation

When the broker failed to acquire the ownership of a namespace bundle by LockBusyException. It means there is another broker that has acquired the metadata store path and didn't release that path. For example:

Broker 1:

2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000
   Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked

Broker 2:

2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO  org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds

After broker 2 released the lock at 23:35:36,650, the lookup request to broker 1 should tell the client that namespace bundle 0x50000000_0x51000000 is currently being unloaded and in the next retry the client will connect to the new owner broker.

Here is another typical error:

2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded

Though after apache/pulsar#21211, the server error becomes MetadataError rather than ServiceNotReady.

However, since the ServerError is ServiceNotReady, the client will close the connection. If there are many other producers or consumers on the same connection, they will all reestablish connection to the broker, which is unnecessary and brings much pressure to broker side.

Modifications

In checkServerError, when the error code is ServiceNotReady, check
the error message as well, if it hit the case in handleLookupError, do
not close the connection.

Add ConnectionTest on a mocked ClientConnection object to verify
close() will not be called.

@BewareMyPower BewareMyPower self-assigned this Jan 30, 2024
@BewareMyPower BewareMyPower added this to the 3.5.0 milestone Jan 30, 2024
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this pull request Jan 30, 2024
…kBusyException

### Motivation

When a broker restarted, there is a case in
`NamespaceService#findBrokerServiceUrl`:
1. `ownershipCache.getOwnerAsync(bundle)` got an empty data, then
   `searchForCandidateBroker` will be called
2. The broker itself was elected as the candidate broker.
3. Meanwhile, the other broker has acquired the distributed lock of the
   bundle, then `ownershipCache.tryAcquiringOwnership` will fail with

```java
lookupFuture.completeExceptionally(new PulsarServerException(
        "Failed to acquire ownership for namespace bundle " + bundle, exception));
```

See apache/pulsar-client-cpp#390 for the real
world case.

Then in `TopicLookupBase#handleLookupError`, this exception will be
wrapped into a `ServiceNotReady` error to client.

This case happens very frequently in our production environment when a
broker restarted. If there is a `PulsarClient` that has many producers
or consumers, the connection will be closed, which results in many
reconnections, which brings much pressure to the cluster.

### Modifications

In `handleLookupError`, check the `PulsarServerException` and unwrap the
`CompletionException`. If the unwrapped exception is
`MetadataStoreException`, return the `MetadataError` to avoid closing
the connection at client side.

Add `testLookupConnectionNotCloseIfFailedToAcquireOwnershipOfBundle` to
simulate the case and verify the socket won't be closed.
@BewareMyPower BewareMyPower marked this pull request as draft January 31, 2024 07:57
@BewareMyPower
Copy link
Contributor Author

Mark it as drafted for now, I'm trying to mock the ClientConnection so that this change can be protected by tests.

@BewareMyPower BewareMyPower force-pushed the bewaremypower/service-not-ready-retry branch from cc5d8b0 to 58223e7 Compare January 31, 2024 13:39
@BewareMyPower BewareMyPower changed the title Do not close the socket when the broker failed to acquire ownership for namespace bundle Do not close the socket when the broker failed due to MetadataStoreException Jan 31, 2024
…ception

### Motivation

When the broker failed to acquire the ownership of a namespace bundle by
`LockBusyException`. It means there is another broker that has acquired
the metadata store path and didn't release that path. For example:

Broker 1:

```
2024-01-24T23:35:36,626+0000 [metadata-store-10-1] WARN  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error org.apache.pulsar.broker.PulsarServerException: Failed to acquire ownership for namespace bundle <tenant>/<ns>/0x50000000_0x51000000
   Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/<tenant>/<ns>/0x50000000_0x51000000 is already locked
```

Broker 2:

```
2024-01-24T23:35:36,650+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO  org.apache.pulsar.broker.PulsarService - Loaded 1 topics on <tenant>/<ns>/0x50000000_0x51000000 -- time taken: 0.044 seconds
```

After broker 2 released the lock at 23:35:36,650, the lookup request to
broker 1 should tell the client that namespace bundle
0x50000000_0x51000000 is currently being unloaded and in the next retry
the client will connect to the new owner broker.

Here is another typical error:

```
2024-01-24T23:57:57,264+0000 [pulsar-io-4-5] INFO  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup <role> for topic persistent://<tenant>/<ns>/<topic> with error Namespace bundle <tenant>/<ns>/0x0d000000_0x0e000000 is being unloaded
```

Though after apache/pulsar#21211, the server
error becomes `MetadataError` rather than `ServiceNotReady`.

However, since the `ServerError` is `ServiceNotReady`, the client will
close the connection. If there are many other producers or consumers on
the same connection, they will all reestablish connection to the broker,
which is unnecessary and brings much pressure to broker side.

### Modifications

In `checkServerError`, when the error code is `ServiceNotReady`, check
the error message as well, if it hit the case in `handleLookupError`, do
not close the connection.

Add `ConnectionTest` on a mocked `ClientConnection` object to verify
`close()` will not be called.
@BewareMyPower BewareMyPower force-pushed the bewaremypower/service-not-ready-retry branch from 58223e7 to b6329e8 Compare January 31, 2024 16:47
@BewareMyPower BewareMyPower marked this pull request as ready for review January 31, 2024 17:00
@BewareMyPower BewareMyPower merged commit c7e53ac into apache:main Feb 2, 2024
12 checks passed
@BewareMyPower BewareMyPower deleted the bewaremypower/service-not-ready-retry branch February 2, 2024 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants