Tight loop while trying too manage consistencyTest topic #182

SamBarker · 2023-09-20T23:10:40Z

Please use this to only for bug reports. For questions or when you need help, you can use the GitHub Discussions or use the community Slack chat.

Describe the bug
The test suite sometimes gets itself into a state where it logs the following ad nauseam.

2023-09-20 23:06:10 WARN  io.kroxylicious.testing.kafka.common.Utils:214 - Failed to create topic: __org_kroxylicious_testing_consistencyTest due to org.apache.kafka.common.errors.TimeoutException: The AdminClient thread is not accepting new calls.

The following stack trace is also logged regularly and is probably more indicative of the underlying cause.

2023-09-20 23:08:33 WARN  io.kroxylicious.testing.kafka.common.Utils:146 - Unexpected failure describing topic: __org_kroxylicious_testing_consistencyTest due to org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics
java.util.concurrent.CompletionException: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:636) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2194) ~[?:?]
	at org.apache.kafka.common.internals.KafkaCompletableFuture.kafkaCompleteExceptionally(KafkaCompletableFuture.java:49) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.common.internals.KafkaFutureImpl.completeExceptionally(KafkaFutureImpl.java:130) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.common.KafkaFuture.lambda$allOf$2(KafkaFuture.java:93) ~[kafka-clients-3.5.1.jar:?]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2194) ~[?:?]
	at org.apache.kafka.common.internals.KafkaCompletableFuture.kafkaCompleteExceptionally(KafkaCompletableFuture.java:49) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.common.internals.KafkaFutureImpl.completeExceptionally(KafkaFutureImpl.java:130) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient.lambda$completeAllExceptionally$1(KafkaAdminClient.java:420) ~[kafka-clients-3.5.1.jar:?]
	at java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1787) ~[?:?]
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762) ~[?:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient.completeAllExceptionally(KafkaAdminClient.java:420) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient.completeAllExceptionally(KafkaAdminClient.java:409) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient.access$3000(KafkaAdminClient.java:298) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$5.handleFailure(KafkaAdminClient.java:2005) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.handleTimeoutFailure(KafkaAdminClient.java:851) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.fail(KafkaAdminClient.java:817) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$TimeoutProcessor.handleTimeouts(KafkaAdminClient.java:947) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.timeoutPendingCalls(KafkaAdminClient.java:1026) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1380) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1344) ~[kafka-clients-3.5.1.jar:?]
	at java.lang.Thread.run(Thread.java:1623) [?:?]

To Reproduce
Steps to reproduce the behavior:

Use kroxylicious-junit5-extension like this ... /shrug
Run command 'mvn verify'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Logs

Attach or copy and paste relevant logs.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

SamBarker · 2023-09-20T23:20:40Z

The test run eventually finished with:

[ERROR] Failures: 
[ERROR]   TemplateTest$Versions.afterAll:180 expected: <[latest-kafka-3.4.0, latest-kafka-3.1.2, latest, latest-kafka-3.2.3]> but was: <[]>
[ERROR] Errors: 
[ERROR]   StaticFieldSubclassExtensionTest » ConditionTimeout Condition with io.kroxylicious.testing.kafka.common.Utils was not fulfilled within 2 minutes.
[ERROR]   TemplateTest$Versions.testVersions(TestcontainersKafkaCluster)[1] » ParameterResolution Failed to resolve parameter [io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster cluster] in method [public void io.kroxylicious.testing.kafka.junit5ext.TemplateTest$Versions.testVersions(io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster)]: Condition with io.kroxylicious.testing.kafka.common.Utils was not fulfilled within 2 minutes.
[ERROR]   TemplateTest$Versions.testVersions(TestcontainersKafkaCluster)[2] » ParameterResolution Failed to resolve parameter [io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster cluster] in method [public void io.kroxylicious.testing.kafka.junit5ext.TemplateTest$Versions.testVersions(io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster)]: Condition with io.kroxylicious.testing.kafka.common.Utils was not fulfilled within 2 minutes.
[ERROR]   TemplateTest$Versions.testVersions(TestcontainersKafkaCluster)[3] » ParameterResolution Failed to resolve parameter [io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster cluster] in method [public void io.kroxylicious.testing.kafka.junit5ext.TemplateTest$Versions.testVersions(io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster)]: Condition with io.kroxylicious.testing.kafka.common.Utils was not fulfilled within 2 minutes.
[ERROR]   TemplateTest$Versions.testVersions(TestcontainersKafkaCluster)[4] » ParameterResolution Failed to resolve parameter [io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster cluster] in method [public void io.kroxylicious.testing.kafka.junit5ext.TemplateTest$Versions.testVersions(io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster)]: Condition with io.kroxylicious.testing.kafka.common.Utils was not fulfilled within 2 minutes.
[INFO] 
[ERROR] Tests run: 43, Failures: 1, Errors: 5, Skipped: 0

Which suggests the test containers integration was failing but we aren't handling that effectively

SamBarker · 2023-09-21T01:20:50Z

I also see

2023-09-20 23:12:11 ERROR tc.quay.io/ogunalp/kafka-native:latest-kafka-3.2.3:552 - Could not start container
java.lang.RuntimeException: java.io.IOException: Broken pipe
	at com.github.dockerjava.zerodep.ApacheDockerHttpClientImpl.execute(ApacheDockerHttpClientImpl.java:195) ~[docker-java-transport-zerodep-3.3.3.jar:?]
	at com.github.dockerjava.zerodep.ZerodepDockerHttpClient.execute(ZerodepDockerHttpClient.java:8) ~[docker-java-transport-zerodep-3.3.3.jar:?]
	at org.testcontainers.dockerclient.HeadersAddingDockerHttpClient.execute(HeadersAddingDockerHttpClient.java:23) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.execute(DefaultInvocationBuilder.java:228) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.post(DefaultInvocationBuilder.java:124) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.shaded.com.github.dockerjava.core.exec.CreateNetworkCmdExec.execute(CreateNetworkCmdExec.java:27) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.shaded.com.github.dockerjava.core.exec.CreateNetworkCmdExec.execute(CreateNetworkCmdExec.java:12) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.shaded.com.github.dockerjava.core.exec.AbstrSyncDockerCmdExec.exec(AbstrSyncDockerCmdExec.java:21) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.shaded.com.github.dockerjava.core.command.AbstrDockerCmd.exec(AbstrDockerCmd.java:33) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.containers.Network$NetworkImpl.create(Network.java:100) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.containers.Network$NetworkImpl.getId(Network.java:64) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.containers.GenericContainer.applyConfiguration(GenericContainer.java:891) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:390) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:356) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81) ~[duct-tape-1.0.8.jar:?]
	at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:346) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:334) ~[testcontainers-1.19.0.jar:1.19.0]
	at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:787) ~[?:?]
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1623) [?:?]
Caused by: java.io.IOException: Broken pipe

Which lends further wait to problems with the cluster not being properly propagated.

k-wall · 2023-09-21T07:14:33Z

I'll take a look. Anything interesting in the container logs (./junit5-extension/target/container-logs)?

k-wall · 2023-09-21T07:20:14Z

Can't reproduce it yet. Could this have been related to a slow pull of the new image in your environment?

SamBarker · 2023-09-21T07:35:43Z

No I didn't get the container logs.

It not deterministic for me but I have seen it a few times.

k-wall · 2023-09-21T13:00:10Z

I also see

2023-09-20 23:12:11 ERROR tc.quay.io/ogunalp/kafka-native:latest-kafka-3.2.3:552 - Could not start container
java.lang.RuntimeException: java.io.IOException: Broken pipe

Are you running under podman? Are you applying?

https://github.com/kroxylicious/kroxylicious-junit5-extension/blob/main/DEV_GUIDE.md#podmantestcontainers-incompatibility

k-wall · 2023-09-21T13:37:26Z

I finally got a failure. If took several hours to appear.

2023-09-21 13:53:30 WARN ForkJoinPool.commonPool-worker-3 io.kroxylicious.testing.kafka.common.Utils:214 - Failed to create topic: __org_kroxylicious_testing_consistencyTest due to org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: createTopics

TimeoutException is a RetriableException so that might explain the loop.

I wonder if we are leaking an admin client, or the loop within createTopics is continuing even though the client is closed?

k-wall · 2023-09-21T13:52:55Z

Yeah, the admin client produces an exception with this message after the client is closed. It seems bizarre for it to use a RetriableException. The client isn't going to come back to life. This seems like a very odd choice for the Kafka Client (@showuon WDYT?).
Short term, I think we special case based on this exception message.

I think the looping is probably a secondary cause. I guess we are still looking for a root cause your initial failure.

from: org.apache.kafka.clients.admin.KafkaAdminClient.AdminClientRunnable#call

        void call(Call call, long now) {
            if (hardShutdownTimeMs.get() != INVALID_SHUTDOWN_TIME) {
                log.debug("The AdminClient is not accepting new calls. Timing out {}.", call);
                call.handleTimeoutFailure(time.milliseconds(),
                    new TimeoutException("The AdminClient thread is not accepting new calls."));
            } else {
                enqueue(call, now);
            }
        }

k-wall · 2023-09-21T21:49:40Z

@SamBarker I only saw one test failure that looks like yours today, despite running in a loop for most of the day. I did keep seeing #183, but haven't investigated that yet. I curious what you see with my PR. I suspect that in your case the topic creation loop will be a secondary issue, and there will be a root cause failure that is still to be understood/dealt with.

SamBarker · 2023-09-21T21:57:58Z

@SamBarker I only saw one test failure that looks like yours today, despite running in a loop for most of the day. I did keep seeing #183, but haven't investigated that yet. I curious what you see with my PR. I suspect that in your case the topic creation loop will be a secondary issue, and there will be a root cause failure that is still to be understood/dealt with.

The topic creation loop is definitely a symptom yes.

Are you running under podman? Are you applying?

Yes, podman and apparently no it had been reverted :(

So that is probably the root cause.

k-wall · 2023-09-22T08:48:34Z

Ok, so I think the createTopic loop still deserves to be fixed.

showuon · 2023-09-26T08:08:16Z

Nice find @k-wall ! Yes, it's definitely a bug in adminClient. I've filed KAFKA-15507 and see if anyone is interested in picking it up. Otherwise, I'll fix that later when available.

… admin client gets closed workaround for KAFKA-15507

…t gets closed (#184) workaround for KAFKA-15507

SamBarker added the bug Something isn't working label Sep 20, 2023

k-wall mentioned this issue Sep 21, 2023

Sporadic test fail from kafkaClusterKraftModeWithMultipleControllers #183

Open

k-wall linked a pull request Sep 21, 2023 that will close this issue

fix: stop Utils#createTopic trying to create topic if admin client gets closed #184

Merged

5 tasks

k-wall added a commit to k-wall/kroxylicious-junit5-extension that referenced this issue Sep 26, 2023

fix kroxylicious#182: stop trying to create consistency test topic if…

7bfa219

… admin client gets closed workaround for KAFKA-15507

k-wall added a commit to k-wall/kroxylicious-junit5-extension that referenced this issue Sep 26, 2023

fix kroxylicious#182: stop trying to create consistency test topic if…

ba8beca

… admin client gets closed workaround for KAFKA-15507

k-wall closed this as completed in #184 Sep 26, 2023

k-wall added a commit that referenced this issue Sep 26, 2023

fix #182: stop trying to create consistency test topic if admin clien…

4686a48

…t gets closed (#184) workaround for KAFKA-15507

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tight loop while trying too manage consistencyTest topic #182

Tight loop while trying too manage consistencyTest topic #182

SamBarker commented Sep 20, 2023

SamBarker commented Sep 20, 2023

SamBarker commented Sep 21, 2023

k-wall commented Sep 21, 2023 •

edited

Loading

k-wall commented Sep 21, 2023 •

edited

Loading

SamBarker commented Sep 21, 2023

k-wall commented Sep 21, 2023

k-wall commented Sep 21, 2023

k-wall commented Sep 21, 2023 •

edited

Loading

k-wall commented Sep 21, 2023

SamBarker commented Sep 21, 2023 •

edited

Loading

k-wall commented Sep 22, 2023

showuon commented Sep 26, 2023

Tight loop while trying too manage consistencyTest topic #182

Tight loop while trying too manage consistencyTest topic #182

Comments

SamBarker commented Sep 20, 2023

SamBarker commented Sep 20, 2023

SamBarker commented Sep 21, 2023

k-wall commented Sep 21, 2023 • edited Loading

k-wall commented Sep 21, 2023 • edited Loading

SamBarker commented Sep 21, 2023

k-wall commented Sep 21, 2023

k-wall commented Sep 21, 2023

k-wall commented Sep 21, 2023 • edited Loading

k-wall commented Sep 21, 2023

SamBarker commented Sep 21, 2023 • edited Loading

k-wall commented Sep 22, 2023

showuon commented Sep 26, 2023

k-wall commented Sep 21, 2023 •

edited

Loading

k-wall commented Sep 21, 2023 •

edited

Loading

k-wall commented Sep 21, 2023 •

edited

Loading

SamBarker commented Sep 21, 2023 •

edited

Loading