Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tight loop while trying too manage consistencyTest topic #182

Closed
SamBarker opened this issue Sep 20, 2023 · 12 comments · Fixed by #184
Closed

Tight loop while trying too manage consistencyTest topic #182

SamBarker opened this issue Sep 20, 2023 · 12 comments · Fixed by #184
Labels
bug Something isn't working

Comments

@SamBarker
Copy link
Member

Please use this to only for bug reports. For questions or when you need help, you can use the GitHub Discussions or use the community Slack chat.

Describe the bug
The test suite sometimes gets itself into a state where it logs the following ad nauseam.

2023-09-20 23:06:10 WARN  io.kroxylicious.testing.kafka.common.Utils:214 - Failed to create topic: __org_kroxylicious_testing_consistencyTest due to org.apache.kafka.common.errors.TimeoutException: The AdminClient thread is not accepting new calls.

The following stack trace is also logged regularly and is probably more indicative of the underlying cause.

2023-09-20 23:08:33 WARN  io.kroxylicious.testing.kafka.common.Utils:146 - Unexpected failure describing topic: __org_kroxylicious_testing_consistencyTest due to org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics
java.util.concurrent.CompletionException: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:636) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2194) ~[?:?]
	at org.apache.kafka.common.internals.KafkaCompletableFuture.kafkaCompleteExceptionally(KafkaCompletableFuture.java:49) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.common.internals.KafkaFutureImpl.completeExceptionally(KafkaFutureImpl.java:130) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.common.KafkaFuture.lambda$allOf$2(KafkaFuture.java:93) ~[kafka-clients-3.5.1.jar:?]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2194) ~[?:?]
	at org.apache.kafka.common.internals.KafkaCompletableFuture.kafkaCompleteExceptionally(KafkaCompletableFuture.java:49) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.common.internals.KafkaFutureImpl.completeExceptionally(KafkaFutureImpl.java:130) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient.lambda$completeAllExceptionally$1(KafkaAdminClient.java:420) ~[kafka-clients-3.5.1.jar:?]
	at java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1787) ~[?:?]
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762) ~[?:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient.completeAllExceptionally(KafkaAdminClient.java:420) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient.completeAllExceptionally(KafkaAdminClient.java:409) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient.access$3000(KafkaAdminClient.java:298) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$5.handleFailure(KafkaAdminClient.java:2005) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.handleTimeoutFailure(KafkaAdminClient.java:851) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.fail(KafkaAdminClient.java:817) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$TimeoutProcessor.handleTimeouts(KafkaAdminClient.java:947) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.timeoutPendingCalls(KafkaAdminClient.java:1026) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1380) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1344) ~[kafka-clients-3.5.1.jar:?]
	at java.lang.Thread.run(Thread.java:1623) [?:?]

To Reproduce
Steps to reproduce the behavior:

  1. Use kroxylicious-junit5-extension like this ... /shrug
  2. Run command 'mvn verify'
  3. See error

Expected behavior
A clear and concise description of what you expected to happen.

Logs

Attach or copy and paste relevant logs.

Additional context
Add any other context about the problem here.

@SamBarker SamBarker added the bug Something isn't working label Sep 20, 2023
@SamBarker
Copy link
Member Author

The test run eventually finished with:

[ERROR] Failures: 
[ERROR]   TemplateTest$Versions.afterAll:180 expected: <[latest-kafka-3.4.0, latest-kafka-3.1.2, latest, latest-kafka-3.2.3]> but was: <[]>
[ERROR] Errors: 
[ERROR]   StaticFieldSubclassExtensionTest » ConditionTimeout Condition with io.kroxylicious.testing.kafka.common.Utils was not fulfilled within 2 minutes.
[ERROR]   TemplateTest$Versions.testVersions(TestcontainersKafkaCluster)[1] » ParameterResolution Failed to resolve parameter [io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster cluster] in method [public void io.kroxylicious.testing.kafka.junit5ext.TemplateTest$Versions.testVersions(io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster)]: Condition with io.kroxylicious.testing.kafka.common.Utils was not fulfilled within 2 minutes.
[ERROR]   TemplateTest$Versions.testVersions(TestcontainersKafkaCluster)[2] » ParameterResolution Failed to resolve parameter [io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster cluster] in method [public void io.kroxylicious.testing.kafka.junit5ext.TemplateTest$Versions.testVersions(io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster)]: Condition with io.kroxylicious.testing.kafka.common.Utils was not fulfilled within 2 minutes.
[ERROR]   TemplateTest$Versions.testVersions(TestcontainersKafkaCluster)[3] » ParameterResolution Failed to resolve parameter [io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster cluster] in method [public void io.kroxylicious.testing.kafka.junit5ext.TemplateTest$Versions.testVersions(io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster)]: Condition with io.kroxylicious.testing.kafka.common.Utils was not fulfilled within 2 minutes.
[ERROR]   TemplateTest$Versions.testVersions(TestcontainersKafkaCluster)[4] » ParameterResolution Failed to resolve parameter [io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster cluster] in method [public void io.kroxylicious.testing.kafka.junit5ext.TemplateTest$Versions.testVersions(io.kroxylicious.testing.kafka.testcontainers.TestcontainersKafkaCluster)]: Condition with io.kroxylicious.testing.kafka.common.Utils was not fulfilled within 2 minutes.
[INFO] 
[ERROR] Tests run: 43, Failures: 1, Errors: 5, Skipped: 0

Which suggests the test containers integration was failing but we aren't handling that effectively

@SamBarker
Copy link
Member Author

I also see

2023-09-20 23:12:11 ERROR tc.quay.io/ogunalp/kafka-native:latest-kafka-3.2.3:552 - Could not start container
java.lang.RuntimeException: java.io.IOException: Broken pipe
	at com.github.dockerjava.zerodep.ApacheDockerHttpClientImpl.execute(ApacheDockerHttpClientImpl.java:195) ~[docker-java-transport-zerodep-3.3.3.jar:?]
	at com.github.dockerjava.zerodep.ZerodepDockerHttpClient.execute(ZerodepDockerHttpClient.java:8) ~[docker-java-transport-zerodep-3.3.3.jar:?]
	at org.testcontainers.dockerclient.HeadersAddingDockerHttpClient.execute(HeadersAddingDockerHttpClient.java:23) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.execute(DefaultInvocationBuilder.java:228) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.post(DefaultInvocationBuilder.java:124) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.shaded.com.github.dockerjava.core.exec.CreateNetworkCmdExec.execute(CreateNetworkCmdExec.java:27) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.shaded.com.github.dockerjava.core.exec.CreateNetworkCmdExec.execute(CreateNetworkCmdExec.java:12) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.shaded.com.github.dockerjava.core.exec.AbstrSyncDockerCmdExec.exec(AbstrSyncDockerCmdExec.java:21) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.shaded.com.github.dockerjava.core.command.AbstrDockerCmd.exec(AbstrDockerCmd.java:33) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.containers.Network$NetworkImpl.create(Network.java:100) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.containers.Network$NetworkImpl.getId(Network.java:64) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.containers.GenericContainer.applyConfiguration(GenericContainer.java:891) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:390) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:356) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81) ~[duct-tape-1.0.8.jar:?]
	at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:346) ~[testcontainers-1.19.0.jar:1.19.0]
	at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:334) ~[testcontainers-1.19.0.jar:1.19.0]
	at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:787) ~[?:?]
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1623) [?:?]
Caused by: java.io.IOException: Broken pipe

Which lends further wait to problems with the cluster not being properly propagated.

@k-wall
Copy link
Contributor

k-wall commented Sep 21, 2023

I'll take a look. Anything interesting in the container logs (./junit5-extension/target/container-logs)?

@k-wall
Copy link
Contributor

k-wall commented Sep 21, 2023

Can't reproduce it yet. Could this have been related to a slow pull of the new image in your environment?

@SamBarker
Copy link
Member Author

No I didn't get the container logs.

It not deterministic for me but I have seen it a few times.

@k-wall
Copy link
Contributor

k-wall commented Sep 21, 2023

I also see

2023-09-20 23:12:11 ERROR tc.quay.io/ogunalp/kafka-native:latest-kafka-3.2.3:552 - Could not start container
java.lang.RuntimeException: java.io.IOException: Broken pipe

Are you running under podman? Are you applying?

https://github.com/kroxylicious/kroxylicious-junit5-extension/blob/main/DEV_GUIDE.md#podmantestcontainers-incompatibility

@k-wall
Copy link
Contributor

k-wall commented Sep 21, 2023

I finally got a failure. If took several hours to appear.

2023-09-21 13:53:30 WARN ForkJoinPool.commonPool-worker-3 io.kroxylicious.testing.kafka.common.Utils:214 - Failed to create topic: __org_kroxylicious_testing_consistencyTest due to org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: createTopics

TimeoutException is a RetriableException so that might explain the loop.

I wonder if we are leaking an admin client, or the loop within createTopics is continuing even though the client is closed?

@k-wall
Copy link
Contributor

k-wall commented Sep 21, 2023

Yeah, the admin client produces an exception with this message after the client is closed. It seems bizarre for it to use a RetriableException. The client isn't going to come back to life. This seems like a very odd choice for the Kafka Client (@showuon WDYT?).
Short term, I think we special case based on this exception message.

I think the looping is probably a secondary cause. I guess we are still looking for a root cause your initial failure.

from: org.apache.kafka.clients.admin.KafkaAdminClient.AdminClientRunnable#call

        void call(Call call, long now) {
            if (hardShutdownTimeMs.get() != INVALID_SHUTDOWN_TIME) {
                log.debug("The AdminClient is not accepting new calls. Timing out {}.", call);
                call.handleTimeoutFailure(time.milliseconds(),
                    new TimeoutException("The AdminClient thread is not accepting new calls."));
            } else {
                enqueue(call, now);
            }
        }

@k-wall
Copy link
Contributor

k-wall commented Sep 21, 2023

@SamBarker I only saw one test failure that looks like yours today, despite running in a loop for most of the day. I did keep seeing #183, but haven't investigated that yet. I curious what you see with my PR. I suspect that in your case the topic creation loop will be a secondary issue, and there will be a root cause failure that is still to be understood/dealt with.

@SamBarker
Copy link
Member Author

SamBarker commented Sep 21, 2023

@SamBarker I only saw one test failure that looks like yours today, despite running in a loop for most of the day. I did keep seeing #183, but haven't investigated that yet. I curious what you see with my PR. I suspect that in your case the topic creation loop will be a secondary issue, and there will be a root cause failure that is still to be understood/dealt with.

The topic creation loop is definitely a symptom yes.

Are you running under podman? Are you applying?

Yes, podman and apparently no it had been reverted :(

So that is probably the root cause.

@k-wall
Copy link
Contributor

k-wall commented Sep 22, 2023

Ok, so I think the createTopic loop still deserves to be fixed.

@showuon
Copy link
Member

showuon commented Sep 26, 2023

Nice find @k-wall ! Yes, it's definitely a bug in adminClient. I've filed KAFKA-15507 and see if anyone is interested in picking it up. Otherwise, I'll fix that later when available.

k-wall added a commit to k-wall/kroxylicious-junit5-extension that referenced this issue Sep 26, 2023
… admin client gets closed

workaround for KAFKA-15507
k-wall added a commit to k-wall/kroxylicious-junit5-extension that referenced this issue Sep 26, 2023
… admin client gets closed

workaround for KAFKA-15507
k-wall added a commit that referenced this issue Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants