Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MINOR Capture heap dump after OOM on CI #19031

Open
wants to merge 40 commits into
base: trunk
Choose a base branch
from

Conversation

mumrah
Copy link
Member

@mumrah mumrah commented Feb 25, 2025

We have seen a few OOM errors on trunk lately. This patch adds the ability to capture a heap dump when this happens so we can better determine if the error was due to something in Gradle or within our own tests (like a memory leak).

@github-actions github-actions bot added build Gradle build or GitHub Actions small Small PRs labels Feb 25, 2025
@mumrah
Copy link
Member Author

mumrah commented Feb 26, 2025

It looks like the UserQuotaTest is the likely culprit.

From https://github.com/apache/kafka/actions/runs/13523225366/job/37791507296

Gradle Test Run :core:test > Gradle Test Executor 38 > UserQuotaTest > testQuotaOverrideDelete(String, String) > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=consumer STARTED

> Task :storage:compileTestJava
Unexpected exception thrown.
org.gradle.internal.remote.internal.MessageIOException: Could not read message from '/127.0.0.1:50402'.

	at org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:99)
	at org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270)
	at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
	at org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:48)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

and from https://github.com/apache/kafka/actions/runs/13534291834/job/37823471305

Gradle Test Run :core:test > Gradle Test Executor 37 > UserQuotaTest > testQuotaOverrideDelete(String, String) > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=consumer STARTED

> Task :storage:checkstyleMain
> Task :shell:checkstyleMain
Unexpected exception thrown.

org.gradle.internal.remote.internal.MessageIOException: Could not read message from '/127.0.0.1:38838'.
	at org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:99)
	at org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270)
	at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
	at org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:48)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

It appears that the Gradle worker is trying to send results to the main process which causes a long GC pause which triggers this "GC overhead limit exceeded" error.

build.gradle Outdated
@@ -54,7 +54,7 @@ ext {
buildVersionFileName = "kafka-version.properties"

defaultMaxHeapSize = "2g"
defaultJvmArgs = ["-Xss4m", "-XX:+UseParallelGC"]
defaultJvmArgs = ["-Xss4m", "-XX:+UseParallelGC", "-XX:-UseGCOverheadLimit"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ijuma WDYT about disabling this feature? From what I can tell, this will prevent a long GC pause from triggering an OOM. Instead, the build would likely just timeout (which it's doing anyways with the OOM happing in the Gradle worker).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said, the build is unlikely to succeed in either case. The GC overhead thing at least gives a hint that there is a memory leak or the heap is too small. Isn't that better than a timeout with no information?

@mumrah
Copy link
Member Author

mumrah commented Feb 26, 2025

Seems to have reproduced here: https://github.com/apache/kafka/actions/runs/13550598471/job/37873138268?pr=19031

No activity for a while after

Wed, 26 Feb 2025 18:35:41 GMT > Task :streams:test-utils:copyDependantLibs
Wed, 26 Feb 2025 18:52:51 GMT > Task :streams:test-utils:jar
Wed, 26 Feb 2025 18:55:06 GMT > Task :connect:runtime:compileJava

This suggests that -XX:-UseGCOverheadLimit is working as expected. However, it also suggests that there is a real memory leak or something. This run included a larger heap of 3gb.

@github-actions github-actions bot added the kraft label Feb 27, 2025
@github-actions github-actions bot added the core Kafka Broker label Feb 28, 2025
@mumrah
Copy link
Member Author

mumrah commented Mar 3, 2025

I've not been able to reproduce the Gradle OOM that we're seeing on trunk, however I saw a different OOM over on my fork.

https://github.com/mumrah/kafka/actions/runs/13639578283/job/38126853058

2025-03-03T21:58:41.1436187Z Gradle Test Run :connect:runtime:test > Gradle Test Executor 71 > ConnectWorkerIntegrationTest > testPauseStopResume() STARTED
2025-03-03T21:58:43.8433931Z 
2025-03-03T21:58:43.8434407Z java.lang.OutOfMemoryError: Java heap space
2025-03-03T21:58:43.8434975Z Dumping heap to /home/runner/work/kafka/kafka/heap-dumps/java_pid471589.hprof ...
2025-03-03T21:58:46.0433855Z 
2025-03-03T21:58:46.0434629Z > Task :connect:runtime:test
2025-03-03T21:58:46.0435151Z 

(Heap dump was uploaded here https://github.com/mumrah/kafka/actions/runs/13639578283/artifacts/2685005476)

This at least shows that the OOM arguments and heap dump archiving are working.

@mumrah mumrah changed the title WIP Investigate OOM MINOR Capture heap dump after OOM on CI Mar 3, 2025
@mumrah mumrah requested review from ijuma and chia7712 March 3, 2025 23:07
mkdir -p heap-dumps
HEAP_DUMP_DIR=$(readlink -f heap-dumps)
timeout ${TIMEOUT_MINUTES}m ./gradlew --continue --no-scan \
-Dorg.gradle.jvmargs="-Xmx4g -Xss4m -XX:+UseParallelGC -XX:+UseGCOverheadLimit -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$HEAP_DUMP_DIR" \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to pass the heap dump path directly to Gradle as well as to JUnit (inside build.gradle). That's why we have this apparent duplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Gradle build or GitHub Actions core Kafka Broker kraft small Small PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants