-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MINOR Capture heap dump after OOM on CI #19031
base: trunk
Are you sure you want to change the base?
Conversation
It looks like the UserQuotaTest is the likely culprit. From https://github.com/apache/kafka/actions/runs/13523225366/job/37791507296
and from https://github.com/apache/kafka/actions/runs/13534291834/job/37823471305
It appears that the Gradle worker is trying to send results to the main process which causes a long GC pause which triggers this "GC overhead limit exceeded" error. |
build.gradle
Outdated
@@ -54,7 +54,7 @@ ext { | |||
buildVersionFileName = "kafka-version.properties" | |||
|
|||
defaultMaxHeapSize = "2g" | |||
defaultJvmArgs = ["-Xss4m", "-XX:+UseParallelGC"] | |||
defaultJvmArgs = ["-Xss4m", "-XX:+UseParallelGC", "-XX:-UseGCOverheadLimit"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ijuma WDYT about disabling this feature? From what I can tell, this will prevent a long GC pause from triggering an OOM. Instead, the build would likely just timeout (which it's doing anyways with the OOM happing in the Gradle worker).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said, the build is unlikely to succeed in either case. The GC overhead thing at least gives a hint that there is a memory leak or the heap is too small. Isn't that better than a timeout with no information?
Seems to have reproduced here: https://github.com/apache/kafka/actions/runs/13550598471/job/37873138268?pr=19031 No activity for a while after
This suggests that |
I've not been able to reproduce the Gradle OOM that we're seeing on trunk, however I saw a different OOM over on my fork. https://github.com/mumrah/kafka/actions/runs/13639578283/job/38126853058
(Heap dump was uploaded here https://github.com/mumrah/kafka/actions/runs/13639578283/artifacts/2685005476) This at least shows that the OOM arguments and heap dump archiving are working. |
mkdir -p heap-dumps | ||
HEAP_DUMP_DIR=$(readlink -f heap-dumps) | ||
timeout ${TIMEOUT_MINUTES}m ./gradlew --continue --no-scan \ | ||
-Dorg.gradle.jvmargs="-Xmx4g -Xss4m -XX:+UseParallelGC -XX:+UseGCOverheadLimit -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$HEAP_DUMP_DIR" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to pass the heap dump path directly to Gradle as well as to JUnit (inside build.gradle). That's why we have this apparent duplication.
We have seen a few OOM errors on trunk lately. This patch adds the ability to capture a heap dump when this happens so we can better determine if the error was due to something in Gradle or within our own tests (like a memory leak).