Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/home/jenkins/workspace/Grinder: No space left on device #2251

Closed
sophia-guo opened this issue Jul 5, 2021 · 31 comments
Closed

/home/jenkins/workspace/Grinder: No space left on device #2251

sophia-guo opened this issue Jul 5, 2021 · 31 comments
Assignees

Comments

@sophia-guo
Copy link

sophia-guo commented Jul 5, 2021

/home/jenkins/workspace/Grinder: No space left on device, the error found on following docker ones:
test-docker-fedora33-x64-1
test-docker-fedora33-x64-2

https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/1027/
https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/1028/console

@sxa sxa added the systemdown label Jul 5, 2021
@sxa sxa added this to the July 2021 milestone Jul 5, 2021
@sxa
Copy link
Member

sxa commented Jul 5, 2021

From docker system df -v it appears that the f33l.2229, alp311.2231 and alp312.2230 containers are using around 150Gb of space each which is likely causing us some problems.

@sxa sxa self-assigned this Jul 5, 2021
@sxa
Copy link
Member

sxa commented Jul 5, 2021

Test_openjdk18_hs_extended.openjdk_x86-64_alpine-linux_testList_0 was using 43Gb on the Alpine 3.12 container. Similarly it was the _1 variant of the same that was chewup up a comparable amount o Alpine 3.11 so I suspect they had been aborted part way through.

Both have been cleared and the host now has about 131Gb available which should resole the problem. Therefore closing.

@sxa sxa closed this as completed Jul 5, 2021
@sophia-guo
Copy link
Author

sophia-guo commented Oct 18, 2021

I could see it happened again.
test-docker-fedora33-x64-2
https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/354/console

@sophia-guo sophia-guo reopened this Oct 18, 2021
@sophia-guo
Copy link
Author

test-docker-ubuntu1604-x64 similar issue:
https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/352/console

@sophia-guo
Copy link
Author

docker-packet-ubuntu2004-amd-1 similar issue:
https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/356/console

@sophia-guo
Copy link
Author

@sophia-guo
Copy link
Author

@llxia
Copy link

llxia commented Oct 19, 2021

@sophia-guo
Copy link
Author

@llxia
Copy link

llxia commented Dec 29, 2021

test-docker-fedora33-x64-1 Exception: java.nio.file.FileSystemException: /home/jenkins/workspace/Grinder: No space left on device:

https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2875/console

@sxa
Copy link
Member

sxa commented Dec 31, 2021

Not sure why the host is using so much space, but a docker system prune -a has recovered 30GB so that should keep it going for a while.

Biggest uses of space ont eh Fedora box appear to have been these, so I've also clear them out

639908	Test_openjdk19_hs_extended.openjdk_x86-64_linux_testList_1
779588	Test_openjdk17_hs_extended.system_x86-64_linux
1339544	Test_openjdk11_hs_extended.functional_x86-64_linux
1986392	Test_openjdk11_bisheng_sanity.openjdk_x86-64_linux
2164604	Test_openjdk8_bisheng_extended.openjdk_x86-64_linux_testList_2
2728576	Test_openjdk8_j9_sanity.openjdk_x86-64_linux

@sophia-guo
Copy link
Author

@sophia-guo
Copy link
Author

test-docker-fedora33-x64-1 Exception: java.nio.file.FileSystemException: /home/jenkins/workspace/Grinder: No space left on device:

https://ci.adoptopenjdk.net/view/Test_grinder/job/Test_Job_Auto_Gen/277/

@sxa
Copy link
Member

sxa commented Jan 14, 2022

@Haroon-Khel As the new expert in the DockerStatic stuff, can you take a look and see what we can do with this please? We probably need some sort of automation (jenkins job or otherwise) that goes over the dockerhost machines and checks and if necessary reports any problems with:

  • total disk space on the host
  • total disk space in use by docker
  • whether any particular container is chewing up more space than it ought to be (probably in the workspace directory

Doing something with the output of something like these commands may be a good place to start: df -k; docker system df; for CONTAINER in $(docker ps -q); do echo CONTAINER $CONTAINER = $(docker ps | awk "/^$CONTAINER/{print\$NF}"); docker exec $CONTAINER du -ks /home/jenkins/workspace / 2>/dev/null; done

@sxa sxa modified the milestones: July 2021, 2022-01 (January) Jan 14, 2022
@Haroon-Khel Haroon-Khel self-assigned this Jan 14, 2022
@Haroon-Khel
Copy link
Contributor

@sxa sxa pinned this issue Jan 18, 2022
@sxa
Copy link
Member

sxa commented Feb 11, 2022

@Haroon-Khel The latest JDK11 release didn't appear to cause a filling up of the file system. I think you asserted that adoptium/aqa-tests#3326 hadn't taken effect, although that may be a result of using the v0.8.0-release branch which won't have had the change merged. Can you try and check:

  • If it still run the tests, why we didn't see the filling up of the file system
  • If it didn't run the tests, whether it was due to running from the alternate branch that didn't have them disabled

@smlambert
Copy link
Contributor

smlambert commented Mar 5, 2022

Fresh issue on test-docker-ubuntu2010-x64-2:

https://ci.adoptopenjdk.net/view/work-in-progress/job/WIP_Test_Job_Auto_Gen/72/console

Building remotely on [test-docker-ubuntu2010-x64-2](https://ci.adoptopenjdk.net/computer/test-docker-ubuntu2010-x64-2) (ci.role.test sw.os.linux hw.arch.x86) in workspace /home/jenkins/workspace/WIP_Test_Job_Auto_Gen
Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to test-docker-ubuntu2010-x64-2
		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1800)
		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
		at hudson.remoting.Channel.call(Channel.java:1001)

@smlambert
Copy link
Contributor

Adding to May 2022 plan (as it looks partly worked, and it does still affect releases)

@sxa
Copy link
Member

sxa commented May 18, 2022

No current issues so removing from the May milestone. I'll keep it open for another month or so and then we can close if no more occurrences (Can always be reopened if required)

@sxa sxa modified the milestones: 2022-04 (April), 2022-06 (June) May 24, 2022
@llxia
Copy link

llxia commented Jun 20, 2022

test-docker-fedora34-x64-1:

Exception: java.nio.file.FileSystemException: /home/jenkins/workspace/Grinder: No space left on device

https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/5012/console

@sxa
Copy link
Member

sxa commented Jun 20, 2022

Hmmm cleared up some old volumes, although I thought I'd done a clearup on this host ealirer today so we'll see if it fills up again. If so we'll need to investigate what's using it up. I've only been able to reclaim 25% of the 400Gb volume, and it shouldn't be using anywhere near that amount.

@sxa
Copy link
Member

sxa commented Jun 21, 2022

An extra 50Gb seems to have been used up overnight on the file system. That's not normal

@llxia
Copy link

llxia commented Jun 21, 2022

Could you list the top files/folders that use the most space? Maybe we can get some clues.

@sxa
Copy link
Member

sxa commented Jun 21, 2022

Could you list the top files/folders that use the most space? Maybe we can get some clues.

It's not quite that simple when it's a load of docker containers on the host system unfortunately.

@sxa
Copy link
Member

sxa commented Jun 21, 2022

Looks like this process might have been keeping a lot of space in use but with probably from deleted files which still had file handles open to them: jenkins 2266542 9825 99 Jun17 ? 11-17:17:13 /home/jenkins/workspace/Test_openjdk8_dragonwell_sanity.openjdk_x86-64_linux/openjdkbinary/j2sdk-image/bin/java -cp . -XX:+UseG1GC -XX:+MultiTenant -XX:+TenantHeapIsolation -XX:NativeMemoryTracking=detail -XX:+PrintGCDetails -Xloggc:gc.log -Xmx1g -Xmn32m TestLeak - I've killed it now and there's 320Gb free.

@llxia
Copy link

llxia commented Jun 22, 2022

https://ci.adoptopenjdk.net/job/SXA-processCheck/label=test-docker-fedora34-x64-1/295/console cannot complete on this machine due to the space issue.

In the test Jenkins script, it detects the leftover processes. I think we should enforce the logic to kill the leftover processes before and after the test job. The ideal place for this logic should be in TKG. If that cannot be completed soon, maybe we should do it in the Jenkins script for now.
FYI @smlambert @renfeiw

@sophia-guo
Copy link
Author

@llxia
Copy link

llxia commented Jun 30, 2022

re #2251 (comment), the above code only lists the processes.

@sxa sxa modified the milestones: 2022-06 (June), 2022-07 (July) Jun 30, 2022
@sophia-guo
Copy link
Author

test-sxa-armv7l-ubuntu2004-odroid-2 got No space left on device.

https://ci.adoptopenjdk.net/job/Grinder/6196/console

@sxa
Copy link
Member

sxa commented Nov 21, 2022

test-sxa-armv7l-ubuntu2004-odroid-2 got No space left on device.
https://ci.adoptopenjdk.net/job/Grinder/6196/console

Will cover this under #2829

@sxa
Copy link
Member

sxa commented Nov 21, 2022

I believe all of the problems related to the DockerHost systems have bow been resolved since we allocated more dedicated space to /var/lib/docker a few months ago so I'm going to close this issue now. If we have any further problems they can be opened in separate issues.

@sxa sxa closed this as completed Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants