bpm/1.4.2 fails on a bosh-lite #176

abg · 2024-10-29T16:01:15Z

Yesterday our pipelines picked up bpm/1.4.2 that bumped to runc/1.2.0 and environments using a bosh-lite configuration started failing.

The initial deployment is successful but cleaning up jobs later fails.

# bpm stop test-server
Error: failed to cleanup job-process: exit status 1

bpm seems to get in a bad state if I have multiple deployments and restart a couple times. Here's a reproduction using the bpm-release bosh-lite.yml test manifest.

$ bosh -n -d bpm deploy manifests/bosh-lite.yml
...success...
$ export BOSH_DEPLOYMENT=bpm-$(uuidgen)
$ bosh -n deploy manifests/bosh-lite.yml -o <(echo '[{"type":"replace","path":"/name","value":"((deployment_name))"}]') -v deployment_name=$BOSH_DEPLOYMENT
...success...
$ bosh -n restart
...success...
$ bosh -n restart
...
Task 20 | 14:31:59 | L starting jobs: bpm/33f58def-3dac-467e-bc7d-715e4a890b54 (0) (canary) (00:02:33)
                   L Error: 'bpm/33f58def-3dac-467e-bc7d-715e4a890b54 (0)' is not running after update. Review logs for failed jobs: test-server, alt-test-server
...

$ bosh ssh 
# bpm list
Name                        Pid Status
test-errand                 -   stopped
test-server                 -   failed
test-server.alt-test-server -   failed
# bpm start test-server
Error: failed to clean up stale job-process: exit status 1
# bpm stop test-server
Error: failed to clean up stale job-process: exit status 1

This may be related:

# /var/vcap/packages/bpm/bin/runc --root /var/vcap/sys/run/bpm-runc delete --force bpm-test-server
ERRO[0002] unable to destroy container: unable to remove container's cgroup: Failed to remove paths: map[net_cls:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-test-server net_prio:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-test-server]

I couldn't reproduce this on a bbl environment. I couldn't reproduce this with bpm/1.4.1.

Rolling back to bpm/1.4.1 (and runc/1.1.15) seems to resolve this issue for us.

The text was updated successfully, but these errors were encountered:

abg · 2024-10-31T15:13:05Z

Poking at this a little this morning, I see runc-1.1.15 ran into the same container / cgroup teardown issue, but seemingly ignored it. runc-1.2.0 seems to hard stop when it cannot cleanup a group.

# /var/vcap/packages/bpm/bin/runc --version
runc version 1.2.0
commit: unknown
spec: 1.2.0
go: go1.23.2
libseccomp: 2.5.1
# /var/vcap/packages/bpm/bin/runc --root /var/vcap/sys/run/bpm-runc/ delete bpm-loggr-forwarder-agent
ERRO[0002] unable to destroy container: unable to remove container's cgroup: Failed to remove paths: map[net_cls:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent net_prio:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent]
# echo $?
1

# /var/vcap/packages/bpm/bin/runc-1.1.15 --version
runc version 1.1.15
commit: unknown
spec: 1.0.2-dev
go: go1.23.2
libseccomp: 2.5.3

# /var/vcap/packages/bpm/bin/runc-1.1.15 --root /var/vcap/sys/run/bpm-runc/ delete bpm-loggr-forwarder-agent
WARN[0000] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent: device or resource busy"
WARN[0000] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent: device or resource busy"
ERRO[0000] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent: device or resource busy"
ERRO[0000] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent: device or resource busy"
ERRO[0000] Failed to remove paths: map[net_cls:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent net_prio:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-loggr-forwarder-agent]
# echo $?
0

selzoc · 2024-10-31T21:12:23Z

We found the changes in runc that lead to this behavioral change: opencontainers/runc@a6f4081
and
opencontainers/runc@7396ca9

Essentially, errors from removing cgroups were being ignored. Now they're taking a "fast fail" approach.

It's unclear at the moment what leads to the errors removing cgroups on bosh lites.

selzoc · 2024-10-31T21:57:17Z

Note that this only appears to be a problem with bosh lites using the warden cpi, not the docker cpi.

We've had reports such as cloudfoundry/bpm-release#176 where running bosh lite with the latest runc will fail to deploy or restart. While we haven't been able to find an absolute root cause, it is the case that garden-runc-release switched the default of this property a few months back: cloudfoundry/garden-runc-release#315. Garden has some issues with running itself under bpm (see https://github.com/cloudfoundry/garden-runc-release/blob/develop/docs/BPM_support.md) So we postulate that doing the reverse (running bpm under Garden) has some similar issues. We have not been able to reproduce the issue in cloudfoundry/bpm-release#176 with containerd_mode set to false.

selzoc · 2024-11-19T18:00:39Z

FYI I've opened a PR for bosh-deployment to resolve this issue: cloudfoundry/bosh-deployment#479

abg · 2024-12-11T19:11:57Z

Swinging back around to this, I unpinned bpm in one of our pipelines since we are pulling in this cloudfoundry/bosh-deployment#479 change now.

My pipeline failed - random jobs failed to start on redeploys or monit stop / start operations.

It seems like the containerd_mode: false property is set but in some configurations jobs don't restart cleanly.

$ bosh env
Name               bosh-lite
UUID               45dc22bd-4972-459d-93ac-93a048e71e1b
Version            280.1.13 (00000000)
Director Stemcell  -/1.651
CPI                warden_cpi
Features           config_server: enabled
                   local_dns: enabled
                   snapshots: disabled
User               admin

$ bosh -n restart
...
Task 149 | 19:06:13 | L starting jobs: bpm/7792fd64-c12f-4034-b129-b10eed0a3946 (0) (canary) (00:02:48)
                    L Error: 'bpm/7792fd64-c12f-4034-b129-b10eed0a3946 (0)' is not running after update. Review logs for failed jobs: test-server, alt-test-server

$ bosh ssh
$ sudo -i
# bpm version
1.4.6
# bpm list
Name                        Pid Status
test-errand                 -   stopped
test-server                 -   failed
test-server.alt-test-server -   failed

# /var/vcap/packages/bpm/bin/runc --root /var/vcap/sys/run/bpm-runc/ delete bpm-test-server
ERRO[0002] unable to destroy container: unable to remove container's cgroup: Failed to remove paths: map[net_cls:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-test-server net_prio:/sys/fs/cgroup/net_cls,net_prio/monit-api-access/bpm-test-server]

$ ssh -i /tmp/director.key jumpbox@${director_ip}
bosh/0:~$ sudo -i
bosh/0:~# bosh/0:~# grep -ri containerd /var/vcap/jobs/garden/monit
bosh/0:~#

cf-foundation-community-automation bot added this to Foundational Infrastructure Working Group Oct 29, 2024

cf-foundation-community-automation bot moved this to Inbox in Foundational Infrastructure Working Group Oct 29, 2024

jpalermo moved this from Inbox to Pending Review | Discussion in Foundational Infrastructure Working Group Oct 31, 2024

jpalermo moved this from Pending Review | Discussion to Pending Merge | Prioritized in Foundational Infrastructure Working Group Oct 31, 2024

selzoc mentioned this issue Nov 19, 2024

Disable garden's containerd_mode for bosh lite cloudfoundry/bosh-deployment#479

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpm/1.4.2 fails on a bosh-lite #176

bpm/1.4.2 fails on a bosh-lite #176

abg commented Oct 29, 2024

abg commented Oct 31, 2024

selzoc commented Oct 31, 2024 •

edited

Loading

selzoc commented Oct 31, 2024

selzoc commented Nov 19, 2024

abg commented Dec 11, 2024

bpm/1.4.2 fails on a bosh-lite #176

bpm/1.4.2 fails on a bosh-lite #176

Comments

abg commented Oct 29, 2024

abg commented Oct 31, 2024

selzoc commented Oct 31, 2024 • edited Loading

selzoc commented Oct 31, 2024

selzoc commented Nov 19, 2024

abg commented Dec 11, 2024

selzoc commented Oct 31, 2024 •

edited

Loading