Skip to content

Recovering from DOKS volumes multi‐attach error

Hugo Denizart edited this page Sep 7, 2023 · 1 revision

After a general disruption/failure (such as network blips, node/DOKS/control plane/volumes failure...), it may happen that a workload is stuck.

Upon manual or automatic re-deployment, the workload may remain stuck, and you may observe in events (by running kubectl describe) a multi-attach/MountVolume.SetUp error.
Make sure it is related to a volume though, otherwise none of this applies. For example, the following is an error reading a ConfigMap from a control plane failure: image

(in such a case, we can only get DigitalOcean to fix their shit.)

The PVC should be directly referenced to in the error message. That means this volume is stuck mounted on a previous node.
In that case, make sure there's no other pod referencing that volume. If there is, the new pod cannot run until the old pod gets terminated. If it doesn't get terminated after ~30sec, a kill may be attempted. If the kill doesn't succeed, we have an ongoing node/control plane failure.

While this should quickly restore itself as the underlying systems recover, in the case of s3-nginx-proxy, a workaround may be attempted. The volume is only used for cache and can be discarded:

  • Get the pod manifest: kubectl get pvc -n s3-nginx-proxy s3-nginx-proxy-s3-nginx-proxy-chart -o yaml > pvc.yaml
  • Rename and apply using kubectl apply -n s3-nginx-proxy -f pvc.yaml
  • Edit the Deployment: kubectl edit -n s3-nginx-proxy deploy/s3-nginx-proxy-s3-nginx-proxy-chart
    Look for the claimName under .spec.template.spec.volumes and update it with the new name

A new pod should replace the older one with a fresh cache volume.