Velero Panic on backup - Observed a panic: sync: negative WaitGroup counter #8708

Gui13 · 2025-02-20T10:59:41Z

What steps did you take and what happened:

During a velero backup, we hit a panic (this is not the same as #8657):


time="2025-02-20T09:35:08Z" level=info msg="plugin process exited" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-storage-location id=2068 logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-microsoft-azure
time="2025-02-20T09:35:15Z" level=info msg="The request has status 'InProgress', skip." controller=backup-deletion deletebackuprequest=velero/velero-braincube-20250120000012-gt6xm logSource="pkg/controller/backup_deletion_controller.go:145"
time="2025-02-20T09:35:27Z" level=error msg="pod volume backup failed: data path backup canceled: PVB is canceled" backup=velero/manual-velero-braincube-20250220 logSource="pkg/podvolume/backupper.go:382"
time="2025-02-20T09:35:27Z" level=error msg="pod volume backup failed: data path backup canceled: PVB is canceled" backup=velero/manual-velero-braincube-20250220 logSource="pkg/podvolume/backupper.go:382"
E0220 09:35:27.132115       1 runtime.go:77] Observed a panic: sync: negative WaitGroup counter
goroutine 757 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x26c8ce0, 0x314da30})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00467f6c0?})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x26c8ce0?, 0x314da30?})
	/usr/local/go/src/runtime/panic.go:770 +0x132
sync.(*WaitGroup).Add(0xc007cec270?, 0x2c4e600?)
	/usr/local/go/src/sync/waitgroup.go:62 +0xd8
sync.(*WaitGroup).Done(...)
	/usr/local/go/src/sync/waitgroup.go:87
github.com/vmware-tanzu/velero/pkg/podvolume.newBackupper.func1({0x40a3d2?, 0xc007cf6060?}, {0x2c4e600?, 0xc00da55208?})
	/go/src/github.com/vmware-tanzu/velero/pkg/podvolume/backupper.go:167 +0x1eb
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:246
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:970 +0xea
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0030d9f70, {0x3156de0, 0xc008688c60}, 0x1, 0xc0037a0ba0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0006b7770, 0x3b9aca00, 0x0, 0x1, 0xc0037a0ba0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc007ce63f0)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:966 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 379
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:70 +0x73
panic: sync: negative WaitGroup counter [recovered]
	panic: sync: negative WaitGroup counter

What did you expect to happen:

No crash :-)

The following information will help us better understand what's going on:

I have collected a bundle, but I would like to send it privately to no disclose too much information.

Anything else you would like to add:

Environment:

Velero version (use velero version): 1.15.2
Velero features (use velero client config get features): FSB is used extensively
Kubernetes version (use kubectl version): 1.31
Kubernetes installer & version: AKZ Azure
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): Azure Linux

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

ywk253100 · 2025-02-21T07:33:40Z

This should be caused by the similar reason with #8657:
The PVBs are handled very quickly and WaitGroup.Done() is called multiple times before they are added into the WaitGroup.

Is this issue reproduceble in your environment? Appreciate if you could try the dev build velero/velero:release-1.15-dev with the fix and verify whether it works.

Gui13 · 2025-02-26T21:34:14Z

The issue is not easily reproducible (only 1 hit in last 15 days), and it being a client env, we can't push a dev build easily.
We'll wait until velero 1.16 lands to see if this disappears.

Gui13 changed the title ~~Velero Panic on backup~~ Velero Panic on backup - Observed a panic: sync: negative WaitGroup counter Feb 20, 2025

Lyndon-Li added the area/fs-backup label Feb 20, 2025

Lyndon-Li assigned ywk253100 Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero Panic on backup - Observed a panic: sync: negative WaitGroup counter #8708

Velero Panic on backup - Observed a panic: sync: negative WaitGroup counter #8708

Gui13 commented Feb 20, 2025

ywk253100 commented Feb 21, 2025 •

edited

Loading

Gui13 commented Feb 26, 2025

Velero Panic on backup - Observed a panic: sync: negative WaitGroup counter #8708

Velero Panic on backup - Observed a panic: sync: negative WaitGroup counter #8708

Comments

Gui13 commented Feb 20, 2025

ywk253100 commented Feb 21, 2025 • edited Loading

Gui13 commented Feb 26, 2025

ywk253100 commented Feb 21, 2025 •

edited

Loading