Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero Panic on backup - Observed a panic: sync: negative WaitGroup counter #8708

Open
Gui13 opened this issue Feb 20, 2025 · 2 comments
Open
Assignees

Comments

@Gui13
Copy link

Gui13 commented Feb 20, 2025

What steps did you take and what happened:

During a velero backup, we hit a panic (this is not the same as #8657):


time="2025-02-20T09:35:08Z" level=info msg="plugin process exited" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-storage-location id=2068 logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-microsoft-azure
time="2025-02-20T09:35:15Z" level=info msg="The request has status 'InProgress', skip." controller=backup-deletion deletebackuprequest=velero/velero-braincube-20250120000012-gt6xm logSource="pkg/controller/backup_deletion_controller.go:145"
time="2025-02-20T09:35:27Z" level=error msg="pod volume backup failed: data path backup canceled: PVB is canceled" backup=velero/manual-velero-braincube-20250220 logSource="pkg/podvolume/backupper.go:382"
time="2025-02-20T09:35:27Z" level=error msg="pod volume backup failed: data path backup canceled: PVB is canceled" backup=velero/manual-velero-braincube-20250220 logSource="pkg/podvolume/backupper.go:382"
E0220 09:35:27.132115       1 runtime.go:77] Observed a panic: sync: negative WaitGroup counter
goroutine 757 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x26c8ce0, 0x314da30})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00467f6c0?})
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x26c8ce0?, 0x314da30?})
	/usr/local/go/src/runtime/panic.go:770 +0x132
sync.(*WaitGroup).Add(0xc007cec270?, 0x2c4e600?)
	/usr/local/go/src/sync/waitgroup.go:62 +0xd8
sync.(*WaitGroup).Done(...)
	/usr/local/go/src/sync/waitgroup.go:87
github.com/vmware-tanzu/velero/pkg/podvolume.newBackupper.func1({0x40a3d2?, 0xc007cf6060?}, {0x2c4e600?, 0xc00da55208?})
	/go/src/github.com/vmware-tanzu/velero/pkg/podvolume/backupper.go:167 +0x1eb
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:246
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:970 +0xea
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0030d9f70, {0x3156de0, 0xc008688c60}, 0x1, 0xc0037a0ba0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0006b7770, 0x3b9aca00, 0x0, 0x1, 0xc0037a0ba0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc007ce63f0)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:966 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 379
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:70 +0x73
panic: sync: negative WaitGroup counter [recovered]
	panic: sync: negative WaitGroup counter

What did you expect to happen:

No crash :-)

The following information will help us better understand what's going on:

I have collected a bundle, but I would like to send it privately to no disclose too much information.

Anything else you would like to add:

Environment:

  • Velero version (use velero version): 1.15.2
  • Velero features (use velero client config get features): FSB is used extensively
  • Kubernetes version (use kubectl version): 1.31
  • Kubernetes installer & version: AKZ Azure
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): Azure Linux

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Gui13 Gui13 changed the title Velero Panic on backup Velero Panic on backup - Observed a panic: sync: negative WaitGroup counter Feb 20, 2025
@ywk253100
Copy link
Contributor

ywk253100 commented Feb 21, 2025

This should be caused by the similar reason with #8657:
The PVBs are handled very quickly and WaitGroup.Done() is called multiple times before they are added into the WaitGroup.

Is this issue reproduceble in your environment? Appreciate if you could try the dev build velero/velero:release-1.15-dev with the fix and verify whether it works.

@Gui13
Copy link
Author

Gui13 commented Feb 26, 2025

The issue is not easily reproducible (only 1 hit in last 15 days), and it being a client env, we can't push a dev build easily.
We'll wait until velero 1.16 lands to see if this disappears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants