Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filesystem backup doesn't work when multiple controllers select the same pods #8735

Open
blackpiglet opened this issue Feb 28, 2025 · 0 comments

Comments

@blackpiglet
Copy link
Contributor

blackpiglet commented Feb 28, 2025

What steps did you take and what happened:

Create two deployments, which have the same selector.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-app
  namespace: upgrade
spec:
  selector:
    matchLabels:
      app: hello-app
  template:
    metadata:
      labels:
        app: hello-app
    spec:
      containers:
      - name: hello-app
        image: gcr.io/velero-gcp/nginx:1.17.6
        args: [ "sleep", "3600" ]
        volumeMounts:
        - name: standard
          mountPath: /usr/share/standard/
      volumes:
      - name: standard
        persistentVolumeClaim:
          claimName: standard
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-app-2
  namespace: upgrade
spec:
  selector:
    matchLabels:
      app: hello-app
  template:
    metadata:
      labels:
        app: hello-app
    spec:
      containers:
      - name: hello-app
        image: gcr.io/velero-gcp/nginx:1.17.6
        args: [ "sleep", "3600" ]
        volumeMounts:
        - name: standard
          mountPath: /usr/share/standard/
      volumes:
      - name: standard
        persistentVolumeClaim:
          claimName: standard
---
apiVersion: snapshot.storage.k8s.io/v1
deletionPolicy: Delete
driver: csi.vsphere.vmware.com
kind: VolumeSnapshotClass
metadata:
  labels:
    velero.io/csi-volumesnapshot-class: "true"
  name: velero
---
apiVersion: v1
kind: Namespace
metadata:
  name: upgrade
  labels:
    pod-security.kubernetes.io/enforce: privileged
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: standard
  namespace: upgrade
spec:
  storageClassName: worker-storagepolicy-latebinding
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

Create a backup that includes both of the deployments and the volumes data is backed up by fs-backup.

velero backup create --include-namespaces=upgrade multiple-deployments --default-volumes-to-fs-backup

Create a restore from the backup.

velero restore create --from-backup multiple-deployments --namespace-mappings upgrade:restore

There is a possibility that the restore fails with PartiallyFailed. The reason is the PodVolumeRestore fails.

What did you expect to happen:
The restore should complete successfully.

The following information will help us better understand what's going on:
This is a limitation of Velero fs-backup.
Velero does some tricks to back up the deployments.
The resource sequence in Velero backup is determined by the k8s resource type. That means Velero backup will back up all resources of type A, then type B, then type C.
The Velero restore also has similar logic.

To successfully restore deployments, Velero restores pods, then ReplicaRest, finally Deployments.
To make it work, Velero restore deletes some metadata and all statuses from the resources, e.g. OwnerReference.

In this issue scenario, two deployments create two ReplicaSets, and two Pods.
During restore, the two ReplicaSets can be both chosen by the two Deployments.

It's possible two ReplicaSets are both adopted by Deployment A, then the Deployment A will scale down one of the ReplicaSets, and only keep one ReplicaSet of the other.
At the same time, those ReplicaSets are adopted by Deployment B, after the Deployment B is created.
Because one of the ReplicaSets is already adopted by Deployment A, and the OwnerReference is set on that ReplicaSet, only the other ReplicaSet can be adopted by the Deployment B, but the ReplicaSet is already scaled down.
As a result, the Deployment B needs to create a new ReplicaSet.

If the PodVolumeRestore is already in progress, before the ReplicaSet scaled down, then the PodVolumeRestore should fail, because the pod will be deleted, and the mount directory will be changed.

root@corgi-jumper:~/workload# kubectl -n restore describe deploy hello-app
Name:                   hello-app
Namespace:              restore
CreationTimestamp:      Fri, 28 Feb 2025 07:46:16 +0000
Labels:                 velero.io/backup-name=multiple-1
                        velero.io/restore-name=multiple-1-20250228074613
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=hello-app
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=hello-app
  Containers:
   hello-app:
    Image:      gcr.io/velero-gcp/nginx:1.17.6
    Port:       <none>
    Host Port:  <none>
    Args:
      sleep
      3600
    Environment:  <none>
    Mounts:
      /usr/share/standard/ from standard (rw)
  Volumes:
   standard:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  standard
    ReadOnly:   false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   hello-app-57d44fbbd7 (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  6m4s  deployment-controller  Scaled up replica set hello-app-57d44fbbd7 from 0 to 1
root@corgi-jumper:~/workload# kubectl -n restore describe deploy hello-app-2
Name:                   hello-app-2
Namespace:              restore
CreationTimestamp:      Fri, 28 Feb 2025 07:46:16 +0000
Labels:                 velero.io/backup-name=multiple-1
                        velero.io/restore-name=multiple-1-20250228074613
Annotations:            deployment.kubernetes.io/revision: 2
Selector:               app=hello-app
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=hello-app
  Containers:
   hello-app:
    Image:      gcr.io/velero-gcp/nginx:1.17.6
    Port:       <none>
    Host Port:  <none>
    Args:
      sleep
      3600
    Environment:  <none>
    Mounts:
      /usr/share/standard/ from standard (rw)
  Volumes:
   standard:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  standard
    ReadOnly:   false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  hello-app-776f688678 (0/0 replicas created)
NewReplicaSet:   hello-app-2-776f688678 (1/1 replicas created)
Events:
  Type    Reason             Age    From                   Message
  ----    ------             ----   ----                   -------
  Normal  ScalingReplicaSet  6m22s  deployment-controller  Scaled down replica set hello-app-776f688678 from 1 to 0

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:

Environment:

  • Velero version (use velero version):
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant