Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After a restore, cStorVolume remains in Init, CVRs remain in Error state #127

Closed
sgielen opened this issue Oct 12, 2020 · 10 comments · Fixed by #131
Closed

After a restore, cStorVolume remains in Init, CVRs remain in Error state #127

sgielen opened this issue Oct 12, 2020 · 10 comments · Fixed by #131
Assignees
Labels
Bug issue/pr is a bug/fix to existing feature

Comments

@sgielen
Copy link
Contributor

sgielen commented Oct 12, 2020

What steps did you take and what happened:

I wanted to test a full restore of a namespace (a single application). I tested this on a namespace called "vikunja", which contains two Deployments, two Services, two Pods, one Ingress and one PVC, backed by a cStorVolume and three cStorVolumeReplicas that are all healthy. The restore succeeded, however, in the new namespace the application never came up because its cStorVolume remains in Init state and its three cStorVolumeReplicas remain in Error state.

This is the Restore command I ran:

velero restore create vikunja-test-restore --from-backup volumes-full-20201011-043725 --include-namespaces 'vikunja' --namespace-mappings 'vikunja:vikunja-restore' --restore-volumes=true

Where volumes-full is a (non-Schedule) backup that contains a full copy of most of the volumes in my cluster.

The restore finished successfully in approximately one minute. Indeed, after this, the vikunja-restore namespace contains two Deployments, two Services, two Pods, one Ingress and one PVC. I modified the Ingress so that the endpoint URI is different from the original namespace, and visited the application at the new URI expecting to find the same application, but in the state from the backup instead of the most recent state. However, the application does not load, because the PVC cannot be mounted as the iSCSI target is not ready.

$ kubectl get cstorvolume -n openebs | grep 9e222
pvc-9e222d37-984f-45ad-a42b-5fac7892b51f   Init      7m48s   50Gi
$ kubectl describe cstorvolume -n openebs pvc-9e222d37-984f-45ad-a42b-5fac7892b51f 
[...]
Events:
  Type    Reason   Age    From                                                             Message
  ----    ------   ----   ----                                                             -------
  Normal  Updated  6m32s  pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-target-5cf9bbd767288st  Updated resize conditions
  Normal  Updated  6m32s  pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-target-5cf9bbd767288st  successfully resized volume from 0 to 50Gi
$ kubectl get cvr -n openebs | grep 9e222
pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-eiq0   32.6M   1.76M       Error     7m55s
pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-una0   32.6M   1.76M       Error     7m55s
pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-7it6   32.6M   1.75M       Error     7m55s
$ kubectl describe cvr -n openebs pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-eiq0
[...]
Events:
  Type     Reason      Age                    From                Message
  ----     ------      ----                   ----                -------
  Normal   Synced      8m55s                  CStorVolumeReplica  Received Resource create event
  Normal   Created     8m51s                  CStorVolumeReplica  Resource created successfully
  Normal   Synced      8m51s (x2 over 8m55s)  CStorVolumeReplica  Received Resource modify event
  Warning  SyncFailed  26s (x18 over 8m51s)   CStorVolumeReplica  failed to sync CVR error: unable to update snapshot list details in CVR: failed to get the list of snapshots: Output: failed listsnap command for cstor-67623196-d6e5-4956-9eb4-db81259c9089/pvc-9e222d37-984f-45ad-a42b-5fac7892b51f with err 2

I tried to let the situation resolve overnight but to no avail.

What did you expect to happen:

I expected the volume and its replicas to eventually become Healthy with the same contents as the volume had during the restored backup.

The output of the following commands will help us better understand what's going on:

  • maya-apiserver logs (restore is around 21:24):
I1011 06:18:29.198928       7 backup_endpoint_v1alpha1.go:501] Deleting backup snapshot volumes-full-20201011-043725 for volume "pvc-1c329d55-f024-4306-8489-0efbd65f58c5"
I1011 06:18:29.531233       7 backup_endpoint_v1alpha1.go:507] Snapshot:'volumes-full-20201011-043725' deleted successfully for volume:pvc-1c329d55-f024-4306-8489-0efbd65f58c5
I1011 21:24:35.125935       7 volume_endpoint_v1alpha1.go:78] received cas volume request: http method {GET}
I1011 21:24:35.126216       7 volume_endpoint_v1alpha1.go:171] received volume read request: pvc-9e222d37-984f-45ad-a42b-5fac7892b51f
W1011 21:24:35.557971       7 task.go:433] notfound error at runtask {readlistsvc}: error {target service not found}
W1011 21:24:35.560904       7 runner.go:166] nothing to rollback: no rollback tasks were found
2020/10/11 21:24:35.561160 [ERR] http: Request GET /latest/volumes/pvc-9e222d37-984f-45ad-a42b-5fac7892b51f
failed to read volume: volume {pvc-9e222d37-984f-45ad-a42b-5fac7892b51f} not found in namespace {vikunja-restore}
I1011 21:24:35.574866       7 volume_endpoint_v1alpha1.go:78] received cas volume request: http method {POST}
I1011 21:24:35.575543       7 volume_endpoint_v1alpha1.go:135] received volume create request
I1011 21:24:38.119247       7 select.go:154] Overprovisioning restriction policy not added as overprovisioning is enabled on spc cstor-disk-pool
I1011 21:24:39.041828       7 volume_endpoint_v1alpha1.go:165] volume 'pvc-9e222d37-984f-45ad-a42b-5fac7892b51f' created successfully
I1011 21:24:39.058360       7 volume_endpoint_v1alpha1.go:78] received cas volume request: http method {GET}
I1011 21:24:39.058460       7 volume_endpoint_v1alpha1.go:171] received volume read request: pvc-9e222d37-984f-45ad-a42b-5fac7892b51f
I1011 21:24:42.053882       7 volume_endpoint_v1alpha1.go:226] volume 'pvc-9e222d37-984f-45ad-a42b-5fac7892b51f' read successfully
I1011 21:24:47.994475       7 restore_endpoint_v1alpha1.go:105] Restore volume 'pvc-9e222d37-984f-45ad-a42b-5fac7892b51f' created successfully 
I1011 21:24:48.937246       7 restore_endpoint_v1alpha1.go:143] Restore:volumes-full-20201011-043725-61650654-92c8-4fa3-ad5d-326c3f5a553c created for volume "pvc-9e222d37-984f-45ad-a42b-5fac7892b51f" poolUUID:223b4546-5ed3-4e1e-b8b7-7f42c4502dbb
I1011 21:24:49.503239       7 restore_endpoint_v1alpha1.go:143] Restore:volumes-full-20201011-043725-0359f610-7763-428c-b0fc-a92c2f5463c6 created for volume "pvc-9e222d37-984f-45ad-a42b-5fac7892b51f" poolUUID:30381e2c-0392-4b82-b552-ca843fdaf6d7
I1011 21:24:50.294236       7 restore_endpoint_v1alpha1.go:143] Restore:volumes-full-20201011-043725-44813e28-d02f-454e-8891-dd2b65d31a66 created for volume "pvc-9e222d37-984f-45ad-a42b-5fac7892b51f" poolUUID:67623196-d6e5-4956-9eb4-db81259c9089
I1011 21:24:55.476539       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-61650654-92c8-4fa3-ad5d-326c3f5a553c status is InProgress
I1011 21:25:00.685293       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-61650654-92c8-4fa3-ad5d-326c3f5a553c status is InProgress
I1011 21:25:06.459184       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-44813e28-d02f-454e-8891-dd2b65d31a66 status is InProgress
I1011 21:25:12.062433       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-0359f610-7763-428c-b0fc-a92c2f5463c6 status is InProgress
I1011 21:25:17.099452       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-61650654-92c8-4fa3-ad5d-326c3f5a553c status is Done
I1011 21:25:17.099542       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-44813e28-d02f-454e-8891-dd2b65d31a66 status is Done
I1011 21:25:17.099581       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-0359f610-7763-428c-b0fc-a92c2f5463c6 status is Done
I1012 04:38:52.017035       7 backup_endpoint_v1alpha1.go:160] Creating backup snapshot volumes-full-20201012-043837 for volume "pvc-318b2f41-40db-4ee7-8564-4c8677665bbf"
I1012 04:38:53.195149       7 backup_endpoint_v1alpha1.go:167] Backup snapshot:'volumes-full-20201012-043837' created successfully for volume:pvc-318b2f41-40db-4ee7-8564-4c8677665bbf
  • velero restore logs vikunja-test-restore | grep -v 'Skipping namespace': https://pastebin.com/vh9ACfAn
  • velero restore describe vikunja-test-restore:
velero restore describe vikunja-test-restore
Name:         vikunja-test-restore
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:  Completed

Started:    2020-10-11 23:24:29 +0200 CEST
Completed:  2020-10-11 23:25:28 +0200 CEST

Warnings:
  Velero:     <none>
  Cluster:  could not restore, persistentvolumes "pvc-9e222d37-984f-45ad-a42b-5fac7892b51f" already exists. Warning: the in-cluster version is different than the backed-up version.
  Namespaces:
    vikunja-restore:  could not restore, persistentvolumeclaims "storage" already exists. Warning: the in-cluster version is different than the backed-up version.

Backup:  volumes-full-20201011-043725

Namespaces:
  Included:  vikunja
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  vikunja=vikunja-restore

Label selector:  <none>

Restore PVs:  true
  • Logs of the cStor disk pool pod:
$ kubectl logs cstor-disk-pool-7it6-657f7f5897-2tg4b -n openebs cstor-pool | grep -C3 10-11/21:24 
2020-10-11/06:17:47.588 [tgt 10.43.208.76:6060:17]: Destroy snapshot command for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-2ef2eddc-c59f-4222-a112-6fb7ebfccdb3@volumes-full-20201011-043725
2020-10-11/06:17:50.336 [tgt 10.43.42.69:6060:16]: Create snapshot command for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-1c329d55-f024-4306-8489-0efbd65f58c5@volumes-full-20201011-043725
2020-10-11/06:18:29.476 [tgt 10.43.42.69:6060:16]: Destroy snapshot command for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-1c329d55-f024-4306-8489-0efbd65f58c5@volumes-full-20201011-043725
2020-10-11/21:24:42.125 zvol cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-9e222d37-984f-45ad-a42b-5fac7892b51f status change: DEGRADED -> DEGRADED
2020-10-11/21:24:42.125 zvol cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-9e222d37-984f-45ad-a42b-5fac7892b51f rebuild status change: INIT -> INIT
2020-10-11/21:24:42.126 ERROR target IP address is empty for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-9e222d37-984f-45ad-a42b-5fac7892b51f
2020-10-12/04:38:52.176 [tgt 10.43.120.72:6060:28]: Create snapshot command for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-318b2f41-40db-4ee7-8564-4c8677665bbf@volumes-full-20201012-043837
2020-10-12/04:41:13.316 [tgt 10.43.120.72:6060:28]: Destroy snapshot command for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-318b2f41-40db-4ee7-8564-4c8677665bbf@volumes-full-20201012-043837
2020-10-12/04:41:17.452 [tgt 10.43.80.233:6060:18]: Create snapshot command for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-20bffc04-703d-44b0-8820-e124a9c0bc51@volumes-full-20201012-043837

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Velero version (use velero version): v1.5.1
  • Velero features (use velero client config get features):
  • Velero-plugin version: openebs/velero-plugin-arm64:2.1.0
  • OpenEBS version: 2.1.0
  • Kubernetes version (use kubectl version): client 1.19.2, server 1.18.3
  • Kubernetes installer & version: k3os
  • Cloud provider or hardware configuration: arm64
  • OS (e.g. from /etc/os-release): k3os
@sgielen sgielen added the Bug issue/pr is a bug/fix to existing feature label Oct 12, 2020
@mittachaitu
Copy link

mittachaitu commented Oct 12, 2020

Hi @sgielen, Looks like you didn't set the targetIP for volume replicas(ZFS datasets). Can you set targetIP by exec into pool pod as mentioned in the Note here.

@mynktl
Copy link
Member

mynktl commented Oct 12, 2020

@sgielen
CVR uses targetip(target pod service ip) to connect to target pod. During restore, targetip is not set. This is to restore incremental backup after full backup restore.

Once restore completes, if incremental backups are there then those backups are also needs to be restored to retrieve all the data, targetip needs to be updated in replica.
steps:

  1. To fetch targetip
    kubectl get svc -n openebs <PV_NAME> -ojsonpath='{.spec.clusterIP}'

In your case, command will be
target_ip=kubectl get svc -n openebs pvc-9e222d37-984f-45ad-a42b-5fac7892b51f -ojsonpath='{.spec.clusterIP}'

  1. Once you get targetip, you need to set it in all the replica of restored pv using following command:
    kubectl exec -it <POOL_POD> -c cstor-pool -n openebs -- bash
    zfs set io.openebs:targetip=<TARGET_IP> <POOL_NAME/VOLUME_NAME>

In your case, it will be

  1. For CVR, pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-eiq0
    pool_pod=kubectl get pods -n openebs |grep cstor-disk-pool-eiq0 |awk -F ' ' '{print $1}'
    kubectl exec -it $pool_pod -c cstor-pool -n openebs -- bash
    POOL_VOLUME_NAME=zfs list |grep pvc-2d13e86e-d11c-40d4-947f-8b7f03ffa6d3 |awk -F ' ' '{print $1}'
    `zfs set io.openebs:targetip=$target_ip $POOL_VOLUME_NAME

  2. For CVR, pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-una0
    pool_pod=kubectl get pods -n openebs |grep cstor-disk-pool-una0 |awk -F ' ' '{print $1}'
    kubectl exec -it $pool_pod -c cstor-pool -n openebs -- bash
    POOL_VOLUME_NAME=zfs list |grep pvc-2d13e86e-d11c-40d4-947f-8b7f03ffa6d3 |awk -F ' ' '{print $1}'
    `zfs set io.openebs:targetip=$target_ip $POOL_VOLUME_NAME

  3. For CVR, pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-7it6
    pool_pod=kubectl get pods -n openebs |grep cstor-disk-pool-7it6 |awk -F ' ' '{print $1}'
    kubectl exec -it $pool_pod -c cstor-pool -n openebs -- bash
    POOL_VOLUME_NAME=zfs list |grep pvc-2d13e86e-d11c-40d4-947f-8b7f03ffa6d3 |awk -F ' ' '{print $1}'
    `zfs set io.openebs:targetip=$target_ip $POOL_VOLUME_NAME

@mynktl mynktl self-assigned this Oct 12, 2020
@zlymeda
Copy link
Contributor

zlymeda commented Oct 17, 2020

as mentioned on the Slack, I'm working on the fix for this issue

@zlymeda
Copy link
Contributor

zlymeda commented Oct 17, 2020

I made good progress on this. Just testing restores now.

I will create PR soon for this project as well as cstor-operators and I am also thinking about backporting the fix to maya repo as well.

@sgielen
Copy link
Contributor Author

sgielen commented Oct 21, 2020

@zlymeda Let me know when you have a PR, and I will try to retest my restore with it then :)

@sgielen
Copy link
Contributor Author

sgielen commented Oct 25, 2020

Hi @sgielen, Looks like you didn't set the targetIP for volume replicas(ZFS datasets). Can you set targetIP by exec into pool pod as mentioned in the Note here.

For completeness & future reference, this was indeed the problem; after I set the targetip correctly according to instructions in this thread, the CVRs became Healthy, the Pods started automatically and the restored application was completely functional exactly as I had originally expected.

As soon as a PR exists for this issue, I will update my version of the velero-plugin and retry the restore steps without setting target IP to confirm it is no longer necessary to do this manually :-)

@zlymeda
Copy link
Contributor

zlymeda commented Oct 26, 2020

Single backup / restore is working. I need to make a small change to support restore from scheduled backup with multiple backups created.
The first restore works, targetip is set. But second restore fails as the "device is busy". I tested manually setting targetip back to empty and try to restore and it worked. So I need to fix this in the code now.

@sgielen the fix will be in 3 repos. One here, one in maya and one in cstor-operators.
If you are using cstor-operators, you will need that one. Otherwise, you will just need maya.

If you want to test it now, you can use my images I've created for testing:

  • zlymeda/velero-plugin:ci
  • zlymeda/cstor-volume-mgmt:2.2.0 (for cstor-operators env var: OPENEBS_IO_CSPI_MGMT_IMAGE)
  • zlymeda/cstor-pool-mgmt:2.2.0 (for maya env var: OPENEBS_IO_CSTOR_POOL_MGMT_IMAGE)

@mynktl
Copy link
Member

mynktl commented Oct 27, 2020

@zlymeda Once the targetip is set, we won't be able to restore for that PV. #99 is to restore all the scheduled backup using single restore. With this, we won't have issue of device is busy.

@sgielen
Copy link
Contributor Author

sgielen commented Nov 2, 2020

Single backup / restore is working. I need to make a small change to support restore from scheduled backup with multiple backups created.
The first restore works, targetip is set. But second restore fails as the "device is busy". I tested manually setting targetip back to empty and try to restore and it worked. So I need to fix this in the code now.

@sgielen the fix will be in 3 repos. One here, one in maya and one in cstor-operators.
If you are using cstor-operators, you will need that one. Otherwise, you will just need maya.

If you want to test it now, you can use my images I've created for testing:

  • zlymeda/velero-plugin:ci
  • zlymeda/cstor-volume-mgmt:2.2.0 (for cstor-operators env var: OPENEBS_IO_CSPI_MGMT_IMAGE)
  • zlymeda/cstor-pool-mgmt:2.2.0 (for maya env var: OPENEBS_IO_CSTOR_POOL_MGMT_IMAGE)

@zlymeda, unfortunately I have an arm64 based cluster. But either way I prefer to build my own images unless they come from the upstream. Would you already have a patch to share somewhere that I could test? Thanks!

@zlymeda
Copy link
Contributor

zlymeda commented Nov 4, 2020

@sgielen I mentioned this issue in PRs, please check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug issue/pr is a bug/fix to existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants