After a restore, cStorVolume remains in Init, CVRs remain in Error state #127

sgielen · 2020-10-12T08:30:57Z

What steps did you take and what happened:

I wanted to test a full restore of a namespace (a single application). I tested this on a namespace called "vikunja", which contains two Deployments, two Services, two Pods, one Ingress and one PVC, backed by a cStorVolume and three cStorVolumeReplicas that are all healthy. The restore succeeded, however, in the new namespace the application never came up because its cStorVolume remains in Init state and its three cStorVolumeReplicas remain in Error state.

This is the Restore command I ran:

velero restore create vikunja-test-restore --from-backup volumes-full-20201011-043725 --include-namespaces 'vikunja' --namespace-mappings 'vikunja:vikunja-restore' --restore-volumes=true

Where volumes-full is a (non-Schedule) backup that contains a full copy of most of the volumes in my cluster.

The restore finished successfully in approximately one minute. Indeed, after this, the vikunja-restore namespace contains two Deployments, two Services, two Pods, one Ingress and one PVC. I modified the Ingress so that the endpoint URI is different from the original namespace, and visited the application at the new URI expecting to find the same application, but in the state from the backup instead of the most recent state. However, the application does not load, because the PVC cannot be mounted as the iSCSI target is not ready.

$ kubectl get cstorvolume -n openebs | grep 9e222
pvc-9e222d37-984f-45ad-a42b-5fac7892b51f   Init      7m48s   50Gi
$ kubectl describe cstorvolume -n openebs pvc-9e222d37-984f-45ad-a42b-5fac7892b51f 
[...]
Events:
  Type    Reason   Age    From                                                             Message
  ----    ------   ----   ----                                                             -------
  Normal  Updated  6m32s  pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-target-5cf9bbd767288st  Updated resize conditions
  Normal  Updated  6m32s  pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-target-5cf9bbd767288st  successfully resized volume from 0 to 50Gi
$ kubectl get cvr -n openebs | grep 9e222
pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-eiq0   32.6M   1.76M       Error     7m55s
pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-una0   32.6M   1.76M       Error     7m55s
pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-7it6   32.6M   1.75M       Error     7m55s
$ kubectl describe cvr -n openebs pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-eiq0
[...]
Events:
  Type     Reason      Age                    From                Message
  ----     ------      ----                   ----                -------
  Normal   Synced      8m55s                  CStorVolumeReplica  Received Resource create event
  Normal   Created     8m51s                  CStorVolumeReplica  Resource created successfully
  Normal   Synced      8m51s (x2 over 8m55s)  CStorVolumeReplica  Received Resource modify event
  Warning  SyncFailed  26s (x18 over 8m51s)   CStorVolumeReplica  failed to sync CVR error: unable to update snapshot list details in CVR: failed to get the list of snapshots: Output: failed listsnap command for cstor-67623196-d6e5-4956-9eb4-db81259c9089/pvc-9e222d37-984f-45ad-a42b-5fac7892b51f with err 2

I tried to let the situation resolve overnight but to no avail.

What did you expect to happen:

I expected the volume and its replicas to eventually become Healthy with the same contents as the volume had during the restored backup.

The output of the following commands will help us better understand what's going on:

maya-apiserver logs (restore is around 21:24):

I1011 06:18:29.198928       7 backup_endpoint_v1alpha1.go:501] Deleting backup snapshot volumes-full-20201011-043725 for volume "pvc-1c329d55-f024-4306-8489-0efbd65f58c5"
I1011 06:18:29.531233       7 backup_endpoint_v1alpha1.go:507] Snapshot:'volumes-full-20201011-043725' deleted successfully for volume:pvc-1c329d55-f024-4306-8489-0efbd65f58c5
I1011 21:24:35.125935       7 volume_endpoint_v1alpha1.go:78] received cas volume request: http method {GET}
I1011 21:24:35.126216       7 volume_endpoint_v1alpha1.go:171] received volume read request: pvc-9e222d37-984f-45ad-a42b-5fac7892b51f
W1011 21:24:35.557971       7 task.go:433] notfound error at runtask {readlistsvc}: error {target service not found}
W1011 21:24:35.560904       7 runner.go:166] nothing to rollback: no rollback tasks were found
2020/10/11 21:24:35.561160 [ERR] http: Request GET /latest/volumes/pvc-9e222d37-984f-45ad-a42b-5fac7892b51f
failed to read volume: volume {pvc-9e222d37-984f-45ad-a42b-5fac7892b51f} not found in namespace {vikunja-restore}
I1011 21:24:35.574866       7 volume_endpoint_v1alpha1.go:78] received cas volume request: http method {POST}
I1011 21:24:35.575543       7 volume_endpoint_v1alpha1.go:135] received volume create request
I1011 21:24:38.119247       7 select.go:154] Overprovisioning restriction policy not added as overprovisioning is enabled on spc cstor-disk-pool
I1011 21:24:39.041828       7 volume_endpoint_v1alpha1.go:165] volume 'pvc-9e222d37-984f-45ad-a42b-5fac7892b51f' created successfully
I1011 21:24:39.058360       7 volume_endpoint_v1alpha1.go:78] received cas volume request: http method {GET}
I1011 21:24:39.058460       7 volume_endpoint_v1alpha1.go:171] received volume read request: pvc-9e222d37-984f-45ad-a42b-5fac7892b51f
I1011 21:24:42.053882       7 volume_endpoint_v1alpha1.go:226] volume 'pvc-9e222d37-984f-45ad-a42b-5fac7892b51f' read successfully
I1011 21:24:47.994475       7 restore_endpoint_v1alpha1.go:105] Restore volume 'pvc-9e222d37-984f-45ad-a42b-5fac7892b51f' created successfully 
I1011 21:24:48.937246       7 restore_endpoint_v1alpha1.go:143] Restore:volumes-full-20201011-043725-61650654-92c8-4fa3-ad5d-326c3f5a553c created for volume "pvc-9e222d37-984f-45ad-a42b-5fac7892b51f" poolUUID:223b4546-5ed3-4e1e-b8b7-7f42c4502dbb
I1011 21:24:49.503239       7 restore_endpoint_v1alpha1.go:143] Restore:volumes-full-20201011-043725-0359f610-7763-428c-b0fc-a92c2f5463c6 created for volume "pvc-9e222d37-984f-45ad-a42b-5fac7892b51f" poolUUID:30381e2c-0392-4b82-b552-ca843fdaf6d7
I1011 21:24:50.294236       7 restore_endpoint_v1alpha1.go:143] Restore:volumes-full-20201011-043725-44813e28-d02f-454e-8891-dd2b65d31a66 created for volume "pvc-9e222d37-984f-45ad-a42b-5fac7892b51f" poolUUID:67623196-d6e5-4956-9eb4-db81259c9089
I1011 21:24:55.476539       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-61650654-92c8-4fa3-ad5d-326c3f5a553c status is InProgress
I1011 21:25:00.685293       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-61650654-92c8-4fa3-ad5d-326c3f5a553c status is InProgress
I1011 21:25:06.459184       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-44813e28-d02f-454e-8891-dd2b65d31a66 status is InProgress
I1011 21:25:12.062433       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-0359f610-7763-428c-b0fc-a92c2f5463c6 status is InProgress
I1011 21:25:17.099452       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-61650654-92c8-4fa3-ad5d-326c3f5a553c status is Done
I1011 21:25:17.099542       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-44813e28-d02f-454e-8891-dd2b65d31a66 status is Done
I1011 21:25:17.099581       7 restore_endpoint_v1alpha1.go:246] Restore:volumes-full-20201011-043725-0359f610-7763-428c-b0fc-a92c2f5463c6 status is Done
I1012 04:38:52.017035       7 backup_endpoint_v1alpha1.go:160] Creating backup snapshot volumes-full-20201012-043837 for volume "pvc-318b2f41-40db-4ee7-8564-4c8677665bbf"
I1012 04:38:53.195149       7 backup_endpoint_v1alpha1.go:167] Backup snapshot:'volumes-full-20201012-043837' created successfully for volume:pvc-318b2f41-40db-4ee7-8564-4c8677665bbf

velero restore logs vikunja-test-restore | grep -v 'Skipping namespace': https://pastebin.com/vh9ACfAn
velero restore describe vikunja-test-restore:

velero restore describe vikunja-test-restore
Name:         vikunja-test-restore
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:  Completed

Started:    2020-10-11 23:24:29 +0200 CEST
Completed:  2020-10-11 23:25:28 +0200 CEST

Warnings:
  Velero:     <none>
  Cluster:  could not restore, persistentvolumes "pvc-9e222d37-984f-45ad-a42b-5fac7892b51f" already exists. Warning: the in-cluster version is different than the backed-up version.
  Namespaces:
    vikunja-restore:  could not restore, persistentvolumeclaims "storage" already exists. Warning: the in-cluster version is different than the backed-up version.

Backup:  volumes-full-20201011-043725

Namespaces:
  Included:  vikunja
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  vikunja=vikunja-restore

Label selector:  <none>

Restore PVs:  true

Logs of the cStor disk pool pod:

$ kubectl logs cstor-disk-pool-7it6-657f7f5897-2tg4b -n openebs cstor-pool | grep -C3 10-11/21:24 
2020-10-11/06:17:47.588 [tgt 10.43.208.76:6060:17]: Destroy snapshot command for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-2ef2eddc-c59f-4222-a112-6fb7ebfccdb3@volumes-full-20201011-043725
2020-10-11/06:17:50.336 [tgt 10.43.42.69:6060:16]: Create snapshot command for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-1c329d55-f024-4306-8489-0efbd65f58c5@volumes-full-20201011-043725
2020-10-11/06:18:29.476 [tgt 10.43.42.69:6060:16]: Destroy snapshot command for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-1c329d55-f024-4306-8489-0efbd65f58c5@volumes-full-20201011-043725
2020-10-11/21:24:42.125 zvol cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-9e222d37-984f-45ad-a42b-5fac7892b51f status change: DEGRADED -> DEGRADED
2020-10-11/21:24:42.125 zvol cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-9e222d37-984f-45ad-a42b-5fac7892b51f rebuild status change: INIT -> INIT
2020-10-11/21:24:42.126 ERROR target IP address is empty for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-9e222d37-984f-45ad-a42b-5fac7892b51f
2020-10-12/04:38:52.176 [tgt 10.43.120.72:6060:28]: Create snapshot command for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-318b2f41-40db-4ee7-8564-4c8677665bbf@volumes-full-20201012-043837
2020-10-12/04:41:13.316 [tgt 10.43.120.72:6060:28]: Destroy snapshot command for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-318b2f41-40db-4ee7-8564-4c8677665bbf@volumes-full-20201012-043837
2020-10-12/04:41:17.452 [tgt 10.43.80.233:6060:18]: Create snapshot command for cstor-30381e2c-0392-4b82-b552-ca843fdaf6d7/pvc-20bffc04-703d-44b0-8820-e124a9c0bc51@volumes-full-20201012-043837

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Velero version (use velero version): v1.5.1
Velero features (use velero client config get features):
Velero-plugin version: openebs/velero-plugin-arm64:2.1.0
OpenEBS version: 2.1.0
Kubernetes version (use kubectl version): client 1.19.2, server 1.18.3
Kubernetes installer & version: k3os
Cloud provider or hardware configuration: arm64
OS (e.g. from /etc/os-release): k3os

The text was updated successfully, but these errors were encountered:

mittachaitu · 2020-10-12T11:42:16Z

Hi @sgielen, Looks like you didn't set the targetIP for volume replicas(ZFS datasets). Can you set targetIP by exec into pool pod as mentioned in the Note here.

mynktl · 2020-10-12T12:35:58Z

@sgielen
CVR uses targetip(target pod service ip) to connect to target pod. During restore, targetip is not set. This is to restore incremental backup after full backup restore.

Once restore completes, if incremental backups are there then those backups are also needs to be restored to retrieve all the data, targetip needs to be updated in replica.
steps:

To fetch targetip
kubectl get svc -n openebs <PV_NAME> -ojsonpath='{.spec.clusterIP}'

In your case, command will be
target_ip=kubectl get svc -n openebs pvc-9e222d37-984f-45ad-a42b-5fac7892b51f -ojsonpath='{.spec.clusterIP}'

Once you get targetip, you need to set it in all the replica of restored pv using following command:
kubectl exec -it <POOL_POD> -c cstor-pool -n openebs -- bash
zfs set io.openebs:targetip=<TARGET_IP> <POOL_NAME/VOLUME_NAME>

In your case, it will be

For CVR, pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-eiq0
pool_pod=kubectl get pods -n openebs |grep cstor-disk-pool-eiq0 |awk -F ' ' '{print $1}'
kubectl exec -it $pool_pod -c cstor-pool -n openebs -- bash
POOL_VOLUME_NAME=zfs list |grep pvc-2d13e86e-d11c-40d4-947f-8b7f03ffa6d3 |awk -F ' ' '{print $1}'
`zfs set io.openebs:targetip=$target_ip $POOL_VOLUME_NAME
For CVR, pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-una0
pool_pod=kubectl get pods -n openebs |grep cstor-disk-pool-una0 |awk -F ' ' '{print $1}'
kubectl exec -it $pool_pod -c cstor-pool -n openebs -- bash
POOL_VOLUME_NAME=zfs list |grep pvc-2d13e86e-d11c-40d4-947f-8b7f03ffa6d3 |awk -F ' ' '{print $1}'
`zfs set io.openebs:targetip=$target_ip $POOL_VOLUME_NAME
For CVR, pvc-9e222d37-984f-45ad-a42b-5fac7892b51f-cstor-disk-pool-7it6
pool_pod=kubectl get pods -n openebs |grep cstor-disk-pool-7it6 |awk -F ' ' '{print $1}'
kubectl exec -it $pool_pod -c cstor-pool -n openebs -- bash
POOL_VOLUME_NAME=zfs list |grep pvc-2d13e86e-d11c-40d4-947f-8b7f03ffa6d3 |awk -F ' ' '{print $1}'
`zfs set io.openebs:targetip=$target_ip $POOL_VOLUME_NAME

zlymeda · 2020-10-17T16:31:45Z

as mentioned on the Slack, I'm working on the fix for this issue

zlymeda · 2020-10-17T22:52:09Z

I made good progress on this. Just testing restores now.

I will create PR soon for this project as well as cstor-operators and I am also thinking about backporting the fix to maya repo as well.

sgielen · 2020-10-21T10:40:04Z

@zlymeda Let me know when you have a PR, and I will try to retest my restore with it then :)

sgielen · 2020-10-25T19:25:40Z

Hi @sgielen, Looks like you didn't set the targetIP for volume replicas(ZFS datasets). Can you set targetIP by exec into pool pod as mentioned in the Note here.

For completeness & future reference, this was indeed the problem; after I set the targetip correctly according to instructions in this thread, the CVRs became Healthy, the Pods started automatically and the restored application was completely functional exactly as I had originally expected.

As soon as a PR exists for this issue, I will update my version of the velero-plugin and retry the restore steps without setting target IP to confirm it is no longer necessary to do this manually :-)

zlymeda · 2020-10-26T05:47:17Z

Single backup / restore is working. I need to make a small change to support restore from scheduled backup with multiple backups created.
The first restore works, targetip is set. But second restore fails as the "device is busy". I tested manually setting targetip back to empty and try to restore and it worked. So I need to fix this in the code now.

@sgielen the fix will be in 3 repos. One here, one in maya and one in cstor-operators.
If you are using cstor-operators, you will need that one. Otherwise, you will just need maya.

If you want to test it now, you can use my images I've created for testing:

zlymeda/velero-plugin:ci
zlymeda/cstor-volume-mgmt:2.2.0 (for cstor-operators env var: OPENEBS_IO_CSPI_MGMT_IMAGE)
zlymeda/cstor-pool-mgmt:2.2.0 (for maya env var: OPENEBS_IO_CSTOR_POOL_MGMT_IMAGE)

mynktl · 2020-10-27T05:06:21Z

@zlymeda Once the targetip is set, we won't be able to restore for that PV. #99 is to restore all the scheduled backup using single restore. With this, we won't have issue of device is busy.

sgielen · 2020-11-02T22:08:53Z

Single backup / restore is working. I need to make a small change to support restore from scheduled backup with multiple backups created.
The first restore works, targetip is set. But second restore fails as the "device is busy". I tested manually setting targetip back to empty and try to restore and it worked. So I need to fix this in the code now.

@sgielen the fix will be in 3 repos. One here, one in maya and one in cstor-operators.
If you are using cstor-operators, you will need that one. Otherwise, you will just need maya.

If you want to test it now, you can use my images I've created for testing:

zlymeda/velero-plugin:ci

zlymeda/cstor-volume-mgmt:2.2.0 (for cstor-operators env var: OPENEBS_IO_CSPI_MGMT_IMAGE)

zlymeda/cstor-pool-mgmt:2.2.0 (for maya env var: OPENEBS_IO_CSTOR_POOL_MGMT_IMAGE)

@zlymeda, unfortunately I have an arm64 based cluster. But either way I prefer to build my own images unless they come from the upstream. Would you already have a patch to share somewhere that I could test? Thanks!

zlymeda · 2020-11-04T23:36:41Z

@sgielen I mentioned this issue in PRs, please check

sgielen added the Bug issue/pr is a bug/fix to existing feature label Oct 12, 2020

mynktl self-assigned this Oct 12, 2020

mynktl assigned zlymeda Oct 21, 2020

This was referenced Nov 4, 2020

fix(restore): mark CVRs after restore is completed #131

Merged

fix(restore): set targetip on CVRs after restore is completed openebs-archive/maya#1761

Merged

fix(restore): set targetip on CVRs after restore is completed openebs-archive/cstor-operators#198

Merged

mynktl closed this as completed in #131 Nov 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After a restore, cStorVolume remains in Init, CVRs remain in Error state #127

After a restore, cStorVolume remains in Init, CVRs remain in Error state #127

sgielen commented Oct 12, 2020 •

edited

Loading

mittachaitu commented Oct 12, 2020 •

edited

Loading

mynktl commented Oct 12, 2020

zlymeda commented Oct 17, 2020

zlymeda commented Oct 17, 2020

sgielen commented Oct 21, 2020

sgielen commented Oct 25, 2020 •

edited

Loading

zlymeda commented Oct 26, 2020

mynktl commented Oct 27, 2020

sgielen commented Nov 2, 2020

zlymeda commented Nov 4, 2020

After a restore, cStorVolume remains in Init, CVRs remain in Error state #127

After a restore, cStorVolume remains in Init, CVRs remain in Error state #127

Comments

sgielen commented Oct 12, 2020 • edited Loading

mittachaitu commented Oct 12, 2020 • edited Loading

mynktl commented Oct 12, 2020

zlymeda commented Oct 17, 2020

zlymeda commented Oct 17, 2020

sgielen commented Oct 21, 2020

sgielen commented Oct 25, 2020 • edited Loading

zlymeda commented Oct 26, 2020

mynktl commented Oct 27, 2020

sgielen commented Nov 2, 2020

zlymeda commented Nov 4, 2020

sgielen commented Oct 12, 2020 •

edited

Loading

mittachaitu commented Oct 12, 2020 •

edited

Loading

sgielen commented Oct 25, 2020 •

edited

Loading