[e2e][backend] Add test cases for restore backup with snapshot created #1384

TachunLin · 2024-07-15T02:21:36Z

Which issue(s) this PR fixes:

What this PR does / why we need it:

According to issue harvester/harvester#4954

We need to add the backend e2e test to the vm_backup_restore integration test
To cover the case when a vm both have backup and snapshot created on it, when we restore this vm from backup.
It should be restore successfully.

Added the following test script:

test_with_snapshot_restore_with_new_vm
- Restore vm also have snapshot created to a new vm
test_with_snapshot_restore_replace_retain_vols
- Restore vm also have snapshot created to replace the existing vm and retain volume

Special notes for your reviewer:

Test result (Trigger locally and execute test on remote ecm lab machine)

Test can PASS all test cases in the TestBackupRestoreWithSnapshot class
Test can PASS most of the test cases in the test_4_vm_backup_restore.py file

In order to not affect all existing test cases in the TestBackupRestore and TestBackupRestoreOnMigration and TestMultipleBackupRestore , consider the stability and future test scalability, thus I create a separate class for the new test scenario.

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

albinsun · 2024-07-17T03:47:01Z

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

+            api_client.volumes.delete(vol_name)
+
+    @pytest.mark.dependency(depends=["TestBackupRestoreWithSnapshot::tests_backup_vm"], param=True)
+    def test_with_snapshot_restore_replace_retain_vols(


Can we parameterize delete_volumes thus can test Delete Previous Volumes for both Delete and Retain?

It's a good idea to use parameterize config to cover both the Delete and Retain test cases.

While according to the test issue #1045

We just select the Restore backup to replace existing (retain volume)

The reason is when the vm also have snapshot created, even if we shutdown the vm, the backend would check and prevent to restore to replace existing with delete volume.

Thus here we test the replace existing with retain volume only.

Snapshot function is based on the volume, so delete volume will not work with snapshot should be the expected feature, or maybe we can discuss the case in sync up meeting to double confirm it.

Thanks for the suggestion.
After moving the snapshot restore test cases into the original TestBackupRestore.
When test successfully complete, all the test generated volume will be automatically cleanup.

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

albinsun · 2024-07-22T09:12:56Z

Have a quick try using ECM raven(v1.2.2) but fails on:

test_with_snapshot_restore_with_new_vm[NFS]
test_with_snapshot_restore_replace_retain_vols[NFS]

Please help to check, thx.

harvester-runtests/43

bk201 · 2024-07-23T08:34:23Z

@TachunLin Please help check the error and make sure the test run successfully in Jenkins.

TachunLin · 2024-07-25T03:27:02Z

Thanks for the reminder. I checked the test report of harvester-runtests/43.
Most of the S3 related test was failed from test_connection[S3]::setup

The reason is we not yet have the backup bucket created in our ecm lab minio artifact endpoint
I have created two more buckets ravens and falcons for future testing requirement on these machines.

Then I set the same config.yml which used by the harvester_run_test pipeline from my local to trigger the same test to remote ravens cluster.

The result is when I execute TestBackupRestoreWithSnapshot or the entire test_4_vm_backup_restore.py
Both of them can pass most of the test cases.

The TestBackupRestoreWithSnapshot class
The test_4_vm_backup_restore file

Next I trigger a new test harvester-runtests/49, there I get some expected failure like the following:

TestBackupRestore::test_restore_with_new_vm[S3] and [NFS], TestBackupRestoreWithSnapshot::test_with_snapshot_restore_with_new_vm[S3] and [NFS]

E       AssertionError: Failed to Start VM(s3-restore-0735071682-09h31m16s956301-07-24) with errors:
E         Status: 404
E         API Status(404): {'type': 'error', 'links': {}, 'code': 'NotFound', 'message': 'virtualmachines.kubevirt.io "s3-restore-0735071682-09h31m16s956301-07-24" not found', 'status': 404}
E       assert False

test_restore_replace_with_delete_vols[S3] and [NFS], TestBackupRestoreWithSnapshot::test_with_snapshot_restore_replace_retain_vols[S3] and [NFS]

E       AssertionError: cloud-init writefile failed
E         Executed stdout: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgQC2J8e7zDDo/Mfgg4cmvt4OJYXOuY+LMfNnl6lQzdVhXJTNnnf2ulA+GMnqDsw2o5QCZ/bYkfXIvhnIHYh9PChucUujFMKhz2F3+q8fXQZqt+p6koAj7toMdmpd66rS8+x9Krmk7rS/0iZn13jqyjSIIsZ0/5fEM13jpVpWIUFC2w==
E         
E         Executed stderr: 
E       assert '0708196929-09h31m16s956301-07-24' in 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgQC2J8e7zDDo/Mfgg4cmvt4OJYXOuY+LMfNnl6lQzdVhXJTNnnf2ulA+GMnqDsw2o5QCZ/bYkfXIvhnIHYh9PChucUujFMKhz2F3+q8fXQZqt+p6koAj7toMdmpd66rS8+x9Krmk7rS/0iZn13jqyjSIIsZ0/5fEM13jpVpWIUFC2w==\n'

I would continue to investigate what cause these tests failed on the Jenkins run test jobs while works fine when trigger from local.

albinsun · 2024-07-25T06:13:50Z

...

I would continue to investigate what cause these tests failed on the Jenkins run test jobs while works fine when trigger from local.

Hi @TachunLin ,
I checked our daily tests and found 2 basic restore test cases are failed too.

TestBackupRestore::test_restore_with_new_vm[S3,NFS]
TestBackupRestore::test_restore_replace_with_delete_vols[S3,NFS]

If they are PASS if trigger from local like you mentioned, then we should suspect it's an environment issue and investigate ECM lab.

CC @lanfon72

harvester-install-and-test-e2e-daily#328 (v1.3-head)

harvester-install-and-test-e2e-daily#329 (v1.2-head)

lanfon72 · 2024-07-29T15:07:35Z

...
I would continue to investigate what cause these tests failed on the Jenkins run test jobs while works fine when trigger from local.

Hi @TachunLin , I checked our daily tests and found 2 basic restore test cases are failed too.

TestBackupRestore::test_restore_with_new_vm[S3,NFS]

TestBackupRestore::test_restore_replace_with_delete_vols[S3,NFS]

If they are PASS if trigger from local like you mentioned, then we should suspect it's an environment issue and investigate ECM lab.

CC @lanfon72

harvester-install-and-test-e2e-daily#328 (v1.3-head)

harvester-install-and-test-e2e-daily#329 (v1.2-head)

Those test cases are known issue in harvester/harvester#4640
in KVM environment, they might not always reproduced.

albinsun · 2024-08-02T16:52:07Z

Hi @TachunLin,
Per discussed, I checked the flaky test case test_restore_with_new_vm and found it's due to VM will not created immediately after triggering api_client.backups.restore, so vm_checker.wait_ip_addresses assert VM is not there and fail the test.

Just sent a quick fix PR, please refer to

Wait VM becomes getable after triggering restore #1419

TachunLin · 2024-08-04T13:28:54Z

Thank you @albinsun for finding the root cause and created PR #1419 to fix the flaky case of test_restore_with_new_vm. I am really appreciated for your help.

I also added the vm_checker.wait_getable function into the test_with_snapshot_restore_with_new_vm and test_with_snapshot_restore_replace_retain_vols.

And trigger the new test on the main Jenkins vm towards the raven cluster to run TestBackupRestoreWithSnapshot class only.

It can well fix the test_with_snapshot_restore_with_new_vm test cases.

And I also trigger the entire test_4_vm_backup_restore, it can also PASS most of the back and restore cases except those flaky cases failed before for other reason.

lanfon72

please move those test cases into TestBackupRestore (in the last) to reduce test time, as both are depends on the same test_backup_vm, I feel that it is not necessary to create another class for it.

lanfon72 · 2024-08-05T18:20:06Z

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

+class TestBackupRestoreWithSnapshot:
+
+    @pytest.mark.dependency()
+    def test_connection(self, api_client, backup_config, config_backup_target):


we already tested the connection in previous class, we don't need duplicate the test case.

Thanks for the suggestion.
The reason to have the test connection running again is to maintain the full function of the backup and restore test cases inside the new test class TestBackupRestoreWithSnapshot

According to the test result in the follow up comment
#1384 (comment)

And consider the trade off of pros and cons, may I keep these tests in the new test class. Many thanks.

lanfon72 · 2024-08-05T18:21:34Z

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

+        assert 200 == code, f'Failed to test backup target connection: {data}'
+
+    @pytest.mark.dependency(depends=["TestBackupRestoreWithSnapshot::test_connection"], param=True)
+    def tests_backup_vm(self, api_client, wait_timeout, backup_config, base_vm_with_data):


same as above, it looks you are going to reuse the test case?

Thanks for the suggestion.
The reason to have the test connection running again is to provide a new backup for all the test cases inside the new test class TestBackupRestoreWithSnapshot.

According to the test result in the follow up comment
#1384 (comment)

And consider the trade off of pros and cons, may I keep these tests in the new test class. Many thanks.

if you want to decouple new test cases without using existed one, you can just add another fixture to spawn up VM and do the backup (as it is the prerequisite of new test cases).

The problem here is, if you added those two duplicated (totally same as previous class) test cases, which means the total of test cases in report is not reflect to what we actually implemented and covered tests.

to add them after existed test cases would be more easy than create another fixture, that's why I suggested so.

@TachunLin, any update or comment?
the current point is we should not have 2 same tests

TestBackupRestore:tests_backup_vm

TestBackupRestoreWithSnapshot:tests_backup_vm

I think one way is to make TestBackupRestoreWithSnapshot:tests_backup_vm a class fixture.

@TachunLin ,
Per discussed, let you know that /issues/1462 has been fixed so there should be no connection fail.
You can continue surveying the refactor of this task item, thx.

Thanks @albinsun for fixing the connection failed issue.
According to the discussion and based on the review comment.

I think it would be better moving the new snapshot restore test cases right after existed test cases in the same TestBackupRestore. It would solve the following concerns:

No reuse and duplication of test_connection and tests_backup_vm

Save time without running the take backup action twice

I already made the fix and tested working on the local to remote ECM lab machine

Will try again on the main Jenkins jobs to double confirm

Tested PASS on the main Jenkins Harvester run test jobs #102

All the new snapshot restore tests can execute well inside the TestBackupRestore test class.
http://172.19.99.74/job/harvester-runtests/102/TestReport/

The S3 related tests are blocked to proceed due to the environment broken caused by vpn disconnection yesterday.

TachunLin · 2024-08-08T09:31:50Z

Thanks for the check and suggestion.
In the begging to add restore with backup related test cases, I also have the same plan to add all test cases under the existing TestBackupRestore class.

But when I actually add the new test_with_snapshot_restore_with_new_vm and test_with_snapshot_restore_replace_retain_vols to the end of TestBackupRestore class.

After trigger the test, we can see while execution on the two new test cases.
The existing virtual machine can't do any restore vm action to the new or replace the existing.

Execution on the test_with_snapshot_restore_with_new_vm
- We can find the vm did not respond to the restore action to the new vm
vokoscreenNG-2024-08-08_15-14-23.mp4
Execution on the test_with_snapshot_restore_with_new_vm
- the vm also have no response for the action to replace the existing
vokoscreenNG-2024-08-08_15-29-44.mp4
On the test report, we can find the two new test cases failed
Both of them got the connection failed error

I think the reason to cause this failure may related to in the entire TestBackupRestore class, we use the same and the only one virtual machines.
And since the virtual machine have already done with the restore to new vm and replace with existing vm in the previous tests test_restore_with_new_vm and test_restore_replace_with_delete_vols.

Thus I plan to create a separate class TestBackupRestoreWithSnapshot to make the restore with snapshot related tests can run on a separate and clean vm without affecting all the existing test cases.

Indeed it will increase the execution time to rerun the test connection and take backup test.

In a trade off, The pros to have a separate class can have the following benefits:

Make the new restore backup with snapshot cases can running well
Maintain the stability of the backup and restore test without affecting with existing tests
For future scalability, in case of we need to add more restore backup with restore related test in the future

According to this, may I continue to use the separate class TestBackupRestoreWithSnapshot for the new restore backup with snapshot test cases. Many thanks.

albinsun · 2024-09-10T04:00:58Z

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

@@ -605,9 +605,227 @@ def test_restore_replace_with_vm_shutdown_command(
        spec = api_client.backups.RestoreSpec.for_existing(delete_volumes=True)
        code, data = api_client.backups.restore(unique_vm_name, spec)
        assert 201 == code, f'Failed to restore backup with current VM replaced, {data}'
-        vm_getable, (code, data) = vm_checker.wait_getable(unique_vm_name)
+


Why remove wait_getable?
Won't possible to hit #1419?

Thanks for the check.
I think that would probably because during the rebase process I miss something for not adding this check back.
Just added the wait_getable action back to the original places.

albinsun · 2024-09-10T04:08:16Z

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

+            code, data = api_client.vm_snapshots.get(vm_snapshot_name)
+            if data.get("status", {}).get("readyToUse"):
+                break
+            print(f"waiting for {vm_snapshot_name} to be ready")


Drop temporary code.

Thanks for the check.
The purpose to add the line to check vm_snapshot is ready was to ensure the snapshot are actually created before proceeding to the next step.

albinsun · 2024-09-10T04:09:33Z

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

+            print(f"waiting for {vm_snapshot_name} to be ready")
+            sleep(3)
+        else:
+            raise AssertionError(f"timed out waiting for {vm_snapshot_name} to be ready")


Please have code and data in error message like other assers.

Thanks for the suggestion.
Add the code and data in error message like others in the code.

albinsun · 2024-09-10T04:10:44Z

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

+            raise AssertionError(f"timed out waiting for {vm_snapshot_name} to be ready")
+
+        assert 200 == code
+        assert data.get("status", {}).get("readyToUse") is True


Should these 2 lines be in the while loop?

Thanks for the suggestion.

Indeed these 2 lines can be put into the previous while loop since a redundant check exists on
data.get("status", {}).get("readyToUse"). Update the code accordingly.

albinsun · 2024-09-10T04:18:25Z

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

+        # Check VM Started then get IPs (vm and host)
+        vm_got_ips, (code, data) = vm_checker.wait_ip_addresses(unique_vm_name, ['default'])
+        assert vm_got_ips, (
+            f"Failed to Start VM({unique_vm_name}) with errors:\n"


Do we have any restore or restart here?
If want to check VM still Running and accessible, then perhaps alter the error message.

Thanks for the check.
The VM did not restart before, just make sure the vm is running after taking the snapshot.
Just updated the assert error message accordingly.

albinsun · 2024-09-10T04:38:20Z

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

+            code, data = api_client.vm_snapshots.get(vm_snapshot_name)
+            if data.get("status", {}).get("readyToUse"):
+                break
+            print(f"waiting for {vm_snapshot_name} to be ready")


same as previous.

Thanks for the suggestion.
Add the code and data in error message like others in the code.

albinsun · 2024-09-10T04:38:47Z

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

+            raise AssertionError(f"timed out waiting for {vm_snapshot_name} to be ready")
+
+        assert 200 == code
+        assert data.get("status", {}).get("readyToUse") is True


same as previous.

Thanks for the suggestion.

Indeed these 2 lines can be put into the previous while loop since a redundant check exists on data.get("status", {}).get("readyToUse").
Update the code accordingly.

albinsun · 2024-09-10T04:40:09Z

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

+        vm_got_ips, (code, data) = vm_checker.wait_ip_addresses(unique_vm_name, ['default'])
+        assert vm_got_ips, (
+            f"Failed to Start VM({unique_vm_name}) with errors:\n"
+            f"Status: {data.get('status')}\n"


same as previous.

Thanks for the check.
The VM did not restart before, just make sure the vm is running after taking the snapshot.
Just updated the assert error message accordingly.

albinsun · 2024-09-10T04:44:01Z

harvester_e2e_tests/integrations/test_4_vm_backup_restore.py

+        ) as sh:
+            out, err = sh.exec_command(f"echo {pub_key!r} > {base_vm_with_data['data']['path']}")
+            assert not err, (out, err)
+            sh.exec_command('sync')


Does messing operation also needed in restore new?
As it's restore new, the messed data will not get back on base_vm.

scripts

TachunLin mentioned this pull request Jul 15, 2024

[BUG] VMRestore to new VM doesn't work if there is VMSnapshot harvester/harvester#4954

Closed

khushboo-rancher requested a review from a team July 16, 2024 20:21

albinsun requested changes Jul 17, 2024

View reviewed changes

albinsun requested review from lanfon72, irishgordo, khushboo-rancher and noahgildersleeve July 17, 2024 03:53

TachunLin force-pushed the 1045-restore-backup-with-snapshot branch from a8fd100 to 71c5a97 Compare July 22, 2024 07:29

TachunLin mentioned this pull request Aug 4, 2024

Wait VM becomes getable after triggering restore #1419

Merged

lanfon72 requested changes Aug 5, 2024

View reviewed changes

TachunLin force-pushed the 1045-restore-backup-with-snapshot branch from 71c5a97 to 8c0aef0 Compare August 9, 2024 07:37

TachunLin force-pushed the 1045-restore-backup-with-snapshot branch 2 times, most recently from a77925b to cfdde7e Compare September 2, 2024 07:10

TachunLin force-pushed the 1045-restore-backup-with-snapshot branch from cfdde7e to 2d28bc4 Compare September 5, 2024 16:46

TachunLin requested review from albinsun and lanfon72 September 6, 2024 08:08

albinsun reviewed Sep 10, 2024

View reviewed changes

TachunLin added 5 commits September 12, 2024 23:58

Add test case for restore backup from vm have snapshot created

eae912a

Create new class TestBackupRestoreWithSnapshot to place new test

6ab1ec6

scripts

Set specific backup name and check retain volume exists

11111ea

Add todo action to remind to clean up the retain volume

98b69d5

Rebase with latest main changes

001236b

TachunLin added 2 commits September 12, 2024 23:58

Move the snapshot restore testcases to the TestBackupRestore class

5577c19

Fix Jenkins job execution failure

206612a

TachunLin force-pushed the 1045-restore-backup-with-snapshot branch from 2d28bc4 to 435a78b Compare September 12, 2024 16:00

Update with review request part1

534f329

TachunLin force-pushed the 1045-restore-backup-with-snapshot branch from 435a78b to 534f329 Compare September 13, 2024 06:44

TachunLin force-pushed the 1045-restore-backup-with-snapshot branch 2 times, most recently from faa5615 to 534f329 Compare October 4, 2024 09:42

[e2e][backend] Add test cases for restore backup with snapshot created #1384

Are you sure you want to change the base?

[e2e][backend] Add test cases for restore backup with snapshot created #1384

Conversation

TachunLin commented Jul 15, 2024

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albinsun commented Jul 22, 2024 • edited Loading

bk201 commented Jul 23, 2024

TachunLin commented Jul 25, 2024

albinsun commented Jul 25, 2024

lanfon72 commented Jul 29, 2024

albinsun commented Aug 2, 2024 • edited Loading

TachunLin commented Aug 4, 2024

lanfon72 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albinsun Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TachunLin commented Aug 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albinsun commented Jul 22, 2024 •

edited

Loading

albinsun commented Aug 2, 2024 •

edited

Loading

albinsun Aug 26, 2024 •

edited

Loading