Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[e2e][backend] Add test cases for restore backup with snapshot created #1384

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

TachunLin
Copy link
Contributor

Which issue(s) this PR fixes:

Issue #1045

What this PR does / why we need it:

According to issue harvester/harvester#4954

We need to add the backend e2e test to the vm_backup_restore integration test
To cover the case when a vm both have backup and snapshot created on it, when we restore this vm from backup.
It should be restore successfully.

Added the following test script:

  1. test_with_snapshot_restore_with_new_vm
    • Restore vm also have snapshot created to a new vm
  2. test_with_snapshot_restore_replace_retain_vols
    • Restore vm also have snapshot created to replace the existing vm and retain volume

Special notes for your reviewer:

Test result (Trigger locally and execute test on remote ecm lab machine)

  1. Test can PASS all test cases in the TestBackupRestoreWithSnapshot class
    image

  2. Test can PASS most of the test cases in the test_4_vm_backup_restore.py file
    image

  • In order to not affect all existing test cases in the TestBackupRestore and TestBackupRestoreOnMigration and TestMultipleBackupRestore , consider the stability and future test scalability, thus I create a separate class for the new test scenario.

api_client.volumes.delete(vol_name)

@pytest.mark.dependency(depends=["TestBackupRestoreWithSnapshot::tests_backup_vm"], param=True)
def test_with_snapshot_restore_replace_retain_vols(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we parameterize delete_volumes thus can test Delete Previous Volumes for both Delete and Retain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea to use parameterize config to cover both the Delete and Retain test cases.

While according to the test issue #1045

We just select the Restore backup to replace existing (retain volume)

The reason is when the vm also have snapshot created, even if we shutdown the vm, the backend would check and prevent to restore to replace existing with delete volume.
image

Thus here we test the replace existing with retain volume only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshot function is based on the volume, so delete volume will not work with snapshot should be the expected feature, or maybe we can discuss the case in sync up meeting to double confirm it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.
After moving the snapshot restore test cases into the original TestBackupRestore.
When test successfully complete, all the test generated volume will be automatically cleanup.

@albinsun
Copy link
Contributor

albinsun commented Jul 22, 2024

Have a quick try using ECM raven(v1.2.2) but fails on:

  1. test_with_snapshot_restore_with_new_vm[NFS]
  2. test_with_snapshot_restore_replace_retain_vols[NFS]

Please help to check, thx.

harvester-runtests/43
image

@bk201
Copy link
Member

bk201 commented Jul 23, 2024

@TachunLin Please help check the error and make sure the test run successfully in Jenkins.

@TachunLin
Copy link
Contributor Author

Thanks for the reminder. I checked the test report of harvester-runtests/43.
Most of the S3 related test was failed from test_connection[S3]::setup

The reason is we not yet have the backup bucket created in our ecm lab minio artifact endpoint
I have created two more buckets ravens and falcons for future testing requirement on these machines.

Then I set the same config.yml which used by the harvester_run_test pipeline from my local to trigger the same test to remote ravens cluster.

The result is when I execute TestBackupRestoreWithSnapshot or the entire test_4_vm_backup_restore.py
Both of them can pass most of the test cases.

  • The TestBackupRestoreWithSnapshot class
    image

  • The test_4_vm_backup_restore file
    image

Next I trigger a new test harvester-runtests/49, there I get some expected failure like the following:

  • TestBackupRestore::test_restore_with_new_vm[S3] and [NFS], TestBackupRestoreWithSnapshot::test_with_snapshot_restore_with_new_vm[S3] and [NFS]

    E       AssertionError: Failed to Start VM(s3-restore-0735071682-09h31m16s956301-07-24) with errors:
    E         Status: 404
    E         API Status(404): {'type': 'error', 'links': {}, 'code': 'NotFound', 'message': 'virtualmachines.kubevirt.io "s3-restore-0735071682-09h31m16s956301-07-24" not found', 'status': 404}
    E       assert False
    

    image

  • test_restore_replace_with_delete_vols[S3] and [NFS], TestBackupRestoreWithSnapshot::test_with_snapshot_restore_replace_retain_vols[S3] and [NFS]

    E       AssertionError: cloud-init writefile failed
    E         Executed stdout: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgQC2J8e7zDDo/Mfgg4cmvt4OJYXOuY+LMfNnl6lQzdVhXJTNnnf2ulA+GMnqDsw2o5QCZ/bYkfXIvhnIHYh9PChucUujFMKhz2F3+q8fXQZqt+p6koAj7toMdmpd66rS8+x9Krmk7rS/0iZn13jqyjSIIsZ0/5fEM13jpVpWIUFC2w==
    E         
    E         Executed stderr: 
    E       assert '0708196929-09h31m16s956301-07-24' in 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgQC2J8e7zDDo/Mfgg4cmvt4OJYXOuY+LMfNnl6lQzdVhXJTNnnf2ulA+GMnqDsw2o5QCZ/bYkfXIvhnIHYh9PChucUujFMKhz2F3+q8fXQZqt+p6koAj7toMdmpd66rS8+x9Krmk7rS/0iZn13jqyjSIIsZ0/5fEM13jpVpWIUFC2w==\n'
    

    image

I would continue to investigate what cause these tests failed on the Jenkins run test jobs while works fine when trigger from local.

@albinsun
Copy link
Contributor

...

I would continue to investigate what cause these tests failed on the Jenkins run test jobs while works fine when trigger from local.

Hi @TachunLin ,
I checked our daily tests and found 2 basic restore test cases are failed too.

  1. TestBackupRestore::test_restore_with_new_vm[S3,NFS]
  2. TestBackupRestore::test_restore_replace_with_delete_vols[S3,NFS]

If they are PASS if trigger from local like you mentioned, then we should suspect it's an environment issue and investigate ECM lab.

CC @lanfon72

harvester-install-and-test-e2e-daily#328 (v1.3-head)
image

harvester-install-and-test-e2e-daily#329 (v1.2-head)
image

@lanfon72
Copy link
Member

...
I would continue to investigate what cause these tests failed on the Jenkins run test jobs while works fine when trigger from local.

Hi @TachunLin , I checked our daily tests and found 2 basic restore test cases are failed too.

  1. TestBackupRestore::test_restore_with_new_vm[S3,NFS]
  2. TestBackupRestore::test_restore_replace_with_delete_vols[S3,NFS]

If they are PASS if trigger from local like you mentioned, then we should suspect it's an environment issue and investigate ECM lab.

CC @lanfon72

harvester-install-and-test-e2e-daily#328 (v1.3-head) image

harvester-install-and-test-e2e-daily#329 (v1.2-head) image

Those test cases are known issue in harvester/harvester#4640
in KVM environment, they might not always reproduced.

@albinsun
Copy link
Contributor

albinsun commented Aug 2, 2024

Hi @TachunLin,
Per discussed, I checked the flaky test case test_restore_with_new_vm and found it's due to VM will not created immediately after triggering api_client.backups.restore, so vm_checker.wait_ip_addresses assert VM is not there and fail the test.

Just sent a quick fix PR, please refer to

@TachunLin
Copy link
Contributor Author

Thank you @albinsun for finding the root cause and created PR #1419 to fix the flaky case of test_restore_with_new_vm. I am really appreciated for your help.

I also added the vm_checker.wait_getable function into the test_with_snapshot_restore_with_new_vm and test_with_snapshot_restore_replace_retain_vols.

And trigger the new test on the main Jenkins vm towards the raven cluster to run TestBackupRestoreWithSnapshot class only.

It can well fix the test_with_snapshot_restore_with_new_vm test cases.
image

And I also trigger the entire test_4_vm_backup_restore, it can also PASS most of the back and restore cases except those flaky cases failed before for other reason.

image

Copy link
Member

@lanfon72 lanfon72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move those test cases into TestBackupRestore (in the last) to reduce test time, as both are depends on the same test_backup_vm, I feel that it is not necessary to create another class for it.

class TestBackupRestoreWithSnapshot:

@pytest.mark.dependency()
def test_connection(self, api_client, backup_config, config_backup_target):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already tested the connection in previous class, we don't need duplicate the test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.
The reason to have the test connection running again is to maintain the full function of the backup and restore test cases inside the new test class TestBackupRestoreWithSnapshot

According to the test result in the follow up comment
#1384 (comment)

And consider the trade off of pros and cons, may I keep these tests in the new test class. Many thanks.

assert 200 == code, f'Failed to test backup target connection: {data}'

@pytest.mark.dependency(depends=["TestBackupRestoreWithSnapshot::test_connection"], param=True)
def tests_backup_vm(self, api_client, wait_timeout, backup_config, base_vm_with_data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, it looks you are going to reuse the test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.
The reason to have the test connection running again is to provide a new backup for all the test cases inside the new test class TestBackupRestoreWithSnapshot.

According to the test result in the follow up comment
#1384 (comment)

And consider the trade off of pros and cons, may I keep these tests in the new test class. Many thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to decouple new test cases without using existed one, you can just add another fixture to spawn up VM and do the backup (as it is the prerequisite of new test cases).

The problem here is, if you added those two duplicated (totally same as previous class) test cases, which means the total of test cases in report is not reflect to what we actually implemented and covered tests.

to add them after existed test cases would be more easy than create another fixture, that's why I suggested so.

Copy link
Contributor

@albinsun albinsun Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TachunLin, any update or comment?
the current point is we should not have 2 same tests

  • TestBackupRestore:tests_backup_vm
  • TestBackupRestoreWithSnapshot:tests_backup_vm

I think one way is to make TestBackupRestoreWithSnapshot:tests_backup_vm a class fixture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TachunLin ,
Per discussed, let you know that /issues/1462 has been fixed so there should be no connection fail.
You can continue surveying the refactor of this task item, thx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @albinsun for fixing the connection failed issue.
According to the discussion and based on the review comment.

I think it would be better moving the new snapshot restore test cases right after existed test cases in the same TestBackupRestore. It would solve the following concerns:

  1. No reuse and duplication of test_connection and tests_backup_vm
  2. Save time without running the take backup action twice

I already made the fix and tested working on the local to remote ECM lab machine
image

Will try again on the main Jenkins jobs to double confirm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested PASS on the main Jenkins Harvester run test jobs #102

All the new snapshot restore tests can execute well inside the TestBackupRestore test class.
http://172.19.99.74/job/harvester-runtests/102/TestReport/

image

The S3 related tests are blocked to proceed due to the environment broken caused by vpn disconnection yesterday.

@TachunLin
Copy link
Contributor Author

Thanks for the check and suggestion.
In the begging to add restore with backup related test cases, I also have the same plan to add all test cases under the existing TestBackupRestore class.

But when I actually add the new test_with_snapshot_restore_with_new_vm and test_with_snapshot_restore_replace_retain_vols to the end of TestBackupRestore class.

After trigger the test, we can see while execution on the two new test cases.
The existing virtual machine can't do any restore vm action to the new or replace the existing.

  • Execution on the test_with_snapshot_restore_with_new_vm

    • We can find the vm did not respond to the restore action to the new vm
    vokoscreenNG-2024-08-08_15-14-23.mp4
  • Execution on the test_with_snapshot_restore_with_new_vm

    • the vm also have no response for the action to replace the existing
    vokoscreenNG-2024-08-08_15-29-44.mp4
  • On the test report, we can find the two new test cases failed
    image

  • Both of them got the connection failed error
    image

I think the reason to cause this failure may related to in the entire TestBackupRestore class, we use the same and the only one virtual machines.
And since the virtual machine have already done with the restore to new vm and replace with existing vm in the previous tests test_restore_with_new_vm and test_restore_replace_with_delete_vols.

Thus I plan to create a separate class TestBackupRestoreWithSnapshot to make the restore with snapshot related tests can run on a separate and clean vm without affecting all the existing test cases.

Indeed it will increase the execution time to rerun the test connection and take backup test.

In a trade off, The pros to have a separate class can have the following benefits:

  1. Make the new restore backup with snapshot cases can running well
  2. Maintain the stability of the backup and restore test without affecting with existing tests
  3. For future scalability, in case of we need to add more restore backup with restore related test in the future

According to this, may I continue to use the separate class TestBackupRestoreWithSnapshot for the new restore backup with snapshot test cases. Many thanks.

@TachunLin TachunLin force-pushed the 1045-restore-backup-with-snapshot branch from 71c5a97 to 8c0aef0 Compare August 9, 2024 07:37
@TachunLin TachunLin force-pushed the 1045-restore-backup-with-snapshot branch 2 times, most recently from a77925b to cfdde7e Compare September 2, 2024 07:10
@TachunLin TachunLin force-pushed the 1045-restore-backup-with-snapshot branch from cfdde7e to 2d28bc4 Compare September 5, 2024 16:46
@@ -605,9 +605,227 @@ def test_restore_replace_with_vm_shutdown_command(
spec = api_client.backups.RestoreSpec.for_existing(delete_volumes=True)
code, data = api_client.backups.restore(unique_vm_name, spec)
assert 201 == code, f'Failed to restore backup with current VM replaced, {data}'
vm_getable, (code, data) = vm_checker.wait_getable(unique_vm_name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove wait_getable?
Won't possible to hit #1419?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the check.
I think that would probably because during the rebase process I miss something for not adding this check back.
Just added the wait_getable action back to the original places.

code, data = api_client.vm_snapshots.get(vm_snapshot_name)
if data.get("status", {}).get("readyToUse"):
break
print(f"waiting for {vm_snapshot_name} to be ready")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop temporary code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the check.
The purpose to add the line to check vm_snapshot is ready was to ensure the snapshot are actually created before proceeding to the next step.

print(f"waiting for {vm_snapshot_name} to be ready")
sleep(3)
else:
raise AssertionError(f"timed out waiting for {vm_snapshot_name} to be ready")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have code and data in error message like other assers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.
Add the code and data in error message like others in the code.

raise AssertionError(f"timed out waiting for {vm_snapshot_name} to be ready")

assert 200 == code
assert data.get("status", {}).get("readyToUse") is True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these 2 lines be in the while loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.

Indeed these 2 lines can be put into the previous while loop since a redundant check exists on
data.get("status", {}).get("readyToUse"). Update the code accordingly.

# Check VM Started then get IPs (vm and host)
vm_got_ips, (code, data) = vm_checker.wait_ip_addresses(unique_vm_name, ['default'])
assert vm_got_ips, (
f"Failed to Start VM({unique_vm_name}) with errors:\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any restore or restart here?
If want to check VM still Running and accessible, then perhaps alter the error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the check.
The VM did not restart before, just make sure the vm is running after taking the snapshot.
Just updated the assert error message accordingly.

code, data = api_client.vm_snapshots.get(vm_snapshot_name)
if data.get("status", {}).get("readyToUse"):
break
print(f"waiting for {vm_snapshot_name} to be ready")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as previous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.
Add the code and data in error message like others in the code.

raise AssertionError(f"timed out waiting for {vm_snapshot_name} to be ready")

assert 200 == code
assert data.get("status", {}).get("readyToUse") is True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as previous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.

Indeed these 2 lines can be put into the previous while loop since a redundant check exists on data.get("status", {}).get("readyToUse").
Update the code accordingly.

vm_got_ips, (code, data) = vm_checker.wait_ip_addresses(unique_vm_name, ['default'])
assert vm_got_ips, (
f"Failed to Start VM({unique_vm_name}) with errors:\n"
f"Status: {data.get('status')}\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as previous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the check.
The VM did not restart before, just make sure the vm is running after taking the snapshot.
Just updated the assert error message accordingly.

) as sh:
out, err = sh.exec_command(f"echo {pub_key!r} > {base_vm_with_data['data']['path']}")
assert not err, (out, err)
sh.exec_command('sync')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does messing operation also needed in restore new?
As it's restore new, the messed data will not get back on base_vm.

@TachunLin TachunLin force-pushed the 1045-restore-backup-with-snapshot branch from 2d28bc4 to 435a78b Compare September 12, 2024 16:00
@TachunLin TachunLin force-pushed the 1045-restore-backup-with-snapshot branch from 435a78b to 534f329 Compare September 13, 2024 06:44
@TachunLin TachunLin force-pushed the 1045-restore-backup-with-snapshot branch 2 times, most recently from faa5615 to 534f329 Compare October 4, 2024 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants