-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to report reinstall by metal-hammer results in fresh install with wiped disks #207
Comments
Here are related metal-api error logs: {"level":"error","time":"2021-08-04T12:48:48.296Z","caller":"rest/middleware.go:88","message":"Rest Call","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","status":403,"content-length":55,"duration":0.000260831}
{"level":"error","time":"2021-08-04T12:48:48.296Z","caller":"rest/middleware.go:100","message":"cannot get user from request","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","error":"Wrong HMAC found","got":"667c82a5692d7e1013a88b022af0f367e3d349860818ff8714b685bed3ae91e3","want":"4a2e464619017e291fd0113c5a884b89d825a5e8ea8fbd6b1a83eb5f1eab86c8"}
{"level":"error","time":"2021-08-04T12:48:48.274Z","caller":"rest/middleware.go:88","message":"Rest Call","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","useremail":"[email protected]","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","status":422,"content-length":137,"duration":0.087437639}
{"level":"error","time":"2021-08-04T12:48:48.274Z","caller":"service/service.go:96","message":"service error","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","useremail":"[email protected]","operation":"finalizeAllocation","status":422,"error":"the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731","service-caller":"machine-service.go:1635","resp":"the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"} And these are from metal-core: 2021-08-04T12:48:48.301245+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.298Z#011error#011endpoint/abortReinstall.go:38#011Failed to abort reinstall#011{"app": "metal-core", "statusCode": 500, "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "primary disk already wiped": true, "boot information": null}
2021-08-04T12:48:48.299985+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.298Z#011error#011api/abortReinstall.go:22#011Failed to abort reinstall#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "primary disk already wiped": true, "error": "[POST /v1/machine/{id}/abort-reinstall][403] abortReinstallMachine default Wrong HMAC found (403)"}
2021-08-04T12:48:48.279567+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.276Z#011error#011endpoint/report.go:57#011Unable to report machine back to api.#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "error": "[POST /v1/machine/{id}/finalize-allocation][422] finalizeAllocation default the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"}
2021-08-04T12:48:48.278363+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.275Z#011error#011api/finalizeAllocation.go:27#011Finalize failed#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "error": "[POST /v1/machine/{id}/finalize-allocation][422] finalizeAllocation default the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"} |
I tried to reconstruct what happened, here is what found:
This certainly is a cluster of very unhappy events. My question would be: Why does the metal-hammer wipe disks when there is an allocation on the machine? I think, this should be the first thing to prevent before starting to wipe disks. Maybe this condition is to specific: https://github.com/metal-stack/metal-hammer/blob/v0.9.1/cmd/root.go#L129. I'll investigate a little more to see what we can do. |
Apparently it can sometimes happen that an attempt from metal-hammer to report a successful reinstall of a machine fails with an error code 500; in one such instance the machine rebooted and when the metal-hammer next contacted the api was told to do a fresh install, which wiped all local disks.
This is obviously undesired behaviour; how can this issue be fixed?
Error message from metal-hammer ipmi console log when failing to report:
Complete log files of failed reinstall followed by fresh install, and for comparison a successful reinstall, are attached.
successful-reinstall.txt
failed-reinstall-and-subsequent-install-with-wipe.txt
The text was updated successfully, but these errors were encountered: