-
Notifications
You must be signed in to change notification settings - Fork 10
add retry when vcekcertchain is empty #45
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original thought was to use exponential backoffs on retrying. No need to retry for more than one time because if retry for once does not work, something must be badly configured - then the failure ought to surface. We also do not want to risk Azure being blacklisted by AMD.
The attestation container PR is introducing exponential backoff (See fetchWithRetry
). So even if you want to have it, you don't have to write it (And feedback is very welcome :) ).
We need to be careful with what we retry and how many times etc. Some of the services, for sure the AMD certificate endpoint, will throttle repeat requests. This may lead to failures and if those failures lead to more requests then we might trigger more stringent throttling. All this should be on reliable networks and unlikely to fail. If it does fail then it may be because of bad config that will never succeed and so need to be surfaced to the user, or due to some intermittent failure where a retry is appropriate. I suspect we cannot easily tell the difference and so we should only retry a few times before giving in. Even if we knew were were properly connected then we might still get some intermittent failures due to throttling. For MAA and mHSM we don't expect to fail due to throttling, just due to bad config, system load or some intermittent network failure. In that case I am less worried about triggering throttling, but we could make a high load situation worse. My conclusion is we should probably only retry over a period of maybe 10s of seconds and be prepared to fail sooner rather than later. Perhaps we should allow the user to specify a limit, subject to some upper bound of ours, like a minute. |
I took a look at the fetchWithRetry, it does the retry only when RefreshCertChain is called. The issue now aasp PR facing is that when there is a bad initial state such as when initialCert was not even configured, the RefreshCertChain is not even called. So I think we do need this PR. Once this function is called, it can take advantage of the retry in attestation container PR. I put a comment on the attestation container PR :) |
If you mean URL parameter by config, it returns 400 BAD_REQUEST. So we can tell the difference between bad config and other failure. https://www.amd.com/system/files/TechDocs/57230.pdf But I'll agree to go with safer option. I'll make the change for attestation container PR. I understand this PR is necessary. I should have written it clear, apologies :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Please do not merge until the CI is up and running so we can do automated checks that SKR still behaves as expected. That is nearly done. |
The title explains the PR. Original thought was to use exponential backoffs on retrying. No need to retry for more than one time because if retry for once does not work, something must be badly configured - then the failure ought to surface. We also do not want to risk Azure being blacklisted by AMD.