Skip to content
This repository has been archived by the owner on Jul 19, 2023. It is now read-only.

add retry when vcekcertchain is empty #45

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

stevendongatmsft
Copy link
Collaborator

@stevendongatmsft stevendongatmsft commented Jun 19, 2023

The title explains the PR. Original thought was to use exponential backoffs on retrying. No need to retry for more than one time because if retry for once does not work, something must be badly configured - then the failure ought to surface. We also do not want to risk Azure being blacklisted by AMD.

Copy link
Contributor

@takuro-sato takuro-sato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original thought was to use exponential backoffs on retrying. No need to retry for more than one time because if retry for once does not work, something must be badly configured - then the failure ought to surface. We also do not want to risk Azure being blacklisted by AMD.

The attestation container PR is introducing exponential backoff (See fetchWithRetry). So even if you want to have it, you don't have to write it (And feedback is very welcome :) ).

pkg/attest/attest.go Show resolved Hide resolved
pkg/attest/attest.go Outdated Show resolved Hide resolved
@KenGordon
Copy link
Contributor

Original thought was to use exponential backoffs on retrying. No need to retry for more than one time because if retry for once does not work, something must be badly configured - then the failure ought to surface. We also do not want to risk Azure being blacklisted by AMD.

The attestation container PR is introducing exponential backoff (See fetchWithRetry). So even if you want to have it, you don't have to write it (And feedback is very welcome :) ).

We need to be careful with what we retry and how many times etc. Some of the services, for sure the AMD certificate endpoint, will throttle repeat requests. This may lead to failures and if those failures lead to more requests then we might trigger more stringent throttling.

All this should be on reliable networks and unlikely to fail. If it does fail then it may be because of bad config that will never succeed and so need to be surfaced to the user, or due to some intermittent failure where a retry is appropriate. I suspect we cannot easily tell the difference and so we should only retry a few times before giving in. Even if we knew were were properly connected then we might still get some intermittent failures due to throttling.

For MAA and mHSM we don't expect to fail due to throttling, just due to bad config, system load or some intermittent network failure. In that case I am less worried about triggering throttling, but we could make a high load situation worse.

My conclusion is we should probably only retry over a period of maybe 10s of seconds and be prepared to fail sooner rather than later.

Perhaps we should allow the user to specify a limit, subject to some upper bound of ours, like a minute.

@stevendongatmsft
Copy link
Collaborator Author

Original thought was to use exponential backoffs on retrying. No need to retry for more than one time because if retry for once does not work, something must be badly configured - then the failure ought to surface. We also do not want to risk Azure being blacklisted by AMD.

The attestation container PR is introducing exponential backoff (See fetchWithRetry). So even if you want to have it, you don't have to write it (And feedback is very welcome :) ).

I took a look at the fetchWithRetry, it does the retry only when RefreshCertChain is called. The issue now aasp PR facing is that when there is a bad initial state such as when initialCert was not even configured, the RefreshCertChain is not even called. So I think we do need this PR. Once this function is called, it can take advantage of the retry in attestation container PR. I put a comment on the attestation container PR :)

@takuro-sato
Copy link
Contributor

@KenGordon

All this should be on reliable networks and unlikely to fail. If it does fail then it may be because of bad config that will never succeed and so need to be surfaced to the user, or due to some intermittent failure where a retry is appropriate. I suspect we cannot easily tell the difference and so we should only retry a few times before giving in.

If you mean URL parameter by config, it returns 400 BAD_REQUEST. So we can tell the difference between bad config and other failure. https://www.amd.com/system/files/TechDocs/57230.pdf

But I'll agree to go with safer option. I'll make the change for attestation container PR.

@stevendongatmsft

I understand this PR is necessary. I should have written it clear, apologies :)

Copy link
Contributor

@takuro-sato takuro-sato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@KenGordon
Copy link
Contributor

Please do not merge until the CI is up and running so we can do automated checks that SKR still behaves as expected. That is nearly done.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants