add retry when vcekcertchain is empty #45

stevendongatmsft · 2023-06-19T18:02:43Z

The title explains the PR. Original thought was to use exponential backoffs on retrying. No need to retry for more than one time because if retry for once does not work, something must be badly configured - then the failure ought to surface. We also do not want to risk Azure being blacklisted by AMD.

takuro-sato

Original thought was to use exponential backoffs on retrying. No need to retry for more than one time because if retry for once does not work, something must be badly configured - then the failure ought to surface. We also do not want to risk Azure being blacklisted by AMD.

The attestation container PR is introducing exponential backoff (See fetchWithRetry). So even if you want to have it, you don't have to write it (And feedback is very welcome :) ).

pkg/attest/attest.go

KenGordon · 2023-06-20T12:39:38Z

Original thought was to use exponential backoffs on retrying. No need to retry for more than one time because if retry for once does not work, something must be badly configured - then the failure ought to surface. We also do not want to risk Azure being blacklisted by AMD.

The attestation container PR is introducing exponential backoff (See fetchWithRetry). So even if you want to have it, you don't have to write it (And feedback is very welcome :) ).

We need to be careful with what we retry and how many times etc. Some of the services, for sure the AMD certificate endpoint, will throttle repeat requests. This may lead to failures and if those failures lead to more requests then we might trigger more stringent throttling.

All this should be on reliable networks and unlikely to fail. If it does fail then it may be because of bad config that will never succeed and so need to be surfaced to the user, or due to some intermittent failure where a retry is appropriate. I suspect we cannot easily tell the difference and so we should only retry a few times before giving in. Even if we knew were were properly connected then we might still get some intermittent failures due to throttling.

For MAA and mHSM we don't expect to fail due to throttling, just due to bad config, system load or some intermittent network failure. In that case I am less worried about triggering throttling, but we could make a high load situation worse.

My conclusion is we should probably only retry over a period of maybe 10s of seconds and be prepared to fail sooner rather than later.

Perhaps we should allow the user to specify a limit, subject to some upper bound of ours, like a minute.

stevendongatmsft · 2023-06-20T15:24:08Z

Original thought was to use exponential backoffs on retrying. No need to retry for more than one time because if retry for once does not work, something must be badly configured - then the failure ought to surface. We also do not want to risk Azure being blacklisted by AMD.

The attestation container PR is introducing exponential backoff (See fetchWithRetry). So even if you want to have it, you don't have to write it (And feedback is very welcome :) ).

I took a look at the fetchWithRetry, it does the retry only when RefreshCertChain is called. The issue now aasp PR facing is that when there is a bad initial state such as when initialCert was not even configured, the RefreshCertChain is not even called. So I think we do need this PR. Once this function is called, it can take advantage of the retry in attestation container PR. I put a comment on the attestation container PR :)

takuro-sato · 2023-06-20T16:15:39Z

@KenGordon

All this should be on reliable networks and unlikely to fail. If it does fail then it may be because of bad config that will never succeed and so need to be surfaced to the user, or due to some intermittent failure where a retry is appropriate. I suspect we cannot easily tell the difference and so we should only retry a few times before giving in.

If you mean URL parameter by config, it returns 400 BAD_REQUEST. So we can tell the difference between bad config and other failure. https://www.amd.com/system/files/TechDocs/57230.pdf

But I'll agree to go with safer option. I'll make the change for attestation container PR.

@stevendongatmsft

I understand this PR is necessary. I should have written it clear, apologies :)

takuro-sato

LGTM

KenGordon · 2023-07-06T16:20:12Z

Please do not merge until the CI is up and running so we can do automated checks that SKR still behaves as expected. That is nearly done.

add retry when vcekcertchain is empty

8bb0bb5

stevendongatmsft self-assigned this Jun 19, 2023

stevendongatmsft requested a review from takuro-sato June 19, 2023 18:05

takuro-sato reviewed Jun 20, 2023

View reviewed changes

pkg/attest/attest.go Show resolved Hide resolved

pkg/attest/attest.go Outdated Show resolved Hide resolved

remove/add comments

f360fd6

takuro-sato approved these changes Jun 20, 2023

View reviewed changes

hgarvison approved these changes Jul 6, 2023

View reviewed changes

stevendongatmsft mentioned this pull request Aug 11, 2023

DO NOT MERGE add retry when vcekcertchain is empty microsoft/confidential-sidecar-containers#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add retry when vcekcertchain is empty #45

add retry when vcekcertchain is empty #45

stevendongatmsft commented Jun 19, 2023 •

edited

Loading

takuro-sato left a comment

KenGordon commented Jun 20, 2023

stevendongatmsft commented Jun 20, 2023

takuro-sato commented Jun 20, 2023

takuro-sato left a comment

KenGordon commented Jul 6, 2023

add retry when vcekcertchain is empty #45

Are you sure you want to change the base?

add retry when vcekcertchain is empty #45

Conversation

stevendongatmsft commented Jun 19, 2023 • edited Loading

takuro-sato left a comment

Choose a reason for hiding this comment

KenGordon commented Jun 20, 2023

stevendongatmsft commented Jun 20, 2023

takuro-sato commented Jun 20, 2023

takuro-sato left a comment

Choose a reason for hiding this comment

KenGordon commented Jul 6, 2023

stevendongatmsft commented Jun 19, 2023 •

edited

Loading