-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the progressive policy rollout design #118
Update the progressive policy rollout design #118
Conversation
/cc @imiller0 in case you are interested! |
e8a7a60
to
c1d8c13
Compare
version of the policy, and GitOps is the recommended method for this. | ||
- **Rollbacks** - Reverting a policy does not necessarily rollback the change. It's up to the user to determine how to | ||
rollback in the event of a rollout failure. | ||
- **Group Rollouts** - Rollouts of a group of policies is not a goal at this moment due to technical and user experience |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Group of policy means policy set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Group of policy means policy set?
Correct
|
||
If the user wants the approval to be conditional on a policy version, the user must set `spec.approvalsForVersion` to a | ||
value that matches the `policy.open-cluster-management.io/version` annotation on the root policy. By default, the | ||
approvals apply to all policy versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply to lastest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all versions little confusing me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I'll update that.
|
||
| Status | Description | | ||
| --------------- | ----------------------------------------------------------------------------------------------------- | | ||
| `ToApply` | The cluster is waiting for the new version of the policy but the old version is still applied. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opt: hmm to me, await , ready are better to apply
not clear to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enum has been out there for a little bit, so I suspect there may be opposition to updating this particular one.
namespace: <same as root policy> | ||
spec: | ||
approvalsForVersion: "2.0" <-- optional if the user wants to tie the approval to a specific policy "version". | ||
decisionGroups: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is the order of groups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order here doesn't actually matter. I prefer using a map in these cases but most Kubernetes APIs use lists with objects with name
keys, so this copies that. The order in the placement matters in terms of priority but are not dependencies like in progressivePerGroup
.
progressivePerGroup: | ||
minSuccessTime: 5m | ||
progressDeadline: 10m | ||
maxFailures: 2% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have some like fail-fast? wherever is fail, stop progressing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have some like fail-fast? wherever is fail, stop progressing
The default is to stop rollouts on failures. These tuning points allow it to continue within customizable thresholds.
1. The updated policy applies to all clusters in the `stage` group. These clusters have the `rolloutStatus` of | ||
`InProgress`. | ||
1. All the clusters in the `stage` group become compliant. Their `rolloutStatus` is set to `Succeeded`. | ||
1. The updated policy applies to all clusters in the `prod` group. These clusters have the `rolloutStatus` of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Individual status of group only appears on kind: rollout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not proposing statuses per group but only rollout statuses per cluster and a summary rollout status to reflect all clusters.
#### Updated Policy With manualPerGroup in Order | ||
|
||
1. The root policy is updated. | ||
1. All clusters have the `rolloutStatus` of `ToApply` and keep the previous policy version. The root policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if user set both dev and prod to true at the same time. The order would be dev -> prod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct.
/hold for reviews |
new cluster would receive the last successful policy version, if any, until the rollout progressed to the `stage` group. | ||
|
||
If the cluster is added to an earlier group than the group that is `InProgress`, the rollout switches back to that group | ||
and waits for the cluster to rollout before resuming progress at the point previous to the new cluster being added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is. a newbie cluster is in progress as soon as it comes. And then the previously progressed cluster restarts its progress.. is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite. The new cluster always gets a policy. Either the last successful version or the latest version. Which one is dependent on if the cluster is in a group that is being rolled out to or has been rolled out to.
For example, if you have dev, stage, and prod groups and the rollout is on the stage group.
If the new cluster is a dev cluster, the rollout goes back to dev and applies the latest policy to the new cluster before going back to stage.
If the new cluster is a prod cluster, the new cluster gets the last successful policy until the rollout is on the prod group. Then the new cluster gets thte latest policy.
In this situation, the first option is to update the policy spec in some way which would completly restart the rollout. | ||
|
||
The second option is to use the `spec.retryRollout.rolloutUID` field on the `Rollout` object. When set to a UID that | ||
matches `status.rolloutUID`, the Governance Policy Propagator will restart the rollout and create a new rollout UID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to save snaps of policy + rollout?
I expected second option retry failed group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please rephrase the question?
remediationAction: enforce | ||
rolloutStrategy: | ||
ignoreClusterRolloutStatus: | ||
matchExpressions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use this instead of simple group name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might have prod clusters that are flaky but you still want to apply the policy to them without ignoring the status of all prod clusters.
|
||
## Design Details | ||
|
||
### Open Questions | ||
|
||
1. Should the per-cluster status on the root policy be grouped similar to how they're grouped in the | ||
`PlacementDecisions`? | ||
1. Should the `spec.rolloutStrategy.ignoreClusterRolloutStatus` field be contributed to the rollout strategy API? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ignoreClusterRolloutStatus ignore cluster group failure? Is it excluded from maxfailure
calculation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, the cluster gets the new policy version when its group is being rolled out to, but its status is ignored and not counted towards failure or success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good! Thanks for the updates! I have a couple questions/comments.
|
||
| Status | Description | | ||
| --------------- | ----------------------------------------------------------------------------------------------------- | | ||
| `ToApply` | The cluster is waiting for the new version of the policy but the old version is still applied. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enum has been out there for a little bit, so I suspect there may be opposition to updating this particular one.
version of the policy, and GitOps is the recommended method for this. | ||
- **Rollbacks** - Reverting a policy does not necessarily rollback the change. It's up to the user to determine how to | ||
rollback in the event of a rollout failure. | ||
- **Group Rollouts** - Rollouts of a group of policies is not a goal at this moment due to technical and user experience |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Group of policy means policy set?
Correct
progressivePerGroup: | ||
minSuccessTime: 5m | ||
progressDeadline: 10m | ||
maxFailures: 2% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have some like fail-fast? wherever is fail, stop progressing
The default is to stop rollouts on failures. These tuning points allow it to continue within customizable thresholds.
c1d8c13
to
5bdaf3c
Compare
5bdaf3c
to
119b634
Compare
Signed-off-by: mprahl <[email protected]>
119b634
to
c5d65cc
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mprahl, yiraeChristineKim The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
6580ef8
into
open-cluster-management-io:main
Note that since there's so much change, it's best to treat this like a new document when reviewing the changes (e.g. look at the markdown file directly instead of the diff).
Part of the proposal applies to all of Open Cluster Management.