Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the progressive policy rollout design #118

Conversation

mprahl
Copy link
Member

@mprahl mprahl commented Apr 20, 2024

Note that since there's so much change, it's best to treat this like a new document when reviewing the changes (e.g. look at the markdown file directly instead of the diff).

Part of the proposal applies to all of Open Cluster Management.

@mprahl
Copy link
Member Author

mprahl commented Apr 20, 2024

@mprahl
Copy link
Member Author

mprahl commented Apr 20, 2024

/cc @imiller0 in case you are interested!

@mprahl mprahl force-pushed the progressive-rollout branch 3 times, most recently from e8a7a60 to c1d8c13 Compare April 26, 2024 17:59
@mprahl
Copy link
Member Author

mprahl commented Apr 26, 2024

/cc @serngawy @qiujian16 @jnpacker

version of the policy, and GitOps is the recommended method for this.
- **Rollbacks** - Reverting a policy does not necessarily rollback the change. It's up to the user to determine how to
rollback in the event of a rollout failure.
- **Group Rollouts** - Rollouts of a group of policies is not a goal at this moment due to technical and user experience

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Group of policy means policy set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Group of policy means policy set?

Correct


If the user wants the approval to be conditional on a policy version, the user must set `spec.approvalsForVersion` to a
value that matches the `policy.open-cluster-management.io/version` annotation on the root policy. By default, the
approvals apply to all policy versions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply to lastest?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all versions little confusing me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I'll update that.


| Status | Description |
| --------------- | ----------------------------------------------------------------------------------------------------- |
| `ToApply` | The cluster is waiting for the new version of the policy but the old version is still applied. |
Copy link

@yiraeChristineKim yiraeChristineKim Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opt: hmm to me, await , ready are better to apply not clear to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enum has been out there for a little bit, so I suspect there may be opposition to updating this particular one.

namespace: <same as root policy>
spec:
approvalsForVersion: "2.0" <-- optional if the user wants to tie the approval to a specific policy "version".
decisionGroups:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the order of groups?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order here doesn't actually matter. I prefer using a map in these cases but most Kubernetes APIs use lists with objects with name keys, so this copies that. The order in the placement matters in terms of priority but are not dependencies like in progressivePerGroup.

progressivePerGroup:
minSuccessTime: 5m
progressDeadline: 10m
maxFailures: 2%

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some like fail-fast? wherever is fail, stop progressing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some like fail-fast? wherever is fail, stop progressing

The default is to stop rollouts on failures. These tuning points allow it to continue within customizable thresholds.

1. The updated policy applies to all clusters in the `stage` group. These clusters have the `rolloutStatus` of
`InProgress`.
1. All the clusters in the `stage` group become compliant. Their `rolloutStatus` is set to `Succeeded`.
1. The updated policy applies to all clusters in the `prod` group. These clusters have the `rolloutStatus` of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Individual status of group only appears on kind: rollout?

Copy link
Member Author

@mprahl mprahl May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not proposing statuses per group but only rollout statuses per cluster and a summary rollout status to reflect all clusters.

#### Updated Policy With manualPerGroup in Order

1. The root policy is updated.
1. All clusters have the `rolloutStatus` of `ToApply` and keep the previous policy version. The root policy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if user set both dev and prod to true at the same time. The order would be dev -> prod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct.

@mprahl
Copy link
Member Author

mprahl commented Apr 26, 2024

/hold for reviews

new cluster would receive the last successful policy version, if any, until the rollout progressed to the `stage` group.

If the cluster is added to an earlier group than the group that is `InProgress`, the rollout switches back to that group
and waits for the cluster to rollout before resuming progress at the point previous to the new cluster being added.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is. a newbie cluster is in progress as soon as it comes. And then the previously progressed cluster restarts its progress.. is this correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. The new cluster always gets a policy. Either the last successful version or the latest version. Which one is dependent on if the cluster is in a group that is being rolled out to or has been rolled out to.

For example, if you have dev, stage, and prod groups and the rollout is on the stage group.

If the new cluster is a dev cluster, the rollout goes back to dev and applies the latest policy to the new cluster before going back to stage.

If the new cluster is a prod cluster, the new cluster gets the last successful policy until the rollout is on the prod group. Then the new cluster gets thte latest policy.

In this situation, the first option is to update the policy spec in some way which would completly restart the rollout.

The second option is to use the `spec.retryRollout.rolloutUID` field on the `Rollout` object. When set to a UID that
matches `status.rolloutUID`, the Governance Policy Propagator will restart the rollout and create a new rollout UID.
Copy link

@yiraeChristineKim yiraeChristineKim Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to save snaps of policy + rollout?
I expected second option retry failed group

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please rephrase the question?

remediationAction: enforce
rolloutStrategy:
ignoreClusterRolloutStatus:
matchExpressions:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use this instead of simple group name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have prod clusters that are flaky but you still want to apply the policy to them without ignoring the status of all prod clusters.


## Design Details

### Open Questions

1. Should the per-cluster status on the root policy be grouped similar to how they're grouped in the
`PlacementDecisions`?
1. Should the `spec.rolloutStrategy.ignoreClusterRolloutStatus` field be contributed to the rollout strategy API?
Copy link

@yiraeChristineKim yiraeChristineKim Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ignoreClusterRolloutStatus ignore cluster group failure? Is it excluded from maxfailure calculation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, the cluster gets the new policy version when its group is being rolled out to, but its status is ignored and not counted towards failure or success.

Copy link
Member

@dhaiducek dhaiducek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! Thanks for the updates! I have a couple questions/comments.


| Status | Description |
| --------------- | ----------------------------------------------------------------------------------------------------- |
| `ToApply` | The cluster is waiting for the new version of the policy but the old version is still applied. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enum has been out there for a little bit, so I suspect there may be opposition to updating this particular one.

version of the policy, and GitOps is the recommended method for this.
- **Rollbacks** - Reverting a policy does not necessarily rollback the change. It's up to the user to determine how to
rollback in the event of a rollout failure.
- **Group Rollouts** - Rollouts of a group of policies is not a goal at this moment due to technical and user experience
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Group of policy means policy set?

Correct

progressivePerGroup:
minSuccessTime: 5m
progressDeadline: 10m
maxFailures: 2%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some like fail-fast? wherever is fail, stop progressing

The default is to stop rollouts on failures. These tuning points allow it to continue within customizable thresholds.

Copy link

openshift-ci bot commented May 1, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mprahl, yiraeChristineKim

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mprahl
Copy link
Member Author

mprahl commented Nov 1, 2024

/unhold

@openshift-merge-bot openshift-merge-bot bot merged commit 6580ef8 into open-cluster-management-io:main Nov 1, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants