ROX-16430: Add option in os-4 flavor to keep failed clusters #1144

BradLugo · 2024-01-16T19:31:11Z

Requires https://github.com/stackrox/automation-flavors/pull/184.

Testing

Forcing the failure mode via https://github.com/stackrox/automation-flavors/pull/184/commits/10747416e14221e171dfc50fdaa8e60943397c2a:

❯ infractl -k -e localhost:8443 create openshift-4 bl-os4-keep --arg "keep-failed-cluster=true"
NOTE: infractl no longer requires a NAME parameter when creating a cluster.
      If ommitted a name will be generated using your infra user initials, a
      short date and a counter for uniqueness. e.g. jb-10-31-1
ID: bl-os4-keep
❯ infractl -k -e localhost:8443 create openshift-4 bl-os4-no-keep
NOTE: infractl no longer requires a NAME parameter when creating a cluster.
      If ommitted a name will be generated using your infra user initials, a
      short date and a counter for uniqueness. e.g. jb-10-31-1
ID: bl-os4-no-keep
❯ infractl -k -e localhost:8443 list
bl-os4-keep
  Flavor:      openshift-4
  Owner:       [email protected]
  Description:
  Status:      READY
  Created:     40m ago
  Lifespan:    2h19m remaining
bl-os4-no-keep
  Flavor:      openshift-4
  Owner:       [email protected]
  Description:
  Status:      FAILED
  Created:     41m ago
  Destroyed:   2m39s ago
  Lifespan:    2h18m remaining

❯ infractl -k -e localhost:8443 delete bl-os4-keep
ID: bl-os4-keep

❯ argo logs bl-os4-keep-5z82m | bat # Cluster delete Argo Workflow
# The log output is long, so I won't paste it here, but it grabs the data and tears down the cluster properly

Ensuring the openshift-4 flavor works after the revert:

❯ infractl -k -e localhost:8443 create openshift-4 bl-os4-good-no-keep
NOTE: infractl no longer requires a NAME parameter when creating a cluster.
      If ommitted a name will be generated using your infra user initials, a
      short date and a counter for uniqueness. e.g. jb-10-31-1
ID: bl-os4-good-no-keep
❯ infractl -k -e localhost:8443 create openshift-4 bl-os4-good-keep --arg "keep-failed-cluster=true"
NOTE: infractl no longer requires a NAME parameter when creating a cluster.
      If ommitted a name will be generated using your infra user initials, a
      short date and a counter for uniqueness. e.g. jb-10-31-1
ID: bl-os4-good-keep
❯ infractl -k -e localhost:8443 list
bl-os4-good-no-keep
  Flavor:      openshift-4
  Owner:       [email protected]
  Description:
  Status:      READY
  Created:     38m ago
  Lifespan:    2h21m remaining
bl-os4-good-keep
  Flavor:      openshift-4
  Owner:       [email protected]
  Description:
  Status:      READY
  Created:     38m ago
  Lifespan:    2h21m remaining
❯ infractl -k -e localhost:8443 delete bl-os4-good-no-keep
ID: bl-os4-good-no-keep
❯ infractl -k -e localhost:8443 delete bl-os4-good-keep
ID: bl-os4-good-keep

rhacs-bot · 2024-01-16T19:40:01Z

A single node development cluster (infra-pr-1144) was allocated in production infra for this PR.

CI will attempt to deploy quay.io/rhacs-eng/infra-server:0.9.3-6-g1081a2c863 to it.

🔌 You can connect to this cluster with:

gcloud container clusters get-credentials infra-pr-1144 --zone us-central1-a --project acs-team-temp-dev

🛠️ And pull infractl from the deployed dev infra-server with:

nohup kubectl -n infra port-forward svc/infra-server-service 8443:8443 &
make pull-infractl-from-dev-server

🚲 You can then use the dev infra instance e.g.:

bin/infractl -k -e localhost:8443 whoami

⚠️ Any clusters that you start using your dev infra instance should have a lifespan shorter then the development cluster instance. Otherwise they will not be destroyed when the dev infra instance ceases to exist when the development cluster is deleted. ⚠️

Further Development

☕ If you make changes, you can commit and push and CI will take care of updating the development cluster.

🚀 If you only modify configuration (chart/infra-server/configuration) or templates (chart/infra-server/{static,templates}), you can get a faster update with:

make install-local

Logs

Logs for the development infra depending on your @redhat.com authuser:

Or:

kubectl -n infra logs -l app=infra-server --tail=1 -f

davdhacs

+1

tommartensen · 2024-01-17T08:31:04Z

chart/infra-server/static/flavors.yaml

@@ -344,6 +344,13 @@
      value: false
      kind: optional

+    - name: keep-failed-cluster


I am asking myself if the individual users need to care about this setting or if it would be better to have this as a global control (as in: the users can't select it, we need to switch on this globally in the Helm values and deploy a test environment).
My fear is that we will just keep a lot of failed clusters around in case of an incident like on Monday.
Also it is just another setting that they need to worry about when spinning up a new cluster.

I am asking myself if the individual users need to care about this setting or if it would be better to have this as a global control (as in: the users can't select it, we need to switch on this globally in the Helm values and deploy a test environment).

I don't have a strong opinion on this - I'm down for either. In an ideal world, I think I'd like to have some admin-specific settings so we could decide which individual cluster we'd want to keep. But the option currently implemented in this PR seemed like the most straightforward.

My fear is that we will just keep a lot of failed clusters around in case of an incident like on Monday.

If I correctly understand how the workflow and automation-flavor work, I think the clusters marked with the keep-failed-cluster would still be destroyed at the end of their lifespan, i.e., the destroy step in the create script would be skipped if the create step fails, but the destroy step would still run at the end of the lifespan. I'm not 100% sure on that though, so please correct me if I'm wrong.

That being said, it wouldn't stop folks from seeing the error the first time, reproducing it with the keep-failed-cluster flag enabled thinking they're helping us by giving us a cluster that failed. That would indeed be helpful if only one or two people did that, but if everyone started doing that, it would cause even more problems. Thinking out loud - maybe adding some explanation in the help message would be sufficient for that case? Or we could build in a limit to the number of currently provisioned clusters?

Also it is just another setting that they need to worry about when spinning up a new cluster.

Agreed. I'm hoping the helping message is descriptive enough to indicate they don't need to worry about it, but yeah, it adds another option in the UI that they'd need to parse.

Anyway, I'm totally down to scrap this approach if we want to go in a different direction.

I think I'd like to have some admin-specific settings so we could decide which individual cluster we'd want to keep.

You're right. A global approach (my suggestion) would keep all clusters of a flavor around, while a flag (your approach) would allow for a more specific investigation.

If I correctly understand how the workflow and automation-flavor work, I think the clusters marked with the keep-failed-cluster would still be destroyed at the end of their lifespan, i.e., the destroy step in the create script would be skipped if the create step fails, but the destroy step would still run at the end of the lifespan. I'm not 100% sure on that though, so please correct me if I'm wrong.

From what I observed in our debugging of the OCP issues this week, that is correct.

That would indeed be helpful if only one or two people did that, but if everyone started doing that, it would cause even more problems. Thinking out loud - maybe adding some explanation in the help message would be sufficient for that case? Or we could build in a limit to the number of currently provisioned clusters?

I think this might have become an issue on Monday, when people started to try to provision clusters on their own, just to "check if it works for them". It didn't cause any issues, because failed clusters were immediately destroyed - but IMHO this could have been prevented by better communication (on my side - I chose not to intervene, because we were already too deep in the investigation).

@davdhacs @gavin-stackrox what do you think about this?

+1 I'd like to get this implementation in as-is, and we can iterate or change it after trying this one.

detail:
Since we use ocp-4 the most, it may be helpful soonest for example if gcp breaks things again next week!💥

I think it is okay to risk excess helpful clusters: The clusters will still expire because the workflow runs are tagged with their expiration and infra checks over all of them on a schedule. If we see it abused and users leaving many clusters when a flavor is broken, then we can figure out a solution after that happens (I think low risk (no customers or business impact) to defer solving that until it happens).

and +1 to add more parameters(noise). I think the users of Infra already deal with the ui noise, and I think users will ignore the field or learn quickly what it does.

future: + for ocp-4-demo and I'd like the option for all flavors (or some other control in Infra of the workflow steps)

Let's do it. @BradLugo could you document this discussion in the Jira issue?

A follow-up to enable "keeping failed clusters globally" could be to set the default value with Helm.

Documented here: https://issues.redhat.com/browse/ROX-16430?focusedId=23856033&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-23856033

Please feel free to add anything I missed.

BradLugo · 2024-01-19T07:18:41Z

One future improvement would be to indicate in the infra status of the cluster whether the cluster failed, but we're keeping it around. With these changes, if we use the --arg "keep-failed-cluster=true", the cluster will almost certainly be marked as READY even though it's an unhealthy cluster (or perhaps no cluster exists at all).

BradLugo · 2024-01-19T08:48:59Z

Note to self: if we decide to go with the current implementation in this PR, it's probably worth adding the parameter to the openshift-4-demo flavor as well (likely wouldn't require any stackrox/automation-flavor changes, just an update to the workflow config)

davdhacs

+1 but circling back to @tommartensen, are concerns resolved?

BradLugo requested a review from a team as a code owner January 16, 2024 19:31

davdhacs approved these changes Jan 17, 2024

View reviewed changes

tommartensen reviewed Jan 17, 2024

View reviewed changes

BradLugo added 3 commits January 18, 2024 23:46

ROX-16430: Add option in os-4 flavor to keep failed clusters

d7d4711

Test failure modes

1e4f7aa

Update automation-flavor version

21dca5d

BradLugo requested review from tommartensen and davdhacs January 19, 2024 08:40

BradLugo added 2 commits January 19, 2024 00:42

Update CHANGELOG

c131775

Update automation-flavor version

a8cf7c3

BradLugo requested a review from gavin-stackrox January 19, 2024 08:49

BradLugo force-pushed the blugo/ROX-16430-os4-keep-failed-cluster branch from fc035a7 to a8cf7c3 Compare January 19, 2024 21:27

davdhacs reviewed Jan 19, 2024

View reviewed changes

tommartensen approved these changes Jan 22, 2024

View reviewed changes

Merge branch 'master' into blugo/ROX-16430-os4-keep-failed-cluster

1081a2c

tommartensen enabled auto-merge (squash) March 7, 2024 18:52

tommartensen merged commit 597e7df into master Mar 7, 2024
9 checks passed

tommartensen deleted the blugo/ROX-16430-os4-keep-failed-cluster branch March 7, 2024 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-16430: Add option in os-4 flavor to keep failed clusters #1144

ROX-16430: Add option in os-4 flavor to keep failed clusters #1144

BradLugo commented Jan 16, 2024 •

edited

Loading

rhacs-bot commented Jan 16, 2024 •

edited

Loading

davdhacs left a comment

tommartensen Jan 17, 2024 •

edited

Loading

BradLugo Jan 17, 2024

tommartensen Jan 19, 2024

tommartensen Jan 19, 2024

davdhacs Jan 19, 2024

tommartensen Jan 22, 2024

BradLugo Jan 23, 2024

BradLugo commented Jan 19, 2024 •

edited

Loading

BradLugo commented Jan 19, 2024 •

edited

Loading

davdhacs left a comment •

edited

Loading

ROX-16430: Add option in os-4 flavor to keep failed clusters #1144

ROX-16430: Add option in os-4 flavor to keep failed clusters #1144

Conversation

BradLugo commented Jan 16, 2024 • edited Loading

Testing

rhacs-bot commented Jan 16, 2024 • edited Loading

Further Development

Logs

davdhacs left a comment

Choose a reason for hiding this comment

tommartensen Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

BradLugo Jan 17, 2024

Choose a reason for hiding this comment

tommartensen Jan 19, 2024

Choose a reason for hiding this comment

tommartensen Jan 19, 2024

Choose a reason for hiding this comment

davdhacs Jan 19, 2024

Choose a reason for hiding this comment

tommartensen Jan 22, 2024

Choose a reason for hiding this comment

BradLugo Jan 23, 2024

Choose a reason for hiding this comment

BradLugo commented Jan 19, 2024 • edited Loading

BradLugo commented Jan 19, 2024 • edited Loading

davdhacs left a comment • edited Loading

Choose a reason for hiding this comment

BradLugo commented Jan 16, 2024 •

edited

Loading

rhacs-bot commented Jan 16, 2024 •

edited

Loading

tommartensen Jan 17, 2024 •

edited

Loading

BradLugo commented Jan 19, 2024 •

edited

Loading

BradLugo commented Jan 19, 2024 •

edited

Loading

davdhacs left a comment •

edited

Loading