Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-16430: Add option in os-4 flavor to keep failed clusters #1144

Merged
merged 6 commits into from
Mar 7, 2024

Conversation

BradLugo
Copy link
Contributor

@BradLugo BradLugo commented Jan 16, 2024

Requires https://github.com/stackrox/automation-flavors/pull/184.

Testing

Forcing the failure mode via https://github.com/stackrox/automation-flavors/pull/184/commits/10747416e14221e171dfc50fdaa8e60943397c2a:

❯ infractl -k -e localhost:8443 create openshift-4 bl-os4-keep --arg "keep-failed-cluster=true"
NOTE: infractl no longer requires a NAME parameter when creating a cluster.
      If ommitted a name will be generated using your infra user initials, a
      short date and a counter for uniqueness. e.g. jb-10-31-1
ID: bl-os4-keep
❯ infractl -k -e localhost:8443 create openshift-4 bl-os4-no-keep
NOTE: infractl no longer requires a NAME parameter when creating a cluster.
      If ommitted a name will be generated using your infra user initials, a
      short date and a counter for uniqueness. e.g. jb-10-31-1
ID: bl-os4-no-keep
❯ infractl -k -e localhost:8443 list
bl-os4-keep
  Flavor:      openshift-4
  Owner:       [email protected]
  Description:
  Status:      READY
  Created:     40m ago
  Lifespan:    2h19m remaining
bl-os4-no-keep
  Flavor:      openshift-4
  Owner:       [email protected]
  Description:
  Status:      FAILED
  Created:     41m ago
  Destroyed:   2m39s ago
  Lifespan:    2h18m remaining

❯ infractl -k -e localhost:8443 delete bl-os4-keep
ID: bl-os4-keep

❯ argo logs bl-os4-keep-5z82m | bat # Cluster delete Argo Workflow
# The log output is long, so I won't paste it here, but it grabs the data and tears down the cluster properly

Ensuring the openshift-4 flavor works after the revert:

❯ infractl -k -e localhost:8443 create openshift-4 bl-os4-good-no-keep
NOTE: infractl no longer requires a NAME parameter when creating a cluster.
      If ommitted a name will be generated using your infra user initials, a
      short date and a counter for uniqueness. e.g. jb-10-31-1
ID: bl-os4-good-no-keep
❯ infractl -k -e localhost:8443 create openshift-4 bl-os4-good-keep --arg "keep-failed-cluster=true"
NOTE: infractl no longer requires a NAME parameter when creating a cluster.
      If ommitted a name will be generated using your infra user initials, a
      short date and a counter for uniqueness. e.g. jb-10-31-1
ID: bl-os4-good-keep
❯ infractl -k -e localhost:8443 list
bl-os4-good-no-keep
  Flavor:      openshift-4
  Owner:       [email protected]
  Description:
  Status:      READY
  Created:     38m ago
  Lifespan:    2h21m remaining
bl-os4-good-keep
  Flavor:      openshift-4
  Owner:       [email protected]
  Description:
  Status:      READY
  Created:     38m ago
  Lifespan:    2h21m remaining
❯ infractl -k -e localhost:8443 delete bl-os4-good-no-keep
ID: bl-os4-good-no-keep
❯ infractl -k -e localhost:8443 delete bl-os4-good-keep
ID: bl-os4-good-keep

@BradLugo BradLugo requested a review from a team as a code owner January 16, 2024 19:31
@rhacs-bot
Copy link
Contributor

rhacs-bot commented Jan 16, 2024

A single node development cluster (infra-pr-1144) was allocated in production infra for this PR.

CI will attempt to deploy quay.io/rhacs-eng/infra-server:0.9.3-6-g1081a2c863 to it.

🔌 You can connect to this cluster with:

gcloud container clusters get-credentials infra-pr-1144 --zone us-central1-a --project acs-team-temp-dev

🛠️ And pull infractl from the deployed dev infra-server with:

nohup kubectl -n infra port-forward svc/infra-server-service 8443:8443 &
make pull-infractl-from-dev-server

🚲 You can then use the dev infra instance e.g.:

bin/infractl -k -e localhost:8443 whoami

⚠️ Any clusters that you start using your dev infra instance should have a lifespan shorter then the development cluster instance. Otherwise they will not be destroyed when the dev infra instance ceases to exist when the development cluster is deleted. ⚠️

Further Development

☕ If you make changes, you can commit and push and CI will take care of updating the development cluster.

🚀 If you only modify configuration (chart/infra-server/configuration) or templates (chart/infra-server/{static,templates}), you can get a faster update with:

make install-local

Logs

Logs for the development infra depending on your @redhat.com authuser:

Or:

kubectl -n infra logs -l app=infra-server --tail=1 -f

Copy link
Contributor

@davdhacs davdhacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -344,6 +344,13 @@
value: false
kind: optional

- name: keep-failed-cluster
Copy link
Contributor

@tommartensen tommartensen Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am asking myself if the individual users need to care about this setting or if it would be better to have this as a global control (as in: the users can't select it, we need to switch on this globally in the Helm values and deploy a test environment).
My fear is that we will just keep a lot of failed clusters around in case of an incident like on Monday.
Also it is just another setting that they need to worry about when spinning up a new cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am asking myself if the individual users need to care about this setting or if it would be better to have this as a global control (as in: the users can't select it, we need to switch on this globally in the Helm values and deploy a test environment).

I don't have a strong opinion on this - I'm down for either. In an ideal world, I think I'd like to have some admin-specific settings so we could decide which individual cluster we'd want to keep. But the option currently implemented in this PR seemed like the most straightforward.

My fear is that we will just keep a lot of failed clusters around in case of an incident like on Monday.

If I correctly understand how the workflow and automation-flavor work, I think the clusters marked with the keep-failed-cluster would still be destroyed at the end of their lifespan, i.e., the destroy step in the create script would be skipped if the create step fails, but the destroy step would still run at the end of the lifespan. I'm not 100% sure on that though, so please correct me if I'm wrong.

That being said, it wouldn't stop folks from seeing the error the first time, reproducing it with the keep-failed-cluster flag enabled thinking they're helping us by giving us a cluster that failed. That would indeed be helpful if only one or two people did that, but if everyone started doing that, it would cause even more problems. Thinking out loud - maybe adding some explanation in the help message would be sufficient for that case? Or we could build in a limit to the number of currently provisioned clusters?

Also it is just another setting that they need to worry about when spinning up a new cluster.

Agreed. I'm hoping the helping message is descriptive enough to indicate they don't need to worry about it, but yeah, it adds another option in the UI that they'd need to parse.


Anyway, I'm totally down to scrap this approach if we want to go in a different direction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to have some admin-specific settings so we could decide which individual cluster we'd want to keep.

You're right. A global approach (my suggestion) would keep all clusters of a flavor around, while a flag (your approach) would allow for a more specific investigation.

If I correctly understand how the workflow and automation-flavor work, I think the clusters marked with the keep-failed-cluster would still be destroyed at the end of their lifespan, i.e., the destroy step in the create script would be skipped if the create step fails, but the destroy step would still run at the end of the lifespan. I'm not 100% sure on that though, so please correct me if I'm wrong.

From what I observed in our debugging of the OCP issues this week, that is correct.

That would indeed be helpful if only one or two people did that, but if everyone started doing that, it would cause even more problems. Thinking out loud - maybe adding some explanation in the help message would be sufficient for that case? Or we could build in a limit to the number of currently provisioned clusters?

I think this might have become an issue on Monday, when people started to try to provision clusters on their own, just to "check if it works for them". It didn't cause any issues, because failed clusters were immediately destroyed - but IMHO this could have been prevented by better communication (on my side - I chose not to intervene, because we were already too deep in the investigation).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davdhacs @gavin-stackrox what do you think about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I'd like to get this implementation in as-is, and we can iterate or change it after trying this one.

detail:
Since we use ocp-4 the most, it may be helpful soonest for example if gcp breaks things again next week!💥

I think it is okay to risk excess helpful clusters: The clusters will still expire because the workflow runs are tagged with their expiration and infra checks over all of them on a schedule. If we see it abused and users leaving many clusters when a flavor is broken, then we can figure out a solution after that happens (I think low risk (no customers or business impact) to defer solving that until it happens).

and +1 to add more parameters(noise). I think the users of Infra already deal with the ui noise, and I think users will ignore the field or learn quickly what it does.

future: + for ocp-4-demo and I'd like the option for all flavors (or some other control in Infra of the workflow steps)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it. @BradLugo could you document this discussion in the Jira issue?

A follow-up to enable "keeping failed clusters globally" could be to set the default value with Helm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BradLugo
Copy link
Contributor Author

BradLugo commented Jan 19, 2024

One future improvement would be to indicate in the infra status of the cluster whether the cluster failed, but we're keeping it around. With these changes, if we use the --arg "keep-failed-cluster=true", the cluster will almost certainly be marked as READY even though it's an unhealthy cluster (or perhaps no cluster exists at all).

@BradLugo
Copy link
Contributor Author

BradLugo commented Jan 19, 2024

Note to self: if we decide to go with the current implementation in this PR, it's probably worth adding the parameter to the openshift-4-demo flavor as well (likely wouldn't require any stackrox/automation-flavor changes, just an update to the workflow config)

@BradLugo BradLugo force-pushed the blugo/ROX-16430-os4-keep-failed-cluster branch from fc035a7 to a8cf7c3 Compare January 19, 2024 21:27
Copy link
Contributor

@davdhacs davdhacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 but circling back to @tommartensen, are concerns resolved?

@tommartensen tommartensen enabled auto-merge (squash) March 7, 2024 18:52
@tommartensen tommartensen merged commit 597e7df into master Mar 7, 2024
9 checks passed
@tommartensen tommartensen deleted the blugo/ROX-16430-os4-keep-failed-cluster branch March 7, 2024 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants