-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROX-16430: Add option in os-4 flavor to keep failed clusters #1144
Conversation
A single node development cluster (infra-pr-1144) was allocated in production infra for this PR. CI will attempt to deploy 🔌 You can connect to this cluster with:
🛠️ And pull infractl from the deployed dev infra-server with:
🚲 You can then use the dev infra instance e.g.:
Further Development☕ If you make changes, you can commit and push and CI will take care of updating the development cluster. 🚀 If you only modify configuration (chart/infra-server/configuration) or templates (chart/infra-server/{static,templates}), you can get a faster update with:
LogsLogs for the development infra depending on your @redhat.com authuser: Or:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -344,6 +344,13 @@ | |||
value: false | |||
kind: optional | |||
|
|||
- name: keep-failed-cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am asking myself if the individual users need to care about this setting or if it would be better to have this as a global control (as in: the users can't select it, we need to switch on this globally in the Helm values and deploy a test environment).
My fear is that we will just keep a lot of failed clusters around in case of an incident like on Monday.
Also it is just another setting that they need to worry about when spinning up a new cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am asking myself if the individual users need to care about this setting or if it would be better to have this as a global control (as in: the users can't select it, we need to switch on this globally in the Helm values and deploy a test environment).
I don't have a strong opinion on this - I'm down for either. In an ideal world, I think I'd like to have some admin-specific settings so we could decide which individual cluster we'd want to keep. But the option currently implemented in this PR seemed like the most straightforward.
My fear is that we will just keep a lot of failed clusters around in case of an incident like on Monday.
If I correctly understand how the workflow and automation-flavor work, I think the clusters marked with the keep-failed-cluster
would still be destroyed at the end of their lifespan, i.e., the destroy step in the create script would be skipped if the create step fails, but the destroy step would still run at the end of the lifespan. I'm not 100% sure on that though, so please correct me if I'm wrong.
That being said, it wouldn't stop folks from seeing the error the first time, reproducing it with the keep-failed-cluster
flag enabled thinking they're helping us by giving us a cluster that failed. That would indeed be helpful if only one or two people did that, but if everyone started doing that, it would cause even more problems. Thinking out loud - maybe adding some explanation in the help message would be sufficient for that case? Or we could build in a limit to the number of currently provisioned clusters?
Also it is just another setting that they need to worry about when spinning up a new cluster.
Agreed. I'm hoping the helping message is descriptive enough to indicate they don't need to worry about it, but yeah, it adds another option in the UI that they'd need to parse.
Anyway, I'm totally down to scrap this approach if we want to go in a different direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like to have some admin-specific settings so we could decide which individual cluster we'd want to keep.
You're right. A global approach (my suggestion) would keep all clusters of a flavor around, while a flag (your approach) would allow for a more specific investigation.
If I correctly understand how the workflow and automation-flavor work, I think the clusters marked with the keep-failed-cluster would still be destroyed at the end of their lifespan, i.e., the destroy step in the create script would be skipped if the create step fails, but the destroy step would still run at the end of the lifespan. I'm not 100% sure on that though, so please correct me if I'm wrong.
From what I observed in our debugging of the OCP issues this week, that is correct.
That would indeed be helpful if only one or two people did that, but if everyone started doing that, it would cause even more problems. Thinking out loud - maybe adding some explanation in the help message would be sufficient for that case? Or we could build in a limit to the number of currently provisioned clusters?
I think this might have become an issue on Monday, when people started to try to provision clusters on their own, just to "check if it works for them". It didn't cause any issues, because failed clusters were immediately destroyed - but IMHO this could have been prevented by better communication (on my side - I chose not to intervene, because we were already too deep in the investigation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davdhacs @gavin-stackrox what do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I'd like to get this implementation in as-is, and we can iterate or change it after trying this one.
detail:
Since we use ocp-4 the most, it may be helpful soonest for example if gcp breaks things again next week!💥
I think it is okay to risk excess helpful clusters: The clusters will still expire because the workflow runs are tagged with their expiration and infra checks over all of them on a schedule. If we see it abused and users leaving many clusters when a flavor is broken, then we can figure out a solution after that happens (I think low risk (no customers or business impact) to defer solving that until it happens).
and +1 to add more parameters(noise). I think the users of Infra already deal with the ui noise, and I think users will ignore the field or learn quickly what it does.
future: + for ocp-4-demo and I'd like the option for all flavors (or some other control in Infra of the workflow steps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it. @BradLugo could you document this discussion in the Jira issue?
A follow-up to enable "keeping failed clusters globally" could be to set the default value with Helm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documented here: https://issues.redhat.com/browse/ROX-16430?focusedId=23856033&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-23856033
Please feel free to add anything I missed.
One future improvement would be to indicate in the infra status of the cluster whether the cluster failed, but we're keeping it around. With these changes, if we use the |
Note to self: if we decide to go with the current implementation in this PR, it's probably worth adding the parameter to the openshift-4-demo flavor as well (likely wouldn't require any |
fc035a7
to
a8cf7c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 but circling back to @tommartensen, are concerns resolved?
Requires https://github.com/stackrox/automation-flavors/pull/184.
Testing
Forcing the failure mode via https://github.com/stackrox/automation-flavors/pull/184/commits/10747416e14221e171dfc50fdaa8e60943397c2a:
Ensuring the openshift-4 flavor works after the revert: