Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we instantiate a k8s cluster within CI for rhg_compute_tools.kubernetes unit tests? #88

Open
bolliger32 opened this issue Sep 14, 2020 · 3 comments

Comments

@bolliger32
Copy link
Collaborator

In theory, you could do this with https://github.com/rancher/k3s. This would allow for some better testing of the get_cluster related commands. For example, right now the _get_cluster_dask_gateway unit test creates a dask gateway server with a local backend. The local backend doesn't permit the same config options as a kubernetes backend, so not all of our config options are testable. But if we made a local k8s cluster and then spun up the gateway server on that, they would be. I'd imagine there are some other existing or planned functions that could make use of this for unit testing.

Might be overkill... but could be useful

@brews
Copy link
Member

brews commented Sep 14, 2020

Interesting! Have you per chance looked at coverage stats for what we have vs what we're missing because of this limitation?

@bolliger32
Copy link
Collaborator Author

I haven't - I'm not sure it would show up? TBH I don't really know how coverage works but it's not that we're missing testing some functions, its just that we're not able to test all components of what some functions could do. So my guess (again without much knowledge on the subject) is that this would show up as fully covered. But just as an example, I haven't found a way to actually allow a memory specification to adjust the memory allocated to a worker on the local dask gateway cluster. I think this is something about how the local backend manages memory given whatever memory is available locally (so this wouldn't be an issue on kubernetes) but it's also possible that it's a bug in my code and we actually are not able to control memory allocated to a worker, even with the kubernetes background. I was able to test this manually on adrastea, but couldn't test it within CI.

@bolliger32
Copy link
Collaborator Author

Oh! One other change that users will see is that it sometimes takes some time (on the order of seconds to minutes) for the get_*_cluster command to return client and cluster objects. This is b/c the scheduler is now always its own remote pod. So at the very least, an image that already exists on a node will need to be used to start a new container. At most, a new node will need to be triggered, the image will need to be pulled, and then it will need to be loaded. I think there is potentially a way to do this asynchronously, but for now its just implemented synchronously, so the response time of this function can be a lot slower.

Relatedly, schedulers are currently going into the core pool I believe. I think this is fine, as that is autoscaling and I think a more appropriate machine size. But we could start a scheduler-specific node pool and add some tolerations to the scheduler pods by default if we wanted to keep those separate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants