Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autoscale / allocate jobs based on metrics use case #118

Open
gedw99 opened this issue Jul 11, 2023 · 3 comments
Open

autoscale / allocate jobs based on metrics use case #118

gedw99 opened this issue Jul 11, 2023 · 3 comments

Comments

@gedw99
Copy link

gedw99 commented Jul 11, 2023

I have a use case where i need to run long running jobs on hetzner where the hetzner robot allows me to add and remove vms.

So it will allow me to autoscale, on cheap hardware.

In order to do this i need to detect RAM and CPU usage on each server where the async jobs agent runs. Detection is pretty easy.

Logic is:

  • if server is below 10% utilisation, put server into "blocked" mode, block new job allocations, and move any long running jobs off to servers with 50 to 80% utilisation. Its essentially rebalancing, so we can kill servers.
  • if server has more than 80% utilisation, start a new server, and let it take jobs off the queue.
  • on new job, find server with lowest utilisation that is not "blocked".

so that the logic of the 2 use cases can be don, NATS KV would be an easy win.
If each agent sends metrics every 1 minute, and NATS KV flushes KV on a TTL of 1 hour, it will self run.

Then core can issue jobs based on these metrics.

@ripienaar
Copy link
Member

Sounds like you are issuing jobs for a specific worker?

If not you can start many workers as you want and they will scale up and down automatically. The only question is how to handle in flight jobs as today I don’t support a signal that says complete current job and exit rather than complete current job and get next one?

how you handle auto scaling per se is out of the control of this tool though afaik?

@gedw99
Copy link
Author

gedw99 commented Jul 12, 2023

thanks for asking.. sure your pretty busy. I would like to try to some experience with asyncjobs and this seems like a really useful one too.

Sounds like you are issuing jobs for a specific worker?

yes.

The autoscaler is this: https://github.com/woodpecker-ci/autoscaler
Its a self hosted Git, Build and CI system.

Hetzner robot in included that talks to Hetzner IAAS to control VM's
https://github.com/woodpecker-ci/autoscaler/blob/main/provider/hetzner.go which deploys the wood packer agent to a VM.

If not you can start many workers as you want and they will scale up and down automatically. The only question is how to handle in flight jobs as today I don’t support a signal that says complete current job and exit rather than complete current job and get next one?

yes i agree. In this case, it's a single run job style task. Run and when done, please die basically.

That would mean i don't have to do any rebalancing because we are basically doing the Serverless pattern, which is a smarter pattern i think :)

Makes me think of google cloud run, where it only dies if there has not been all http calls for the last 5 minutes.

I think these 2 logic patterns are the way to do it. Its simpler than what i proposed originally.

how you handle auto scaling per se is out of the control of this tool though afaik?

To calculate the auto scaling, https://github.com/woodpecker-ci/autoscaler/blob/main/main.go#L23 holds a reference to all the agents.
The logic is in the next function below called "getLoad".

I am realising that the way to build this is to have a dummy VM Pool locally, to simulate a Hertzner VM.
https://github.com/woodpecker-ci/autoscaler/blob/main/client.go is the client, and so i need a dummy version

@gedw99
Copy link
Author

gedw99 commented Jul 12, 2023

https://github.com/windsource/picus does that same logic and uses woodpecker ci.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants