[EC2 Specific] Need automated worker monitoring/restart #4

ChristopherWilks · 2020-03-24T17:09:17Z

We currently use Vagrant to manage VMs in EC2 for recount-pump runs.

However, node management is ad hoc---each node has its own Vagrant process and subdirectory.

Node failures, either due to workflow specific problems (e.g. running out of disk space) or because nodes were pre-empted (spot market) need to be manually detected currently.

For a few tranches this is fine, for longer term runs, we may want to either try to use existing orchestration tools or roll our own to support the mix of inside/outside container code we're using.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EC2 Specific] Need automated worker monitoring/restart #4

[EC2 Specific] Need automated worker monitoring/restart #4

ChristopherWilks commented Mar 24, 2020

[EC2 Specific] Need automated worker monitoring/restart #4

[EC2 Specific] Need automated worker monitoring/restart #4

Comments

ChristopherWilks commented Mar 24, 2020