-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core feature] Vertical Pod scaling to handle OOMs #2234
Comments
Thank you for opening your first issue here! 🛠 |
Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏 |
Commenting to keep open. |
👋 We're very much interested in this feature as well! What would it take to implement this? I assume we can leverage k8s native vertical pod autoscaler. So if I'm not mistaken, on Flyte's side we'd need to add configuration (on the task or propeller) and in propeller we'd set up a I guess one question is if the VPA's auto update mode would suffice or if there's more flyte propeller changes required to persist any of VPA's resource adjustments. I might have a naive understanding here so let me know if I'm missing anything! |
Contributor's meetup notes: looks great. Suggested: create an RFC Incubator post to further discuss. |
Hi! I would like to work on this issue. Is there any update recently? |
Motivation: Why do you think this is important?
It would be clean and useful for allowing Flyte to handle vertical pod scaling when tasks fail from an OOMKill (or possible set of resource related recoverable exceptions). This feature could be exposed via the simple task resource or pod_spec parameter. It is extremely effective for use cases where the workflow writer's users are modifying and overriding an existing set of workflows. Experimental compute generally requires various monitoring of running workflows, and creates unnecessary overhead.
Goal: What should the final outcome look like, ideally?
Ideally, this would be a simple field for the Task definition to consume (similar to pod specs), that defines the behavior on which exceptions the task should be reran on, and with what monotonic backing-off. There can also exist configurations that live on the flytepropeller for describing limits and other system level constraints.
Describe alternatives you've considered
A naive but complex alternative is utilizing a server that acts as a long polling listener running FlyteRemote. This listener would monitor the existing workflow that needs to be relaunched on OOM, and wait for the running nodes to either return succeed, or error, and then rerun the workflow from the start. This method has a few drawbacks. The first being long polling listeners from flyte remote do not seem efficient, and can be an anti pattern when many workflows with heterogeneous inputs are expected to be ran in parallel. Secondly, relaunching workflows can be costly, especially for workflows that are not being cached intentionally.
Propose: Link/Inline OR Additional context
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: