[Core feature] Vertical Pod scaling to handle OOMs #2234

apatel-fn · 2022-03-08T04:08:55Z

Motivation: Why do you think this is important?

It would be clean and useful for allowing Flyte to handle vertical pod scaling when tasks fail from an OOMKill (or possible set of resource related recoverable exceptions). This feature could be exposed via the simple task resource or pod_spec parameter. It is extremely effective for use cases where the workflow writer's users are modifying and overriding an existing set of workflows. Experimental compute generally requires various monitoring of running workflows, and creates unnecessary overhead.

Goal: What should the final outcome look like, ideally?

Ideally, this would be a simple field for the Task definition to consume (similar to pod specs), that defines the behavior on which exceptions the task should be reran on, and with what monotonic backing-off. There can also exist configurations that live on the flytepropeller for describing limits and other system level constraints.

Describe alternatives you've considered

A naive but complex alternative is utilizing a server that acts as a long polling listener running FlyteRemote. This listener would monitor the existing workflow that needs to be relaunched on OOM, and wait for the running nodes to either return succeed, or error, and then rerun the workflow from the start. This method has a few drawbacks. The first being long polling listeners from flyte remote do not seem efficient, and can be an anti pattern when many workflows with heterogeneous inputs are expected to be ran in parallel. Secondly, relaunching workflows can be costly, especially for workflows that are not being cached intentionally.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

welcome · 2022-03-08T04:08:57Z

Thank you for opening your first issue here! 🛠

github-actions · 2023-08-28T00:38:50Z

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

hamersaw · 2023-08-30T15:12:06Z

Commenting to keep open.

swarup-stripe · 2024-04-26T16:11:07Z

👋 We're very much interested in this feature as well! What would it take to implement this?

I assume we can leverage k8s native vertical pod autoscaler. So if I'm not mistaken, on Flyte's side we'd need to add configuration (on the task or propeller) and in propeller we'd set up a VerticalPodAutoscaler CRD if VPA is enabled.

I guess one question is if the VPA's auto update mode would suffice or if there's more flyte propeller changes required to persist any of VPA's resource adjustments.

I might have a naive understanding here so let me know if I'm missing anything!

davidmirror-ops · 2024-05-23T14:23:11Z

Contributor's meetup notes: looks great. Suggested: create an RFC Incubator post to further discuss.

Mecoli1219 · 2025-01-22T05:31:40Z

Hi! I would like to work on this issue. Is there any update recently?

apatel-fn added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Mar 8, 2022

kumare3 removed the untriaged This issues has not yet been looked at by the Maintainers label Mar 8, 2022

kumare3 added this to the 1.1.0 - Hawk milestone Mar 8, 2022

wild-endeavor modified the milestones: 1.1.0 - Hawk, 1.2.0 Jun 28, 2022

github-actions bot added the stale label Aug 28, 2023

github-actions bot removed the stale label Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core feature] Vertical Pod scaling to handle OOMs #2234

[Core feature] Vertical Pod scaling to handle OOMs #2234

apatel-fn commented Mar 8, 2022

welcome bot commented Mar 8, 2022

github-actions bot commented Aug 28, 2023

hamersaw commented Aug 30, 2023

swarup-stripe commented Apr 26, 2024

davidmirror-ops commented May 23, 2024

Mecoli1219 commented Jan 22, 2025

[Core feature] Vertical Pod scaling to handle OOMs #2234

[Core feature] Vertical Pod scaling to handle OOMs #2234

Comments

apatel-fn commented Mar 8, 2022

Motivation: Why do you think this is important?

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome bot commented Mar 8, 2022

github-actions bot commented Aug 28, 2023

hamersaw commented Aug 30, 2023

swarup-stripe commented Apr 26, 2024

davidmirror-ops commented May 23, 2024

Mecoli1219 commented Jan 22, 2025