Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core feature] Vertical Pod scaling to handle OOMs #2234

Open
2 tasks done
apatel-fn opened this issue Mar 8, 2022 · 6 comments
Open
2 tasks done

[Core feature] Vertical Pod scaling to handle OOMs #2234

apatel-fn opened this issue Mar 8, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@apatel-fn
Copy link

Motivation: Why do you think this is important?

It would be clean and useful for allowing Flyte to handle vertical pod scaling when tasks fail from an OOMKill (or possible set of resource related recoverable exceptions). This feature could be exposed via the simple task resource or pod_spec parameter. It is extremely effective for use cases where the workflow writer's users are modifying and overriding an existing set of workflows. Experimental compute generally requires various monitoring of running workflows, and creates unnecessary overhead.

Goal: What should the final outcome look like, ideally?

Ideally, this would be a simple field for the Task definition to consume (similar to pod specs), that defines the behavior on which exceptions the task should be reran on, and with what monotonic backing-off. There can also exist configurations that live on the flytepropeller for describing limits and other system level constraints.

Describe alternatives you've considered

A naive but complex alternative is utilizing a server that acts as a long polling listener running FlyteRemote. This listener would monitor the existing workflow that needs to be relaunched on OOM, and wait for the running nodes to either return succeed, or error, and then rerun the workflow from the start. This method has a few drawbacks. The first being long polling listeners from flyte remote do not seem efficient, and can be an anti pattern when many workflows with heterogeneous inputs are expected to be ran in parallel. Secondly, relaunching workflows can be costly, especially for workflows that are not being cached intentionally.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@apatel-fn apatel-fn added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Mar 8, 2022
@welcome
Copy link

welcome bot commented Mar 8, 2022

Thank you for opening your first issue here! 🛠

@kumare3 kumare3 removed the untriaged This issues has not yet been looked at by the Maintainers label Mar 8, 2022
@kumare3 kumare3 added this to the 1.1.0 - Hawk milestone Mar 8, 2022
@wild-endeavor wild-endeavor modified the milestones: 1.1.0 - Hawk, 1.2.0 Jun 28, 2022
@github-actions
Copy link

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label Aug 28, 2023
@hamersaw
Copy link
Contributor

Commenting to keep open.

@github-actions github-actions bot removed the stale label Aug 31, 2023
@swarup-stripe
Copy link

👋 We're very much interested in this feature as well! What would it take to implement this?

I assume we can leverage k8s native vertical pod autoscaler. So if I'm not mistaken, on Flyte's side we'd need to add configuration (on the task or propeller) and in propeller we'd set up a VerticalPodAutoscaler CRD if VPA is enabled.

I guess one question is if the VPA's auto update mode would suffice or if there's more flyte propeller changes required to persist any of VPA's resource adjustments.

I might have a naive understanding here so let me know if I'm missing anything!

@davidmirror-ops
Copy link
Contributor

Contributor's meetup notes: looks great. Suggested: create an RFC Incubator post to further discuss.

@Mecoli1219
Copy link
Contributor

Hi! I would like to work on this issue. Is there any update recently?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants