Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt LeaderWorkerSet to implement multi-node dirtributed inference #4001

Open
3 tasks
JesseStutler opened this issue Feb 7, 2025 · 1 comment
Open
3 tasks
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@JesseStutler
Copy link
Member

JesseStutler commented Feb 7, 2025

What is the problem you're trying to solve

BackGround

The development and application of large language models are experiencing explosive growth, with open-source models like DeepSeek-R1 continuously emerging, driving the demand for developers to deploy large models in local environments. However, as the scale of model parameters continues to grow, the memory capacity of a single device has become insufficient to accommodate the complete model. Some inference frameworks have begun actively exploring multi-node distributed inference solutions:

New API for multi-node distributed inference

LeaderWorkerSet

k8s sig has designed a new API for multi-node distributed inference scenario, called LeaderWorkerSet:
https://github.com/kubernetes-sigs/lws

KServe ServingRuntime/ClusterServingRuntime WorkerSpec

Even KServe has modified their serving API, add a new field called WorkerSpec to implement multi-node distributed inference

After discussing with @Monokaix @hwdef , we'd better implement LeaderWorkerSet first and get end users' feedback.

Describe the solution you'd like

LeaderWorkerSet has the concept of logical PodGroup when it is designed, corresponding to 1 Leader + n Workers. Volcano needs to keep this logical PodGroup concept consistent with Volcano's PodGroup. The replicas in LeaderWorkerSet represent the number of Volcano PodGroups to be created. One of the tasks is Leader Pod, the replica is 1, and the other task is Workers. So there are following tasks need to be adapted:

  • Add a LeaderWorkSet controller, reconcile to create podgroups for lws
  • Implement network topology aware scheduling for worker pods
  • Adapt LeaderWorkSet RestartPolicy

Additional context

No response

@JesseStutler JesseStutler added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 7, 2025
@JesseStutler
Copy link
Member Author

milestone v1.12, may need to start implement soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

1 participant