Proposal of dynamic GPU slice plugin #3820

sailorvii · 2024-11-14T07:16:45Z

NVIDIA official GPU sharing includes time-slice, MPS and MIG. Currently the MPS and MIG dynamic is not supported, we want to add this into volcano scheduler plugin

volcano-sh-bot · 2024-11-14T07:16:48Z

Welcome @sailorvii!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

volcano-sh-bot · 2024-11-14T07:16:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign lowang-bh
You can assign the PR to them by writing /assign @lowang-bh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Monokaix · 2024-11-14T09:21:26Z

Hi, please squash to one commit and sign off.

Signed-off-by: sailorvii <[email protected]> Signed-off-by: chenw66 <[email protected]>

JesseStutler

I have reviewd it, please take a look~

JesseStutler · 2024-11-18T02:27:19Z

docs/design/dynamic-gpu-slice.md

+    - plugins:
+      - name: cugpushare
+        arguments:
+          cugpushare.schedulePolicy: spread


What does cu mean in cugpushare?

Maybe we can rename the plugin? There is already a plugin called deviceshare, we need to highlight that the plugin is used to dynamically split the gpu.

Sure, let's rename it.

JesseStutler · 2024-11-18T07:29:17Z

docs/design/images/dynamicGPUSliceSlice.png

Where is the logic of this AddPod, in the mig-agent of nos? I'm wondering whether our dynamic GPU slice plugin is strongly dependent on the nos project. You can see that the annotation has the watermark of nos, and nos project is not updated frequently.

AddPod is in volcano/pkg/scheduler/api/node_info.go addResource.

3 functions can be reused from nos project: mig agent, mps agent and mps device plugin. They are not the most important part. If needed, we could rewrite them.

JesseStutler · 2024-11-18T07:41:49Z

docs/design/images/dynamicGPUSliceSlice.png

And I feel that the design of nos is strange. All the MIG profiles whether they are free or used are annotated as annotation entries of the node. And the MIG profiles requested by Pod are also annotated as annotation entries. That's not how annotation is meant to be used this way. Can we define a CRD to manage these specs and status, and node can refer to this CRD? Or we can aggregate these specs and status into one JSON struct and annotated as only one annotation entry.

Sure, it's good to bind those annotations.

JesseStutler · 2024-11-18T08:19:58Z

docs/design/images/dynamicGPUSliceScore.png

How is the capacity in binpack or spread calculated here? Because we are dynamically dividing the GPU as MIG profiles, we don't know how many MIG profiles the GPU can be divided into.

Oh, here we only use the memory as the resource capacity calculation. Each profile has a dedicated memory size.

JesseStutler · 2024-11-18T09:11:41Z

docs/design/dynamic-gpu-slice.md

+          cugpushare.weight.cpu: 1
+          cugpushare.weight.memory: 2
+          cugpushare.weight.gpu: 5
+          cugpushare.DevicePluginCMName: mps-configmap


Why do we need to specify the name and ns of the configmap here, isn't this fixed?

This is a shared config between the planner and the device plugin. The name could be configured. The planner(in the scheduler) and device plugin should define the same name configmap.

volcano-sh-bot added the retest-not-required-docs-only label Nov 14, 2024

volcano-sh-bot requested review from Monokaix and Thor-wl November 14, 2024 07:16

volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 14, 2024

sailorvii force-pushed the master branch from 596744d to e3ffd7e Compare November 15, 2024 01:46

Add dynamic GPU slice proposal

0000b26

Signed-off-by: sailorvii <[email protected]> Signed-off-by: chenw66 <[email protected]>

sailorvii force-pushed the master branch from e3ffd7e to 0000b26 Compare November 15, 2024 01:57

JesseStutler reviewed Nov 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal of dynamic GPU slice plugin #3820

Proposal of dynamic GPU slice plugin #3820

sailorvii commented Nov 14, 2024

volcano-sh-bot commented Nov 14, 2024

volcano-sh-bot commented Nov 14, 2024

Monokaix commented Nov 14, 2024

JesseStutler left a comment

JesseStutler Nov 18, 2024

JesseStutler Nov 18, 2024

sailorvii Nov 19, 2024

JesseStutler Nov 18, 2024

sailorvii Nov 19, 2024

JesseStutler Nov 18, 2024

sailorvii Nov 19, 2024

JesseStutler Nov 18, 2024

sailorvii Nov 19, 2024

JesseStutler Nov 18, 2024

sailorvii Nov 19, 2024

Proposal of dynamic GPU slice plugin #3820

Are you sure you want to change the base?

Proposal of dynamic GPU slice plugin #3820

Conversation

sailorvii commented Nov 14, 2024

volcano-sh-bot commented Nov 14, 2024

volcano-sh-bot commented Nov 14, 2024

Monokaix commented Nov 14, 2024

JesseStutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment