NVIDIA Mig support #1238

rootfs · 2024-02-16T17:24:00Z

Test status

I am running a text generation inference pod and the measured power

Load generation

kubectl exec to the text-generation-inference pod, and run the command to generate text and pause 5 seconds after each run:

# while true; do curl localhost:8080/generate -X POST -H "Content-Type: application/json" -d '{"inputs": "<s>[INST] Write a K8s YAML file to create a pod that deploys nginx[/INST]", "parameters": {"max_new_tokens": 400}}';sleep 5;done

Test result

# nvidia-smi
+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG|
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  1    2   0   0  |           18816MiB / 19968MiB  | 42      0 |  3   0    2    0    0 |
|                  |               2MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    1    2    0     494327      C   /opt/conda/bin/python3.9                  18770MiB |
+---------------------------------------------------------------------------------------+

Method

The current way of process energy in MIG device is as the following:

Get the MIG multi processor unit ratio with respect to the GPU, noted as multiprocessorCountRatio below. This is done via parsing nvidia-smi -q -x output
Get the DCGM_FI_PROF_PIPE_TENSOR_ACTIVE counter from the MIG (thanks to @yuezhu1)
Calculate the processor's smUtil = multiprocessorCountRatio * DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

TODO

This PR uses mixed nvml (to get running processes) and dcgm to get info from different libraries. We should really use just one library (opened issue at go-dcgm).
This PR doesn't address issues where the MIG can be dynamically reconfigured.
We need to test more cases: no MIG, one MIG, multiple MIGs, processes running on 1+ MIGs, and processes running on MIG + GPU, etc

yuezhu1 · 2024-02-16T18:25:15Z

pkg/sensors/accelerator/gpu/source/gpu_dcgm.go

+						label := deviceFieldsString[i]
+						value := ToString(val)
+						klog.Infof("Device %v Label %v Val: %v", entityName, label, value)
+						if val.FieldId == dcgm.DCGM_FI_PROF_SM_ACTIVE {


we can use DCGM_FI_PROF_PIPE_TENSOR_ACTIVE instead of DCGM_FI_PROF_SM_ACTIVE here, since DCGM_FI_PROF_PIPE_TENSOR_ACTIVE looks closer to actual resource utilization rate

marceloamaral · 2024-02-18T01:20:33Z

pkg/sensors/accelerator/gpu/source/gpu_dcgm.go

+					if val.FieldId == ratioFields {
+						floatVal, _ := strconv.ParseFloat(value, 32)
+						// ratio of active multiprocessors to total multiprocessors
+						smUtil := uint32(floatVal * 100 * multiprocessorCountRatio)


This smUtil variable represents tensor core utilization, not SM utilization.

It might be better to just use "util" since the variable will hold the utilization of different components in the future.

marceloamaral · 2024-02-18T01:21:10Z

pkg/sensors/accelerator/gpu/source/gpu_dcgm.go

+						processAcceleratorMetrics[p.Pid] = ProcessUtilizationSample{
+							Pid:       p.Pid,
+							TimeStamp: uint64(time.Now().UnixNano()),
+							SmUtil:    smUtil,


marceloamaral · 2024-02-18T01:40:40Z

pkg/collector/stats/stats.go

@@ -75,7 +75,7 @@ func NewStats() *Stats {
 		m.ResourceUsage[metricName] = types.NewUInt64StatCollection()
 	}

-	if gpu.IsGPUCollectionSupported() {
+	if config.EnabledGPU {


Why not use gpu.IsGPUCollectionSupported()?
A user could enable GPU in Kepler configuration but the system might not have GPUs.

marceloamaral · 2024-02-18T01:46:27Z

pkg/collector/resourceutilization/accelerator/process_gpu_collector.go

@@ -78,8 +78,9 @@ func UpdateNodeGPUUtilizationMetrics(processStats map[uint64]*stats.ProcessStats
 				}
 				processStats[uintPid] = stats.NewProcessStats(uintPid, uint64(0), containerID, vmID, command)
 			}
-			processStats[uintPid].ResourceUsage[config.GPUSMUtilization].AddDeltaStat(utils.GenericSocketID, uint64(processUtilization.SmUtil))
-			processStats[uintPid].ResourceUsage[config.GPUMemUtilization].AddDeltaStat(utils.GenericSocketID, uint64(processUtilization.MemUtil))
+			gpuName := fmt.Sprintf("%s%v", utils.GenericGPUID, gpuID)


@rootfs how did this solve the GPU MIG naming problem?
You are also using the GPU ID as the key as in the PR #1236.
Note that the utils.GenericGPUID will be the same for all MIG instance, so in the end, this is doing the same thing as in the PR #1236.

Also, note that the PR #1236 is fixing other things :)

marceloamaral · 2024-02-18T01:58:00Z

pkg/sensors/accelerator/gpu/source/gpu_dcgm.go

+		klog.V(debugLevel).Infof("failed to get latest values for fields: %v", err)
+		return processAcceleratorMetrics, err
+	}
+	gpuSMActive := uint32(0)


Same here, this is tensor utilization, not SM.
It might be better to just use utilization.

rootfs · 2024-02-20T14:58:59Z

@marceloamaral review addressed

rootfs · 2024-02-20T19:58:21Z

pkg/sensors/accelerator/gpu/source/gpu_dcgm.go

+	d.entities = make(map[string]dcgm.GroupEntityPair)
+
+	// cleanup, err := dcgm.Init(dcgm.Embedded) // embeded mode is not recommended for production per https://github.com/NVIDIA/dcgm-exporter/issues/22#issuecomment-1321521995
+	cleanup, err := dcgm.Init(dcgm.Standalone, config.DCGMHostEngineEndpoint, "0")


this requires :

the host must enable dcgm service and run nv-hostengine

pod spec to enable hostNetwork so to access the nvidia dcgm hostengine

To investigate how to use embeded mode after this PR is merged.

I forgot to mention that, the standalone connecting to the localhost works in K8s environments.

ok, let's consider it as an enhancement then.

here is the issue #1243

rootfs · 2024-02-20T20:02:02Z

A test image is built quay.io/sustainable_computing_io/kepler:latest-dcgm

…on is unavailable Signed-off-by: Huamin Chen <[email protected]>

Signed-off-by: Huamin Chen <[email protected]>

marceloamaral

/lgtm

rootfs requested review from marceloamaral and SamYuan1990 February 16, 2024 17:24

yuezhu1 reviewed Feb 16, 2024

View reviewed changes

rootfs force-pushed the mig-support branch 2 times, most recently from c1f21ee to 3a06f00 Compare February 16, 2024 19:43

rootfs requested a review from wangchen615 February 16, 2024 19:57

rootfs changed the title ~~[WIP] NVIDIA Mig support~~ NVIDIA Mig support Feb 16, 2024

rootfs force-pushed the mig-support branch 2 times, most recently from f9b0085 to 0a19ff4 Compare February 17, 2024 00:09

rootfs marked this pull request as draft February 17, 2024 02:16

rootfs force-pushed the mig-support branch from 0a19ff4 to df927f0 Compare February 17, 2024 15:16

rootfs marked this pull request as ready for review February 17, 2024 15:16

rootfs mentioned this pull request Feb 17, 2024

fix process resource usage map to use GPU ID as the map key #1236

Merged

marceloamaral reviewed Feb 18, 2024

View reviewed changes

rootfs force-pushed the mig-support branch 2 times, most recently from 973f761 to 21f7a2e Compare February 20, 2024 14:19

rootfs force-pushed the mig-support branch from 270920c to 5696ae7 Compare February 20, 2024 15:00

rootfs commented Feb 20, 2024

View reviewed changes

rootfs force-pushed the mig-support branch from 6449cff to d42bbdd Compare February 21, 2024 00:06

rootfs added 5 commits February 20, 2024 19:07

metrics: support gpu memory usage based ratio, when process utilizati…

00163dc

…on is unavailable Signed-off-by: Huamin Chen <[email protected]>

review feedback

61e07c6

Signed-off-by: Huamin Chen <[email protected]>

add dcgm to kepler image

f89960d

Signed-off-by: Huamin Chen <[email protected]>

gpu: switch to dcgm standalone mode due to containerization limitations

d42bbdd

Signed-off-by: Huamin Chen <[email protected]>

fix yaml

193ed6a

Signed-off-by: Huamin Chen <[email protected]>

marceloamaral approved these changes Feb 21, 2024

View reviewed changes

rootfs merged commit 87228e6 into sustainable-computing-io:main Feb 21, 2024
19 checks passed

rootfs mentioned this pull request Feb 21, 2024

Investigate how to use dcgm embedded mode #1243

Open

rinana mentioned this pull request Jun 13, 2024

Add support for server deployment on MIG fmperf-project/fmperf#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA Mig support #1238

NVIDIA Mig support #1238

rootfs commented Feb 16, 2024 •

edited

Loading

yuezhu1 Feb 16, 2024

rootfs Feb 16, 2024

marceloamaral Feb 18, 2024 •

edited

Loading

marceloamaral Feb 18, 2024

marceloamaral Feb 18, 2024

marceloamaral Feb 18, 2024

marceloamaral Feb 18, 2024

rootfs commented Feb 20, 2024

rootfs Feb 20, 2024

rootfs Feb 20, 2024

marceloamaral Feb 21, 2024

rootfs Feb 21, 2024

rootfs Feb 21, 2024

rootfs commented Feb 20, 2024

marceloamaral left a comment

NVIDIA Mig support #1238

NVIDIA Mig support #1238

Conversation

rootfs commented Feb 16, 2024 • edited Loading

Test status

Load generation

Test result

Method

TODO

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marceloamaral Feb 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rootfs commented Feb 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rootfs commented Feb 20, 2024

marceloamaral left a comment

Choose a reason for hiding this comment

rootfs commented Feb 16, 2024 •

edited

Loading

marceloamaral Feb 18, 2024 •

edited

Loading