Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA Mig support #1238

Merged
merged 5 commits into from
Feb 21, 2024
Merged

NVIDIA Mig support #1238

merged 5 commits into from
Feb 21, 2024

Conversation

rootfs
Copy link
Contributor

@rootfs rootfs commented Feb 16, 2024

fix #1198

Test status

I am running a text generation inference pod and the measured power

Load generation

kubectl exec to the text-generation-inference pod, and run the command to generate text and pause 5 seconds after each run:

# while true; do curl localhost:8080/generate -X POST -H "Content-Type: application/json" -d '{"inputs": "<s>[INST] Write a K8s YAML file to create a pod that deploys nginx[/INST]", "parameters": {"max_new_tokens": 400}}';sleep 5;done

Test result

# nvidia-smi
+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG|
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  1    2   0   0  |           18816MiB / 19968MiB  | 42      0 |  3   0    2    0    0 |
|                  |               2MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    1    2    0     494327      C   /opt/conda/bin/python3.9                  18770MiB |
+---------------------------------------------------------------------------------------+

image

Method

The current way of process energy in MIG device is as the following:

  • Get the MIG multi processor unit ratio with respect to the GPU, noted as multiprocessorCountRatio below. This is done via parsing nvidia-smi -q -x output
  • Get the DCGM_FI_PROF_PIPE_TENSOR_ACTIVE counter from the MIG (thanks to @yuezhu1)
  • Calculate the processor's smUtil = multiprocessorCountRatio * DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

TODO

  • This PR uses mixed nvml (to get running processes) and dcgm to get info from different libraries. We should really use just one library (opened issue at go-dcgm).
  • This PR doesn't address issues where the MIG can be dynamically reconfigured.
  • We need to test more cases: no MIG, one MIG, multiple MIGs, processes running on 1+ MIGs, and processes running on MIG + GPU, etc

label := deviceFieldsString[i]
value := ToString(val)
klog.Infof("Device %v Label %v Val: %v", entityName, label, value)
if val.FieldId == dcgm.DCGM_FI_PROF_SM_ACTIVE {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use DCGM_FI_PROF_PIPE_TENSOR_ACTIVE instead of DCGM_FI_PROF_SM_ACTIVE here, since DCGM_FI_PROF_PIPE_TENSOR_ACTIVE looks closer to actual resource utilization rate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@rootfs rootfs force-pushed the mig-support branch 2 times, most recently from c1f21ee to 3a06f00 Compare February 16, 2024 19:43
@rootfs rootfs requested a review from wangchen615 February 16, 2024 19:57
@rootfs rootfs changed the title [WIP] NVIDIA Mig support NVIDIA Mig support Feb 16, 2024
@rootfs rootfs force-pushed the mig-support branch 2 times, most recently from f9b0085 to 0a19ff4 Compare February 17, 2024 00:09
@rootfs rootfs marked this pull request as draft February 17, 2024 02:16
@rootfs rootfs marked this pull request as ready for review February 17, 2024 15:16
if val.FieldId == ratioFields {
floatVal, _ := strconv.ParseFloat(value, 32)
// ratio of active multiprocessors to total multiprocessors
smUtil := uint32(floatVal * 100 * multiprocessorCountRatio)
Copy link
Collaborator

@marceloamaral marceloamaral Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This smUtil variable represents tensor core utilization, not SM utilization.

It might be better to just use "util" since the variable will hold the utilization of different components in the future.

processAcceleratorMetrics[p.Pid] = ProcessUtilizationSample{
Pid: p.Pid,
TimeStamp: uint64(time.Now().UnixNano()),
SmUtil: smUtil,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -75,7 +75,7 @@ func NewStats() *Stats {
m.ResourceUsage[metricName] = types.NewUInt64StatCollection()
}

if gpu.IsGPUCollectionSupported() {
if config.EnabledGPU {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use gpu.IsGPUCollectionSupported()?
A user could enable GPU in Kepler configuration but the system might not have GPUs.

@@ -78,8 +78,9 @@ func UpdateNodeGPUUtilizationMetrics(processStats map[uint64]*stats.ProcessStats
}
processStats[uintPid] = stats.NewProcessStats(uintPid, uint64(0), containerID, vmID, command)
}
processStats[uintPid].ResourceUsage[config.GPUSMUtilization].AddDeltaStat(utils.GenericSocketID, uint64(processUtilization.SmUtil))
processStats[uintPid].ResourceUsage[config.GPUMemUtilization].AddDeltaStat(utils.GenericSocketID, uint64(processUtilization.MemUtil))
gpuName := fmt.Sprintf("%s%v", utils.GenericGPUID, gpuID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rootfs how did this solve the GPU MIG naming problem?
You are also using the GPU ID as the key as in the PR #1236.
Note that the utils.GenericGPUID will be the same for all MIG instance, so in the end, this is doing the same thing as in the PR #1236.

Also, note that the PR #1236 is fixing other things :)

klog.V(debugLevel).Infof("failed to get latest values for fields: %v", err)
return processAcceleratorMetrics, err
}
gpuSMActive := uint32(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this is tensor utilization, not SM.
It might be better to just use utilization.

@rootfs rootfs force-pushed the mig-support branch 2 times, most recently from 973f761 to 21f7a2e Compare February 20, 2024 14:19
@rootfs
Copy link
Contributor Author

rootfs commented Feb 20, 2024

@marceloamaral review addressed

d.entities = make(map[string]dcgm.GroupEntityPair)

// cleanup, err := dcgm.Init(dcgm.Embedded) // embeded mode is not recommended for production per https://github.com/NVIDIA/dcgm-exporter/issues/22#issuecomment-1321521995
cleanup, err := dcgm.Init(dcgm.Standalone, config.DCGMHostEngineEndpoint, "0")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this requires :

  • the host must enable dcgm service and run nv-hostengine
  • pod spec to enable hostNetwork so to access the nvidia dcgm hostengine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To investigate how to use embeded mode after this PR is merged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to mention that, the standalone connecting to the localhost works in K8s environments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's consider it as an enhancement then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the issue #1243

@rootfs
Copy link
Contributor Author

rootfs commented Feb 20, 2024

A test image is built quay.io/sustainable_computing_io/kepler:latest-dcgm

Copy link
Collaborator

@marceloamaral marceloamaral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@rootfs rootfs merged commit 87228e6 into sustainable-computing-io:main Feb 21, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU NVIDIA H100 PCIe not supported
3 participants