-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA Mig support #1238
NVIDIA Mig support #1238
Conversation
label := deviceFieldsString[i] | ||
value := ToString(val) | ||
klog.Infof("Device %v Label %v Val: %v", entityName, label, value) | ||
if val.FieldId == dcgm.DCGM_FI_PROF_SM_ACTIVE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
instead of DCGM_FI_PROF_SM_ACTIVE
here, since DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
looks closer to actual resource utilization rate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
c1f21ee
to
3a06f00
Compare
f9b0085
to
0a19ff4
Compare
if val.FieldId == ratioFields { | ||
floatVal, _ := strconv.ParseFloat(value, 32) | ||
// ratio of active multiprocessors to total multiprocessors | ||
smUtil := uint32(floatVal * 100 * multiprocessorCountRatio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This smUtil
variable represents tensor core utilization, not SM utilization.
It might be better to just use "util" since the variable will hold the utilization of different components in the future.
processAcceleratorMetrics[p.Pid] = ProcessUtilizationSample{ | ||
Pid: p.Pid, | ||
TimeStamp: uint64(time.Now().UnixNano()), | ||
SmUtil: smUtil, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
pkg/collector/stats/stats.go
Outdated
@@ -75,7 +75,7 @@ func NewStats() *Stats { | |||
m.ResourceUsage[metricName] = types.NewUInt64StatCollection() | |||
} | |||
|
|||
if gpu.IsGPUCollectionSupported() { | |||
if config.EnabledGPU { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use gpu.IsGPUCollectionSupported()
?
A user could enable GPU in Kepler configuration but the system might not have GPUs.
@@ -78,8 +78,9 @@ func UpdateNodeGPUUtilizationMetrics(processStats map[uint64]*stats.ProcessStats | |||
} | |||
processStats[uintPid] = stats.NewProcessStats(uintPid, uint64(0), containerID, vmID, command) | |||
} | |||
processStats[uintPid].ResourceUsage[config.GPUSMUtilization].AddDeltaStat(utils.GenericSocketID, uint64(processUtilization.SmUtil)) | |||
processStats[uintPid].ResourceUsage[config.GPUMemUtilization].AddDeltaStat(utils.GenericSocketID, uint64(processUtilization.MemUtil)) | |||
gpuName := fmt.Sprintf("%s%v", utils.GenericGPUID, gpuID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rootfs how did this solve the GPU MIG naming problem?
You are also using the GPU ID as the key as in the PR #1236.
Note that the utils.GenericGPUID
will be the same for all MIG instance, so in the end, this is doing the same thing as in the PR #1236.
Also, note that the PR #1236 is fixing other things :)
klog.V(debugLevel).Infof("failed to get latest values for fields: %v", err) | ||
return processAcceleratorMetrics, err | ||
} | ||
gpuSMActive := uint32(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, this is tensor utilization, not SM.
It might be better to just use utilization
.
973f761
to
21f7a2e
Compare
@marceloamaral review addressed |
d.entities = make(map[string]dcgm.GroupEntityPair) | ||
|
||
// cleanup, err := dcgm.Init(dcgm.Embedded) // embeded mode is not recommended for production per https://github.com/NVIDIA/dcgm-exporter/issues/22#issuecomment-1321521995 | ||
cleanup, err := dcgm.Init(dcgm.Standalone, config.DCGMHostEngineEndpoint, "0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this requires :
- the host must enable dcgm service and run nv-hostengine
- pod spec to enable
hostNetwork
so to access the nvidia dcgm hostengine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To investigate how to use embeded
mode after this PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to mention that, the standalone
connecting to the localhost works in K8s environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's consider it as an enhancement then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the issue #1243
A test image is built |
…on is unavailable Signed-off-by: Huamin Chen <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
fix #1198
Test status
I am running a text generation inference pod and the measured power
Load generation
kubectl exec
to thetext-generation-inference
pod, and run the command to generate text and pause 5 seconds after each run:# while true; do curl localhost:8080/generate -X POST -H "Content-Type: application/json" -d '{"inputs": "<s>[INST] Write a K8s YAML file to create a pod that deploys nginx[/INST]", "parameters": {"max_new_tokens": 400}}';sleep 5;done
Test result
Method
The current way of process energy in MIG device is as the following:
multiprocessorCountRatio
below. This is done via parsingnvidia-smi -q -x
outputDCGM_FI_PROF_PIPE_TENSOR_ACTIVE
counter from the MIG (thanks to @yuezhu1)multiprocessorCountRatio
*DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
TODO