Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with dcgm-exporter #166

Open
ritazh opened this issue Sep 18, 2024 · 5 comments
Open

Working with dcgm-exporter #166

ritazh opened this issue Sep 18, 2024 · 5 comments

Comments

@ritazh
Copy link

ritazh commented Sep 18, 2024

Was trying to get dcgm-exporter working after installing this, but helm install errored with

Not collecting GPU metrics; Error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system.

running ls -l /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* on the host shows the files, but inside the kind worker node shows nothing.

Running the GPU operator helped but should we avoid running the gpu operator and this DRA plugin together? Is there a way to not have to install the operator to get NVML?

@klueska
Copy link
Collaborator

klueska commented Sep 18, 2024

Ignoring the error you are facing for a moment -- even if you got DCGM exporter running, it would not show any GPU metrics. dgcm-exporter relies on the PodResources API to gather and report its GPU metrics, and dcgm-exporter has not yet been updated to consume information about GPUs allocated via DRA.

@ritazh
Copy link
Author

ritazh commented Sep 18, 2024

I see. FWIW, after installing the gpu operator in the same cluster I have the DRA plugin, the dcgm-exporter that comes with the gpu operator was getting GPU metrics from the running distributed inference model off the mig devices in the cluster.

Example output:

# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active.
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="8",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.029399
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="9",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.033893
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="10",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="11",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.034816
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="12",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="13",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active.
# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="8",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.002098
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="9",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.002359
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="10",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.002094
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="11",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.001672
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="12",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="13",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000
# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the device memory interface is active sending or receiving data.
# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="8",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.015358
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="9",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.017687
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="10",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.015245
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="11",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.019403
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="12",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-2c2f45e8-0886-e3cc-02be-a4181a13fc9b",pci_bus_id="00000001:00:00.0",device="nvidia0",modelName="NVIDIA A100 80GB PCIe",GPU_I_PROFILE="1g.10gb",GPU_I_ID="13",Hostname="k8s-dra-driver-cluster-worker",DCGM_FI_DRIVER_VERSION="560.35.03"} 0.000000

Are you saying these metrics may not be accurate?

If they are not accurate and we wish to get some GPU metrics from this cluster running the DRA driver, what would you recommend for us to try?

@klueska
Copy link
Collaborator

klueska commented Sep 18, 2024

These metrics are accurate, but you won't get any of the per-pod GPU metrics that you normallly get with GPUs allocated via the standard device plugin.

@ritazh
Copy link
Author

ritazh commented Oct 2, 2024

should we avoid running the gpu operator and this DRA plugin together? what is the roadmap for this plugin and the operator?

@klueska
Copy link
Collaborator

klueska commented Oct 2, 2024

Nothing has been integrated with the GPU Operator yet. We have plans to do that soon, but will not make any commitments until it is confirmed when DRA is going beta upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants