You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I am running the gpu-exporter daemon set in my GKE cluster with Google managed nvidia drivers pod. Upon issuing a new node with GPU, there is a race condition between the nvidia-driver pods and gpu-exporter pod.
Describe the solution you'd like
A way to delay the initialization of the gpu-exporter pod. Maybe via init container or some hooks.
Describe alternatives you've considered
Manually patching the pod created by the gpu-exporter daemon set.
Additional context
Error in the gpu-exporter pod which I am getting when the race condition is triggered-
time=2025-01-21T09:20:33.175Z level=WARN source=exporter.go:132 msg="failed to auto-determine query field names, falling back to the built-in list" e
rr="command failed: code: -1 | command: \"nvidia-smi --help-query-gpu\" | stdout: \"\" | stderr: \"\": error running command: exec: \"nvidia-smi\": e
xecutable file not found in $PATH"
time=2025-01-21T09:20:33.177Z level=INFO source=tls_config.go:347 msg="Listening on" address=[::]:9835
time=2025-01-21T09:20:33.177Z level=INFO source=tls_config.go:350 msg="TLS is disabled." http2=false address=[::]:9835
time=2025-01-21T09:21:31.938Z level=ERROR source=exporter.go:188 msg="failed to collect metrics" err="command failed: code: -1 | command: nvidia-smi
--query-gpu=driver_model.pending,ecc.errors.corrected.volatile.total,clocks.default_applications.graphics,retired_pages.pending,power.management,cloc
ks_throttle_reasons.active,clocks_throttle_reasons.sync_boost,ecc.errors.corrected.volatile.device_memory,ecc.errors.uncorrected.volatile.l1_cache,ec
c.errors.uncorrected.aggregate.register_file,pcie.link.width.max,inforom.pwr,gom.pending,clocks.max.graphics,clocks.current.sm,clocks.current.video,v
bios_version,fan.speed,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_power_brake_slowdown,ecc.mode.current,driver_version,ecc.error
s.uncorrected.volatile.device_memory,temperature.gpu,enforced.power.limit,memory.used,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatil
e.register_file,power.max_limit,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.l2_ca
che,ecc.errors.uncorrected.volatile.cbu,memory.free,ecc.errors.uncorrected.aggregate.sram,power.draw,clocks.applications.graphics,pci.sub_device_id,c
locks_throttle_reasons.supported,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.aggregate.texture_m
emory,ecc.errors.uncorrected.aggregate.l2_cache,pci.device,pcie.link.gen.max,pstate,clocks_throttle_reasons.sw_thermal_slowdown,encoder.stats.average
Latency,accounting.buffer_size,utilization.gpu,pci.domain,mig.mode.pending,display_mode,inforom.ecc,gom.current,encoder.stats.sessionCount,ecc.errors
.corrected.volatile.l1_cache,pci.bus_id,ecc.errors.uncorrected.aggregate.device_memory,pcie.link.width.current,ecc.errors.corrected.aggregate.l1_cach
e,persistence_mode,ecc.mode.pending,ecc.errors.corrected.volatile.l2_cache,clocks.max.sm,name,uuid,clocks_throttle_reasons.hw_slowdown,ecc.errors.cor
rected.aggregate.total,clocks.applications.memory,accounting.mode,clocks_throttle_reasons.hw_thermal_slowdown,ecc.errors.corrected.volatile.texture_m
emory,ecc.errors.uncorrected.aggregate.dram,pci.bus,driver_model.current,ecc.errors.corrected.aggregate.sram,clocks.max.memory,count,power.min_limit,
compute_mode,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.uncorrected.volatile.texture_memory,power.default_limit,inforom.oem,utilization
.memory,retired_pages.double_bit.count,power.limit,timestamp,index,ecc.errors.uncorrected.aggregate.l1_cache,ecc.errors.uncorrected.aggregate.cbu,clocks_throttle_reasons.gpu_idle,ecc.errors.uncorrected.volatile.total,temperature.memory,clocks.current.graphics,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.aggregate.register_file,display_active,ecc.errors.uncorrected.aggregate.total,pcie.link.gen.current,inforom.img,ecc.errors.corrected.volatile.cbu,ecc.errors.corrected.aggregate.cbu,clocks.current.memory,serial,pci.device_id,memory.total,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l2_cache,mig.mode.current,clocks_throttle_reasons.applications_clocks_setting,encoder.stats.averageFps,retired_pages.single_bit_ecc.count,clocks.default_applications.memory --format=csv | stdout: | stderr: : error running command: exec: \"nvidia-smi\": executable file not found in $PATH"
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
I am running the gpu-exporter daemon set in my GKE cluster with Google managed nvidia drivers pod. Upon issuing a new node with GPU, there is a race condition between the nvidia-driver pods and gpu-exporter pod.
Describe the solution you'd like
A way to delay the initialization of the gpu-exporter pod. Maybe via init container or some hooks.
Describe alternatives you've considered
Manually patching the pod created by the gpu-exporter daemon set.
Additional context
Error in the gpu-exporter pod which I am getting when the race condition is triggered-
The text was updated successfully, but these errors were encountered: