Ability to add init container #281

shauryagoel · 2025-01-21T12:56:26Z

Is your feature request related to a problem? Please describe.
I am running the gpu-exporter daemon set in my GKE cluster with Google managed nvidia drivers pod. Upon issuing a new node with GPU, there is a race condition between the nvidia-driver pods and gpu-exporter pod.

Describe the solution you'd like
A way to delay the initialization of the gpu-exporter pod. Maybe via init container or some hooks.

Describe alternatives you've considered
Manually patching the pod created by the gpu-exporter daemon set.

Additional context
Error in the gpu-exporter pod which I am getting when the race condition is triggered-

time=2025-01-21T09:20:33.175Z level=WARN source=exporter.go:132 msg="failed to auto-determine query field names, falling back to the built-in list" e
rr="command failed: code: -1 | command: \"nvidia-smi --help-query-gpu\" | stdout: \"\" | stderr: \"\": error running command: exec: \"nvidia-smi\": e
xecutable file not found in $PATH"
time=2025-01-21T09:20:33.177Z level=INFO source=tls_config.go:347 msg="Listening on" address=[::]:9835
time=2025-01-21T09:20:33.177Z level=INFO source=tls_config.go:350 msg="TLS is disabled." http2=false address=[::]:9835
time=2025-01-21T09:21:31.938Z level=ERROR source=exporter.go:188 msg="failed to collect metrics" err="command failed: code: -1 | command: nvidia-smi
--query-gpu=driver_model.pending,ecc.errors.corrected.volatile.total,clocks.default_applications.graphics,retired_pages.pending,power.management,cloc
ks_throttle_reasons.active,clocks_throttle_reasons.sync_boost,ecc.errors.corrected.volatile.device_memory,ecc.errors.uncorrected.volatile.l1_cache,ec
c.errors.uncorrected.aggregate.register_file,pcie.link.width.max,inforom.pwr,gom.pending,clocks.max.graphics,clocks.current.sm,clocks.current.video,v
bios_version,fan.speed,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_power_brake_slowdown,ecc.mode.current,driver_version,ecc.error
s.uncorrected.volatile.device_memory,temperature.gpu,enforced.power.limit,memory.used,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatil
e.register_file,power.max_limit,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.l2_ca
che,ecc.errors.uncorrected.volatile.cbu,memory.free,ecc.errors.uncorrected.aggregate.sram,power.draw,clocks.applications.graphics,pci.sub_device_id,c
locks_throttle_reasons.supported,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.aggregate.texture_m
emory,ecc.errors.uncorrected.aggregate.l2_cache,pci.device,pcie.link.gen.max,pstate,clocks_throttle_reasons.sw_thermal_slowdown,encoder.stats.average
Latency,accounting.buffer_size,utilization.gpu,pci.domain,mig.mode.pending,display_mode,inforom.ecc,gom.current,encoder.stats.sessionCount,ecc.errors
.corrected.volatile.l1_cache,pci.bus_id,ecc.errors.uncorrected.aggregate.device_memory,pcie.link.width.current,ecc.errors.corrected.aggregate.l1_cach
e,persistence_mode,ecc.mode.pending,ecc.errors.corrected.volatile.l2_cache,clocks.max.sm,name,uuid,clocks_throttle_reasons.hw_slowdown,ecc.errors.cor
rected.aggregate.total,clocks.applications.memory,accounting.mode,clocks_throttle_reasons.hw_thermal_slowdown,ecc.errors.corrected.volatile.texture_m
emory,ecc.errors.uncorrected.aggregate.dram,pci.bus,driver_model.current,ecc.errors.corrected.aggregate.sram,clocks.max.memory,count,power.min_limit,
compute_mode,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.uncorrected.volatile.texture_memory,power.default_limit,inforom.oem,utilization
.memory,retired_pages.double_bit.count,power.limit,timestamp,index,ecc.errors.uncorrected.aggregate.l1_cache,ecc.errors.uncorrected.aggregate.cbu,clocks_throttle_reasons.gpu_idle,ecc.errors.uncorrected.volatile.total,temperature.memory,clocks.current.graphics,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.aggregate.register_file,display_active,ecc.errors.uncorrected.aggregate.total,pcie.link.gen.current,inforom.img,ecc.errors.corrected.volatile.cbu,ecc.errors.corrected.aggregate.cbu,clocks.current.memory,serial,pci.device_id,memory.total,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l2_cache,mig.mode.current,clocks_throttle_reasons.applications_clocks_setting,encoder.stats.averageFps,retired_pages.single_bit_ecc.count,clocks.default_applications.memory --format=csv | stdout:  | stderr: : error running command: exec: \"nvidia-smi\": executable file not found in $PATH"

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to add init container #281

Ability to add init container #281

shauryagoel commented Jan 21, 2025

Ability to add init container #281

Ability to add init container #281

Comments

shauryagoel commented Jan 21, 2025