Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to add init container #281

Open
shauryagoel opened this issue Jan 21, 2025 · 0 comments
Open

Ability to add init container #281

shauryagoel opened this issue Jan 21, 2025 · 0 comments

Comments

@shauryagoel
Copy link

Is your feature request related to a problem? Please describe.
I am running the gpu-exporter daemon set in my GKE cluster with Google managed nvidia drivers pod. Upon issuing a new node with GPU, there is a race condition between the nvidia-driver pods and gpu-exporter pod.

Describe the solution you'd like
A way to delay the initialization of the gpu-exporter pod. Maybe via init container or some hooks.

Describe alternatives you've considered
Manually patching the pod created by the gpu-exporter daemon set.

Additional context
Error in the gpu-exporter pod which I am getting when the race condition is triggered-

time=2025-01-21T09:20:33.175Z level=WARN source=exporter.go:132 msg="failed to auto-determine query field names, falling back to the built-in list" e
rr="command failed: code: -1 | command: \"nvidia-smi --help-query-gpu\" | stdout: \"\" | stderr: \"\": error running command: exec: \"nvidia-smi\": e
xecutable file not found in $PATH"
time=2025-01-21T09:20:33.177Z level=INFO source=tls_config.go:347 msg="Listening on" address=[::]:9835
time=2025-01-21T09:20:33.177Z level=INFO source=tls_config.go:350 msg="TLS is disabled." http2=false address=[::]:9835
time=2025-01-21T09:21:31.938Z level=ERROR source=exporter.go:188 msg="failed to collect metrics" err="command failed: code: -1 | command: nvidia-smi
--query-gpu=driver_model.pending,ecc.errors.corrected.volatile.total,clocks.default_applications.graphics,retired_pages.pending,power.management,cloc
ks_throttle_reasons.active,clocks_throttle_reasons.sync_boost,ecc.errors.corrected.volatile.device_memory,ecc.errors.uncorrected.volatile.l1_cache,ec
c.errors.uncorrected.aggregate.register_file,pcie.link.width.max,inforom.pwr,gom.pending,clocks.max.graphics,clocks.current.sm,clocks.current.video,v
bios_version,fan.speed,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_power_brake_slowdown,ecc.mode.current,driver_version,ecc.error
s.uncorrected.volatile.device_memory,temperature.gpu,enforced.power.limit,memory.used,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatil
e.register_file,power.max_limit,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.l2_ca
che,ecc.errors.uncorrected.volatile.cbu,memory.free,ecc.errors.uncorrected.aggregate.sram,power.draw,clocks.applications.graphics,pci.sub_device_id,c
locks_throttle_reasons.supported,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.aggregate.texture_m
emory,ecc.errors.uncorrected.aggregate.l2_cache,pci.device,pcie.link.gen.max,pstate,clocks_throttle_reasons.sw_thermal_slowdown,encoder.stats.average
Latency,accounting.buffer_size,utilization.gpu,pci.domain,mig.mode.pending,display_mode,inforom.ecc,gom.current,encoder.stats.sessionCount,ecc.errors
.corrected.volatile.l1_cache,pci.bus_id,ecc.errors.uncorrected.aggregate.device_memory,pcie.link.width.current,ecc.errors.corrected.aggregate.l1_cach
e,persistence_mode,ecc.mode.pending,ecc.errors.corrected.volatile.l2_cache,clocks.max.sm,name,uuid,clocks_throttle_reasons.hw_slowdown,ecc.errors.cor
rected.aggregate.total,clocks.applications.memory,accounting.mode,clocks_throttle_reasons.hw_thermal_slowdown,ecc.errors.corrected.volatile.texture_m
emory,ecc.errors.uncorrected.aggregate.dram,pci.bus,driver_model.current,ecc.errors.corrected.aggregate.sram,clocks.max.memory,count,power.min_limit,
compute_mode,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.uncorrected.volatile.texture_memory,power.default_limit,inforom.oem,utilization
.memory,retired_pages.double_bit.count,power.limit,timestamp,index,ecc.errors.uncorrected.aggregate.l1_cache,ecc.errors.uncorrected.aggregate.cbu,clocks_throttle_reasons.gpu_idle,ecc.errors.uncorrected.volatile.total,temperature.memory,clocks.current.graphics,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.aggregate.register_file,display_active,ecc.errors.uncorrected.aggregate.total,pcie.link.gen.current,inforom.img,ecc.errors.corrected.volatile.cbu,ecc.errors.corrected.aggregate.cbu,clocks.current.memory,serial,pci.device_id,memory.total,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l2_cache,mig.mode.current,clocks_throttle_reasons.applications_clocks_setting,encoder.stats.averageFps,retired_pages.single_bit_ecc.count,clocks.default_applications.memory --format=csv | stdout:  | stderr: : error running command: exec: \"nvidia-smi\": executable file not found in $PATH"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant