Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia settings API for container runtime #3994

Closed
wants to merge 1 commit into from

Conversation

monirul
Copy link
Contributor

@monirul monirul commented May 28, 2024

Issue number:

Closes #

Description of changes:
This PR will expose two new APIs that will allow customer to configure value of accept-nvidia-visible-devices-as-volume-mounts and accept-nvidia-visible-devices-envvar-when-unprivileged for nvidia container runtime.

Bottlerocket Settings Impact Value What it means?
settings.kubernetes.nvidia.container-runtime.visible-devices-as-volume-mounts allows to change the  accept-nvidia-visible-devices-as-volume-mounts value for k8s container-toolkit true | false default: true Adjusting the visible-devices-as-volume-mounts settings will alters the method of GPU detection and integration within container environments. Setting this parameter to true enables the NVIDIA runtime to recognize GPU devices listed in the NVIDIA_VISIBLE_DEVICES environment variable and mount them as volumes, which permits applications within the container to interact with and leverage the GPUs as if they were local resources.
settings.kubernetes.nvidia.container-runtime.visible-devices-envvar-when-unprivileged allows to set value of accept-nvidia-visible-devices-envvar-when-unprivileged settings of nvidia container runtime for k8s varient true | false default: false When this setting is set to false, it prevents unprivileged containers from accessing all GPU devices on the host by default. If NVIDIA_VISIBLE_DEVICES is set to all within the container images and visible-devices-envvar-when-unprivileged is set to true, all GPUs on the host will be accessible to the containers, regardless of the limits set via nvidia.com/gpu. This could lead to situations where more GPUs are allocated to a pod than intended, which can affect resource scheduling and isolation.

Testing done:
Yes.

$ apiclient set settings.kubernetes.nvidia.container-runtime.visible-devices-as-volume-mounts = true
$ apiclient set settings.kubernetes.nvidia.container-runtime.visible-devices-envvar-when-unprivileged = false
$ apiclient get settings.kubernetes.nvidia.container-runtime
{
  "settings": {
    "kubernetes": {
      "nvidia": {
        "container-runtime": {
          "visible-devices-as-volume-mounts": true,
          "visible-devices-envvar-when-unprivileged": false
        }
      }
    }
  }
}

$ cat /etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false

[nvidia-container-cli]
root = "/"
path = "/usr/bin/nvidia-container-cli"
environment = []
ldconfig = "@/sbin/ldconfig"

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

visible-devices-as-volume-mounts = false
visible-devices-envvar-when-unprivileged = true

[metadata.settings.kubernetes.nvidia.container-runtime]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is missing a migration to remove the affected services on a downgrade.

Comment on lines +5 to +6
accept-nvidia-visible-devices-as-volume-mounts = {{settings.kubernetes.nvidia.container-runtime.visible-devices-as-volume-mounts}}
accept-nvidia-visible-devices-envvar-when-unprivileged = {{settings.kubernetes.nvidia.container-runtime.visible-devices-envvar-when-unprivileged}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be safe, and use the {{default}} helper, otherwise if settings.kubernetes.nvidia.container-runtime.visible-devices-as-volume-mounts isn't present, the render will fail.

@@ -0,0 +1,14 @@
[settings.kubernetes.nvidia.container-runtime]
visible-devices-as-volume-mounts = false
visible-devices-envvar-when-unprivileged = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use the default values to prevent unprivileged pods from accessing all the devices:

accept-nvidia-visible-devices-envvar-when-unprivileged = false

@@ -0,0 +1 @@
../../../shared-defaults/nvidia-k8s-container-toolkit.toml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is missing symlinks for other variants, we need symlinks for:

  • aws-k8s-1.24-nvidia
  • aws-k8s-1.25-nvidia
  • aws-k8s-1.26-nvidia

@ytsssun
Copy link
Contributor

ytsssun commented Jun 12, 2024

Opened #4052 instead.

@ytsssun ytsssun closed this Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants