Nvidia container-runtime API for GPU allocation #4052

ytsssun · 2024-06-12T00:02:07Z

Co-authored-by: Monirul Islam
Revives: #3994

Description of changes:
This PR will expose two new APIs that will allow customer to configure value of accept-nvidia-visible-devices-as-volume-mounts and accept-nvidia-visible-devices-envvar-when-unprivileged for nvidia container runtime.

We introduce the default behavior to inject Nvidia GPUs using volume-mounts(#3718). This PR is to allow the users to opt-in to the previous behavior that allows unprivileged pods to have access to all GPUs when NVIDIA_VISIBLE_DEVICES=all is enabled and make both behavior configurable.

Bottlerocket Settings	Impact	Value	What it means?
`settings.kubernetes.nvidia.container-runtime.visible-devices-as-volume-mounts`	allows to change the `accept-nvidia-visible-devices-as-volume-mounts` value for k8s container-toolkit	`true` \| `false` default: `true`	Adjusting the `visible-devices-as-volume-mounts` settings will alters the method of GPU detection and integration within container environments. Setting this parameter to `true` enables the NVIDIA runtime to recognize GPU devices listed in the `NVIDIA_VISIBLE_DEVICES` environment variable and mount them as volumes, which permits applications within the container to interact with and leverage the GPUs as if they were local resources.
`settings.kubernetes.nvidia.container-runtime.visible-devices-envvar-when-unprivileged`	allows to set value of `accept-nvidia-visible-devices-envvar-when-unprivileged` settings of nvidia container runtime for k8s varient	`true` \| `false` default: `false`	When this setting is set to `false`, it prevents unprivileged containers from accessing all GPU devices on the host by default. If `NVIDIA_VISIBLE_DEVICES` is set to `all` within the container images and `visible-devices-envvar-when-unprivileged` is set to true, all GPUs on the host will be accessible to the containers, regardless of the limits set via nvidia.com/gpu. This could lead to situations where more GPUs are allocated to a pod than intended, which can affect resource scheduling and isolation.

Testing done:

Functional Test

Built an AMI for nvidia variant. Verify the settings gets picked up with default value.

$ apiclient get settings.kubernetes.nvidia.container-runtime
{
  "settings": {
    "kubernetes": {
      "nvidia": {
        "container-runtime": {
          "visible-devices-as-volume-mounts": true,
          "visible-devices-envvar-when-unprivileged": false
        }
      }
    }
  }
}

Opt-in the previous behavior to allow unprivileged nvidia device access.

$ apiclient set settings.kubernetes.nvidia.container-runtime.visible-devices-as-volume-mounts=false
$ apiclient set settings.kubernetes.nvidia.container-runtime.visible-devices-envvar-when-unprivileged=true
$ apiclient get settings.kubernetes.nvidia.container-runtime
{
  "settings": {
    "kubernetes": {
      "nvidia": {
        "container-runtime": {
          "visible-devices-as-volume-mounts": false,
          "visible-devices-envvar-when-unprivileged": true
        }
      }
    }
  }
}

Verify the nvidia-container-runtime config exists

$ cat /etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false

[nvidia-container-cli]
root = "/"
path = "/usr/bin/nvidia-container-cli"
environment = []
ldconfig = "@/sbin/ldconfig"

Migration Test
Tested migration from 1.20.1 to new version.
Tested migration back to 1.20.1.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Signed-off-by: Monirul Islam <[email protected]> Signed-off-by: Yutong Sun <[email protected]>

ytsssun · 2024-07-02T00:57:35Z

Closing this PR since it is conflicting with the core-kit migration. We will submit new PRs soon to accommodate the new core-kit setup.

ytsssun force-pushed the nvidia-api branch 2 times, most recently from 95b8251 to 9198f2c Compare June 12, 2024 06:48

ytsssun marked this pull request as ready for review June 12, 2024 06:48

ytsssun mentioned this pull request Jun 12, 2024

Nvidia settings API for container runtime #3994

Closed

Nvidia settings API for container runtime

2777cf0

Signed-off-by: Monirul Islam <[email protected]> Signed-off-by: Yutong Sun <[email protected]>

ytsssun force-pushed the nvidia-api branch from 9198f2c to 2777cf0 Compare June 12, 2024 16:25

ytsssun closed this Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nvidia container-runtime API for GPU allocation #4052

Nvidia container-runtime API for GPU allocation #4052

ytsssun commented Jun 12, 2024 •

edited

Loading

ytsssun commented Jul 2, 2024

Nvidia container-runtime API for GPU allocation #4052

Nvidia container-runtime API for GPU allocation #4052

Conversation

ytsssun commented Jun 12, 2024 • edited Loading

ytsssun commented Jul 2, 2024

ytsssun commented Jun 12, 2024 •

edited

Loading