Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia container-runtime API for GPU allocation #4052

Closed
wants to merge 1 commit into from

Conversation

ytsssun
Copy link
Contributor

@ytsssun ytsssun commented Jun 12, 2024

Co-authored-by: Monirul Islam
Revives: #3994

Description of changes:
This PR will expose two new APIs that will allow customer to configure value of accept-nvidia-visible-devices-as-volume-mounts and accept-nvidia-visible-devices-envvar-when-unprivileged for nvidia container runtime.

We introduce the default behavior to inject Nvidia GPUs using volume-mounts(#3718). This PR is to allow the users to opt-in to the previous behavior that allows unprivileged pods to have access to all GPUs when NVIDIA_VISIBLE_DEVICES=all is enabled and make both behavior configurable.

Bottlerocket Settings Impact Value What it means?
settings.kubernetes.nvidia.container-runtime.visible-devices-as-volume-mounts allows to change the  accept-nvidia-visible-devices-as-volume-mounts value for k8s container-toolkit true | false default: true Adjusting the visible-devices-as-volume-mounts settings will alters the method of GPU detection and integration within container environments. Setting this parameter to true enables the NVIDIA runtime to recognize GPU devices listed in the NVIDIA_VISIBLE_DEVICES environment variable and mount them as volumes, which permits applications within the container to interact with and leverage the GPUs as if they were local resources.
settings.kubernetes.nvidia.container-runtime.visible-devices-envvar-when-unprivileged allows to set value of accept-nvidia-visible-devices-envvar-when-unprivileged settings of nvidia container runtime for k8s varient true | false default: false When this setting is set to false, it prevents unprivileged containers from accessing all GPU devices on the host by default. If NVIDIA_VISIBLE_DEVICES is set to all within the container images and visible-devices-envvar-when-unprivileged is set to true, all GPUs on the host will be accessible to the containers, regardless of the limits set via nvidia.com/gpu. This could lead to situations where more GPUs are allocated to a pod than intended, which can affect resource scheduling and isolation.

Testing done:

  • Functional Test
  1. Built an AMI for nvidia variant. Verify the settings gets picked up with default value.
$ apiclient get settings.kubernetes.nvidia.container-runtime
{
  "settings": {
    "kubernetes": {
      "nvidia": {
        "container-runtime": {
          "visible-devices-as-volume-mounts": true,
          "visible-devices-envvar-when-unprivileged": false
        }
      }
    }
  }
}
  1. Opt-in the previous behavior to allow unprivileged nvidia device access.
$ apiclient set settings.kubernetes.nvidia.container-runtime.visible-devices-as-volume-mounts=false
$ apiclient set settings.kubernetes.nvidia.container-runtime.visible-devices-envvar-when-unprivileged=true
$ apiclient get settings.kubernetes.nvidia.container-runtime
{
  "settings": {
    "kubernetes": {
      "nvidia": {
        "container-runtime": {
          "visible-devices-as-volume-mounts": false,
          "visible-devices-envvar-when-unprivileged": true
        }
      }
    }
  }
}
  1. Verify the nvidia-container-runtime config exists
$ cat /etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false

[nvidia-container-cli]
root = "/"
path = "/usr/bin/nvidia-container-cli"
environment = []
ldconfig = "@/sbin/ldconfig"
  • Migration Test
    Tested migration from 1.20.1 to new version.
    Tested migration back to 1.20.1.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@ytsssun ytsssun force-pushed the nvidia-api branch 2 times, most recently from 95b8251 to 9198f2c Compare June 12, 2024 06:48
@ytsssun ytsssun marked this pull request as ready for review June 12, 2024 06:48
Signed-off-by: Monirul Islam <[email protected]>

Signed-off-by: Yutong Sun <[email protected]>
@ytsssun
Copy link
Contributor Author

ytsssun commented Jul 2, 2024

Closing this PR since it is conflicting with the core-kit migration. We will submit new PRs soon to accommodate the new core-kit setup.

@ytsssun ytsssun closed this Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant