Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Nvidia Time-Slicing #4230

Merged
merged 2 commits into from
Oct 10, 2024
Merged

Conversation

KCSesh
Copy link
Contributor

@KCSesh KCSesh commented Oct 2, 2024

Issue number:

Related:

Description of changes:
Adding time-slicing apis and migrations.

Added:

  • settings.kubelet-device-plugins.nvidia.device-sharing-strategy
  • settings.kubelet-device-plugins.nvidia.time-slicing.replicas
  • settings.kubelet-device-plugins.nvidia.time-slicing.rename-by-default
  • settings.kubelet-device-plugins.nvidia.time-slicing.fail-requests-greater-than-one

Testing done:

  1. Migration:
bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.24.1 (aws-k8s-1.26-nvidia)"
PRETTY_NAME="Bottlerocket OS 1.24.1 (aws-k8s-1.26-nvidia)"
VARIANT_ID=aws-k8s-1.26-nvidia
VERSION_ID=1.24.1
BUILD_ID=acfd0abe
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"
DOCUMENTATION_URL="https://bottlerocket.dev"
...
bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.25.0 (aws-k8s-1.26-nvidia)"
PRETTY_NAME="Bottlerocket OS 1.25.0 (aws-k8s-1.26-nvidia)"
VARIANT_ID=aws-k8s-1.26-nvidia
VERSION_ID=1.25.0
BUILD_ID=5c12e860
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"
DOCUMENTATION_URL="https://bottlerocket.dev"
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-sharing-strategy": "time-slicing",
        "pass-device-specs": true,
        "time-slicing": {
          "fail-requests-greater-than-one": true,
          "rename-by-default": true,
          "replicas": 2
        }
      }
    }
  }
}
...
bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.24.1 (aws-k8s-1.26-nvidia)"
PRETTY_NAME="Bottlerocket OS 1.24.1 (aws-k8s-1.26-nvidia)"
VARIANT_ID=aws-k8s-1.26-nvidia
VERSION_ID=1.24.1
BUILD_ID=acfd0abe
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"
DOCUMENTATION_URL="https://bottlerocket.dev"
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "pass-device-specs": true
      }
    }
  }
}

  1. Instance joined the cluster
NAME                                           STATUS   ROLES    AGE   VERSION
ip-XXXX.us-west-2.compute.internal   Ready    <none>   15h   v1.29.5-eks-1109419
  1. Model Default:
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-sharing-strategy": "none",
        "pass-device-specs": true,
      }
    }
  }
}
  1. Model Updates:
bash-5.1#: apiclient set settings.kubelet-device-plugins.nvidia.device-sharing-strategy=time-slicing settings.kubelet-device-plugins.nvidia.time-slicing.replicas=2 settings.kubelet-device-plugins.nvidia.time-slicing.rename-by-default=true settings.kubelet-device-plugins.nvidia.time-slicing.fail-requests-greater-than-one=true
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-sharing-strategy": "none",
        "pass-device-specs": true,
        "time-slicing": {
          "fail-requests-greater-than-one": true,
          "rename-by-default": true,
          "replicas": 2
        }
      }
    }
  }
}
  1. Bounded check:
bash-5.1# apiclient set settings.kubelet-device-plugins.nvidia.time-slicing.replicas=1
Failed to change settings: Failed PATCH request to '/settings/keypair?tx=apiclient-set-Ne2SavnsTF7Fweq0': Status 400 when PATCHing /settings/keypair?tx=apiclient-set-Ne2SavnsTF7Fweq0: Unable to match your input to the data model.  We may not have enough type information.  Please try the --json input form.  Cause: Error during deserialization: integer out of range, expected it to be between 2 and 2147483647 at line 1 column 107
bash-5.1#
  1. Files generated:
bash-5.1# cat /etc/nvidia-k8s-device-plugin/settings.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  plugin:
    passDeviceSpecs: true
    deviceListStrategy: volume-mounts
    deviceIDStrategy: index
sharing:
  timeSlicing:
    renameByDefault: true
    failRequestsGreaterThanOne: true
    resources:
    - name: "nvidia.com/gpu"
      replicas: 2
nvidia-device-plugin[360007]:   "resources": {
nvidia-device-plugin[360007]:     "gpus": [
nvidia-device-plugin[360007]:       {
nvidia-device-plugin[360007]:         "pattern": "*",
nvidia-device-plugin[360007]:         "name": "nvidia.com/gpu"
nvidia-device-plugin[360007]:       }
nvidia-device-plugin[360007]:     ]
nvidia-device-plugin[360007]:   },
nvidia-device-plugin[360007]:   "sharing": {
nvidia-device-plugin[360007]:     "timeSlicing": {
nvidia-device-plugin[360007]:       "renameByDefault": true,
nvidia-device-plugin[360007]:       "failRequestsGreaterThanOne": true,
nvidia-device-plugin[360007]:       "resources": [
nvidia-device-plugin[360007]:         {
nvidia-device-plugin[360007]:           "name": "nvidia.com/gpu",
nvidia-device-plugin[360007]:           "rename": "nvidia.com/gpu.shared",
nvidia-device-plugin[360007]:           "devices": "all",
nvidia-device-plugin[360007]:           "replicas": 2
nvidia-device-plugin[360007]:         }
nvidia-device-plugin[360007]:       ]
nvidia-device-plugin[360007]:     }
nvidia-device-plugin[360007]:   }
nvidia-device-plugin[360007]: }
  1. Time slicing on 1 instance with 1 GPU with 2 containers:
bash-5.1# apiclient set settings.kubelet-device-plugins.nvidia.device-sharing-strategy=time-slicing settings.kubelet-device-plugins.nvidia.time-slicing.replicas=2 
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-sharing-strategy": "time-slicing",
        "pass-device-specs": true,
        "time-slicing": {
          "replicas": 2
        }
      }
    }
  }
}

bash-5.1# cat /etc/nvidia-k8s-device-plugin/settings.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  plugin:
    passDeviceSpecs: true
    deviceListStrategy: volume-mounts
    deviceIDStrategy: index
sharing:
  timeSlicing:
    renameByDefault: true
    failRequestsGreaterThanOne: true
    resources:
    - name: "nvidia.com/gpu"
      replicas: 2
bash-5.1#

➜  kubectl apply -f node-1-gpu-2.yaml
pod/gpu-test-2-pod-node-one created

 ➜  kubectl get pods
NAME                      READY   STATUS    RESTARTS   AGE
gpu-test-1-pod-node-one   1/1     Running   0          13m
gpu-test-2-pod-node-one   1/1     Running   0          3s

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@KCSesh KCSesh force-pushed the api-time-slice branch 2 times, most recently from 17f8249 to 9ffec57 Compare October 10, 2024 02:55
@KCSesh KCSesh marked this pull request as ready for review October 10, 2024 03:43
@KCSesh KCSesh merged commit 742f343 into bottlerocket-os:develop Oct 10, 2024
2 checks passed
@KCSesh KCSesh deleted the api-time-slice branch October 23, 2024 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants