-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add build scripts for building Nvidia and Neuron AMIs based on AL2023 #1924
Conversation
templates/al2023/runtime/rootfs/etc/systemd/system/configure-multicard-interfaces.service
Outdated
Show resolved
Hide resolved
Remove multicard config script Remove CUDA packages Remove OCI config for Neuron
nvidiaOptions := ` | ||
[plugins.'io.containerd.grpc.v1.cri'.containerd] | ||
default_runtime_name = 'nvidia' | ||
discard_unpacked_layers = true | ||
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] | ||
base_runtime_spec = "/etc/containerd/base-runtime-spec.json" | ||
runtime_type = "io.containerd.runc.v2" | ||
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] | ||
BinaryName = "/usr/bin/nvidia-container-runtime" | ||
SystemdCgroup = true | ||
` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If this is a constant, lets move it out of this func.
- If you're having to include keys like
SystemdCgroup = true
, etc. so they don't get lost during merge, you should be able to just swap the order of the args you're passing toutil.Merge
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Moved out of the function and declared as constant.
- I'm not sure how swapping will help. IIUC
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
and[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
are not present inconfig.template.toml
, so if we want them to present in the resulting config, we'll have pass everything we need as is (SystemdCgroup, BinaryName, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. That's really not ideal, we don't want to have to keep this blob in sync with what the rest of the config flow does to these nested structs. Does the nvidia-ctk clone your existing runtime config and just change the name? How does this work there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvidia-ctk
uses /etc/containerd/config.toml
as base and does an in-place change (adding whatever it needs to the base). This is why initially I was writing the config file first, calling nvidia-ctk, passing the resulting config back. Let me see what will happen if we call it on empty file and merge with our template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering about how nvidia-ctk
handles options like SystemdCgroup
, base_runtime_spec
, etc. It must be copying our existing runc
runtime config and just plugging in nvidia
and BinaryName
.
Could you just update our existing containerd config template such that you can swap in nvidia
? https://github.com/awslabs/amazon-eks-ami/blob/main/nodeadm/internal/containerd/config.template.toml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I addressed that, check the latest code.
Co-authored-by: Carter <[email protected]>
elif [[ $GPUNAME == *"M60"* ]]; then | ||
nvidia-smi -ac 2505,1177 | ||
elif [[ $GPUNAME == *"H100"* ]]; then | ||
nvidia-smi -ac 2619,1980 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we aren't going to dynamically pull these at boot, we should proactively add all known cards - L40S
(g6e) was just launched - B200
, GB200
, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, do we need to set nvidia-smi -lgc (--lock-gpu-clocks) xx
and nvidia-smi -lmc (--lock-memory-clocks) yy
? do we test and validate that the clock speeds are maintained across reboots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we aren't going to dynamically pull these at boot, we should proactively add all known cards -
L40S
(g6e) was just launched -B200
,GB200
, etc.?
I'll go over the existing instances and make sure we have all cards.
also, do we need to set
nvidia-smi -lgc (--lock-gpu-clocks) xx
andnvidia-smi -lmc (--lock-memory-clocks) yy
? do we test and validate that the clock speeds are maintained across reboots?
IIUC from nvidia-smi
help, the flag that we currently set ( --applications-clocks
)
specifies the clocks when there is something running on the GPU. While the flags you mention are setting gpu and mem clocks at all times (with or without load). We don't use lock-gpu and lock-mem for the other GPU AMI that we support, so if you think it's beneficial, we can set them.
I'll check the reboot persistence, because this is executed as systemd unit, it should persist across reboots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clock settings persist reboot, I left out the lgc and lmc flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final nits 👍
please fix the lint, should be easy via |
/ci |
@Issacwww the workflow that you requested has completed. 🎉
|
done |
|
||
# load the nvidia kernel module appropriate for the attached devices | ||
MODULE_NAME="" | ||
if devices-support-open; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the older cards that require the proprietary driver are a known quantity, and new cards that will be released in the future are unknown to this script (and therefore the PCI class codes are unknown here) - should we flip this logic to default to using the open GPU kernel module unless we know the proprietary driver is absolutely required?
this may require some changes above as well - but I think the appropriate end goal here is to default to the open GPU kernel module, unless we know for certain the proprietary driver is required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check with @Issacwww about that. I'm not the original writer of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a fair call out, I will follow up with next PR as current one is big enough :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'm also in favor of changing this in different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new cards that will be released in the future are unknown to this script (and therefore the PCI class codes are unknown here)
The PCI class codes don't change per device, these have been the same for a few decades now. :)
should we flip this logic to default to using the open GPU kernel module unless we know the proprietary driver is absolutely required
I didn't go this route because it's way messier. Compiling (and maintaining) an exhaustive list of every PCI device ID used by those EC2 instance types is quite a chore -- p3 uses V100's, but there are many revisions and variants of V100 in use. With the current approach, we generate the supported device file from the driver archive and do a simple lookup.
Long term, yes we want to default to the open source kmod. NVIDIA is going to do that in the 560 driver branch, and we can revisit/refactor accordingly then, IMO.
|
||
# gpu boost clock | ||
sudo nvidia-smi -pm 1 # set persistence mode | ||
sudo nvidia-smi --auto-boost-default=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the final else case below, we should not disable auto boost if the card is unknown to this script
if a new instance is launched, we don't want to hamper its performance in any form if we skip updating this script - let it fall back to the default behavior of the card
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bryantbiggs do you mind if we address this in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely - as stated below, the two added comments here are great candidates for follow PRs if we think the changes are worthy. But I think this turkey is cooked as is so -
looks great to me - two comments that can be followed up in later PRs to ensure we fail open to the appropriate path |
Issue #, if available:
Description of changes:
Adding build scripts for building Nvidia and Neuron AMIs based on AL2023 OS.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Testing Done
Neuron:
Nvidia:
nvidia-smi
from it.Nodeadm E2E tests are also passing.
See this guide for recommended testing for PRs. Some tests may not apply. Completing tests and providing additional validation steps are not required, but it is recommended and may reduce review time and time to merge.