Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(al2023): NVIDIA variant in isolated partitions - Install NVIDIA Container toolkit/deps #2032

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 33 additions & 1 deletion templates/al2023/provisioners/install-nvidia-driver.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,27 @@ function is-isolated-partition() {
return 0
}

function rpm_install() {
local RPMS=($@)
echo "pulling and installing rpms: (${RPMS[@]}) from s3 bucket: (${BINARY_BUCKET_NAME}) in region: (${BINARY_BUCKET_REGION})"
for RPM in ${RPMS[@]}; do
aws s3 cp --region ${BINARY_BUCKET_REGION} s3://${BINARY_BUCKET_NAME}/rpms/${RPM} ${WORKING_DIR}/${RPM}
sudo dnf localinstall -y ${WORKING_DIR}/${RPM}
done
}

function patch-nvidia-container-toolkit(){
whoix marked this conversation as resolved.
Show resolved Hide resolved
# The order of these RPMs is important, as they have dependencies on each other
RPMS=("libnvidia-container1-1.16.2-1.x86_64.rpm" "nvidia-container-toolkit-base-1.16.2-1.x86_64.rpm" "libnvidia-container-tools-1.16.2-1.x86_64.rpm" "nvidia-container-toolkit-1.16.2-1.x86_64.rpm")
for RPM in ${RPMS[@]}; do
echo "pulling and installing rpms: (${RPM}) from s3 bucket: (${BINARY_BUCKET_NAME}) in region: (${BINARY_BUCKET_REGION})"
aws s3 cp --region ${BINARY_BUCKET_REGION} s3://${BINARY_BUCKET_NAME}/rpms/${RPM} ${WORKING_DIR}/${RPM}
Copy link
Member

@Issacwww Issacwww Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main concern for using this approach here would require us to vendor these rpms to public s3 bucket for isolated regions, otherwise customer in those region cannot build the ami

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RPMs necessary for nvidia container toolkit should already be in the s3 bucket used for building the nvidia variant amis in isolated regions. This is the same approach we use for AL2 GPU ami building. I agree its not the best approach but unfortunately I don't really see any other option to install nvidia container toolkit (because isolated partitions have no internet connectivity).

If the nvidia-container-toolkit and its necessary dependencies were vended from the Amazon Linux repo, then yes you would be correct this would no longer be needed. Probably worth a discussion again with the Amazon Linux team to see if they could vend it.

Eventually there is going to be a project to directly copy and patch EKS Node variant amis from commercial to isolated partitions but that is still under research.

echo "installing rpm: ${WORKING_DIR}/${RPM}"
sudo rpm -ivh ${WORKING_DIR}/${RPM}
done
}


echo "Installing NVIDIA ${NVIDIA_DRIVER_MAJOR_VERSION} drivers..."

################################################################################
Expand All @@ -37,6 +58,10 @@ if is-isolated-partition; then
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-amazon-linux-2023' | sudo tee /etc/yum.repos.d/amzn2023-nvidia.repo

# these are required in order to build kmod-nvidia-open-dkms
# normally be available from epel but that isn't reachable in isolated partitions
rpm_install "opencl-filesystem-1.0-5.el7.noarch.rpm" "ocl-icd-2.2.12-1.el7.x86_64.rpm"

else
if [[ $AWS_REGION == cn-* ]]; then
DOMAIN="nvidia.cn"
Expand Down Expand Up @@ -102,7 +127,14 @@ sudo systemctl enable set-nvidia-clocks.service
################################################################################
### Install other dependencies #################################################
################################################################################
sudo dnf -y install nvidia-fabric-manager nvidia-container-toolkit
sudo dnf -y install nvidia-fabric-manager

# NVIDIA Container toolkit needs to be locally installed for isolated partitions
if is-isolated-partition; then
patch-nvidia-container-toolkit
Issacwww marked this conversation as resolved.
Show resolved Hide resolved
else
sudo dnf -y nvidia-container-toolkit
fi

sudo systemctl enable nvidia-fabricmanager
sudo systemctl enable nvidia-persistenced
2 changes: 2 additions & 0 deletions templates/al2023/template.json
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,8 @@
"environment_vars": [
"AWS_REGION={{user `aws_region`}}",
"ENABLE_ACCELERATOR={{user `enable_accelerator`}}",
"BINARY_BUCKET_NAME={{user `binary_bucket_name`}}",
"BINARY_BUCKET_REGION={{user `binary_bucket_region`}}",
"NVIDIA_DRIVER_MAJOR_VERSION={{user `nvidia_driver_major_version`}}",
"WORKING_DIR={{user `working_dir`}}"
]
Expand Down