From c9e0f5e3db7a87a117ed9477f498f311d4a761d5 Mon Sep 17 00:00:00 2001 From: Maxime Hugues Date: Tue, 30 Apr 2024 12:07:59 -0500 Subject: [PATCH] Add draft gpu troubles --- troubleshooting/GPU-Troubleshooting.md | 70 ++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 troubleshooting/GPU-Troubleshooting.md diff --git a/troubleshooting/GPU-Troubleshooting.md b/troubleshooting/GPU-Troubleshooting.md new file mode 100644 index 00000000..6a986b15 --- /dev/null +++ b/troubleshooting/GPU-Troubleshooting.md @@ -0,0 +1,70 @@ +# GPU Troubleshooting Guide + +This guide presents how to identify and resolve issues on Amazon EC2 instances with NVIDIA GPUs. + +While running High-Performance Computing or Machine Learning workloads, GPUs may fail for various reasons captured by Xid messages. +Those messages are placed in `/var/log/messages` for Amazon Linux or for Ubuntu in `/var/log/syslog` and `/var/log/kern.log` + +| Xid | Failure | Resolution | Orchestrator | +| --- | --------------------- | ------------------- | ------------------------------------------------------- | +| 48 | Double Bit ECC | Terminate instances | [AWS ParallelCluster](#Terminate-and-replace-instances) | +| 94 | Contained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) | +| 95 | Uncontained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) | + +# AWS ParallelCluster + +## Terminate and replace instances + +1. Create a reservation to isolate the node from being used by any jobs. + ```bash + sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE] + ``` + +1. Cancel + ```bash + scancel [JOB_ID] + ``` + +1. Place the node in **DRAIN**. + ```bash + sudo /opt/slurm/bin/scontrol update node=[NODE_TO_TERMINATE] state=drain reason=gpus-fail + ``` + + The node will have a **DRAIN** status. Then the instance will be terminated and replaced. + + +1. Delete the reservation + ```bash + sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER] + ``` + +## Reset GPUs + +1. Create a reservation to isolate the node from being used by any jobs. + ```bash + sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE] + ``` + +1. Cancel + ```bash + scancel [JOB_ID] + ``` + +1. Reset the GPUs + ```bash + sudo /opt/slurm/bin/srun -w [NODE_TO_TERMINATE] nvidia-smi -r + ``` + +1. Delete the reservation + ```bash + sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER] + ``` + + +# Amazon SageMaker HyperPod + +TBD + +# Amazon EKS + +TBD