From 7a102a327b8f53156fc914a069d4dbdaa7c249b2 Mon Sep 17 00:00:00 2001 From: Maxime Hugues Date: Tue, 30 Apr 2024 12:07:59 -0500 Subject: [PATCH] Add draft gpu troubles --- troubleshooting/GPU-Troubleshooting.md | 92 ++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 troubleshooting/GPU-Troubleshooting.md diff --git a/troubleshooting/GPU-Troubleshooting.md b/troubleshooting/GPU-Troubleshooting.md new file mode 100644 index 00000000..1c71d409 --- /dev/null +++ b/troubleshooting/GPU-Troubleshooting.md @@ -0,0 +1,92 @@ +# GPU Troubleshooting Guide + +This guide presents how to identify and resolve issues on Amazon EC2 instances with NVIDIA GPUs. + +While running High-Performance Computing or Machine Learning workloads, GPUs may fail for various reasons captured by Xid messages. +Those messages are placed in `/var/log/messages` for Amazon Linux or for Ubuntu in `/var/log/syslog` and `/var/log/kern.log` + +| Xid | Failure | Resolution | Orchestrator | +| --- | ------------------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------- | +| 48 | Double Bit ECC | Terminate instances | [AWS ParallelCluster](#Terminate-and-replace-instances) | +| 64 | ECC page retirement or row remapper recording failure
All reserved rows for bank are remapped | Terminate instances | [AWS ParallelCluster](#Terminate-and-replace-instances) | +| | | | | +| 79 | GPU has fallen off the bus | Reboot | | +| 94 | Contained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) | +| 95 | Uncontained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) | +| 120 | GSP Error | Terminate instances | [AWS ParallelCluster](#Terminate-and-replace-instances) | + +# AWS ParallelCluster + +## Terminate and replace instances + +1. Create a reservation to isolate the node from being used by any jobs. + + ```bash + sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE] + ``` + +1. Identify jobs using nodes to terminate + + ```bash + squeue -w [NODE_TO_TERMINATE] -o %A -h + ``` + +1. Cancel + + ```bash + scancel [JOB_ID] + ``` + +1. Place the node in **DRAIN**. + + ```bash + sudo /opt/slurm/bin/scontrol update node=[NODE_TO_TERMINATE] state=drain reason=gpus-fail + ``` + + The node will have a **DRAIN** status. Then the instance will be terminated and replaced. + +1. Delete the reservation + + ```bash + sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER] + ``` + +## Reset GPUs + +1. Create a reservation to isolate the node from being used by any jobs. + + ```bash + sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE] + ``` + +1. Identify jobs using nodes to terminate + + ```bash + squeue -w [NODE_TO_TERMINATE] -o %A -h + ``` + +1. Cancel + + ```bash + scancel [JOB_ID] + ``` + +1. Reset the GPUs + + ```bash + sudo /opt/slurm/bin/srun -w [NODE_TO_TERMINATE] nvidia-smi -r + ``` + +1. Delete the reservation + + ```bash + sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER] + ``` + +# Amazon SageMaker HyperPod + +TBD + +# Amazon EKS + +TBD