-
Notifications
You must be signed in to change notification settings - Fork 92
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
b1e89bb
commit c9e0f5e
Showing
1 changed file
with
70 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# GPU Troubleshooting Guide | ||
|
||
This guide presents how to identify and resolve issues on Amazon EC2 instances with NVIDIA GPUs. | ||
|
||
While running High-Performance Computing or Machine Learning workloads, GPUs may fail for various reasons captured by Xid messages. | ||
Those messages are placed in `/var/log/messages` for Amazon Linux or for Ubuntu in `/var/log/syslog` and `/var/log/kern.log` | ||
|
||
| Xid | Failure | Resolution | Orchestrator | | ||
| --- | --------------------- | ------------------- | ------------------------------------------------------- | | ||
| 48 | Double Bit ECC | Terminate instances | [AWS ParallelCluster](#Terminate-and-replace-instances) | | ||
| 94 | Contained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) | | ||
| 95 | Uncontained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) | | ||
|
||
# AWS ParallelCluster | ||
|
||
## Terminate and replace instances | ||
|
||
1. Create a reservation to isolate the node from being used by any jobs. | ||
```bash | ||
sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE] | ||
``` | ||
|
||
1. Cancel | ||
```bash | ||
scancel [JOB_ID] | ||
``` | ||
|
||
1. Place the node in **DRAIN**. | ||
```bash | ||
sudo /opt/slurm/bin/scontrol update node=[NODE_TO_TERMINATE] state=drain reason=gpus-fail | ||
``` | ||
|
||
The node will have a **DRAIN** status. Then the instance will be terminated and replaced. | ||
|
||
|
||
1. Delete the reservation | ||
```bash | ||
sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER] | ||
``` | ||
|
||
## Reset GPUs | ||
|
||
1. Create a reservation to isolate the node from being used by any jobs. | ||
```bash | ||
sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE] | ||
``` | ||
|
||
1. Cancel | ||
```bash | ||
scancel [JOB_ID] | ||
``` | ||
|
||
1. Reset the GPUs | ||
```bash | ||
sudo /opt/slurm/bin/srun -w [NODE_TO_TERMINATE] nvidia-smi -r | ||
``` | ||
|
||
1. Delete the reservation | ||
```bash | ||
sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER] | ||
``` | ||
|
||
|
||
# Amazon SageMaker HyperPod | ||
|
||
TBD | ||
|
||
# Amazon EKS | ||
|
||
TBD |