diff --git a/1.architectures/5.sagemaker-hyperpod/README.md b/1.architectures/5.sagemaker-hyperpod/README.md index bdd59c5b..1dfe366a 100644 --- a/1.architectures/5.sagemaker-hyperpod/README.md +++ b/1.architectures/5.sagemaker-hyperpod/README.md @@ -325,16 +325,16 @@ srun -N 8 python3 hyperpod-precheck.py Follow the mitigations listed in this table if one of the checks fails: -| Test | Description | Failure mitigation | -|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| check_if_docker_installed | Life-cycle scripts ensure that docker is installed on all nodes
This checks if docker is available on all compute nodes | Run life-cycle scripts manually
`cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N bash install_docker.sh` | -| check_enroot_runtime_path | Make sure the `ENROOT_RUNTIME_PATH` is pointed to the right directory | Follow [these steps](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/01-cluster/08-docker-setup#enroot) in the workshop (Cluster Setup > Docker Setup > Enroot) | -| check_docker_data_root | Docker data root should be at `/opt/dlami/nvme/data-root` | Run life-cycle scripts manually```cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N bash install_docker.sh``` | -| check_if_fsx_mounted | `df -h` should show /fsx as mounted | Speak to AWS; We have ensured provisioning parameters include this. So if it's not mounted, we need to investigate this issue. | -| check_if_pyxis_installed | Pyxis is a container plugin for Slurm. Should be installed by default through life-cycle scripts when provisioning cluster | Run life-cycle scripts manually ```cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N bash install_enroot_pyxis.sh``` | -| check_slurmd_service_status | Check if `slrumd` is running across all compute instances | Sometimes slurm can fail due to an underlying error. If this check fails, ssh into the specific host and run `sudo systemctl status slurmd` and find the reason. Then restart it using `sudo systemctl start slurmd`. If it fails again check `sudo journalctl -xe` to see what has gone wrong | -| check_if_user_directory_on_fsx | This checks if users are sharing /fsx file system mount | Multi user setup will create /fsx/ mounts. Follow [those steps here](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/04-advanced/01-multi-user)
If the user directory doesn't exist for nodes that have been replaced
Run a variant of this command for your nodes
`srun -N 2 usermod -d /fsx/ubuntu ubuntu`
(Replace ubuntu with username) | -| nvidia_cli_installed | Nvidia Container CLI is installed via docker life cycle scripts. It's unlikely this will be an issue. | Go to [this page](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/03-megatron-lm/01-pre-process) and look for the command that runs the nvidia-container-cli installation.
Create a script from those steps and either use sbatch or srun to execute across all compute nodes
You can also use this same script to check for unsupported operations in your training launch script | +| Test | Description | Failure mitigation | +|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| check_if_docker_installed | Life-cycle scripts ensure that docker is installed on all nodes
This checks if docker is available on all compute nodes | Run life-cycle scripts manually
`cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N bash install_docker.sh` | +| check_enroot_runtime_path | Make sure the `ENROOT_RUNTIME_PATH` is pointed to the right directory | Follow [these steps](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/01-cluster/08-docker-setup#enroot) in the workshop (Cluster Setup > Docker Setup > Enroot) | +| check_docker_data_root | Docker data root should be at `/opt/dlami/nvme/data-root` | Run life-cycle scripts manually```cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N bash install_docker.sh``` | +| check_if_fsx_mounted | `df -h` should show /fsx as mounted | Speak to AWS; We have ensured provisioning parameters include this. So if it's not mounted, we need to investigate this issue. | +| check_if_pyxis_installed | Pyxis is a container plugin for Slurm. Should be installed by default through life-cycle scripts when provisioning cluster | Run life-cycle scripts manually ```cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N bash install_enroot_pyxis.sh``` | +| check_slurmd_service_status | Check if `slrumd` is running across all compute instances | Sometimes slurm can fail due to an underlying error. If this check fails, ssh into the specific host and run `sudo systemctl status slurmd` and find the reason. Then restart it using `sudo systemctl start slurmd`. If it fails again check `sudo journalctl -xe` to see what has gone wrong | +| check_if_user_directory_on_fsx | This checks if users are sharing /fsx file system mount | Multi user setup will create /fsx/ mounts. Follow [those steps here](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/04-advanced/01-multi-user)
If the user directory doesn't exist for nodes that have been replaced
Run a variant of this command for your nodes
`srun -N 2 usermod -d /fsx/ubuntu ubuntu`
(Replace ubuntu with username) | +| nvidia_cli_installed | Nvidia Container CLI is installed via docker life cycle scripts. It's unlikely this will be an issue. | Go to [this page](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/03-megatron-lm/01-pre-process) and look for the command that runs the nvidia-container-cli installation.
Create a script from those steps and either use sbatch or srun to execute across all compute nodes | You can also run validation on the scripts you wish to run. This ensures you’re not using unsupported operations in the script.