Skip to content

Commit

Permalink
Removed sentence which was for next command
Browse files Browse the repository at this point in the history
  • Loading branch information
DarkSector authored and sean-smith committed Mar 1, 2024
1 parent 1a181af commit 39ca357
Showing 1 changed file with 10 additions and 10 deletions.
20 changes: 10 additions & 10 deletions 1.architectures/5.sagemaker-hyperpod/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -325,16 +325,16 @@ srun -N 8 python3 hyperpod-precheck.py

Follow the mitigations listed in this table if one of the checks fails:

| Test | Description | Failure mitigation |
|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| check_if_docker_installed | Life-cycle scripts ensure that docker is installed on all nodes<br/>This checks if docker is available on all compute nodes | Run life-cycle scripts manually<br/>`cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_docker.sh` |
| check_enroot_runtime_path | Make sure the `ENROOT_RUNTIME_PATH` is pointed to the right directory | Follow [these steps](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/01-cluster/08-docker-setup#enroot) in the workshop (Cluster Setup > Docker Setup > Enroot) |
| check_docker_data_root | Docker data root should be at `/opt/dlami/nvme/data-root` | Run life-cycle scripts manually```cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_docker.sh``` |
| check_if_fsx_mounted | `df -h` should show /fsx as mounted | Speak to AWS; We have ensured provisioning parameters include this. So if it's not mounted, we need to investigate this issue. |
| check_if_pyxis_installed | Pyxis is a container plugin for Slurm. Should be installed by default through life-cycle scripts when provisioning cluster | Run life-cycle scripts manually ```cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_enroot_pyxis.sh``` |
| check_slurmd_service_status | Check if `slrumd` is running across all compute instances | Sometimes slurm can fail due to an underlying error. If this check fails, ssh into the specific host and run `sudo systemctl status slurmd` and find the reason. Then restart it using `sudo systemctl start slurmd`. If it fails again check `sudo journalctl -xe` to see what has gone wrong |
| check_if_user_directory_on_fsx | This checks if users are sharing /fsx file system mount | Multi user setup will create /fsx/<user> mounts. Follow [those steps here](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/04-advanced/01-multi-user)<br />If the user directory doesn't exist for nodes that have been replaced<br />Run a variant of this command for your nodes<br>`srun -N 2 usermod -d /fsx/ubuntu ubuntu`<br>(Replace ubuntu with username) |
| nvidia_cli_installed | Nvidia Container CLI is installed via docker life cycle scripts. It's unlikely this will be an issue. | Go to [this page](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/03-megatron-lm/01-pre-process) and look for the command that runs the nvidia-container-cli installation.<br /> Create a script from those steps and either use sbatch or srun to execute across all compute nodes<br>You can also use this same script to check for unsupported operations in your training launch script |
| Test | Description | Failure mitigation |
|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| check_if_docker_installed | Life-cycle scripts ensure that docker is installed on all nodes<br/>This checks if docker is available on all compute nodes | Run life-cycle scripts manually<br/>`cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_docker.sh` |
| check_enroot_runtime_path | Make sure the `ENROOT_RUNTIME_PATH` is pointed to the right directory | Follow [these steps](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/01-cluster/08-docker-setup#enroot) in the workshop (Cluster Setup > Docker Setup > Enroot) |
| check_docker_data_root | Docker data root should be at `/opt/dlami/nvme/data-root` | Run life-cycle scripts manually```cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_docker.sh``` |
| check_if_fsx_mounted | `df -h` should show /fsx as mounted | Speak to AWS; We have ensured provisioning parameters include this. So if it's not mounted, we need to investigate this issue. |
| check_if_pyxis_installed | Pyxis is a container plugin for Slurm. Should be installed by default through life-cycle scripts when provisioning cluster | Run life-cycle scripts manually ```cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_enroot_pyxis.sh``` |
| check_slurmd_service_status | Check if `slrumd` is running across all compute instances | Sometimes slurm can fail due to an underlying error. If this check fails, ssh into the specific host and run `sudo systemctl status slurmd` and find the reason. Then restart it using `sudo systemctl start slurmd`. If it fails again check `sudo journalctl -xe` to see what has gone wrong |
| check_if_user_directory_on_fsx | This checks if users are sharing /fsx file system mount | Multi user setup will create /fsx/<user> mounts. Follow [those steps here](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/04-advanced/01-multi-user)<br />If the user directory doesn't exist for nodes that have been replaced<br />Run a variant of this command for your nodes<br>`srun -N 2 usermod -d /fsx/ubuntu ubuntu`<br>(Replace ubuntu with username) |
| nvidia_cli_installed | Nvidia Container CLI is installed via docker life cycle scripts. It's unlikely this will be an issue. | Go to [this page](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/03-megatron-lm/01-pre-process) and look for the command that runs the nvidia-container-cli installation.<br /> Create a script from those steps and either use sbatch or srun to execute across all compute nodes |


You can also run validation on the scripts you wish to run. This ensures you’re not using unsupported operations in the script.
Expand Down

0 comments on commit 39ca357

Please sign in to comment.