Removed sentence which was for next command

aws-samples · Mar 1, 2024 · 39ca357 · 39ca357
1 parent 1a181af
commit 39ca357
Showing 1 changed file with 10 additions and 10 deletions.
diff --git a/1.architectures/5.sagemaker-hyperpod/README.md b/1.architectures/5.sagemaker-hyperpod/README.md
@@ -325,16 +325,16 @@ srun -N 8 python3 hyperpod-precheck.py
 
 Follow the mitigations listed in this table if one of the checks fails:
 
-| Test                           | 	Description	                                                                                                               | Failure  mitigation                                                                                                                                                                                                                                                                                                                                                                                |
-|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| check_if_docker_installed	     | Life-cycle scripts ensure that docker is installed on all nodes<br/>This checks if docker is available on all compute nodes | Run life-cycle scripts manually<br/>`cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_docker.sh`                                                                                                                                                                                                                                                               |
-| check_enroot_runtime_path      | Make sure the `ENROOT_RUNTIME_PATH` is pointed to the right directory                                                       | Follow [these steps](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/01-cluster/08-docker-setup#enroot) in the workshop (Cluster Setup > Docker Setup > Enroot)                                                                                                                                                                                                                             |
-| check_docker_data_root         | Docker data root should be at `/opt/dlami/nvme/data-root`                                                                   | Run life-cycle scripts manually```cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_docker.sh```                                                                                                                                                                                                                                                                |
-| check_if_fsx_mounted           | `df -h` should show /fsx as mounted                                                                                         | Speak to AWS; We have ensured provisioning parameters include this. So if it's not mounted, we need to investigate this issue.                                                                                                                                                                                                                                                                     |
-| check_if_pyxis_installed       | Pyxis is a container plugin for Slurm. Should be installed by default through life-cycle scripts when provisioning cluster  | Run life-cycle scripts manually ```cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_enroot_pyxis.sh```                                                                                                                                                                                                                                                         |
-| check_slurmd_service_status    | Check if `slrumd` is running across all compute instances                                                                   | Sometimes slurm can fail due to an underlying error. If this check fails, ssh into the specific host and run `sudo systemctl status slurmd` and find the reason. Then restart it using `sudo systemctl start slurmd`. If it fails again check `sudo journalctl -xe` to see what has gone wrong                                                                                                     |
-| check_if_user_directory_on_fsx | This checks if users are sharing /fsx file system mount                                                                     | Multi user setup will create /fsx/<user> mounts. Follow [those steps here](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/04-advanced/01-multi-user)<br />If the user directory doesn't exist for nodes that have been replaced<br />Run a variant of this command for your nodes<br>`srun -N 2 usermod -d /fsx/ubuntu ubuntu`<br>(Replace ubuntu with username)                           |
-| nvidia_cli_installed           | Nvidia Container CLI is installed via docker life cycle scripts. It's unlikely this will be an issue.                       | Go to [this page](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/03-megatron-lm/01-pre-process) and look for the command that runs the nvidia-container-cli installation.<br /> Create a script from those steps and either use sbatch or srun to execute across all compute nodes<br>You can also use this same script to check for unsupported operations in your training launch script |
+| Test                           | 	Description	                                                                                                               | Failure  mitigation                                                                                                                                                                                                                                                                                                                                                      |
+|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| check_if_docker_installed	     | Life-cycle scripts ensure that docker is installed on all nodes<br/>This checks if docker is available on all compute nodes | Run life-cycle scripts manually<br/>`cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_docker.sh`                                                                                                                                                                                                                                     |
+| check_enroot_runtime_path      | Make sure the `ENROOT_RUNTIME_PATH` is pointed to the right directory                                                       | Follow [these steps](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/01-cluster/08-docker-setup#enroot) in the workshop (Cluster Setup > Docker Setup > Enroot)                                                                                                                                                                                                   |
+| check_docker_data_root         | Docker data root should be at `/opt/dlami/nvme/data-root`                                                                   | Run life-cycle scripts manually```cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_docker.sh```                                                                                                                                                                                                                                      |
+| check_if_fsx_mounted           | `df -h` should show /fsx as mounted                                                                                         | Speak to AWS; We have ensured provisioning parameters include this. So if it's not mounted, we need to investigate this issue.                                                                                                                                                                                                                                           |
+| check_if_pyxis_installed       | Pyxis is a container plugin for Slurm. Should be installed by default through life-cycle scripts when provisioning cluster  | Run life-cycle scripts manually ```cd /tmp/sagemaker-lifecycle-* && cd src/utils/ && srun -N <no of nodes> bash install_enroot_pyxis.sh```                                                                                                                                                                                                                               |
+| check_slurmd_service_status    | Check if `slrumd` is running across all compute instances                                                                   | Sometimes slurm can fail due to an underlying error. If this check fails, ssh into the specific host and run `sudo systemctl status slurmd` and find the reason. Then restart it using `sudo systemctl start slurmd`. If it fails again check `sudo journalctl -xe` to see what has gone wrong                                                                           |
+| check_if_user_directory_on_fsx | This checks if users are sharing /fsx file system mount                                                                     | Multi user setup will create /fsx/<user> mounts. Follow [those steps here](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/04-advanced/01-multi-user)<br />If the user directory doesn't exist for nodes that have been replaced<br />Run a variant of this command for your nodes<br>`srun -N 2 usermod -d /fsx/ubuntu ubuntu`<br>(Replace ubuntu with username) |
+| nvidia_cli_installed           | Nvidia Container CLI is installed via docker life cycle scripts. It's unlikely this will be an issue.                       | Go to [this page](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/03-megatron-lm/01-pre-process) and look for the command that runs the nvidia-container-cli installation.<br /> Create a script from those steps and either use sbatch or srun to execute across all compute nodes                                                                               |
 
 
 You can also run validation on the scripts you wish to run. This ensures you’re not using unsupported operations in the script.