Skip to content

Commit

Permalink
Documentation updates: Windows instructions, troubleshooting steps, m…
Browse files Browse the repository at this point in the history
…oving to CloudShell from Cloud9, IDE integration steps
  • Loading branch information
ivan-khvostishkov committed Sep 12, 2024
1 parent 61bd4af commit 7c0cf12
Show file tree
Hide file tree
Showing 5 changed files with 91 additions and 20 deletions.
38 changes: 34 additions & 4 deletions FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,10 @@ We often see a lot of questions that surface repeatedly. This repository is an a
The solution was primarily designed for developers who are using Linux and macOS.

Basic scenarios, which require only SSM without SSH, work on Windows without
any additional configuration.
any additional configuration, i.e., you only need to install the library with pip.

To be able to connect from your local machine with SSH and start port forwarding with the script `sm-ssh`, please consider that
you need Bash interpreter and Python to execute them. They don't work in PowerShell or in the default Command Prompt.
you need Bash interpreter and Python to execute them. They don't work in PowerShell or in the default Command Prompt that have no Bash.

However, it's possible also to make it working on Windows, with some limitations on use from IDEs that use the Command Prompt.

Expand Down Expand Up @@ -107,6 +107,18 @@ export AWS_DEFAULT_REGION=eu-west-1
sm-ssh list
```

9. When configuring the remote interpreter in your IDE on Windows, you cannot use `ssh fqdn` directly, because SSH needs to call Bash somehow.

But there's a trick (A). Inside GitBash run `sm-ssh connect` and it will forward you the remote SSH port to `localhost` on port `10022`.

Alternatively (B), configure [~/.ssh/config](README.md#sshconfig) inside GitBash and forward the port manually:

```bash
ssh -L localhost:10022:localhost:22 fqdn
```

Now use `localhost:10022` in your IDE to connect to remote interpreter and when the IDE asks for the private key, use either (A) `~/.ssh/fqdn` or (B) `~/.ssh/sagemaker-ssh-gw` respectively.

### Are SageMaker notebook instances supported?

Yes, the setup is similar to SageMaker Studio. Run [SageMaker_SSH_Notebook.ipynb](SageMaker_SSH_Notebook.ipynb) on the notebook instance and `sm-ssh connect <<notebook-instance-name>>.notebook.sagemaker` your local machine.
Expand Down Expand Up @@ -187,7 +199,7 @@ During the container build, execute `sm-setup-ssh configure` and `sm-ssh-ide con

See the examples of such containers [byoc/Dockerfile.internet_free](https://github.com/aws-samples/sagemaker-ssh-helper/blob/main/tests/byoc/Dockerfile.internet_free) and [byoi_studio/Dockerfile.internet_free](https://github.com/aws-samples/sagemaker-ssh-helper/blob/main/tests/byoi_studio/Dockerfile.internet_free) in the tests.

You will also need to configure AWS PrivateLink for [Session Manager endpoints](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html) and for [STS endpoints](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_sts_vpce.html).
You will also need to configure AWS PrivateLink for [Session Manager endpoints](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html) and for [STS endpoints](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_sts_vpce.html), in addition to your already existing endpoints for SageMaker and S3.

*Note:* If you are using the [Network Isolation](https://docs.aws.amazon.com/sagemaker/latest/dg/mkt-algo-model-internet-free.html) mode, i.e., set the `enable_network_isolation` parameter of the `Estimator` to `True`, you won't be able to connect to your containers, because they will have no access to the Amazon Systems Manager.

Expand Down Expand Up @@ -577,6 +589,11 @@ Below are the generic tips to start with:
* **Important:** Make sure you fully read and understood the "Getting started" section and didn't skip the steps from [Setting up your AWS account with IAM and SSM configuration](IAM_SSM_Setup.md).
* Find all instances of SSH Helper installation. They might conflict with each other if both are in the system `PATH`. Switch into each Python environment and uninstall old versions with `pip uninstall`:
```bash
find / -name 'sagemaker_ssh_helper' 2>/dev/null
```
* Check that the managed instance in AWS Console in Systems Manager -> Fleet Manager section appears as "Online". Check that you're able to connect to the node from the Console by selecting Node actions -> Start terminal session.
If instance is "Offline", you might see this error message when calling an `sm-ssh connect` command:
Expand All @@ -595,7 +612,7 @@ An error occurred (InvalidInstanceId) when calling the SendCommand operation: In
* Turn on Session Manager [logging](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-logging.html) and inspect the session logs.
* Try `sm-ssh list` to see if instance is `Online` or offline (will be marked with `-`). Pay attention to what the output says about the AWS region that you connect to.
* Try `sm-ssh list` to see if instance is `Online` or offline (will be marked with `ConnectionLost` or `ssh:NotFound`). Pay attention to what the output says about the AWS region that you connect to.
* If you have issues with SSH, but you can connect successfully from AWS Console, make sure you can run the both below SSM commands successfully on your local machine:
Expand All @@ -605,6 +622,8 @@ aws ssm start-session --target mi-01234567890abcdef \
--document-name AWS-StartSSHSession --parameters portNumber=22
```
* Use `ssh -v` for additional log output
* (SageMaker Studio) Check SSM agent logs. From the image terminal run:
```text
tail /var/log/amazon/ssm/*.log && date
Expand All @@ -626,6 +645,17 @@ Check carefully the notebook output in SageMaker Studio to see if there are any
* (SageMaker Studio) Try to re-initialize the instance by restarting the notebook: Kernel -> Restart Kernel and Run All Cells.
* (PyCharm) Check the IDE log:
```bash
tail -f ~/Library/Logs/JetBrains/PyCharm2024.1/idea.log
```
* Enable Session Manager session logs as described [in the AWS Systems Manager documentation]((https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-logging.html)). You might need to create a new CloudWatch log group, e.g. `/ssm/logs` and / or S3 bucket like `ssm-logs-555555555555`. Note that according to the documentation, *"Logging isn't available for Session Manager sessions that connect through port forwarding or SSH"*, so it will only help you when you connect directly to the `mi-*` instance with [AWS CLI or AWS Console](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-sessions-start.html).
* Set locally the environment variable `SM_SSH_DEBUG=true` and check the file `/tmp/sm-ssh-debug.log`
* Check that the remote host is not overloaded with tasks and has enough memory to execute SSM and SSH commands, e.g., by running `top` from SageMaker Studio image terminal.
### I’m getting an API throttling error in the logs
Expand Down
7 changes: 3 additions & 4 deletions IAM_SSM_Setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ SageMaker SSH Helper relies on the AWS Systems Manager service to create SSH tun

### Automated setup with CDK and Cloud9

a. Create the [Cloud9](https://docs.aws.amazon.com/cloud9/latest/user-guide/create-environment-main.html) environment. Alternatively, you can the commands run in your local terminal. In this case, make sure you've installed Node.js and CDK and fulfilled [all other CDK prerequisites](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_prerequisites). In both cases you need to have an admin role.
a. From AWS Console, pop up [CloudShell](https://aws.amazon.com/cloudshell/) environment. Alternatively, you can the commands run in your local terminal. In this case, make sure you've installed Node.js and CDK and fulfilled [all other CDK prerequisites](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_prerequisites). In both cases you need to have an admin role.

b. Define your SageMaker role, local user role, AWS account ID and AWS Region as variables by executing the following commands in the terminal line by line:

Expand Down Expand Up @@ -52,6 +52,7 @@ Local variables `SAGEMAKER_ROLE_ARN` and `USER_ROLE_ARN` are passed as parameter
c. To enable SageMaker SSH Helper in additional AWS Regions, run these commands per region (adjust `REGION` variable each time):

```shell
ACCOUNT_ID=
REGION=
```

Expand All @@ -60,9 +61,7 @@ cdk bootstrap aws://"$ACCOUNT_ID"/"$REGION"

APP="python -m sagemaker_ssh_helper.cdk.advanced_tier_app"

AWS_REGION="$REGION" cdk -a "$APP" deploy SSM-Advanced-Tier-Stack \
-c sagemaker_role="$SAGEMAKER_ROLE_ARN" \
-c user_role="$USER_ROLE_ARN"
AWS_REGION="$REGION" cdk -a "$APP" deploy SSM-Advanced-Tier-Stack
```

*Note:* If you will run the jobs from SageMaker Studio instead of your local machine, specify `USER_ROLE_ARN` the same as `SAGEMAKER_ROLE_ARN`.
Expand Down
58 changes: 47 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ remote debugging, and advanced troubleshooting.

Three most common tasks that motivated to create the library, sometimes referred as "SSH into SageMaker", are:
1. A terminal session into a container running in SageMaker to diagnose a stuck training job, use CLI commands
like nvidia-smi, or iteratively fix and re-execute your training script within seconds.
like nvidia-smi and neuron-ls, or iteratively fix and re-execute your training script within seconds.
2. Remote debugging of a code running in SageMaker from your local favorite IDE like
PyCharm Professional Edition or Visual Studio Code.
3. Port forwarding to access auxiliary tools running inside SageMaker, e.g., Dask dashboard, Streamlit apps, TensorBoard or Spark Web UI.
Expand Down Expand Up @@ -90,7 +90,9 @@ Install the latest stable version of library from the [PyPI repository](https://
```shell
pip install sagemaker-ssh-helper
```
**Caution:** It's always recommended to install the library into a Python venv, not into the system env.
**Caution:** It's always recommended to install the library into a Python venv, not into the system env. If you want to use later the SSH plugins of your IDE that will use the system env and system Python, you should add the venv into the system PATH, as described in the section [Remote code execution with PyCharm / VSCode over SSH](#remote-interpreter).

If you're working on Windows, see [FAQ](FAQ.md#is-windows-supported).

### Step 2: Modify your start training job code
1. Add import for `SSHEstimatorWrapper`
Expand Down Expand Up @@ -499,7 +501,7 @@ This low-level script takes the managed instance ID as a parameter. Next section
The syntax for the SSH Helper CLI command `sm-ssh` is the following:

```bash
sm-ssh [-h] [-v] {list,start-proxy,connect} [fqdn]
sm-ssh [-h] [-v] {list,start-proxy,connect} [fqdn] [extra-connect-args]*
```

where `fqdn` is the resource name with `.sagemaker` suffix, respectively:
Expand Down Expand Up @@ -529,14 +531,21 @@ sm-ssh list sagemaker

– will list all resources of all types.

The instances with SSH Helper will be marked `Online` while other instances will be marked with `-`.
The instances with SSH Helper will be marked `Online` or `ConnectionLost` while the instances not registered with SSM be marked with `ssh:NotFound`.

The `connect` command starts interactive SSH session into container, e.g.:

```bash
sm-ssh connect ssh-training-example-2023-07-25-03-18-04-490.training.sagemaker
```

It's possible to pass additional arguments and forward ports together with the `connect` command, e.g., to forward [SSH Agent](https://linux.die.net/man/1/ssh-agent) and Streamlit web app port:

```bash
ssh-add
sm-ssh connect ssh-training-example-2023-07-25-03-18-04-490.training.sagemaker -A -L 8501:localhost:8501
```

#### ~/.ssh/config

Alternatively, instead of using `sm-ssh connect` command, you can use the native `ssh` command, but it will require you to update your [ssh config](https://linux.die.net/man/5/ssh_config), typically `~/.ssh/config`, with `sm-ssh start-proxy` command as follows:
Expand Down Expand Up @@ -577,7 +586,14 @@ Follow the steps in the next section for the IDE configuration, to prepare the `
sm-local-configure
```

**Caution**: If you plan to use `sm-ssh` tool from the IDE, which you run inside your system Python env, you should install SSH Helper into your system Python env, too.
**Caution**: You will use SSH plugins from the IDE running inside your system env with system Python, therefore you should add SSH Helper into your system PATH, e.g., on macOS:
```bash
sudo bash -c "echo '/Users/janedoe/PycharmProjects/sagemaker-ssh-helper-dev-venv/bin' > /etc/paths.d/42-sm-ssh"
```

You might need restart the Terminal and the IDE for changes to take an effect.

Alternatively, use the trick with port forwarding - start the `sm-ssh` or `ssh` with `-L` option inside venv, and then use `localhost` as the host to connect to from IDE. This trick is used to make SSH Helper work on Windows, and it's described in [FAQ - Is Windows Supported?](FAQ.md#is-windows-supported).

2. Submit your code to SageMaker with SSH Helper as described in previous sections, e.g. as a [training job](#step-1-install-the-library).

Expand All @@ -589,13 +605,19 @@ Instead of using SSM to connect to the container from command line, proceed to t

Make sure you've configured your ssh config as mentioned in the [~/.ssh/config](#sshconfig) section and your IDE can access `sm-ssh` command from the system env.

If you connect to your host for the first time, check that `ssh` command is working from CLI:

```bash
ssh sh-training-manual-2023-10-02-14-38-56-744.training.sagemaker
```

A. Follow the [instructions in the PyCharm docs](https://www.jetbrains.com/help/pycharm/remote-debugging-with-product.html#remote-interpreter), to configure the remote interpreter in PyCharm.

In the field for host name, put the same value as for `fqdn` in the [`sm-ssh` command](#sm-ssh), e.g., `ssh-training-manual-2023-10-02-14-38-56-744.training.sagemaker`, and use `root` as the username.

![](images/pycharm_training.png)

When PyCharm asks for the SSH key, point to the `~/.ssh/<fqdn>` private key file that was automatically generated for you by SSH Helper:
If PyCharm asks for the SSH key, point to the `~/.ssh/<fqdn>` private key file that was automatically generated for you by SSH Helper:

![](images/pycharm_training_ssh.png)

Expand All @@ -614,11 +636,15 @@ Put the `root@fqdn` as the hostname to connect to, e.g., `root@ssh-training-exam

![](images/vscode_training.png)

> **NOTE:** The **Remote SSH** extension described in the above instructions is only for the [Visual Studio Code native app](https://code.visualstudio.com/). Code Editor in SageMaker Studio and web apps based on [Code Server](https://github.com/coder/code-server) that use extensions from [Open VSX Registry](https://open-vsx.org/) might look and work differently. SageMaker SSH Helper **DOES NOT** support browser-based implementations and haven't been tested with any of Open VSX extensions. If you prefer to use the browser for development, take a look at the [Web VNC](#web-vnc) option.
> **NOTE:** The **Remote SSH** extension described in the above instructions is only for the [Visual Studio Code native app](https://code.visualstudio.com/). Code Editor in SageMaker Studio and other web apps based on [Code - OSS](https://github.com/microsoft/vscode#visual-studio-code---open-source-code---oss) such as [Code Server](https://github.com/coder/code-server) that use extensions from [Open VSX Registry](https://open-vsx.org/) might look and work differently from the native app that has Microsoft-specific customizations. SageMaker SSH Helper **DOES NOT** support browser-based implementations of VS Code and haven't been tested with any of Open VSX extensions. If you prefer to use the browser for development, take a look at the [Web VNC](#web-vnc) option.
There are few extension options that you might want to change for VS Code to work properly with SageMaker containers:

You might also need to increase "Remote.SSH: Connect Timeout" option to `90` in VS Code. See [the StackOverflow post](https://stackoverflow.com/questions/59978826/why-ssh-connection-timed-out-in-vscode) for details.
* You might need to increase "Remote.SSH: Connect Timeout" option to `120` in VS Code. See [the StackOverflow post](https://stackoverflow.com/questions/59978826/why-ssh-connection-timed-out-in-vscode) for details.

If you see the error `tar: code: Cannot change ownership to uid 1000, gid 1000: Operation not permitted` when connecting, then try to set "Remote.SSH: Use Exec server" to `false`, as mentioned in [#58 - vscode connect fails](https://github.com/aws-samples/sagemaker-ssh-helper/issues/58).
* If you see the error `tar: code: Cannot change ownership to uid 1000, gid 1000: Operation not permitted` when connecting, then try to set "Remote.SSH: Use Exec server" to `false`, as mentioned in [#58 - vscode connect fails](https://github.com/aws-samples/sagemaker-ssh-helper/issues/58).

* You might also need to set "Remote.SSH: Use Local Server" to `false` and "Remote.SSH: Lockfiles In Tmp" to `true`, if you still have connection problems.

4. Connect to the instance and stop the waiting loop

Expand Down Expand Up @@ -735,15 +761,23 @@ For your local IDE integration with SageMaker Studio, follow the same steps as f

1. Copy [SageMaker_SSH_IDE.ipynb](SageMaker_SSH_IDE.ipynb) into SageMaker Studio and run it.

Note that the `main` branch of this repo can contain changes that are not compatible with the version of `sagemaker-ssh-helper` that you installed from pip.

To be completely sure that you're using the version of the notebook that corresponds to the installed library, take a copy of the notebook from your filesystem after you install SSH Helper package, e.g.:

```bash
cp /opt/conda/sm_ssh/SageMaker_SSH_IDE.ipynb /root/
```

You can also check the version with `pip freeze | grep sagemaker-ssh-helper` and take the notebook from [the corresponding release tag](https://github.com/aws-samples/sagemaker-ssh-helper/tags).

Alternatively, [attach](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-lcc-create.html) to a domain the KernelGateway lifecycle config script [kernel-lc-config.sh](kernel-lc-config.sh)
(you may need to ask your administrator to do this).
Once configured, from the Launcher choose the environment, pick up the lifecycle script and choose
'Open image terminal' (so, you don't even need to create a notebook).

You might want to change the `LOCAL_USER_ID` variable upon the first run, to prevent users from impersonating each other. For more details see the FAQ on [How SageMaker SSH Helper protects users from impersonating each other?](FAQ.md#how-sagemaker-ssh-helper-protects-users-from-impersonating-each-other).

> Note that the `main` branch of this repo can contain changes that are not compatible with the version of `sagemaker-ssh-helper` that you installed from pip. To ensure the stable performance, check the version with `pip freeze | grep sagemaker-ssh-helper` and take the notebook and the lifecycle script from [the corresponding tag](https://github.com/aws-samples/sagemaker-ssh-helper/tags).
2. Configure remote interpreter in PyCharm / VS Code to connect to SageMaker Studio

Use `app_name.user_profile_name.domain_id.studio.sagemaker` or `app_name.studio.sagemaker` as the `fqdn` to connect.
Expand All @@ -754,6 +788,8 @@ To see available apps to connect to, you may run the `list` command:
sm-ssh list studio.sagemaker
```

*Note:* If you're using Windows, see [the FAQ](FAQ.md#is-windows-supported).

3. Using the remote Jupyter Notebook

In recent versions of PyCharm, Jupyter Notebook is tunnelled automatically through remote interpreter connection. You might need to add `--allow-root` argument to the command line, when your remote interpreter runs under root:
Expand Down
5 changes: 4 additions & 1 deletion sagemaker_ssh_helper/cdk/iam_ssm/iam_ssm_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,10 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
actions=[
"ssm:StartSession",
],
resources=[f"arn:{Aws.PARTITION}:ssm:*::document/AWS-StartSSHSession"]
resources=[
f"arn:{Aws.PARTITION}:ssm:*::document/AWS-StartSSHSession",
f"arn:{Aws.PARTITION}:ssm:*:{Aws.ACCOUNT_ID}:document/SSM-SessionManagerRunShell"
]
),
PolicyStatement(
effect=Effect.ALLOW,
Expand Down
Loading

0 comments on commit 7c0cf12

Please sign in to comment.