Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byo vpc mlstudio #894

Merged
merged 39 commits into from
Dec 8, 2023
Merged

Byo vpc mlstudio #894

merged 39 commits into from
Dec 8, 2023

Conversation

noah-paige
Copy link
Contributor

Feature or Bugfix

  • Feature

Detail

  • Enable SageMaker Studio Domain to be deployed in a already provisioned VPC

Relates

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

  • Does this PR introduce or modify any input fields or queries - this includes
    fetching data from storage outside the application (e.g. a database, an S3 bucket)?
    • Is the input sanitized?
    • What precautions are you taking before deserializing the data you consume?
    • Is injection prevented by parametrizing queries?
    • Have you ensured no eval or similar functions are used?
  • Does this PR introduce any functionality or component that requires authorization?
    • How have you ensured it respects the existing AuthN/AuthZ mechanisms?
    • Are you logging failed auth attempts?
  • Are you using or adding any cryptographic features?
    • Do you use a standard proven implementations?
    • Are the used keys controlled by the customer? Where are they stored?
  • Are you introducing any new policies/roles/users?
    • Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@noah-paige noah-paige linked an issue Nov 29, 2023 that may be closed by this pull request
@noah-paige
Copy link
Contributor Author

Tested locally:
Backwards Compatibility

  • Migration Script Upgrade (populates sagemaker_studio_domain table)
  • Migration Script Downgrade and Re-Upgrade
  • Env with MLStudio Enabled can view Studio Domain Details in Environment-MLStudio Tab
  • Update Env Stack - Domain is not recreated
  • ML Studio User —> Still access Studio from UI

Functionality

  • Create Env and specify VPC and Subnets For ML Studio
    • VPC Not Found Error if VPC does not exist
    • Subnet Not Found Error if Subnet(s) do not exist
    • Subnet not in VPC Error if Subnet(s) a part of a different VPC
    • Create Successfully with appropriate VPC and Subnet specified
  • Env Admin Can View Studio Domain Details in Environment-MLStudio Tab (if ML Studio Enabled)
  • Create ML Studio User, Access Studio from UI, and Delete ML Studio User
  • Edit Environment and Disable ML Studio —> Env Updates and Domain Deleted
  • Edit Environment and Re-Enable ML Studio with no VPC specified —> Env uses default (if present) or creates its own VPC

To test the same as above in AWS Deployment...

@noah-paige
Copy link
Contributor Author

AWS TESTING
Backwards Compatibility

  • Migration Script Upgrade (populates sagemaker_studio_domain table)
  • Migration Script Downgrade and Re-Upgrade
  • Env with MLStudio Enabled can view Studio Domain Details in Environment-MLStudio Tab
  • Update Env Stack - Domain is not recreated (only KMS Alias is recreated to match name of domain)
  • ML Studio User —> Still access Studio from UI
  • Create ML Studio User, Access Studio from UI, and Delete ML Studio User
  • Edit Environment and Disable ML Studio —> Env Updates and Domain Deleted
  • Edit Environment and Re-Enable ML Studio with no VPC specified —> Env creates new vpc (default not present)

Functionality

  • Create Env and specify VPC and Subnets For ML Studio
    • VPC Not Found Error if VPC does not exist
    • Subnet Not Found Error if Subnet(s) do not exist
    • Subnet not in VPC Error if Subnet(s) a part of a different VPC
    • Specify VPC No Subnets —> Error - forced to specify Subnets if vpc id provided
    • Create Successfully with appropriate VPC and Subnet specified
  • Env Admin Can View Studio Domain Details in Environment-MLStudio Tab (if ML Studio Enabled)
  • Create ML Studio User, Access Studio from UI, and Delete ML Studio User
  • Edit Environment and Disable ML Studio —> Env Updates and Domain Deleted
  • Edit Environment and Re-Enable ML Studio with no VPC specified —> Env uses default (if present)

@noah-paige
Copy link
Contributor Author

Opening up this PR for review with the following questions/enhancements in mind:

Questions:

  1. Should only the Env Admin be able to see ML Studio Tab in Environment or should any group in environment be able to see it? (Currently, only Env Admin can see details on ML Studio Tab since they are the only group who can enable/disable mlstudio env feature)
  2. What is the best location to store ML Studio Domain APIs (create, get, delete)? Should this be a new MLStudio folder in services/graphql/...? (Currently ML Studio Domain APIs are in Environment folder since domains only created/deleted by enabling env feature flag and domain information only shown on env tab)

Enhancements:

  • How best to update the sagemaker studio domain record for the environment with the domain id that is auto-generated by CDK/Cloudformation after the env stack is done creating/updating for the domain?

I think the enhancement could be implemented at a later date so can think on the best approach but open to ideas

@noah-paige noah-paige marked this pull request as ready for review December 5, 2023 17:18
@noah-paige
Copy link
Contributor Author

I cleaned up a bit of the design for this PR so we are not calling 2 different APIs for create, update, and delete env

Using EnvironmentResource to register SagemakerStudioEnvironmentResource in ml studio module and call all the required operations on ml studio from there

Re-tested the same as above in AWS:
Functionality

  • Create Env and specify VPC and Subnets For ML Studio
    • VPC Not Found Error if VPC does not exist
    • Subnet Not Found Error if Subnet(s) do not exist
    • Subnet not in VPC Error if Subnet(s) a part of a different VPC
    • Specify VPC No Subnets —> Error - forced to specify Subnets if vpc id provided
    • Create Successfully with appropriate VPC and Subnet specified
  • Env Admin Can View Studio Domain Details in Environment-MLStudio Tab (if ML Studio Enabled)
  • Create ML Studio User, Access Studio from UI, and Delete ML Studio User
  • Edit Environment and Disable ML Studio —> Env Updates and Domain Deleted
  • Edit Environment and Re-Enable ML Studio with no VPC specified —> Env uses default (if present)
  • Delete Environment —> Env and ML Studio deleted (from RDS + env stack)

Also tested with mlStudio turned off (i.e. active: false) in local deployment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this change, it was my main concern with the PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final implementation is way cleaner and more elegant :) :)

@dlpzx
Copy link
Contributor

dlpzx commented Dec 7, 2023

Testing locally. I changed some tests to cover more scenarios @noah-paige

  • Pre-existing environment with MLStudio updates successfully. SageMaker domain does not get deleted/re-created
  • Edit environment, disable ML Studio and update stack - SageMaker domain is deleted
  • Env Admin Can View Studio Domain Details in Environment-MLStudio Tab - if ML Studio NOT Enabled it sees an empty table and the correct instructions
  • Edit parameter, re-enable ML Studio and specify VPC and Subnets For ML Studio
    • VPC Not Found Error if VPC does not exist
    • Subnet Not Found Error if Subnet(s) do not exist
    • Subnet not in VPC Error if Subnet(s) a part of a different VPC
    • Specify VPC No Subnets —> Error - forced to specify Subnets if vpc id provided
    • Create Successfully with appropriate VPC and Subnet specified
  • Env Admin Can View Studio Domain Details in Environment-MLStudio Tab (if ML Studio Enabled) ---> Yes, the functionality works but there is something weird with the frontend view (commented in the code)
  • Create ML Studio User, Access Studio from UI, and Delete ML Studio User
  • Edit Environment and Re-Enable ML Studio with no VPC specified in ENVIRONMENT with default VPC —> Env uses default VPC
  • Edit Environment and Re-Enable ML Studio with no VPC specified in ENVIRONMENT without default VPC —> Env creates new VPC
  • Delete Environment —> If environment count_resources functions fail (the environment is not cleaned-up), the mlstudio is not deleted

created={mlStudioDomain.created}
/>
</Grid>
</Grid>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ObjectMetadata component looks a bit weird in the frontend, was it intentional?

Screenshot 2023-12-07 at 10 31 43

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just wanting to add all relevant information we could about domain - however the ObjectMetadata is esentially the same as the ObjectMetadata of the Environment in the Overview Tab - so I can remove and just keep ML Studio Information if you think that is cleaner from UX side?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove it, first because it might lead to errors on users thinking that the domain was created at the environment creation time. And also because we need to do some tweaking for it to look good on all screen sizes

@dlpzx
Copy link
Contributor

dlpzx commented Dec 7, 2023

Testing AWS:

  • CICD runs successfully
  • Pre-existing environment with MLStudio (DEFAULT VPC) updates successfully. SageMaker domain does not get deleted/re-created
  • Pre-existing environment with MLStudio (CREATED VPC) updates successfully. SageMaker domain does not get deleted/re-created, VPC does not get deleted/re-created
  • Edit environment, disable ML Studio and update stack - SageMaker domain is deleted
  • Edit parameter, re-enable ML Studio and specify VPC and Subnets For ML Studio
    • VPC Not Found Error if VPC does not exist
    • Subnet Not Found Error if Subnet(s) do not exist
    • Subnet not in VPC Error if Subnet(s) a part of a different VPC
    • Specify VPC No Subnets —> Error - forced to specify Subnets if vpc id provided
    • Create Successfully with appropriate VPC and Subnet specified
  • Env Admin Can View Studio Domain Details in Environment-MLStudio Tab (if ML Studio Enabled) --> yes with the weird view
  • Create ML Studio User, Access Studio from UI, and Delete ML Studio User
  • Edit Environment and Re-Enable ML Studio with no VPC specified in ENVIRONMENT with default VPC —> Env uses default VPC

@noah-paige
Copy link
Contributor Author

noah-paige commented Dec 7, 2023

Finishing up these final 3 points and additional testing:

  • Add SamlGroupName column to Studio Domain

  • Remove ObjectMetaData from EnvironmentMLStudioView

  • Edit update_env() in sagemaker studio domain to set vpcType, vpcId (if possible) and subnetIds (if possible)

    • I edited update_env() in sagemaker studio domain to set vpc information but I kept the extension stack as is because of the chance where a user calls update_stack() without first editing the environment... update_env() only runs on edits to the environment
    • Then in the case above an environment with mlstudio domain vpcType of unknown would try to create a new VPC and new Domain and would fail unless they first edit the stack to have the domain vpc information populated (if imported or default)
    • Also after far too much trial and error I discovered that specifying the subnets from the RDS table (i.e. domain.subnetIds) verse specifying the subnets from cdk VPC lookup (i.e. subnet.subnet_id for subnet in vpc.private_subnets) will cause Domain to recreate EVEN IF the subnets are the exact same 😞
    • Meaning we will likely still have to make a VPC boto3 call in extension stack to check if default vpc exists and go from there

@dlpzx
Copy link
Contributor

dlpzx commented Dec 8, 2023

Finishing up these final 3 points and additional testing:

* [x]  Add `SamlGroupName` column to Studio Domain

* [x]  Remove `ObjectMetaData` from `EnvironmentMLStudioView`

* [x]  Edit `update_env()` in sagemaker studio domain to set `vpcType`, `vpcId` (if possible) and `subnetIds` (if possible)
  
  * I edited  `update_env()` in sagemaker studio domain to set vpc information but I kept the extension stack as is because of the chance where a user calls `update_stack()` without first editing the environment... `update_env()` only runs on edits to the environment
  * Then in the case above an environment with mlstudio domain vpcType of `unknown` would try to create a new VPC and new Domain and would fail unless they first edit the stack to have the domain vpc information populated (if imported or default)
  * Also after far too much trial and error I discovered that specifying the subnets from the RDS table (i.e. `domain.subnetIds`) verse specifying the subnets from cdk VPC lookup (i.e. `subnet.subnet_id for subnet in vpc.private_subnets`) will cause Domain to recreate EVEN IF the subnets are the exact same 😞
  * Meaning we will likely still have to make a VPC boto3 call in extension stack to check if default vpc exists and go from there

Thank you! I will do a final testing but I think we are good! And regarding the last point, it might be related with some id that we are not aware of, strange. Great testing though, this is something that we would have not spotted by reading the documentation. So it seems like we will keep the tests in all places :) but at least we will update metadata as users use the MLStudio feature more (when they do update_env)

@dlpzx
Copy link
Contributor

dlpzx commented Dec 8, 2023

Tested the latest features:

  • CICD pipeline runs successfully
  • downgraded and re-upgraded migration scripts
  • frontend changes
  • update environment updates ML Studio info
  • enable/disable ML Studio works as expected

Approving!

Copy link
Contributor

@dlpzx dlpzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

@noah-paige noah-paige merged commit 94c93d9 into main Dec 8, 2023
8 checks passed
@noah-paige noah-paige deleted the byo-vpc-mlstudio branch December 8, 2023 13:43
dlpzx pushed a commit that referenced this pull request Dec 13, 2023
### Feature or Bugfix
<!-- please choose -->
- Bugfix

### Detail
- When deploying from scratch - the migration script for byo vpc for
mlstudio had a line of code `if has_table('sagemaker_studio_domain')`
which was evaluating to `False` and skipping necessary migration code
- Removed the `if has_table()` and just throw Exception in try/catch
block for upgrade migration script

- Also as part of this PR I made sure the security group gets created in
the mlstudio_extension stack no matter what type of VPC we use for ML
Studio Domain
- Without the Security Group VpcOnly Deployments of ML Studio Domain
with `security_group=[]` will never be able to connect to internet
without manual changes
- Tested with Imported VPC and provisioned Studio User Profile - can
access internet via NAT Gateway as prescribed with VPCOnly deployments
if internet access is required
- Tested changes to ML Studio Domain Security Group does NOT cause new
creation of Domain (only UpdateDomain no replacement) --> Environment
Stack Updates Successfully


### Relates
- #894

### Security
N/A
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow for bring-your-own VPC for SageMaker domains
2 participants