Undeploy models with no WorkerNodes #3380

brianf-aws · 2025-01-11T03:36:30Z

This PR aims to undeploy modelIds that have no nodes associated to them so as to keep the intention of undeploy truthful.

Description

When performing undeploy if the model has no nodes associated to it then it will reset the index to UNDEPLOY status

Here is an example of why this code change is needed

This secnario is for the PARTIALLY_DEPLOYED issue.

Have nodes a,b,c,d in cluster associated with modelID:@ i.e. peform deploy on it
Bring a,b down while having the syncup job running
By Now sync up will make this PARTIALLY_UNDEPLOYED
stop sync up
Bring other 2 c,d down and bring 2 nodes (these now have new ids 1,2)
bring the other two nodes back which have different ids so now the cluster has (1,2,3,4)
But the model index says PARTIALLY_DEPLOYED and no nodes are servicing

This code fix says. If no nodes are servicing this model then I need to set the index to UNDEPLOYED no matter if its already UNDEPLOYED or not.

Related Issues

Resolves #3285

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

ylwu-amzn · 2025-01-11T03:46:01Z

plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java

+        bulkUpdateRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);
+        log.info("No models service: {}", modelIds.toString());
+        client.bulk(bulkUpdateRequest, ActionListener.wrap(br -> { log.debug("Successfully set modelIds to UNDEPLOY in index"); }, e -> {
+            log.error("Failed to set modelIds to UNDEPLOY in index", e);


Should we send this exception back to client side ? If yes, we should pass listener to this method and add this line here

listener.onFailure(e);

~~I'm not sure the user is concerned with the failure? The only way I see this as a problem if the model index does not exist and user does undeploy, this might cause a write issue.~~

Added your sugesstion to report the failure if it cant write back to the index.

mingshl · 2025-01-11T03:46:24Z

plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java

@@ -157,10 +163,36 @@ private void undeployModels(String[] targetNodeIds, String[] modelIds, ActionLis
        MLUndeployModelNodesRequest mlUndeployModelNodesRequest = new MLUndeployModelNodesRequest(targetNodeIds, modelIds);

        client.execute(MLUndeployModelAction.INSTANCE, mlUndeployModelNodesRequest, ActionListener.wrap(r -> {
+            if (r.getNodes().isEmpty()) {


it make sense that when execute undeploy model, the response return no worder nodes, then set the model to undeploy.

but this doesn't fix the partially undeployed issue, right?

PARTIALLY_UNDEPLOYED is a bit of a mixture of different scenarios one of them like so

Have nodes a,b,c,d in cluster associated with modelID:@ i.e. peform deploy on it

Bring a,b down while having the syncup job running

By Now sync up will make this PARTIALLY_UNDEPLOYED

stop sync up

Bring other 2 c,d down and bring 2 nodes (these now have new ids 1,2)

bring the other two nodes back which have different ids so now the cluster has (1,2,3,4)
But the model index says PARTIALLY_DEPLOYED and no nodes are servicing

This code fix says. If no nodes are servicing this model then I need to set the index to UNDEPLOYED no matter if its already UNDEPLOYED or not.

dhrubo-os · 2025-01-11T03:49:06Z

Can we add unit test?

Zhangxunmt · 2025-01-11T03:58:52Z

plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java

+        }
+        bulkUpdateRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);
+        log.info("No models service: {}", modelIds.toString());
+        client.bulk(bulkUpdateRequest, ActionListener.wrap(br -> { log.debug("Successfully set modelIds to UNDEPLOY in index"); }, e -> {


I think you can return MLUndeployModelsResponse including the original nodes rather than empty. You can find some examples in the tests how to create a new MLUndeployModelsResponse.

If we don't return in br -> { log.debug("Successfully set modelIds to UNDEPLOY in index"); }, it's possible that when client side receive the undeploy response, the model still on DEPLOYED state.

Im not sure how I feel about creating a new one based on the failures,I think it will be misleading. I will pass the original response r instead.

The failures should just return a message. But for the success case we should return something rather than {}

if success case shows (to keep consistency with current output for partially deployed case):

{ "node id 1" : Not Found, "node id 2" : Not Found }

I don't think this will add much value from customer's POV.

But may be we can send a Success message to customer?

This will still give empty response which is not accurate. Since we cannot send nodes as response in this case, lets send something to show model/models undeployed successfully. Something like

{ <model_id_1>: "UNDEPLOYED SUCCESSFULLY", <model_id_2>: "UNDEPLOYED SUCCESSFULLY" }

We discussed internally that we would not want to send back this information as we don't want to break bwc if we send back a updated response.

Also I'm thinking if we write UNDEPLOYED Successfully, this may sound like it performed undeployement but the reality is that it is just updating the index and not performing any update on the nodes carrying the "model".

plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java

brianf-aws · 2025-01-12T04:35:29Z

Can we add unit test?

Added 2 UTs for code fix

Check that the bulk write occurred when undeploy returned {}. This is a sign that the stale model index is UNDEPLOYED
Check that bulk write did not occur when some nodes have a response to the model, undeploy occured and changed index.

ml-commons/plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelAction.java

Lines 192 to 195 in 22b558d

    
           client.bulk(bulkRequest, ActionListener.runAfter(actionListener, () -> { 
        
               syncUpUndeployedModels(syncUpRequest); 
        
               listener.onResponse(undeployModelNodesResponse); 
        
           }));

plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java

brianf-aws · 2025-01-28T19:34:45Z

LGTM. I added 2.x backport label, did you figure out what other versions you need to backport to? I will add the labels for you

Thank you! Yes I think we can do OS.2.11 - 2.18

brianf-aws · 2025-01-28T19:49:13Z

Sorry can we make this to 2.x ? since this works of multi-tenancy and would break every other backport

brianf-aws · 2025-01-28T21:55:09Z

Looks like an IT is causing an issue

RestBedRockInferenceIT > test_bedrock_multimodal_model_empty_imageInput_null_textInput STANDARD_OUT
    [2025-01-28T13:06:27,997][INFO ][o.o.m.r.RestBedRockInferenceIT] [test_bedrock_multimodal_model_empty_imageInput_null_textInput] before test
Error: 25-01-28T13:06:48,987][ERROR][o.o.m.r.MLCommonsRestTestCase] [test_bedrock_multimodal_model_empty_imageInput_null_textInput] method [POST], host [http://[::1]:38167], URI [/_plugins/_ml/_predict/TEXT_EMBEDDING/UFOGrpQB2p_oggmIDsWU], status line [HTTP/1.1 400 Bad Request]
    {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"No input text or image provided"}],"type":"illegal_argument_exception","reason":"No input text or image provided"},"status":400}
    org.opensearch.client.ResponseException: method [POST], host [http://[::1]:38167], URI [/_plugins/_ml/_predict/TEXT_EMBEDDING/UFOGrpQB2p_oggmIDsWU], status line [HTTP/1.1 400 Bad Request]
    {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"No input text or image provided"}],"type":"illegal_argument_exception","reason":"No input text or image provided"},"status":400}
    	at org.opensearch.client.RestClient.convertResponse(RestClient.java:501) ~[opensearch-rest-client-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    	at org.opensearch.client.RestClient.performRequest(RestClient.java:384) ~[opensearch-rest-client-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    	at org.opensearch.client.RestClient.performRequest(RestClient.java:359) ~[opensearch-rest-client-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
    	at org.opensearch.ml.utils.TestHelper.makeRequest(TestHelper.java:183) ~[test/:?]
    	at org.opensearch.ml.utils.TestHelper.makeRequest(TestHelper.java:156) ~[test/:?]
    	at org.opensearch.ml.utils.TestHelper.makeRequest(TestHelper.java:145) ~[test/:?]
    	at org.opensearch.ml.rest.MLCommonsRestTestCase.predictTextEmbeddingModel(MLCommonsRestTestCase.java:925) ~[test/:?]
    	at org.opensearch.ml.rest.RestBedRockInferenceIT.test_bedrock_multimodal_model_empty_imageInput_null_textInput(RestBedRockInferenceIT.java:233) ~[test/:?]

Is this a flaky known issue if not can we merge? cc: @jngz-es, @mingshl

* undeploy models with no WorkerNodes This commit aims to undeploy modelIds that have no nodes associated to them so as to keep the intention of undeploy truthful. Signed-off-by: Brian Flores <[email protected]> # Conflicts: # plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java * Exit early when no nodes service the model Now when entering this method its guaranteed to write to index first before sending back the MLUndeploy response. And will also send back a exception if the write back fails Signed-off-by: Brian Flores <[email protected]> * add UTs for undeploy stale model index fix Added UTs for the 2 scenarios 1. Check that the bulk operation occured when no nodes are returned from the Undeploy response is , 2. Check that the bulk operation did not occur when there are nodes that have found the model within their cache. Signed-off-by: Brian Flores <[email protected]> * update code change with comment explaining the change Signed-off-by: Brian Flores <[email protected]> * add context stash/restore to write operation Signed-off-by: Brian Flores <[email protected]> * Apply spotless Signed-off-by: Brian Flores <[email protected]> * Add better logging to write request Signed-off-by: Brian Flores <[email protected]> * wrap exception into 5xx Signed-off-by: Brian Flores <[email protected]> * adapts undeploy code change to multi-tenancy feature Signed-off-by: Brian Flores <[email protected]> * applies spotless Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> (cherry picked from commit 18bcaae)

* undeploy models with no WorkerNodes This commit aims to undeploy modelIds that have no nodes associated to them so as to keep the intention of undeploy truthful. Signed-off-by: Brian Flores <[email protected]> # Conflicts: # plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java * Exit early when no nodes service the model Now when entering this method its guaranteed to write to index first before sending back the MLUndeploy response. And will also send back a exception if the write back fails Signed-off-by: Brian Flores <[email protected]> * add UTs for undeploy stale model index fix Added UTs for the 2 scenarios 1. Check that the bulk operation occured when no nodes are returned from the Undeploy response is , 2. Check that the bulk operation did not occur when there are nodes that have found the model within their cache. Signed-off-by: Brian Flores <[email protected]> * update code change with comment explaining the change Signed-off-by: Brian Flores <[email protected]> * add context stash/restore to write operation Signed-off-by: Brian Flores <[email protected]> * Apply spotless Signed-off-by: Brian Flores <[email protected]> * Add better logging to write request Signed-off-by: Brian Flores <[email protected]> * wrap exception into 5xx Signed-off-by: Brian Flores <[email protected]> * adapts undeploy code change to multi-tenancy feature Signed-off-by: Brian Flores <[email protected]> * applies spotless Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> (cherry picked from commit 18bcaae) Co-authored-by: Brian Flores <[email protected]>

opensearch-trigger-bot · 2025-02-05T20:10:30Z

The backport to feature/multi_tenancy failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-feature/multi_tenancy feature/multi_tenancy
# Navigate to the new working tree
cd .worktrees/backport-feature/multi_tenancy
# Create a new branch
git switch --create backport/backport-3380-to-feature/multi_tenancy
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 18bcaaeff9294372801e63f376bc7920143fa3ad
# Push it to GitHub
git push --set-upstream origin backport/backport-3380-to-feature/multi_tenancy
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-feature/multi_tenancy

Then, create a pull request where the base branch is feature/multi_tenancy and the compare/head branch is backport/backport-3380-to-feature/multi_tenancy.

brianf-aws requested review from b4sjoo, dhrubo-os, mingshl, jngz-es, model-collapse, rbhavna, ylwu-amzn, zane-neo, Zhangxunmt, austintlee, HenryL27 and xinyual as code owners January 11, 2025 03:36

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval January 11, 2025 03:36 — with GitHub Actions Inactive

ylwu-amzn reviewed Jan 11, 2025

View reviewed changes

mingshl reviewed Jan 11, 2025

View reviewed changes

Zhangxunmt reviewed Jan 11, 2025

View reviewed changes

ylwu-amzn reviewed Jan 11, 2025

View reviewed changes

plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java Outdated Show resolved Hide resolved

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval January 11, 2025 04:36 — with GitHub Actions Failure

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval January 11, 2025 04:51 — with GitHub Actions Failure

ylwu-amzn reviewed Jan 11, 2025

View reviewed changes

plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java Outdated Show resolved Hide resolved

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval January 12, 2025 04:30 — with GitHub Actions Inactive

dhrubo-os reviewed Jan 12, 2025

View reviewed changes

plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java Show resolved Hide resolved

brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval January 12, 2025 07:39 — with GitHub Actions Failure

pyek-bot reviewed Jan 13, 2025

View reviewed changes

plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java Outdated Show resolved Hide resolved

brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval January 13, 2025 19:55 — with GitHub Actions Inactive

mingshl added the bug Something isn't working label Jan 28, 2025

jngz-es approved these changes Jan 28, 2025

View reviewed changes

mingshl added backport 2.18 2.17 backport 2.17 backport 2.16 backport 2.15 backport 2.14 backport 2.13 backport 2.12 backport 2.11 and removed 2.17 labels Jan 28, 2025

mingshl removed backport 2.11 backport 2.12 backport 2.13 backport 2.14 backport 2.15 backport 2.16 backport 2.17 backport 2.18 labels Jan 28, 2025

dhrubo-os merged commit 18bcaae into opensearch-project:main Jan 28, 2025
18 of 19 checks passed

opensearch-trigger-bot bot mentioned this pull request Jan 28, 2025

[Backport 2.x] Undeploy models with no WorkerNodes #3447

Merged

dhrubo-os added the backport feature/multi_tenancy label Feb 5, 2025

brianf-aws mentioned this pull request Feb 5, 2025

Backport/backport 3380 to feature/multi tenancy #3505

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undeploy models with no WorkerNodes #3380

Undeploy models with no WorkerNodes #3380

brianf-aws commented Jan 11, 2025 •

edited

Loading

ylwu-amzn Jan 11, 2025

dhrubo-os Jan 11, 2025

brianf-aws Jan 11, 2025 •

edited

Loading

mingshl Jan 11, 2025

brianf-aws Jan 11, 2025

dhrubo-os commented Jan 11, 2025

Zhangxunmt Jan 11, 2025

ylwu-amzn Jan 11, 2025

brianf-aws Jan 11, 2025

Zhangxunmt Jan 11, 2025

dhrubo-os Jan 11, 2025 •

edited

Loading

rbhavna Jan 13, 2025

brianf-aws Jan 14, 2025

brianf-aws commented Jan 12, 2025

brianf-aws commented Jan 28, 2025

brianf-aws commented Jan 28, 2025

brianf-aws commented Jan 28, 2025

opensearch-trigger-bot bot commented Feb 5, 2025

Undeploy models with no WorkerNodes #3380

Undeploy models with no WorkerNodes #3380

Conversation

brianf-aws commented Jan 11, 2025 • edited Loading

Description

Related Issues

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianf-aws Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhrubo-os commented Jan 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhrubo-os Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianf-aws commented Jan 12, 2025

brianf-aws commented Jan 28, 2025

brianf-aws commented Jan 28, 2025

brianf-aws commented Jan 28, 2025

opensearch-trigger-bot bot commented Feb 5, 2025

brianf-aws commented Jan 11, 2025 •

edited

Loading

brianf-aws Jan 11, 2025 •

edited

Loading

dhrubo-os Jan 11, 2025 •

edited

Loading