-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undeploy models with no WorkerNodes #3380
Undeploy models with no WorkerNodes #3380
Conversation
bulkUpdateRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE); | ||
log.info("No models service: {}", modelIds.toString()); | ||
client.bulk(bulkUpdateRequest, ActionListener.wrap(br -> { log.debug("Successfully set modelIds to UNDEPLOY in index"); }, e -> { | ||
log.error("Failed to set modelIds to UNDEPLOY in index", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we send this exception back to client side ? If yes, we should pass listener to this method and add this line here
listener.onFailure(e);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the user is concerned with the failure? The only way I see this as a problem if the model index does not exist and user does undeploy, this might cause a write issue.
Added your sugesstion to report the failure if it cant write back to the index.
@@ -157,10 +163,36 @@ private void undeployModels(String[] targetNodeIds, String[] modelIds, ActionLis | |||
MLUndeployModelNodesRequest mlUndeployModelNodesRequest = new MLUndeployModelNodesRequest(targetNodeIds, modelIds); | |||
|
|||
client.execute(MLUndeployModelAction.INSTANCE, mlUndeployModelNodesRequest, ActionListener.wrap(r -> { | |||
if (r.getNodes().isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it make sense that when execute undeploy model, the response return no worder nodes, then set the model to undeploy.
but this doesn't fix the partially undeployed issue, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PARTIALLY_UNDEPLOYED
is a bit of a mixture of different scenarios one of them like so
- Have nodes a,b,c,d in cluster associated with
modelID:@
i.e. peform deploy on it - Bring a,b down while having the syncup job running
- By Now sync up will make this PARTIALLY_UNDEPLOYED
- stop sync up
- Bring other 2 c,d down and bring 2 nodes (these now have new ids 1,2)
- bring the other two nodes back which have different ids so now the cluster has (1,2,3,4)
But the model index saysPARTIALLY_DEPLOYED
and no nodes are servicing
This code fix says. If no nodes are servicing this model then I need to set the index to UNDEPLOYED no matter if its already UNDEPLOYED or not.
Can we add unit test? |
} | ||
bulkUpdateRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE); | ||
log.info("No models service: {}", modelIds.toString()); | ||
client.bulk(bulkUpdateRequest, ActionListener.wrap(br -> { log.debug("Successfully set modelIds to UNDEPLOY in index"); }, e -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can return MLUndeployModelsResponse including the original nodes rather than empty. You can find some examples in the tests how to create a new MLUndeployModelsResponse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't return in br -> { log.debug("Successfully set modelIds to UNDEPLOY in index"); }
, it's possible that when client side receive the undeploy response, the model still on DEPLOYED state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure how I feel about creating a new one based on the failures,I think it will be misleading. I will pass the original response r instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failures should just return a message. But for the success case we should return something rather than {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if success case shows (to keep consistency with current output for partially deployed case):
{
"node id 1" : Not Found,
"node id 2" : Not Found
}
I don't think this will add much value from customer's POV.
But may be we can send a Success message to customer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will still give empty response which is not accurate. Since we cannot send nodes as response in this case, lets send something to show model/models undeployed successfully. Something like
{
<model_id_1>: "UNDEPLOYED SUCCESSFULLY",
<model_id_2>: "UNDEPLOYED SUCCESSFULLY"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed internally that we would not want to send back this information as we don't want to break bwc if we send back a updated response.
Also I'm thinking if we write UNDEPLOYED Successfully, this may sound like it performed undeployement but the reality is that it is just updating the index and not performing any update on the nodes carrying the "model".
plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java
Outdated
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java
Outdated
Show resolved
Hide resolved
Added 2 UTs for code fix
ml-commons/plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelAction.java Lines 192 to 195 in 22b558d
|
plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java
Outdated
Show resolved
Hide resolved
Thank you! Yes I think we can do OS.2.11 - 2.18 |
Sorry can we make this to 2.x ? since this works of multi-tenancy and would break every other backport |
Looks like an IT is causing an issue
Is this a flaky known issue if not can we merge? cc: @jngz-es, @mingshl |
* undeploy models with no WorkerNodes This commit aims to undeploy modelIds that have no nodes associated to them so as to keep the intention of undeploy truthful. Signed-off-by: Brian Flores <[email protected]> # Conflicts: # plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java * Exit early when no nodes service the model Now when entering this method its guaranteed to write to index first before sending back the MLUndeploy response. And will also send back a exception if the write back fails Signed-off-by: Brian Flores <[email protected]> * add UTs for undeploy stale model index fix Added UTs for the 2 scenarios 1. Check that the bulk operation occured when no nodes are returned from the Undeploy response is , 2. Check that the bulk operation did not occur when there are nodes that have found the model within their cache. Signed-off-by: Brian Flores <[email protected]> * update code change with comment explaining the change Signed-off-by: Brian Flores <[email protected]> * add context stash/restore to write operation Signed-off-by: Brian Flores <[email protected]> * Apply spotless Signed-off-by: Brian Flores <[email protected]> * Add better logging to write request Signed-off-by: Brian Flores <[email protected]> * wrap exception into 5xx Signed-off-by: Brian Flores <[email protected]> * adapts undeploy code change to multi-tenancy feature Signed-off-by: Brian Flores <[email protected]> * applies spotless Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> (cherry picked from commit 18bcaae)
* undeploy models with no WorkerNodes This commit aims to undeploy modelIds that have no nodes associated to them so as to keep the intention of undeploy truthful. Signed-off-by: Brian Flores <[email protected]> # Conflicts: # plugin/src/main/java/org/opensearch/ml/action/undeploy/TransportUndeployModelsAction.java * Exit early when no nodes service the model Now when entering this method its guaranteed to write to index first before sending back the MLUndeploy response. And will also send back a exception if the write back fails Signed-off-by: Brian Flores <[email protected]> * add UTs for undeploy stale model index fix Added UTs for the 2 scenarios 1. Check that the bulk operation occured when no nodes are returned from the Undeploy response is , 2. Check that the bulk operation did not occur when there are nodes that have found the model within their cache. Signed-off-by: Brian Flores <[email protected]> * update code change with comment explaining the change Signed-off-by: Brian Flores <[email protected]> * add context stash/restore to write operation Signed-off-by: Brian Flores <[email protected]> * Apply spotless Signed-off-by: Brian Flores <[email protected]> * Add better logging to write request Signed-off-by: Brian Flores <[email protected]> * wrap exception into 5xx Signed-off-by: Brian Flores <[email protected]> * adapts undeploy code change to multi-tenancy feature Signed-off-by: Brian Flores <[email protected]> * applies spotless Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> (cherry picked from commit 18bcaae) Co-authored-by: Brian Flores <[email protected]>
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-feature/multi_tenancy feature/multi_tenancy
# Navigate to the new working tree
cd .worktrees/backport-feature/multi_tenancy
# Create a new branch
git switch --create backport/backport-3380-to-feature/multi_tenancy
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 18bcaaeff9294372801e63f376bc7920143fa3ad
# Push it to GitHub
git push --set-upstream origin backport/backport-3380-to-feature/multi_tenancy
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-feature/multi_tenancy Then, create a pull request where the |
This PR aims to undeploy modelIds that have no nodes associated to them so as to keep the intention of undeploy truthful.
Description
When performing undeploy if the model has no nodes associated to it then it will reset the index to UNDEPLOY status
Here is an example of why this code change is needed
This secnario is for the
PARTIALLY_DEPLOYED
issue.modelID:@
i.e. peform deploy on itBut the model index says
PARTIALLY_DEPLOYED
and no nodes are servicingThis code fix says. If no nodes are servicing this model then I need to set the index to UNDEPLOYED no matter if its already UNDEPLOYED or not.
Related Issues
Resolves #3285
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.