Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Delete all versions of a deleted model #6193

Draft
wants to merge 1 commit into
base: v2
Choose a base branch
from

Conversation

abhimanyu003
Copy link
Member

@abhimanyu003 abhimanyu003 commented Jan 15, 2025

What this PR does / why we need it:

How to replicate the issue

  • Apply model
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: add10
spec:
  storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
  requirements:
  - triton
  - python
  • Update manifest by applying faulty update
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: add10
spec:
  storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10-faulty-url"
  requirements:
  - triton
  - python
  • Do seldon model unload add10
  • Model will remain in memory and never get cleand-up

Which issue(s) this PR fixes:
Fixes #INFRA-1230

Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is potentially going to cause downtime. This will need to be revisited.

}
c.logger.WithField("func", c.stateManager.modelVersions.getVersionsForAllModels()).Infof("----")

err := c.ModelRepository.RemoveModelVersion(util.GetVersionedModelName(modelName, v.GetVersion()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flow is probably not the correct flow as we are not yet sure that we should be removing the old version of this model.

In progressive rollout we have to make sure that the new version of the model is up before removing an old version otherwise we risk inference requests not being served in this transition.

I dont think this is the correct place also to do the clean up as one agent (i.e. a server replica loading one model replica) doesnt have visibility with regards to the actual state of the new version of the model (e.g. that the envoy routes have been updated). Therefore I think that the clean up process should be done somewhere at the scheduler side and not on the agent.

@sakoush sakoush marked this pull request as draft January 16, 2025 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants