-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(scheduler): account for number of model instances when scheduling #6183
feat(scheduler): account for number of model instances when scheduling #6183
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I left a couple of comments for potential follow up PR.
@@ -683,6 +684,7 @@ func (c *Client) UnloadModel(request *agent.ModelOperationMessage, timestamp int | |||
defer c.modelTimestamps.Store(modelWithVersion, timestamp) | |||
|
|||
// we do not care about model versions here | |||
// model runtime info is retrieved from the existing version, so nil is passed here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is... the model runtime info param in getModifiedModelVersion
is nil.
scheduler/pkg/agent/client_utils.go
Outdated
mv := proto.Clone(originalModelVersion) | ||
mv.(*agent.ModelVersion).Model.Meta.Name = modelId | ||
if mv.(*agent.ModelVersion).Model.ModelSpec != nil && modelRuntimeInfo != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can ModelSpec
be actually nil? or it is just a safe guard?
Maybe add a note as readers could think that runtime info can be set while model spec is nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it can be in practice, but there are some tests where it's not set. I can change this so it's always set.
@@ -324,12 +323,12 @@ func (m *MLServerRepositoryHandler) findHighestVersionInPath(modelPath string) ( | |||
return "", nil | |||
} | |||
|
|||
func (m *MLServerRepositoryHandler) GetModelRuntimeInfo(_ string) (*agent.ModelRuntimeInfo, error) { | |||
func (m *MLServerRepositoryHandler) GetModelRuntimeInfo(_ string) (*scheduler.ModelRuntimeInfo, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is perhaps a design question as opposed to something related to this PR:
If for mlserver we are getting the parallelWorkersEnvVar set at the server level, why the choice was to do it at the model level (as all models will have the same value for a given server and can be exposed at the point of the server connecting to the scheduler)?
Is this because Triton does not follow the same principle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's at the model level for consistency. In the design doc, there's a concept of "server attributes", which has not been added here... it could be added in future.
@@ -982,7 +987,7 @@ func TestUpdateModelState(t *testing.T) { | |||
"server": { | |||
name: "server", | |||
replicas: map[int]*ServerReplica{ | |||
0: {loadedModels: map[ModelVersionID]bool{{Name: "foo", Version: 2}: true, {Name: "foo", Version: 1}: true}, reservedMemory: memBytes, uniqueLoadedModels: map[string]bool{"foo": true}}, | |||
0: {loadedModels: map[ModelVersionID]bool{{Name: "foo", Version: 2}: true, {Name: "foo", Version: 1}: true}, reservedMemory: memBytes * 2, uniqueLoadedModels: map[string]bool{"foo": true}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not related to this incremental change specifically but I am not sure if we are considering instance count for reservedMemory
in the actual logic.
Specifically reservedMemory
in the cases of parallel loads of models on a given server before the actual memory is used and therefore we can guard against overloading beyong the memory limit. We should also consider (perhaps as a follow up PR) how we can expose the instance count "before" loading the model. For MLServer this seems simple as per my comment about the envar but I am not sure about triton yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point. I was thinking that getting the instance count before loading the model was merely an optimisation, but it would have a some purpose. Triton would take some effort to solve and might never work before loading the model, as GPUs will be considered when scheduling at some point.
What this PR does / why we need it:
As a follow up from #6054, the number of times a model will loaded into memory, will now be taken into account when scheduling.
When the control plane initially loads a model, this information will not be available until the first model event message is sent from the agent, which will contain the model specific settings. Model runtime info should only be set once per model version.
Which issue(s) this PR fixes:
Fixes # INFRA-1146
Special notes for your reviewer:
Moved the model runtime info field to the scheduler API, as this is what's stored in the in-memory model store.