feat(scheduler): account for number of model instances when scheduling #6183

driev · 2025-01-02T14:28:05Z

What this PR does / why we need it:

As a follow up from #6054, the number of times a model will loaded into memory, will now be taken into account when scheduling.

When the control plane initially loads a model, this information will not be available until the first model event message is sent from the agent, which will contain the model specific settings. Model runtime info should only be set once per model version.

Which issue(s) this PR fixes:

Fixes # INFRA-1146

Special notes for your reviewer:

Moved the model runtime info field to the scheduler API, as this is what's stored in the in-memory model store.

sakoush

LGTM. I left a couple of comments for potential follow up PR.

sakoush · 2025-01-07T12:52:47Z

scheduler/pkg/agent/client.go

@@ -683,6 +684,7 @@ func (c *Client) UnloadModel(request *agent.ModelOperationMessage, timestamp int
 	defer c.modelTimestamps.Store(modelWithVersion, timestamp)

 	// we do not care about model versions here
+	// model runtime info is retrieved from the existing version, so nil is passed here


Is this comment relevant?

It is... the model runtime info param in getModifiedModelVersion is nil.

sakoush · 2025-01-08T08:12:10Z

scheduler/pkg/agent/client_utils.go

 	mv := proto.Clone(originalModelVersion)
 	mv.(*agent.ModelVersion).Model.Meta.Name = modelId
+	if mv.(*agent.ModelVersion).Model.ModelSpec != nil && modelRuntimeInfo != nil {


can ModelSpec be actually nil? or it is just a safe guard?

Maybe add a note as readers could think that runtime info can be set while model spec is nil?

I don't think it can be in practice, but there are some tests where it's not set. I can change this so it's always set.

sakoush · 2025-01-08T08:15:54Z

scheduler/pkg/agent/repository/mlserver/mlserver.go

@@ -324,12 +323,12 @@ func (m *MLServerRepositoryHandler) findHighestVersionInPath(modelPath string) (
 	return "", nil
 }

-func (m *MLServerRepositoryHandler) GetModelRuntimeInfo(_ string) (*agent.ModelRuntimeInfo, error) {
+func (m *MLServerRepositoryHandler) GetModelRuntimeInfo(_ string) (*scheduler.ModelRuntimeInfo, error) {


this is perhaps a design question as opposed to something related to this PR:

If for mlserver we are getting the parallelWorkersEnvVar set at the server level, why the choice was to do it at the model level (as all models will have the same value for a given server and can be exposed at the point of the server connecting to the scheduler)?

Is this because Triton does not follow the same principle?

It's at the model level for consistency. In the design doc, there's a concept of "server attributes", which has not been added here... it could be added in future.

sakoush · 2025-01-08T08:26:28Z

scheduler/pkg/store/memory_test.go

@@ -982,7 +987,7 @@ func TestUpdateModelState(t *testing.T) {
 					"server": {
 						name: "server",
 						replicas: map[int]*ServerReplica{
-							0: {loadedModels: map[ModelVersionID]bool{{Name: "foo", Version: 2}: true, {Name: "foo", Version: 1}: true}, reservedMemory: memBytes, uniqueLoadedModels: map[string]bool{"foo": true}},
+							0: {loadedModels: map[ModelVersionID]bool{{Name: "foo", Version: 2}: true, {Name: "foo", Version: 1}: true}, reservedMemory: memBytes * 2, uniqueLoadedModels: map[string]bool{"foo": true}},


This is not related to this incremental change specifically but I am not sure if we are considering instance count for reservedMemory in the actual logic.
Specifically reservedMemory in the cases of parallel loads of models on a given server before the actual memory is used and therefore we can guard against overloading beyong the memory limit. We should also consider (perhaps as a follow up PR) how we can expose the instance count "before" loading the model. For MLServer this seems simple as per my comment about the envar but I am not sure about triton yet.

Yeah, that's a good point. I was thinking that getting the instance count before loading the model was merely an optimisation, but it would have a some purpose. Triton would take some effort to solve and might never work before loading the model, as GPUs will be considered when scheduling at some point.

driev added 2 commits January 2, 2025 14:10

setting the model runtime info in the scheduler

aefe0a1

linting

fbb6390

driev requested review from sakoush and lc525 as code owners January 2, 2025 14:28

driev added the v2 label Jan 2, 2025

proto cleanup

50b8c11

sakoush approved these changes Jan 8, 2025

View reviewed changes

improve nil pointer checks

44c7c53

driev merged commit 616cddc into SeldonIO:v2 Jan 8, 2025
3 checks passed

driev deleted the INFRA-1146/filter-by-number-of-workers-when-scheduling-pt2 branch January 8, 2025 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scheduler): account for number of model instances when scheduling #6183

feat(scheduler): account for number of model instances when scheduling #6183

driev commented Jan 2, 2025

sakoush left a comment

sakoush Jan 7, 2025

driev Jan 8, 2025

sakoush Jan 8, 2025

driev Jan 8, 2025

sakoush Jan 8, 2025

driev Jan 8, 2025

sakoush Jan 8, 2025

driev Jan 8, 2025

feat(scheduler): account for number of model instances when scheduling #6183

feat(scheduler): account for number of model instances when scheduling #6183

Conversation

driev commented Jan 2, 2025

sakoush left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment