Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(scheduler): account for number of model instances when scheduling #6183

Conversation

driev
Copy link

@driev driev commented Jan 2, 2025

What this PR does / why we need it:

As a follow up from #6054, the number of times a model will loaded into memory, will now be taken into account when scheduling.

When the control plane initially loads a model, this information will not be available until the first model event message is sent from the agent, which will contain the model specific settings. Model runtime info should only be set once per model version.

Which issue(s) this PR fixes:

Fixes # INFRA-1146

Special notes for your reviewer:

Moved the model runtime info field to the scheduler API, as this is what's stored in the in-memory model store.

@driev driev requested review from sakoush and lc525 as code owners January 2, 2025 14:28
@driev driev added the v2 label Jan 2, 2025
Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left a couple of comments for potential follow up PR.

@@ -683,6 +684,7 @@ func (c *Client) UnloadModel(request *agent.ModelOperationMessage, timestamp int
defer c.modelTimestamps.Store(modelWithVersion, timestamp)

// we do not care about model versions here
// model runtime info is retrieved from the existing version, so nil is passed here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment relevant?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is... the model runtime info param in getModifiedModelVersion is nil.

mv := proto.Clone(originalModelVersion)
mv.(*agent.ModelVersion).Model.Meta.Name = modelId
if mv.(*agent.ModelVersion).Model.ModelSpec != nil && modelRuntimeInfo != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can ModelSpec be actually nil? or it is just a safe guard?

Maybe add a note as readers could think that runtime info can be set while model spec is nil?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can be in practice, but there are some tests where it's not set. I can change this so it's always set.

@@ -324,12 +323,12 @@ func (m *MLServerRepositoryHandler) findHighestVersionInPath(modelPath string) (
return "", nil
}

func (m *MLServerRepositoryHandler) GetModelRuntimeInfo(_ string) (*agent.ModelRuntimeInfo, error) {
func (m *MLServerRepositoryHandler) GetModelRuntimeInfo(_ string) (*scheduler.ModelRuntimeInfo, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is perhaps a design question as opposed to something related to this PR:

If for mlserver we are getting the parallelWorkersEnvVar set at the server level, why the choice was to do it at the model level (as all models will have the same value for a given server and can be exposed at the point of the server connecting to the scheduler)?

Is this because Triton does not follow the same principle?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's at the model level for consistency. In the design doc, there's a concept of "server attributes", which has not been added here... it could be added in future.

@@ -982,7 +987,7 @@ func TestUpdateModelState(t *testing.T) {
"server": {
name: "server",
replicas: map[int]*ServerReplica{
0: {loadedModels: map[ModelVersionID]bool{{Name: "foo", Version: 2}: true, {Name: "foo", Version: 1}: true}, reservedMemory: memBytes, uniqueLoadedModels: map[string]bool{"foo": true}},
0: {loadedModels: map[ModelVersionID]bool{{Name: "foo", Version: 2}: true, {Name: "foo", Version: 1}: true}, reservedMemory: memBytes * 2, uniqueLoadedModels: map[string]bool{"foo": true}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to this incremental change specifically but I am not sure if we are considering instance count for reservedMemory in the actual logic.
Specifically reservedMemory in the cases of parallel loads of models on a given server before the actual memory is used and therefore we can guard against overloading beyong the memory limit. We should also consider (perhaps as a follow up PR) how we can expose the instance count "before" loading the model. For MLServer this seems simple as per my comment about the envar but I am not sure about triton yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. I was thinking that getting the instance count before loading the model was merely an optimisation, but it would have a some purpose. Triton would take some effort to solve and might never work before loading the model, as GPUs will be considered when scheduling at some point.

@driev driev merged commit 616cddc into SeldonIO:v2 Jan 8, 2025
3 checks passed
@driev driev deleted the INFRA-1146/filter-by-number-of-workers-when-scheduling-pt2 branch January 8, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants