SLA Max: Config to apply to Accepted + Launched jobs. #713

crioux-stripe · 2024-09-12T17:27:01Z

Context

Stripe has a very autodeploy focused approach to code. As a result our Mantis jobs are forcibly submitted each time their code is updated. This becomes very painful for us when users are developing new jobs, as they may merge code that will never transition from Accepted to Launched dozens of times. They effectively denial of service the platform eventually by consuming all resources available. Nothing can start, on-call gets activated, and manual cleanup is necessary.

Our users have generally wondered why the SLA max isn't stopping this from happening. They expect that if they set a max of 1 that the platform will not allow an infinite number of jobs to pile up in accepted. It is a reasonable assumption but not how things work today.

A solution to this is that the SLA is enforced on Accepted jobs as well. This pull request adds a configuration option where if specified, the SLAMax will also apply itself to jobs in Accepted state. In short if this integer parameter is > 0 we will perform a second loop in the enforceSLAMax function in which we identify Accepted jobs for removal. The test cases illustrate how this works pretty clearly.

There probably should be an open discussion about how best to configure this:

As I have here, adding a configuration option and passing it through the constructors. I'm open to changing the name for clarity, I just don't have a perfect one yet.
By passing the entire configuration object down to the actor itself.
Making this a setting on a per-job basis. This is less favorable for us, as we'd like it to be a defensive configuration from the platform perspective. An uninformed user can still accidentally cause the issue, and in fact will cause the issue because this feature will be disabled by default.

Checklist

./gradlew build compiles code correctly
Added new tests where applicable
./gradlew test passes all tests
Extended README or added javadocs where applicable

github-actions · 2024-09-12T17:33:09Z

Test Results

536 tests - 3 530 ✅ - 3 7m 39s ⏱️ -8s
138 suites - 1 6 💤 ±0
138 files - 1 0 ❌ ±0

Results for commit 3806688. ± Comparison against base commit d54548a.

This pull request removes 5 and adds 2 tests. Note that renamed tests count towards both.

io.mantisrx.server.master.store.DynamoStoreTest ‑ testInsertAndDelete
io.mantisrx.server.master.store.DynamoStoreTest ‑ testInsertAndDeleteMoreThan25
io.mantisrx.server.master.store.DynamoStoreTest ‑ testInsertAndGetAllMoreThan25
io.mantisrx.server.master.store.DynamoStoreTest ‑ testUpsertMoreThan25andGetAllPk
io.mantisrx.server.master.store.DynamoStoreTest ‑ testUpsertOrdered

io.mantisrx.master.jobcluster.SLAEnforcerTest ‑ sla11ShouldNotLetJobsPileUpIfEnabled
io.mantisrx.master.jobcluster.SLAEnforcerTest ‑ slaMaxShouldLimitAcceptedJobsToOneIfEnabled

♻️ This comment has been updated with latest results.

crioux-stripe · 2024-09-13T15:53:03Z

I was thinking about this last night and a simpler, easier to understand approach might be to rename the flag to mantis.sla.includeAcceptedInEnforcement and implement a behavior where for a max of N it will:

Keep up to N (newest first) jobs in Launched (call this M)
Keep up to Min(0, N-M) (newest first) Accepted jobs
Keep the latest Accepted job, regardless of max, to accommodate running jobs.

If this is easier to accept as a global Mantis configuration option, I'm happy to update the PR.

Edit: Hrmm. Actually the more I think about it the more this alternative has a lot of harsh edge cases. Imagine under a fairly quick sequence of submissions:

Job-1 Launched
Job-2 Accepted (But will actually start)
Job-3 Accepted (But will actually start)
Job-4 Launched

This would end an SLA enforcement cycle with Job-4 and Job-1 running despite the intended effect being Job-3 and Job-4 running. Perhaps the PR is best as originally proposed.

Andyz26 · 2024-09-13T20:16:23Z

@crioux-stripe This is awesome and I like this behavior better (it's always bad to have piled up accepted jobs). Shall we make this the default behavior (maybe something less restrictive e.g. default allowance 3)?

crioux-stripe · 2024-09-13T20:40:18Z

@Andyz26 From a Stripe perspective we'd be really keen on making it the default behavior given that we:

Don't have a lot of things running right now anyway.
People generally expect this to be the behavior anyway.

My main concern was changing the default behavior potentially impacting Netflix in unexpected ways.

crioux-stripe · 2024-09-13T20:50:26Z

@Andyz26 Changed the default to 3 in the latest commit.

…requirements

Add headroom configuration option for SLA Max and jobs in Accepted state

d167823

crioux-stripe requested review from calvin681, sundargates, Andyz26, hmitnflx and fdc-ntflx as code owners September 12, 2024 17:27

crioux-stripe mentioned this pull request Sep 12, 2024

use jobJarUrl instead of artifactName to avoid stripping path prefixes #708

Merged

4 tasks

Merge branch 'master' into master

b289ee8

Merge branch 'master' into master

31a44d6

Update default to 3

b08091b

crioux-stripe added 2 commits September 13, 2024 14:38

Update gradle enterprise plugin to 3.13.1 to comply with new minimum …

30648a2

…requirements

Attempt to update to latest gradle enterprise plugin 3.18.1

83f1837

Andyz26 approved these changes Sep 13, 2024

View reviewed changes

crioux-stripe added 2 commits September 13, 2024 17:23

Remove gradle-test-retry that now conflicts with enterprise plugin

00bcdf6

Remove dynamodb-store

3806688

crioux-stripe merged commit a59f5f6 into Netflix:master Sep 14, 2024
4 of 5 checks passed

crioux-stripe had a problem deploying to Integrate Pull Request October 14, 2024 00:31 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLA Max: Config to apply to Accepted + Launched jobs. #713

SLA Max: Config to apply to Accepted + Launched jobs. #713

crioux-stripe commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 12, 2024 •

edited

Loading

crioux-stripe commented Sep 13, 2024 •

edited

Loading

Andyz26 commented Sep 13, 2024

crioux-stripe commented Sep 13, 2024

crioux-stripe commented Sep 13, 2024

SLA Max: Config to apply to Accepted + Launched jobs. #713

SLA Max: Config to apply to Accepted + Launched jobs. #713

Conversation

crioux-stripe commented Sep 12, 2024 • edited Loading

Context

Checklist

github-actions bot commented Sep 12, 2024 • edited Loading

Test Results

crioux-stripe commented Sep 13, 2024 • edited Loading

Andyz26 commented Sep 13, 2024

crioux-stripe commented Sep 13, 2024

crioux-stripe commented Sep 13, 2024

crioux-stripe commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 12, 2024 •

edited

Loading

crioux-stripe commented Sep 13, 2024 •

edited

Loading