Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLA Max: Config to apply to Accepted + Launched jobs. #713

Merged
merged 8 commits into from
Sep 14, 2024

Conversation

crioux-stripe
Copy link
Collaborator

@crioux-stripe crioux-stripe commented Sep 12, 2024

Context

Stripe has a very autodeploy focused approach to code. As a result our Mantis jobs are forcibly submitted each time their code is updated. This becomes very painful for us when users are developing new jobs, as they may merge code that will never transition from Accepted to Launched dozens of times. They effectively denial of service the platform eventually by consuming all resources available. Nothing can start, on-call gets activated, and manual cleanup is necessary.

Our users have generally wondered why the SLA max isn't stopping this from happening. They expect that if they set a max of 1 that the platform will not allow an infinite number of jobs to pile up in accepted. It is a reasonable assumption but not how things work today.

A solution to this is that the SLA is enforced on Accepted jobs as well. This pull request adds a configuration option where if specified, the SLAMax will also apply itself to jobs in Accepted state. In short if this integer parameter is > 0 we will perform a second loop in the enforceSLAMax function in which we identify Accepted jobs for removal. The test cases illustrate how this works pretty clearly.

There probably should be an open discussion about how best to configure this:

  1. As I have here, adding a configuration option and passing it through the constructors. I'm open to changing the name for clarity, I just don't have a perfect one yet.
  2. By passing the entire configuration object down to the actor itself.
  3. Making this a setting on a per-job basis. This is less favorable for us, as we'd like it to be a defensive configuration from the platform perspective. An uninformed user can still accidentally cause the issue, and in fact will cause the issue because this feature will be disabled by default.

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

Copy link

github-actions bot commented Sep 12, 2024

Test Results

536 tests   - 3   530 ✅  - 3   7m 39s ⏱️ -8s
138 suites  - 1     6 💤 ±0 
138 files    - 1     0 ❌ ±0 

Results for commit 3806688. ± Comparison against base commit d54548a.

This pull request removes 5 and adds 2 tests. Note that renamed tests count towards both.
io.mantisrx.server.master.store.DynamoStoreTest ‑ testInsertAndDelete
io.mantisrx.server.master.store.DynamoStoreTest ‑ testInsertAndDeleteMoreThan25
io.mantisrx.server.master.store.DynamoStoreTest ‑ testInsertAndGetAllMoreThan25
io.mantisrx.server.master.store.DynamoStoreTest ‑ testUpsertMoreThan25andGetAllPk
io.mantisrx.server.master.store.DynamoStoreTest ‑ testUpsertOrdered
io.mantisrx.master.jobcluster.SLAEnforcerTest ‑ sla11ShouldNotLetJobsPileUpIfEnabled
io.mantisrx.master.jobcluster.SLAEnforcerTest ‑ slaMaxShouldLimitAcceptedJobsToOneIfEnabled

♻️ This comment has been updated with latest results.

@crioux-stripe
Copy link
Collaborator Author

crioux-stripe commented Sep 13, 2024

I was thinking about this last night and a simpler, easier to understand approach might be to rename the flag to mantis.sla.includeAcceptedInEnforcement and implement a behavior where for a max of N it will:

  • Keep up to N (newest first) jobs in Launched (call this M)
  • Keep up to Min(0, N-M) (newest first) Accepted jobs
  • Keep the latest Accepted job, regardless of max, to accommodate running jobs.

If this is easier to accept as a global Mantis configuration option, I'm happy to update the PR.

Edit: Hrmm. Actually the more I think about it the more this alternative has a lot of harsh edge cases. Imagine under a fairly quick sequence of submissions:

Job-1 Launched
Job-2 Accepted (But will actually start)
Job-3 Accepted (But will actually start)
Job-4 Launched

This would end an SLA enforcement cycle with Job-4 and Job-1 running despite the intended effect being Job-3 and Job-4 running. Perhaps the PR is best as originally proposed.

@Andyz26
Copy link
Collaborator

Andyz26 commented Sep 13, 2024

@crioux-stripe This is awesome and I like this behavior better (it's always bad to have piled up accepted jobs). Shall we make this the default behavior (maybe something less restrictive e.g. default allowance 3)?

@crioux-stripe
Copy link
Collaborator Author

@Andyz26 From a Stripe perspective we'd be really keen on making it the default behavior given that we:

  1. Don't have a lot of things running right now anyway.
  2. People generally expect this to be the behavior anyway.

My main concern was changing the default behavior potentially impacting Netflix in unexpected ways.

@crioux-stripe
Copy link
Collaborator Author

@Andyz26 Changed the default to 3 in the latest commit.

@crioux-stripe crioux-stripe merged commit a59f5f6 into Netflix:master Sep 14, 2024
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants