-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLA Max: Config to apply to Accepted + Launched jobs. #713
Conversation
Test Results536 tests - 3 530 ✅ - 3 7m 39s ⏱️ -8s Results for commit 3806688. ± Comparison against base commit d54548a. This pull request removes 5 and adds 2 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
I was thinking about this last night and a simpler, easier to understand approach might be to rename the flag to
If this is easier to accept as a global Mantis configuration option, I'm happy to update the PR. Edit: Hrmm. Actually the more I think about it the more this alternative has a lot of harsh edge cases. Imagine under a fairly quick sequence of submissions: Job-1 Launched This would end an SLA enforcement cycle with Job-4 and Job-1 running despite the intended effect being Job-3 and Job-4 running. Perhaps the PR is best as originally proposed. |
@crioux-stripe This is awesome and I like this behavior better (it's always bad to have piled up accepted jobs). Shall we make this the default behavior (maybe something less restrictive e.g. default allowance 3)? |
@Andyz26 From a Stripe perspective we'd be really keen on making it the default behavior given that we:
My main concern was changing the default behavior potentially impacting Netflix in unexpected ways. |
@Andyz26 Changed the default to 3 in the latest commit. |
Context
Stripe has a very autodeploy focused approach to code. As a result our Mantis jobs are forcibly submitted each time their code is updated. This becomes very painful for us when users are developing new jobs, as they may merge code that will never transition from
Accepted
toLaunched
dozens of times. They effectively denial of service the platform eventually by consuming all resources available. Nothing can start, on-call gets activated, and manual cleanup is necessary.Our users have generally wondered why the SLA max isn't stopping this from happening. They expect that if they set a max of 1 that the platform will not allow an infinite number of jobs to pile up in accepted. It is a reasonable assumption but not how things work today.
A solution to this is that the SLA is enforced on
Accepted
jobs as well. This pull request adds a configuration option where if specified, the SLAMax will also apply itself to jobs inAccepted
state. In short if this integer parameter is > 0 we will perform a second loop in theenforceSLAMax
function in which we identifyAccepted
jobs for removal. The test cases illustrate how this works pretty clearly.There probably should be an open discussion about how best to configure this:
Checklist
./gradlew build
compiles code correctly./gradlew test
passes all tests