Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quicker unhappy jobs #233

Open
cortlandstarrett opened this issue Jun 20, 2024 · 2 comments
Open

quicker unhappy jobs #233

cortlandstarrett opened this issue Jun 20, 2024 · 2 comments

Comments

@cortlandstarrett
Copy link
Member

cortlandstarrett commented Jun 20, 2024

At present, an unhappy job finishes after a timeout of the hanging job timer.
The rationale is to allow sufficient time for a critical event to arrive, which triggers an alarm condition. This makes sense, but has the below costs.

There are a few problems with this:

  1. It reuses a timer that is intended for other purposes.
  2. It is not configurable separately from the hanging job.
  3. It it a long timer which means that unhappy jobs are held in memory for a long time increasing the number of concurrent jobs such that it could be a memory risk. At 50 jobs per second and a 30 second hanging job timer, this could expand to 1500 jobs waiting to end. This impacts our max jobs per worker setting.

It might be good to add a configuration value for this timer.
Another option is to use the intra-event timer which can be very short.

A thought would be to allow the unhappy job to finish quickly, but detect critical events in the Job Gone Horribly Wrong state, which is entered if a "stray event" from a previous job arrives.

@cortlandstarrett
Copy link
Member Author

This implies that a transition from JobFailed to HorriblyWrong is needed, so that a failed job can be upgraded to alarm when a critical event arrives after failure.

@cortlandstarrett
Copy link
Member Author

For 1.4.0, the timer to end an unhappy job shifted from the JobHanging timer to the InvariantLoad timer.

In the future, a separate, purpose-specific timer should be added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant