You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At present, an unhappy job finishes after a timeout of the hanging job timer.
The rationale is to allow sufficient time for a critical event to arrive, which triggers an alarm condition. This makes sense, but has the below costs.
There are a few problems with this:
It reuses a timer that is intended for other purposes.
It is not configurable separately from the hanging job.
It it a long timer which means that unhappy jobs are held in memory for a long time increasing the number of concurrent jobs such that it could be a memory risk. At 50 jobs per second and a 30 second hanging job timer, this could expand to 1500 jobs waiting to end. This impacts our max jobs per worker setting.
It might be good to add a configuration value for this timer.
Another option is to use the intra-event timer which can be very short.
A thought would be to allow the unhappy job to finish quickly, but detect critical events in the Job Gone Horribly Wrong state, which is entered if a "stray event" from a previous job arrives.
The text was updated successfully, but these errors were encountered:
This implies that a transition from JobFailed to HorriblyWrong is needed, so that a failed job can be upgraded to alarm when a critical event arrives after failure.
At present, an unhappy job finishes after a timeout of the hanging job timer.
The rationale is to allow sufficient time for a critical event to arrive, which triggers an alarm condition. This makes sense, but has the below costs.
There are a few problems with this:
It might be good to add a configuration value for this timer.
Another option is to use the intra-event timer which can be very short.
A thought would be to allow the unhappy job to finish quickly, but detect critical events in the Job Gone Horribly Wrong state, which is entered if a "stray event" from a previous job arrives.
The text was updated successfully, but these errors were encountered: