Fix limiter token tracking (again) #432
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What
Fixes a regression in limiter job counting.
Use k8s Informer more effectively to count jobs-in-flight accurately.
Related to #302 (but probably not a fix for it).
Why
In the pre ~v0.16 paradigm, the limiter had two roles: deduplicate jobs, and enforce the max-in-flight limit. This made sense because to deduplicate jobs required tracking the running jobs in a map, and the map could count running jobs. Later, I changed the mechanism for waking goroutines waiting for jobs to finish from a
sync.Cond
to a channel ("token bucket"), so that waiting for capacity could be cancelled via context.In v0.19 I split the old limiter into the new limiter and deduper to make it easier to understand. But I didn't entirely finish the job: the deduping map had the effect of preventing multiple-counting tokens, and because the new limiter didn't have a deduping map, it can easily double-count starting and finishing jobs.
To count jobs correctly we need to:
Taking tokens in
OnAdd
after the cache sync is finished is double-counting, because we're already taking tokens inHandle
. Similarly, returning tokens inOnDelete
is probably wrong except in the case where the Informer has missed some updates from not-finished to finished. Finally, taking or returning tokens inOnUpdate
should only happen when a job changes state between not-finished and finished, not whenever it gets called.Testing
I manually tested locally on Orbstack with a pipeline that generates 100 trivial jobs, configured with checkout skipped and default MaxInFlight (25).
Time spent running jobs (not including waiting for agent):