Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix limiter token tracking (again) #432

Merged
merged 1 commit into from
Nov 24, 2024

Conversation

DrJosh9000
Copy link
Contributor

@DrJosh9000 DrJosh9000 commented Nov 22, 2024

What

Fixes a regression in limiter job counting.

Use k8s Informer more effectively to count jobs-in-flight accurately.

Related to #302 (but probably not a fix for it).

Why

In the pre ~v0.16 paradigm, the limiter had two roles: deduplicate jobs, and enforce the max-in-flight limit. This made sense because to deduplicate jobs required tracking the running jobs in a map, and the map could count running jobs. Later, I changed the mechanism for waking goroutines waiting for jobs to finish from a sync.Cond to a channel ("token bucket"), so that waiting for capacity could be cancelled via context.

In v0.19 I split the old limiter into the new limiter and deduper to make it easier to understand. But I didn't entirely finish the job: the deduping map had the effect of preventing multiple-counting tokens, and because the new limiter didn't have a deduping map, it can easily double-count starting and finishing jobs.

To count jobs correctly we need to:

  • take a token only when a job is started
  • return a token only when a job is finished

Taking tokens in OnAdd after the cache sync is finished is double-counting, because we're already taking tokens in Handle. Similarly, returning tokens in OnDelete is probably wrong except in the case where the Informer has missed some updates from not-finished to finished. Finally, taking or returning tokens in OnUpdate should only happen when a job changes state between not-finished and finished, not whenever it gets called.

Testing

I manually tested locally on Orbstack with a pipeline that generates 100 trivial jobs, configured with checkout skipped and default MaxInFlight (25).

Time spent running jobs (not including waiting for agent):

  • v0.18.0: 46s
  • v0.19.2: 4m57s
  • v0.19.2-4-g878ddf3: 52s

@DrJosh9000 DrJosh9000 enabled auto-merge November 22, 2024 06:50
@DrJosh9000 DrJosh9000 force-pushed the fix-limiter-token-tracking-again branch from 8c5b7f2 to 878ddf3 Compare November 22, 2024 07:10
Copy link
Contributor

@wolfeidau wolfeidau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 🚀

@DrJosh9000 DrJosh9000 merged commit eb355cc into main Nov 24, 2024
1 check passed
@DrJosh9000 DrJosh9000 deleted the fix-limiter-token-tracking-again branch November 24, 2024 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants