-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Policy monitor load issues #3254
Comments
The unified throttle seems to solve the 10k perf test for the main branch. The underlying (policy) index monitor is successfully detecting all the updated docs in the ES index. We can try to solve this with interruptible dispatches in the policy monitor, as outline in the issue description. Alternatively we can alter how the underlying index monitors function so that I don't think it's worth addressing for the time being as errors like the perf-test failure for the "multiple" branch has never been reported. |
We're going to push this issue out for a later time when we have more capacity as it does not affect users and the work around is pretty simple |
Overview
The fleet-server performance tests have been failing for a while.
The specific test case that fails is multiple policy changes (submitted before they are fully acked).
This is reproducible on an 8GB ess instance with 5k agents.
Currently we do not have reports from customers or the community where this is an issue.
There is some discussion and investigation of this issue in the policy monitor debounce pr but I believe we should move the discussion of the issue here.
Our current proposal for a work around is to increase the default policy-monitor debounce time and extend/remove the policy dispatch timeout (currently 5s).
Issues
I believe that the issue with this scale test is created by the following components:
The limiter introduces a delay (default 5ms) in between dispatches to subscriptions. This is done to avoid having a policy change that effects a lot of agents trigger all checkins from writing at once (and allocating expensive gzip writers).
However, this delay increases the time it takes for the
dispatchPending
method, and since there is a single mutex other operations on the policy monitor are blocked during this, including processing new subscriptions, unsubscribes, and updating the policies.There is also no way to cancel a dispatch early; specifically if we are dispatching 10k subscriptions on rev:2, and rev:3 is detected, we should stop dispatching rev:2, update the policy reference to rev:3 then dispatch the newest revision to the remaining subscriptions.
Proposed solutions
I think we can use a combination of solutions to fix the policyMonitor issues under load.
Unify action+policy limits in checkinT
The action limit (introduced in #2994) and the policy limit are both used to stop the checkin endpoint from writing every response to awaiting agents at once (so that expensive gzip writer allocations can be avoided).
I think that we should remove the limits from the actionDispatcher and policyMonitor and use a unified limiter in checkT (to control access to the gzip pool in writeResponse).
This would allow the policyMonitor to dispatch pending changes much faster.
This is the easiest to implement and likely has the most impact.
Interruptible dispatches
If we refactor the policy monitor so that
dispatchPending
may be canceled early when an updated policy is received by the underlying monitor we can avoid dispatching a policy that is outdated to agents.This solution requires a lot more work in terms of implementation/testing and can be done along with any other solution.
Change subscription tracking
Currently the policy subscriptions are tracked as a linked list associated with a policy id as well as the pendingQ of the policy monitor (another linked list) that tracks subscriptions with pending changes.
When a new change is detected the policyMonitor will update policy's subscriptions by unlinking the list one item at a time and adding the subscriptions to the pendingQ, then to dispatch the queue linearly.
Changing how subscriptions are tracked may have to be done as a port of implementing interruptible dispatches.
If we change how the subscriptions are transfered to the pendingQ we should be able to pass the entire list at once (O(1) instead of O(n))
The text was updated successfully, but these errors were encountered: