Policy monitor load issues #3254

michel-laterman · 2024-02-05T22:23:45Z

Overview

The fleet-server performance tests have been failing for a while.
The specific test case that fails is multiple policy changes (submitted before they are fully acked).
This is reproducible on an 8GB ess instance with 5k agents.

Currently we do not have reports from customers or the community where this is an issue.

There is some discussion and investigation of this issue in the policy monitor debounce pr but I believe we should move the discussion of the issue here.

Our current proposal for a work around is to increase the default policy-monitor debounce time and extend/remove the policy dispatch timeout (currently 5s).

Issues

I believe that the issue with this scale test is created by the following components:

policy monitor limiter (previously the throttle)
policy monitor has a single mutex
no early cancelations

The limiter introduces a delay (default 5ms) in between dispatches to subscriptions. This is done to avoid having a policy change that effects a lot of agents trigger all checkins from writing at once (and allocating expensive gzip writers).

However, this delay increases the time it takes for the dispatchPending method, and since there is a single mutex other operations on the policy monitor are blocked during this, including processing new subscriptions, unsubscribes, and updating the policies.

There is also no way to cancel a dispatch early; specifically if we are dispatching 10k subscriptions on rev:2, and rev:3 is detected, we should stop dispatching rev:2, update the policy reference to rev:3 then dispatch the newest revision to the remaining subscriptions.

Proposed solutions

I think we can use a combination of solutions to fix the policyMonitor issues under load.

Unify action+policy limits in checkinT

The action limit (introduced in #2994) and the policy limit are both used to stop the checkin endpoint from writing every response to awaiting agents at once (so that expensive gzip writer allocations can be avoided).

I think that we should remove the limits from the actionDispatcher and policyMonitor and use a unified limiter in checkT (to control access to the gzip pool in writeResponse).
This would allow the policyMonitor to dispatch pending changes much faster.

This is the easiest to implement and likely has the most impact.

Interruptible dispatches

If we refactor the policy monitor so that dispatchPending may be canceled early when an updated policy is received by the underlying monitor we can avoid dispatching a policy that is outdated to agents.

This solution requires a lot more work in terms of implementation/testing and can be done along with any other solution.

Change subscription tracking

Currently the policy subscriptions are tracked as a linked list associated with a policy id as well as the pendingQ of the policy monitor (another linked list) that tracks subscriptions with pending changes.
When a new change is detected the policyMonitor will update policy's subscriptions by unlinking the list one item at a time and adding the subscriptions to the pendingQ, then to dispatch the queue linearly.

Changing how subscriptions are tracked may have to be done as a port of implementing interruptible dispatches.
If we change how the subscriptions are transfered to the pendingQ we should be able to pass the entire list at once (O(1) instead of O(n))

The text was updated successfully, but these errors were encountered:

michel-laterman · 2024-02-07T17:48:04Z

The unified throttle seems to solve the 10k perf test for the main branch.
However it does not succeed on the "multiple" branch; where the 10s delay in between policy updates (for the multiple policy test) has been removed.

The underlying (policy) index monitor is successfully detecting all the updated docs in the ES index.
However, when dispatching to the policy monitor's subscription it drops notifications due to a timeout.

We can try to solve this with interruptible dispatches in the policy monitor, as outline in the issue description.

Alternatively we can alter how the underlying index monitors function so that notify does not timeout, but may be canceled if it is called again.
However this will take a significant amount of effort to implement and test, and may break some assumptions we have in regards to document delivery.

I don't think it's worth addressing for the time being as errors like the perf-test failure for the "multiple" branch has never been reported.
Also the work around for it would be very simple; issue a new policy revision if agents are stuck on an old revision for an extended period of time.

michel-laterman · 2024-02-13T00:01:45Z

Looks like the unified limiter has actually caused some pods to experience OOM crashes in the 10k serverless test.
I'm reverting the PR: #3274, we can rebase #3273 as an attempt to fix it after; however it may need more work

michel-laterman · 2024-02-13T16:53:36Z

We're going to push this issue out for a later time when we have more capacity as it does not affect users and the work around is pretty simple

michel-laterman added bug Something isn't working Team:Fleet Label for the Fleet team labels Feb 5, 2024

michel-laterman self-assigned this Feb 5, 2024

michel-laterman mentioned this issue Feb 5, 2024

Replace policy and action limiters with a checkin limiter #3255

Merged

8 tasks

jen-huang removed the bug Something isn't working label Feb 6, 2024

michel-laterman mentioned this issue Feb 12, 2024

Add optional subscription channel size #3273

Closed

8 tasks

michel-laterman removed their assignment Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy monitor load issues #3254

Policy monitor load issues #3254

michel-laterman commented Feb 5, 2024

michel-laterman commented Feb 7, 2024

michel-laterman commented Feb 13, 2024

michel-laterman commented Feb 13, 2024

Policy monitor load issues #3254

Policy monitor load issues #3254

Comments

michel-laterman commented Feb 5, 2024

Overview

Issues

Proposed solutions

Unify action+policy limits in checkinT

Interruptible dispatches

Change subscription tracking

michel-laterman commented Feb 7, 2024

michel-laterman commented Feb 13, 2024

michel-laterman commented Feb 13, 2024