AAP-38230 Fix metrics pipe vs send buffering race condition #15731
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
SUMMARY
I believe this is a fix for test failures we've been seeing. I'll do a dump of notes here.
The failing test is
tests.api.external_data.test_metrics.TestMetrics.test_subsystem_metrics_counts_incremented_accurately
, and it fails because the/api/v2/metrics/
count of metriccallback_receiver_events_insert_db
is the same before running an ad hoc command, as after it runs (flaky). This should not be the case! I have managed to manually confirm what it is saying - yes, the count should update because we ran a command, and it produces events. But the count fails to update, and this is a legitimate bug.With the test ruled out, I dug deeper into what was going on, and I would start with observing these settings:
These are essentially the buffering time frames for 2 different buffering loops. Let's look at that failure again:
gives
Notice anything interesting about the pass/fail ratio? It seems to fail about... 1 out of 3 times? Let's look at those buffering loops, here is the logic with the send / don't-send logic for what I'll call the "pipe" loop:
awx/awx/main/analytics/subsystem_metrics.py
Lines 265 to 271 in 9a5ed20
And the same send / don't-send logic for what I'll call the "send" loop:
awx/awx/main/analytics/subsystem_metrics.py
Line 298 in 9a5ed20
You'll see that both of these gate based on the current time compared to a breadcrumb of the last time it ran. The bug here is when that breadcrumb gets set, like
self.last_pipe_execute = time.time()
. It sets that timestamp, but it doesn't apply the settings to redis.The callback receiver calls
pipe_execute
which callssend_metrics
. These represent the outer and inner buffering loops. Yet code inside of the inner loop takes both actionsBut (1) needs to be done whenever the
last_pipe_execute
time is set, and if you don't, then you'll get a bug where the local data fails to be purged on the last event of something. This is how we get the 1-out-of-3 probability. There is a 2/3rd chance that the inner and outer loops refresh at the same point-in-time. The 1/3rd chance haspipe_execute
callsend_metrics
and then it evaluates that we shouldn't send metrics, and records nothing into redis.ISSUE TYPE
COMPONENT NAME