Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AAP-38230 Fix metrics pipe vs send buffering race condition #15731

Draft
wants to merge 1 commit into
base: devel
Choose a base branch
from

Conversation

AlanCoding
Copy link
Member

SUMMARY

I believe this is a fix for test failures we've been seeing. I'll do a dump of notes here.

The failing test is tests.api.external_data.test_metrics.TestMetrics.test_subsystem_metrics_counts_incremented_accurately, and it fails because the /api/v2/metrics/ count of metric callback_receiver_events_insert_db is the same before running an ad hoc command, as after it runs (flaky). This should not be the case! I have managed to manually confirm what it is saying - yes, the count should update because we ran a command, and it produces events. But the count fails to update, and this is a legitimate bug.

With the test ruled out, I dug deeper into what was going on, and I would start with observing these settings:

# Interval in seconds for sending local metrics to other nodes
SUBSYSTEM_METRICS_INTERVAL_SEND_METRICS = 3

# Interval in seconds for saving local metrics to redis
SUBSYSTEM_METRICS_INTERVAL_SAVE_TO_REDIS = 2

These are essentially the buffering time frames for 2 different buffering loops. Let's look at that failure again:

py.test --inventory=playbooks/inventory-docker.py tests/api/external_data/test_metrics.py -k test_subsystem_metrics_counts_incremented_accurately -rsx --repeat=20

gives

7 failed, 13 passed, 300 deselected, 425 warnings in 262.38s (0:04:22)

Notice anything interesting about the pass/fail ratio? It seems to fail about... 1 out of 3 times? Let's look at those buffering loops, here is the logic with the send / don't-send logic for what I'll call the "pipe" loop:

def should_pipe_execute(self):
if self.metrics_have_changed is False:
return False
if time.time() - self.last_pipe_execute > self.pipe_execute_interval:
return True
else:
return False

And the same send / don't-send logic for what I'll call the "send" loop:

if current_time - self.previous_send_metrics.decode(self.conn) > self.send_metrics_interval:

You'll see that both of these gate based on the current time compared to a breadcrumb of the last time it ran. The bug here is when that breadcrumb gets set, like self.last_pipe_execute = time.time(). It sets that timestamp, but it doesn't apply the settings to redis.

The callback receiver calls pipe_execute which calls send_metrics. These represent the outer and inner buffering loops. Yet code inside of the inner loop takes both actions

  1. apply local settings to local redis
  2. send a websocket message to apply those to redis in other nodes

But (1) needs to be done whenever the last_pipe_execute time is set, and if you don't, then you'll get a bug where the local data fails to be purged on the last event of something. This is how we get the 1-out-of-3 probability. There is a 2/3rd chance that the inner and outer loops refresh at the same point-in-time. The 1/3rd chance has pipe_execute call send_metrics and then it evaluates that we shouldn't send metrics, and records nothing into redis.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API

Copy link

sonarqubecloud bot commented Jan 6, 2025

@AlanCoding AlanCoding changed the title Fix metrics pipe vs send buffering race condition AAP-38230 Fix metrics pipe vs send buffering race condition Jan 6, 2025
@AlanCoding
Copy link
Member Author

There may be a valid argument against this, in that, these 2 loops should not exist independently of each other. Just make it 1 loop. 1 buffer timeout.

@AlanCoding
Copy link
Member Author

@fosterseth you originated what this adjusts. I'm increasingly leaning towards these not being 2 settings, but just a single condition for sending metrics to both local redis and websockets. Would there be any problem with that? I mean, 2 vs 3 seconds isn't a big difference, right?

@AlanCoding AlanCoding requested a review from fosterseth January 8, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant