AAP-38230 Fix metrics pipe vs send buffering race condition #15731

AlanCoding · 2025-01-06T19:10:17Z

SUMMARY

I believe this is a fix for test failures we've been seeing. I'll do a dump of notes here.

The failing test is tests.api.external_data.test_metrics.TestMetrics.test_subsystem_metrics_counts_incremented_accurately, and it fails because the /api/v2/metrics/ count of metric callback_receiver_events_insert_db is the same before running an ad hoc command, as after it runs (flaky). This should not be the case! I have managed to manually confirm what it is saying - yes, the count should update because we ran a command, and it produces events. But the count fails to update, and this is a legitimate bug.

With the test ruled out, I dug deeper into what was going on, and I would start with observing these settings:

# Interval in seconds for sending local metrics to other nodes
SUBSYSTEM_METRICS_INTERVAL_SEND_METRICS = 3

# Interval in seconds for saving local metrics to redis
SUBSYSTEM_METRICS_INTERVAL_SAVE_TO_REDIS = 2

These are essentially the buffering time frames for 2 different buffering loops. Let's look at that failure again:

py.test --inventory=playbooks/inventory-docker.py tests/api/external_data/test_metrics.py -k test_subsystem_metrics_counts_incremented_accurately -rsx --repeat=20

gives

7 failed, 13 passed, 300 deselected, 425 warnings in 262.38s (0:04:22)

Notice anything interesting about the pass/fail ratio? It seems to fail about... 1 out of 3 times? Let's look at those buffering loops, here is the logic with the send / don't-send logic for what I'll call the "pipe" loop:

awx/awx/main/analytics/subsystem_metrics.py

Lines 265 to 271 in 9a5ed20

    
           def should_pipe_execute(self): 
        
               if self.metrics_have_changed is False: 
        
                   return False 
        
               if time.time() - self.last_pipe_execute > self.pipe_execute_interval: 
        
                   return True 
        
               else: 
        
                   return False

And the same send / don't-send logic for what I'll call the "send" loop:

awx/awx/main/analytics/subsystem_metrics.py

Line 298 in 9a5ed20

    
           if current_time - self.previous_send_metrics.decode(self.conn) > self.send_metrics_interval:

You'll see that both of these gate based on the current time compared to a breadcrumb of the last time it ran. The bug here is when that breadcrumb gets set, like self.last_pipe_execute = time.time(). It sets that timestamp, but it doesn't apply the settings to redis.

The callback receiver calls pipe_execute which calls send_metrics. These represent the outer and inner buffering loops. Yet code inside of the inner loop takes both actions

apply local settings to local redis
send a websocket message to apply those to redis in other nodes

But (1) needs to be done whenever the last_pipe_execute time is set, and if you don't, then you'll get a bug where the local data fails to be purged on the last event of something. This is how we get the 1-out-of-3 probability. There is a 2/3rd chance that the inner and outer loops refresh at the same point-in-time. The 1/3rd chance has pipe_execute call send_metrics and then it evaluates that we shouldn't send metrics, and records nothing into redis.

ISSUE TYPE

Bug, Docs Fix or other nominal change

COMPONENT NAME

API

sonarqubecloud · 2025-01-06T19:11:41Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

AlanCoding · 2025-01-06T21:58:01Z

There may be a valid argument against this, in that, these 2 loops should not exist independently of each other. Just make it 1 loop. 1 buffer timeout.

AlanCoding · 2025-01-07T20:06:52Z

@fosterseth you originated what this adjusts. I'm increasingly leaning towards these not being 2 settings, but just a single condition for sending metrics to both local redis and websockets. Would there be any problem with that? I mean, 2 vs 3 seconds isn't a big difference, right?

Fix metrics pipe vs send buffering race condition

2677bda

github-actions bot added the component:api label Jan 6, 2025

AlanCoding changed the title ~~Fix metrics pipe vs send buffering race condition~~ AAP-38230 Fix metrics pipe vs send buffering race condition Jan 6, 2025

AlanCoding requested review from kdelee and TheRealHaoLiu January 6, 2025 19:32

AlanCoding requested a review from fosterseth January 8, 2025 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AAP-38230 Fix metrics pipe vs send buffering race condition #15731

AAP-38230 Fix metrics pipe vs send buffering race condition #15731

AlanCoding commented Jan 6, 2025

sonarqubecloud bot commented Jan 6, 2025

AlanCoding commented Jan 6, 2025

AlanCoding commented Jan 7, 2025

	def should_pipe_execute(self):
	if self.metrics_have_changed is False:
	return False
	if time.time() - self.last_pipe_execute > self.pipe_execute_interval:
	return True
	else:
	return False

AAP-38230 Fix metrics pipe vs send buffering race condition #15731

Are you sure you want to change the base?

AAP-38230 Fix metrics pipe vs send buffering race condition #15731

Conversation

AlanCoding commented Jan 6, 2025

SUMMARY

ISSUE TYPE

COMPONENT NAME

sonarqubecloud bot commented Jan 6, 2025

Quality Gate passed

AlanCoding commented Jan 6, 2025

AlanCoding commented Jan 7, 2025