Receptor Execution over subscribed causes unhandled exception in awx-task #15013

whitej6 · 2024-03-19T18:56:43Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.
I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

The awx-ee container inside of the awx-task pod has either changed to a more aggressive approach on the receptor mesh network route advertisements OR is has a more aggressive timeout when attempting to release work items from a remote receptor execution node. This causes an exception to occur for the run_callback_receiver process inside the awx-task inside the awx-task pod and shortly after this pod becomes unhealthy and disrupts routing within the receptor mesh network. When this occurs the ONLY fix is to have kubernetes kill the affected task container and upon doing so services should restore on their own BUT any job that was hashed to that task node will be either be stuck in a hung state or will be marked as errored. The issue appears to be prevented if the remote execution node is running on a more performant CPU.

Last good awx-ee image tag: quay.io/ansible/awx-ee:23.6.0
Last good build of receptor: 1.4.3+g4ca9363

This bug can affect several versions of AWX due to AWX-Operator <2.13.0 defaults to using quay.io/ansible/awx-ee:latest and the issue is still present in that tag as of a build from 2 days ago (understanding this image has regular rebuilds).

This issue is related to one open on Receptor BUT spans Receptor, Operator, & AWX. ansible/receptor#934 (comment)

I am in the process of finalizing a full write up on the issue and will share once I am done with the write up.

AWX version

24.0.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Deploy AWX with a remote execution node that has a lower clock speed
fire off several jobs that will use that instance to overload the CPU

Expected results

AWX will keep jobs in waiting state until capacity frees up and only then push a new work item to an instance

Actual results

AWX will over subscribe the receptor execution leading to the remote receptor not replying in time to release a work item and triggering an exception in run_callback_receiver that then results in the task node to no longer be able to process work items and must delete from kubernetes. This will then cause any job hashed to that receptor control node to either become hung OR errored.

Additional information

Here is a traceback when the event occurs. After this traceback occurs you will no longer be able to find any log messages in that pod for run_callback_receiver

2024-03-14 19:09:02,503 INFO     [9de81ac3ca6341678db4ceabea06d08a] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 5373
2024-03-14 19:09:03,164 ERROR    [9de81ac3ca6341678db4ceabea06d08a] awx.main.tasks.receptor Error releasing work unit A1ImfyMW.
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 324, in run
    receptor_ctl.simple_command(f"work release {self.unit_id}")
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_interface.py", line 83, in simple_command
    return self.read_and_parse_json()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_interface.py", line 60, in read_and_parse_json
    raise RuntimeError(text[7:])
RuntimeError: read error reading from ec2-3-128-120-98.us-east-2.compute.amazonaws.com: timeout: no recent network activity

The text was updated successfully, but these errors were encountered:

whitej6 · 2024-03-19T20:23:04Z

It's worth mentioning that the playbook being executed has little effect. I created a workflow template that spawns 10 jobs running this playbook to simulate load. By continually triggering the workflow it's very easy to reproduce the issue https://github.com/whitej6/ansible-test/blob/main/test.yml

whitej6 · 2024-03-23T00:00:47Z

It appears the issue was solved in receptor v1.4.5! ansible/receptor#934 (comment)

As a user that depends on AWX for a LOT of day to day it would be nice to see the receptor package come is as part of tagged releases and STOP using nightly builds from the devel branch on the receptor container.

@shanemcd / @AlanCoding I am happy to provide as much of my findings as I can including how I replicated the full environment including receptor hop (much appreciated mesh ingress is now part of operator).

shanemcd · 2024-03-25T13:08:03Z

As a user that depends on AWX for a LOT of day to day it would be nice to see the receptor package come is as part of tagged releases and STOP using nightly builds from the devel branch on the receptor container.

I understand why you might want this, but unfortunately it would have not helped in this case. This slipped past our testing both upstream and downstream due to the nature of the problem. We will need to look into some kind of chaos testing that might help us with this kind of thing in the future.

whitej6 · 2024-03-25T16:09:34Z

Totally understand and agree it wouldn't have helped here. 🙂

Just an ask to hopefully prevent in-development items from breaking a previous stable release.

TheRealHaoLiu · 2024-05-29T15:22:40Z

closing this issue since its resolved

github-actions bot added needs_triage type:bug community labels Mar 19, 2024

fosterseth removed the needs_triage label Apr 3, 2024

TheRealHaoLiu closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receptor Execution over subscribed causes unhandled exception in awx-task #15013

Receptor Execution over subscribed causes unhandled exception in awx-task #15013

whitej6 commented Mar 19, 2024

whitej6 commented Mar 19, 2024

whitej6 commented Mar 23, 2024

shanemcd commented Mar 25, 2024

whitej6 commented Mar 25, 2024

TheRealHaoLiu commented May 29, 2024

Receptor Execution over subscribed causes unhandled exception in awx-task #15013

Receptor Execution over subscribed causes unhandled exception in awx-task #15013

Comments

whitej6 commented Mar 19, 2024

Please confirm the following

Bug Summary

AWX version

Select the relevant components

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

whitej6 commented Mar 19, 2024

whitej6 commented Mar 23, 2024

shanemcd commented Mar 25, 2024

whitej6 commented Mar 25, 2024

TheRealHaoLiu commented May 29, 2024