-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Receptor Execution over subscribed causes unhandled exception in awx-task #15013
Comments
It's worth mentioning that the playbook being executed has little effect. I created a workflow template that spawns 10 jobs running this playbook to simulate load. By continually triggering the workflow it's very easy to reproduce the issue https://github.com/whitej6/ansible-test/blob/main/test.yml |
It appears the issue was solved in receptor v1.4.5! ansible/receptor#934 (comment) As a user that depends on AWX for a LOT of day to day it would be nice to see the receptor package come is as part of tagged releases and STOP using nightly builds from the devel branch on the receptor container. @shanemcd / @AlanCoding I am happy to provide as much of my findings as I can including how I replicated the full environment including receptor hop (much appreciated mesh ingress is now part of operator). |
I understand why you might want this, but unfortunately it would have not helped in this case. This slipped past our testing both upstream and downstream due to the nature of the problem. We will need to look into some kind of chaos testing that might help us with this kind of thing in the future. |
Totally understand and agree it wouldn't have helped here. 🙂 Just an ask to hopefully prevent in-development items from breaking a previous stable release. |
closing this issue since its resolved |
Please confirm the following
[email protected]
instead.)Bug Summary
The
awx-ee
container inside of theawx-task
pod has either changed to a more aggressive approach on the receptor mesh network route advertisements OR is has a more aggressive timeout when attempting to release work items from a remote receptor execution node. This causes an exception to occur for therun_callback_receiver
process inside theawx-task
inside theawx-task
pod and shortly after this pod becomes unhealthy and disrupts routing within the receptor mesh network. When this occurs the ONLY fix is to have kubernetes kill the affected task container and upon doing so services should restore on their own BUT any job that was hashed to that task node will be either be stuck in a hung state or will be marked as errored. The issue appears to be prevented if the remote execution node is running on a more performant CPU.Last good awx-ee image tag:
quay.io/ansible/awx-ee:23.6.0
Last good build of receptor:
1.4.3+g4ca9363
This bug can affect several versions of AWX due to AWX-Operator
<2.13.0
defaults to usingquay.io/ansible/awx-ee:latest
and the issue is still present in that tag as of a build from 2 days ago (understanding this image has regular rebuilds).This issue is related to one open on Receptor BUT spans Receptor, Operator, & AWX. ansible/receptor#934 (comment)
I am in the process of finalizing a full write up on the issue and will share once I am done with the write up.
AWX version
24.0.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Expected results
AWX will keep jobs in waiting state until capacity frees up and only then push a new work item to an instance
Actual results
AWX will over subscribe the receptor execution leading to the remote receptor not replying in time to release a work item and triggering an exception in
run_callback_receiver
that then results in the task node to no longer be able to process work items and must delete from kubernetes. This will then cause any job hashed to that receptor control node to either become hung OR errored.Additional information
Here is a traceback when the event occurs. After this traceback occurs you will no longer be able to find any log messages in that pod for
run_callback_receiver
The text was updated successfully, but these errors were encountered: