Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receptor Execution over subscribed causes unhandled exception in awx-task #15013

Closed
5 of 11 tasks
whitej6 opened this issue Mar 19, 2024 · 5 comments
Closed
5 of 11 tasks

Comments

@whitej6
Copy link
Contributor

whitej6 commented Mar 19, 2024

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

The awx-ee container inside of the awx-task pod has either changed to a more aggressive approach on the receptor mesh network route advertisements OR is has a more aggressive timeout when attempting to release work items from a remote receptor execution node. This causes an exception to occur for the run_callback_receiver process inside the awx-task inside the awx-task pod and shortly after this pod becomes unhealthy and disrupts routing within the receptor mesh network. When this occurs the ONLY fix is to have kubernetes kill the affected task container and upon doing so services should restore on their own BUT any job that was hashed to that task node will be either be stuck in a hung state or will be marked as errored. The issue appears to be prevented if the remote execution node is running on a more performant CPU.

Last good awx-ee image tag: quay.io/ansible/awx-ee:23.6.0
Last good build of receptor: 1.4.3+g4ca9363

This bug can affect several versions of AWX due to AWX-Operator <2.13.0 defaults to using quay.io/ansible/awx-ee:latest and the issue is still present in that tag as of a build from 2 days ago (understanding this image has regular rebuilds).

This issue is related to one open on Receptor BUT spans Receptor, Operator, & AWX. ansible/receptor#934 (comment)

I am in the process of finalizing a full write up on the issue and will share once I am done with the write up.

AWX version

24.0.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

  1. Deploy AWX with a remote execution node that has a lower clock speed
  2. fire off several jobs that will use that instance to overload the CPU

Expected results

AWX will keep jobs in waiting state until capacity frees up and only then push a new work item to an instance

Actual results

AWX will over subscribe the receptor execution leading to the remote receptor not replying in time to release a work item and triggering an exception in run_callback_receiver that then results in the task node to no longer be able to process work items and must delete from kubernetes. This will then cause any job hashed to that receptor control node to either become hung OR errored.

Additional information

Here is a traceback when the event occurs. After this traceback occurs you will no longer be able to find any log messages in that pod for run_callback_receiver

2024-03-14 19:09:02,503 INFO     [9de81ac3ca6341678db4ceabea06d08a] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 5373
2024-03-14 19:09:03,164 ERROR    [9de81ac3ca6341678db4ceabea06d08a] awx.main.tasks.receptor Error releasing work unit A1ImfyMW.
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 324, in run
    receptor_ctl.simple_command(f"work release {self.unit_id}")
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_interface.py", line 83, in simple_command
    return self.read_and_parse_json()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_interface.py", line 60, in read_and_parse_json
    raise RuntimeError(text[7:])
RuntimeError: read error reading from ec2-3-128-120-98.us-east-2.compute.amazonaws.com: timeout: no recent network activity
@whitej6
Copy link
Contributor Author

whitej6 commented Mar 19, 2024

It's worth mentioning that the playbook being executed has little effect. I created a workflow template that spawns 10 jobs running this playbook to simulate load. By continually triggering the workflow it's very easy to reproduce the issue https://github.com/whitej6/ansible-test/blob/main/test.yml

@whitej6
Copy link
Contributor Author

whitej6 commented Mar 23, 2024

It appears the issue was solved in receptor v1.4.5! ansible/receptor#934 (comment)

As a user that depends on AWX for a LOT of day to day it would be nice to see the receptor package come is as part of tagged releases and STOP using nightly builds from the devel branch on the receptor container.

@shanemcd / @AlanCoding I am happy to provide as much of my findings as I can including how I replicated the full environment including receptor hop (much appreciated mesh ingress is now part of operator).

@shanemcd
Copy link
Member

As a user that depends on AWX for a LOT of day to day it would be nice to see the receptor package come is as part of tagged releases and STOP using nightly builds from the devel branch on the receptor container.

I understand why you might want this, but unfortunately it would have not helped in this case. This slipped past our testing both upstream and downstream due to the nature of the problem. We will need to look into some kind of chaos testing that might help us with this kind of thing in the future.

@whitej6
Copy link
Contributor Author

whitej6 commented Mar 25, 2024

Totally understand and agree it wouldn't have helped here. 🙂

Just an ask to hopefully prevent in-development items from breaking a previous stable release.

@TheRealHaoLiu
Copy link
Member

closing this issue since its resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants