Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Socket path does not exist: /var/run/receptor/receptor.sock #15421

Closed
5 of 11 tasks
rabindangol opened this issue Aug 3, 2024 · 9 comments
Closed
5 of 11 tasks

Comments

@rabindangol
Copy link

rabindangol commented Aug 3, 2024

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

After I did Kubernetes update from v1.27 to v1.28, Ansilbe awx pods are failing. However awx-operator-controller-manager pod is running healthy.

Upon checking logs for each containers in awx pod there are two containers namely awx-task and awx-ee failing.
Container awx-ee image = quay.io/ansible/awx-ee:latest
Error:
empty receptor config, skipping...

Container awx-task image = quay.io/ansible/awx:21.4.0
Errors:

[wait-for-migrations] Waiting for database migrations...
[wait-for-migrations] Attempt 1 of 30
Instance Group already registered controlplane
Instance Group already registered default
Instance already registered awx-6f5455c47b-dtlp9
2024-08-03 13:53:03,875 INFO RPC interface 'supervisor' initialized
2024-08-03 13:53:03,875 INFO RPC interface 'supervisor' initialized
2024-08-03 13:53:03,876 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-08-03 13:53:03,876 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-08-03 13:53:03,876 INFO supervisord started with pid 7
2024-08-03 13:53:03,876 INFO supervisord started with pid 7
2024-08-03 13:53:04,878 INFO spawned: 'superwatcher' with pid 28
2024-08-03 13:53:04,878 INFO spawned: 'superwatcher' with pid 28
2024-08-03 13:53:04,880 INFO spawned: 'dispatcher' with pid 29
2024-08-03 13:53:04,880 INFO spawned: 'dispatcher' with pid 29
2024-08-03 13:53:04,882 INFO spawned: 'callback-receiver' with pid 30
2024-08-03 13:53:04,882 INFO spawned: 'callback-receiver' with pid 30
READY
2024-08-03 13:53:05,884 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-08-03 13:53:05,884 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-08-03 13:53:06,826 WARNING  [-] awx.main.dispatch.periodic periodic beat started
2024-08-03 13:53:06,914 DEBUG    [-] awx.main.commands.run_callback_receiver scaling up worker pid:36
2024-08-03 13:53:06,914 DEBUG    [-] awx.main.commands.run_callback_receiver scaling up worker pid:36
2024-08-03 13:53:06,920 DEBUG    [-] awx.main.commands.run_callback_receiver scaling up worker pid:37
2024-08-03 13:53:06,920 DEBUG    [-] awx.main.commands.run_callback_receiver scaling up worker pid:37
2024-08-03 13:53:06,924 DEBUG    [-] awx.main.commands.run_callback_receiver scaling up worker pid:38
2024-08-03 13:53:06,924 DEBUG    [-] awx.main.commands.run_callback_receiver scaling up worker pid:38
2024-08-03 13:53:06,929 DEBUG    [-] awx.main.commands.run_callback_receiver scaling up worker pid:39
2024-08-03 13:53:06,929 DEBUG    [-] awx.main.commands.run_callback_receiver scaling up worker pid:39
2024-08-03 13:53:06,931 DEBUG    [-] awx.main.commands.run_callback_receiver 30 is alive
2024-08-03 13:53:06,931 DEBUG    [-] awx.main.commands.run_callback_receiver 30 is alive
2024-08-03 13:53:06,932 DEBUG    [-] awx.main.dispatch scaling up worker pid:40
2024-08-03 13:53:06,937 DEBUG    [-] awx.main.dispatch scaling up worker pid:41
2024-08-03 13:53:06,941 DEBUG    [-] awx.main.dispatch scaling up worker pid:42
2024-08-03 13:53:06,945 DEBUG    [-] awx.main.dispatch scaling up worker pid:43
2024-08-03 13:53:06,947 INFO     [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'awx-00-6f5455c47b-dtlp9']
2024-08-03 13:53:07,024 DEBUG    [-] awx.main.tasks Syncing Schedules
2024-08-03 13:53:07,378 DEBUG    [-] awx.main.tasks.system Waited 0.0013823509216308594 seconds to obtain lock name: cluster_policy_lock
2024-08-03 13:53:07,388 DEBUG    [-] awx.main.tasks.system Total instances: 3, available for policy: 3
2024-08-03 13:53:07,389 DEBUG    [-] awx.main.tasks.system Policy percentage, adding Instances [141, 142, 143] to Group controlplane
2024-08-03 13:53:07,389 DEBUG    [-] awx.main.tasks.system Cluster policy no-op finished in 0.010658025741577148 seconds
2024-08-03 13:53:07,390 DEBUG    [-] awx.main.tasks.system Cluster node heartbeat task.
2024-08-03 13:53:07,401 ERROR    [-] awx.main.dispatch Encountered unhandled error in dispatcher main loop
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/base.py", line 156, in run
    self.worker.on_start()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 128, in on_start
    dispatch_startup()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 105, in dispatch_startup
    cluster_node_heartbeat()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 491, in cluster_node_heartbeat
    inspect_execution_nodes(instance_list)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 431, in inspect_execution_nodes
    mesh_status = ctl.simple_command('status')
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_interface.py", line 81, in simple_command
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_interface.py", line 99, in connect
    raise ValueError(f"Socket path does not exist: {path}")
ValueError: Socket path does not exist: /var/run/receptor/receptor.sock
Traceback (most recent call last):
  File "/usr/bin/awx-manage", line 8, in <module>
    sys.exit(manage())
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 201, in manage
    execute_from_command_line(sys.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
    utility.execute()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 413, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 354, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 398, in execute
    output = self.handle(*args, **options)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/management/commands/run_dispatcher.py", line 62, in handle
    consumer.run()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/base.py", line 156, in run
    self.worker.on_start()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 128, in on_start
    dispatch_startup()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 105, in dispatch_startup
    cluster_node_heartbeat()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 491, in cluster_node_heartbeat
    inspect_execution_nodes(instance_list)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 431, in inspect_execution_nodes
    mesh_status = ctl.simple_command('status')
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_interface.py", line 81, in simple_command
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/receptorctl/socket_interface.py", line 99, in connect
    raise ValueError(f"Socket path does not exist: {path}")
ValueError: Socket path does not exist: /var/run/receptor/receptor.sock
2024-08-03 13:53:07,802 INFO exited: dispatcher (exit status 1; not expected)
2024-08-03 13:53:07,802 INFO exited: dispatcher (exit status 1; not expected)
2024-08-03 13:53:08,805 INFO spawned: 'dispatcher' with pid 44
2024-08-03 13:53:08,805 INFO spawned: 'dispatcher' with pid 44

...
then loop of same error...

AWX version

21.4.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

21.4.0

Operating system

EKS

Web browser

Edge

Steps to reproduce

It was working until i update eks from v1.27 to v1.28. When workloads (ansible ) were moved to new nodes as part of eks update, the pods are showing errors.

Expected results

AWX run without error

Actual results

Pods are failed and can no longer access the ansible from browser.

Additional information

These are the containers running in the awx pod:

  • redis
    image: docker.io/redis:7

  • awx-web
    image: quay.io/ansible/awx:21.4.0

  • awx-task
    image: quay.io/ansible/awx:21.4.0

  • awx-ee
    image quay.io/ansible/awx-ee:latest

No response

@rabindangol
Copy link
Author

rabindangol commented Aug 5, 2024

The issue was due to container image - quay.io/ansible/awx-ee:latest.
Switched to older one, fixed my issue.

Fix:
In CRD...:

spec:
...
control_plane_ee_image: quay.io/ansible/awx-ee:24.4.0

@fs30000
Copy link

fs30000 commented Aug 5, 2024

@rabindangol Shere should that be put? CRD? What is that?

EDIT: I'm using stand alone docker, not K8s.

@fs30000
Copy link

fs30000 commented Aug 16, 2024

@rabindangol ?

@MaxTeiger
Copy link

@fs30000 did you find a way to solve this problem ? I am experiencing the same issue.

@fs30000
Copy link

fs30000 commented Aug 20, 2024

@fs30000 did you find a way to solve this problem ? I am experiencing the same issue.

Yeah, change the receptor version:

export RECEPTOR_IMAGE=quay.io/ansible/receptor:v1.4.8

Like:

git clone -b WHATEVERVERSION https://github.com/ansible/awx.git
cd awx/
export RECEPTOR_IMAGE=quay.io/ansible/receptor:v1.4.8    COMPOSE_TAG=release_4.3    COMPOSE_UP_OPTS=-d 
# optional make docker-compose-build
make docker-compose 

The COMPOSE_TAG is for the awx_devel version. Still, it will work right from the start, but after a system reboot, it will fail with this:
https://forum.ansible.com/t/error-current-system-boot-id-differs-from-cached-boot-id/7898

AWX is so screwed right now.

@MaxTeiger
Copy link

Thank you !
For the ones interested, here are the steps which solved the issue on my machine :

I just stopped my containers and remove the awx image (docker rmi <image_id>) I had on my machine, and ran again :

make docker-compose COMPOSE_UP_OPTS=-d

This uploaded my old image (from two-weeks ago), to the newer one and solved my problem ! 😄

@bobot4258
Copy link

I also had this same issue, running two AWX instances in a 3 node K8s cluster. After a restart of all the nodes only one AWX instance was working, the other had the same issue as this ticket. The two AWX instances where running on different Kubernetes nodes. I did a docker image ls and noticed that on the working node the awx-ee image was a year old and on the one that was not working it was 4 weeks old
image
I deleted the image from that node, the next most recent awx-ee image was from 7 months ago
image
After that the awx-task and awx-ee containers in the pod successfully started up and that AWX instance was working again.

@fs30000
Copy link

fs30000 commented Sep 6, 2024

I also had this same issue, running two AWX instances in a 3 node K8s cluster. After a restart of all the nodes only one AWX instance was working, the other had the same issue as this ticket. The two AWX instances where running on different Kubernetes nodes. I did a docker image ls and noticed that on the working node the awx-ee image was a year old and on the one that was not working it was 4 weeks old image I deleted the image from that node, the next most recent awx-ee image was from 7 months ago image After that the awx-task and awx-ee containers in the pod successfully started up and that AWX instance was working again.

How to pick a version of the awx-ee image in docker dev environment?

@rabindangol
Copy link
Author

@fs30000, CRD stands for Custom Resource Definition in Kubernetes, where we define which version of AWX-EE to use. My setting was quay.io/ansible/awx-ee:latest, which means it always pulls the latest version of AWX-EE.

I had AWX working with no problems for the last few months. However, during a Kubernetes version update, the pods were restarted. Since we were using the latest tag, it pulled the newest image of AWX-EE, which was not compatible, resulting in an error.

My fix was to replace the latest tag with a few months old tag, in this case, awx-ee:24.4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants