Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX Jobs Failing with "Task was canceled due to receiving a shutdown signal." #14948

Closed
6 of 11 tasks
mmacdo02-tufts opened this issue Mar 4, 2024 · 2 comments
Closed
6 of 11 tasks

Comments

@mmacdo02-tufts
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

Long running Ansible jobs are failing with no other information. We have AWX 23.8.0 installed on OpenShift 4.11.57 using the AWX-Operator. I did check the current issues for duplicates so I apologies if this is a duplicate bug.

I am able to replicate this problem in both my Lab and Production environments which run on different OpenShift clusters. Both are the same version of AWX (23.8.0) with same AWX operator (awx-operator.v2.12.0) and same version of Red Hat OpenShift 4.11.57. All long running jobs fail the same way.

kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – env | grep ANSIBLE_RUNNER_KEEPALIVE_SECOND
ANSIBLE_RUNNER_KEEPALIVE_SECONDS=30

kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – receptor --version
1.4.4+gc75b1f6

kubectl -n tts-lab-awx exec -it automation-job-1152-mvg7d – ansible-runner --version
2.3.5

I’m happy to provide more information but I am pretty new to AWX. I did increase our containerLogMaxSize to 200mb for better visibility. I also set K8S Ansible Runner Keep-Alive Message Interval to 30.

Right now I am just trying to run a simple Ansible playbook that simply pauses for 120 minutes for troubleshooting / debugging. This job will always fail.

AWX version

23.8.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

openshift

Modifications

no

Ansible version

No response

Operating system

OpenShift 4.11.57

Web browser

Chrome

Steps to reproduce

Within AWX, the Task shows Failed: Task was canceled due to receiving a shutdown signal. I am just running a very similar Ansible playbook that pauses for 120 minutes to replicate the issue. I cannot figure out what is sending a shutdown to the automation

- name: Test long running job in AWX hosts: localhost connection: local gather_facts: no become: no tasks: - name: Pause for 120 minutes to allow testing of the executor pod pause: minutes: 120

Screenshot 2024-03-04 153340

awx-lab-task-845bbc4f89-w6wkz-awx-lab-task.log

Expected results

I expect the Ansible job to run successfully without timing out.

Actual results

Every job fails with Task was canceled due to receiving a shutdown signal.

I can see the automation-job pod terminate but I cannot figure out what is causing this pod to terminate before the Ansible job is completed.

Additional information

No response

@mmacdo02-tufts
Copy link
Author

I've also attached logs from awx-task pod
awx-lab-task-845bbc4f89-w6wkz-awx-lab-task.log

@mmacdo02-tufts
Copy link
Author

This appears to be a duplicate of #14876

It says it's resolved in AWX 23.8.1 and Operator 2.12.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant