Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High concurrency causes delay in AWX Job starts in a workflow #15419

Open
7 of 11 tasks
chinna44 opened this issue Aug 2, 2024 · 0 comments
Open
7 of 11 tasks

High concurrency causes delay in AWX Job starts in a workflow #15419

chinna44 opened this issue Aug 2, 2024 · 0 comments

Comments

@chinna44
Copy link

chinna44 commented Aug 2, 2024

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

When running more than 450 concurrent workflows against the same workflow job template on different inventories, there is a notable delay in starting all the AWX jobs within the workflow. Initially, jobs remain in a "Pending" state and transition to a "Running" state after an average delay of 2 minutes. This issue does not occur when the concurrency is limited to 150 workflows.

AWX Version Upgrade: Recently upgraded from AWX version 22.5.0 to 23.9.0.
Environment: AWX is hosted on EKS (Elastic Kubernetes Service) version 1.28.
Resources allocation: Replica count is set to 10 for awx-web and awx-task pods each.
awx-web requests: cpu: 1500m and memory: 2Gi
awx-task requests: cpu: 4000m and memory: 8Gi
Database Performance: 50% cpu utilization and we have 20 control plane nodes running (each Ec2's with cpu: 8000m and memory: 32Gi)

When this issue happens, I captured and attached logs from datadog for automation-job-id.
automation-job-id.logs.txt

I'm also seeing ~2min delay and this happened for every awx job that runs in bulk. The delay is between job-10243331 created and job-10243331 work unit id assigned about inventory sync and some other commands that are executing in control plane nodes.

What could be the reason for this delay and what can be done to avoid this ?

AWX version

23.9.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

Chrome

Steps to reproduce

Run more than 400 workflows simultaneously

Expected results

The delays in starting AWX jobs within high-concurrency workflows can be minimized

Actual results

Delays in starting the awx jobs

Additional information

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant