-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task manager refactor #12629
Task manager refactor #12629
Conversation
- DependencyManager spawns dependencies if necessary - WorkflowManager processes running workflows to see if a new job is ready to spawn - TaskManager starts tasks if unblocked and has execution capacity
implement ansible#12446 in development environment, enable set of views that run the task manager(s). Also introduce a setting that disables any calls to schedule() that do not originate from the debug views when in the development environment. With guards around both if we are in the development environment and the setting, I think we're pretty safe this won't get triggered unintentionally. use MODE to determine if we are in devel env Also, move test for skipping task managers to the tasks file
more than saving the loop, we save building the WorkflowDag twice which makes LOTS of queries!!! Also, do a bulk update on the WorkflowJobNodes instead of saving in a loop :fear:
get_tasks uses UnifiedJob Additionally, make local overrides run after development settings
add settings to define task manager timeout and grace period This gives us still TASK_MANAGER_TIMEOUT_GRACE_PERIOD amount of time to get out of the task manager. Also, apply start task limit in WorkflowManager to starting pending workflows
each call to schedule_task_manager becomes one of ScheduleTaskManager ScheduleDependencyManager ScheduleWorkflowManager
We have to look at the sliced job's unified_job_template_id Now, task_blocked_by works for sliced jobs too.
we had tried doing this in the WorkflowManager, but we decided that we want to handle ALL pending jobs and "soft blockers" to jobs with the TaskManager/DependencyGraph and not duplicate that logic in the WorkflowManager.
We always add the job to the graph right before calling start task. Reduce complexity of proper operation by just doing this in start_task, because if you call start_task, you need to add it to the dependency graph
…yset (ansible#12502) Instead of loading all pending Workflow Approvals in the task manager, run a query that will only return the expired apporovals directly expire all which are returned by that query Cache expires time as a new field in order to simplify WorkflowApproval filter
When creating unified job, stash the list of pk values from the instance groups returned from preferred_instance_groups so that the task management system does not need to call out to this method repeatedly. .preferred_instance_groups_cache is the new field
task_impact is now a field on the database It is calculated and set during create_unified_job set task_impact on .save for adhoc commands
awx/main/scheduler/task_manager.py
Outdated
job.job_explanation = gettext_noop( | ||
"Workflow Job spawned from workflow could not start because it " | ||
"would result in recursion (spawn order, most recent first: {})" | ||
).format(', '.join(['<{}>'.format(tmp) for tmp in display_list])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the .join()
call also does not need the parameter to be a list comprehension. An implicit generator expression (i.e. just remove the [
and ]
) will work just fine, like with the any()
call I mentioned previously.
awx/main/models/unified_jobs.py
Outdated
) | ||
task_impact = models.PositiveIntegerField( | ||
default=0, | ||
editable=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we don't add help_text to this (consistent with other new field). If you are not sure if it's taking effect, see GET listed in the OPTIONS data. It's mundane work, but those who work with our API will appreciate it.
9c74678
to
eeaa383
Compare
self.generate_dependencies(deps) | ||
self.subsystem_metrics.inc(f"{self.prefix}_pending_processed", len(self.all_tasks) + len(deps)) | ||
|
||
@timeit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still really bothered that we are putting timings on _schedule
and not schedule
, because this misses timings of on_commit actions, which we know are a constant point of failure. Sure, this could be handled by filing a followup issue, but I really don't want it to get lost or slip into the next release, because it undermines the very point of the task manager metrics. We can still hit the dispatcher timeout limits, with no forewarning by the metrics. Until we fix this, the metrics are not really to be believed.
just the other day, I was able to see this by a new mechanism:
tools_awx_1 | 2022-08-08 17:40:42,465 ERROR [252b9dcc] awx.main.dispatch Worker failed to run task awx.main.scheduler.tasks.run_task_manager(*[], **{}
tools_awx_1 | Traceback (most recent call last):
tools_awx_1 | File "/awx_devel/awx/main/dispatch/worker/task.py", line 102, in perform_work
tools_awx_1 | result = self.run_callable(body)
tools_awx_1 | File "/awx_devel/awx/main/dispatch/worker/task.py", line 77, in run_callable
tools_awx_1 | return _call(*args, **kwargs)
tools_awx_1 | File "/awx_devel/awx/main/scheduler/tasks.py", line 15, in run_task_manager
tools_awx_1 | TaskManager().schedule()
tools_awx_1 | File "/awx_devel/awx/main/scheduler/task_manager.py", line 657, in schedule
tools_awx_1 | start_time = time.time()
tools_awx_1 | File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/transaction.py", line 290, in __exit__
tools_awx_1 | connection.set_autocommit(True)
tools_awx_1 | File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 420, in set_autocommit
tools_awx_1 | self.run_and_clear_commit_hooks()
tools_awx_1 | File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 653, in run_and_clear_commit_hooks
tools_awx_1 | func()
tools_awx_1 | File "/awx_devel/awx/main/scheduler/task_manager.py", line 294, in post_commit
tools_awx_1 | task_cls.apply_async(
tools_awx_1 | File "/awx_devel/awx/main/dispatch/publish.py", line 87, in apply_async
tools_awx_1 | with pg_bus_conn() as conn:
tools_awx_1 | File "/usr/lib64/python3.9/contextlib.py", line 119, in __enter__
tools_awx_1 | return next(self.gen)
tools_awx_1 | File "/awx_devel/awx/main/dispatch/__init__.py", line 50, in pg_bus_conn
tools_awx_1 | conn = psycopg2.connect(dbname=conf['NAME'], host=conf['HOST'], user=conf['USER'], password=conf['PASSWORD'], port=conf['PORT'], **conf.get("OPTIONS", {}))
tools_awx_1 | File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg2/__init__.py", line 126, in connect
tools_awx_1 | conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
tools_awx_1 | psycopg2.OperationalError: FATAL: sorry, too many clients already
This shows that the on_commit
logic is even more vulnerable than thought, because it requires establishing a new database connection. When approaching connection limits, this guarantees we leave jobs stuck in waiting. That may or may not be exasperated failing to close unused connections, not sure yet.
eeaa383
to
028a558
Compare
028a558
to
f7e6a32
Compare
Inherit from our own APIView, not rest framework
Major changes
Split TaskManager into 3 separate managers that can run in parallel
dependencies_processed
to True after resolving any dependenciesdependencies_processed
True. It is responsible for fitting the task to a control instance and execution instance, and submitting the task to the dispatcher.Save progress before timing out manager
Cache preferred_instance_groups and task_impact
.create_unified_job
, thus no extra DB hits accrue when accessing these fields.the parent issue tracker can be found here https://github.com/ansible/tower/issues/5943
ISSUE TYPE
COMPONENT NAME
AWX VERSION