Task manager refactor #12629

fosterseth · 2022-08-05T18:38:33Z

Major changes

Split TaskManager into 3 separate managers that can run in parallel

DependencyManager processes tasks that might need dependencies spawned, and flips dependencies_processed to True after resolving any dependencies
TaskManager looks at pending tasks with dependencies_processed True. It is responsible for fitting the task to a control instance and execution instance, and submitting the task to the dispatcher.
WorkflowManager looks only at running workflows and decides whether it can spawn new jobs

Save progress before timing out manager

After 5 minutes, the task manager PID will be killed. However, there is now a grace period that allows the managers to bail out and save whatever progress it has made before it is killed.

Cache preferred_instance_groups and task_impact

These two fields on UnifiedJob are fairly expensive, and were being calculated repeatedly across task manager runs. Now, these fields are calculated and set in .create_unified_job, thus no extra DB hits accrue when accessing these fields.

the parent issue tracker can be found here https://github.com/ansible/tower/issues/5943

fix up unit / functional tests
- PR [task_manager_refactor] Fix unit / functional tests #12534
fix up integration / collections tests
green on test suite
performance and scale team results
merge it

ISSUE TYPE

New or Enhanced Feature

COMPONENT NAME

API

AWX VERSION

awx: 21.3.1.dev72+g9c74678c61

- DependencyManager spawns dependencies if necessary - WorkflowManager processes running workflows to see if a new job is ready to spawn - TaskManager starts tasks if unblocked and has execution capacity

implement ansible#12446 in development environment, enable set of views that run the task manager(s). Also introduce a setting that disables any calls to schedule() that do not originate from the debug views when in the development environment. With guards around both if we are in the development environment and the setting, I think we're pretty safe this won't get triggered unintentionally. use MODE to determine if we are in devel env Also, move test for skipping task managers to the tasks file

more than saving the loop, we save building the WorkflowDag twice which makes LOTS of queries!!! Also, do a bulk update on the WorkflowJobNodes instead of saving in a loop :fear:

get_tasks uses UnifiedJob Additionally, make local overrides run after development settings

add settings to define task manager timeout and grace period This gives us still TASK_MANAGER_TIMEOUT_GRACE_PERIOD amount of time to get out of the task manager. Also, apply start task limit in WorkflowManager to starting pending workflows

each call to schedule_task_manager becomes one of ScheduleTaskManager ScheduleDependencyManager ScheduleWorkflowManager

We have to look at the sliced job's unified_job_template_id Now, task_blocked_by works for sliced jobs too.

we had tried doing this in the WorkflowManager, but we decided that we want to handle ALL pending jobs and "soft blockers" to jobs with the TaskManager/DependencyGraph and not duplicate that logic in the WorkflowManager.

We always add the job to the graph right before calling start task. Reduce complexity of proper operation by just doing this in start_task, because if you call start_task, you need to add it to the dependency graph

…yset (ansible#12502) Instead of loading all pending Workflow Approvals in the task manager, run a query that will only return the expired apporovals directly expire all which are returned by that query Cache expires time as a new field in order to simplify WorkflowApproval filter

When creating unified job, stash the list of pk values from the instance groups returned from preferred_instance_groups so that the task management system does not need to call out to this method repeatedly. .preferred_instance_groups_cache is the new field

task_impact is now a field on the database It is calculated and set during create_unified_job set task_impact on .save for adhoc commands

awx/main/dispatch/pool.py

awx/main/models/inventory.py

jbradberry · 2022-08-05T20:29:00Z

awx/main/scheduler/task_manager.py

+                            job.job_explanation = gettext_noop(
+                                "Workflow Job spawned from workflow could not start because it "
+                                "would result in recursion (spawn order, most recent first: {})"
+                            ).format(', '.join(['<{}>'.format(tmp) for tmp in display_list]))


the .join() call also does not need the parameter to be a list comprehension. An implicit generator expression (i.e. just remove the [ and ]) will work just fine, like with the any() call I mentioned previously.

awx/main/scheduler/task_manager.py

awx/main/analytics/subsystem_metrics.py

AlanCoding · 2022-08-09T02:13:06Z

awx/main/models/unified_jobs.py

+    )
+    task_impact = models.PositiveIntegerField(
+        default=0,
+        editable=False,


I don't see why we don't add help_text to this (consistent with other new field). If you are not sure if it's taking effect, see GET listed in the OPTIONS data. It's mundane work, but those who work with our API will appreciate it.

AlanCoding · 2022-08-09T20:34:37Z

awx/main/scheduler/task_manager.py

+        self.generate_dependencies(deps)
+        self.subsystem_metrics.inc(f"{self.prefix}_pending_processed", len(self.all_tasks) + len(deps))
+
+    @timeit


I'm still really bothered that we are putting timings on _schedule and not schedule, because this misses timings of on_commit actions, which we know are a constant point of failure. Sure, this could be handled by filing a followup issue, but I really don't want it to get lost or slip into the next release, because it undermines the very point of the task manager metrics. We can still hit the dispatcher timeout limits, with no forewarning by the metrics. Until we fix this, the metrics are not really to be believed.

just the other day, I was able to see this by a new mechanism:

tools_awx_1 | 2022-08-08 17:40:42,465 ERROR [252b9dcc] awx.main.dispatch Worker failed to run task awx.main.scheduler.tasks.run_task_manager(*[], **{} tools_awx_1 | Traceback (most recent call last): tools_awx_1 | File "/awx_devel/awx/main/dispatch/worker/task.py", line 102, in perform_work tools_awx_1 | result = self.run_callable(body) tools_awx_1 | File "/awx_devel/awx/main/dispatch/worker/task.py", line 77, in run_callable tools_awx_1 | return _call(*args, **kwargs) tools_awx_1 | File "/awx_devel/awx/main/scheduler/tasks.py", line 15, in run_task_manager tools_awx_1 | TaskManager().schedule() tools_awx_1 | File "/awx_devel/awx/main/scheduler/task_manager.py", line 657, in schedule tools_awx_1 | start_time = time.time() tools_awx_1 | File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/transaction.py", line 290, in __exit__ tools_awx_1 | connection.set_autocommit(True) tools_awx_1 | File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 420, in set_autocommit tools_awx_1 | self.run_and_clear_commit_hooks() tools_awx_1 | File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 653, in run_and_clear_commit_hooks tools_awx_1 | func() tools_awx_1 | File "/awx_devel/awx/main/scheduler/task_manager.py", line 294, in post_commit tools_awx_1 | task_cls.apply_async( tools_awx_1 | File "/awx_devel/awx/main/dispatch/publish.py", line 87, in apply_async tools_awx_1 | with pg_bus_conn() as conn: tools_awx_1 | File "/usr/lib64/python3.9/contextlib.py", line 119, in __enter__ tools_awx_1 | return next(self.gen) tools_awx_1 | File "/awx_devel/awx/main/dispatch/__init__.py", line 50, in pg_bus_conn tools_awx_1 | conn = psycopg2.connect(dbname=conf['NAME'], host=conf['HOST'], user=conf['USER'], password=conf['PASSWORD'], port=conf['PORT'], **conf.get("OPTIONS", {})) tools_awx_1 | File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg2/__init__.py", line 126, in connect tools_awx_1 | conn = _connect(dsn, connection_factory=connection_factory, **kwasync) tools_awx_1 | psycopg2.OperationalError: FATAL: sorry, too many clients already

This shows that the on_commit logic is even more vulnerable than thought, because it requires establishing a new database connection. When approaching connection limits, this guarantees we leave jobs stuck in waiting. That may or may not be exasperated failing to close unused connections, not sure yet.

)

Inherit from our own APIView, not rest framework

fosterseth and others added 18 commits August 5, 2022 14:29

Split TaskManager into

431b937

- DependencyManager spawns dependencies if necessary - WorkflowManager processes running workflows to see if a new job is ready to spawn - TaskManager starts tasks if unblocked and has execution capacity

we can do all the work in one loop

29d91da

more than saving the loop, we save building the WorkflowDag twice which makes LOTS of queries!!! Also, do a bulk update on the WorkflowJobNodes instead of saving in a loop :fear:

Manage pending workflow jobs in Workflow Manager

ff118f2

get_tasks uses UnifiedJob Additionally, make local overrides run after development settings

fix lint errors

236c1df

fix looking at wrong id for wf allow_simultaneous

b26d2ab

pid kill each of the 3 task managers on timeout

b3eb9e0

split schedule_task_manager into 3

0a47d05

each call to schedule_task_manager becomes one of ScheduleTaskManager ScheduleDependencyManager ScheduleWorkflowManager

Fix unintended slice job blocking

8af4dd5

fix sliced jobs blocking logic in depedency graph

e603c23

We have to look at the sliced job's unified_job_template_id Now, task_blocked_by works for sliced jobs too.

Start pending workflows in TaskManager

76d76d1

we had tried doing this in the WorkflowManager, but we decided that we want to handle ALL pending jobs and "soft blockers" to jobs with the TaskManager/DependencyGraph and not duplicate that logic in the WorkflowManager.

unify call pattern for get_tasks

bf89093

add job to dependency graph in start task

7776a81

We always add the job to the graph right before calling start task. Reduce complexity of proper operation by just doing this in start_task, because if you call start_task, you need to add it to the dependency graph

Block sliced workflow jobs on any job type from their JT (ansible#12551)

d06a3f0

Cache task_impact

e6f8852

task_impact is now a field on the database It is calculated and set during create_unified_job set task_impact on .save for adhoc commands

github-actions bot added the component:api label Aug 5, 2022

jbradberry reviewed Aug 5, 2022

View reviewed changes

awx/main/dispatch/pool.py Outdated Show resolved Hide resolved

jbradberry reviewed Aug 5, 2022

View reviewed changes

awx/main/models/inventory.py Outdated Show resolved Hide resolved

jbradberry reviewed Aug 5, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Outdated Show resolved Hide resolved

AlanCoding reviewed Aug 9, 2022

View reviewed changes

awx/main/analytics/subsystem_metrics.py Outdated Show resolved Hide resolved

AlanCoding reviewed Aug 9, 2022

View reviewed changes

awx/main/analytics/subsystem_metrics.py Outdated Show resolved Hide resolved

AlanCoding reviewed Aug 9, 2022

View reviewed changes

fosterseth force-pushed the task_manager_refactor_squashed branch from 9c74678 to eeaa383 Compare August 9, 2022 18:36

AlanCoding reviewed Aug 9, 2022

View reviewed changes

AlanCoding approved these changes Aug 9, 2022

View reviewed changes

fosterseth force-pushed the task_manager_refactor_squashed branch from eeaa383 to 028a558 Compare August 10, 2022 13:15

Optimize task manager with debug toolbar, adjust prefetch (ansible#12588

f7e6a32

)

fosterseth force-pushed the task_manager_refactor_squashed branch from 028a558 to f7e6a32 Compare August 10, 2022 14:05

AlanCoding and others added 2 commits August 10, 2022 15:31

Inherit from our own APIView, not rest framework

a0103ac

Merge pull request #44 from AlanCoding/one_of_seths_own

6fb3c8d

Inherit from our own APIView, not rest framework

fosterseth merged commit 85a5b58 into ansible:devel Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task manager refactor #12629

Task manager refactor #12629

fosterseth commented Aug 5, 2022 •

edited by kdelee

Loading

jbradberry Aug 5, 2022

AlanCoding Aug 9, 2022

AlanCoding Aug 9, 2022

Task manager refactor #12629

Task manager refactor #12629

Conversation

fosterseth commented Aug 5, 2022 • edited by kdelee Loading

Major changes

Split TaskManager into 3 separate managers that can run in parallel

Save progress before timing out manager

Cache preferred_instance_groups and task_impact

ISSUE TYPE

COMPONENT NAME

AWX VERSION

jbradberry Aug 5, 2022

Choose a reason for hiding this comment

AlanCoding Aug 9, 2022

Choose a reason for hiding this comment

AlanCoding Aug 9, 2022

Choose a reason for hiding this comment

fosterseth commented Aug 5, 2022 •

edited by kdelee

Loading