Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task manager refactor #12629

Merged
merged 21 commits into from
Aug 10, 2022
Merged

Conversation

fosterseth
Copy link
Member

@fosterseth fosterseth commented Aug 5, 2022

Major changes

Split TaskManager into 3 separate managers that can run in parallel

  • DependencyManager processes tasks that might need dependencies spawned, and flips dependencies_processed to True after resolving any dependencies
  • TaskManager looks at pending tasks with dependencies_processed True. It is responsible for fitting the task to a control instance and execution instance, and submitting the task to the dispatcher.
  • WorkflowManager looks only at running workflows and decides whether it can spawn new jobs

Save progress before timing out manager

  • After 5 minutes, the task manager PID will be killed. However, there is now a grace period that allows the managers to bail out and save whatever progress it has made before it is killed.

Cache preferred_instance_groups and task_impact

  • These two fields on UnifiedJob are fairly expensive, and were being calculated repeatedly across task manager runs. Now, these fields are calculated and set in .create_unified_job, thus no extra DB hits accrue when accessing these fields.

the parent issue tracker can be found here https://github.com/ansible/tower/issues/5943

ISSUE TYPE
  • New or Enhanced Feature
COMPONENT NAME
  • API
AWX VERSION
awx: 21.3.1.dev72+g9c74678c61

fosterseth and others added 18 commits August 5, 2022 14:29
- DependencyManager spawns dependencies if necessary
- WorkflowManager processes running workflows to see if a new job is
  ready to spawn
- TaskManager starts tasks if unblocked and has execution capacity
implement ansible#12446
in development environment, enable set of views that run
the task manager(s).

Also introduce a setting that disables any calls to schedule()
that do not originate from the debug views when in the development
environment. With guards around both if we are in the development
environment and the setting, I think we're pretty safe this won't get
triggered unintentionally.

use MODE to determine if we are in devel env

Also, move test for skipping task managers to the tasks file
more than saving the loop, we save building the WorkflowDag twice which
makes LOTS of queries!!!

Also, do a bulk update on the WorkflowJobNodes instead of saving in a
loop :fear:
get_tasks uses UnifiedJob
Additionally, make local overrides run after development settings
add settings to define task manager timeout and grace period

This gives us still TASK_MANAGER_TIMEOUT_GRACE_PERIOD amount of time to
get out of the task manager.

Also, apply start task limit in WorkflowManager to starting pending
workflows
each call to schedule_task_manager becomes one of

ScheduleTaskManager
ScheduleDependencyManager
ScheduleWorkflowManager
We have to look at the sliced job's unified_job_template_id
Now, task_blocked_by works for sliced jobs too.
we had tried doing this in the WorkflowManager, but we decided that
we want to handle ALL pending jobs and "soft blockers" to jobs with the
TaskManager/DependencyGraph and not duplicate that logic in the
WorkflowManager.
We always add the job to the graph right before calling start task.
Reduce complexity of proper operation by just doing this in start_task,
because if you call start_task, you need to add it to the dependency
graph
…yset (ansible#12502)

Instead of loading all pending Workflow Approvals in the task manager,
  run a query that will only return the expired apporovals
  directly expire all which are returned by that query

Cache expires time as a new field in order to simplify WorkflowApproval filter
When creating unified job, stash the list of pk values from the
instance groups returned from preferred_instance_groups so that the
task management system does not need to call out to this method
repeatedly.

.preferred_instance_groups_cache is the new field
task_impact is now a field on the database
It is calculated and set during create_unified_job

set task_impact on .save for adhoc commands
job.job_explanation = gettext_noop(
"Workflow Job spawned from workflow could not start because it "
"would result in recursion (spawn order, most recent first: {})"
).format(', '.join(['<{}>'.format(tmp) for tmp in display_list]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the .join() call also does not need the parameter to be a list comprehension. An implicit generator expression (i.e. just remove the [ and ]) will work just fine, like with the any() call I mentioned previously.

)
task_impact = models.PositiveIntegerField(
default=0,
editable=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we don't add help_text to this (consistent with other new field). If you are not sure if it's taking effect, see GET listed in the OPTIONS data. It's mundane work, but those who work with our API will appreciate it.

@fosterseth fosterseth force-pushed the task_manager_refactor_squashed branch from 9c74678 to eeaa383 Compare August 9, 2022 18:36
self.generate_dependencies(deps)
self.subsystem_metrics.inc(f"{self.prefix}_pending_processed", len(self.all_tasks) + len(deps))

@timeit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still really bothered that we are putting timings on _schedule and not schedule, because this misses timings of on_commit actions, which we know are a constant point of failure. Sure, this could be handled by filing a followup issue, but I really don't want it to get lost or slip into the next release, because it undermines the very point of the task manager metrics. We can still hit the dispatcher timeout limits, with no forewarning by the metrics. Until we fix this, the metrics are not really to be believed.

just the other day, I was able to see this by a new mechanism:

tools_awx_1 | 2022-08-08 17:40:42,465 ERROR    [252b9dcc] awx.main.dispatch Worker failed to run task awx.main.scheduler.tasks.run_task_manager(*[], **{}
tools_awx_1 | Traceback (most recent call last):
tools_awx_1 |   File "/awx_devel/awx/main/dispatch/worker/task.py", line 102, in perform_work
tools_awx_1 |     result = self.run_callable(body)
tools_awx_1 |   File "/awx_devel/awx/main/dispatch/worker/task.py", line 77, in run_callable
tools_awx_1 |     return _call(*args, **kwargs)
tools_awx_1 |   File "/awx_devel/awx/main/scheduler/tasks.py", line 15, in run_task_manager
tools_awx_1 |     TaskManager().schedule()
tools_awx_1 |   File "/awx_devel/awx/main/scheduler/task_manager.py", line 657, in schedule
tools_awx_1 |     start_time = time.time()
tools_awx_1 |   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/transaction.py", line 290, in __exit__
tools_awx_1 |     connection.set_autocommit(True)
tools_awx_1 |   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 420, in set_autocommit
tools_awx_1 |     self.run_and_clear_commit_hooks()
tools_awx_1 |   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 653, in run_and_clear_commit_hooks
tools_awx_1 |     func()
tools_awx_1 |   File "/awx_devel/awx/main/scheduler/task_manager.py", line 294, in post_commit
tools_awx_1 |     task_cls.apply_async(
tools_awx_1 |   File "/awx_devel/awx/main/dispatch/publish.py", line 87, in apply_async
tools_awx_1 |     with pg_bus_conn() as conn:
tools_awx_1 |   File "/usr/lib64/python3.9/contextlib.py", line 119, in __enter__
tools_awx_1 |     return next(self.gen)
tools_awx_1 |   File "/awx_devel/awx/main/dispatch/__init__.py", line 50, in pg_bus_conn
tools_awx_1 |     conn = psycopg2.connect(dbname=conf['NAME'], host=conf['HOST'], user=conf['USER'], password=conf['PASSWORD'], port=conf['PORT'], **conf.get("OPTIONS", {}))
tools_awx_1 |   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg2/__init__.py", line 126, in connect
tools_awx_1 |     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
tools_awx_1 | psycopg2.OperationalError: FATAL:  sorry, too many clients already

This shows that the on_commit logic is even more vulnerable than thought, because it requires establishing a new database connection. When approaching connection limits, this guarantees we leave jobs stuck in waiting. That may or may not be exasperated failing to close unused connections, not sure yet.

@fosterseth fosterseth force-pushed the task_manager_refactor_squashed branch from eeaa383 to 028a558 Compare August 10, 2022 13:15
@fosterseth fosterseth force-pushed the task_manager_refactor_squashed branch from 028a558 to f7e6a32 Compare August 10, 2022 14:05
@fosterseth fosterseth merged commit 85a5b58 into ansible:devel Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants