Multi instance Task Manager issues after 8.15 #197145

tttttx2 · 2024-10-21T21:56:41Z

Kibana version:
8.15+

Elasticsearch version:
8.15+

Server OS version:
Docker on Debian

Browser version:
N/A

Browser OS version:
N/A

Original install method (e.g. download page, yum, from source, etc.):
Docker

Describe the bug:
I have seen multiple clusters that are throwing Taskmanager errors (Degraded, even though not overloaded at all, and HealthStatus.Error because of expired hot timestamps). Furthermore, they only show a single 'observed_kibana_instances' on api/task_manager/_health API Endpoints. However, on Stack Monitoring all kibana instances are shown.

I guess something is regularly killing my task managers on multiple instances, and somehow they don't appear to 'talk' to each other.

I haven't observed this before 8.15, and a cluster on 8.14 is still working fine (with pretty much identical config)

Steps to reproduce:

upgrade past 8.15
Have multiple kibana instances
setup kibana for load balancing, still only 1 observed in health api

Expected behavior:

Multiple kibana instances should be shown on the health api, and task manager should not be regularly degraded (once every 1-2 minutesor so)

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:

elasticmachine · 2024-10-24T16:09:10Z

Pinging @elastic/response-ops (Team:ResponseOps)

mikecote · 2024-10-24T17:10:47Z

Regarding the observed_kibana_instances issue, this will be fixed when 8.16 goes out via #192568. The issue goes back to 8.8 and you'll mainly observe Task Manager is unhealthy errors on clusters that have good volume of background tasks running and the capacity estimation thinks there's only one instance running when there are actually more.

I don't believe the fix mentioned above will solve the HealthStatus.Error because of expired hot timestamps issues. That one is mainly caused when the Task Manager health report contains a last_update or stats.runtime.value.polling.last_successful_poll that is older than I believe 4 seconds (by default). Usually this would be caused by scenarios like; errors returned by Elasticsearch when the Kibana Task Manager is looking for tasks to run, Kibana CPU is high / event loop blocked, etc.

tttttx2 · 2024-10-24T17:25:33Z

Thanks @mikecote for your reply. So if I understand this correctly it's a mostly cosmetic issue that'll be fixed soon, but the task manager is actually working fine in the mean time and I can just ignore it if I don't need a proper capacity estimation / health status reporting.

If the expired warnings are unrelated and therefore something I gotta investigate further myself, this issue can be closed again then.

Thanks a lot for the help :)

mikecote · 2024-10-24T17:28:54Z

So if I understand this correctly it's a mostly cosmetic issue that'll be fixed soon, but the task manager is actually working fine in the mean time and I can just ignore it if I don't need a proper capacity estimation / health status reporting.

That is correct, the calculations are based on the wrong number of observed Kibana instances so it's providing false warnings.

If the expired warnings are unrelated and therefore something I gotta investigate further myself, this issue can be closed again then.

That is my thinking, look for "failed to poll for work" logs or others coming from the task manager plugin, it should help find the underlying cause. I'll leave the issue opened a bit longer just in case they end up being related in your case.

mikecote · 2024-10-31T12:35:23Z

I'll go ahead and close the issue now, I hope it the above helped!

tttttx2 added the bug Fixes for quality problems that affect the customer experience label Oct 21, 2024

botelastic bot added the needs-team Issues missing a team label label Oct 21, 2024

tsullivan added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Oct 24, 2024

botelastic bot removed the needs-team Issues missing a team label label Oct 24, 2024

mikecote closed this as completed Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi instance Task Manager issues after 8.15 #197145

Multi instance Task Manager issues after 8.15 #197145

tttttx2 commented Oct 21, 2024

elasticmachine commented Oct 24, 2024

mikecote commented Oct 24, 2024

tttttx2 commented Oct 24, 2024

mikecote commented Oct 24, 2024

mikecote commented Oct 31, 2024

Multi instance Task Manager issues after 8.15 #197145

Multi instance Task Manager issues after 8.15 #197145

Comments

tttttx2 commented Oct 21, 2024

elasticmachine commented Oct 24, 2024

mikecote commented Oct 24, 2024

tttttx2 commented Oct 24, 2024

mikecote commented Oct 24, 2024

mikecote commented Oct 31, 2024