Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi instance Task Manager issues after 8.15 #197145

Closed
tttttx2 opened this issue Oct 21, 2024 · 5 comments
Closed

Multi instance Task Manager issues after 8.15 #197145

tttttx2 opened this issue Oct 21, 2024 · 5 comments
Labels
bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@tttttx2
Copy link

tttttx2 commented Oct 21, 2024

Kibana version:
8.15+

Elasticsearch version:
8.15+

Server OS version:
Docker on Debian

Browser version:
N/A

Browser OS version:
N/A

Original install method (e.g. download page, yum, from source, etc.):
Docker

Describe the bug:
I have seen multiple clusters that are throwing Taskmanager errors (Degraded, even though not overloaded at all, and HealthStatus.Error because of expired hot timestamps). Furthermore, they only show a single 'observed_kibana_instances' on api/task_manager/_health API Endpoints. However, on Stack Monitoring all kibana instances are shown.

I guess something is regularly killing my task managers on multiple instances, and somehow they don't appear to 'talk' to each other.

I haven't observed this before 8.15, and a cluster on 8.14 is still working fine (with pretty much identical config)

Steps to reproduce:

  1. upgrade past 8.15
  2. Have multiple kibana instances
  3. setup kibana for load balancing, still only 1 observed in health api

Expected behavior:

Multiple kibana instances should be shown on the health api, and task manager should not be regularly degraded (once every 1-2 minutesor so)

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:

@tttttx2 tttttx2 added the bug Fixes for quality problems that affect the customer experience label Oct 21, 2024
@botelastic botelastic bot added the needs-team Issues missing a team label label Oct 21, 2024
@tsullivan tsullivan added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Oct 24, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Oct 24, 2024
@mikecote
Copy link
Contributor

Regarding the observed_kibana_instances issue, this will be fixed when 8.16 goes out via #192568. The issue goes back to 8.8 and you'll mainly observe Task Manager is unhealthy errors on clusters that have good volume of background tasks running and the capacity estimation thinks there's only one instance running when there are actually more.

I don't believe the fix mentioned above will solve the HealthStatus.Error because of expired hot timestamps issues. That one is mainly caused when the Task Manager health report contains a last_update or stats.runtime.value.polling.last_successful_poll that is older than I believe 4 seconds (by default). Usually this would be caused by scenarios like; errors returned by Elasticsearch when the Kibana Task Manager is looking for tasks to run, Kibana CPU is high / event loop blocked, etc.

@tttttx2
Copy link
Author

tttttx2 commented Oct 24, 2024

Thanks @mikecote for your reply. So if I understand this correctly it's a mostly cosmetic issue that'll be fixed soon, but the task manager is actually working fine in the mean time and I can just ignore it if I don't need a proper capacity estimation / health status reporting.

If the expired warnings are unrelated and therefore something I gotta investigate further myself, this issue can be closed again then.

Thanks a lot for the help :)

@mikecote
Copy link
Contributor

So if I understand this correctly it's a mostly cosmetic issue that'll be fixed soon, but the task manager is actually working fine in the mean time and I can just ignore it if I don't need a proper capacity estimation / health status reporting.

That is correct, the calculations are based on the wrong number of observed Kibana instances so it's providing false warnings.

If the expired warnings are unrelated and therefore something I gotta investigate further myself, this issue can be closed again then.

That is my thinking, look for "failed to poll for work" logs or others coming from the task manager plugin, it should help find the underlying cause. I'll leave the issue opened a bit longer just in case they end up being related in your case.

@mikecote
Copy link
Contributor

I'll go ahead and close the issue now, I hope it the above helped!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

4 participants