-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zombie jobs on Terrascope #958
Comments
note: those "Doing 'client_credentials' token ..." logs are because you (or web editor) are doing status poll requests for that job, so that's not zombie behavior, that's you poking the zombie :) |
And I think the other logs are caused by the Zookeeper-ElasticSearch switchover |
There is also the fact that this job still has status "running", even though the job does not exist anymore at this point (in YARN). At the moment, this wrong status "running" is sticky for that job, as the job tracker will not consider to update that job anymore (outside of time window it currently looks at). What we could consider is some kind of cron job that "fixes" these kind of jobs (very old, status running, YARN/K8s app not available anymore), and sets their status to "error" to avoid the confusion that they would still be running |
I'm going to unassign me from this ticket. Initial analysis is done. Possible follow up (cron job to fixup old running jobs) is to be planned later |
I had around 10 jobs that got stuck in 'running' status for months. I deleted all but kept 2 of them for debugging:
vito-j-2310315a994c4b97b25869c1d0659270
agg-pj-20240530-105651
A sample of the logs:
The logs seems not that frequent, but if there are many users with zombie jobs like this, it might be a strain on EJR too
The text was updated successfully, but these errors were encountered: