Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zombie jobs on Terrascope #958

Open
EmileSonneveld opened this issue Dec 3, 2024 · 5 comments
Open

Zombie jobs on Terrascope #958

EmileSonneveld opened this issue Dec 3, 2024 · 5 comments

Comments

@EmileSonneveld
Copy link
Contributor

I had around 10 jobs that got stuck in 'running' status for months. I deleted all but kept 2 of them for debugging:
vito-j-2310315a994c4b97b25869c1d0659270
agg-pj-20240530-105651

A sample of the logs:

Dec 3, 2024 @ 10:01:19.377	INFO	Doing 'client_credentials' token request 'https://sso.terrascope.be/auth/realms/terrascope/protocol/openid-connect/token' with post data fields ['grant_type', 'client_id', 'client_secret', 'scope'] (client_id 'openeo-elastic-job-registry')	oidc.py
Dec 2, 2024 @ 22:58:03.096	INFO	Doing 'client_credentials' token request 'https://sso.terrascope.be/auth/realms/terrascope/protocol/openid-connect/token' with post data fields ['grant_type', 'client_id', 'client_secret', 'scope'] (client_id 'openeo-elastic-job-registry')	oidc.py
Dec 2, 2024 @ 22:47:03.756	INFO	Doing 'client_credentials' token request 'https://sso.terrascope.be/auth/realms/terrascope/protocol/openid-connect/token' with post data fields ['grant_type', 'client_id', 'client_secret', 'scope'] (client_id 'openeo-elastic-job-registry')	oidc.py
Dec 2, 2024 @ 22:45:04.180	INFO	Doing 'client_credentials' token request 'https://sso.terrascope.be/auth/realms/terrascope/protocol/openid-connect/token' with post data fields ['grant_type', 'client_id', 'client_secret', 'scope'] (client_id 'openeo-elastic-job-registry')	oidc.py
Dec 2, 2024 @ 22:40:03.075	INFO	Doing 'client_credentials' token request 'https://sso.terrascope.be/auth/realms/terrascope/protocol/openid-connect/token' with post data fields ['grant_type', 'client_id', 'client_secret', 'scope'] (client_id 'openeo-elastic-job-registry')	oidc.py
Dec 2, 2024 @ 21:40:00.103	INFO	Doing 'client_credentials' token request 'https://sso.terrascope.be/auth/realms/terrascope/protocol/openid-connect/token' with post data fields ['grant_type', 'client_id', 'client_secret', 'scope'] (client_id 'openeo-elastic-job-registry')	oidc.py
Dec 2, 2024 @ 21:38:05.637	INFO	Doing 'client_credentials' token request 'https://sso.terrascope.be/auth/realms/terrascope/protocol/openid-connect/token' with post data fields ['grant_type', 'client_id', 'client_secret', 'scope'] (client_id 'openeo-elastic-job-registry')	oidc.py
Oct 10, 2024 @ 10:06:10.448	ERROR	App not found: job_id='j-2310315a994c4b97b25869c1d0659270' application_id='application_1696843816575_99634'	job_tracker_v2.py
Oct 10, 2024 @ 10:06:10.377	DEBUG	About to sync status for job_id='j-2310315a994c4b97b25869c1d0659270' user_id='c907428632b46cb406ef02a01a4a5dc34aedad370369d7b9d5c314a3b666fd03@egi.eu' application_id='application_1696843816575_99634' previous_status='running'	job_tracker_v2.py
...

The logs seems not that frequent, but if there are many users with zombie jobs like this, it might be a strain on EJR too

@soxofaan
Copy link
Member

soxofaan commented Dec 3, 2024

note: those "Doing 'client_credentials' token ..." logs are because you (or web editor) are doing status poll requests for that job, so that's not zombie behavior, that's you poking the zombie :)

@soxofaan
Copy link
Member

soxofaan commented Dec 3, 2024

And I think the other logs are caused by the Zookeeper-ElasticSearch switchover

@soxofaan
Copy link
Member

soxofaan commented Dec 3, 2024

There is also the fact that this job still has status "running", even though the job does not exist anymore at this point (in YARN).
There are probably multiple possible reasons to get in that situation, hard to guess now what actually happened.

At the moment, this wrong status "running" is sticky for that job, as the job tracker will not consider to update that job anymore (outside of time window it currently looks at).

What we could consider is some kind of cron job that "fixes" these kind of jobs (very old, status running, YARN/K8s app not available anymore), and sets their status to "error" to avoid the confusion that they would still be running

@soxofaan
Copy link
Member

soxofaan commented Dec 3, 2024

I did a quick lookup in EJR and it's apparently not that uncommon: I found 665 jobs created between Jan 1 and Sep 1, still with status "running" .

85% on mep-prod, 10% on CDSE prod (so it's not a yarn-specific issue)

image

@soxofaan
Copy link
Member

soxofaan commented Dec 3, 2024

I'm going to unassign me from this ticket. Initial analysis is done. Possible follow up (cron job to fixup old running jobs) is to be planned later

@soxofaan soxofaan removed their assignment Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants