Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How many times do we retry a single job, on average #141

Open
jenimal opened this issue May 9, 2022 · 7 comments
Open

How many times do we retry a single job, on average #141

jenimal opened this issue May 9, 2022 · 7 comments

Comments

@jenimal
Copy link

jenimal commented May 9, 2022

WmCore retries jobs multiple times to take care of temporal issues within the system.
I can't find any monitoring that tells me on average how many times a job gets resubmitted. The vast majority of jobs obviously succeed in the end but are they succeeding on the 1st try? The 3rd try?
The ability to break this down by site, and exit code might also be useful so we can monitory what sites are causing more retries than others and for what reasons. This would help us to better understand the effectiveness of each site, and if a particular site sticks out as always needing to retry many times they may have an undetected issue that is effecting efficiency.

Jen

@vkuznet
Copy link
Collaborator

vkuznet commented May 9, 2022

Jen, monitoring does not appear by magic, somebody first need to write such information somewhere. In order to answer your request please provide more information where this information is stored and who is feeding this information to which system. Once we can understand how information is fed to the (some) system it would be clear how to answer your question. I also doubt this repo is a right place to request this info. It is repository of tools CMS Monitoring group develop, while your request is very vague as well. Do you need a CLI tool, a plot, or dashboard? What would be input parameters for such tool/plot/dashbaord? How will you interact with it, via terminal or web GUI? I suggest that you re-structure your request as foollowing:

  • provide example of tool/plot/dashbaord you would like to have, e.g.
jobMonitoring -jobs=MytFavoritJob -show=tries -orderBy=Site
# and it will show
MytFavoritJob  5 T1_X
MytFavoritJob  3 T2_Y

or may be you want to pass a pattern, etc.

  • next, provide information where we should read this information from, e.g.
    • all data are stored in WMArchive
    • it can be access from X, Y, Z, ...
  • or, may be you need a dashboard which will list some jobs and provide you a table with jobName, retries, Sites, and provide filters to select what you need.

Once we have a clear picture then we can discuss how to implement it.

FYI: @mrceyhun , @leggerf , @brij01

@jenimal
Copy link
Author

jenimal commented May 9, 2022

Of course it doesn't appear by magic, otherwise I would have made it appear.

Within WmStats I cam click on "State Transition" and see this:

State Transition

jobcooloff: 2022/4/19 (Tue) 17:42:23 UTC, T3_US_NERSC
jobcooloff: 2022/4/26 (Tue) 20:31:05 UTC, T3_US_TACC
jobcooloff: 2022/5/2 (Mon) 18:40:07 UTC, T1_DE_KIT

Depending on the exit code the number of max retries is hard coded

I believe the number of retries is stored in WmArchive

What we would like to be able to cross reference is (plotted) :

  1. simple number of retries for each job I suspect this number is 0/1 but Christoph wants to know this as the only ones we ever look at in the end are the failed jobs
  2. Exit code/site # retries - as this could tell us if there is a site issue that we are missing
  3. Exit Code/retry/workflow and or Campaign or dataset because sometimes if it is a file read issue if we can spot a spike of retries that isn't settling down as a workflow progresses we may need to make more replicas.

@vkuznet
Copy link
Collaborator

vkuznet commented May 9, 2022

Jen, the WMStats is not part of CMS Monitoring and used internal CouchDB backend which is not integrated within CMS Monitoring. Therefore, in order to use this information someone needs to propagated this info either to MONIT or provide APIS which we can use to fetch this info. In both case @amaltaro can answer if it is possible.

I also not sure if WMArchive really contains this info. If we look at WMArchive schema I don't see that any retry info is stored in WMArchive documents. Again, @amaltaro can tell us more.

Therefore, even you clarify few bits it is far from clear that we have any information in CMS Monitoring landscape about it. The information is kept within DMWM internal tools and in order to use we need broader discussion how to access it (queries, periodicity, etc.), how to aggregate it and store it before making its visualization.

I think information belongs to DMWM and it should push it to MONIT, e.g. we have proxy server for that. Once information will be in MONIT, e.g. in ES, it would be easier to visualize it. While, if we keep information within WMStats we need to adapt usage of some API to extract/query it from this system and I'm not aware that they exists.

@mrceyhun
Copy link
Contributor

mrceyhun commented May 9, 2022

Hi @jenimal ,
Thanks to @vkuznet , he explained all the details and how we can get this monitoring.
As monitoring operator, sorry, i don't know much about the internals of DMWM. We've WMArchive data in ElasticSearch, but I'm not sure we can get job retrials from ES data. However, if that kind of information exists in DMWM or any other place, we can push data to monitoring stack as @vkuznet said. To do that, we need help from the data owner, so we can help to implement the data pipeline and prepare its monitoring.

@mrceyhun
Copy link
Contributor

mrceyhun commented May 10, 2022

@jenimal I also found CMS_JobRetryCount in ClassAds, here is an example. I'm also not sure if this is the info you're looking for, because It shows same retry for all WMAgent_JobID of a completed task.

@vkuznet
Copy link
Collaborator

vkuznet commented May 11, 2022

I created separate issue within dmwm/WMCore#11140 where we should decide on the approach and which information we need in CMS Monitoring. Once information will be there we can discuss here how to yield it back to the users and which tools/plots/dashboards we can provide.

@mrceyhun
Copy link
Contributor

Thanks @vkuznet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants