Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery of history data after downtimes #60

Open
stiegerb opened this issue Oct 8, 2018 · 3 comments
Open

Recovery of history data after downtimes #60

stiegerb opened this issue Oct 8, 2018 · 3 comments

Comments

@stiegerb
Copy link
Collaborator

stiegerb commented Oct 8, 2018

I don't fully trust how the code recovers history data after not finishing for a few hours (e.g. when the VM is down). For example, the VM feeding es-cms was down for a few hours after a reboot on Sunday afternoon (October 7th), and the script was restarted only on Monday afternoon. It recovers some, but not all of the data:
https://es-cms.cern.ch/kibana/goto/14b8189cfdd5119db8dc25405fa4a9f7

Looking at the code, I suspect this:
https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/history.py#L54

where we specify a limit of 10'000 jobs per query (per schedd). Depending on which 10'000 jobs this retrieves, the last_completion time will be set such that older jobs are never recovered.

@bbockelm can you clarify which jobs are returned when a limit is passed to schedd.history? Should we increase that number?

@bbockelm
Copy link
Collaborator

bbockelm commented Oct 8, 2018

So, it used to be that HTCondor didn't index records accordingly and would continue to scan through old history files even when it was impossible to find additional records.

That is, if you said "return all records from the last 5 minutes" without providing a limit on how many jobs to return, it would search through the entire history database on the schedd side.

Now, based on CMS's complaints, upstream updated the schedd and Python bindings to be avoid this situation. IIRC, you provide a "last processed time" (or maybe job ID?) and it'll stop scanning history files once it reaches that point.

Could you do some digging and figure out what the minimal HTCondor version is that supports this? Once we can confirm all our schedds are patched, we can do this and increase the number of records we can recover post-failure.

@bbockelm
Copy link
Collaborator

bbockelm commented Oct 8, 2018

Probably worth noting that we get the remote HTCondor version as part of the schedd ad in the collector. If we have a significant number of schedds that haven't upgraded, we can parse the version to determine the remote capabilities and take the old / new code-path accordingly.

@stiegerb
Copy link
Collaborator Author

stiegerb commented Oct 8, 2018

Ok, digging a bit, what we want is the since option in schedd.history. Indeed you pass a job id and it will iterate until it encounters it. As far as I can tell, it was introduced in version 8.7. (At least it's not in the docs for older version.):
http://research.cs.wisc.edu/htcondor/manual/v8.7/PythonBindings.html

Of the 63 schedd's we're currently querying, only 14 are running version 8.7. All the others are still at version 8.6.

I'll see if I can come up with some code that would work for 8.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants