Recovery of history data after downtimes #60

stiegerb · 2018-10-08T14:34:53Z

I don't fully trust how the code recovers history data after not finishing for a few hours (e.g. when the VM is down). For example, the VM feeding es-cms was down for a few hours after a reboot on Sunday afternoon (October 7th), and the script was restarted only on Monday afternoon. It recovers some, but not all of the data:
https://es-cms.cern.ch/kibana/goto/14b8189cfdd5119db8dc25405fa4a9f7

Looking at the code, I suspect this:
https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/history.py#L54

where we specify a limit of 10'000 jobs per query (per schedd). Depending on which 10'000 jobs this retrieves, the last_completion time will be set such that older jobs are never recovered.

@bbockelm can you clarify which jobs are returned when a limit is passed to schedd.history? Should we increase that number?

The text was updated successfully, but these errors were encountered:

bbockelm · 2018-10-08T14:39:30Z

So, it used to be that HTCondor didn't index records accordingly and would continue to scan through old history files even when it was impossible to find additional records.

That is, if you said "return all records from the last 5 minutes" without providing a limit on how many jobs to return, it would search through the entire history database on the schedd side.

Now, based on CMS's complaints, upstream updated the schedd and Python bindings to be avoid this situation. IIRC, you provide a "last processed time" (or maybe job ID?) and it'll stop scanning history files once it reaches that point.

Could you do some digging and figure out what the minimal HTCondor version is that supports this? Once we can confirm all our schedds are patched, we can do this and increase the number of records we can recover post-failure.

bbockelm · 2018-10-08T14:41:14Z

Probably worth noting that we get the remote HTCondor version as part of the schedd ad in the collector. If we have a significant number of schedds that haven't upgraded, we can parse the version to determine the remote capabilities and take the old / new code-path accordingly.

stiegerb · 2018-10-08T15:39:46Z

Ok, digging a bit, what we want is the since option in schedd.history. Indeed you pass a job id and it will iterate until it encounters it. As far as I can tell, it was introduced in version 8.7. (At least it's not in the docs for older version.):
http://research.cs.wisc.edu/htcondor/manual/v8.7/PythonBindings.html

Of the 63 schedd's we're currently querying, only 14 are running version 8.7. All the others are still at version 8.6.

I'll see if I can come up with some code that would work for 8.7.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery of history data after downtimes #60

Recovery of history data after downtimes #60

stiegerb commented Oct 8, 2018

bbockelm commented Oct 8, 2018

bbockelm commented Oct 8, 2018

stiegerb commented Oct 8, 2018

Recovery of history data after downtimes #60

Recovery of history data after downtimes #60

Comments

stiegerb commented Oct 8, 2018

bbockelm commented Oct 8, 2018

bbockelm commented Oct 8, 2018

stiegerb commented Oct 8, 2018