Skip to content

CRAB vs HammerCloud

Stefano Belforte edited this page Sep 30, 2022 · 9 revisions

HammerCloud tools uses CRAB to submit jobs to CMS sites for continuous site monitoring.

This page describes which features CRAB has which are meant explicitely for HC use and are not part of general user documentation

how HammerCloud submissions are recognized and handled in CRAB.

If you do not have a place where to keep this, I guess I can make a twiki page in CRAB, but I do not want to encourage users to play with activity flag

User (i.e. you i.e. HC) sets the config. paramenter General.activity

If that contrains the string "hc" (case insensitive) CRAB flags it as an HammerCloud task and sets these classAds for reporting to MONIT so that they become keys in ES/Grafana/Kibana searches

CMS_WMTool = 'HammerCloud'
CMS_TaskType = same string as found in General.activity above
CMS_Type  = 'Test'

Be aware that CMS_Type = 'Test' is used also by WMA

besides what is reported, there's the matter of what/where is run CRAB uses a parameter in TaskWorker config [2]

config.TaskWorker.ActivitiesToRunEverywhere = ['hctest', 'hcdev']

to disable black lists [3] and stageout check [4]

so if you want e.g. to use 'hctestNew' and still want it to run at blacklisted sites, you need to tell the CRAB operators in advance so that we change config.

Alternatively you can explicitly put in crabConfig :

config.Site.ignoreGlobalBlacklist = True
config.General.transferOutputs = False
config.General.transferLogs = False

(no transfers.. no need to check [5])

[1] https://github.com/dmwm/CRABServer/blob/32066a9248142e7851ebf9ebe0dd12f95679bef4/src/python/TaskWorker/Actions/DagmanCreator.py#L424-L453

[2] https://gitlab.cern.ch/ai/it-puppet-hostgroup-vocmsglidein/-/blob/master/code/templates/crab/crabtaskworker/TaskWorkerConfig.py.erb#L94

[3] https://github.com/dmwm/CRABServer/blob/32066a9248142e7851ebf9ebe0dd12f95679bef4/src/python/TaskWorker/Actions/DagmanCreator.py#L797-L805

[4] https://github.com/dmwm/CRABServer/blob/32066a9248142e7851ebf9ebe0dd12f95679bef4/src/python/TaskWorker/Actions/StageoutCheck.py#L14-L21 https://github.com/dmwm/CRABServer/blob/32066a9248142e7851ebf9ebe0dd12f95679bef4/src/python/TaskWorker/Actions/StageoutCheck.py#L96-L100

[5] https://github.com/dmwm/CRABServer/blob/32066a9248142e7851ebf9ebe0dd12f95679bef4/src/python/TaskWorker/Actions/StageoutCheck.py#L102-L105

slow job release

for HammerCloud CRAB can release jobs in a task slowly so that they are hopefully executed in a constant flow at the sites, rather than all at the same time in O(100) job bunches

In a nutshell

  • standard operations: users submits a 100-job tasks, 100jobs are queued in HTCondor "asap" via a quick succession of condor_submit (this is done by DAGMAN)
  • slow release: user specifies in crabConfig.py this line config.Debug.extraJDL=['+CRAB_JobReleaseTimeout=Nsec'] where Nsec is an integer indicating a number of seconds
    • Then (still via DAGMAN, inserting a delay in each DAG node):
      • task starts in schedd at time T0
      • job #1 is submitted to HTCondor at T0 + Nsec
      • job #2 is submitted to HTCondor at T0 + 2*Nsec
      • ...
      • job #N is submitted to HTCondor at T0 + N*Nsec
    • there is no guarantee and no way to predict when jobs will start running, new submissions do not wait for previous jobs to complete

and here is the code, which is all in all clear enough

https://github.com/dmwm/CRABServer/blob/32066a9248142e7851ebf9ebe0dd12f95679bef4/src/python/TaskWorker/Actions/DagmanCreator.py#L581-L599

https://github.com/dmwm/CRABServer/blob/32066a9248142e7851ebf9ebe0dd12f95679bef4/src/python/TaskWorker/Actions/PreJob.py#L489-L506

Important details:

  • all the work happens in PreJob context
  • at task start time, when DAGMAN starts, all PreJobs are executed quickly. The code in PreJob.py simply returns from all but the first of them with status=4 which asks DAGMAN to be deferred. The printout of from PreJob.py is only informational and the deferred time computed inside it is irrelevant
  • deferred PreJobs are executed again by DAGMAN after the delay indicated in the DAG configuration (SPOOL_DIR/RunJobs.dag)
  • relevant (trimmed) lines from RunJobs.dag in a real example:
SCRIPT DEFER 4 300 PRE  Job1 dag_bootstrap.sh
SCRIPT DEFER 4 600 PRE  Job2 dag_bootstrap.sh
SCRIPT DEFER 4 900 PRE  Job3 dag_bootstrap.sh
SCRIPT DEFER 4 1200 PRE  Job4 dag_bootstrap.sh
SCRIPT DEFER 4 1500 PRE  Job5 dag_bootstrap.sh
SCRIPT DEFER 4 1800 PRE  Job6 dag_bootstrap.sh
SCRIPT DEFER 4 2100 PRE  Job7 dag_bootstrap.sh
...
SCRIPT DEFER 4 30000 PRE  Job100 dag_bootstrap.sh
SCRIPT DEFER 4 30300 PRE  Job101 dag_bootstrap.sh
SCRIPT DEFER 4 30600 PRE  Job102 dag_bootstrap.sh
...
  • so the time when each PreJob runs, and hence when actual job is submitted to the global pool, is predefined at task start.
  • if DAGMAN is restarted for any reason (machine reboot, schedd restart etc.), DAGMAN is also restarted and all PreJobs for non-completed jobs are executed again. Those "past due" are then submitted immediately (the code in the PreJob finds out that there is no need to defer), but the ones still to be submitted are deferred and now they are deferred by the amount initially specified but at this point from the current (latest) DAGMAN start