Skip to content

AUTO TUNING of jobs time limit

Stefano Belforte edited this page Jun 5, 2018 · 41 revisions

Problem: most jobs have a very long time limit, because users either do not put anything (and so use default of 24h) or set a convervative limit that can make all jobs in the task succeed. But gWms uses the time limit both to kill jobs (so it is good to be conservative) and to schedule. Using long time for scheduling makes it impossible to fit jobs in the tail of existing pilots, leading to pilot churn and/or underutilization of partially used multicore pilots.

Solution (proposed by Brian) in two steps:

  1. Introduce two ClassAds attributes (see https://github.com/dmwm/CRABServer/pull/5463 for implementation):

    • EstimatedWallTimeMins: Used for matchmaking of jobs within HTCondor. This is initially set to the wall time requested by the user.
    • MaxWallTimeMins: If the job is idle (to be matched), evaluates to the value of EstimatedWallTimeMins. Otherwise, used by the condor_schedd for killing jobs that have gone over the runtime limit and set to the user-requested limit (in CRAB, this defaults to 20 hours).
  2. Introduce a mechanism (based on the existing work for WMAgent) to automatically tune EstimatedWallTimeMins based on the time it actually takes for jobs to run:

    • gwmsmon provides running time percentiles for a task.
    • a python script calculates the new EstimatedWallTimeMins as follows:
      • If less than 20 jobs have finished - or the gwmsmon query results in errors - do nothing!
      • If at least 20 jobs have finished, take the 95th percentile of the runtime for completed jobs; set estimated run time as min(95th percentile, user-provided runtime).
    • This python script will provide a new configuration for the JobRouter running on the CRAB3 schedd. The route will update the ClassAds for idle jobs
      • JobRouter scales much better than a cronjob performing condor_qedit for CRAB3 jobs.
    • In order to preserve a single autocluster per task, all jobs in a CRAB3 task will get the same value of EstimatedWallTimeMins.

As of April 19, 2017, Justas has done the work in gwmsmon [1]

Work to do is tracked in:

The links to the source:


QUESTIONS:

  1. how do we deal with jobs which run into time limit ? Do we resubmit in post-job with limit *= 1.5 until we hit 48h ? see https://hypernews.cern.ch/HyperNews/CMS/get/crabDevelopment/2617/1/1.html
  2. what happens to jobs killed by pilot reaching end of life before payload does ?

REFERENCES:

https://cms-gwmsmon.cern.ch/analysisview/json/historynew/percentileruntime720/smitra/170411_132805:smitra_crab_DYJets

which returns a json file
{"hits": {"hits": [], "total": 263, "max_score": 0.0}, "_shards": {"successful": 31, "failed": 0, "total": 31}, "took": 40, "aggregations": {"2": {"values": {"5.0": 6.5436111111111126, "25.0": 11.444305555555555, "1.0": 3.5115222222222222, "95.0": 19.811305555555556, "75.0": 16.773194444444446, "99.0": 20.513038888888889, "50.0": 13.365277777777777}}}, "timed_out": false}

in hopefully fixed forever format so that the "values" can be extracted
and one would e.g. pick the 95.0 one (i.e. 19.8 hours)
    [
        Name = "Set timing requirement to 105";
        set_HasBeenRouted = false;
        set_HasBeenTimingTuned = true;
        GridResource = "condor localhost localhost";
        Requirements = member(target.WMAgent_SubTaskName,
           {
              "/pdmvserv_task_TSG-PhaseIISpring17GS-00002__v1_T_170419_164853_2948/TSG-PhaseIISpring17GS-00002_0"
           }) && ( target.HasBeenTimingTuned isnt true ) && ( target.MaxWallTimeMins <= 105 );
        set_OriginalMaxWallTimeMins = 105;
        TimeTasknames =
           {
              "/pdmvserv_task_TSG-PhaseIISpring17GS-00002__v1_T_170419_164853_2948/TSG-PhaseIISpring17GS-00002_0"
           };
        TargetUniverse = 5
    ] 

Brian's solution above did not work

Was tried in Feb 2018, but we need to roll it back since it was resulting in lots of early job kill and restarts, putting load on schedd's and wasting resources.

Refined Solution The problem is that we do not simply run on a vanilla condor pool, where that would have been fine. Our startd's are managed by glideinWms pilots, which have both a MATCH and a START expression

Some documentation about this can be found in https://twiki.cern.ch/twiki/bin/view/CMS/GlideinWMSFrontendOpsGWMS in particular: https://twiki.cern.ch/twiki/bin/view/CMS/GlideinWMSFrontendOpsGWMS#Writing_expressions_match_expr_s

Summarizing what's relevant for our use case:

  1. pilots are requested based on the MATCH expression
  2. jobs are matched to pilots with the START expression
  3. start expression is also evaluted a second time when jobs starts to run on the pilot (*) and if it becomes false, it kicks the job out (which happened to CRAB jobs)
    • (*) From Diego Davila: "Using a more verbose debug setup at the startd, you can see that the START expression is evaluated twice, the second time JobStatus==2 in the jobClassAd, and jobStatus==1 in the first one"

There is a global match+start expression, then there are more for each frontend group, those are ANDed. But currently groups match/start expressions do not involve Time, so we only worry about: https://gitlab.cern.ch/CMSSI/cmsgwms-frontend-configurations/blob/cern/global/frontend.xml#L20

the relevant parts are, converting a bit in English and taking out stuff added to convert all classAds to same unit (seconds):

match_expr : MaxWallTimeMins +10 min < ( GLIDEIN_Max_Walltime-GLIDEIN_Retire_Time_Spread )
start_expr : MaxWallTimeMins < GLIDEIN_ToDie-MyCurrentTime

Where:

 NAME                      |TYPICAL VALUE|    MEANING
MaxWallTimeMins            | few hours   | what CRAB jobs request in their JDL
GLIDEIN_Max_Walltime       | 2~3 days    | Max allowed time for the glidein see http://glideinwms.fnal.gov/doc.prd/factory/custom_vars.html#lifetime
GLIDEIN_Retire_Time_Spread |   2 hours   | a random spread to smooth out glideins all ending simultaneously see glidein see http://glideinwms.fnal.gov/doc.prd/factory/custom_vars.html#lifetime

The value of those GLIDEIN_* classAds is dyamically set in the factories and can be queried via commands like this condor_status -pool vocms0805 -any -const 'MyType=="glidefactory" && regexp("CMSHTPC_T1_ES_PIC_ce07-multicore", Name)' -af GLIDEIN_Max_Walltime being vomcs0805 the CERN_Factory

The above means that when we want to start a job in a slot which may expire before the job, we can not change MaxWallTimeMins or the job will be immediately killed. Hence we need to change the Periodic_Remove expression in JDL not to depend on MaxWallTimeMins.

Easiest way seems to keep MaxWallTimeMins as the indicator of the pilot slot that we want the job to fit in, but use a different classAd for the Periodic_Remove, allowing the job to run longer up to when glidein dies and jobs are killed and automatically restarted, or when they hit the time limit defined for them in a new classAd. Can start with using the user-specified Max time (or the default), keeping the spirit of the initial proposal. So will edit https://github.com/dmwm/CRABServer/blob/master/src/python/TaskWorker/Actions/DagmanCreator.py

  1. define
    • MaxWallTimeMinsRun : the max allowed to the job to run
    • MaxWallTimeMins : the time request for the matching (stick to name gWMS wants)
  2. use MaxWallTimeMinsRun in place of MaxWallTimeMins in periodic_remove and +PeriodicRemoveReason
  3. in the JobRouter time tuning will edit MaxWallTimeMin instead of (or i addition to) defining the new EstimatedWallTimeMins

Updates from May 2018

The solution to the problem above is to have two classAds:

  1. MaxWallTimeMins used by gWms for matching and starting (name set in gWms FrontEnd config)
  2. MaxWallTimeMinsRun used by crab schedd's to set how long job can run before PeriodicRemove kicks in

Note that since the introduction of Automatic Splitting we have also two more similar classAds: MaxWallTimeMinsProbe and MaxWallTimeMinsTail which are used to set MaxWallTimeMins for probe and tail jobs. Time Tuning is not allowed (currently) on Automatic Splitting tasks, but generally speaking the bulk of jobs in those tasks may still benefit from being run in slots with less time to live than the (reasonably conservative) estimate obtained from the probe jobs.

now tracked as: https://github.com/dmwm/CRABServer/issues/5683 and deployed in pre-prod on June 4th, 2018

Now need to get back to Questions above:

  1. how do we deal with jobs which run into time limit ?
  2. what happens to jobs killed by pilot reaching end of life before payload does ?

Answer to second one is simple : HTC will restart those automatically (was tested) We are left with the real one:

how do we deal with jobs which run into time limit ?

  1. STEP 0: try to minimize them by making the time estimate large enough that on average at least 99% of the jobs will fit. Beware that first jobs to complete may be the ones which fail for random reason: start very conservative in tuning. Refinements :

    • as done in Unified: do not time tune tasks running >24h, little to gain, and likely risky
  2. STEP 1: become more aggressive in time tuning (slowly) and watch things. Requires good monitoring setup. But we currently expect that the above will be sufficient to cut idle slots below attention threshold.

  3. STEP 2: use maxwalltime more aggressively to detect early and restart elsewhere doomed jobs (malfunctioning hardware, data reading problems), basically reset MaxWallTimeMinsRun from the very conservative default (or value set by users) to something sensible. This is very likely not needed, but if we find that we can't live with the associated inefficiency here's a possible plan: look into a way to increase match time for resubmissions: need to verify which status sequence do jobs go through so that JobRouter leaves them alone if we change MaxWallTimeMins in the classAd according to JobStart. Requires some investigation, but it is possible that we can use MaxWallTimeMins=EstimatedWallTimeMins*(1+JobStarts) in the submit JDL and keep editing EstimatedWallTimeMins for Idle jobs only via the JobRouter. Also cap MaxWallTimeMins at 46h: MaxWallTimeMins=min(EstimatedWallTimeMins*(1+JobStarts),2760)