Add basic state recovery support #663

dongahn · 2020-06-05T08:00:55Z

This PR implements the basic state recovery semantics.

Recover the state of both the running job queue and the resource graph data store for those running jobs. Meet the state-recovery requirement for a key near-term use case in which both resource and qmanager modules are shut down and reloaded in a coordinated fashion. Assume two conditions for full recovery:

The RV1 JSON object for each and every currently running job includes the scheduling key, therefore the JGF section.
The resource module is reloaded before qmanager.

Lay some additional work to prepare for more comprehensive failure recovery in the future. If only the qmanager module is shut down and reloaded, still fully recover the state. If both qmanager and resource are reloaded, but the RV1 JSON of a running job does not include the scheduling key, qmanager will fail to load because it can't recover the resource state.

[Table 1] Current flux-sched state recovery semantics

resource	qmanager	JGF	no-JGF
reload	reload	yes	no
running	reload	yes	yes
reload	running	no	no.

Resolve #470

dongahn · 2020-06-05T08:04:34Z

I added the [WIP] tag as I still need to add the queue name of each job into RV1. The queue name in R will allow qmanager's hello callback to add the running job to the right queue.

dongahn · 2020-06-05T16:43:45Z

I just found a new sharpness script (./t1009-recovery-none.t) is hung. I have the following test copied from a another test and even w/ that, it appears the whole test is hung on exit.

# Reload the core scheduler so that rc3 won't hang waiting for
# queue to become idle after jobs are canceled.
test_expect_success 'load sched-simple module' '
    flux module load sched-simple
'

ok 6 - removing resource and qmanager modules
2020-06-05T16:39:51.357186Z sched-simple.err[0]: hello: rlist_remove (rank0/core[12-15]): No such file or directory
ok 7 - load sched-simple module
# passed all 7 test(s)
1..7
2020-06-05T16:39:51.473915Z broker.err[0]: rc3: flux-job: Canceled 1 jobs (0 errors)
2020-06-05T16:39:51.499430Z sched-simple.err[0]: free: rank0/core[12-15]: No such file or directory

@grondo or @garlick:
This is a case where one job is still running by the time sched-simple is being loaded. Do you have any clue why this is hung and is the err message is related to the hang at all?

dongahn · 2020-06-05T16:49:25Z

BTW,

# Reload the core scheduler so that rc3 won't hang waiting for
# queue to become idle after jobs are canceled.

I don't fully understand why sched-simple is required, but in production I will assume rc3's requirement can be satisfied by fluxion as well. IOW, sched-simple is there only for the purpose of testing.

dongahn · 2020-06-05T16:51:16Z

@grondo or @garlick:
This is a case where one job is still running by the time sched-simple is being loaded. Do you have any clue why this is hung and is the err message is related to the hang at all?

Another piece of info: I also didn't see this hang with my old PR (i.e., before this merging and new flux-core).

garlick · 2020-06-05T17:01:28Z

See @grondo's comment in 41e955d

Does a similar thing need to happen here? (Haven't looked yet in detail)

dongahn · 2020-06-05T17:05:19Z

See @grondo's comment in 41e955d

Does a similar thing need to happen here? (Haven't looked yet in detail)

Ah... Yes, this does seem like the cause. Let me test this out.

dongahn · 2020-06-05T17:13:16Z

41e955d

flux module unload sched-simple &&

I never used unload myself and it doesn't show up in the main help message:

Usage: flux-module COMMAND [OPTIONS]
  -h, --help             Display this message.

flux module subcommands:
   list            List loaded modules
   remove          Unload module
   load            Load module
   reload          Reload module
   info            Display module info
   stats           Display stats on module
   debug           Get/set module debug flags

Is there a difference between remove and unload or is this just an alias?

dongahn · 2020-06-05T17:19:10Z

OK. This fixes the hang.

 # Reload the core scheduler so that rc3 won't hang waiting for
 # queue to become idle after jobs are canceled.
 test_expect_success 'load sched-simple module' '
+    flux module reload -f resource &&
     flux module load sched-simple
 '

garlick · 2020-06-05T17:29:27Z

Unload is just an alias for remove.

codecov-commenter · 2020-06-05T17:50:32Z

Codecov Report

Merging #663 into master will decrease coverage by 0.07%.
The diff coverage is 71.67%.

@@            Coverage Diff             @@
##           master     #663      +/-   ##
==========================================
- Coverage   75.15%   75.07%   -0.08%     
==========================================
  Files          78       78              
  Lines        7894     8166     +272     
==========================================
+ Hits         5933     6131     +198     
- Misses       1961     2035      +74

Impacted Files	Coverage Δ
qmanager/policies/base/queue_policy_base.hpp	`100.00% <ø> (ø)`
qmanager/policies/queue_policy_bf_base.hpp	`100.00% <ø> (ø)`
qmanager/policies/queue_policy_fcfs.hpp	`100.00% <ø> (ø)`
resource/hlapi/bindings/c++/reapi_cli_impl.hpp	`0.00% <0.00%> (ø)`
qmanager/policies/queue_policy_bf_base_impl.hpp	`87.09% <33.33%> (-6.01%)`	⬇️
resource/traversers/dfu_impl_update.cpp	`78.62% <53.33%> (-1.46%)`	⬇️
resource/writers/match_writers.cpp	`62.00% <54.28%> (-0.58%)`	⬇️
qmanager/policies/base/queue_policy_base_impl.hpp	`75.00% <65.00%> (-1.79%)`	⬇️
qmanager/modules/qmanager_callbacks.cpp	`73.23% <75.00%> (-0.37%)`	⬇️
resource/hlapi/bindings/c++/reapi_module_impl.hpp	`53.68% <75.75%> (+9.93%)`	⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd3c95c...8ef2d7a. Read the comment docs.

dongahn · 2020-06-05T17:52:37Z

Travis CI is now happy! Let me now try to quickly add queue name support within RV1: flux-framework/rfc#240.

dongahn · 2020-06-08T19:39:41Z

Just added multi queue support for state recovery. Once Travis turns green, I will take off WIP tag.

dongahn · 2020-06-08T20:59:08Z

Two Travis configurations failed:

./t1010-recovery-multiqueue.t: line 18: syntax error near unexpected token return' 1847./t1010-recovery-multiqueue.t: line 18: [[ $? -ne 0 ]] return $?'

Too much "bash"sm. Will fix.

dongahn · 2020-06-08T21:35:09Z

Travis CI is green now. I nominated a few people as reviewers :-). Thank you in advance.

garlick · 2020-06-08T21:51:39Z

But the resources that those running jobs are using can be scheduled to other jobs (i.e., overscheduling) because the resource module won't be able to reconstruct the state.

This would seem like a fundamental failure. Should state recovery by always enabled? What is the use case for disabling it?

garlick · 2020-06-08T21:55:36Z

This is a rather big PR. Would it make sense to pull out all the cleanup not related to state reconstruction to a separate PR?

dongahn · 2020-06-08T22:01:17Z

@garlick: thank you for the quick response.

This would seem like a fundamental failure. Should state recovery by always enabled?

To enable this always, we need to make the scheduling key in RV1 mandatory. This would be fine for the system instance, but this would add rather high overheads for nested instances in particular for high throughput loads.

For throughput-oriented scheduling specialization, users may trade an ability to recover the state on reload with performance for their nested instances.

As part of RV2, if the graph can be sufficiently reduced and compressed and thus won't be on the critical path, this can be turned on by default. But until then, I think we should still make this trade off possible.

In general, how important is it to recover the states for single user instance though?

dongahn · 2020-06-08T22:01:49Z

This is a rather big PR. Would it make sense to pull out all the cleanup not related to state reconstruction to a separate PR?

Sure. I can split this into two.

garlick · 2020-06-08T22:13:47Z

For throughput-oriented scheduling specialization, users may trade an ability to recover the state on reload with performance for their nested instances.

That makes sense to me, however

but the resources that those running jobs are using can be scheduled to other jobs (i.e., overscheduling) because the resource module won't be able to reconstruct the state.

should the scheduler continue in this case, or give up since allowing resources to be exclusively allocated to multiple jobs is a pretty bad situation?

dongahn · 2020-06-08T22:24:49Z

should the scheduler continue in this case, or give up since allowing resources to be exclusively allocated to multiple jobs is a pretty bad situation?

This is a good point. I actually thought about this and realized either has a pros and cons.

If the scheduler cannot be loaded, users cannot utilize idling resources until all running jobs complete.
If the scheduler is loaded, users can over-schedule jobs.

For high throughput loads, I slightly preferred overscheduling to underutilization, but perhaps this was an incorrect line of thoughts?

garlick · 2020-06-08T22:27:00Z

Personally, I think it's game over if the scheduler doesn't know what's allocated. Hopefully not a condition that would arise from normal use?

dongahn · 2020-06-08T22:31:17Z

Personally, I think it's game over if the scheduler doesn't know what's allocated. Hopefully not a condition that would arise from normal use?

I also don't have a strong preference one way or the other because like you said this wouldn't arise often for normal cases.

I can very easily change that semantics to discontinue. Maybe if such a case occurs, users can load sched-simple and continue as well.

I do hope we can advance RV2 and compression/condensing logic so that this becomes a moot point.

SteVwonder

Thanks @dongahn for putting this together. Generally LGTM! Only minor comments below, nothing major.

Thanks for spending the time to make such great commit messages and atomic commits. That effort made review this PR enjoyable and relatively easy :)

One test case that I don't think is covered, but I think would be good to get in the CI: jobs finishing/getting canceled while the qmanager is unloaded and then reloading it. I tested it manually, and it worked!! It does rely on the job-manager replaying the failed 'free' message which I am not sure if we test anywhere currently.

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux mini submit -N1 -n1 sleep inf                                                                                                                                                                                                                                 18:38:59 ()
889125339136

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux mini submit -N1 -n1 sleep inf                                                                                                                                                                                                                                 18:39:16 ()
913200644096

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux jobs                                                                                                                                                                                                                                                          18:39:17 ()
             JOBID USER     NAME       STATUS NTASKS NNODES  RUNTIME RANKS
      913200644096 herbein1 sleep           R      1      1   2.116s 0
      889125339136 herbein1 sleep           R      1      1   3.547s 0

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux module unload sched-fluxion-qmanager                                                                                                                                                                                                                          18:39:50 ()

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux job cancel 889125339136                                                                                                                                                                                                                                       18:40:08 ()

❯ flux jobs                                                                                                                                                                                                                                                          18:40:10 ()
             JOBID USER     NAME       STATUS NTASKS NNODES  RUNTIME RANKS
      913200644096 herbein1 sleep           R      1      1   55.34s 0
      889125339136 herbein1 sleep           R      1      1   54.53s 0

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux module load sched-fluxion-qmanager                                                                                                                                                                                                                            18:40:28 ()

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux jobs                                                                                                                                                                                                                                                          18:40:35 ()
             JOBID USER     NAME       STATUS NTASKS NNODES  RUNTIME RANKS
      913200644096 herbein1 sleep           R      1      1   1.334m 0

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux dmesg
<snip>
2020-06-10T01:39:57.913316Z broker.debug[0]: rmmod sched-fluxion-qmanager
2020-06-10T01:39:57.913752Z sched-fluxion-qmanager.debug[0]: service_unregister
2020-06-10T01:39:57.913967Z broker.debug[0]: module sched-fluxion-qmanager exited
2020-06-10T01:40:10.814045Z job-exec.debug[0]: exec aborted: id=889125339136
2020-06-10T01:40:10.814152Z job-exec.debug[0]: exec_kill: 889125339136: signal 15
2020-06-10T01:40:10.821070Z connector-local.debug[0]: unsubscribe shell-889125339136.
2020-06-10T01:40:10.837037Z job-manager.debug[0]: alloc: stop due to free response error: Function not implemented
2020-06-10T01:40:35.615413Z broker.debug[0]: insmod sched-fluxion-qmanager
2020-06-10T01:40:35.619013Z sched-fluxion-qmanager.debug[0]: enforced policy (queue=default): fcfs
2020-06-10T01:40:35.619046Z sched-fluxion-qmanager.debug[0]: effective queue params (queue=default): default
2020-06-10T01:40:35.619056Z sched-fluxion-qmanager.debug[0]: effective policy params (queue=default): default
2020-06-10T01:40:35.619395Z sched-fluxion-qmanager.debug[0]: service_register
2020-06-10T01:40:35.619626Z job-manager.debug[0]: scheduler: hello
2020-06-10T01:40:35.622273Z sched-fluxion-resource.debug[0]: update_request_cb: jobid (889125339136) with matching R exists
2020-06-10T01:40:35.622565Z sched-fluxion-qmanager.debug[0]: requeue success (queue=default id=889125339136)
2020-06-10T01:40:35.624226Z sched-fluxion-resource.debug[0]: update_request_cb: jobid (913200644096) with matching R exists
2020-06-10T01:40:35.624438Z sched-fluxion-qmanager.debug[0]: requeue success (queue=default id=913200644096)
2020-06-10T01:40:35.624650Z job-manager.debug[0]: scheduler: ready unlimited
2020-06-10T01:40:35.626107Z sched-fluxion-qmanager.debug[0]: free succeeded (queue=default id=889125339136)

resource/modules/resource_match.cpp

t/scripts/flux-ion-resource.py

SteVwonder · 2020-06-10T01:04:44Z

qmanager/modules/qmanager.cpp

+        if (json_typeof (v) == JSON_STRING)
+            value = value.substr (1, value.length () - 2);


I assume this is here to strip the quotes off the ends of the string? For some reason this feels like it could be fragile, I'm not sure though. Any opposition to just grabbing the string value via the jansson API?

Suggested change

if (json_typeof (v) == JSON_STRING)

value = value.substr (1, value.length () - 2);

if (json_typeof (v) == JSON_STRING)

value = json_string_value (v);

If you want to keep the substr, can you leave a short comment as to the purpose?

That was my initial code but I changed it so that one can add non string values within sched-fluxion-qmanager.toml like our dropped support for resource-recovery-on-load

resource-recovery-on-load = true

As opposed to

resource-recovery-on-load = "true"

can be supported.

garlick

Thanks @dongahn. I second @SteVwonder 's comment that the detailed commit messages are helpful!

My comments are fairly superficial although I did find a couple issues on error paths. At minimum unchecked error codes and memory leaks should be addressed.

Thanks!

garlick · 2020-06-10T18:23:13Z

resource/modules/resource_match.cpp

+    duration = static_cast<uint64_t> (et) - static_cast<uint64_t> (st);
+
+freemem_out:
+    json_decref (o);


might be good idea to save/restore errno around json_decref().

Ah... good call. This is an error path and we want to preserved errno!

Just out of curiosity: do you know if json_decref sets errno? Apparently, I am not a big fan of using a global variable to propagate errors, but we have no other alternatives... :-)

In flux-core, we've been operating under the defensive assumption that free(), and hence destructors that internally call free(), and don't preserve errno, could set errno.

https://stackoverflow.com/questions/30569981/does-free-set-errno/30571921#30571921

Great info. Thanks @garlick.

Just to make sure I am on the same page on the general errno propagation semantics. Looking at flux-core code, it seems like the semantics is best effort:

Capture errno on the first failure detected

Don't reset errno on the cleanup/exit paths of this failure where it makes sense

W/ respect to 2) from flux-core, I see two styles: A) only preserve errno possibly set by non-Flux function calls (e.g., jansson calls, free...); B) preserve the errno even for all calls including Flux API calls themselves.

I can see an argument for A vs. B., but it feels like having a general semantics written down for consistency could be beneficial. If you folks already discussed this, my apology. -- Seems like the semantics can go in a RFC -- I don't know if something like this can go into the coding style RFC though.

I'm not sure I understand your A vs B.

A function that purports to set errno on failure may set it to any value it likes. Usually you want that to reflect the root (first) cause of a problem but not always.

Jansson makes no claims about setting errno on failure. But it probably does sometimes given that it must make calls to malloc() for example.

The main thing is (a) if your contract with users is that errno will be set on error, you must make sure it is not set to zero on error, and (b) once you have errno set the way you want it you must avoid clobbering it since it is a stupid global variable.

Destructors are often called after errno is set, and destructors are usually void-returning functions that ignore errors. Thus if they call any functions that set errno and then ignore the failures of those functions, they may clobber your carefully set errno value.

(Sorry if I'm just stating obvious stuff because I didn't quite get your question!)

BTW I don't think it's a forgone conclusion that fluxion internals must use this same error handling style as flux-core. Obviously they have to deal with flux-core's error semantics but they could adopt something different for how those errors are propagated.

Destructors are often called after errno is set, and destructors are usually void-returning functions that ignore errors. Thus if they call any functions that set errno and then ignore the failures of those functions, they may clobber your carefully set errno value.

Yes, this was essentially my question -- I have seen a number of examples where a destructor in the Flux API set is called on error paths without no errno wrap. My guess the Flux destructors themselves are written to preserve errno within itself so this clobbering won't happen.

garlick · 2020-06-10T18:24:24Z

resource/modules/resource_match.cpp

+    rc = (json_equal (rlite1, rlite2) == 1)? 0 : 1;
+
+out:
+    json_decref (o1);


same comment here about saving errno across json_decref

garlick · 2020-06-10T18:46:37Z

resource/modules/resource_match.cpp

+    if ( (rc = gettimeofday (&start, NULL)) < 0) {
+        flux_log_error (ctx->h, "%s: gettimeofday", __FUNCTION__);
+        goto done;
+    }
+    if ( (rc = parse_R (ctx, R, jgf, at, duration)) < 0) {
+        flux_log_error (ctx->h, "%s: parsing R", __FUNCTION__);
+        goto done;
+    }
+    if ( (rc = run (ctx, jobid, jgf, at, duration)) < 0) {
+        flux_log_error (ctx->h, "%s: run", __FUNCTION__);
+        goto done;
+    }
+    if ( (rc = ctx->writers->emit (o)) < 0) {
+        flux_log_error (ctx->h, "%s: writers->emit", __FUNCTION__);
+        goto done;
+    }
+    if ( (rc = gettimeofday (&end, NULL)) < 0) {
+        flux_log_error (ctx->h, "%s: gettimeofday", __FUNCTION__);
+        goto done;
+    }
+    ov = get_elapse_time (start, end);


there is an assumption that each of these functions sets errno on failure, but I'm not completely sure if it's true?
I got lost digging into the traversers. Possibly someone with more knowledge of the code should walk through and make sure errno is set when it should be.

Yes, these should all set errno including gettimeofday().

I also have #618 to do a PR only on this topic to tighten up any loosen end. Finally, merging @trws's exception PR should make sched-fluxion-resource more robust when a C++ exception arises from within a request handlers.

garlick · 2020-06-10T19:12:22Z

resource/hlapi/bindings/c++/reapi_module_impl.hpp

+                                           "overhead", &overhead,
+                                           "R", &rset,
+                                           "at", &scheduled_at)) < 0)
+        goto out;


error path leaks future

Doh! Good catch.

garlick · 2020-06-10T19:13:43Z

resource/hlapi/bindings/c/reapi_cli.cpp

+                                             jobid, R, *at, *ov, R_buf)) < 0) {
+        goto out;
+    }
+    (*R_out) = strdup (R_buf.c_str ());


check for strdup failure

garlick · 2020-06-10T19:14:08Z

resource/hlapi/bindings/c/reapi_module.cpp

+                                                *at, *ov, R_buf)) < 0) {
+        goto out;
+    }
+    (*R_out) = strdup (R_buf.c_str ());


check for strdup failure

garlick · 2020-06-10T19:20:23Z

resource/writers/match_writers.cpp

+            goto ret;
+        }
+        if ((rc = json_object_set_new (attrs, kv.first.c_str (), stro)) == -1) {
+            json_decref (attrs);


is stro leaked here?

I thought we determined free memory in this case is unsafe.

flux-framework/flux-core#2803 (comment)

If we have other places where stro equivalents are freed, we need to change it; double free can lead to a broker crash.

This may have changed in a more recent jansson library. @chu11?

Oof! Thanks for the reminder! I'd forgotten this affects functions other than json_pack(). It's unpleasant to have to choose between a double free and memory leak.

One way around it is to avoid json_pack() with o or any of the *_new() "object stealing" interfaces in jansson. It means a bit of extra reference count management in the non-error case.

garlick · 2020-06-10T19:21:59Z

resource/writers/match_writers.cpp

+            json_decref (o);
+            goto ret;
+        }
+        json_object_set_new (o, "attributes", attrs_o);


check return value

garlick · 2020-06-10T19:27:26Z

t/t1007-recovery-full.t

+test_expect_success 'recovery: loading flux-sched modules works (rv1)' '
+    flux module remove sched-simple &&
+    flux module reload -f resource &&
+    load_resource load-whitelist=node,core,gpu match-format=rv1 &&


https://9to5google.com/2020/06/07/google-chrome-blacklist-blocklist-more-inclusive/

Will do this as part of a future PR.

dongahn · 2020-06-11T16:55:20Z

@garlick and @SteVwonder: I added fixup commits and I believe I addressed the issues captured in your review comments. Let me know if there is anything else, you want me to do before this is merged.

SteVwonder

Thanks @dongahn! Changes LGTM!

garlick

LGTM

SteVwonder · 2020-06-11T19:25:49Z

@dongahn: feel free to squash the commit and set MWP whenever you are ready

parse_R: parse R(V1) C-string and return the JGF section in std::string along with starttime/expiration. R_equal: check the equality of two R(V1)s and return 0 if equal. Note that casting st into starttime of int64_t type later is safe because there is a guarantee that a starting scheduled point will not exceed MAX_INT64.

Problem: to allow qmanager to reconstruct the resource states of running jobs on reload in the future, the resource module needs a way to update its resource graph using the R(V1) of each running job. Introduce a new RPC (resource.update) that can update the resource graph using the R(V1) string sent to it. Call the dfu_traverser_t class' overloaded run() method to perform this update. To satisfy it's inputs, parse out the JGF section as well as the starttime and expiration keys from the R. This is different from the existing resource.match RPC that finds the best matching resources according to the jobspec and update the resource graph with those matching resources. Return the previously recorded R data if the requested jobid already exists and the requested R is equal to the recorded R of that jobid. This will be the case if only qmanager was reloaded, not resource. Return a corresponding errno if the requested jobid already exists and the requested R is not equal to the recorded R or any other error condition arises.

Add update_allocate interface into the high level API of our resource infrastructure.

Add error checking on strdup within the existing match_allocate functions in the high level. The omission was found during the code review for the newly added update_allocate functions.

Problem: The semantics of reconstruct within the queue policy implementation class is to reconstruct only the running job queue. The method name, reconstruct(), hence is misleading as it may suggest the method would reconstruct both the queue *and* resource state. Change the name to be "reconstruct_queue" to clarify this. Add more error checking code as part of this as well.

Problem: we have an API to reconstruct the state of running-job queue state for a job, but we don’t have an interface to reconstruct the state of resource data store for the job. Add reconstruct_resource as a pure virtual method to the base queue policy class (queue_policy_base_t). Because the base queue policy classes (queue_policy_base_t and queue_policy_base_impl_t) must manage various job queues while being agnostic to the resources of jobs, this pure virtual method must be implemented by the derived classes, the queue policy layer that is aware of the resources. This virtual interface allows queue_policy_base_t to make a up-call to the methods overridden by all of the derived classes. It then makes this scheme future proof with respect to future queueing-policy implementation classes.

Introduce the update command into flux-resource, which is required for future tests. Usage: flux-resource.py update [-h] RV1 Jobid Update the resource database positional arguments: RV1 RV1 file name Jobid Jobid optional arguments: -h, --help show this help message and exit

Add an option to tolerate a nonexistent jobid for cancel within resource high level API (hlapi). Update how the queue policy layer uses this API call. Need this in the future to implement the resource recovery semantics for flux-sched module reloads.

Track RFC20 to support the new attributes key in RV1. If one or more attributes are present, emit them under "attributes.system.scheduler".

Add support for queue name attribute emission. If a queue name is specified in the jobspec, emit it as an attribute.

Implement the basic state recovery semantics using the recent changes made to the resource module, resource high-level API, queue policy layer. Recover the state of both the running job queue and the resource graph data store for running jobs. Meet the state-recovery requirements for a key near-term use case in which both resource and qmanager modules are shut down and reloaded in a coordinated fashion. Assume two conditions for full recovery: 1. The RV1 JSON object for each and every currently running job includes the "scheduling" key, therefore the JGF section. 2. The resource module is reloaded before qmanager. Lay some foundational work to prepare for more comprehensive failure recovery in the future. If only the qmanager module is shut down and reloaded, we still fully recover the state. If both qmanager and resource are reloaded, but the RV1 JSON of a running job does not include the "scheduling" key, qmanager will fail to load because it can't recover the resource state. To summarize the current state recovery semantics: fluxion modules | RV1 of running jobs resource qmanager | JGF no-JGF -------------------------------------------------- reload reload | yes no running reload | yes yes reload running | no no

Use flux-resource's update sub-command to test the basic functionalities of resource.update, a newly updated RPC within resource. Generate 4 allocations in the RV1 format using flux resource match allocate with a unit jobspec that requests 1 node with 1 core. Reload the resource module and use flux resource update with these generated allocations in RV1 to reconstruct the resource graph store state. Also add a few corner case tests: 1. Updating with an existing jobid with an identical RV1 string succeeds. 2. Updating with an existing jobid with a different RV1 string must fail. 3. Updating with an existing RV1 string under a new jobid must fail.

Rename t4009-update-basic.t and t4010-update-multi.t to better match with the grouping structure of our test cases.

Add cases to test flux-sched modules can fully recover the state of running jobs on reload for the cases where they must: 1. The format of running jobs at recovery point is rv1 and we restart both resource and qmanager modules. 2. We restart only qmanager, not the resource module. Note that we also added test coverage for a condition where a running job completes after fluxion modules are unloaded, but before they are reloaded. In test 5, job1 is cancelled after both of the fluxion modules are unloaded. Then, the following test is augmented to verify that job1 is not enqueued on reload.

Add cases to test flux-sched modules doesn't recover the state of running jobs on reload for the cases where they currently can't. The format of running jobs is rv1_nosched, and both resource and qmanager restart.

Add some cases to test flux-sched modules can fully recover the state of running jobs on reload when sched-fluxion-qmanager is configured to have multiple queues.

dongahn · 2020-06-11T21:39:36Z

Thanks!

dongahn force-pushed the reconstruct2 branch from 18ef0c8 to c138cf7 Compare June 5, 2020 17:38

dongahn mentioned this pull request Jun 5, 2020

RFC20: Need queue key under scheduling. flux-framework/rfc#240

Open

dongahn force-pushed the reconstruct2 branch from c138cf7 to f76ef45 Compare June 8, 2020 19:38

dongahn force-pushed the reconstruct2 branch from f76ef45 to fbbb1ee Compare June 8, 2020 21:02

dongahn requested review from garlick, SteVwonder and milroy June 8, 2020 21:33

dongahn changed the title ~~[WIP] Add basic state recovery support~~ Add basic state recovery support Jun 8, 2020

dongahn mentioned this pull request Jun 9, 2020

Sync between sched-fluxion-qmanager and -resource module #668

Closed

SteVwonder reviewed Jun 10, 2020

View reviewed changes

garlick requested changes Jun 10, 2020

View reviewed changes

dongahn mentioned this pull request Jun 11, 2020

Use allowlist instead of whitelist #669

Closed

SteVwonder approved these changes Jun 11, 2020

View reviewed changes

garlick approved these changes Jun 11, 2020

View reviewed changes

dongahn added 18 commits June 11, 2020 14:12

resource: update high level API for update_allocate

03a4d21

Add update_allocate interface into the high level API of our resource infrastructure.

hlapi: add error checks to strdup

08ed4d3

Add error checking on strdup within the existing match_allocate functions in the high level. The omission was found during the code review for the newly added update_allocate functions.

writer: add attributes support

742a1f9

Track RFC20 to support the new attributes key in RV1. If one or more attributes are present, emit them under "attributes.system.scheduler".

resource: emit queue name attributes

e0986e7

Add support for queue name attribute emission. If a queue name is specified in the jobspec, emit it as an attribute.

testsuite: rename resource-query-based update tests

854cc18

Rename t4009-update-basic.t and t4010-update-multi.t to better match with the grouping structure of our test cases.

sharness: add reload fluxion module wrappers

b5685d8

testsuite: add coverage for no recovery case

7fc5fcd

Add cases to test flux-sched modules doesn't recover the state of running jobs on reload for the cases where they currently can't. The format of running jobs is rv1_nosched, and both resource and qmanager restart.

testsuite: add coverage for recovery on multiple queues

7fa735f

Add some cases to test flux-sched modules can fully recover the state of running jobs on reload when sched-fluxion-qmanager is configured to have multiple queues.

build: integrate recovery tests into Make

8ef2d7a

dongahn force-pushed the reconstruct2 branch from 5e7dffe to 8ef2d7a Compare June 11, 2020 21:19

dongahn added the merge-when-passing mergify.io - merge PR automatically once CI passes label Jun 11, 2020

mergify bot merged commit 577d821 into flux-framework:master Jun 11, 2020

		if (json_typeof (v) == JSON_STRING)
		value = value.substr (1, value.length () - 2);

Add basic state recovery support #663

Add basic state recovery support #663

Conversation

dongahn commented Jun 5, 2020 • edited Loading

dongahn commented Jun 5, 2020

dongahn commented Jun 5, 2020

dongahn commented Jun 5, 2020 • edited Loading

dongahn commented Jun 5, 2020

garlick commented Jun 5, 2020

dongahn commented Jun 5, 2020

dongahn commented Jun 5, 2020

dongahn commented Jun 5, 2020

garlick commented Jun 5, 2020

codecov-commenter commented Jun 5, 2020 • edited Loading

Codecov Report

dongahn commented Jun 5, 2020 • edited Loading

dongahn commented Jun 8, 2020

dongahn commented Jun 8, 2020

dongahn commented Jun 8, 2020

garlick commented Jun 8, 2020

garlick commented Jun 8, 2020

dongahn commented Jun 8, 2020

dongahn commented Jun 8, 2020

garlick commented Jun 8, 2020

dongahn commented Jun 8, 2020

garlick commented Jun 8, 2020

dongahn commented Jun 8, 2020

SteVwonder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garlick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongahn commented Jun 11, 2020

SteVwonder left a comment

Choose a reason for hiding this comment

garlick left a comment

Choose a reason for hiding this comment

SteVwonder commented Jun 11, 2020

dongahn commented Jun 11, 2020

dongahn commented Jun 5, 2020 •

edited

Loading

dongahn commented Jun 5, 2020 •

edited

Loading

codecov-commenter commented Jun 5, 2020 •

edited

Loading

dongahn commented Jun 5, 2020 •

edited

Loading