Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic state recovery support #663

Merged
merged 20 commits into from
Jun 11, 2020
Merged

Conversation

dongahn
Copy link
Member

@dongahn dongahn commented Jun 5, 2020

This PR implements the basic state recovery semantics.

Recover the state of both the running job queue and the resource graph data store for those running jobs. Meet the state-recovery requirement for a key near-term use case in which both resource and qmanager modules are shut down and reloaded in a coordinated fashion. Assume two conditions for full recovery:

  • The RV1 JSON object for each and every currently running job includes the scheduling key, therefore the JGF section.
  • The resource module is reloaded before qmanager.

Lay some additional work to prepare for more comprehensive failure recovery in the future. If only the qmanager module is shut down and reloaded, still fully recover the state. If both qmanager and resource are reloaded, but the RV1 JSON of a running job does not include the scheduling key, qmanager will fail to load because it can't recover the resource state.

[Table 1] Current flux-sched state recovery semantics

resource qmanager JGF no-JGF
reload reload yes no
running reload yes yes
reload running no no.

Resolve #470

@dongahn
Copy link
Member Author

dongahn commented Jun 5, 2020

I added the [WIP] tag as I still need to add the queue name of each job into RV1. The queue name in R will allow qmanager's hello callback to add the running job to the right queue.

@dongahn
Copy link
Member Author

dongahn commented Jun 5, 2020

I just found a new sharpness script (./t1009-recovery-none.t) is hung. I have the following test copied from a another test and even w/ that, it appears the whole test is hung on exit.

# Reload the core scheduler so that rc3 won't hang waiting for
# queue to become idle after jobs are canceled.
test_expect_success 'load sched-simple module' '
    flux module load sched-simple
'
ok 6 - removing resource and qmanager modules
2020-06-05T16:39:51.357186Z sched-simple.err[0]: hello: rlist_remove (rank0/core[12-15]): No such file or directory
ok 7 - load sched-simple module
# passed all 7 test(s)
1..7
2020-06-05T16:39:51.473915Z broker.err[0]: rc3: flux-job: Canceled 1 jobs (0 errors)
2020-06-05T16:39:51.499430Z sched-simple.err[0]: free: rank0/core[12-15]: No such file or directory

@grondo or @garlick:
This is a case where one job is still running by the time sched-simple is being loaded. Do you have any clue why this is hung and is the err message is related to the hang at all?

@dongahn
Copy link
Member Author

dongahn commented Jun 5, 2020

BTW,

# Reload the core scheduler so that rc3 won't hang waiting for
# queue to become idle after jobs are canceled.

I don't fully understand why sched-simple is required, but in production I will assume rc3's requirement can be satisfied by fluxion as well. IOW, sched-simple is there only for the purpose of testing.

@dongahn
Copy link
Member Author

dongahn commented Jun 5, 2020

@grondo or @garlick:
This is a case where one job is still running by the time sched-simple is being loaded. Do you have any clue why this is hung and is the err message is related to the hang at all?

Another piece of info: I also didn't see this hang with my old PR (i.e., before this merging and new flux-core).

@garlick
Copy link
Member

garlick commented Jun 5, 2020

See @grondo's comment in 41e955d

Does a similar thing need to happen here? (Haven't looked yet in detail)

@dongahn
Copy link
Member Author

dongahn commented Jun 5, 2020

See @grondo's comment in 41e955d

Does a similar thing need to happen here? (Haven't looked yet in detail)

Ah... Yes, this does seem like the cause. Let me test this out.

@dongahn
Copy link
Member Author

dongahn commented Jun 5, 2020

41e955d

flux module unload sched-simple &&

I never used unload myself and it doesn't show up in the main help message:

Usage: flux-module COMMAND [OPTIONS]
  -h, --help             Display this message.

flux module subcommands:
   list            List loaded modules
   remove          Unload module
   load            Load module
   reload          Reload module
   info            Display module info
   stats           Display stats on module
   debug           Get/set module debug flags

Is there a difference between remove and unload or is this just an alias?

@dongahn
Copy link
Member Author

dongahn commented Jun 5, 2020

OK. This fixes the hang.

 # Reload the core scheduler so that rc3 won't hang waiting for
 # queue to become idle after jobs are canceled.
 test_expect_success 'load sched-simple module' '
+    flux module reload -f resource &&
     flux module load sched-simple
 '

@garlick
Copy link
Member

garlick commented Jun 5, 2020

Unload is just an alias for remove.

@codecov-commenter
Copy link

codecov-commenter commented Jun 5, 2020

Codecov Report

Merging #663 into master will decrease coverage by 0.07%.
The diff coverage is 71.67%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #663      +/-   ##
==========================================
- Coverage   75.15%   75.07%   -0.08%     
==========================================
  Files          78       78              
  Lines        7894     8166     +272     
==========================================
+ Hits         5933     6131     +198     
- Misses       1961     2035      +74     
Impacted Files Coverage Δ
qmanager/policies/base/queue_policy_base.hpp 100.00% <ø> (ø)
qmanager/policies/queue_policy_bf_base.hpp 100.00% <ø> (ø)
qmanager/policies/queue_policy_fcfs.hpp 100.00% <ø> (ø)
resource/hlapi/bindings/c++/reapi_cli_impl.hpp 0.00% <0.00%> (ø)
qmanager/policies/queue_policy_bf_base_impl.hpp 87.09% <33.33%> (-6.01%) ⬇️
resource/traversers/dfu_impl_update.cpp 78.62% <53.33%> (-1.46%) ⬇️
resource/writers/match_writers.cpp 62.00% <54.28%> (-0.58%) ⬇️
qmanager/policies/base/queue_policy_base_impl.hpp 75.00% <65.00%> (-1.79%) ⬇️
qmanager/modules/qmanager_callbacks.cpp 73.23% <75.00%> (-0.37%) ⬇️
resource/hlapi/bindings/c++/reapi_module_impl.hpp 53.68% <75.75%> (+9.93%) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd3c95c...8ef2d7a. Read the comment docs.

@dongahn
Copy link
Member Author

dongahn commented Jun 5, 2020

Travis CI is now happy! Let me now try to quickly add queue name support within RV1: flux-framework/rfc#240.

@dongahn
Copy link
Member Author

dongahn commented Jun 8, 2020

Just added multi queue support for state recovery. Once Travis turns green, I will take off WIP tag.

@dongahn
Copy link
Member Author

dongahn commented Jun 8, 2020

Two Travis configurations failed:

./t1010-recovery-multiqueue.t: line 18: syntax error near unexpected token return' 1847./t1010-recovery-multiqueue.t: line 18: [[ $? -ne 0 ]] return $?'

Too much "bash"sm. Will fix.

@dongahn
Copy link
Member Author

dongahn commented Jun 8, 2020

Travis CI is green now. I nominated a few people as reviewers :-). Thank you in advance.

@dongahn dongahn changed the title [WIP] Add basic state recovery support Add basic state recovery support Jun 8, 2020
@garlick
Copy link
Member

garlick commented Jun 8, 2020

But the resources that those running jobs are using can be scheduled to other jobs (i.e., overscheduling) because the resource module won't be able to reconstruct the state.

This would seem like a fundamental failure. Should state recovery by always enabled? What is the use case for disabling it?

@garlick
Copy link
Member

garlick commented Jun 8, 2020

This is a rather big PR. Would it make sense to pull out all the cleanup not related to state reconstruction to a separate PR?

@dongahn
Copy link
Member Author

dongahn commented Jun 8, 2020

@garlick: thank you for the quick response.

This would seem like a fundamental failure. Should state recovery by always enabled?

To enable this always, we need to make the scheduling key in RV1 mandatory. This would be fine for the system instance, but this would add rather high overheads for nested instances in particular for high throughput loads.

For throughput-oriented scheduling specialization, users may trade an ability to recover the state on reload with performance for their nested instances.

As part of RV2, if the graph can be sufficiently reduced and compressed and thus won't be on the critical path, this can be turned on by default. But until then, I think we should still make this trade off possible.

In general, how important is it to recover the states for single user instance though?

@dongahn
Copy link
Member Author

dongahn commented Jun 8, 2020

This is a rather big PR. Would it make sense to pull out all the cleanup not related to state reconstruction to a separate PR?

Sure. I can split this into two.

@garlick
Copy link
Member

garlick commented Jun 8, 2020

For throughput-oriented scheduling specialization, users may trade an ability to recover the state on reload with performance for their nested instances.

That makes sense to me, however

but the resources that those running jobs are using can be scheduled to other jobs (i.e., overscheduling) because the resource module won't be able to reconstruct the state.

should the scheduler continue in this case, or give up since allowing resources to be exclusively allocated to multiple jobs is a pretty bad situation?

@dongahn
Copy link
Member Author

dongahn commented Jun 8, 2020

should the scheduler continue in this case, or give up since allowing resources to be exclusively allocated to multiple jobs is a pretty bad situation?

This is a good point. I actually thought about this and realized either has a pros and cons.

  • If the scheduler cannot be loaded, users cannot utilize idling resources until all running jobs complete.

  • If the scheduler is loaded, users can over-schedule jobs.

For high throughput loads, I slightly preferred overscheduling to underutilization, but perhaps this was an incorrect line of thoughts?

@garlick
Copy link
Member

garlick commented Jun 8, 2020

Personally, I think it's game over if the scheduler doesn't know what's allocated. Hopefully not a condition that would arise from normal use?

@dongahn
Copy link
Member Author

dongahn commented Jun 8, 2020

Personally, I think it's game over if the scheduler doesn't know what's allocated. Hopefully not a condition that would arise from normal use?

I also don't have a strong preference one way or the other because like you said this wouldn't arise often for normal cases.

I can very easily change that semantics to discontinue. Maybe if such a case occurs, users can load sched-simple and continue as well.

I do hope we can advance RV2 and compression/condensing logic so that this becomes a moot point.

Copy link
Member

@SteVwonder SteVwonder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dongahn for putting this together. Generally LGTM! Only minor comments below, nothing major.

Thanks for spending the time to make such great commit messages and atomic commits. That effort made review this PR enjoyable and relatively easy :)

One test case that I don't think is covered, but I think would be good to get in the CI: jobs finishing/getting canceled while the qmanager is unloaded and then reloading it. I tested it manually, and it worked!! It does rely on the job-manager replaying the failed 'free' message which I am not sure if we test anywhere currently.

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux mini submit -N1 -n1 sleep inf                                                                                                                                                                                                                                 18:38:59 ()
889125339136

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux mini submit -N1 -n1 sleep inf                                                                                                                                                                                                                                 18:39:16 ()
913200644096

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux jobs                                                                                                                                                                                                                                                          18:39:17 ()
             JOBID USER     NAME       STATUS NTASKS NNODES  RUNTIME RANKS
      913200644096 herbein1 sleep           R      1      1   2.116s 0
      889125339136 herbein1 sleep           R      1      1   3.547s 0

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux module unload sched-fluxion-qmanager                                                                                                                                                                                                                          18:39:50 ()

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux job cancel 889125339136                                                                                                                                                                                                                                       18:40:08 ()

❯ flux jobs                                                                                                                                                                                                                                                          18:40:10 ()
             JOBID USER     NAME       STATUS NTASKS NNODES  RUNTIME RANKS
      913200644096 herbein1 sleep           R      1      1   55.34s 0
      889125339136 herbein1 sleep           R      1      1   54.53s 0

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux module load sched-fluxion-qmanager                                                                                                                                                                                                                            18:40:28 ()

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux jobs                                                                                                                                                                                                                                                          18:40:35 ()
             JOBID USER     NAME       STATUS NTASKS NNODES  RUNTIME RANKS
      913200644096 herbein1 sleep           R      1      1   1.334m 0

herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux dmesg
<snip>
2020-06-10T01:39:57.913316Z broker.debug[0]: rmmod sched-fluxion-qmanager
2020-06-10T01:39:57.913752Z sched-fluxion-qmanager.debug[0]: service_unregister
2020-06-10T01:39:57.913967Z broker.debug[0]: module sched-fluxion-qmanager exited
2020-06-10T01:40:10.814045Z job-exec.debug[0]: exec aborted: id=889125339136
2020-06-10T01:40:10.814152Z job-exec.debug[0]: exec_kill: 889125339136: signal 15
2020-06-10T01:40:10.821070Z connector-local.debug[0]: unsubscribe shell-889125339136.
2020-06-10T01:40:10.837037Z job-manager.debug[0]: alloc: stop due to free response error: Function not implemented
2020-06-10T01:40:35.615413Z broker.debug[0]: insmod sched-fluxion-qmanager
2020-06-10T01:40:35.619013Z sched-fluxion-qmanager.debug[0]: enforced policy (queue=default): fcfs
2020-06-10T01:40:35.619046Z sched-fluxion-qmanager.debug[0]: effective queue params (queue=default): default
2020-06-10T01:40:35.619056Z sched-fluxion-qmanager.debug[0]: effective policy params (queue=default): default
2020-06-10T01:40:35.619395Z sched-fluxion-qmanager.debug[0]: service_register
2020-06-10T01:40:35.619626Z job-manager.debug[0]: scheduler: hello
2020-06-10T01:40:35.622273Z sched-fluxion-resource.debug[0]: update_request_cb: jobid (889125339136) with matching R exists
2020-06-10T01:40:35.622565Z sched-fluxion-qmanager.debug[0]: requeue success (queue=default id=889125339136)
2020-06-10T01:40:35.624226Z sched-fluxion-resource.debug[0]: update_request_cb: jobid (913200644096) with matching R exists
2020-06-10T01:40:35.624438Z sched-fluxion-qmanager.debug[0]: requeue success (queue=default id=913200644096)
2020-06-10T01:40:35.624650Z job-manager.debug[0]: scheduler: ready unlimited
2020-06-10T01:40:35.626107Z sched-fluxion-qmanager.debug[0]: free succeeded (queue=default id=889125339136)

resource/modules/resource_match.cpp Outdated Show resolved Hide resolved
resource/modules/resource_match.cpp Outdated Show resolved Hide resolved
resource/modules/resource_match.cpp Outdated Show resolved Hide resolved
t/scripts/flux-ion-resource.py Outdated Show resolved Hide resolved
Comment on lines +106 to +107
if (json_typeof (v) == JSON_STRING)
value = value.substr (1, value.length () - 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is here to strip the quotes off the ends of the string? For some reason this feels like it could be fragile, I'm not sure though. Any opposition to just grabbing the string value via the jansson API?

Suggested change
if (json_typeof (v) == JSON_STRING)
value = value.substr (1, value.length () - 2);
if (json_typeof (v) == JSON_STRING)
value = json_string_value (v);

If you want to keep the substr, can you leave a short comment as to the purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my initial code but I changed it so that one can add non string values within sched-fluxion-qmanager.toml like our dropped support for resource-recovery-on-load

resource-recovery-on-load = true

As opposed to

resource-recovery-on-load = "true"

can be supported.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dongahn. I second @SteVwonder 's comment that the detailed commit messages are helpful!

My comments are fairly superficial although I did find a couple issues on error paths. At minimum unchecked error codes and memory leaks should be addressed.

Thanks!

duration = static_cast<uint64_t> (et) - static_cast<uint64_t> (st);

freemem_out:
json_decref (o);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good idea to save/restore errno around json_decref().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... good call. This is an error path and we want to preserved errno!

Just out of curiosity: do you know if json_decref sets errno? Apparently, I am not a big fan of using a global variable to propagate errors, but we have no other alternatives... :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In flux-core, we've been operating under the defensive assumption that free(), and hence destructors that internally call free(), and don't preserve errno, could set errno.

https://stackoverflow.com/questions/30569981/does-free-set-errno/30571921#30571921

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great info. Thanks @garlick.

Just to make sure I am on the same page on the general errno propagation semantics. Looking at flux-core code, it seems like the semantics is best effort:

  1. Capture errno on the first failure detected
  2. Don't reset errno on the cleanup/exit paths of this failure where it makes sense

W/ respect to 2) from flux-core, I see two styles: A) only preserve errno possibly set by non-Flux function calls (e.g., jansson calls, free...); B) preserve the errno even for all calls including Flux API calls themselves.

I can see an argument for A vs. B., but it feels like having a general semantics written down for consistency could be beneficial. If you folks already discussed this, my apology. -- Seems like the semantics can go in a RFC -- I don't know if something like this can go into the coding style RFC though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your A vs B.

A function that purports to set errno on failure may set it to any value it likes. Usually you want that to reflect the root (first) cause of a problem but not always.

Jansson makes no claims about setting errno on failure. But it probably does sometimes given that it must make calls to malloc() for example.

The main thing is (a) if your contract with users is that errno will be set on error, you must make sure it is not set to zero on error, and (b) once you have errno set the way you want it you must avoid clobbering it since it is a stupid global variable.

Destructors are often called after errno is set, and destructors are usually void-returning functions that ignore errors. Thus if they call any functions that set errno and then ignore the failures of those functions, they may clobber your carefully set errno value.

(Sorry if I'm just stating obvious stuff because I didn't quite get your question!)

BTW I don't think it's a forgone conclusion that fluxion internals must use this same error handling style as flux-core. Obviously they have to deal with flux-core's error semantics but they could adopt something different for how those errors are propagated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Destructors are often called after errno is set, and destructors are usually void-returning functions that ignore errors. Thus if they call any functions that set errno and then ignore the failures of those functions, they may clobber your carefully set errno value.

Yes, this was essentially my question -- I have seen a number of examples where a destructor in the Flux API set is called on error paths without no errno wrap. My guess the Flux destructors themselves are written to preserve errno within itself so this clobbering won't happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct!

rc = (json_equal (rlite1, rlite2) == 1)? 0 : 1;

out:
json_decref (o1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here about saving errno across json_decref

Comment on lines +868 to +894
if ( (rc = gettimeofday (&start, NULL)) < 0) {
flux_log_error (ctx->h, "%s: gettimeofday", __FUNCTION__);
goto done;
}
if ( (rc = parse_R (ctx, R, jgf, at, duration)) < 0) {
flux_log_error (ctx->h, "%s: parsing R", __FUNCTION__);
goto done;
}
if ( (rc = run (ctx, jobid, jgf, at, duration)) < 0) {
flux_log_error (ctx->h, "%s: run", __FUNCTION__);
goto done;
}
if ( (rc = ctx->writers->emit (o)) < 0) {
flux_log_error (ctx->h, "%s: writers->emit", __FUNCTION__);
goto done;
}
if ( (rc = gettimeofday (&end, NULL)) < 0) {
flux_log_error (ctx->h, "%s: gettimeofday", __FUNCTION__);
goto done;
}
ov = get_elapse_time (start, end);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an assumption that each of these functions sets errno on failure, but I'm not completely sure if it's true?
I got lost digging into the traversers. Possibly someone with more knowledge of the code should walk through and make sure errno is set when it should be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these should all set errno including gettimeofday().

I also have #618 to do a PR only on this topic to tighten up any loosen end. Finally, merging @trws's exception PR should make sched-fluxion-resource more robust when a C++ exception arises from within a request handlers.

"overhead", &overhead,
"R", &rset,
"at", &scheduled_at)) < 0)
goto out;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error path leaks future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! Good catch.

jobid, R, *at, *ov, R_buf)) < 0) {
goto out;
}
(*R_out) = strdup (R_buf.c_str ());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for strdup failure

*at, *ov, R_buf)) < 0) {
goto out;
}
(*R_out) = strdup (R_buf.c_str ());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for strdup failure

goto ret;
}
if ((rc = json_object_set_new (attrs, kv.first.c_str (), stro)) == -1) {
json_decref (attrs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is stro leaked here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we determined free memory in this case is unsafe.

flux-framework/flux-core#2803 (comment)

If we have other places where stro equivalents are freed, we need to change it; double free can lead to a broker crash.

This may have changed in a more recent jansson library. @chu11?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof! Thanks for the reminder! I'd forgotten this affects functions other than json_pack(). It's unpleasant to have to choose between a double free and memory leak.

One way around it is to avoid json_pack() with o or any of the *_new() "object stealing" interfaces in jansson. It means a bit of extra reference count management in the non-error case.

json_decref (o);
goto ret;
}
json_object_set_new (o, "attributes", attrs_o);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check return value

test_expect_success 'recovery: loading flux-sched modules works (rv1)' '
flux module remove sched-simple &&
flux module reload -f resource &&
load_resource load-whitelist=node,core,gpu match-format=rv1 &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do this as part of a future PR.

@dongahn
Copy link
Member Author

dongahn commented Jun 11, 2020

@garlick and @SteVwonder: I added fixup commits and I believe I addressed the issues captured in your review comments. Let me know if there is anything else, you want me to do before this is merged.

Copy link
Member

@SteVwonder SteVwonder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dongahn! Changes LGTM!

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SteVwonder
Copy link
Member

@dongahn: feel free to squash the commit and set MWP whenever you are ready

dongahn added 18 commits June 11, 2020 14:12
parse_R: parse R(V1) C-string and return the JGF section
in std::string along with starttime/expiration.

R_equal: check the equality of two R(V1)s and return 0
if equal.

Note that casting st into starttime of int64_t type
later is safe because there is a guarantee that
a starting scheduled point will not exceed MAX_INT64.
Problem: to allow qmanager to reconstruct
the resource states of running jobs on
reload in the future, the resource module
needs a way to update its resource graph
using the R(V1) of each running job.

Introduce a new RPC (resource.update) that
can update the resource graph using the
R(V1) string sent to it.

Call the dfu_traverser_t class' overloaded
run() method to perform this update.
To satisfy it's inputs, parse out the JGF
section as well as the starttime and
expiration keys from the R.

This is different from the existing
resource.match RPC that finds the best
matching resources according to the jobspec
and update the resource graph with those
matching resources.

Return the previously recorded R data
if the requested jobid already exists
and the requested R is equal to the
recorded R of that jobid. This will be
the case if only qmanager was reloaded,
not resource.

Return a corresponding errno
if the requested jobid already exists
and the requested R is not equal to
the recorded R or any other error
condition arises.
Add update_allocate interface into the high level
API of our resource infrastructure.
Add error checking on strdup
within the existing match_allocate
functions in the high level.
The omission was found
during the code review for the newly added
update_allocate functions.
Problem: The semantics of reconstruct within the queue policy
implementation class is to reconstruct only the running
job queue.  The method name, reconstruct(), hence is
misleading as it may suggest the method would reconstruct
both the queue *and* resource state.

Change the name to be "reconstruct_queue" to clarify this.
Add more error checking code as part of this as well.
Problem: we have an API to reconstruct the state
of running-job queue state for a job, but we don’t have
an interface to reconstruct the state of resource
data store for the job.

Add reconstruct_resource as a pure virtual method
to the base queue policy class (queue_policy_base_t).
Because the base queue policy classes (queue_policy_base_t
and queue_policy_base_impl_t) must manage various job
queues while being agnostic to the resources of jobs, this
pure virtual method must be implemented by the
derived classes, the queue policy layer that is aware
of the resources. This virtual interface allows
queue_policy_base_t to make a up-call to the methods
overridden by all of the derived classes. It then makes
this scheme future proof with respect to future
queueing-policy implementation classes.
Introduce the update command into flux-resource, which
is required for future tests.

    Usage: flux-resource.py update [-h] RV1 Jobid

    Update the resource database

    positional arguments:
      RV1         RV1 file name
      Jobid       Jobid

    optional arguments:
      -h, --help  show this help message and exit
Add an option to tolerate a nonexistent jobid for cancel
within resource high level API (hlapi).

Update how the queue policy layer uses this API call.

Need this in the future to implement the resource recovery
semantics for flux-sched module reloads.
Track RFC20 to support the new attributes key in RV1.
If one or more attributes are present, emit
them under "attributes.system.scheduler".
Add support for queue name attribute emission.
If a queue name is specified in the jobspec,
emit it as an attribute.
Implement the basic state recovery semantics
using the recent changes made to the resource
module, resource high-level API, queue
policy layer.

Recover the state of both the running job
queue and the resource graph data store
for running jobs.

Meet the state-recovery requirements for a
key near-term use case in which both
resource and qmanager modules are shut down
and reloaded in a coordinated fashion.

Assume two conditions for full recovery:

1. The RV1 JSON object for each and every
   currently running job includes the
   "scheduling" key, therefore the JGF
   section.
2. The resource module is reloaded before
   qmanager.

Lay some foundational work to prepare for
more comprehensive failure recovery
in the future.

If only the qmanager module is shut down
and reloaded, we still fully recover
the state.

If both qmanager and resource
are reloaded, but the RV1 JSON
of a running job does not include
the "scheduling" key, qmanager
will fail to load because
it can't recover the resource state.

To summarize the current state recovery semantics:

  fluxion modules  |   RV1 of running jobs
resource  qmanager | JGF   no-JGF
--------------------------------------------------
reload    reload   | yes   no
running   reload   | yes   yes
reload    running  | no    no
Use flux-resource's update sub-command to test
the basic functionalities of resource.update, a
newly updated RPC within resource.

Generate 4 allocations in the RV1 format using
flux resource match allocate with a unit jobspec
that requests 1 node with 1 core.

Reload the resource module and use flux resource
update with these generated allocations in RV1
to reconstruct the resource graph store state.

Also add a few corner case tests:
1. Updating with an existing jobid with
       an identical RV1 string succeeds.
2. Updating with an existing jobid with
       a different RV1 string must fail.
3. Updating with an existing RV1 string
       under a new jobid must fail.
Rename t4009-update-basic.t and t4010-update-multi.t
to better match with the grouping structure of
our test cases.
Add cases to test flux-sched modules can fully
recover the state of running jobs on reload
for the cases where they must:

1. The format of running jobs at recovery point
   is rv1 and we restart both resource and
   qmanager modules.

2. We restart only qmanager, not
   the resource module.

Note that we also added test coverage for a condition
where a running job completes
after fluxion modules are unloaded, but before they are
reloaded. In test 5, job1 is cancelled after
both of the fluxion modules are unloaded.
Then, the following test is augmented to verify that job1
is not enqueued on reload.
Add cases to test flux-sched modules doesn't
recover the state of running jobs on reload
for the cases where they currently can't.

The format of running jobs is rv1_nosched, and
both resource and qmanager restart.
Add some cases to test flux-sched modules can fully
recover the state of running jobs on reload
when sched-fluxion-qmanager is configured to
have multiple queues.
@dongahn dongahn added the merge-when-passing mergify.io - merge PR automatically once CI passes label Jun 11, 2020
@mergify mergify bot merged commit 577d821 into flux-framework:master Jun 11, 2020
@dongahn
Copy link
Member Author

dongahn commented Jun 11, 2020

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add state reconstruction support to resource module
4 participants