-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic state recovery support #663
Conversation
I added the [WIP] tag as I still need to add the queue name of each job into |
I just found a new sharpness script ( # Reload the core scheduler so that rc3 won't hang waiting for
# queue to become idle after jobs are canceled.
test_expect_success 'load sched-simple module' '
flux module load sched-simple
' ok 6 - removing resource and qmanager modules
2020-06-05T16:39:51.357186Z sched-simple.err[0]: hello: rlist_remove (rank0/core[12-15]): No such file or directory
ok 7 - load sched-simple module
# passed all 7 test(s)
1..7
2020-06-05T16:39:51.473915Z broker.err[0]: rc3: flux-job: Canceled 1 jobs (0 errors)
2020-06-05T16:39:51.499430Z sched-simple.err[0]: free: rank0/core[12-15]: No such file or directory @grondo or @garlick: |
BTW,
I don't fully understand why |
Another piece of info: I also didn't see this hang with my old PR (i.e., before this merging and new flux-core). |
I never used
Is there a difference between |
OK. This fixes the hang. # Reload the core scheduler so that rc3 won't hang waiting for
# queue to become idle after jobs are canceled.
test_expect_success 'load sched-simple module' '
+ flux module reload -f resource &&
flux module load sched-simple
' |
Unload is just an alias for remove. |
Codecov Report
@@ Coverage Diff @@
## master #663 +/- ##
==========================================
- Coverage 75.15% 75.07% -0.08%
==========================================
Files 78 78
Lines 7894 8166 +272
==========================================
+ Hits 5933 6131 +198
- Misses 1961 2035 +74
Continue to review full report at Codecov.
|
Travis CI is now happy! Let me now try to quickly add queue name support within RV1: flux-framework/rfc#240. |
Just added multi queue support for state recovery. Once Travis turns green, I will take off WIP tag. |
Two Travis configurations failed:
Too much "bash"sm. Will fix. |
Travis CI is green now. I nominated a few people as reviewers :-). Thank you in advance. |
This would seem like a fundamental failure. Should state recovery by always enabled? What is the use case for disabling it? |
This is a rather big PR. Would it make sense to pull out all the cleanup not related to state reconstruction to a separate PR? |
@garlick: thank you for the quick response.
To enable this always, we need to make the For throughput-oriented scheduling specialization, users may trade an ability to recover the state on reload with performance for their nested instances. As part of RV2, if the graph can be sufficiently reduced and compressed and thus won't be on the critical path, this can be turned on by default. But until then, I think we should still make this trade off possible. In general, how important is it to recover the states for single user instance though? |
Sure. I can split this into two. |
That makes sense to me, however
should the scheduler continue in this case, or give up since allowing resources to be exclusively allocated to multiple jobs is a pretty bad situation? |
This is a good point. I actually thought about this and realized either has a pros and cons.
For high throughput loads, I slightly preferred overscheduling to underutilization, but perhaps this was an incorrect line of thoughts? |
Personally, I think it's game over if the scheduler doesn't know what's allocated. Hopefully not a condition that would arise from normal use? |
I also don't have a strong preference one way or the other because like you said this wouldn't arise often for normal cases. I can very easily change that semantics to discontinue. Maybe if such a case occurs, users can load I do hope we can advance RV2 and compression/condensing logic so that this becomes a moot point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongahn for putting this together. Generally LGTM! Only minor comments below, nothing major.
Thanks for spending the time to make such great commit messages and atomic commits. That effort made review this PR enjoyable and relatively easy :)
One test case that I don't think is covered, but I think would be good to get in the CI: jobs finishing/getting canceled while the qmanager is unloaded and then reloading it. I tested it manually, and it worked!! It does rely on the job-manager replaying the failed 'free' message which I am not sure if we test anywhere currently.
herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux mini submit -N1 -n1 sleep inf 18:38:59 ()
889125339136
herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux mini submit -N1 -n1 sleep inf 18:39:16 ()
913200644096
herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux jobs 18:39:17 ()
JOBID USER NAME STATUS NTASKS NNODES RUNTIME RANKS
913200644096 herbein1 sleep R 1 1 2.116s 0
889125339136 herbein1 sleep R 1 1 3.547s 0
herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux module unload sched-fluxion-qmanager 18:39:50 ()
herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux job cancel 889125339136 18:40:08 ()
❯ flux jobs 18:40:10 ()
JOBID USER NAME STATUS NTASKS NNODES RUNTIME RANKS
913200644096 herbein1 sleep R 1 1 55.34s 0
889125339136 herbein1 sleep R 1 1 54.53s 0
herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux module load sched-fluxion-qmanager 18:40:28 ()
herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux jobs 18:40:35 ()
JOBID USER NAME STATUS NTASKS NNODES RUNTIME RANKS
913200644096 herbein1 sleep R 1 1 1.334m 0
herbein1@quartz1916 ~/.../toss3/flux/663-state-reconstruction/bin
❯ flux dmesg
<snip>
2020-06-10T01:39:57.913316Z broker.debug[0]: rmmod sched-fluxion-qmanager
2020-06-10T01:39:57.913752Z sched-fluxion-qmanager.debug[0]: service_unregister
2020-06-10T01:39:57.913967Z broker.debug[0]: module sched-fluxion-qmanager exited
2020-06-10T01:40:10.814045Z job-exec.debug[0]: exec aborted: id=889125339136
2020-06-10T01:40:10.814152Z job-exec.debug[0]: exec_kill: 889125339136: signal 15
2020-06-10T01:40:10.821070Z connector-local.debug[0]: unsubscribe shell-889125339136.
2020-06-10T01:40:10.837037Z job-manager.debug[0]: alloc: stop due to free response error: Function not implemented
2020-06-10T01:40:35.615413Z broker.debug[0]: insmod sched-fluxion-qmanager
2020-06-10T01:40:35.619013Z sched-fluxion-qmanager.debug[0]: enforced policy (queue=default): fcfs
2020-06-10T01:40:35.619046Z sched-fluxion-qmanager.debug[0]: effective queue params (queue=default): default
2020-06-10T01:40:35.619056Z sched-fluxion-qmanager.debug[0]: effective policy params (queue=default): default
2020-06-10T01:40:35.619395Z sched-fluxion-qmanager.debug[0]: service_register
2020-06-10T01:40:35.619626Z job-manager.debug[0]: scheduler: hello
2020-06-10T01:40:35.622273Z sched-fluxion-resource.debug[0]: update_request_cb: jobid (889125339136) with matching R exists
2020-06-10T01:40:35.622565Z sched-fluxion-qmanager.debug[0]: requeue success (queue=default id=889125339136)
2020-06-10T01:40:35.624226Z sched-fluxion-resource.debug[0]: update_request_cb: jobid (913200644096) with matching R exists
2020-06-10T01:40:35.624438Z sched-fluxion-qmanager.debug[0]: requeue success (queue=default id=913200644096)
2020-06-10T01:40:35.624650Z job-manager.debug[0]: scheduler: ready unlimited
2020-06-10T01:40:35.626107Z sched-fluxion-qmanager.debug[0]: free succeeded (queue=default id=889125339136)
if (json_typeof (v) == JSON_STRING) | ||
value = value.substr (1, value.length () - 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is here to strip the quotes off the ends of the string? For some reason this feels like it could be fragile, I'm not sure though. Any opposition to just grabbing the string value via the jansson API?
if (json_typeof (v) == JSON_STRING) | |
value = value.substr (1, value.length () - 2); | |
if (json_typeof (v) == JSON_STRING) | |
value = json_string_value (v); |
If you want to keep the substr
, can you leave a short comment as to the purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my initial code but I changed it so that one can add non string values within sched-fluxion-qmanager.toml
like our dropped support for resource-recovery-on-load
resource-recovery-on-load = true
As opposed to
resource-recovery-on-load = "true"
can be supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongahn. I second @SteVwonder 's comment that the detailed commit messages are helpful!
My comments are fairly superficial although I did find a couple issues on error paths. At minimum unchecked error codes and memory leaks should be addressed.
Thanks!
duration = static_cast<uint64_t> (et) - static_cast<uint64_t> (st); | ||
|
||
freemem_out: | ||
json_decref (o); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be good idea to save/restore errno around json_decref().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... good call. This is an error path and we want to preserved errno
!
Just out of curiosity: do you know if json_decref
sets errno
? Apparently, I am not a big fan of using a global variable to propagate errors, but we have no other alternatives... :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In flux-core, we've been operating under the defensive assumption that free()
, and hence destructors that internally call free()
, and don't preserve errno, could set errno.
https://stackoverflow.com/questions/30569981/does-free-set-errno/30571921#30571921
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great info. Thanks @garlick.
Just to make sure I am on the same page on the general errno propagation semantics. Looking at flux-core code, it seems like the semantics is best effort:
- Capture
errno
on the first failure detected - Don't reset
errno
on the cleanup/exit paths of this failure where it makes sense
W/ respect to 2) from flux-core, I see two styles: A) only preserve errno
possibly set by non-Flux function calls (e.g., jansson calls, free...); B) preserve the errno
even for all calls including Flux API calls themselves.
I can see an argument for A vs. B., but it feels like having a general semantics written down for consistency could be beneficial. If you folks already discussed this, my apology. -- Seems like the semantics can go in a RFC -- I don't know if something like this can go into the coding style RFC though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand your A vs B.
A function that purports to set errno on failure may set it to any value it likes. Usually you want that to reflect the root (first) cause of a problem but not always.
Jansson makes no claims about setting errno on failure. But it probably does sometimes given that it must make calls to malloc()
for example.
The main thing is (a) if your contract with users is that errno will be set on error, you must make sure it is not set to zero on error, and (b) once you have errno set the way you want it you must avoid clobbering it since it is a stupid global variable.
Destructors are often called after errno is set, and destructors are usually void-returning functions that ignore errors. Thus if they call any functions that set errno and then ignore the failures of those functions, they may clobber your carefully set errno value.
(Sorry if I'm just stating obvious stuff because I didn't quite get your question!)
BTW I don't think it's a forgone conclusion that fluxion internals must use this same error handling style as flux-core. Obviously they have to deal with flux-core's error semantics but they could adopt something different for how those errors are propagated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Destructors are often called after errno is set, and destructors are usually void-returning functions that ignore errors. Thus if they call any functions that set errno and then ignore the failures of those functions, they may clobber your carefully set errno value.
Yes, this was essentially my question -- I have seen a number of examples where a destructor in the Flux API set is called on error paths without no errno wrap. My guess the Flux destructors themselves are written to preserve errno within itself so this clobbering won't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct!
rc = (json_equal (rlite1, rlite2) == 1)? 0 : 1; | ||
|
||
out: | ||
json_decref (o1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here about saving errno across json_decref
if ( (rc = gettimeofday (&start, NULL)) < 0) { | ||
flux_log_error (ctx->h, "%s: gettimeofday", __FUNCTION__); | ||
goto done; | ||
} | ||
if ( (rc = parse_R (ctx, R, jgf, at, duration)) < 0) { | ||
flux_log_error (ctx->h, "%s: parsing R", __FUNCTION__); | ||
goto done; | ||
} | ||
if ( (rc = run (ctx, jobid, jgf, at, duration)) < 0) { | ||
flux_log_error (ctx->h, "%s: run", __FUNCTION__); | ||
goto done; | ||
} | ||
if ( (rc = ctx->writers->emit (o)) < 0) { | ||
flux_log_error (ctx->h, "%s: writers->emit", __FUNCTION__); | ||
goto done; | ||
} | ||
if ( (rc = gettimeofday (&end, NULL)) < 0) { | ||
flux_log_error (ctx->h, "%s: gettimeofday", __FUNCTION__); | ||
goto done; | ||
} | ||
ov = get_elapse_time (start, end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is an assumption that each of these functions sets errno on failure, but I'm not completely sure if it's true?
I got lost digging into the traversers. Possibly someone with more knowledge of the code should walk through and make sure errno is set when it should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"overhead", &overhead, | ||
"R", &rset, | ||
"at", &scheduled_at)) < 0) | ||
goto out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error path leaks future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh! Good catch.
jobid, R, *at, *ov, R_buf)) < 0) { | ||
goto out; | ||
} | ||
(*R_out) = strdup (R_buf.c_str ()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check for strdup failure
*at, *ov, R_buf)) < 0) { | ||
goto out; | ||
} | ||
(*R_out) = strdup (R_buf.c_str ()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check for strdup failure
goto ret; | ||
} | ||
if ((rc = json_object_set_new (attrs, kv.first.c_str (), stro)) == -1) { | ||
json_decref (attrs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is stro leaked here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we determined free memory in this case is unsafe.
flux-framework/flux-core#2803 (comment)
If we have other places where stro
equivalents are freed, we need to change it; double free can lead to a broker crash.
This may have changed in a more recent jansson library. @chu11?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof! Thanks for the reminder! I'd forgotten this affects functions other than json_pack()
. It's unpleasant to have to choose between a double free and memory leak.
One way around it is to avoid json_pack()
with o
or any of the *_new()
"object stealing" interfaces in jansson. It means a bit of extra reference count management in the non-error case.
resource/writers/match_writers.cpp
Outdated
json_decref (o); | ||
goto ret; | ||
} | ||
json_object_set_new (o, "attributes", attrs_o); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check return value
test_expect_success 'recovery: loading flux-sched modules works (rv1)' ' | ||
flux module remove sched-simple && | ||
flux module reload -f resource && | ||
load_resource load-whitelist=node,core,gpu match-format=rv1 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do this as part of a future PR.
@garlick and @SteVwonder: I added fixup commits and I believe I addressed the issues captured in your review comments. Let me know if there is anything else, you want me to do before this is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongahn! Changes LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@dongahn: feel free to squash the commit and set MWP whenever you are ready |
parse_R: parse R(V1) C-string and return the JGF section in std::string along with starttime/expiration. R_equal: check the equality of two R(V1)s and return 0 if equal. Note that casting st into starttime of int64_t type later is safe because there is a guarantee that a starting scheduled point will not exceed MAX_INT64.
Problem: to allow qmanager to reconstruct the resource states of running jobs on reload in the future, the resource module needs a way to update its resource graph using the R(V1) of each running job. Introduce a new RPC (resource.update) that can update the resource graph using the R(V1) string sent to it. Call the dfu_traverser_t class' overloaded run() method to perform this update. To satisfy it's inputs, parse out the JGF section as well as the starttime and expiration keys from the R. This is different from the existing resource.match RPC that finds the best matching resources according to the jobspec and update the resource graph with those matching resources. Return the previously recorded R data if the requested jobid already exists and the requested R is equal to the recorded R of that jobid. This will be the case if only qmanager was reloaded, not resource. Return a corresponding errno if the requested jobid already exists and the requested R is not equal to the recorded R or any other error condition arises.
Add update_allocate interface into the high level API of our resource infrastructure.
Add error checking on strdup within the existing match_allocate functions in the high level. The omission was found during the code review for the newly added update_allocate functions.
Problem: The semantics of reconstruct within the queue policy implementation class is to reconstruct only the running job queue. The method name, reconstruct(), hence is misleading as it may suggest the method would reconstruct both the queue *and* resource state. Change the name to be "reconstruct_queue" to clarify this. Add more error checking code as part of this as well.
Problem: we have an API to reconstruct the state of running-job queue state for a job, but we don’t have an interface to reconstruct the state of resource data store for the job. Add reconstruct_resource as a pure virtual method to the base queue policy class (queue_policy_base_t). Because the base queue policy classes (queue_policy_base_t and queue_policy_base_impl_t) must manage various job queues while being agnostic to the resources of jobs, this pure virtual method must be implemented by the derived classes, the queue policy layer that is aware of the resources. This virtual interface allows queue_policy_base_t to make a up-call to the methods overridden by all of the derived classes. It then makes this scheme future proof with respect to future queueing-policy implementation classes.
Introduce the update command into flux-resource, which is required for future tests. Usage: flux-resource.py update [-h] RV1 Jobid Update the resource database positional arguments: RV1 RV1 file name Jobid Jobid optional arguments: -h, --help show this help message and exit
Add an option to tolerate a nonexistent jobid for cancel within resource high level API (hlapi). Update how the queue policy layer uses this API call. Need this in the future to implement the resource recovery semantics for flux-sched module reloads.
Track RFC20 to support the new attributes key in RV1. If one or more attributes are present, emit them under "attributes.system.scheduler".
Add support for queue name attribute emission. If a queue name is specified in the jobspec, emit it as an attribute.
Implement the basic state recovery semantics using the recent changes made to the resource module, resource high-level API, queue policy layer. Recover the state of both the running job queue and the resource graph data store for running jobs. Meet the state-recovery requirements for a key near-term use case in which both resource and qmanager modules are shut down and reloaded in a coordinated fashion. Assume two conditions for full recovery: 1. The RV1 JSON object for each and every currently running job includes the "scheduling" key, therefore the JGF section. 2. The resource module is reloaded before qmanager. Lay some foundational work to prepare for more comprehensive failure recovery in the future. If only the qmanager module is shut down and reloaded, we still fully recover the state. If both qmanager and resource are reloaded, but the RV1 JSON of a running job does not include the "scheduling" key, qmanager will fail to load because it can't recover the resource state. To summarize the current state recovery semantics: fluxion modules | RV1 of running jobs resource qmanager | JGF no-JGF -------------------------------------------------- reload reload | yes no running reload | yes yes reload running | no no
Use flux-resource's update sub-command to test the basic functionalities of resource.update, a newly updated RPC within resource. Generate 4 allocations in the RV1 format using flux resource match allocate with a unit jobspec that requests 1 node with 1 core. Reload the resource module and use flux resource update with these generated allocations in RV1 to reconstruct the resource graph store state. Also add a few corner case tests: 1. Updating with an existing jobid with an identical RV1 string succeeds. 2. Updating with an existing jobid with a different RV1 string must fail. 3. Updating with an existing RV1 string under a new jobid must fail.
Rename t4009-update-basic.t and t4010-update-multi.t to better match with the grouping structure of our test cases.
Add cases to test flux-sched modules can fully recover the state of running jobs on reload for the cases where they must: 1. The format of running jobs at recovery point is rv1 and we restart both resource and qmanager modules. 2. We restart only qmanager, not the resource module. Note that we also added test coverage for a condition where a running job completes after fluxion modules are unloaded, but before they are reloaded. In test 5, job1 is cancelled after both of the fluxion modules are unloaded. Then, the following test is augmented to verify that job1 is not enqueued on reload.
Add cases to test flux-sched modules doesn't recover the state of running jobs on reload for the cases where they currently can't. The format of running jobs is rv1_nosched, and both resource and qmanager restart.
Add some cases to test flux-sched modules can fully recover the state of running jobs on reload when sched-fluxion-qmanager is configured to have multiple queues.
Thanks! |
This PR implements the basic state recovery semantics.
Recover the state of both the running job queue and the resource graph data store for those running jobs. Meet the state-recovery requirement for a key near-term use case in which both
resource
andqmanager
modules are shut down and reloaded in a coordinated fashion. Assume two conditions for full recovery:scheduling
key, therefore the JGF section.resource
module is reloaded beforeqmanager
.Lay some additional work to prepare for more comprehensive failure recovery in the future. If only the
qmanager
module is shut down and reloaded, still fully recover the state. If both qmanager and resource are reloaded, but the RV1 JSON of a running job does not include thescheduling
key,qmanager
will fail to load because it can't recover the resource state.[Table 1] Current flux-sched state recovery semantics
Resolve #470