This repository has been archived by the owner on Oct 22, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Jan Dubois <[email protected]>
…ss fails To avoid terminating early and killing a still running drain script it is necessary to keep the container-run process running even if the main process has already exited. Once container-run receives the SIGTERM signal it will pass it on to the child processes and then wait until all direct children have exited before exiting itself. This should allow the processes to terminate cleanly as long as the grace period has not yet expired. container-run will also terminate immediately if the main process terminates with an error. This allows the container to fail and be restarted by k8s. Signed-off-by: Jan Dubois <[email protected]>
This is used to trigger graceful shutdown of nginx. Signed-off-by: Jan Dubois <[email protected]>
…ning Can be used to wait for a process to stop after sending the STOP command. Signed-off-by: Jan Dubois <[email protected]>
It only supports the "start" and "stop" commands from the real bpm, but has additional "quit" and "running" commands to help reimplementing drain scripts that make use of the pid files. Signed-off-by: Jan Dubois <[email protected]>
Signed-off-by: Jan Dubois <[email protected]>
It is important to process errors synchronously because their handling depends on the value of the 'active' flag. To avoid deadlocks any code sending errors to the channel must run in a blockable goroutine. Signed-off-by: Jan Dubois <[email protected]>
(and give them a chance to notify their child processes). If the processes haven't terminated in 20s, send SIGKILL. Signed-off-by: Jan Dubois <[email protected]>
Signed-off-by: Jan Dubois <[email protected]>
Signed-off-by: Jan Dubois <[email protected]>
That way it also starts the timeout trigger if the process doesn't stop within 20s. Signed-off-by: Jan Dubois <[email protected]>
The rest are just confusing the debug output, and passing on e.g. SIGCHLD to the child processes is neither correct nor useful. Signed-off-by: Jan Dubois <[email protected]>
container-run should send a SIGKILL if the process is still alive after 20s, so if it isn't gone after 30, then it probably will not stop. The `bpm running` subcommand will also now print "yes" or "no" in addition to setting the exit code. Signed-off-by: Jan Dubois <[email protected]>
(but they no longer pass because they have not been updated with the new application logic). Signed-off-by: Jan Dubois <[email protected]>
Also modify `bpm running` to only print yes/no when stdout is a tty. Signed-off-by: Jan Dubois <[email protected]>
|
This new nginx drain logic was added in capi-release v1.110.0 via cloudfoundry/capi-release@30690f3. kubecf is currently using capi-release v1.107.0, so this does not yet apply. |
mook-as
suggested changes
May 11, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really sure my review is comprehensive, but it's something at least I guess?
Signed-off-by: Jan Dubois <[email protected]>
The errors channel is unbuffered, so stopProcesses cannot deliver errors if it is running from the same goroutine as the receiver. Signed-off-by: Jan Dubois <[email protected]>
Signed-off-by: Jan Dubois <[email protected]>
mook-as
approved these changes
May 11, 2021
That looks great. One thing I can think of, if any of the goroutines in the tests gives a problem (flakieness, etc) maybe |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR tries to address the remaining issues in the Quarks operator regarding drain script support.
Many thanks to @univ0298 for testing the changes and @andrew-edgar for updating the tests!
History
The first identified issue was: POD draining does not work because individual containers terminate prematurely instead of waiting for the completion of all drain/preStops on all containers.
It was supposed to be fixed by: Wait for all drain scripts to finish
That fix was incomplete because containers whose main process exits will get killed immediately and not participate in the drain script waiting process. The first idea how to fix this did not work out: Make sure terminated containers are considered as already drained
Further problems are identified in: Many Cloud Foundry drain scripts do not work in KubeCF due to no monit and no bpm
Fixes
This pull request attempts to address the remaining issues with:
keep
container-run
running until it receives aSIGTERM
signal. That wayPreStop
hooks should not get killed prematurely.provide an emultation
bpm
script that implements the commonbpm stop
command used by many of the drain scripts in the CF releases. In addition it provides a mechanism to implement patches for those drain scripts that directly use the pid files to signal other processes.In addition this PR fixes the way subprocesses are stopped via
SIGKILL
, orphaning their child processes and generally not allowing things to shut down gracefully. This PR mirrors thebpm stop
functionality and sends aSIGTERM
, followed 20s later with aSIGKILL
if the child processes are still running at that time.BPM commands
The
bpm
script implements the following commands from the bpm-release:The
bpm stop
command will wait up to 30s for the process to actually stop (it normally should be killed viaSIGKILL
within 20s if it doesn't respong toSIGTERM
).The following extensions are also provided:
bpm term
is the same asbpm stop
, except it doesn't wait for the process to terminate.bpm quit
will send aSIGQUIT
signal to the process (e.g. to tell nginx to shut down gracefully).bpm running
will return with a0
exit code if the process is still running, or a1
if it isn't. It will also print "yes" or "no" to stdout, if that is connected to a tty.Manual testing
I compiled the program on Linux (using
nc
with UDP to Unix sockets doesn't seem to work on macOS):I use the following test program for all tests:
It runs for 30s and then exits on its own, printing a progress message every 3 seconds. It acknowledges
SIGQUIT
andSIGTERM
signals, but only exits early if theSIGTERM
has been preceeded by aSIGQUIT
. It is always run via:The
main.sh
process runs until completion, butcontainer-run
remains running. From a second shellmain.sh
can be restarted:and the process runs again from start to finish. Running
pkill container-run
once the script has finished will terminatecontainer-run
immediately now.Starting
container-run
again and runningbpm stop main
from the second terminal shows that the signal is received (once the current sleep returns), but the process doesn't abort right away; it runs for another 20s before being killed:Output:
Using
bpm term main
works the same asbpm stop main
, but returns immediately and doesn't wait for the process to stop.Sending a
SIGQUIT
before thestop
command lets the script terminate right away, butcontainer-run
will keep running as expected:# bpm start main; sleep 3; bpm quit main; bpm stop main
Output:
Sending
SIGTERM
tocontainer-run
whilemain.sh
is running will sendSIGTERM
to the script as well. It will continue running because it hasn't received aSIGQUIT
first. After 20s the script will be killed viaSIGKILL
andcontainer-run
exits:# bpm start main; sleep 3; pkill container-run
Output:
The
bpm running
command can be used to check if the process is still running or not:Open issues
In addition to a new operator release using the updated
container-run
version we also need at least 2 patches at the kubecf level:The cloud controller drain.sh script calls shutdown_drain.rb using cloud_controller_ng/drain.rb. It should be possible to replace the drain script with something like:
The garden process invokes
garden_start
via tini. Sine we don't runtini
using pid 1, we need to pass the-s
option to enable theTINI_SUBREAPER
.