Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on shutdown #5309

Open
mpghf opened this issue Feb 24, 2025 · 13 comments
Open

Deadlock on shutdown #5309

mpghf opened this issue Feb 24, 2025 · 13 comments
Assignees

Comments

@mpghf
Copy link

mpghf commented Feb 24, 2025

At least version 3.8.4 of cyrus imapd sometimes (quite often) hangs indefinitely on shutdown.

My original mail regarding this is here:

https://cyrus.topicbox.com/groups/info/Te092838a4fe380d4/version-3-8-4-hangup-on-shutdown

The problem seems to have started somewhere between version 3.4 and version 3.8.4, but I note that there was at least one previous report on an earlier version: https://cyrus.topicbox.com/groups/devel/Tddff7ee5049ffbc4 It could just be that timing of the shutdown has changed subtly.

My debugging attempts show that it's waiting here (below) for at least one of the child processes to stop:

   Process 57237 stopped
* thread #1, name = 'master', stop reason = signal SIGSTOP
    frame #0: 0x000000000020e90e master`child_janitor(now=(tv_sec = 1738587528, tv_usec = 269269)) at master.c:1541:26
   1538
   1539     while (i-- > 0) {
   1540         p = &ctable[janitor_position++];
-> 1541         janitor_position = janitor_position % child_table_size;
   1542         while (*p) {
   1543             c = *p;
   1544             if (c->service_state == SERVICE_STATE_DEAD) { 

It looks like one of the child processes is not getting correctly waited on:

Feb 22 15:30:45 mail imapsext[11736]: graceful shutdown

cyrus    11736   0.0  0.0       0       0  -  Z    15:30        0:00.05 <defunct> 

I'm looking into this to try and get more details.

Note that this is on FreeBSD 13.4, but the previous reporter was apparently using some Linux variety.

@mpghf
Copy link
Author

mpghf commented Feb 25, 2025

Further debugging shows that the child process is not getting reaped by reap_child(). That's not called because a SIGCHLD is missed (for unknown reasons).

This only seems to happen on shutdown, but perhaps during normal operation the missing SIGCHLD doesn't matter because subsequent SIGCHLD signals are detected and cause reap_child() to be called, which in turn waits on all defunct children.

To fix this, I've added an explicit call to reap_child() if in_shutdown is set.

        }
        if (in_shutdown)
          reap_child();

        gettimeofday(&now, 0);
        child_janitor(now);
        do_prom_report(now);
    }

finished: 

@elliefm
Copy link
Contributor

elliefm commented Feb 25, 2025

Shot in the dark, but does this make any difference:

diff --git a/master/master.c b/master/master.c
index e38ff7f66..9219b1445 100644
--- a/master/master.c
+++ b/master/master.c
@@ -1672,7 +1672,7 @@ static volatile sig_atomic_t gotsigchld = 0;
 
 static void sigchld_handler(int sig __attribute__((unused)))
 {
-    gotsigchld = 1;
+    gotsigchld ++;
 }
 
 static volatile int gotsighup = 0;
@@ -3267,9 +3267,9 @@ int main(int argc, char **argv)
             spawn_schedule(now);
 
         /* reap first, that way if we need to babysit we will */
-        if (gotsigchld) {
+        while (gotsigchld) {
             /* order matters here */
-            gotsigchld = 0;
+            gotsigchld --;
             reap_child();
         }
 

I think it shouldn't make a difference -- reap_child itself calls waitpid in a loop until all pending children have been reaped, so in this new loop the first call should process all the pending children, and then the remaining calls should find nothing to do. If it does make a difference, that will be interesting.

One thing I worry a little about is that /* order matters here */ comment. That would seem to imply reap_child expects gotsigchild == 0 on entry. I couldn't see an explicit dependency like that, but there might be something I've missed that'll cause new problems, so please tread carefully.

@elliefm
Copy link
Contributor

elliefm commented Feb 25, 2025

That patch is against the master branch, but I think it should apply cleanly to most versions, this stuff hasn't changed in a while.

@elliefm elliefm self-assigned this Feb 25, 2025
@elliefm
Copy link
Contributor

elliefm commented Feb 26, 2025

I think the earlier (Linux) report of this was fixed by #3449. It detects when waitpid() was interrupted by a signal, and retries. But it does this only when waitpid returns 0. If I'm reading the FreeBSD waitpid manpage correctly, waitpid on FreeBSD returns -1 in this situation, so the check for EINTR and retry wouldn't happen, and instead reap_child would finish its loop without having reaped everything that was pending.

So it might simply be that, during shutdown and the flurry of processes exiting, some of those SIGCHLD signals wind up interrupting the waitpid, and that's how they get missed. If this is what's happening, it should be an easy fix, I'll have a new patch for you to try shortly.

@elliefm
Copy link
Contributor

elliefm commented Feb 26, 2025

Try this on for size:

diff --git a/master/master.c b/master/master.c
index e38ff7f66..3a324dc0e 100644
--- a/master/master.c
+++ b/master/master.c
@@ -1373,13 +1373,19 @@ static void reap_child(void)
     struct service *wd;
     int failed;
 
-    while ((pid = waitpid((pid_t) -1, &status, WNOHANG)) > -1) {
-        if (pid == 0) {
-            if (errno == EINTR) {
-                errno = 0;
+    while (1) {
+        errno = 0;
+        pid = waitpid((pid_t) -1, &status, WNOHANG);
+        if (pid <= 0) {
+            switch (errno) {
+            case EINTR:     /* waitpid was interrupted, try again */
                 continue;
+            case ECHILD:    /* no more children */
+            case 0:         /* (WNOHANG) no children have exited */
+            default:        /* some kind of error */
+                errno = 0;
+                return;
             }
-            break;
         }
 
         /* account for the child */

@mpghf
Copy link
Author

mpghf commented Feb 26, 2025

Shot in the dark, but does this make any difference:

 
         /* reap first, that way if we need to babysit we will */
-        if (gotsigchld) {
+        while (gotsigchld) {

One thing I worry a little about is that /* order matters here */ comment. That would seem to imply reap_child expects gotsigchild == 0 on entry. I couldn't see an explicit dependency like that, but there might be something I've missed that'll cause new problems, so please tread carefully.

Thanks. I've tried this one, and it didn't help.

Regarding the comment about the order mattering: if the ordering were the other way around, it would be possible for:

  • gotsigchld to be set by the signal handler immediately after calling reap_child() (it would already be 1), and
  • then immediately reset before it is noticed.

This would not be an issue when pselect is being used (since SIGCHLD is only enabled during pselect), but if select is used, it would be a potential issue. Note that FreeBSD uses pselect.

I'll try the other patch next.

@mpghf
Copy link
Author

mpghf commented Feb 26, 2025

Try this on for size:

diff --git a/master/master.c b/master/master.c
index e38ff7f66..3a324dc0e 100644
--- a/master/master.c
+++ b/master/master.c
@@ -1373,13 +1373,19 @@ static void reap_child(void)
     struct service *wd;
     int failed;
 
-    while ((pid = waitpid((pid_t) -1, &status, WNOHANG)) > -1) {

Thanks. I've tried this one too, and it didn't work unfortunately.

I also (separately) tried the fix in #3449, which is not in 3.8.5, but it also didn't help.

Note also Dilyan's comment on my post on the info mailing list recently:
https://cyrus.topicbox.com/groups/info/Te092838a4fe380d4-M963487fdab4291ba58f58129/version-3-8-4-hangup-on-shutdown
He still seems to be having some variation of this issue too, and presumably he's running Linux still.

Let me see what other information I can gather.

@mpghf
Copy link
Author

mpghf commented Feb 26, 2025

Calling sigpending() immediately before and after myselect() returns tells me that SIGCHLD is pending (and remains pending) when master ends up in this state, but for some reason the signal handler doesn't get called when myselect() is called.

Will investigate further.

@elliefm
Copy link
Contributor

elliefm commented Feb 27, 2025

Interesting. I wonder if there's a race in our signal block/unblock stuff around the pselect.

The waitpid documentation talks about a bunch of slightly different behaviours depending on the status of the SIGCHLD handler, which I don't completely understand at a first read. But if there's a gap around the pselect that the signal can surprise us in, then maybe we're hitting one of those slightly different behaviours when it does.

@mpghf
Copy link
Author

mpghf commented Feb 28, 2025

Another data point is that there is that pselect is returning 1 when master is stuck in this situation, so something is trying to connect. This is not unexpected, since it's a live system with a bunch of users.

Digging into the FreeBSD kernel code, I see the code below. Note the comment in particular. I'm not an expert in the kernel code, but it seems to imply that only signals that occur during the pselect will result in the signal handlers being called when pselect returns. That looks like the behaviour that I'm seeing.

int
kern_pselect(struct thread *td, int nd, fd_set *in, fd_set *ou, fd_set *ex,
    struct timeval *tvp, sigset_t *uset, int abi_nfdbits)
{
	int error;

	if (uset != NULL) {
		error = kern_sigprocmask(td, SIG_SETMASK, uset,
		    &td->td_oldsigmask, 0);
		if (error != 0)
			return (error);
		td->td_pflags |= TDP_OLDMASK;
	}
	error = kern_select(td, nd, in, ou, ex, tvp, abi_nfdbits);
	if (uset != NULL) {
		/*
		 * Make sure that ast() is called on return to
		 * usermode and TDP_OLDMASK is cleared, restoring old
		 * sigmask.  If we didn't get interrupted, then the caller is
		 * likely not expecting a signal to hit that should normally be
		 * blocked by its signal mask, so we restore the mask before
		 * any signals could be delivered.
		 */
		if (error == EINTR) {
			ast_sched(td, TDA_SIGSUSPEND);
		} else {
			/* *select(2) should never restart. */
			MPASS(error != ERESTART);
			ast_sched(td, TDA_PSELECT);
		}
	}

	return (error);
}

Looking at the FreeBSD commit history, it looks like this changes was made about 3 months ago, which is when I started seeing the problem, so the change in Cyrus version is probably just coincidental.

The change is discussed here: https://reviews.freebsd.org/D47738 and here: https://reviews.freebsd.org/D47741. The particular issue that they were seeing was slightly different - in their case the signal is received after the select part of the pselect returns, but before the signal mask is restored (still in pselect).

One of the comments says:

It looks on inspection like Linux provides the stronger guarantee that you won't see a signal handled if it didn't interrupt the sleep, but OpenBSD does not.

This thread, from StackOverflow suggests that this (the new behaviour on FreeBSD, as of 3 months ago) is the correct behaviour, and that it matches the Linux behaviour: https://stackoverflow.com/questions/62315082/pselect-on-linux-does-not-deliver-signals-if-events-are-pending.

That thread quotes the Posix specification on this: https://pubs.opengroup.org/onlinepubs/9699919799/functions/pselect.html

If none of the selected descriptors are ready for the requested operation, the pselect() or select() function shall block until at least one of the requested operations becomes ready, until the timeout occurs, or until interrupted by a signal.

In my case, one of the selected file descriptors is ready.

So, I think the new behaviour of FreeBSD's pselect is correct, and that it matches Linux's behaviour, although I haven't tested that out explicitly.

I'll write some test code for this scenario.

Potential fixes:

  1. Call sigprocmask() after the pselect() to enable SIGCHLD and others, and then call sigprocmask() again immediately to disable them.
  2. Call reap_child() explicitly in the main loop. In that case we could maybe take out the SIGCHLD handling altogether?
  3. Use a shorter delay in pselect. The use of pselect was introduced to (try and) make sure that SIGCHLD was called during pselect as a way of stopping the long delay (up to 10 seconds) during shutdown, as I understand it.

The first two will introduce 2 (for the first option) or 1 (for the second option) additional system calls, which could perhaps be a performance issue, but I don't know how significant that will be.

@mpghf
Copy link
Author

mpghf commented Mar 1, 2025

Testing with separate test code on Linux (kernel 5.15.88) and FreeBSD (13.4-STABLE):

  • In all cases a signal handler (I use SIGUSR1 for testing) is set up.
  • sigprocmask() is used to block SIGUSR1.
  • I then call raise(SIGUSR1) to make sure that a SIGUSR1 signal is pending.
  • I call connect to connect a different socket to the listening socket.
  • I do not call accept on the listening socket. This is to keep it ready for the later pselect call
  • pselect() is called on a listening socket.

I consider 4 scenarios for completeness. Scenario 1 below is the one I'm seeing in master.

  1. The connection to the listening socket is established before pselect is called, and the pselect timeout is 0
  2. The connection to the listening socket is established before pselect is called, and the pselect timeout is 10
  3. The connection to the listening socket is not established before pselect is called, and the pselect timeout is 0
  4. The connection to the listening socket is not established before pselect is called, and the pselect timeout is 10

The results for the above scenarios are:

  1. Neither Linux nor FreeBSD call the signal handler, and the signal remains pending after the pselect.
  2. Neither Linux nor FreeBSD call the signal handler, and the signal remains pending after the pselect.
  3. Linux does call the signal handler, but FreeBSD does not, and the signal remains pending after the pselect (on FreeBSD).
  4. Both Linux and FreeBSD call the signal handler.

@elliefm
Copy link
Contributor

elliefm commented Mar 3, 2025

That's some good research!

  1. sigprocmask after the pselect to enable SIGCHLD and others, and then again immediately to disable them.

This feels the right track to me. Like, particular kernel implementations/fixes aside, it sounds like we can't portably rely on pselect to deliver signals which were already pending before the pselect call, and so we should open a little window of our own for them to arrive in. In fact, here's what the Linux man page for pselect says (emphasis mine):

The reason that pselect() is needed is that if one wants to wait for either a signal or for a file descriptor to become ready, then an atomic test is needed to prevent race conditions. (Suppose the signal handler sets a global flag and returns. Then a test of this global flag followed by a call of select() could hang indefinitely if the signal arrived just after the test but just before the call. By contrast, pselect() allows one to first block signals, handle the signals that have come in, then call pselect() with the desired sigmask, avoiding the race.)

So it looks like opening a little window is exactly what we were meant to be doing all along. It's interesting that it suggests putting that window before the pselect call, rather than after. I think I would put it in myselect().

I'm not sure how the two extra syscalls might affect performance, but if we were supposed to be doing that all along, it can't hurt that bad?

@mpghf
Copy link
Author

mpghf commented Mar 3, 2025

Sorry @elliefm, this was meant to be a reply to your comment #5309 (comment).

I suspect we need to put the additional calls to sigprocmask() before we check the gotsigchld flag, in order to avoid the race mentioned here: d288a9a

I don't know if Greg Banks, who originally made that change, is still available to comment.

So this (in main()):

        /* run any scheduled processes */
        if (!in_shutdown)
            spawn_schedule(now);

#if HAVE_PSELECT
        sigset_t old_sigmask;
        sigprocmask(SIG_SETMASK, &pselect_sigmask, &old_sigmask);
        sigprocmask(SIG_SETMASK, &old_sigmask, &pselect_sigmask);
#endif

        /* reap first, that way if we need to babysit we will */
        if (gotsigchld) {
            /* order matters here */
            gotsigchld = 0;
            reap_child();
        }

rather than this:

static int myselect(int nfds, fd_set *rfds, fd_set *wfds,
                    fd_set *efds, struct timeval *tout)
{
#if HAVE_PSELECT
    sigset_t old_sigmask;
    sigprocmask(SIG_SETMASK, &pselect_sigmask, &old_sigmask);
    sigprocmask(SIG_SETMASK, &old_sigmask, &pselect_sigmask);

    /* pselect() closes the race between SIGCHLD arriving
    * and select() sleeping for up to 10 seconds. */
    struct timespec ts, *tsptr = NULL;

    if (tout) {
        ts.tv_sec = tout->tv_sec;
        ts.tv_nsec = tout->tv_usec * 1000;
        tsptr = &ts;
    }
    return pselect(nfds, rfds, wfds, efds, tsptr, &pselect_sigmask);
#else
    return select(nfds, rfds, wfds, efds, tout);
#endif
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants