Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Really annoying apparent deadlock during nix-store -r in Nix 1.12 #1573

Open
copumpkin opened this issue Sep 21, 2017 · 11 comments
Open

Really annoying apparent deadlock during nix-store -r in Nix 1.12 #1573

copumpkin opened this issue Sep 21, 2017 · 11 comments
Assignees
Labels
bug cli The old and/or new command line interface

Comments

@copumpkin
Copy link
Member

copumpkin commented Sep 21, 2017

I'm on a Linux box running Nix unstable at 1.12pre5511_c94f3d55 and am seeing a hard-to-reproduce/minimize freeze during nix-store -r. The multi-thread stack trace is at this gist and I'm trying to figure out what's confusing it.

The Nix on that box is still frozen if you have ideas for other things to try while it's frozen before I kill it.

Were there any recent changes in the concurrency code? I was on a nixUnstable a couple months ago that I never observed this problem on, but perhaps I was just lucky?

Edit: I accidentally sent it a signal which caused it to fail with "error: download of ‘https://nix-cache.s3.amazonaws.com/ar3x3yk22khfrsk88nqicm11rdkzc018.narinfo’ was interrupted", which presumably shows (based on the curl_multi_wait in the 4th thread's stack trace) that it was somehow stuck waiting on that download. So unfortunately no more debugging on this particular instance of the failure. Back to trying to get it to happen again...

cc @edolstra

@copumpkin
Copy link
Member Author

What I've found out so far:

I have a thread calling download() which waits on a future to signal completion/failure from the downloader. That future never signals, which suggests that

try {
if (request.decompress)
result.data = decodeContent(encoding, ref<std::string>(result.data));
callSuccess(success, failure, const_cast<const DownloadResult &>(result));
act.progress(result.data->size(), result.data->size());
} catch (...) {
done = true;
callFailure(failure, std::current_exception());
}
is never happening.

The other thing I've found by poking around the stack trace and memory is that the download thread appears to be stuck on an infinite poll of our signalling FD:

struct curl_waitfd extraFDs[1];
extraFDs[0].fd = wakeupPipe.readSide.get();
extraFDs[0].events = CURL_WAIT_POLLIN;
extraFDs[0].revents = 0;
auto sleepTimeMs =
nextWakeup != std::chrono::steady_clock::time_point()
? std::max(0, (int) std::chrono::duration_cast<std::chrono::milliseconds>(nextWakeup - std::chrono::steady_clock::now()).count())
: 10000;
vomit("download thread waiting for %d ms", sleepTimeMs);
mc = curl_multi_wait(curlm, extraFDs, 1, sleepTimeMs, &numfds);

Meaning that the CURLM multi handle has no FDs of its own, and is just polling on our extra wakeupPipe one. I'm guessing we never signal it. Perhaps some sort of deadlock around that?

@dezgeg
Copy link
Contributor

dezgeg commented Sep 22, 2017

The 'never signaling' theory should be verifiable by checking if state->incoming is non-empty.

@copumpkin
Copy link
Member Author

I actually realized that I was stuck on 1fe1976 with this symptom, and when I updated to c94f3d5, it appears (hard to be sure given how nondeterministic this is) that all the hangs turned into timeout errors from CURL:

error: unable to download ‘https://some-cache.s3.amazonaws.com/nix-cache-info’: Timeout was reached (28)

@copumpkin
Copy link
Member Author

copumpkin commented Sep 22, 2017

Since switching to c94f3d5 I indeed haven't had any deadlocks, but am getting a lot more of those timeout errors. I'm wondering if something got restructured in the threading code, and maybe in the past the timeouts from curl led to a deadlock, whereas they now cause a normal failure?

Anyway, here's a bunch of Nix logs of the deadlock with increased verbosity from back in 1fe1976:

Sep 22 17:05:37 hostname amazon-init-start[846]: derivation is /nix/store/92gplcbzz7zid3hp5y6cb09kjs387z0g-nixpkgs.drv
Sep 22 17:05:37 hostname amazon-init-start[846]: starting pool of 4 threads
Sep 22 17:05:37 hostname amazon-init-start[846]: checking substituter ‘https://bucket.s3.amazonaws.com’ for path ‘/nix/store/9qpl5mf3fh32l8yr2hf6r8z89b4p6s0r-nixpkgs’
Sep 22 17:05:37 hostname amazon-init-start[846]: download thread waiting for 1000000000 ms
Sep 22 17:05:37 hostname amazon-init-start[846]: starting download of https://bucket.s3.amazonaws.com/9qpl5mf3fh32l8yr2hf6r8z89b4p6s0r.narinfo
Sep 22 17:05:37 hostname amazon-init-start[846]: curl: Couldn't find host bucket.s3.amazonaws.com in the .netrc file; using defaults
Sep 22 17:05:37 hostname amazon-init-start[846]: download thread waiting for 1000000000 ms
Sep 22 17:05:37 hostname amazon-init-start[846]: curl:   Trying 52.216.65.16...
Sep 22 17:05:37 hostname amazon-init-start[846]: curl: TCP_NODELAY set
Sep 22 17:05:37 hostname amazon-init-start[846]: download thread waiting for 1000000000 ms
Sep 22 17:05:37 hostname amazon-init-start[846]: download thread waiting for 1000000000 ms
Sep 22 17:05:37 hostname amazon-init-start[846]: download thread waiting for 1000000000 ms
Sep 22 17:05:44 hostname amazon-init-start[846]: curl: Connected to bucket.s3.amazonaws.com (52.216.65.16) port 443 (#0)
Sep 22 17:05:44 hostname amazon-init-start[846]: download thread waiting for 1000000000 ms

[frozen]
Sep 22 17:05:33 hostname amazon-init-start[856]: derivation is /nix/store/rsxnzrh34xq4wx8dkccpfgl5vzbm5xww-nixos.drv
Sep 22 17:05:33 hostname amazon-init-start[856]: starting pool of 4 threads
Sep 22 17:05:33 hostname amazon-init-start[856]: deleted 0 entries from the NAR info disk cache
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: starting download of https://bucket.s3.amazonaws.com/nix-cache-info
Sep 22 17:05:33 hostname amazon-init-start[856]: curl: Couldn't find host bucket.s3.amazonaws.com in the .netrc file; using defaults
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: curl:   Trying 52.216.161.35...
Sep 22 17:05:33 hostname amazon-init-start[856]: curl: TCP_NODELAY set
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:33 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:34 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:05:34 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms
Sep 22 17:06:05 hostname amazon-init-start[856]: curl: Connected to bucket.s3.amazonaws.com (52.216.161.35) port 443 (#0)
Sep 22 17:06:05 hostname amazon-init-start[856]: download thread waiting for 1000000000 ms

[frozen]
Sep 22 17:05:41 hostname amazon-init-start[842]: download thread waiting for 1000000000 ms
Sep 22 17:05:41 hostname amazon-init-start[842]: starting download of http://s3.amazonaws.com/bucket/channel.tar.gz
Sep 22 17:05:41 hostname amazon-init-start[842]: downloading ‘http://s3.amazonaws.com/bucket/channel.tar.gz’... curl: Couldn't find host s3.amazonaws.com in the .netrc file; using defaults
Sep 22 17:05:41 hostname amazon-init-start[842]: [66B blob data]
Sep 22 17:05:41 hostname amazon-init-start[842]: download thread waiting for 1000000000 ms
Sep 22 17:05:41 hostname amazon-init-start[842]: download thread waiting for 1000000000 ms
Sep 22 17:05:41 hostname amazon-init-start[842]: download thread waiting for 1000000000 ms
Sep 22 17:05:41 hostname amazon-init-start[842]: curl:   Trying 54.231.88.26...
Sep 22 17:05:41 hostname amazon-init-start[842]: curl: TCP_NODELAY set
Sep 22 17:05:41 hostname amazon-init-start[842]: download thread waiting for 1000000000 ms
Sep 22 17:05:42 hostname amazon-init-start[842]: download thread waiting for 1000000000 ms
Sep 22 17:05:42 hostname amazon-init-start[842]: download thread waiting for 1000000000 ms
Sep 22 17:05:49 hostname amazon-init-start[842]: curl: Connected to s3.amazonaws.com (54.231.88.26) port 80 (#0)
Sep 22 17:05:49 hostname amazon-init-start[842]: [67B blob data]

[frozen]

As you see in the gdb logs from the original gist, this can happen in nix-store -r and nix-instantiate, as well as nix-channel. Basically seems to be anything interacting with download.cc.

All of the frozen processes whose logs I include above seem to be in the same situation as the ones in the gist: a call to download() in one thread waiting on a future, with a separate downloader thread seemingly waiting forever on a poll under curl_multi_wait.

I'm including these logs because I don't think the issue was actually fixed, and although it manifests itself differently now in c94f3d5, the underlying cause is the same. Will add more information as I learn it.

@copumpkin
Copy link
Member Author

Confirmed that I'm still getting this even on a much more recent nixUnstable, but much less often. Rather frustrating 😦

@stale
Copy link

stale bot commented Feb 15, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Feb 15, 2021
@fricklerhandwerk fricklerhandwerk added bug cli The old and/or new command line interface labels Sep 13, 2022
@stale stale bot removed stale labels Sep 13, 2022
@domenkozar
Copy link
Member

I'm starting to see this more often since Nix 2.12

@domenkozar
Copy link
Member

Managed to get a bit of strace before hanging:

read(21, "", 4096)                      = 0
close(21)                               = 0
pipe2([13, 20], O_CLOEXEC)              = 0
rt_sigprocmask(SIG_BLOCK, ~[], [HUP INT PIPE TERM WINCH], 8) = 0
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7fbbf6ffd910, parent_tid=0x7fbbf6ffd910, exit_signal=0, stack=0x7fbbf67fd000, stack_size=0x7ffbc0, tls=0x7fbbf6ffd640} => {parent_tid=[628554]}, 88) = 628554
rt_sigprocmask(SIG_SETMASK, [HUP INT PIPE TERM WINCH], NULL, 8) = 0
fcntl(7, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(7, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(7, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(7, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
futex(0x55a8b935eb50, FUTEX_WAKE_PRIVATE, 1) = 1
fcntl(7, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(7, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
futex(0x55a8b935eb50, FUTEX_WAKE_PRIVATE, 1) = 1
fcntl(7, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(7, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
futex(0x55a8b935eb50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55a8b935eb50, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
futex(0x55a8b935eb50, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
fcntl(7, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(7, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
futex(0x55a8b935eb50, FUTEX_WAKE_PRIVATE, 1) = 0
fcntl(7, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(7, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(7, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(7, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(7, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
fcntl(7, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=124, l_len=1}) = 0
poll([{fd=22, events=POLLIN}, {fd=26, events=POLLIN}, {fd=17, events=POLLIN}, {fd=13, events=POLLIN}], 4, -1) = 1 ([{fd=13, revents=POLLHUP}])
read(13, "", 4096)                      = 0
close(13)                               = 0
pipe2([13, 20], O_CLOEXEC)              = 0
rt_sigprocmask(SIG_BLOCK, ~[], [HUP INT PIPE TERM WINCH], 8) = 0
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7fbbf6ffd910, parent_tid=0x7fbbf6ffd910, exit_signal=0, stack=0x7fbbf67fd000, stack_size=0x7ffbc0, tls=0x7fbbf6ffd640} => {parent_tid=[628556]}, 88) = 628556
rt_sigprocmask(SIG_SETMASK, [HUP INT PIPE TERM WINCH], NULL, 8) = 0
poll([{fd=22, events=POLLIN}, {fd=26, events=POLLIN}, {fd=17, events=POLLIN}, {fd=13, events=POLLIN}], 4, -1) = 1 ([{fd=13, revents=POLLHUP}])
read(13, "", 4096)                      = 0
close(13)                               = 0
pipe2([13, 20], O_CLOEXEC)              = 0
rt_sigprocmask(SIG_BLOCK, ~[], [HUP INT PIPE TERM WINCH], 8) = 0
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7fbbf6ffd910, parent_tid=0x7fbbf6ffd910, exit_signal=0, stack=0x7fbbf67fd000, stack_size=0x7ffbc0, tls=0x7fbbf6ffd640} => {parent_tid=[628557]}, 88) = 628557
rt_sigprocmask(SIG_SETMASK, [HUP INT PIPE TERM WINCH], NULL, 8) = 0
poll([{fd=22, events=POLLIN}, {fd=26, events=POLLIN}, {fd=17, events=POLLIN}, {fd=13, events=POLLIN}], 4, -1) = 1 ([{fd=13, revents=POLLHUP}])
read(13, "", 4096)                      = 0
close(13)                               = 0
pipe2([13, 20], O_CLOEXEC)              = 0
rt_sigprocmask(SIG_BLOCK, ~[], [HUP INT PIPE TERM WINCH], 8) = 0
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7fbbf6ffd910, parent_tid=0x7fbbf6ffd910, exit_signal=0, stack=0x7fbbf67fd000, stack_size=0x7ffbc0, tls=0x7fbbf6ffd640} => {parent_tid=[628558]}, 88) = 628558
rt_sigprocmask(SIG_SETMASK, [HUP INT PIPE TERM WINCH], NULL, 8) = 0
poll([{fd=22, events=POLLIN}, {fd=26, events=POLLIN}, {fd=17, events=POLLIN}, {fd=13, events=POLLIN}], 4, -1) = 1 ([{fd=13, revents=POLLHUP}])
read(13, "", 4096)                      = 0
close(13)                               = 0
pipe2([13, 20], O_CLOEXEC)              = 0
rt_sigprocmask(SIG_BLOCK, ~[], [HUP INT PIPE TERM WINCH], 8) = 0
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7fbbf6ffd910, parent_tid=0x7fbbf6ffd910, exit_signal=0, stack=0x7fbbf67fd000, stack_size=0x7ffbc0, tls=0x7fbbf6ffd640} => {parent_tid=[628559]}, 88) = 628559
rt_sigprocmask(SIG_SETMASK, [HUP INT PIPE TERM WINCH], NULL, 8) = 0
poll([{fd=22, events=POLLIN}, {fd=26, events=POLLIN}, {fd=17, events=POLLIN}, {fd=13, events=POLLIN}], 4, -1) = 1 ([{fd=13, revents=POLLHUP}])
read(13, "", 4096)                      = 0
close(13)                               = 0
pipe2([13, 20], O_CLOEXEC)              = 0
rt_sigprocmask(SIG_BLOCK, ~[], [HUP INT PIPE TERM WINCH], 8) = 0
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7fbbf6ffd910, parent_tid=0x7fbbf6ffd910, exit_signal=0, stack=0x7fbbf67fd000, stack_size=0x7ffbc0, tls=0x7fbbf6ffd640} => {parent_tid=[628560]}, 88) = 628560
rt_sigprocmask(SIG_SETMASK, [HUP INT PIPE TERM WINCH], NULL, 8) = 0
poll([{fd=22, events=POLLIN}, {fd=26, events=POLLIN}, {fd=17, events=POLLIN}, {fd=13, events=POLLIN}], 4, -1

@roberth
Copy link
Member

roberth commented Jan 7, 2023

So the last action was forking a process and reading from it.

@domenkozar any idea what that process might be?

It'd also be good to know what the other fds 22, 26 and 17 are. That will be earlier in the trace.
strace isn't the full picture. We may need stack traces, for each thread, by attaching gdb to the blocked process.

@domenkozar
Copy link
Member

Possibly fixed by #6469

@gbpdt
Copy link
Contributor

gbpdt commented Feb 1, 2023

Possibly fixed by #6469

I would look for skipping build of derivation <...> someone beat us to it, if you see those then it's likely that #6469 is the culprit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug cli The old and/or new command line interface
Projects
None yet
Development

No branches or pull requests

8 participants