-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wait4 syscall should return pid (tgid) and not the tid to userspace #3886
base: master
Are you sure you want to change the base?
wait4 syscall should return pid (tgid) and not the tid to userspace #3886
Conversation
This is what the kernel does. Fix emulation to do the same. In glibc waitpid() i.e. __waitpid() is implemented in terms of wait4() i.e. __wait4() which is a thin wrapper over the wait4 syscall. wait4 syscall returns the pid (tgid in kernel parlance) and not the thread id (tid). tid and pid will, of course, often be different in multithreaded situations. See also `man wait4` and `man waitpid`. This should fix the unexpected behavior of waitpid() in programs in which the return pid_t is checked against an expected value or used for further bookkeeping/comparisons. In such programs, the failure can be rare because due to thread sequencing, sometimes the tid happens to be numerically the pid that is expected and sometimes it is not.
BTW the trigger for my investigation and this PR was that I was getting an assertion failure sometimes at line 39 in the rr/src/test/sigcont_threaded.c Lines 37 to 39 in f7067f1
Now the |
I am reasonably sure my fix is correct. However, the assumptions behind the code currently on master branch have led to probably some bugs in the emulated ptrace implementation. This has resulted in the ptrace test related failures listed below. I have not investigated further as ptrace emulation is the most hairy code in the code base ! Any possible fixes could be considered by @rocallahan as necessary. If the PR looks good then it could be merged and the ptrace emulation code fixed later also...
|
(I'm checking to see if I may be wrong about this PR -- I'll report back if I revise my conclusions or agree with them) |
Something is wrong currently with rr, however, I'm now not sure that the PR I posted is the correct fix. I am trying to debug things again. Perhaps @KJTsanaktsidis who made Setup:
Here Process B is 4059195 and Process C is 4059197 and Process A is 4058931. Two things surprised me here: (1) Process C is a thread, so it should not be sending signals on state changes. In fact when the clone syscall is issued to create any thread, the signal to send on termination is usually set to 0. No SIGCHLD is ever issued on termination of Process C so it seems strange that on a SIGSTOP, a SIGCHLD is being sent.... (2) The "real_parent" and "parent" (in kernel parlance) of Process C is Process B. Any synthetic signal if issued by rr due to state changes in C should go to B ! Here both the SIGCHLDs are being sent to A. The SIGCHLD being sent to A due SIGSTOP in B is correct but I don't agree with the synthetic SIGCHLD being sent to A due to a state change in C ! |
Hmmm. I added a bit more logging into that test and pushed it up here: master...KJTsanaktsidis:rr:ktsanaktsidis/more_logs. Some observations: If you run this test just by itself (i.e. not under rr at all)...
The parent process receives one SIGCHLD for each state change in the child (once when it's stopped, once when it's resumed, and once when it's exited). Both the If you run it under rr, something strange happens:
Like before, the parent sees only the tgid, not the tids; BUT, somehow it got two SIGCHLDs when the child was stopped, and zero when it was resumed! Actually, the stop signal is sent to the parent once per thread (you can see this if you add a second I think your patch is not needed, because Line 1051 in f7f4c29
maybe_emulate_wait (Line 2305 in f7f4c29
But nonetheless you did observe So there are at least three bugs in rr I think:
I have a feeling fixing no. 1 would also fix 3, but I don't really know. I won't be able to have a deeper look at this until the weekend, probably, but if you investigate further please keep me updated! |
Thank you for this most useful investigation ! It has given me some ideas for future debugging and root cause analysis. |
wait4
syscall should returnpid
(tgid
) and not thetid
to userspaceThis is what the kernel does. Fix emulation to do the same.
In glibc
waitpid()
i.e.__waitpid()
is implemented in terms ofwait4()
i.e.__wait4()
which is a thin wrapper over thewait4
syscall.wait4
syscall returns the pid (tgid in kernel parlance) and not the thread id (tid). tid and pid will, of course, often be different in multithreaded situations. See alsoman wait4
andman waitpid
.This should fix the unexpected behavior of
waitpid()
in programs in which the returnpid_t
is checked against an expected value or used for further bookkeeping/comparisons. In such programs, the failure can be rare because due to thread sequencing, sometimes the tid happens to be numerically the pid that is expected and sometimes it is not.