Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asan + rr + libGLX __glDispatchInit prevent applications from starting #2746

Closed
milianw opened this issue Nov 19, 2020 · 27 comments
Closed

asan + rr + libGLX __glDispatchInit prevent applications from starting #2746

milianw opened this issue Nov 19, 2020 · 27 comments

Comments

@milianw
Copy link

milianw commented Nov 19, 2020

To reproduce:

$ cat test.cpp
int main() { return 0; }
$ g++ -g -O0 -l GLX -fsanitize=address,undefined test.cpp
ldd a.out
        linux-vdso.so.1 (0x00007ffc9d5d7000)
        libasan.so.6 => /usr/lib/libasan.so.6 (0x00007f8da32df000)
        libGLX.so.0 => /usr/lib/libGLX.so.0 (0x00007f8da32ac000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f8da30cf000)
        libm.so.6 => /usr/lib/libm.so.6 (0x00007f8da2f89000)
        libubsan.so.1 => /usr/lib/libubsan.so.1 (0x00007f8da261b000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f8da2601000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f8da2436000)
        libdl.so.2 => /usr/lib/../lib/libdl.so.2 (0x00007f8da2430000)
        librt.so.1 => /usr/lib/../lib/librt.so.1 (0x00007f8da2425000)
        libpthread.so.0 => /usr/lib/../lib/libpthread.so.0 (0x00007f8da2403000)
        libGLdispatch.so.0 => /usr/lib/libGLdispatch.so.0 (0x00007f8da234c000)
        libX11.so.6 => /usr/lib/libX11.so.6 (0x00007f8da220b000)
        /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f8da3cfa000)
        libxcb.so.1 => /usr/lib/libxcb.so.1 (0x00007f8da21df000)
        libXau.so.6 => /usr/lib/libXau.so.6 (0x00007f8da21da000)
        libXdmcp.so.6 => /usr/lib/libXdmcp.so.6 (0x00007f8da21d2000)
$ ./a.out
# works fine
$ rr record ./a.out
# never finishes

according to perf, this is what happens:
rr-asan-mutexlock

I guess there's something fishy going on with asan and rr both trying to intercept pthread_mutex_lock here. Oddly enough, a simpler application with a normal mutex seems to work. So I guess it's also important that the mutex is locked from a global static that's called from dl_init in a library?

@khuey
Copy link
Collaborator

khuey commented Nov 19, 2020

Can you run with RR_LOG=all:debug and post the output somewhere?

@milianw
Copy link
Author

milianw commented Nov 19, 2020

rr.log

here's the log file

@milianw
Copy link
Author

milianw commented Nov 19, 2020

the message [Scheduler] Using unlimited ticks mode is the last I see, and then it just hangs there.

If I then ctrl + c to quit, the following is logged:

^C[Task]   Task 187741 changed status to 0x27f (STOP-SIGINT)
[Task]   (refreshing register cache)
[Task] Requesting registers from tracee 187741
[Scheduler]   new status is 0x27f (STOP-SIGINT)
[RecordSession] trace time 987: Active task is 187741. Events:
[INFO log_pending_events()] (no pending events)
[record_signal] 187741: handling signal SIGINT (pevent: PTRACE_EVENT(0), event: (none)
[record_signal] Safe to deliver signal at 0x7f14b4f8e310 because not in syscallbuf
[RecordSession] SIGINT handled
[Scheduler] Scheduling next task (PREVENT_SWITCH)
[Scheduler]   (187741 is un-switchable at SIGNAL: SIGINT(async))
[RecordSession] trace time 987: Active task is 187741. Events:
[INFO log()] SIGNAL: SIGINT(async)
[INFO log()] (none)
[Task]   (refreshing extra-register cache using XSAVE)
[RecordTask] Wrote event SIGNAL: SIGINT(async) for time 987
[RecordSession]   187741: no user handler for SIGINT
[Scheduler] Scheduling next task (PREVENT_SWITCH)
[Scheduler]   (187741 is un-switchable at SIGNAL_DELIVERY: SIGINT(async))
[RecordSession] trace time 988: Active task is 187741. Events:
[INFO log()] SIGNAL_DELIVERY: SIGINT(async)
[INFO log()] (none)
[RecordTask] Wrote event SIGNAL_DELIVERY: SIGINT(async) for time 988
[RecordSession]     in signal-stop for SIGINT
[RecordSession]     injecting signal SIGINT
[RecordTask] Refreshed sigmask, now 0
[Task] resuming execution of 187741 with PTRACE_CONT, signal SIGINT tick_period -2 wait 1
[Scheduler] Scheduling next task (PREVENT_SWITCH)
[Scheduler]   (187741 is un-switchable at (none))
[Scheduler]   and running; waiting for state change
[Task] going into blocking waitid(187741) ...
[Task]   Task 187741 changed status to 0x6057f (PTRACE_EVENT_EXIT)
[Task]   (refreshing register cache)
[Task] Requesting registers from tracee 187741
[Scheduler]   new status is 0x6057f (PTRACE_EVENT_EXIT)
[RecordSession] trace time 989: Active task is 187741. Events:
[INFO log_pending_events()] (no pending events)
[Task]   (refreshing extra-register cache using XSAVE)
[RecordTask] Wrote event SCHED for time 989
[Task] Advancing tid 187741 to exit
[Task] Waiting for exit of 187741
[Task] ptrace_if_alive tid 187741 was not alive
[RecordTask] Wrote event EXIT for time 990
[Task] Reaping 187741
[HasTaskSet] removing 187741 from task set 0x560e4fdcea70
[HasTaskSet] removing 187741 from task set 0x560e4fdc7970
[HasTaskSet] removing 187741 from task set 0x560e4fdc4b40
[Task]   dead
[INFO terminate_recording()] Processing termination request ...
[RecordSession] Killing all tasks ...
[Session] Killing all tasks ...
[Session] Session 0x560e4fdca2a0 destroyed

@khuey
Copy link
Collaborator

khuey commented Nov 19, 2020

Are you sure the tracee isn't busy waiting or something? The rr log doesn't show a futex call or anything just before the hang. What does /proc//status show for the tracee when it hangs

@milianw
Copy link
Author

milianw commented Nov 19, 2020

well yes, it is busy within the futex, that's what the CPU profile shows - it's apparently repeatedly getting woken up and then tries to lock it again - it's on-cpu and burning one core. But that only happens when I try to use rr on a binary compiled with asan - not when I just run it as-is, or when I leave out either asan or rr or both :)

Here's the /proc/PID/status contents:

Name:   a.out
Umask:  0022
State:  R (running)
Tgid:   190781
Ngid:   0
Pid:    190781
PPid:   190769
TracerPid:      190769
Uid:    1000    1000    1000    1000
Gid:    1002    1002    1002    1002
FDSize: 1024
Groups: 3 5 108 150 973 985 986 987 991 998 1000 1001 1002 
NStgid: 190781
NSpid:  190781
NSpgid: 190769
NSsid:  155801
VmPeak: 21474918368 kB
VmSize: 21474917424 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      6188 kB
VmRSS:      6188 kB
RssAnon:            2244 kB
RssFile:            3936 kB
RssShmem:              8 kB
VmData: 15032449468 kB
VmStk:         0 kB
VmExe:         4 kB
VmLib:      5304 kB
VmPTE:       312 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1
Threads:        1
SigQ:   0/128022
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 00000001800004c0
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
NoNewPrivs:     1
Seccomp:        2
Seccomp_filters:        1
Speculation_Store_Bypass:       thread force mitigated
Cpus_allowed:   00800000
Cpus_allowed_list:      23
Mems_allowed:   00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        1009
nonvoluntary_ctxt_switches:     227

@milianw
Copy link
Author

milianw commented Nov 19, 2020

To make this more clear:

$ perf stat ./a.out

 Performance counter stats for './a.out':

             13,42 msec task-clock                #    0,978 CPUs utilized          
                 4      context-switches          #    0,298 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             5.281      page-faults               #    0,393 M/sec                  
        28.691.583      cycles                    #    2,138 GHz                      (77,78%)
         2.021.682      stalled-cycles-frontend   #    7,05% frontend cycles idle     (71,28%)
         5.266.126      stalled-cycles-backend    #   18,35% backend cycles idle      (76,81%)
        49.666.516      instructions              #    1,73  insn per cycle         
                                                  #    0,11  stalled cycles per insn  (82,97%)
        13.181.268      branches                  #  982,009 M/sec                  
            64.991      branch-misses             #    0,49% of all branches          (91,16%)

       0,013724489 seconds time elapsed

       0,006967000 seconds user
       0,006959000 seconds sys


$ perf stat rr record ./a.out
rr: Saving execution to trace directory `/home/milian/.local/share/rr/a.out-6'.
^Crr: Interrupt

 Performance counter stats for 'rr record ./a.out':

          4.180,32 msec task-clock                #    1,001 CPUs utilized          
             2.098      context-switches          #    0,502 K/sec                  
                14      cpu-migrations            #    0,003 K/sec                  
             8.705      page-faults               #    0,002 M/sec                  
    18.128.098.993      cycles                    #    4,337 GHz                      (67,02%)
        28.131.115      stalled-cycles-frontend   #    0,16% frontend cycles idle     (66,87%)
        75.343.070      stalled-cycles-backend    #    0,42% backend cycles idle      (66,89%)
    38.127.266.394      instructions              #    2,10  insn per cycle         
                                                  #    0,00  stalled cycles per insn  (66,92%)
    15.637.465.989      branches                  # 3740,731 M/sec                    (66,94%)
         2.198.353      branch-misses             #    0,01% of all branches          (66,91%)

       4,175625769 seconds time elapsed

       4,118887000 seconds user
       0,055010000 seconds sys

@khuey
Copy link
Collaborator

khuey commented Nov 19, 2020

Ok, I misunderstood and thought this was a deadlock.

@khuey
Copy link
Collaborator

khuey commented Nov 19, 2020

If you rr record -n, does anything change?

Is it possible to rr pack a trace (that you terminated with Ctrl-C) and post that for me to download?

@milianw
Copy link
Author

milianw commented Nov 19, 2020

rr record -n does not change anything. Here are two packed traces, -0 is with rr record -n, and -1 is without the -n:
rr.asan.tar.gz

@khuey
Copy link
Collaborator

khuey commented Nov 19, 2020

Thanks. It's definitely the dynamic linker getting confused somehow (and then looping forever)

@khuey
Copy link
Collaborator

khuey commented Nov 19, 2020

Does running rr record --asan help?

If not, try rr record --env LD_DEBUG=all and lets see what it has to say.

@milianw
Copy link
Author

milianw commented Nov 19, 2020

--asan does not help. the LD_DEBUG debug output is here:
rr.asan.ld_debug.log.gz

@khuey
Copy link
Collaborator

khuey commented Nov 19, 2020

So we get

    197083:	symbol=__pthread_mutex_lock;  lookup in file=/home/milian/projects/compiled/other/bin/../lib/rr/librrpreload.so [0]
    197083:	symbol=__pthread_mutex_lock;  lookup in file=/usr/lib/libGLX.so.0 [0]
    197083:	symbol=__pthread_mutex_lock;  lookup in file=/usr/lib/libstdc++.so.6 [0]
    197083:	symbol=__pthread_mutex_lock;  lookup in file=/usr/lib/libm.so.6 [0]
    197083:	symbol=__pthread_mutex_lock;  lookup in file=/usr/lib/libubsan.so.1 [0]
    197083:	symbol=__pthread_mutex_lock;  lookup in file=/usr/lib/libgcc_s.so.1 [0]
    197083:	symbol=__pthread_mutex_lock;  lookup in file=/usr/lib/libc.so.6 [0]
    197083:	symbol=__pthread_mutex_lock;  lookup in file=/usr/lib/../lib/libdl.so.2 [0]
    197083:	symbol=__pthread_mutex_lock;  lookup in file=/usr/lib/../lib/librt.so.1 [0]
    197083:	symbol=__pthread_mutex_lock;  lookup in file=/usr/lib/../lib/libpthread.so.0 [0]
    197083:	binding file /usr/lib/libasan.so.6 [0] to /usr/lib/../lib/libpthread.so.0 [0]: normal symbol `__pthread_mutex_lock'

and then later

197083:	symbol=__pthread_mutex_lock;  lookup in file=./a.out [0]
    197083:	symbol=__pthread_mutex_lock;  lookup in file=/usr/lib/libasan.so.6 [0]
    197083:	binding file /home/milian/projects/compiled/other/bin/../lib/rr/librrpreload.so [0] to /usr/lib/libasan.so.6 [0]: normal symbol `__pthread_mutex_lock' [GLIBC_2.2.5]

Which seems like a guaranteed recipe for sadness.

This might be related to #2329 (comment). What version of ASAN are you using?

@milianw
Copy link
Author

milianw commented Nov 19, 2020

I'm using the one provided by GCC v10.2.0, the SO version of libasan is 6.0.0

@GitMensch
Copy link
Contributor

Any update here? What version of rr was that, and is it still reproducible with master?

@milianw
Copy link
Author

milianw commented Jul 19, 2021

Hm I just updated rr to the latest release, and now it seems it cannot record an asan binary at all anymore:

$ rr --version
rr version 5.4.0 # patch 779040e4
$ cat test.cpp
int main() { return 0; }
$ g++ -g -O0 -l GLX -fsanitize=address,undefined test.cpp
$ ./a.out
# works fine
$ rr record ./a.out
rr: Saving execution to trace directory `/home/milian/.local/share/rr/a.out-6'.
==44279==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
$ rr record --asan ./a.out
rr: Saving execution to trace directory `/home/milian/.local/share/rr/a.out-7'.
==44302==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.

So... it doesn't hang anymore, but it doesn't work either. I'm now on gcc 11.1.0 with the sanitizers that ship with that.

@GitMensch
Copy link
Contributor

That seems like an explicit check, what happens if you LD_PRELOAD the asan library before recording?

LD_PRELOAD=`gcc -print-file-name=libasan.so` rr record ./a.out

?

@milianw
Copy link
Author

milianw commented Jul 19, 2021

Same problem:

$ gcc -print-file-name=libasan.so
/usr/lib/gcc/x86_64-pc-linux-gnu/11.1.0/../../../../lib/libasan.so
$ LD_PRELOAD=`gcc -print-file-name=libasan.so` rr record ./a.out
rr: Saving execution to trace directory `/home/milian/.local/share/rr/a.out-8'.
==64993==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.

@GitMensch
Copy link
Contributor

Just to check: what does ldd ./a.out and ldd /usr/lib/gcc/x86_64-pc-linux-gnu/11.1.0/../../../../lib/libasan.so say?

@milianw
Copy link
Author

milianw commented Jul 19, 2021

$ ldd ./a.out    
        linux-vdso.so.1 (0x00007ffcf4d56000)
        libasan.so.6 => /usr/lib/libasan.so.6 (0x00007f76463ad000)
        libGLX.so.0 => /usr/lib/libGLX.so.0 (0x00007f764637a000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f7646164000)
        libm.so.6 => /usr/lib/libm.so.6 (0x00007f7646020000)
        libubsan.so.1 => /usr/lib/libubsan.so.1 (0x00007f76456ae000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f7645693000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f76454c5000)
        libdl.so.2 => /usr/lib/../lib/libdl.so.2 (0x00007f76454be000)
        librt.so.1 => /usr/lib/../lib/librt.so.1 (0x00007f76454b3000)
        libpthread.so.0 => /usr/lib/../lib/libpthread.so.0 (0x00007f7645492000)
        libGLdispatch.so.0 => /usr/lib/libGLdispatch.so.0 (0x00007f76453da000)
        libX11.so.6 => /usr/lib/libX11.so.6 (0x00007f7645299000)
        /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f7646dc2000)
        libxcb.so.1 => /usr/lib/libxcb.so.1 (0x00007f764526d000)
        libXau.so.6 => /usr/lib/libXau.so.6 (0x00007f7645268000)
        libXdmcp.so.6 => /usr/lib/libXdmcp.so.6 (0x00007f7645260000)
$ ldd /usr/lib/gcc/x86_64-pc-linux-gnu/11.1.0/../../../../lib/libasan.so
        linux-vdso.so.1 (0x00007ffcaee6d000)
        libdl.so.2 => /usr/lib/../lib/libdl.so.2 (0x00007f50656d5000)
        librt.so.1 => /usr/lib/../lib/librt.so.1 (0x00007f50656ca000)
        libpthread.so.0 => /usr/lib/../lib/libpthread.so.0 (0x00007f50656a9000)
        libstdc++.so.6 => /usr/lib/../lib/libstdc++.so.6 (0x00007f5065493000)
        libm.so.6 => /usr/lib/../lib/libm.so.6 (0x00007f506534f000)
        libc.so.6 => /usr/lib/../lib/libc.so.6 (0x00007f5065183000)
        libgcc_s.so.1 => /usr/lib/../lib/libgcc_s.so.1 (0x00007f5065166000)
        /usr/lib64/ld-linux-x86-64.so.2 (0x00007f50660bd000)

I also tested this:

LD_PRELOAD=/usr/lib/libasan.so.6 rr record ./a.out
rr: Saving execution to trace directory `/home/milian/.local/share/rr/a.out-9'.
==65689==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.

@GitMensch
Copy link
Contributor

Last check:

bash # clean environment that can be closed afterwards
export LD_PRELOAD=/usr/lib/libasan.so.6
rr record ./a.out
exit

@milianw
Copy link
Author

milianw commented Jul 19, 2021

Still no dice:

$ export LD_PRELOAD=/usr/lib/libasan.so.6.6
$ echo $LD_PRELOAD 
/usr/lib/libasan.so.6
$ rr record ./a.out
rr: Saving execution to trace directory `/home/milian/.local/share/rr/a.out-10'.
==79487==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.

@GitMensch
Copy link
Contributor

OK, but ./a.out runs, I assume... (note: this message obviously comes from libasan).

It looks like rr prepends another library and the try to do this after libasan at

rr/src/RecordSession.cc

Lines 2067 to 2083 in 779040e

static ExeInfo read_exe_info(const string& exe_file) {
ExeInfo ret;
ScopedFd fd(exe_file.c_str(), O_RDONLY);
if (!fd.is_open()) {
return ret;
}
ElfFileReader reader(fd);
DynamicSection dynamic = reader.read_dynamic();
for (auto& entry : dynamic.entries) {
if (entry.tag == DT_NEEDED && entry.val < dynamic.strtab.size()) {
const char* name = &dynamic.strtab[entry.val];
if (!strncmp(name, "libasan", 7)) {
ret.libasan_path = string(name);
}
}
}
and/or

rr/src/RecordSession.cc

Lines 2192 to 2204 in 779040e

// LD_PRELOAD the syscall interception lib
string syscall_buffer_lib_path = find_helper_library(SYSCALLBUF_LIB_FILENAME);
if (!syscall_buffer_lib_path.empty()) {
string ld_preload = "";
if (!exe_info.libasan_path.empty()) {
LOG(debug) << "Prepending " << exe_info.libasan_path << " to LD_PRELOAD";
// Put an LD_PRELOAD entry for it before our preload library, because
// it checks that it's loaded first
ld_preload += exe_info.libasan_path + ":";
}
ld_preload += syscall_buffer_lib_path + SYSCALLBUF_LIB_FILENAME_PADDED;
inject_ld_helper_library(env, "LD_PRELOAD", ld_preload);
}
does not work.

If you feel lucky you could run this through gdb and check what the issue is:

LD_PRELOAD=/usr/lib/libasan.so.6 gdb --args rr record ./a.out
b RecordSession.cc:2069
b RecordSession.cc:2194
c
# and now stepping and inspecting the conditional stepping

@GitMensch
Copy link
Contributor

... actually it is obviously intended to work without LD_PRELOAD as the !exe_info.libasan_path.empty() should be false and the path to libasan should be seen from the link symbol, then added before the rr helper lib.
The question is: is this the case? @milianw It would be nice if you could check that nexting through read_exe_info and then nexting through the code for "LD_PRELOAD the syscall interception lib". I'd do this myself but don't have a system that matches yours.

@bernhardu
Copy link
Contributor

Hello, this ASan runtime does not come first got introduced by 4be0255.

PR #2892 contains patch 5a7d3dd that fixes it, if applied on top of current tip.
(Unfortunately got not yet merged and does not cleanly apply anymore.)

A workaround might be to record with ASAN_OPTIONS=verify_asan_link_order=0 rr record ./a.out.

@bernhardu
Copy link
Contributor

@milianw: That PR got merged now. So I guess this issue should now be again back to the pthread_mutex hang with current rr tip?

@milianw
Copy link
Author

milianw commented Aug 20, 2021

with current rr master and GCC 11.1 and accompanying sanitizers, everything works as expected now

many thanks!

@milianw milianw closed this as completed Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants