Segfaults in get_nprocs #68

insertinterestingnamehere · 2021-05-10T19:58:36Z

We've been seeing mysterious segfaults in get_nprocs when threads are used together with VECs. The exact conditions that trigger this aren't known since lots of things still appear to work fine.

With the ARPACK demo this does show up, but only if many copies are used (e.g. one ARPACK copy per core, so increase the limit then run 24 copies or something). I most recently saw it there when masively oversubscribed though since I wasn't setting OMP_NUM_THREADS there yet. I wasn't able to get an informative backtrace beyond seeing get_nprocs at the bottom of it.

@hfingler saw segfaults like this several times when debugging the Galois/VECs demo. Here are two backtraces that we saw:

0x7f1db871385f: (killpg+0x40)                                                                                                                                                        
 (killpg+0x40)                                                         
0x7f1db871385f:0x7f1db871385f: (killpg+0x40)                                                                     
0x7f1db871385f: (get_nprocs+0x11f)                                                                               
0x7f1db869defb: (get_nprocs+0x11f)                                                                                                             
 (get_nprocs+0x11f)                                                                                              
0x7f1db869defb: (get_nprocs+0x11f)                                                                                                                                                            
0x7f1db869defb:0x7f1db869defb: (arena_get2.part.4+0x19b)
0x7f1db86a0dc9: (arena_get2.part.4+0x19b)                                                                                                                                      
 (arena_get2.part.4+0x19b)                                                                                                                                                                    
0x7f1db86a0dc9:0x7f1db86a0dc9: (arena_get2.part.4+0x19b)                                                     
0x7f1db86a0dc9: (tcache_init.part.6+0xb9)                                                                                                                                             
0x7f1db86a1b9e: (tcache_init.part.6+0xb9)                                               
 (tcache_init.part.6+0xb9)                                                                             
0x7f1db86a1b9e:0x7f1db86a1b9e: (tcache_init.part.6+0xb9)                                                                             
0x7f1db86a1b9e: (__libc_malloc+0xde)

Another one:

0x7ff896bc5850: (handler+0x28)
0x7ff896bc5850: (killpg+0x40)
0x7ff896c8685f:----- Galois setting # threads to 24
Galois: load_file:304 0x7ff880002680
Reading from file: inputs/r4-2e26.gr
 (killpg+0x40)
0x7ff896c8685f: (get_nprocs+0x11f)
0x7ff896c10efb: (get_nprocs+0x11f)
0x7ff896c10efb: (arena_get2.part.4+0x19b)
0x7ff896c13dc9: (arena_get2.part.4+0x19b)
0x7ff896c13dc9: (tcache_init.part.6+0xb9)
 (tcache_init.part.6+0xb9)
0x7ff896c14b9e:0x7ff896c14b9e: (__libc_malloc+0xde)
 (__libc_malloc+0xde)
0x7ff897e952f5:0x7ff897e952f5: (tls_get_addr_tail+0x165)
 (tls_get_addr_tail+0x165)
0x7ff897e9ae08:0x7ff897e9ae08: (__tls_get_addr+0x38)
 (__tls_get_addr+0x38)
0x7ff88b30a422:0x7ff88b30a422: (_ZTHN6galois9substrate10ThreadPool6my_boxE+0x14)
 (_ZTHN6galois9substrate10ThreadPool6my_boxE+0x14)
0x7ff88b2db545:0x7ff88b2db545: (_ZTWN6galois9substrate10ThreadPool6my_boxE+0x9)

@sestephens73 at one point saw this one as well when working on the matmul demo (I'm not sure what the workaround to avoid this there was):

0x7fa9598e2188: (handler+0x28)
0x7fa95c966400: (killpg+0x40)
0x7fa95bf7837f: (get_nprocs+0x11f)
0x7fa95bf02aab: (arena_get2.part.4+0x19b)

The text was updated successfully, but these errors were encountered:

insertinterestingnamehere · 2021-05-10T20:25:27Z

Here's my current best theory for what might be causing this: our overrides in libparla_context may not be getting correctly preloaded. The resulting shared object has libc as a dependency in its ELF header. Given that to "preload" it into each VEC we just dlmopen libparla_context, this probably means libc's stuff actually gets loaded before our overrides. The only overrides we've actually observed being called successfully from within a VEC are the ones involving pthreads routines. I think the fix is to build libparla_context with undefined symbols so that it doesn't explicitly list libc as a dependency. That'll let us do the equivalent of LD_PRELOAD but within a linker namespace.

insertinterestingnamehere · 2021-05-10T20:25:48Z

All that said, that theory doesn't necessarily mean that there couldn't also be something wrong with our thread affinity wrappers.

hfingler · 2021-05-10T23:12:02Z

The error happens most of the times we run. Eventually a run works. I think it is a per-thread issue since if I run with less cores, the error happens least frequently. With more cores I might see the error one or more times.
This is also seen from the functions __tls_get_addr, tls_get_addr_tail which goes in __libc_malloc

This seems really close to what we're seeing https://sourceware.org/legacy-ml/libc-help/2019-06/msg00026.html

insertinterestingnamehere · 2021-05-11T17:51:49Z

Probably related: #12

insertinterestingnamehere · 2021-05-11T20:16:11Z

@sestephens73 mentioned on slack that this showed up in the matmul demo as well. The backtrace there was

0x7fa9598e2188: (handler+0x28)
0x7fa95c966400: (killpg+0x40)
0x7fa95bf7837f: (get_nprocs+0x11f)
0x7fa95bf02aab: (arena_get2.part.4+0x19b)

I don't remember what the exact conditions to reproduce it for that app were. @sestephens73 feel free to add more details if you have them.

sestephens73 · 2021-05-11T20:25:34Z

Gist reproducing the above trace: https://gist.github.com/sestephens73/9f8c744d5c56bc81283cf8f6d88046cd

insertinterestingnamehere · 2021-05-11T20:27:13Z

Here's an alternate theory as to what could cause this: the current VEC is a thread-local. Spawned threads don't automatically inherit the values of the thread-local variables of the thread that spawned them. Maybe somehow a newly created thread is resolving some thread affinity related stuff in VEC 0 since its thread-local data will be zero-initialized. That could result in some kind of weird failure when shuttling affinity information back and forth.

arthurp · 2021-05-11T20:32:06Z

I tried to handle this by hooking into thread creation. But I might have done it wrong, or not hooked in deeply enough.

insertinterestingnamehere added the bug Something isn't working label May 11, 2021

This was referenced May 11, 2021

VECs Need To Fully Wrap File IO #69

Open

Segmentation Fault with VECs when multithreaded #12

Open

insertinterestingnamehere added the VECs Related to Virtual Execution Contexts label May 11, 2021

insertinterestingnamehere mentioned this issue May 11, 2021

Debug Symbols in VECs #74

Open

insertinterestingnamehere added this to the VEC Paper Revisions milestone Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfaults in get_nprocs #68

Segfaults in get_nprocs #68

insertinterestingnamehere commented May 10, 2021

insertinterestingnamehere commented May 10, 2021

insertinterestingnamehere commented May 10, 2021

hfingler commented May 10, 2021

insertinterestingnamehere commented May 11, 2021

insertinterestingnamehere commented May 11, 2021

sestephens73 commented May 11, 2021

insertinterestingnamehere commented May 11, 2021

arthurp commented May 11, 2021

Segfaults in get_nprocs #68

Segfaults in get_nprocs #68

Comments

insertinterestingnamehere commented May 10, 2021

insertinterestingnamehere commented May 10, 2021

insertinterestingnamehere commented May 10, 2021

hfingler commented May 10, 2021

insertinterestingnamehere commented May 11, 2021

insertinterestingnamehere commented May 11, 2021

sestephens73 commented May 11, 2021

insertinterestingnamehere commented May 11, 2021

arthurp commented May 11, 2021