Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults in get_nprocs #68

Open
insertinterestingnamehere opened this issue May 10, 2021 · 8 comments
Open

Segfaults in get_nprocs #68

insertinterestingnamehere opened this issue May 10, 2021 · 8 comments
Labels
bug Something isn't working VECs Related to Virtual Execution Contexts

Comments

@insertinterestingnamehere
Copy link
Member

We've been seeing mysterious segfaults in get_nprocs when threads are used together with VECs. The exact conditions that trigger this aren't known since lots of things still appear to work fine.

With the ARPACK demo this does show up, but only if many copies are used (e.g. one ARPACK copy per core, so increase the limit then run 24 copies or something). I most recently saw it there when masively oversubscribed though since I wasn't setting OMP_NUM_THREADS there yet. I wasn't able to get an informative backtrace beyond seeing get_nprocs at the bottom of it.

@hfingler saw segfaults like this several times when debugging the Galois/VECs demo. Here are two backtraces that we saw:

0x7f1db871385f: (killpg+0x40)                                                                                                                                                        
 (killpg+0x40)                                                         
0x7f1db871385f:0x7f1db871385f: (killpg+0x40)                                                                     
0x7f1db871385f: (get_nprocs+0x11f)                                                                               
0x7f1db869defb: (get_nprocs+0x11f)                                                                                                             
 (get_nprocs+0x11f)                                                                                              
0x7f1db869defb: (get_nprocs+0x11f)                                                                                                                                                            
0x7f1db869defb:0x7f1db869defb: (arena_get2.part.4+0x19b)
0x7f1db86a0dc9: (arena_get2.part.4+0x19b)                                                                                                                                      
 (arena_get2.part.4+0x19b)                                                                                                                                                                    
0x7f1db86a0dc9:0x7f1db86a0dc9: (arena_get2.part.4+0x19b)                                                     
0x7f1db86a0dc9: (tcache_init.part.6+0xb9)                                                                                                                                             
0x7f1db86a1b9e: (tcache_init.part.6+0xb9)                                               
 (tcache_init.part.6+0xb9)                                                                             
0x7f1db86a1b9e:0x7f1db86a1b9e: (tcache_init.part.6+0xb9)                                                                             
0x7f1db86a1b9e: (__libc_malloc+0xde) 

Another one:

0x7ff896bc5850: (handler+0x28)
0x7ff896bc5850: (killpg+0x40)
0x7ff896c8685f:----- Galois setting # threads to 24
Galois: load_file:304 0x7ff880002680
Reading from file: inputs/r4-2e26.gr
 (killpg+0x40)
0x7ff896c8685f: (get_nprocs+0x11f)
0x7ff896c10efb: (get_nprocs+0x11f)
0x7ff896c10efb: (arena_get2.part.4+0x19b)
0x7ff896c13dc9: (arena_get2.part.4+0x19b)
0x7ff896c13dc9: (tcache_init.part.6+0xb9)
 (tcache_init.part.6+0xb9)
0x7ff896c14b9e:0x7ff896c14b9e: (__libc_malloc+0xde)
 (__libc_malloc+0xde)
0x7ff897e952f5:0x7ff897e952f5: (tls_get_addr_tail+0x165)
 (tls_get_addr_tail+0x165)
0x7ff897e9ae08:0x7ff897e9ae08: (__tls_get_addr+0x38)
 (__tls_get_addr+0x38)
0x7ff88b30a422:0x7ff88b30a422: (_ZTHN6galois9substrate10ThreadPool6my_boxE+0x14)
 (_ZTHN6galois9substrate10ThreadPool6my_boxE+0x14)
0x7ff88b2db545:0x7ff88b2db545: (_ZTWN6galois9substrate10ThreadPool6my_boxE+0x9)

@sestephens73 at one point saw this one as well when working on the matmul demo (I'm not sure what the workaround to avoid this there was):

0x7fa9598e2188: (handler+0x28)
0x7fa95c966400: (killpg+0x40)
0x7fa95bf7837f: (get_nprocs+0x11f)
0x7fa95bf02aab: (arena_get2.part.4+0x19b)
@insertinterestingnamehere
Copy link
Member Author

Here's my current best theory for what might be causing this: our overrides in libparla_context may not be getting correctly preloaded. The resulting shared object has libc as a dependency in its ELF header. Given that to "preload" it into each VEC we just dlmopen libparla_context, this probably means libc's stuff actually gets loaded before our overrides. The only overrides we've actually observed being called successfully from within a VEC are the ones involving pthreads routines. I think the fix is to build libparla_context with undefined symbols so that it doesn't explicitly list libc as a dependency. That'll let us do the equivalent of LD_PRELOAD but within a linker namespace.

@insertinterestingnamehere
Copy link
Member Author

All that said, that theory doesn't necessarily mean that there couldn't also be something wrong with our thread affinity wrappers.

@hfingler
Copy link
Contributor

The error happens most of the times we run. Eventually a run works. I think it is a per-thread issue since if I run with less cores, the error happens least frequently. With more cores I might see the error one or more times.
This is also seen from the functions __tls_get_addr, tls_get_addr_tail which goes in __libc_malloc

This seems really close to what we're seeing https://sourceware.org/legacy-ml/libc-help/2019-06/msg00026.html

@insertinterestingnamehere
Copy link
Member Author

Probably related: #12

@insertinterestingnamehere insertinterestingnamehere added the VECs Related to Virtual Execution Contexts label May 11, 2021
@insertinterestingnamehere
Copy link
Member Author

@sestephens73 mentioned on slack that this showed up in the matmul demo as well. The backtrace there was

0x7fa9598e2188: (handler+0x28)
0x7fa95c966400: (killpg+0x40)
0x7fa95bf7837f: (get_nprocs+0x11f)
0x7fa95bf02aab: (arena_get2.part.4+0x19b)

I don't remember what the exact conditions to reproduce it for that app were. @sestephens73 feel free to add more details if you have them.

@sestephens73
Copy link
Contributor

Gist reproducing the above trace: https://gist.github.com/sestephens73/9f8c744d5c56bc81283cf8f6d88046cd

@insertinterestingnamehere
Copy link
Member Author

Here's an alternate theory as to what could cause this: the current VEC is a thread-local. Spawned threads don't automatically inherit the values of the thread-local variables of the thread that spawned them. Maybe somehow a newly created thread is resolving some thread affinity related stuff in VEC 0 since its thread-local data will be zero-initialized. That could result in some kind of weird failure when shuttling affinity information back and forth.

@arthurp
Copy link
Member

arthurp commented May 11, 2021

I tried to handle this by hooking into thread creation. But I might have done it wrong, or not hooked in deeply enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working VECs Related to Virtual Execution Contexts
Projects
None yet
Development

No branches or pull requests

4 participants