Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control over non-api symbol declarations #923

Open
q66 opened this issue Aug 9, 2024 · 15 comments
Open

control over non-api symbol declarations #923

q66 opened this issue Aug 9, 2024 · 15 comments

Comments

@q66
Copy link

q66 commented Aug 9, 2024

I'm currently evaluating using mimalloc as libc allocator in Chimera Linux (https://chimera-linux.org). Currently, we have LLVM's Scudo integrated into our libc but testing shows that mimalloc may be a better fit.

However, the API of mimalloc is extensive, while I only require to expose a handful of symbols. I'd like to make the other symbols local in order to avoid polluting the static libc.a (in shared libc.so, symbol visibility takes care of it already). In order to integrate the allocator I am using the object build (i.e. everything is a single translation unit) so I can easily mark every function (except my specific allocator entrypoints) static and get what I want.

For public API symbols, handling this is easy; I just do #define mi_decl_export static and that takes care of it. For internal symbols which do not have a tag this is harder; currently I have to patch every declaration the headers. Would it be possible to have a (by default empty) macro tagging these internal symbols so I get to reduce the amount of integration patching I have to do downstream? This should apply to any functions as well as extern variables; e.g. my current mimalloc.o has

q66@chimera: /home/q66/mimalloc/build$ nm mimalloc.o|grep ' [TBR] '                             
0000000000000760 T __libc_calloc
0000000000000800 T __libc_free
0000000000000960 T __libc_malloc_impl
00000000000009b0 T __libc_realloc
0000000000000220 T __malloc_atfork
0000000000000750 T __malloc_donate
0000000000000010 T __malloc_init
0000000000000210 T __malloc_thread_init
0000000000000230 T __malloc_tls_teardown
0000000000000ba0 T aligned_alloc
0000000000000c40 T malloc_usable_size

(all of which are my symbols that hook into the libc internally, except the last two which become libc public API)

@q66
Copy link
Author

q66 commented Aug 9, 2024

to show what kind of thing i currently have, e.g.

--- a/include/mimalloc/internal.h
+++ b/include/mimalloc/internal.h
@@ -14,6 +14,12 @@ terms of the MIT license. A copy of the license can be found in the file
 // functions and macros.
 // --------------------------------------------------------------------------
 
+#ifdef MI_LIBC_BUILD
+#define mi_decl_internal static
+#else
+#define mi_decl_internal extern
+#endif
+
 #include "types.h"
 #include "track.h"
 
@@ -55,80 +61,80 @@ terms of the MIT license. A copy of the license can be found in the file
 
 
 // "options.c"
-void       _mi_fputs(mi_output_fun* out, void* arg, const char* prefix, const char* message);
-void       _mi_fprintf(mi_output_fun* out, void* arg, const char* fmt, ...);
-void       _mi_warning_message(const char* fmt, ...);
-void       _mi_verbose_message(const char* fmt, ...);
-void       _mi_trace_message(const char* fmt, ...);
-void       _mi_options_init(void);
-void       _mi_error_message(int err, const char* fmt, ...);
+mi_decl_internal void       _mi_fputs(mi_output_fun* out, void* arg, const char* prefix, const char* message);
+mi_decl_internal void       _mi_fprintf(mi_output_fun* out, void* arg, const char* fmt, ...);
+mi_decl_internal void       _mi_warning_message(const char* fmt, ...);
+mi_decl_internal void       _mi_verbose_message(const char* fmt, ...);
+mi_decl_internal void       _mi_trace_message(const char* fmt, ...);
+mi_decl_internal void       _mi_options_init(void);
+mi_decl_internal void       _mi_error_message(int err, const char* fmt, ...);
 ...

@q66
Copy link
Author

q66 commented Aug 10, 2024

I submitted my proof of concept integration: chimera-linux/cports#2673

@daanx
Copy link
Collaborator

daanx commented Aug 12, 2024

Very cool! I see what you mean -- I thought -fvisibility=hidden would take care of it but it seems that only works for shared libraries. A static object file still contains all symbols?

Already, this is a bit of trouble -- I have all these declarations in mimalloc/internal.h that need to be visible if we compile each source individually. If we would like to make all internal functions static that can only work with the single-source static build in static.c -- is that would you use now for the libc build?

I think you are asking to upstream all the internal declarations with a mi_decl_internal right? Maybe that is nice anyway; we might be able to also remove the -fvisibility=hidden flag in that case as well. Let me ponder a bit on this. Thanks!

@q66
Copy link
Author

q66 commented Aug 12, 2024

that is what i am doing, the single static.c build and then mark every function decl static outside of my few entrypoints; everything then becomes local symbols, and they get stripped and do not conflict with anything

i'm just about ready to land the PR soon (i'm mostly tweaking option values now, but i'm already running it on my machines, desktop and all)

-fvisibility=hidden is still good to have, at least for actual library builds (primarily shared but also static)

@q66
Copy link
Author

q66 commented Aug 12, 2024

or rather, instead of using static.c directly, i have my own custom wrapper source for it: https://github.com/chimera-linux/cports/blob/58230387a299567022c6d89a42e636542816fb47/main/musl/files/mimalloc.c

this lets me expose the glue cleanly as well as tweak some stuff (i found that fully hardened build is actually broken and significantly slower in some scenarios, but i still want things like freelist encoding in place etc.)

@q66
Copy link
Author

q66 commented Aug 13, 2024

i have landed this by the way; i think the declarations patching makes the bigger bulk of the changes, everything else is some minimal plumbing changes (some in the libc, some in the allocator, pretty much just for process init + thread teardown + since we cannot do thread-local storage in libc code due to ldso shenanigans, i just store the thread-local heap pointer inside pthread non-abi section and that works good enough)

@daanx
Copy link
Collaborator

daanx commented Aug 13, 2024

Thank you for the info; (and if you make a PR please base it on the dev branch).

  • I think MI_SECURE=3 should not be too slow? Level 4 is slow since it checks for double free's which is (currently) expensive.
  • It should not be the case that you would need to disable MI_PADDING -- that seems a bug? If you can repro somehow I would like to learn more so we can fix this.
  • Maybe we should also expose a way to get mi_heap_empty .. on the other hand, maybe it is fine as it is since you really hook up at a very low level with the TLS setup. Hope it is going to be efficient in the end to get the thread local heap pointer :-)

@q66
Copy link
Author

q66 commented Aug 13, 2024

the double free checking was pretty cheap in most things i tested (either benchmarks or real world things) so i left it at 4, however enabling padding made realloc-heavy things about 2.5x slower (notably our sort(1) went from 0.8s to 2s-ish for some things)

padding checks result in firefox refusing to play videos with sound; some debugging reveals this is due to padding value checks (only those, the canary check is ok) failing in a free() in libnuma global constructor function when it's dlopened as a dependency of ffmpeg - this happens with my integration as well as when using mimalloc as a preload library with otherwise untouched libc - forcing a preload for libnuma too makes it stop happening, however i have not been able to figure out why this happens yet (i suspect the dlopen constructor stuff is a red herring and it happens either due to a bug in mimalloc or in firefox, as firefox does some mozalloc override shenanigans) - i was going to look into it later, since it happens with preload as well it's a lot safer like this

@q66
Copy link
Author

q66 commented Aug 14, 2024

* Maybe we should also expose a way to get `mi_heap_empty` .. on the other hand, maybe it is fine as it is since you really hook up at a very low level with the TLS setup. Hope it is going to be efficient in the end to get the thread local heap pointer :-)

i think what i have is pretty much as efficient as it can get; since it has access to the internal structure of the pthread implementation, all it has to do is 1) get the thread pointer (this is done via per-arch asm and is inlined) 2) from it retrieve the pthread struct (either it's that directly or it's at an offset depending on architecture) 3) get the right field (everything here is inlined)

i ran mimalloc-bench and some other things and with MI_SECURE=0 it performs pretty much exactly like a regular standalone build, with my particular hardening setup it's somewhere inbetween standard MI_SECURE=0 and MI_SECURE=4, still performs very well

i'll try to think of some clean abstraction in case we want to upstream this (my current thing is pretty specific to my particular musl integration, though otoh besides the declarations patching it's very slim so it's trivial to maintain downstream)

@q66
Copy link
Author

q66 commented Aug 14, 2024

to provide more detail for the firefox thing, this is what triggers the failure:

https://github.com/numactl/numactl/blob/master/libnuma.c#L435

it's strange because the mallocs/reallocs of that happen in the same function

it's triggered from numa_init:

https://github.com/numactl/numactl/blob/master/libnuma.c#L95

the backtrace we get is something like so:

https://img.ayaya.dev/9hNSGioGScjh

what actually fails in free.c is this:

https://github.com/microsoft/mimalloc/blob/dev/src/free.c#L465

(the padding is decoded correctly)

this only happens when triggered from the constructor called by dlopen like http://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c#n2217

i initially thought this was related to the tls being grown which could be perhaps corrupting my heap pointer somehow, but further investigation ruled that out; everything seems to stay intact (and we don't use tls at all here technically, since we store the heap pointer inside pthread instead, and there is no other tls-allocated storage in mimalloc, and it happens identically with a preloaded mimalloc and otherwise untouched libc, which should be fine and supported)

@daanx
Copy link
Collaborator

daanx commented Aug 21, 2024

I don't know anything about libnuma or get_mempolicy, but looking at the code that triggers the error, it seems wrong?

if (nodemask_sz == 0) {/* fall back on error */
		int pol;
		unsigned long *mask = NULL;
		nodemask_sz = 16;
		do {
			nodemask_sz <<= 1;
			mask = realloc(mask, nodemask_sz / 8);
			if (!mask)
				return;
		} while (get_mempolicy(&pol, mask, nodemask_sz + 1, 0, 0) < 0 && errno == EINVAL &&
				nodemask_sz < 4096*8);
		free(mask);
	}

https://github.com/numactl/numactl/blob/master/libnuma.c#L435

  • I would think the realloc allocates unsigned long entries, so it should be (nodemask_sz + 1)*sizeof(unsigned long) ? Or maybe it is the number of bits in the mask -- then it is correct I guess?
  • It looks like one can actually pass NULL for the mask argument to get_mempolicy so we don't need the allocation in the first place? It might be that this is required to be non-NULL though to trigger a EINVAL response from get_mempolicy ?

The docs say If the mode argument is not NULL, then get_mempolicy() will store the policy mode and any
optional mode flags of the requested NUMA policy in the location pointed to by this argu‐
ment. If nodemask is not NULL, then the nodemask associated with the policy will be
stored in the location pointed to by this argument. maxnode specifies the number of node
IDs that can be stored into nodemask—that is, the maximum node ID plus one. The value
specified by maxnode is always rounded to a multiple of sizeof(unsigned long)*8.

but that makes no sense (as sizeof(unsigned long)*8 is 64?). Maybe I am missing something here -- but it would be good to rule out that there is no actual buffer overflow. Also, since this is a "backup" code path in case the other method fails, maybe that is why it only fails on pre-load and not otherwise.

Anyway, just my 2c.

@q66
Copy link
Author

q66 commented Aug 21, 2024

oh, you might actually be right - i was too preoccupied with "free() fails on stuff that was obviously reallocated in the same code block" that i did not consider that get_mempolicy itself may be corrupting the memory and therefore breaking the allocator

i will test this (that said, the padding checking is still too expensive in my testing to enable in prod, unfortunately)

@q66
Copy link
Author

q66 commented Aug 22, 2024

fixed it: chimera-linux/cports@9cc6ce3

the correct value is nodemask_sz, not nodemask_sz / 8, which can be derived from the prototype of the function in the documentation

@q66
Copy link
Author

q66 commented Aug 22, 2024

fixed it properly... i think: chimera-linux/cports@c9bd817

(why is arithmetic so hard :P)

@daanx
Copy link
Collaborator

daanx commented Aug 22, 2024

Ha, great that the mimalloc padding was working as intended and caught the bug :-) Bugs in "fail paths" are usually tricky to catch. Not sure how connected you are in the Linux community but perhaps the same issue exists in other distributions or libnuma and worth fixing upstream?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants