Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colon (":") in naming of anonymous functions breaks emscripten #536

Open
PyryM opened this issue Apr 20, 2022 · 4 comments
Open

Colon (":") in naming of anonymous functions breaks emscripten #536

PyryM opened this issue Apr 20, 2022 · 4 comments

Comments

@PyryM
Copy link
Contributor

PyryM commented Apr 20, 2022

TLDR: Terra gives anonymous functions names like $anon (junk/wasm_helloworld.t:7) containing special characters (specifically the :) that break the way emscripten expects to parse symbol names.

To reproduce:
First (with Terra on llvm10+ and Emscripten installed) compile to wasm32 bitcode:

terralib.includepath = "" -- no default includes from system libc
local eminclude = os.getenv("EMSDK") .. "/upstream/emscripten/cache/sysroot/include"
local target = terralib.newtarget{Triple = "wasm32"}
local cio = terralib.includec("stdio.h", {"-I", eminclude}, target)

local foo = terra(i: int32)
  cio.printf("helloworld %d!\n", i)
end

terra helloworld_main(): int32
  for i = 0, 10 do foo(i) end
  return 0
end

terralib.saveobj("helloworld.bc", {main=helloworld_main}, nil, target, false)

Now try to link with emscripten:

emcc helloworld.bc -sALLOW_MEMORY_GROWTH=1 -o helloworld.html

Traceback (most recent call last):
  File "/home/anon/emsdk/upstream/emscripten/emcc.py", line 3982, in <module>
    sys.exit(main(sys.argv))
  [...traceback skipped]
  File "/home/anon/emsdk/upstream/emscripten/tools/building.py", line 574, in parse_llvm_nm_symbols
    status = line[entry_pos + 11] # Skip address, which is always fixed-length 8 chars.
IndexError: string index out of range

Why? Emscripten gets symbol names by calling llvm-nm --print-file-names helloworld.bc and parsing each line using colons as delimiters:

# Line format: "[archive filename:]object filename: address status name"
entry_pos = line.rfind(':') # finds *last* colon

But terra has produced this:

llvm-nm --print-file-name helloworld.bc 

helloworld.bc: -------- t $anon (junk/wasm_helloworld.t:7)
helloworld.bc: -------- T main
helloworld.bc:          U printf

Where emscripten incorrectly splits the line helloworld.bc: -------- t $anon (junk/wasm_helloworld.t:7) because it finds the colon inside the symbol name.

Workaround:
It's possible to avoid the issue by making sure every terra function is named, using func:setname(...) as needed.

Fix?:
Arguably this is Emscripten's fault for trying to parse human-readable tool output rather than using actual structured APIs, and for not even robustly parsing that output.

It might make sense, though, on the Terra side to give anonymous functions more sanitized names (i.e., without spaces, colons, or parenthesis) because there are likely a number of tools that expect symbol names in bitcode to be limited to C/C++ naming rules.

@velartrill
Copy link

this is definitely an Emscripten bug (note that file names with colons, which are perfectly legal on linux, would trigger this bug as well!), and if terra is to be tweaked to add a workaround, it should be optional imo. i would suggest either a terralib.saveobj flag/environment variable to use hashes in the generated anonymous names, or a way to customize the format (e.g. you pass a function that takes a terra function object and returns an appropriate name)

@elliottslaughter
Copy link
Member

Can we at least check with the Emscripten developers to see what their outlook is on this one? Since a workaround is available on our end, I don't think we need to rush the fix.

@PyryM
Copy link
Contributor Author

PyryM commented Apr 20, 2022

Yes, there's an existing issue: emscripten-core/emscripten#15325

@sbc100
Copy link

sbc100 commented Apr 20, 2022

If you compile the .bc file to an object file (emcc -c hello_world.bc -o hello_world.o) does it still contain that non-standard symbols?

Are these non-standard symbols only ever internal/local symbols? (i.e. they always have lower case tags when output by nm)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants