Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

symbolizer: shell out to addr2line #5299

Merged
merged 2 commits into from
Jan 28, 2025

Conversation

danipozo
Copy link
Contributor

Use addr2line instead of custom addr2line implementation which doesn't symbolize addresses properly for some binaries (see #5291). Even with a fix for go-delve/delve#3861, which is the root cause of the failure in #5291, the current implementation can't symbolize ~50 % of the addresses in the stacks of the ClickHouse binary in my tests. Using gimli-rs/addr2line as the addr2line implementation works fast and well:
image

Please take this PR as a rough proposal, there are of course details to be fleshed out:

  • Probably you want to keep the current implementation under a setting
  • If not, probably some code would need to be deleted
  • This introduces another dependency, docs and Docker images at least would have to be updated
  • Also there are likely code issues, I'm far from a Go expert

@danipozo danipozo requested a review from a team as a code owner November 19, 2024 18:35
@brancz
Copy link
Member

brancz commented Dec 6, 2024

I'm open to the idea of this though I would much rather actually fix the symbolizer, but I'd be ok with this being an escape hatch. It's definitely possible to write a correct symbolizer even with the bug in the delve code (we did in Polar Signals Colud but what we did doesn't work in a single-process context that we're constrained to in Parca).

This being an escape hatch, I would prefer if we only implement support in Parca for this, but do not add llvm-addr2line or something like that to the container image, we would expect someone to add that themselves if they want to use this escape hatch until the symbolizer itself is fixed.

@danipozo
Copy link
Contributor Author

danipozo commented Dec 9, 2024

Nice to read, I'll add a setting for this then and maybe some docs on how to use it if you face broken symbolization for your binary

@danipozo
Copy link
Contributor Author

I've added a setting to enable this functionality by specifying a path to the addr2line binary, and a new liner that uses this binary. This liner can also be wrapped by the cachedLiner and therefore use the cache.

@danipozo
Copy link
Contributor Author

Hi, kindly asking for another round of review here @brancz

Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I think we're really close, sorry for missing this before. Happy new year!

pkg/symbolizer/symbolizer.go Outdated Show resolved Hide resolved
@danipozo danipozo force-pushed the symbolizer-use-addr2line branch 3 times, most recently from 8875c66 to c0b5f8d Compare January 17, 2025 12:34
@danipozo danipozo requested a review from a team as a code owner January 17, 2025 12:34
@danipozo danipozo force-pushed the symbolizer-use-addr2line branch 2 times, most recently from 127dd5e to 3325769 Compare January 17, 2025 12:49
@chaochaoxiaochao
Copy link

Hi, I used this MR locally for testing and found that there are still many issues?? The output of using addr2line-e libc.so.6 0xe7f9b is:
??:?
But the output of using addr2line-e libc.so.6-f 0xe7f9b is
__clone
??:?
I think this is what we want. It can directly display the function name and also serve as the fastest way to view and fix the problem

@chaochaoxiaochao
Copy link

Hi, I used this MR locally for testing and found that there are still many issues?? The output of using addr2line-e libc.so.6 0xe7f9b is: ??:? But the output of using addr2line-e libc.so.6-f 0xe7f9b is __clone ??:? I think this is what we want. It can directly display the function name and also serve as the fastest way to view and fix the problem

@danipozo Do you have any constructive suggestions? Thank you

@chaochaoxiaochao
Copy link

Hi, I used this MR locally for testing and found that there are still many issues?? The output of using addr2line-e libc.so.6 0xe7f9b is: ??:? But the output of using addr2line-e libc.so.6-f 0xe7f9b is __clone ??:? I think this is what we want. It can directly display the function name and also serve as the fastest way to view and fix the problem

@danipozo Do you have any constructive suggestions? Thank you

And I followed the method in your MR:
addr2line --exe libstdc++.so.6.0.30 -afiC 0xce9db
The output is:
0x00000000000ce9db
std::error_code::default_error_condition() const
??:?
There is also output, but the actual UI does not display it

@danipozo
Copy link
Contributor Author

@chaochaoxiaochao can you add some more specific details about how you're using this and what issues you are finding? Along with proper formatting it would go a great length towards letting me understand your concerns.

The output of using addr2line-e libc.so.6 0xe7f9b is:

But the output of using addr2line-e libc.so.6-f 0xe7f9b is

In particular I don't follow what you're trying to say here since these two command invocations are the same.

Thanks!

@danipozo
Copy link
Contributor Author

since these two command invocations are the same.

Ah now I see in one you're using the -f flag while in the other you aren't. So yes, in one case addr2line will try to show you the function name while in the other it won't, but I still don't understand your concern. In this PR I'm using -afiC to get function names, for all inlined functions at an address, and demangling those if needed. The -a flag is used to be able to read from the program output without blocking, as explained in the thread

@chaochaoxiaochao
Copy link

As I mentioned earlier, I attempted to output the function name locally using the - fiC 0xce9db method, but the actual UI did not display it. The UI displayed ????

As I mentioned earlier, I attempted to output the function name locally using the - fiC 0xce9db method, but the actual UI did not display it. The UI displayed ????

@chaochaoxiaochao
Copy link

As I mentioned earlier, I attempted to output the function name locally using the - fiC 0xce9db method, but the actual UI did not display it. The UI displayed ????

As I mentioned earlier, I attempted to output the function name locally using the - fiC 0xce9db method, but the actual UI did not display it. The UI displayed ????

image

I tried to check the terminal but there were no error messages. I opened it at debug level. It's embarrassing that I couldn't compile parca locally. Otherwise, I could investigate the reason

@chaochaoxiaochao
Copy link

chaochaoxiaochao commented Jan 23, 2025

As I mentioned earlier, I attempted to output the function name locally using the - fiC 0xce9db method, but the actual UI did not display it. The UI displayed ????

As I mentioned earlier, I attempted to output the function name locally using the - fiC 0xce9db method, but the actual UI did not display it. The UI displayed ????

image

I tried to check the terminal but there were no error messages. I opened it at debug level. It's embarrassing that I couldn't compile parca locally. Otherwise, I could investigate the reason

@danipozo The specific issue you mentioned earlier is actually that I can display the function name locally, but the UI shows that ??

@chaochaoxiaochao
Copy link

前面提到过,我尝试使用 -fiC 0xce9db 方法在本地输出函数名,但是实际 UI 并没有显示出来。UI 显示的????

前面提到过,我尝试使用 -fiC 0xce9db 方法在本地输出函数名,但是实际 UI 并没有显示出来。UI 显示的????

图像
我尝试检查终端,但没有错误消息。我以调试级别打开它。很尴尬,我无法在本地编译 parca。否则,我可以调查原因

@danipozo 你之前提到的具体问题其实是我本地能显示函数名,但是UI上却显示??

And I roughly checked and found that libc.so.6 mostly has issues:??

@danipozo
Copy link
Contributor Author

@chaochaoxiaochao OK, I think I understand better now.

To debug this in your specific case, you would need to:

  • Get the information that is shown on the tooltip when hovering one of the problematic addresses, specifically the binary name, build ID and address
  • Make sure you are running addr2line against the same binary, you can get do so by checking the build ID using the file utility
  • Make sure the addr2line version you're running and the one Parca is using are the same, different addr2line versions might have trouble with different binaries and addresses, although llvm-addr2line and gimli's should mostly work

Also please, can you try to stick to condensing the information in 1-2 messages in a row? This way it is much easier for me and probably for reviewers to follow, thanks!

@chaochaoxiaochao
Copy link

chaochaoxiaochao commented Jan 23, 2025

@chaochaoxiaochao OK, I think I understand better now.

To debug this in your specific case, you would need to:

  • Get the information that is shown on the tooltip when hovering one of the problematic addresses, specifically the binary name, build ID and address
  • Make sure you are running addr2line against the same binary, you can get do so by checking the build ID using the file utility
  • Make sure the addr2line version you're running and the one Parca is using are the same, different addr2line versions might have trouble with different binaries and addresses, although llvm-addr2line and gimli's should mostly work

Also please, can you try to stick to condensing the information in 1-2 messages in a row? This way it is much easier for me and probably for reviewers to follow, thanks!

@danipozo Sorry, I'll streamline my response:
1.
企业微信截图_17376308021938
2. yes:
libc.so.6: ELF 64-bit LSB shared object, ARM aarch64, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, BuildID[sha1]=09928b270aa19314161b21f565d1a9732c2c5332, for GNU/Linux 3.7.0, stripped
3. The gimli I am using is version 0.16.0
4. Please add the following scene:
The server runs on X86
Agent runs on Arm
Deploy separately on different machines

Thank you for supporting and viewing
5. I checked that libc is 1.6M
6. server start cmd :
./parca --http-read-timeout=10m --http-write-timeout=10m --symbolizer-external-addr-2-line-path=./addr2line --symbolizer-demangle-mode="full"
7. client start cmd :./parca-agent --remote-store-address=10.151.176.114:7070 --remote-store-insecure

@chaochaoxiaochao
Copy link

@danipozo I tried to debug and compile libc locally before uploading debuginfo, but the result is still the same??

@danipozo
Copy link
Contributor Author

danipozo commented Jan 23, 2025

@chaochaoxiaochao I find it hard to keep debugging your use case, sorry. Please let's try to focus on code review here, I suggest that you open an issue to collect the details of your case and maybe find the cause at some point if/when this is used by more people or we find bugs that could be related.

@brancz the latest commit (5775526) adds some error handling and a bigger buffer size to fix an edge case (~80 kiB of legitimate output for a single address) that I found for my binary.

Adds symbolizer flag to specify addr2line binary to shell out to. Create
a new liner type that uses said binary.
@danipozo danipozo force-pushed the symbolizer-use-addr2line branch from 5775526 to 96a3933 Compare January 23, 2025 15:48
@brancz
Copy link
Member

brancz commented Jan 28, 2025

If I understand correctly, we're saying that we're ok with not getting inlined frames? I think I'm generally ok with this since we'd like to ultimately just fix the symbolizer in the first place.

@brancz brancz closed this Jan 28, 2025
@brancz brancz reopened this Jan 28, 2025
@brancz
Copy link
Member

brancz commented Jan 28, 2025

Apologies (I accidentally misclicked and closed the PR but reopened immediately).

@danipozo
Copy link
Contributor Author

If I understand correctly, we're saying that we're ok with not getting inlined frames? I think I'm generally ok with this since we'd like to ultimately just fix the symbolizer in the first place.

No, actually the latest solution described in this comment lets us get inlined frames from addr2line output!

Apologies (I accidentally misclicked and closed the PR but reopened immediately).

Misclicks happen :)

Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh awesome, I'm sorry I didn't understand before why that would work, but now makes sense!

Let's get this merged!

@brancz brancz merged commit 03cff40 into parca-dev:main Jan 28, 2025
63 of 64 checks passed
@danipozo
Copy link
Contributor Author

Great, thanks for taking the time to review!

@brancz
Copy link
Member

brancz commented Jan 28, 2025

Of course, apologies it took so long, thank you very much for bearing with me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants