-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement basic eBPF-based context switch monitoring #36
Comments
Do you know which ebpf, ebpf-go, or perhaps iovisor/bcc tool(s) we want to use for context switch tracing? I believe the closest I have found and tried so far may be bcc/tools/cpudist. See: https://github.com/iovisor/bcc/blob/master/tools/cpudist_example.txt |
I think we should try the 'sched:sched_switch' tracepoint with cilium/ebpf. I haven't read these but they might point you in the right direction: |
Aha, Both of those do look useful. The output of runqlat looks almost identical to that of cpudist. As I understand so far, cilium/ebpf is "ebpf-go", meaning a set of Go bindings for running eBPF-based utilities such as the tools in bcc (BPF Compiler Collection). I'll read through those pages, dig into |
I have made progress on three fronts:
Here is a truncated overview of the manuals and sections therein which looked relevant to me:
https://github.com/sysstat/sysstat
If this sounds viable, I think I can skip steps 1-3, do 4, and start work on aggregating filtered/condensed |
Upon further research and some successful testing, it seems that
I will continue researching eBPF Once any eBPF-based tool is working, cilium/ebpf-go/bpf2go should be able to generate most of the necessary code to call it from a Go program. |
Progress! I believe I have a working bpftrace script which aggregates IO-blocked context switches by process name as well as memory allocations, reclamations, and page faults, at a sustainable data rate. It should be relatively easy to adjust or add to. Sample code:
Sample output:
Status of Implementation Tasks:
Questions:
|
nice progress! Do you know if bpftrace supports reading perf counters? I see "hardware" sampling points, where you can get code to run every e.g., 100k cache misses, but can you read the number of cache misses on a context switch? |
Excellent questions. Preliminary research indicates that bpftrace does not support reading perf counters directly since both are front-end implementations of (the same?) eBPF framework, but that there are other ways to correlate or integrate the functionality of both. From https://www.brendangregg.com/ebpf.html section 4.2, here is an overview of some available tools: This video may be a helpful resource/reference for how things are related: https://www.youtube.com/watch?v=_5Z2AU7QTH4 As I currently understand: The bpftrace script above passively hooks into tracepoints which 'fire' on each related event, so to break down this function...
When a context switch occurs, the I believe more tracepoints can be added, including at least some which perf uses, and that perf can be used alongside bpftrace or a custom eBPF program. If we want to find the precise number of cache misses associated with an individual context switch event, that may be both difficult and very high-overhead. I'll keep researching. |
Update: I got a bpftrace script working through Go |
@yonch As you recommended, I looked in to ebpf resources by Liz Rice from Isovalent (Cilium) and a few other speakers/writers, and I think I have a much better understanding of the overall ebpf landscape. So, as I currently understand and think is relevant for this issue, from the ground up:
Also, I figured out how to write a much more elegant bpftrace script which:
I will try to get the script nicely commented and uploaded in a pull request soon. |
That's great research! Looking forward to see what you end up with! |
We need to establish initial eBPF instrumentation to track process context switches. This first implementation will use the cilium/ebpf package (github.com/cilium/ebpf) to trace when processes enter and exit CPU cores, allowing us to attribute resource usage to specific processes and containers.
Implementation Tasks
Technical Details
Future iterations will add timer-based sampling and more sophisticated data collection.
cc @atimeofday -- this is what we discussed earlier today
The text was updated successfully, but these errors were encountered: