Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement basic eBPF-based context switch monitoring #36

Open
yonch opened this issue Jan 17, 2025 · 11 comments
Open

Implement basic eBPF-based context switch monitoring #36

yonch opened this issue Jan 17, 2025 · 11 comments
Assignees

Comments

@yonch
Copy link
Contributor

yonch commented Jan 17, 2025

We need to establish initial eBPF instrumentation to track process context switches. This first implementation will use the cilium/ebpf package (github.com/cilium/ebpf) to trace when processes enter and exit CPU cores, allowing us to attribute resource usage to specific processes and containers.

Implementation Tasks

  1. Create Go binary using cilium/ebpf package
  2. Implement basic context switch tracing
  3. Add simple stdout logging of traced events
  4. Set up GitHub Actions workflow to build container image

Technical Details

  • Package: github.com/cilium/ebpf
  • Events: Process context switches
  • Output: Simple logging to stdout
  • Build: Container image via GitHub Actions

Future iterations will add timer-based sampling and more sophisticated data collection.

cc @atimeofday -- this is what we discussed earlier today

@atimeofday
Copy link
Contributor

Do you know which ebpf, ebpf-go, or perhaps iovisor/bcc tool(s) we want to use for context switch tracing? I believe the closest I have found and tried so far may be bcc/tools/cpudist. See:

https://github.com/iovisor/bcc/blob/master/tools/cpudist_example.txt

@yonch
Copy link
Contributor Author

yonch commented Jan 20, 2025

I think we should try the 'sched:sched_switch' tracepoint with cilium/ebpf.

I haven't read these but they might point you in the right direction:

@atimeofday
Copy link
Contributor

Aha, sched_switch, that was the name! I remember you mentioning it.

Both of those do look useful. The output of runqlat looks almost identical to that of cpudist. As I understand so far, cilium/ebpf is "ebpf-go", meaning a set of Go bindings for running eBPF-based utilities such as the tools in bcc (BPF Compiler Collection).

I'll read through those pages, dig into sched:sched_switch in particular, and report back when I make notable progress and/or hit a wall. Thanks!

@atimeofday
Copy link
Contributor

atimeofday commented Jan 24, 2025

I have made progress on three fronts:

  1. I got the ebpf-go Getting Started example with XDP monitoring to work.
  2. I found a few examples of bpf programs/functions using sched_switch to reference.
    • I have run into a bit of a wall actually finding any common format I should be using to hook into it...
  3. Very tentatively, I may have found how the Linux kernel developers and Red Hat already directly solved this problem/context. That being said, I currently do not have a very informed idea of what the viability, relevance, or overhead are for these tools.

Here is a truncated overview of the manuals and sections therein which looked relevant to me:

>man proc

NAME
       proc - process information, system information, and sysctl pseudo-filesystem

DESCRIPTION
       The proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures.

https://serverfault.com/questions/190049/find-out-which-task-is-generating-a-lot-of-context-switches-on-linux

https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/4/html/introduction_to_system_administration/s2-resource-tools-sar

https://github.com/sysstat/sysstat

>sudo dnf install sysstat
>man pidstat

NAME
       pidstat - Report statistics for Linux tasks.

-d     Report I/O statistics
-r     Report page faults and memory utilization
-u     Report CPU utilization
-w     Report task switching activity

-h     Display  all  activities  horizontally on a single line, with no average statistics at the end of the report. This is intended to make it easier to be parsed by other programs.

/proc filesystem must be mounted for the pidstat command to work.

/proc/ and sysstat/pidstat seem to provide everything we could possibly need, already summarized in user-space whether we access it or not, so the overhead of using them could be near-zero.

If this sounds viable, I think I can skip steps 1-3, do 4, and start work on aggregating filtered/condensed /proc/ data in a usable CSV matrix format. If not, I can continue digging into manual implementation of eBPF sched_switch tracing.

@atimeofday
Copy link
Contributor

Upon further research and some successful testing, it seems that /proc/-monitoring tools such as sysstat are usable for basic, rough correlation of the necessary metrics, but are not as optimized as eBPF and likely do not have the necessary fidelity or precision for production-grade monitoring and millisecond-level response times. In case sysstat is of interest for a "v0.1" skeleton / proof of concept, here is one of the commands I developed to experiment with basic memory / sched_switch correlation:

pidstat -r | awk '{s+=$9} END {printf "Estimated Memory \t%.0f", s}' && echo '' && pidstat -wt | awk '{s+=$6} END {printf "Estimated Switches \t%.0f", s}'

I will continue researching eBPF sched_switch tracing. The most promising reference I have found so far is here: https://www.brendangregg.com/ebpf.html

Once any eBPF-based tool is working, cilium/ebpf-go/bpf2go should be able to generate most of the necessary code to call it from a Go program.

@atimeofday
Copy link
Contributor

Progress!

I believe I have a working bpftrace script which aggregates IO-blocked context switches by process name as well as memory allocations, reclamations, and page faults, at a sustainable data rate. It should be relatively easy to adjust or add to.

Sample code:

#!/usr/bin/env/bpftrace

tracepoint:exceptions:page_fault_user {
    @fault[comm] = count();  // Count page faults by process name
}

tracepoint:kmem:mm_page_alloc {
    @alloc[comm] = count();  // Count memory allocations by process name
}

tracepoint:vmscan:mm_vmscan_direct_reclaim_begin {
    @reclaim[comm] = count();  // Count memory reclaim events by process name
}

tracepoint:sched:sched_switch {
    if (@fault[args->prev_comm] || @alloc[args->prev_comm] || @reclaim[args->prev_comm]) {
        @switch[args->prev_comm] = count();  // Count context switches for processes with memory-related events
    }
}

Sample output:

@alloc[sudo]: 1
@alloc[ptyxis-agent]: 1
@alloc[kworker/6:1]: 1
@alloc[llvmpipe-7]: 2
@alloc[llvmpipe-10]: 2
@alloc[bpftrace]: 9
@alloc[gnome-shell]: 91
@fault[abrt-dump-journ]: 1
@fault[timeout]: 1
@fault[systemd-journal]: 10
@fault[llvmpipe-10]: 20
@fault[llvmpipe-5]: 44
@fault[llvmpipe-2]: 44
@fault[llvmpipe-9]: 44
@fault[llvmpipe-7]: 64
@fault[llvmpipe-8]: 88
@fault[llvmpipe-3]: 88
@fault[llvmpipe-6]: 88
@fault[llvmpipe-11]: 88
@fault[llvmpipe-1]: 132

@switch[timeout]: 1
@switch[ptyxis-agent]: 1
@switch[sudo]: 2
@switch[systemd-journal]: 3
@switch[abrt-dump-journ]: 3
@switch[kworker/6:1]: 4
@switch[llvmpipe-9]: 8
@switch[bpftrace]: 15
@switch[llvmpipe-2]: 112
@switch[llvmpipe-6]: 121
@switch[llvmpipe-8]: 128
@switch[llvmpipe-5]: 170
@switch[llvmpipe-10]: 201
@switch[llvmpipe-3]: 208
@switch[llvmpipe-11]: 214
@switch[llvmpipe-7]: 247
@switch[gnome-shell]: 256
@switch[llvmpipe-1]: 283

Status of Implementation Tasks:

  1. Proof of concept done; actual product pending further information
  2. Done?
  3. Done?
  4. Pending further information

Questions:

  • Where should this eBPF instrumentation be placed within the project structure?
  • Using cilium/ebpf may be redundant or overkill, and multiplies the code complexity by an order of magnitude. Should I focus on it for the tightest possible integration, or aim for the simplest possible implementation (for now)?
  • What other metrics should be collected, if any, if we know yet?
  • Does this look like it is on approximately the right track?

@yonch
Copy link
Contributor Author

yonch commented Jan 26, 2025

nice progress!

Do you know if bpftrace supports reading perf counters?

I see "hardware" sampling points, where you can get code to run every e.g., 100k cache misses, but can you read the number of cache misses on a context switch?

@atimeofday
Copy link
Contributor

Excellent questions. Preliminary research indicates that bpftrace does not support reading perf counters directly since both are front-end implementations of (the same?) eBPF framework, but that there are other ways to correlate or integrate the functionality of both. From https://www.brendangregg.com/ebpf.html section 4.2, here is an overview of some available tools:

Image

This video may be a helpful resource/reference for how things are related: https://www.youtube.com/watch?v=_5Z2AU7QTH4

Image

As I currently understand:

The bpftrace script above passively hooks into tracepoints which 'fire' on each related event, so to break down this function...

tracepoint:sched:sched_switch {
    if (@fault[args->prev_comm] || @alloc[args->prev_comm] || @reclaim[args->prev_comm]) {
        @switch[args->prev_comm] = count();  // Count context switches for processes with memory-related events
    }
}

When a context switch occurs, the sched:sched_switch tracepoint fires, running the attached eBPF code. This particular version filters the number of events recorded to processes which were recorded when the page fault, page allocation, or memory reclamation tracepoints fired, then counts each time the sched_switch tracepoint fires for that process. This reduces the data rate I observed from about 65,000 events and 2.5MB per second to the above output of around 20 stat reports and 1KB per second.

I believe more tracepoints can be added, including at least some which perf uses, and that perf can be used alongside bpftrace or a custom eBPF program. If we want to find the precise number of cache misses associated with an individual context switch event, that may be both difficult and very high-overhead.

I'll keep researching.

@atimeofday
Copy link
Contributor

Update: I got a bpftrace script working through Go os/exec recording per-process perf_event hardware counters for CPU cycles, instructions, and cache misses at set intervals. I am still working on whether or how it is possible to add conditional logic to split or filter the data upon certain events as well as intervals.

@atimeofday
Copy link
Contributor

atimeofday commented Feb 5, 2025

@yonch As you recommended, I looked in to ebpf resources by Liz Rice from Isovalent (Cilium) and a few other speakers/writers, and I think I have a much better understanding of the overall ebpf landscape.

So, as I currently understand and think is relevant for this issue, from the ground up:


  • An ebpf-based program is divided into two sections: kernel-space ebpf code, and user-space data handling.
  • With cilium/ebpf, the kernel-space eBPF program is written in C and compiled to eBPF bytecode, and ebpf2go generates a Go "API" for interacting with the compiled kernel-space program. Then, the user-space program is written in Go.
  • As of a year or two ago and potentially today, C and Rust are the only regular languages which can 'natively' compile to ebpf bytecode.

  • bpftrace is an ebpf-front-end language for writing both advanced kernel-space and basic user-space data handling in one file, with a library of premade tools which are sufficient for some use cases and offer formatting examples to work from.
  • bpftrace scripts primarily function in kernel-space, but use certain (well-documented, mostly I/O) functions/keywords to invoke a transfer to user-space.
  • bpftrace uses ebpf maps in perf ring buffers to transfer data to user-space in the fastest and most efficient possible manner for most cases.

  • bpftrace functions are well documented as either synchronous (kernel-space) or asynchronous (kernel+user-space), where for example print defaults to asynchronous, but it is possible to "coerce (and thus force a more expensive synchronous read) the type to an integer using a cast or by doing a comparison." (Quoted from docs)
  • bpftrace defaults to thread-safe, per-CPU maps and sync writes for performance and precision in aggregated statistics, and async reads because synchronously iterating over and collecting maps from each thread/CPU is expensive.
  • Sync and async reads and writes all have unavoidable tradeoffs and require different optimizations to mitigate.

  • Liz Rice strongly recommended starting with ebpf front-end languages/toolkits such as bpftrace, bpftool, and bcc, and then using bcc Python/Go/etc bindings to build tooling which they cannot handle natively.
  • iovisor/bcc pairs with iovisor/gobpf to write new bcc-framework tools in Go, and iovisor/gobpf points to Cilium docs for guidance on how to compile other languages (Go?) to ebpf bytecode with llvm.

Also, I figured out how to write a much more elegant bpftrace script which:

  1. aggregates per-process memory events, cache misses, cpu cycles, and cpu instructions
  2. fires when a context switch occurs
  3. checks if memory events or cache misses occurred during execution of the previous process
  4. counts context switches and records cycles per instruction if at least one has occurred
  5. then outputs this format at consistent millisecond intervals for each entry in the map of context switches:
Metrics at 739 ms:
Process comm: gnome-shell
PID: 2407
Context switches: 1
Cache misses: 3
Memory events: 1
Cycles per instruction: 2
Operation completed at 739 ms

I will try to get the script nicely commented and uploaded in a pull request soon.

@yonch
Copy link
Contributor Author

yonch commented Feb 5, 2025

That's great research! Looking forward to see what you end up with!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants