Implement basic eBPF-based context switch monitoring #36

yonch · 2025-01-17T16:23:37Z

We need to establish initial eBPF instrumentation to track process context switches. This first implementation will use the cilium/ebpf package (github.com/cilium/ebpf) to trace when processes enter and exit CPU cores, allowing us to attribute resource usage to specific processes and containers.

Implementation Tasks

Create Go binary using cilium/ebpf package
Implement basic context switch tracing
Add simple stdout logging of traced events
Set up GitHub Actions workflow to build container image

Technical Details

Package: github.com/cilium/ebpf
Events: Process context switches
Output: Simple logging to stdout
Build: Container image via GitHub Actions

Future iterations will add timer-based sampling and more sophisticated data collection.

cc @atimeofday -- this is what we discussed earlier today

atimeofday · 2025-01-20T19:05:57Z

Do you know which ebpf, ebpf-go, or perhaps iovisor/bcc tool(s) we want to use for context switch tracing? I believe the closest I have found and tried so far may be bcc/tools/cpudist. See:

https://github.com/iovisor/bcc/blob/master/tools/cpudist_example.txt

yonch · 2025-01-20T22:55:50Z

I think we should try the 'sched:sched_switch' tracepoint with cilium/ebpf.

I haven't read these but they might point you in the right direction:

atimeofday · 2025-01-20T23:05:57Z

Aha, sched_switch, that was the name! I remember you mentioning it.

Both of those do look useful. The output of runqlat looks almost identical to that of cpudist. As I understand so far, cilium/ebpf is "ebpf-go", meaning a set of Go bindings for running eBPF-based utilities such as the tools in bcc (BPF Compiler Collection).

I'll read through those pages, dig into sched:sched_switch in particular, and report back when I make notable progress and/or hit a wall. Thanks!

atimeofday · 2025-01-24T20:52:57Z

I have made progress on three fronts:

I got the ebpf-go Getting Started example with XDP monitoring to work.
I found a few examples of bpf programs/functions using sched_switch to reference.
- I have run into a bit of a wall actually finding any common format I should be using to hook into it...
Very tentatively, I may have found how the Linux kernel developers and Red Hat already directly solved this problem/context. That being said, I currently do not have a very informed idea of what the viability, relevance, or overhead are for these tools.

Here is a truncated overview of the manuals and sections therein which looked relevant to me:

>man proc

NAME
       proc - process information, system information, and sysctl pseudo-filesystem

DESCRIPTION
       The proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures.

https://serverfault.com/questions/190049/find-out-which-task-is-generating-a-lot-of-context-switches-on-linux

https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/4/html/introduction_to_system_administration/s2-resource-tools-sar

https://github.com/sysstat/sysstat

>sudo dnf install sysstat
>man pidstat

NAME
       pidstat - Report statistics for Linux tasks.

-d     Report I/O statistics
-r     Report page faults and memory utilization
-u     Report CPU utilization
-w     Report task switching activity

-h     Display  all  activities  horizontally on a single line, with no average statistics at the end of the report. This is intended to make it easier to be parsed by other programs.

/proc filesystem must be mounted for the pidstat command to work.

/proc/ and sysstat/pidstat seem to provide everything we could possibly need, already summarized in user-space whether we access it or not, so the overhead of using them could be near-zero.

If this sounds viable, I think I can skip steps 1-3, do 4, and start work on aggregating filtered/condensed /proc/ data in a usable CSV matrix format. If not, I can continue digging into manual implementation of eBPF sched_switch tracing.

atimeofday · 2025-01-25T00:55:40Z

Upon further research and some successful testing, it seems that /proc/-monitoring tools such as sysstat are usable for basic, rough correlation of the necessary metrics, but are not as optimized as eBPF and likely do not have the necessary fidelity or precision for production-grade monitoring and millisecond-level response times. In case sysstat is of interest for a "v0.1" skeleton / proof of concept, here is one of the commands I developed to experiment with basic memory / sched_switch correlation:

pidstat -r | awk '{s+=$9} END {printf "Estimated Memory \t%.0f", s}' && echo '' && pidstat -wt | awk '{s+=$6} END {printf "Estimated Switches \t%.0f", s}'

I will continue researching eBPF sched_switch tracing. The most promising reference I have found so far is here: https://www.brendangregg.com/ebpf.html

Once any eBPF-based tool is working, cilium/ebpf-go/bpf2go should be able to generate most of the necessary code to call it from a Go program.

atimeofday · 2025-01-25T04:19:48Z

Progress!

I believe I have a working bpftrace script which aggregates IO-blocked context switches by process name as well as memory allocations, reclamations, and page faults, at a sustainable data rate. It should be relatively easy to adjust or add to.

Sample code:

#!/usr/bin/env/bpftrace

tracepoint:exceptions:page_fault_user {
    @fault[comm] = count();  // Count page faults by process name
}

tracepoint:kmem:mm_page_alloc {
    @alloc[comm] = count();  // Count memory allocations by process name
}

tracepoint:vmscan:mm_vmscan_direct_reclaim_begin {
    @reclaim[comm] = count();  // Count memory reclaim events by process name
}

tracepoint:sched:sched_switch {
    if (@fault[args->prev_comm] || @alloc[args->prev_comm] || @reclaim[args->prev_comm]) {
        @switch[args->prev_comm] = count();  // Count context switches for processes with memory-related events
    }
}

Sample output:

@alloc[sudo]: 1
@alloc[ptyxis-agent]: 1
@alloc[kworker/6:1]: 1
@alloc[llvmpipe-7]: 2
@alloc[llvmpipe-10]: 2
@alloc[bpftrace]: 9
@alloc[gnome-shell]: 91
@fault[abrt-dump-journ]: 1
@fault[timeout]: 1
@fault[systemd-journal]: 10
@fault[llvmpipe-10]: 20
@fault[llvmpipe-5]: 44
@fault[llvmpipe-2]: 44
@fault[llvmpipe-9]: 44
@fault[llvmpipe-7]: 64
@fault[llvmpipe-8]: 88
@fault[llvmpipe-3]: 88
@fault[llvmpipe-6]: 88
@fault[llvmpipe-11]: 88
@fault[llvmpipe-1]: 132

@switch[timeout]: 1
@switch[ptyxis-agent]: 1
@switch[sudo]: 2
@switch[systemd-journal]: 3
@switch[abrt-dump-journ]: 3
@switch[kworker/6:1]: 4
@switch[llvmpipe-9]: 8
@switch[bpftrace]: 15
@switch[llvmpipe-2]: 112
@switch[llvmpipe-6]: 121
@switch[llvmpipe-8]: 128
@switch[llvmpipe-5]: 170
@switch[llvmpipe-10]: 201
@switch[llvmpipe-3]: 208
@switch[llvmpipe-11]: 214
@switch[llvmpipe-7]: 247
@switch[gnome-shell]: 256
@switch[llvmpipe-1]: 283

Status of Implementation Tasks:

Proof of concept done; actual product pending further information
Done?
Done?
Pending further information

Questions:

Where should this eBPF instrumentation be placed within the project structure?
Using cilium/ebpf may be redundant or overkill, and multiplies the code complexity by an order of magnitude. Should I focus on it for the tightest possible integration, or aim for the simplest possible implementation (for now)?
What other metrics should be collected, if any, if we know yet?
Does this look like it is on approximately the right track?

yonch · 2025-01-26T15:44:19Z

nice progress!

Do you know if bpftrace supports reading perf counters?

I see "hardware" sampling points, where you can get code to run every e.g., 100k cache misses, but can you read the number of cache misses on a context switch?

atimeofday · 2025-01-26T17:37:18Z

Excellent questions. Preliminary research indicates that bpftrace does not support reading perf counters directly since both are front-end implementations of (the same?) eBPF framework, but that there are other ways to correlate or integrate the functionality of both. From https://www.brendangregg.com/ebpf.html section 4.2, here is an overview of some available tools:

This video may be a helpful resource/reference for how things are related: https://www.youtube.com/watch?v=_5Z2AU7QTH4

As I currently understand:

The bpftrace script above passively hooks into tracepoints which 'fire' on each related event, so to break down this function...

tracepoint:sched:sched_switch {
    if (@fault[args->prev_comm] || @alloc[args->prev_comm] || @reclaim[args->prev_comm]) {
        @switch[args->prev_comm] = count();  // Count context switches for processes with memory-related events
    }
}

When a context switch occurs, the sched:sched_switch tracepoint fires, running the attached eBPF code. This particular version filters the number of events recorded to processes which were recorded when the page fault, page allocation, or memory reclamation tracepoints fired, then counts each time the sched_switch tracepoint fires for that process. This reduces the data rate I observed from about 65,000 events and 2.5MB per second to the above output of around 20 stat reports and 1KB per second.

I believe more tracepoints can be added, including at least some which perf uses, and that perf can be used alongside bpftrace or a custom eBPF program. If we want to find the precise number of cache misses associated with an individual context switch event, that may be both difficult and very high-overhead.

I'll keep researching.

atimeofday · 2025-01-30T22:41:14Z

Update: I got a bpftrace script working through Go os/exec recording per-process perf_event hardware counters for CPU cycles, instructions, and cache misses at set intervals. I am still working on whether or how it is possible to add conditional logic to split or filter the data upon certain events as well as intervals.

atimeofday · 2025-02-05T18:46:11Z

@yonch As you recommended, I looked in to ebpf resources by Liz Rice from Isovalent (Cilium) and a few other speakers/writers, and I think I have a much better understanding of the overall ebpf landscape.

So, as I currently understand and think is relevant for this issue, from the ground up:

An ebpf-based program is divided into two sections: kernel-space ebpf code, and user-space data handling.
With cilium/ebpf, the kernel-space eBPF program is written in C and compiled to eBPF bytecode, and ebpf2go generates a Go "API" for interacting with the compiled kernel-space program. Then, the user-space program is written in Go.
As of a year or two ago and potentially today, C and Rust are the only regular languages which can 'natively' compile to ebpf bytecode.

bpftrace is an ebpf-front-end language for writing both advanced kernel-space and basic user-space data handling in one file, with a library of premade tools which are sufficient for some use cases and offer formatting examples to work from.
bpftrace scripts primarily function in kernel-space, but use certain (well-documented, mostly I/O) functions/keywords to invoke a transfer to user-space.
bpftrace uses ebpf maps in perf ring buffers to transfer data to user-space in the fastest and most efficient possible manner for most cases.

bpftrace functions are well documented as either synchronous (kernel-space) or asynchronous (kernel+user-space), where for example print defaults to asynchronous, but it is possible to "coerce (and thus force a more expensive synchronous read) the type to an integer using a cast or by doing a comparison." (Quoted from docs)
bpftrace defaults to thread-safe, per-CPU maps and sync writes for performance and precision in aggregated statistics, and async reads because synchronously iterating over and collecting maps from each thread/CPU is expensive.
Sync and async reads and writes all have unavoidable tradeoffs and require different optimizations to mitigate.

Liz Rice strongly recommended starting with ebpf front-end languages/toolkits such as bpftrace, bpftool, and bcc, and then using bcc Python/Go/etc bindings to build tooling which they cannot handle natively.
iovisor/bcc pairs with iovisor/gobpf to write new bcc-framework tools in Go, and iovisor/gobpf points to Cilium docs for guidance on how to compile other languages (Go?) to ebpf bytecode with llvm.

Also, I figured out how to write a much more elegant bpftrace script which:

aggregates per-process memory events, cache misses, cpu cycles, and cpu instructions
fires when a context switch occurs
checks if memory events or cache misses occurred during execution of the previous process
counts context switches and records cycles per instruction if at least one has occurred
then outputs this format at consistent millisecond intervals for each entry in the map of context switches:

Metrics at 739 ms:
Process comm: gnome-shell
PID: 2407
Context switches: 1
Cache misses: 3
Memory events: 1
Cycles per instruction: 2
Operation completed at 739 ms

I will try to get the script nicely commented and uploaded in a pull request soon.

yonch · 2025-02-05T21:53:40Z

That's great research! Looking forward to see what you end up with!

yonch assigned atimeofday Jan 23, 2025

atimeofday mentioned this issue Feb 7, 2025

Initial eBPF instrumentation to track process context switches, V1 bpftrace prototype #51

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement basic eBPF-based context switch monitoring #36

Implement basic eBPF-based context switch monitoring #36

yonch commented Jan 17, 2025

atimeofday commented Jan 20, 2025

yonch commented Jan 20, 2025

atimeofday commented Jan 20, 2025

atimeofday commented Jan 24, 2025 •

edited

Loading

atimeofday commented Jan 25, 2025

atimeofday commented Jan 25, 2025

yonch commented Jan 26, 2025

atimeofday commented Jan 26, 2025

atimeofday commented Jan 30, 2025

atimeofday commented Feb 5, 2025 •

edited

Loading

yonch commented Feb 5, 2025

Implement basic eBPF-based context switch monitoring #36

Implement basic eBPF-based context switch monitoring #36

Comments

yonch commented Jan 17, 2025

Implementation Tasks

Technical Details

atimeofday commented Jan 20, 2025

yonch commented Jan 20, 2025

atimeofday commented Jan 20, 2025

atimeofday commented Jan 24, 2025 • edited Loading

atimeofday commented Jan 25, 2025

atimeofday commented Jan 25, 2025

yonch commented Jan 26, 2025

atimeofday commented Jan 26, 2025

atimeofday commented Jan 30, 2025

atimeofday commented Feb 5, 2025 • edited Loading

yonch commented Feb 5, 2025

atimeofday commented Jan 24, 2025 •

edited

Loading

atimeofday commented Feb 5, 2025 •

edited

Loading