Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Planning approach for reading cgroup information #4

Closed
hamiltont opened this issue Dec 26, 2019 · 1 comment
Closed

Planning approach for reading cgroup information #4

hamiltont opened this issue Dec 26, 2019 · 1 comment

Comments

@hamiltont
Copy link
Contributor

This is not really an issue or bug report, so sorry about that. I wanted to lay out some of the background needed when thinking about reading data directly from cgroups (as proposed in #1 and #2). I am new to this background information, so take this info with a bit of skepticism and feel free to propose corrections

Background

We can only get limited information from the systemd dbus API, such as the list of units. More info, such as CPU-per-slice or blockIO-per-scope, is clearly desirable and would make systemd-exporter much more useful.

Unfortunately, "there's currently no systemd API to retrieve accounting information from cgroups. For now, if you need to retrieve this information use /proc/$PID/cgroup to determine the cgroup path for your process in the cpuacct controller (or whichever controller matters to you), and then read the attributes directly from the cgroup tree" link

Confusing this issue, the kernel cgroup version 1 API is being replaced with the cgroup version 2 API. The new v2 API is not backwards compatible, and programs using cgroups (including systemd) much be upgraded to use the new API. As we will see, adoption of this new v2 API is ongoing, and in practise most systemd installations in 2020 will utilize both v1 and v2 APIs.

There appears to be consensus (from both kernel devs and systemd) that that v2 is the future and should be utilized wherever possible. Unfortunately, since most current systemd installations use v1 for resource management, we must support v1 today and plan for v2 support in the future.

Kernel Background

As mentioned above, the main issue is the migration from v1 to v2 of the cgroup API.

More details are available herehere, but the big changes are:

  • v1 runs multiple hierarchies which are all disconnected. A cgroup called "systemd.slice" might exist inside the cpu controller's hierarchy and the same name may appear in the memory controller's hierarchy, but the kernel has no guarantee these refer to the same tasks. v2 has flips this model - all controllers share one "unified" hierarchy
  • v1 manages threads, while v2 manages processes. It was found thread-level management does not make sense in many cases (e.g. limiting memory per-thread)
  • v1 allowed different controllers to have different conventions within their own controller-specific hierarchy. v2 enforces some consistency
  • cgroup many only be written to by one process. In our context, that means systemd is the only process that can mount/edit the v2 unified hierarchy.

Timeline:
4.5 - cgroup v2 API released as non-experimental. Missing multiple features (cpu, freezer, device controller)
5.2 - cgroup v2 is "ready for containers" with support for freezer controller

systemd Background

Systemd can run in one of three modes to change how it uses the v1/v2 cgroup APIs. Whichever API systemd is using for resource management/control is the API we would need to read from to gather accounting metrics.

  • legacy mode - Leave /sys/fs/cgroup alone (on v1 API) and mount /sys/fs/cgroup/systemd for systemd process management (using v1 API)
  • hybrid - As of systemd v233, this mode is identical to legacy with one addition - mount the unified hierarchy (using the v2 API) at /sys/fs/cgroup/unified and use that for process management. Continue using the multiple cgroup v1 hierarchies for resource countrol (e.g. /sys/fs/cgroup/[cpu,memory,blkio,etc])
  • unified - use only new v2 API. Any software that uses cgroups v1 API will not function when running in this unified mode, as the cgroups v1 API will not be available in sysfs.

systemd version info
226 - added optional support for unified hierarchy at /sys/fs/cgroup
230 - official support for linux 4.5 unified hierarchy
232 - added support for cpu controller in unified hierarchy
233 - default to hybrid control group structure
243 - default to unified control group structure

Distro adoption notes

Because systemd-exporter is tied to systemd's use of cgroups, there is only a little value in learning when distros will switch to v2 by default. From perspective of systemd-exporter, it probably makes sense to support v1 as long as an install of systemd could be using it (e.g. until the options for 'hybrid' mode is removed). Still, it's easy to get some rough info on approximate timeline.

By using distrowatch, I have quickly approximated when systemd version 243 is coming to different distros. V243 does not mean a distro must default to unified mode, but it is when systemd recommends they do so

  • Fedora - default-unified was released Oct 2019
  • Ubuntu - 244 is in focal (so perhaps 20.04)
  • Manjaro - not on roadmap
  • Debian - in testing (bullseye) so likely 11
  • openSUSE - in next release e.g. tumbleweed
  • CentOS - not on roadmap
  • Arch - released v244. Did not check if default-unified is enabled

This author estimates community distros will use v1 as default until 2021 and enterprise distros until 2023, which seems reasonable given the above data.

Implementation Questions

1 - What systemd modes should we support?

IMO we should priortize support for legacy and hybrid modes, which covers all systemd installs up to 243.

2 - How do we know if systemd is in unified, hybrid, or legacy mode?

Answer

3 - What cgroup API should we utilize?

For hybrid and legacy modes, we must read the v1 cgroup API. For unified, we must read the v2 API.

4 - Are there any limits to our v1 API usage?

As far as I can tell, we will need to read multiple files (from different controller hierarcies) for each unit. There is no way to read many files atomically, so we must accept that these are best-effort metrics. There is no upper bound on how much time is spent between reads. In some warped system dystopia, we could read cpu usage and memory usage multiple seconds apart.

Similarly, we have to accept that the kernel could update a cgroup (membership, limits, etc) in between us reading from two controllers. Even though memory and cpu both have a system.slice, AFAICT there is no guarantee they are the same system.slice

5 - What controllers should we read?

#1 and #2 start with CPU and memory, that seems like a reasonable start

6 - What files do we need to read?

Minimally, /proc/$PID/cgroup is recommended. Unclear if dbus is the best place to get the list of PIDs we need to read this file for. Does it return all tasks or just the main one?

Perhaps /sys/fs/cgroup/unified/... would be better than dbus as a launching point.

This is the main TBD

@hamiltont
Copy link
Contributor Author

Closing this as I think the current approach is working now :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant