Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for "zones can run out of resources" #7468

Open
1 of 70 tasks
smklein opened this issue Feb 3, 2025 · 4 comments
Open
1 of 70 tasks

Tracking Issue for "zones can run out of resources" #7468

smklein opened this issue Feb 3, 2025 · 4 comments

Comments

@smklein
Copy link
Collaborator

smklein commented Feb 3, 2025

There are several issues related to "who is using CPU / RAM / Disk space" on a sled, and ensuring these are accounted properly. This issue attempts to summarize these issues at a high-level, and track them in one place.

If we do not account for these resources, it's possible for anything consuming these resources in an unbounded fashion to exhaust other consumers of said resource on the sled, which could result in unexpected failures or performance degredation.

Definitions

Resources

Resources are finite resources that exist on Sleds, and are necessary for zones to operates. They are used by zones, but may be used by other non-zone entities as well.

These include:

  • CPUs
  • RAM (both reservoir and non-reservoir usage)
  • Disk Space on datasets (used by both durable datasets and transient zone filesystems). The focus of this issue is on "disk space usage within U.2s specifically".

Resource Consumers

Consumers are entities on Sleds that utilize resources, and draw from the "shared pool" of them.

These include:

  • The Host OS + Global Zone
  • All control-plane zones allocated from the blueprint (Nexus, Crucible, Cockroachdb, DNS, NTP, etc)
  • All control-plane zones that are allocated from the sled (Switch Zone)
  • All control-plane zones that are allocated on-demand (Propolis, Probe Zones)

Why Use This Categorization

If you pick a consumer (e.g. "Switch Zone") and a resource (e.g., "RAM"), and there exists no upper-bound on usage, then it is possible for that consumer to negatively impact other occupants on that sled by preventing them from using resources that they expect to exist.
To definitely resolve this issue, we must define "buckets" from which consumers can access resources. One such example: For disks, the Debug dataset has a reservation and a quota. Although we must account for space usage within this dataset, it is not possible for usage within this dataset to cause problems in other datasets. Similarly, usage of space by other datasets cannot starve the debug dataset of space.

Tools to limit resource usage

illumos gives us tools for providing upper-bounds on the usage of resources by consumers

  • CPUs / RAM usage can be controlled by the capped-cpu and capped-memory properties of zonecfg
  • Disk space can be controlled on a per-dataset basis by using quotas and reservations

Resource Limits by Consumer

  • Host OS / Global Zone
    • Disk Usage: The host OS/GZ generally make usage of the rpool ramdisk, as well as M.2s. They do not really have any unbounded space usage on U.2 disks.
    • CPU + RAM usage: Unbounded. It may be difficult to make an explicit bound here, and easier to "put a capacity on all other zones" instead. Whatever pool we allocate from, the remainder would then be dedicated to the host OS.
  • Blueprint-Controlled Zones (see: ZoneKind and DatasetKind)
    • Boundary + Internal NTP
      • Disk Usage from transient filesystem: Unbounded
      • CPU/RAM usage: Unbounded
    • Clickhouse
      • Disk Usage from transient filesystem: Unbounded
      • Disk Usage from durable filesystem: Unbounded
      • CPU/RAM usage: Unbounded
    • Clickhouse Keeper
      • Disk Usage from transient filesystem: Unbounded
      • Disk Usage from durable filesystem: Unbounded
      • CPU/RAM usage: Unbounded
    • Clickhouse Server
      • Disk Usage from transient filesystem: Unbounded
      • Disk Usage from durable filesystem: Unbounded
      • CPU/RAM usage: Unbounded
    • CockroachDB
      • Disk Usage from transient filesystem: Unbounded
      • Disk Usage from durable filesystem: Unbounded
      • CPU/RAM usage: Unbounded
    • Crucible
      • Disk Usage from transient filesystem: Unbounded
      • Disk Usage from durable filesystem: Nexus considers and updates a "size used" column within each Crucible dataset, and tries to make this usage less than the size of the entire zpool. However, this size is considered relative to the entire zpool (where other datasets may exist!), and nothing accounts for the metadata used by Crucible itself (in addition to the user-requested storage). Furthermore, there is no bound set on the durable dataset itself. However, Crucible does have an internal dataset called regions, where it does apply quotas and reservations internally.
      • CPU/RAM usage: Unbounded
    • Crucible Pantry
      • Disk Usage from transient filesystem: Unbounded
      • CPU/RAM usage: Unbounded
    • Internal + External DNS
      • Disk Usage from transient filesystem: Unbounded
      • Disk Usage from durable filesystem: Unbounded
      • CPU/RAM usage: Unbounded
    • Nexus
      • Disk Usage from transient filesystem: Unbounded
      • CPU/RAM usage: Unbounded
    • Oximeter
      • Disk Usage from transient filesystem: Unbounded
      • CPU/RAM usage: Unbounded
  • Sled-Agent Controlled Zones
    • Switch Zone
  • Dynamically-Provisioned Zones
    • Propolis (Note that the Propolis zone must consume resources in addition to the bounds on the underlying instance)
      • Disk Usage from transient filesystem: Unbounded
      • CPU usage: We supply a value of vcpus for the instance, but impose no bound on the Propolis Zone.
      • Reservoir RAM usage: We give values for memory which is allocated within Nexus, and used by Propolis to cooperatively use a portion of an instance-provisioned "memory reservoir". Reservoir capacity is cooperatively shared by the propolis zones (the sled agent is trusting zones to only consume as much reservoir as they were provided - a "greedy" propolis could starve other instances, although this seems unlikely).
      • Non-Reservoir RAM usage: We do not consider this amount (we pretend this usage is zero in the allocation, but that is not true) and then don't bound it.
    • Probe Zones
      • Disk Usage from transient filesystem: Unbounded
      • CPU/RAM usage: Unbounded
  • Other Datasets (See: DatasetKind)
    • Debug Dataset
      • Disk Usage: It's bounded to a maximum of 100 GiB, but has no reservation. Should it have a reservation?
    • Update
      • Disk Usage: Unbounded? But I cannot tell where this is being allocated. It may be possible to delete this.

Issues

(Note that the "consumer" here does not necessarily imply blame - if a neighboring service has consumed excessive resources, a consumer may fail prematurely)

@karencfv
Copy link
Contributor

karencfv commented Feb 3, 2025

I think RFD 413 might be relevant for this chunk of work?

@smklein
Copy link
Collaborator Author

smklein commented Feb 3, 2025

I think RFD 413 might be relevant for this chunk of work?

Definitely - FYI @gjcolombo . https://rfd.shared.oxide.computer/rfd/0413#_zone_memory_controls seems to discuss the zone controls more, https://rfd.shared.oxide.computer/rfd/0413#possible_budgets seems like an interesting starting point for setting some upper bounds. I do think we probably want to get monitoring integrated more tightly before setting hard limits here.

Also: https://rfd.shared.oxide.computer/rfd/0312 is related for the disk usage side of things.

@gjcolombo
Copy link
Contributor

I do think we probably want to get monitoring integrated more tightly before setting hard limits here.

100% agreed here, especially since I'm sure many of the numbers in 413 section 7 are badly out of date (e.g. Crucible is much more efficient these days). We really need to get a clearer picture of what costs what before we start applying any limits (since the penalty for violating a usage limit might ultimately be termination of the violating process!).

@hawkw
Copy link
Member

hawkw commented Feb 5, 2025

Something that came up while I was talking to @bcantrill about affinity and related work is that, presently, when an instance-start saga fails because we can't find a sled with sufficient resources (or, when @smklein's affinity work is done, because we can't find a sled with sufficient resources that's permitted by the instance's affinity rules), we put the instance in the Failed state. This is the same state that instances go to when the VMM process crashes or a sled abruptly reboots. Bryan suggested that it might be worth having a separate state for "can't currently be scheduled due to insufficient resources", to communicate to the user a difference between "something bad happened to this instance" and "we weren't able to start this instance due to reasons".

At present, we would want the "insufficient resources" state to be handled broadly similar to Failed by control plane logic, e.g. we should consider it eligible to auto-start...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants