Tracking Issue for "zones can run out of resources" #7468

smklein · 2025-02-03T20:37:49Z

There are several issues related to "who is using CPU / RAM / Disk space" on a sled, and ensuring these are accounted properly. This issue attempts to summarize these issues at a high-level, and track them in one place.

If we do not account for these resources, it's possible for anything consuming these resources in an unbounded fashion to exhaust other consumers of said resource on the sled, which could result in unexpected failures or performance degredation.

Definitions

Resources

Resources are finite resources that exist on Sleds, and are necessary for zones to operates. They are used by zones, but may be used by other non-zone entities as well.

These include:

CPUs
RAM (both reservoir and non-reservoir usage)
Disk Space on datasets (used by both durable datasets and transient zone filesystems). The focus of this issue is on "disk space usage within U.2s specifically".

Resource Consumers

Consumers are entities on Sleds that utilize resources, and draw from the "shared pool" of them.

These include:

The Host OS + Global Zone
All control-plane zones allocated from the blueprint (Nexus, Crucible, Cockroachdb, DNS, NTP, etc)
All control-plane zones that are allocated from the sled (Switch Zone)
All control-plane zones that are allocated on-demand (Propolis, Probe Zones)

Why Use This Categorization

If you pick a consumer (e.g. "Switch Zone") and a resource (e.g., "RAM"), and there exists no upper-bound on usage, then it is possible for that consumer to negatively impact other occupants on that sled by preventing them from using resources that they expect to exist.
To definitely resolve this issue, we must define "buckets" from which consumers can access resources. One such example: For disks, the Debug dataset has a reservation and a quota. Although we must account for space usage within this dataset, it is not possible for usage within this dataset to cause problems in other datasets. Similarly, usage of space by other datasets cannot starve the debug dataset of space.

Tools to limit resource usage

illumos gives us tools for providing upper-bounds on the usage of resources by consumers

CPUs / RAM usage can be controlled by the capped-cpu and capped-memory properties of zonecfg
Disk space can be controlled on a per-dataset basis by using quotas and reservations

Resource Limits by Consumer

Issues

(Note that the "consumer" here does not necessarily imply blame - if a neighboring service has consumed excessive resources, a consumer may fail prematurely)

The text was updated successfully, but these errors were encountered:

karencfv · 2025-02-03T23:25:35Z

I think RFD 413 might be relevant for this chunk of work?

smklein · 2025-02-03T23:38:25Z

I think RFD 413 might be relevant for this chunk of work?

Definitely - FYI @gjcolombo . https://rfd.shared.oxide.computer/rfd/0413#_zone_memory_controls seems to discuss the zone controls more, https://rfd.shared.oxide.computer/rfd/0413#possible_budgets seems like an interesting starting point for setting some upper bounds. I do think we probably want to get monitoring integrated more tightly before setting hard limits here.

Also: https://rfd.shared.oxide.computer/rfd/0312 is related for the disk usage side of things.

gjcolombo · 2025-02-03T23:46:17Z

I do think we probably want to get monitoring integrated more tightly before setting hard limits here.

100% agreed here, especially since I'm sure many of the numbers in 413 section 7 are badly out of date (e.g. Crucible is much more efficient these days). We really need to get a clearer picture of what costs what before we start applying any limits (since the penalty for violating a usage limit might ultimately be termination of the violating process!).

hawkw · 2025-02-05T00:23:00Z

Something that came up while I was talking to @bcantrill about affinity and related work is that, presently, when an instance-start saga fails because we can't find a sled with sufficient resources (or, when @smklein's affinity work is done, because we can't find a sled with sufficient resources that's permitted by the instance's affinity rules), we put the instance in the Failed state. This is the same state that instances go to when the VMM process crashes or a sled abruptly reboots. Bryan suggested that it might be worth having a separate state for "can't currently be scheduled due to insufficient resources", to communicate to the user a difference between "something bad happened to this instance" and "we weren't able to start this instance due to reasons".

At present, we would want the "insufficient resources" state to be handled broadly similar to Failed by control plane logic, e.g. we should consider it eligible to auto-start...

smklein pinned this issue Feb 3, 2025

smklein mentioned this issue Feb 3, 2025

(2/5) [nexus] Add Affinity/Anti-Affinity groups to database #7444

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue for "zones can run out of resources" #7468

Tracking Issue for "zones can run out of resources" #7468

smklein commented Feb 3, 2025 •

edited

Loading

karencfv commented Feb 3, 2025

smklein commented Feb 3, 2025

gjcolombo commented Feb 3, 2025

hawkw commented Feb 5, 2025

Tracking Issue for "zones can run out of resources" #7468

Tracking Issue for "zones can run out of resources" #7468

Comments

smklein commented Feb 3, 2025 • edited Loading

Definitions

Resources

Resource Consumers

Why Use This Categorization

Tools to limit resource usage

Resource Limits by Consumer

Issues

karencfv commented Feb 3, 2025

smklein commented Feb 3, 2025

gjcolombo commented Feb 3, 2025

hawkw commented Feb 5, 2025

smklein commented Feb 3, 2025 •

edited

Loading