Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs on cpu performance #1840

Merged
merged 3 commits into from
Oct 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added images/cpu-quota.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/cpu-quota.webp
Binary file not shown.
33 changes: 33 additions & 0 deletions machines/cpu-performance.html.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
title: CPU Performance
layout: docs
nav: machines
---

We offer two kinds of virtual CPUs for Machines: `shared` and `performance`. Both run on the same physical hardware, have the same clock speed, etc... The difference is how much time they are allowed to spend running your applications.


CPU Type | Period<sup>1</sup> | Baseline Quota<sup>1</sup> | Max Quota Balance<sup>1</sup>
-------- | ------ | -------------- | -----------------
`shared` | 80ms | 5ms (1/16th) | 500s
`performance` | 80ms | 50ms (10/16th) | 5000s

We enforce limits using the [Linux `cpu.cfs_quota_us` cgroup](https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.rst). For each 80ms period of time, we instruct the Linux scheduler to run `shared` vCPUs for no more than 5ms and `performance` vCPUs no more than 50ms. If your application is working hard and reaches its quota, its vCPUs will be suspended for the remainder of the 80ms period.

Quotas are shared between a machine's vCPUs. For example, a `shared-cpu-2x` machine is allowed to run for 10ms per 80ms period, regardless of which vCPU is using that time.

<sup>1</sup> We might change these specific numbers if we feel like it.

## Bursting

APIs and human-facing web applications are sensitive to latency and a 75ms pause in program execution is often unacceptable. These same types of applications often work hard in small bursts and remain idle much of the time. To avoid unfairly suspending the execution of vCPUs in these applications, we allow a balance of unused vCPU time to be accrued. The application is then allowed to spend its balance in bursts. When bursting, the vCPU is allowed to run at up to 100%. When the balance is depleted, the vCPU is limited to running at its baseline quota.

## Monitoring

The easiest way to see your CPU utilization, baseline quota, and throttling is on your app's [Managed Grafana](/docs/monitoring/metrics/#managed-grafana) `Fly Instance` dashboard.

![chart showing CPU utilization, steal, baseline, and throttling](../images/cpu-quota.webp)

Here, we can see a machine that was running well bellow it's baseline quota. It had accumulated a 50s/vCPU runtime balance. Then, during a burst of activity, CPU utilization exceeded the baseline quota, causing the balance to drain. When the balance reached 0, the machine was briefly throttled. When CPU utilization went down, throttling was disabled and the balance accumulated again.

A related and somewhat misleading metric is CPU steal. You can see this under the `mode=steal` label in the `fly_instance_cpu` metric. Steal is the amount of time your vCPUs are wanting to run, but our scheduler isn't allowing them to. This can happen due to throttling when your machine exceeds its quota, but it can also be a sign that other machines on the same host are competing for resources. We publish a separate `fly_instance_cpu_throttle` that only includes time your vCPUs were throttled for exceeding quota.
2 changes: 1 addition & 1 deletion machines/guides-examples/machine-sizing.html.markerb
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ curl -i -X POST \
}'
```

The `cpu_kind` parameter can be one of `shared` or `performance`.
The `cpu_kind` parameter can be one of `shared` or `performance`. Learn more about CPU types and performance [here](/docs/machines/cpu-performance/).

If you're using `flyctl`, the equivalent command looks like this:

Expand Down
6 changes: 5 additions & 1 deletion monitoring/metrics.html.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,9 +214,13 @@ fly_instance_memory_vmalloc_chunk
- `load_average` is derived from [`/proc/loadavg`](https://www.kernel.org/doc/html/latest/filesystems/proc.html#id11) ([`getloadavg`](https://man7.org/linux/man-pages/man3/getloadavg.3.html)). It's a ["system load average"](https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html) measuring the number of processes in the system run queue, with samples representing averages over 1, 5, and 15 `minutes`.

- `cpu` is derived from [`/proc/stat`](https://www.kernel.org/doc/html/latest/filesystems/proc.html#miscellaneous-kernel-statistics-in-proc-stat),
and counts the amount of time each CPU (`cpu_id`) has spent performing different kinds of work (`mode`, which may be one of `user`, `nice`, `system`, `idle`, `iowait`, `irq`, `softirq`, `steal`, `guest`, `guest_nice`).
and counts the amount of time each CPU (`cpu_id`) has spent performing different kinds of work (`mode`, which may be one of `user`, `nice`, `system`, `idle`, `iowait`, `irq`, `softirq`, `steal`, `guest`, `guest_nice`).
The time unit is 'clock ticks' of centiseconds (0.01 seconds).

- `cpu_baseline` is the baseline quota for CPU usage across all machine vCPUs. Learn more [here](/docs/machines/cpu-performance).

- `cpu_balance` the the accrued balance of unused baseline CPU quota across all machine vCPUs. Learn more [here](/docs/machines/cpu-performance/).

```
fly_instance_load_average{minutes}
fly_instance_cpu{cpu_id, mode} (Counter, centiseconds)
Expand Down
7 changes: 4 additions & 3 deletions partials/_machines_nav.html.erb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
]
},
{
title: "Machines API",
title: "Machines API",
path: "/docs/machines/api/",
open: true,
links: [
Expand All @@ -24,7 +24,7 @@
]
},
{
title: "Machines and flyctl",
title: "Machines and flyctl",
open: true,
links: [
{ text: "Run a new Machine", path: "/docs/machines/flyctl/fly-machine-run/" },
Expand All @@ -46,10 +46,11 @@
links: [
{ text: "Machine states", path: "/docs/machines/machine-states/" },
{ text: "The Machine runtime environment", path: "/docs/machines/runtime-environment/" },
{ text: "CPU Performance", path: "/docs/machines/cpu-performance/" },
{ text: "flyctl Machine commands", path: "/docs/flyctl/machine/" }
]
}
]
%>

<%= partial "/docs/partials/accordion_nav", locals: { nav: @nav } %>
<%= partial "/docs/partials/accordion_nav", locals: { nav: @nav } %>
Loading