Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safekeeper: add CPU profiling #9764

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

safekeeper: add CPU profiling #9764

wants to merge 4 commits into from

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Nov 14, 2024

Problem

We don't have a convenient way to gather CPU profiles from a running binary, e.g. during production incidents or end-to-end benchmarks, nor during microbenchmarks (particularly on macOS).

We would also like to have continuous profiling in production, likely using Grafana Cloud Profiles. We may choose to use either eBPF profiles or pprof profiles for this (pending testing and discussion with SREs), but pprof profiles appear useful regardless for the reasons listed above. See https://github.com/neondatabase/cloud/issues/14888.

This PR is intended as a proof of concept, to try it out in staging and drive further discussions about profiling more broadly.

Touches #9534.
Touches https://github.com/neondatabase/cloud/issues/14888.

Summary of changes

Adds a HTTP route /profile/cpu that takes a CPU profile and returns it. Defaults to an SVG flamegraph over a 5-second duration, but can also emit pprof Protobuf profiles for use with e.g. pprof or Grafana Alloy. Query parameters:

  • format: output format (pprof or svg)
  • frequency: sampling frequency in microseconds (default 100)
  • seconds: number of seconds to profile (default 5)

Also integrates pprof profiles into Criterion benchmarks, such that flamegraph reports can be taken with cargo bench ... --profile-duration <seconds>. Output under target/criterion/*/profile/flamegraph.svg.

Example profiles:

@erikgrinaker erikgrinaker self-assigned this Nov 14, 2024
@erikgrinaker erikgrinaker requested a review from a team as a code owner November 14, 2024 16:28
Comment on lines 578 to +579
static ALLOWLIST_ROUTES: Lazy<HashSet<Uri>> = Lazy::new(|| {
["/v1/status", "/metrics"]
["/v1/status", "/metrics", "/pprof/profile"]
Copy link
Contributor Author

@erikgrinaker erikgrinaker Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't put this behind auth, since it's pretty inconvenient. The perf impact is minor, and only one profile can run at a time, so it's not a DoS vector. It only exposes function and module names, so it's not a data leak risk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably give this endpoint as well as the metrics endpoint actual JWT authentication eventually. I've filed a private issue https://github.com/neondatabase/cloud/issues/20372 as it is relevant to security.

For now I agree with you though.

Copy link

github-actions bot commented Nov 14, 2024

5490 tests run: 5247 passed, 0 failed, 243 skipped (full report)


Flaky tests (2)

Postgres 16

Code coverage* (full report)

  • functions: 31.5% (7931 of 25185 functions)
  • lines: 49.6% (62839 of 126631 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
e28059c at 2024-11-18T11:02:32.872Z :recycle:

@erikgrinaker
Copy link
Contributor Author

The CDDL-1.0 license used by the inferno crate was rejected by cargo-deny, so we'll have to get that blessed first:

erikgrinaker added a commit that referenced this pull request Nov 15, 2024
#9764, which adds profiling support to Safekeeper, pulls in the
dependency [`inferno`](https://crates.io/crates/inferno) via
[`pprof-rs`](https://crates.io/crates/pprof). This is licenced under the
[Common Development and Distribution License
1.0](https://spdx.org/licenses/CDDL-1.0.html), which is not allowed by
`cargo-deny`.

This patch allows the CDDL-1.0 license. It is a derivative of the
Mozilla Public License, which we already allow, but avoids some issues
around European copyright law that the MPL had. As such, I don't expect
this to be problematic.
@erikgrinaker erikgrinaker changed the title safekeeper: add pprof profiler safekeeper: add CPU profiling Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants