Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define quantiles for Prometheus summary metrics #2425

Open
ishaan-mehta opened this issue Feb 7, 2025 · 0 comments
Open

Define quantiles for Prometheus summary metrics #2425

ishaan-mehta opened this issue Feb 7, 2025 · 0 comments

Comments

@ishaan-mehta
Copy link

What feature you would like to be added?

It would be useful to allow the user to configure the quantiles calculated by Prometheus summary metrics, and at the minimum calculate some pre-determined basic quantiles (such as 0.5, 0.9, 0.99, etc.).

There are 3 such metrics defined here: https://github.com/kubeflow/spark-operator/blob/v2.1.0/internal/metrics/sparkapplication_metrics.go#L57.

  • spark_application_success_execution_time_seconds
  • spark_application_failure_execution_time_seconds
  • spark_application_start_latency_seconds

Why is this needed?

With this change, the summary metrics can be more valuable (and flexible) for monitoring purposes — seeing just the count and sum of execution time and latency metrics is not very useful.

For example, right now, all that is exposed in terms of start latency (the spark_application_start_latency_seconds metric) is the sum of start latency (s) across the controller's lifetime (accessible via spark_application_success_execution_time_seconds_sum) and the count of start latency values across the controller's lifetime (accessible via spark_application_start_latency_seconds_count).

The reason only the count and sum are being exposed despite Prometheus summaries having quantile support is because we are not providing Objectives when instantiating the SummaryOpts objects: https://github.com/kubeflow/spark-operator/blob/v2.1.0/internal/metrics/sparkapplication_metrics.go#L105. See the documentation on the Prometheus Go client here which explains that by default (when there is an empty Objectives map), the summary is created without calculating quantiles: https://github.com/prometheus/client_golang/blob/v1.20.5/prometheus/summary.go#L121.

Describe the solution you would like

The user should be able to configure the calculated quantiles via the values.yaml file when deploying the Helm chart.

We should by default configure the summary metrics to expose some pre-defined quantiles if the user does not provide any values (or even if we do not allow user configuration) — these can be a small list like 0.5, 0.9, 0.99, etc.

Describe alternatives you have considered

No response

Additional context

No response

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant