Define quantiles for Prometheus summary metrics #2425

ishaan-mehta · 2025-02-07T16:53:22Z

What feature you would like to be added?

It would be useful to allow the user to configure the quantiles calculated by Prometheus summary metrics, and at the minimum calculate some pre-determined basic quantiles (such as 0.5, 0.9, 0.99, etc.).

There are 3 such metrics defined here: https://github.com/kubeflow/spark-operator/blob/v2.1.0/internal/metrics/sparkapplication_metrics.go#L57.

spark_application_success_execution_time_seconds
spark_application_failure_execution_time_seconds
spark_application_start_latency_seconds

Why is this needed?

With this change, the summary metrics can be more valuable (and flexible) for monitoring purposes — seeing just the count and sum of execution time and latency metrics is not very useful.

For example, right now, all that is exposed in terms of start latency (the spark_application_start_latency_seconds metric) is the sum of start latency (s) across the controller's lifetime (accessible via spark_application_success_execution_time_seconds_sum) and the count of start latency values across the controller's lifetime (accessible via spark_application_start_latency_seconds_count).

The reason only the count and sum are being exposed despite Prometheus summaries having quantile support is because we are not providing Objectives when instantiating the SummaryOpts objects: https://github.com/kubeflow/spark-operator/blob/v2.1.0/internal/metrics/sparkapplication_metrics.go#L105. See the documentation on the Prometheus Go client here which explains that by default (when there is an empty Objectives map), the summary is created without calculating quantiles: https://github.com/prometheus/client_golang/blob/v1.20.5/prometheus/summary.go#L121.

Describe the solution you would like

The user should be able to configure the calculated quantiles via the values.yaml file when deploying the Helm chart.

We should by default configure the summary metrics to expose some pre-defined quantiles if the user does not provide any values (or even if we do not allow user configuration) — these can be a small list like 0.5, 0.9, 0.99, etc.

Describe alternatives you have considered

No response

Additional context

No response

Love this feature?

Give it a 👍 We prioritize the features with most 👍

The text was updated successfully, but these errors were encountered:

ishaan-mehta added the kind/feature label Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define quantiles for Prometheus summary metrics #2425

Define quantiles for Prometheus summary metrics #2425

ishaan-mehta commented Feb 7, 2025

Define quantiles for Prometheus summary metrics #2425

Define quantiles for Prometheus summary metrics #2425

Comments

ishaan-mehta commented Feb 7, 2025

What feature you would like to be added?

Why is this needed?

Describe the solution you would like

Describe alternatives you have considered

Additional context

Love this feature?