You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be useful to allow the user to configure the quantiles calculated by Prometheus summary metrics, and at the minimum calculate some pre-determined basic quantiles (such as 0.5, 0.9, 0.99, etc.).
With this change, the summary metrics can be more valuable (and flexible) for monitoring purposes — seeing just the count and sum of execution time and latency metrics is not very useful.
For example, right now, all that is exposed in terms of start latency (the spark_application_start_latency_seconds metric) is the sum of start latency (s) across the controller's lifetime (accessible via spark_application_success_execution_time_seconds_sum) and the count of start latency values across the controller's lifetime (accessible via spark_application_start_latency_seconds_count).
The user should be able to configure the calculated quantiles via the values.yaml file when deploying the Helm chart.
We should by default configure the summary metrics to expose some pre-defined quantiles if the user does not provide any values (or even if we do not allow user configuration) — these can be a small list like 0.5, 0.9, 0.99, etc.
Describe alternatives you have considered
No response
Additional context
No response
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered:
What feature you would like to be added?
It would be useful to allow the user to configure the quantiles calculated by Prometheus summary metrics, and at the minimum calculate some pre-determined basic quantiles (such as
0.5
,0.9
,0.99
, etc.).There are 3 such metrics defined here: https://github.com/kubeflow/spark-operator/blob/v2.1.0/internal/metrics/sparkapplication_metrics.go#L57.
spark_application_success_execution_time_seconds
spark_application_failure_execution_time_seconds
spark_application_start_latency_seconds
Why is this needed?
With this change, the summary metrics can be more valuable (and flexible) for monitoring purposes — seeing just the count and sum of execution time and latency metrics is not very useful.
For example, right now, all that is exposed in terms of start latency (the
spark_application_start_latency_seconds
metric) is the sum of start latency (s) across the controller's lifetime (accessible viaspark_application_success_execution_time_seconds_sum
) and the count of start latency values across the controller's lifetime (accessible viaspark_application_start_latency_seconds_count
).The reason only the count and sum are being exposed despite Prometheus summaries having quantile support is because we are not providing
Objectives
when instantiating theSummaryOpts
objects: https://github.com/kubeflow/spark-operator/blob/v2.1.0/internal/metrics/sparkapplication_metrics.go#L105. See the documentation on the Prometheus Go client here which explains that by default (when there is an emptyObjectives
map), the summary is created without calculating quantiles: https://github.com/prometheus/client_golang/blob/v1.20.5/prometheus/summary.go#L121.Describe the solution you would like
The user should be able to configure the calculated quantiles via the
values.yaml
file when deploying the Helm chart.We should by default configure the summary metrics to expose some pre-defined quantiles if the user does not provide any values (or even if we do not allow user configuration) — these can be a small list like
0.5
,0.9
,0.99
, etc.Describe alternatives you have considered
No response
Additional context
No response
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: