Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana: improved aspnetcore.json #7021

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
76 changes: 44 additions & 32 deletions src/Grafana/dashboards/aspnetcore.json
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))",
"expr": "histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're going to see this change on every data plot on every chart:

{job=\"$job\", instance=\"$instance\"}

to

{job=~\"$job\", instance=~\"$instance\"}

The =~ being the key change here - this allows Grafana to expand the variables to include all of the selected job / instance values. I'm not going to comment on every instance of this change because it's all the same, but that's what this is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this chart is the 99% latency chart, and addition to changing it to support multi-select I've also modified it to use the $__range value in Grafana, which corresponds to the tailing time window, versus the $__rate__interval variable, which is just 4x the Prometheus scraping interval: https://grafana.com/blog/2020/09/28/new-in-grafana-7.2-__rate_interval-for-prometheus-rate-queries-that-just-work/

It's useful for rate queries, but that's not what we're measuring here - we're instead trying to determine "what was 99% latency over X period of time?" - the $__range value is better for that, and I've made that change in several charts.

Let's do a before and after comparison for this chart specifically: same app and time-range, but only a single instance:

Before

image

This PR

image

This is probably mostly a taste thing, but what I appreciate about the latter chart are the following:

  1. It's possible to "scope in" using cursor selection in Grafana if I want to look at a specific time period on the chart. With the original chart I get "no data" in many scenarios.
  2. This chart design scales down to smaller time-scales for applications that have less request data - this application does ~1m requests per day and I still have zero data using the default chart design unless I zoom out to a 24 hour time window or so.

The only drawback of this second design is that it's harder to see really large outliers - that's easier to see with the original chart design. This design is averaging rates over a longer period of time, which is what makes it more reliable at showing data over smaller time intervals - you can still see where spikes occur, such as this instance here:

image

But it's not nearly as pronounced. Happy to take feedback on what's more appropriate here, but making sure the chart worked correctly across a wider range of traffic workloads / time ranges was my objective here.

"legendFormat": "p50",
"range": true,
"refId": "p50"
Expand All @@ -209,7 +209,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "histogram_quantile(0.75, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))",
"expr": "histogram_quantile(0.75, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))",
"hide": false,
"legendFormat": "p75",
"range": true,
Expand All @@ -221,7 +221,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "histogram_quantile(0.90, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))",
"expr": "histogram_quantile(0.90, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))",
"hide": false,
"legendFormat": "p90",
"range": true,
Expand All @@ -233,7 +233,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))",
"expr": "histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))",
"hide": false,
"legendFormat": "p95",
"range": true,
Expand All @@ -245,7 +245,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "histogram_quantile(0.98, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))",
"expr": "histogram_quantile(0.98, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))",
"hide": false,
"legendFormat": "p98",
"range": true,
Expand All @@ -257,7 +257,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))",
"expr": "histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))",
"hide": false,
"legendFormat": "p99",
"range": true,
Expand All @@ -269,7 +269,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "histogram_quantile(0.999, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))",
"expr": "histogram_quantile(0.999, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))",
"hide": false,
"legendFormat": "p99.9",
"range": true,
Expand Down Expand Up @@ -413,6 +413,7 @@
"showLegend": true
},
"tooltip": {
"maxHeight": 600,
Aaronontheweb marked this conversation as resolved.
Show resolved Hide resolved
"mode": "multi",
"sort": "none"
}
Expand All @@ -424,7 +425,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\", http_response_status_code=~\"4..|5..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval]))",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the error rate chart, which had some of the same problems that the Requests Duration chart did:

  1. Not showing data over many intervals
  2. Data it showed didn't always make a lot of sense.

Before

image

I actually don't know what to make of those error rates - 0% last and 97.5% max? What does that mean?

This PR

We've changed the error rate chart to use the $__range and now the story it's telling is much clearer:

image

I have a roughly 70% 400-level error rate over the past 24 hours - and that makes sense with what I know about this application: it runs a private NuGet package endpoint and we get a lot of 404s from clients looking for packages that this server doesn't host, as a function of how NuGet clients are designed.

If I scope out to the past 6 hours, which mostly covers Saturdays when our customers aren't working:

image

That 404 rate scales down, with the rest of our traffic, to roughly 11%. It's now easier to tell that by looking at the charts than before.

"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\", http_response_status_code=~\"4..|5..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval]))",
"legendFormat": "All",
"range": true,
"refId": "All"
Expand All @@ -435,7 +436,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\", http_response_status_code=~\"4..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval]))",
"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\", http_response_status_code=~\"4..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval]))",
"hide": false,
"legendFormat": "4XX",
"range": true,
Expand All @@ -447,7 +448,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\", http_response_status_code=~\"5..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval]))",
"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\", http_response_status_code=~\"5..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval]))",
"hide": false,
"legendFormat": "5XX",
"range": true,
Expand Down Expand Up @@ -532,6 +533,7 @@
"showLegend": false
},
"tooltip": {
"maxHeight": 600,
"mode": "single",
"sort": "none"
}
Expand All @@ -543,8 +545,8 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "sum(kestrel_active_connections{job=\"$job\", instance=\"$instance\"})",
"legendFormat": "__auto",
"expr": "sum(kestrel_active_connections{job=~\"$job\", instance=~\"$instance\"})",
"legendFormat": "active connections",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the legend to use a manual entry because otherwise, now that we can aggregrate Kestrel data across multiple applications, you end up with a very gross and verbose-looking auto-generated tooltip here on the legend.

"range": true,
"refId": "A"
}
Expand Down Expand Up @@ -627,6 +629,7 @@
"showLegend": false
},
"tooltip": {
"maxHeight": 600,
"mode": "single",
"sort": "none"
}
Expand All @@ -638,8 +641,8 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "sum(http_server_active_requests{job=\"$job\", instance=\"$instance\"})",
"legendFormat": "__auto",
"expr": "sum(http_server_active_requests{job=~\"$job\", instance=~\"$instance\"})",
"legendFormat": "active requests",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same deal as with the kestrel connections - manually renamed the legend here.

I toyed with the idea of breaking out active connections by service / instance and that's easily doable, but opted for something simpler - the user can get that same data by changing the template variables on the selector.

"range": true,
"refId": "A"
}
Expand Down Expand Up @@ -695,8 +698,10 @@
"fields": "",
"values": false
},
"showPercentChange": false,
"text": {},
"textMode": "value"
"textMode": "value",
"wideLayout": true
},
"pluginVersion": "",
"targets": [
Expand All @@ -707,7 +712,7 @@
},
"editorMode": "code",
"exemplar": false,
"expr": "sum(http_server_request_duration_seconds_count{job=\"$job\", instance=\"$instance\"})",
"expr": "sum(increase(http_server_request_duration_seconds_count{job=~\"$job\", instance=~\"$instance\"}[$__range]))",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "Total Requests" counter - and the way it's currently working is just counting the total value over all time, which means that regardless of what time range you pick the value doesn't go down.

Before

Case in point - this is my "24 hour" view

image

This is my "past 6 hours" view with the current chart - I took this screenshot 2nd and it displays a higher value

image

This PR

So we've made two changes, other than the usual bits with the =~ selector for multi-value support:

  • Use the increase function in PromQL to compute the difference between [time range start, time range end] in our chart and
  • Use the [$__range] value to scope the increase function to specific points in time.

Now we get what we're really interested in: the total requests that occurred during the time period:

Past 24 hours

image

Past 6 hours

image

That's more like it!

"instant": false,
"legendFormat": "__auto",
"range": true,
Expand Down Expand Up @@ -765,8 +770,10 @@
"fields": "",
"values": false
},
"showPercentChange": false,
"text": {},
"textMode": "value"
"textMode": "value",
"wideLayout": true
},
"pluginVersion": "",
"targets": [
Expand All @@ -777,7 +784,7 @@
},
"editorMode": "code",
"exemplar": false,
"expr": "sum(http_server_request_duration_seconds_count{job=\"$job\", instance=\"$instance\", error_type!=\"\"})",
"expr": "sum(increase(http_server_request_duration_seconds_count{job=~\"$job\", instance=~\"$instance\", error_type!=\"\"}[$__range]))",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the Unhandled Exceptions chart - I made the same changes to it as I did on the "Total Requests" chart.

Before

Over the last 6 hours

image

This PR

Over the last 6 hours

image

This is accurate - per my traces and logs, there haven't been any exceptions observed for this application over the past 6 hours.

"instant": false,
"legendFormat": "__auto",
"range": true,
Expand Down Expand Up @@ -835,7 +842,9 @@
"fields": "",
"values": false
},
"textMode": "value_and_name"
"showPercentChange": false,
"textMode": "value_and_name",
"wideLayout": true
},
"pluginVersion": "",
"targets": [
Expand All @@ -845,7 +854,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "sum by (url_scheme) (\r\n max_over_time(http_server_request_duration_seconds_count{job=\"$job\", instance=\"$instance\"}[$__rate_interval])\r\n )",
"expr": "sum by (url_scheme) (\r\n increase(http_server_request_duration_seconds_count{job=~\"$job\", instance=~\"$instance\"}[$__range])\r\n )",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "Requests Secured" chart - rather than pick out just the max value of this counter over the given rate, which doesn't change, we now compute the delta just like we've done with the "Total Requests" chart.

Before

image

Notice how the chart is always growing to the right? That's because we're not taking the change in counter values into account - that's what the increase function will do for us.

This PR

image

Now we can see some livelier data, showing dips and valleys in HTTP traffic to boot.

"legendFormat": "{{scheme}}",
"range": true,
"refId": "A"
Expand Down Expand Up @@ -902,7 +911,9 @@
"fields": "",
"values": false
},
"textMode": "value_and_name"
"showPercentChange": false,
"textMode": "value_and_name",
"wideLayout": true
},
"pluginVersion": "",
"targets": [
Expand All @@ -912,7 +923,7 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"expr": "sum by (method_route) (\r\n label_replace(max_over_time(http_server_request_duration_seconds_count{job=\"$job\", instance=\"$instance\"}[$__rate_interval]), \"method_route\", \"http/$1\", \"network_protocol_version\", \"(.*)\")\r\n )",
"expr": "sum by (method_route) (\r\n label_replace(increase(http_server_request_duration_seconds_count{job=~\"$job\", instance=~\"$instance\"}[$__range]), \"method_route\", \"http/$1\", \"network_protocol_version\", \"(.*)\")\r\n )",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "Requests HTTP Protocol" chart - made identical changes to it akin to what we did with "Requests Secured".

Before

image

This PR

image

"legendFormat": "{{protocol}}",
"range": true,
"refId": "A"
Expand Down Expand Up @@ -1059,7 +1070,7 @@
},
"editorMode": "code",
"exemplar": false,
"expr": " topk(10,\r\n sum by (http_route, http_request_method, method_route) (\r\n label_join(max_over_time(http_server_request_duration_seconds_count{job=\"$job\", instance=\"$instance\", http_route!=\"\"}[$__rate_interval]), \"method_route\", \" \", \"http_request_method\", \"http_route\")\r\n ))",
"expr": " topk(10,\r\n sum by (http_route, http_request_method, method_route) (\r\n label_join(floor(increase(http_server_request_duration_seconds_count{job=~\"$job\", instance=~\"$instance\", http_route!=\"\"}[$__range])), \"method_route\", \" \", \"http_request_method\", \"http_route\")\r\n ))",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "Top 10 Requested Endpoints" graph - in addition to adding multi-variable support, we've done two important things:

  1. Scoped the number of requests per endpoint to the $__range - same idea as some of the other changes, now these values are dynamic per the time period.
  2. Since we're now computing a delta with increase, I decided to floor that value so we can saw off the floating point - aesthetic choice.

Before

I had to scope to 24 hours because the 6 hour chart rendered no data - that issue does not occur in the new version in this PR.

image

These are cumulative counter values - we didn't actually serve this much traffic to these endpoints during this time.

This PR

image

These values are accurate and I can verify these numbers in our log aggregation system too.

"format": "table",
"instant": true,
"interval": "",
Expand Down Expand Up @@ -1233,7 +1244,7 @@
},
"editorMode": "code",
"exemplar": false,
"expr": " topk(10,\r\n sum by (http_route, http_request_method, method_route) (\r\n label_join(max_over_time(http_server_request_duration_seconds_count{job=\"$job\", instance=\"$instance\", http_route!=\"\", error_type!=\"\"}[$__rate_interval]), \"method_route\", \" \", \"http_request_method\", \"http_route\")\r\n ))",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the unhandled exceptions by endpoint chart - made the same changes as the "10 Requested Endpoints"

Before

I can't get any data to show up at all in the current incarnation of this chart, even if I scope out to 24 hours.

This PR

image

"expr": " topk(10,\r\n sum by (http_route, http_request_method, method_route) (\r\n label_join(floor(increase(http_server_request_duration_seconds_count{job=~\"$job\", instance=~\"$instance\", http_route!=\"\", error_type!=\"\"}[$__rate_interval])), \"method_route\", \" \", \"http_request_method\", \"http_route\")\r\n ))",
"format": "table",
"instant": true,
"interval": "",
Expand Down Expand Up @@ -1270,9 +1281,9 @@
"type": "table"
}
],
"refresh": "10s",
"refresh": "1m",
Aaronontheweb marked this conversation as resolved.
Show resolved Hide resolved
"revision": 1,
"schemaVersion": 38,
"schemaVersion": 39,
"tags": [
"dotnet",
"prometheus",
Expand All @@ -1288,9 +1299,9 @@
},
"definition": "label_values(http_server_active_requests,job)",
"hide": 0,
"includeAll": false,
"includeAll": true,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enables "all" to be used

"label": "Job",
"multi": false,
"multi": true,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enables mutli-select to be used, which is now feasible since all of the PromQL queries have been updated to support it.

"name": "job",
"options": [],
"query": {
Expand All @@ -1311,9 +1322,9 @@
},
"definition": "label_values(http_server_active_requests{job=~\"$job\"},instance)",
"hide": 0,
"includeAll": false,
"includeAll": true,
"label": "Instance",
"multi": false,
"multi": true,
"name": "instance",
"options": [],
"query": {
Expand All @@ -1329,9 +1340,10 @@
]
},
"time": {
"from": "now-5m",
"from": "now-24h",
Aaronontheweb marked this conversation as resolved.
Show resolved Hide resolved
"to": "now"
},
"timeRangeUpdatedDuringEditOrView": false,
Aaronontheweb marked this conversation as resolved.
Show resolved Hide resolved
"timepicker": {
"refresh_intervals": [
"1s",
Expand All @@ -1350,6 +1362,6 @@
"timezone": "",
"title": "ASP.NET Core",
"uid": "KdDACDp4z",
"version": 1,
"version": 2,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to bump the chart version if this gets pushed to Grafana Cloud.

"weekStart": ""
}
Loading