-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grafana: improved aspnetcore.json #7021
base: main
Are you sure you want to change the base?
Changes from 1 commit
a17ea12
88321f5
154cfe3
1a799e0
10aa0cf
f5be4e7
d93b8cf
0395597
49ac75c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -198,7 +198,7 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))", | ||
"expr": "histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))", | ||
"legendFormat": "p50", | ||
"range": true, | ||
"refId": "p50" | ||
|
@@ -209,7 +209,7 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "histogram_quantile(0.75, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))", | ||
"expr": "histogram_quantile(0.75, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))", | ||
"hide": false, | ||
"legendFormat": "p75", | ||
"range": true, | ||
|
@@ -221,7 +221,7 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "histogram_quantile(0.90, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))", | ||
"expr": "histogram_quantile(0.90, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))", | ||
"hide": false, | ||
"legendFormat": "p90", | ||
"range": true, | ||
|
@@ -233,7 +233,7 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))", | ||
"expr": "histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))", | ||
"hide": false, | ||
"legendFormat": "p95", | ||
"range": true, | ||
|
@@ -245,7 +245,7 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "histogram_quantile(0.98, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))", | ||
"expr": "histogram_quantile(0.98, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))", | ||
"hide": false, | ||
"legendFormat": "p98", | ||
"range": true, | ||
|
@@ -257,7 +257,7 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))", | ||
"expr": "histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))", | ||
"hide": false, | ||
"legendFormat": "p99", | ||
"range": true, | ||
|
@@ -269,7 +269,7 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "histogram_quantile(0.999, sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval])) by (le))", | ||
"expr": "histogram_quantile(0.999, sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (le))", | ||
"hide": false, | ||
"legendFormat": "p99.9", | ||
"range": true, | ||
|
@@ -413,6 +413,7 @@ | |
"showLegend": true | ||
}, | ||
"tooltip": { | ||
"maxHeight": 600, | ||
Aaronontheweb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"mode": "multi", | ||
"sort": "none" | ||
} | ||
|
@@ -424,7 +425,7 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\", http_response_status_code=~\"4..|5..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval]))", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the error rate chart, which had some of the same problems that the Requests Duration chart did:
BeforeI actually don't know what to make of those error rates - 0% last and 97.5% max? What does that mean? This PRWe've changed the error rate chart to use the I have a roughly 70% 400-level error rate over the past 24 hours - and that makes sense with what I know about this application: it runs a private NuGet package endpoint and we get a lot of 404s from clients looking for packages that this server doesn't host, as a function of how NuGet clients are designed. If I scope out to the past 6 hours, which mostly covers Saturdays when our customers aren't working: That 404 rate scales down, with the rest of our traffic, to roughly 11%. It's now easier to tell that by looking at the charts than before. |
||
"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\", http_response_status_code=~\"4..|5..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval]))", | ||
"legendFormat": "All", | ||
"range": true, | ||
"refId": "All" | ||
|
@@ -435,7 +436,7 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\", http_response_status_code=~\"4..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval]))", | ||
"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\", http_response_status_code=~\"4..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval]))", | ||
"hide": false, | ||
"legendFormat": "4XX", | ||
"range": true, | ||
|
@@ -447,7 +448,7 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\", http_response_status_code=~\"5..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=\"$job\", instance=\"$instance\"}[$__rate_interval]))", | ||
"expr": "sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\", http_response_status_code=~\"5..\"}[$__rate_interval]) or vector(0)) / sum(rate(http_server_request_duration_seconds_bucket{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval]))", | ||
"hide": false, | ||
"legendFormat": "5XX", | ||
"range": true, | ||
|
@@ -532,6 +533,7 @@ | |
"showLegend": false | ||
}, | ||
"tooltip": { | ||
"maxHeight": 600, | ||
"mode": "single", | ||
"sort": "none" | ||
} | ||
|
@@ -543,8 +545,8 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "sum(kestrel_active_connections{job=\"$job\", instance=\"$instance\"})", | ||
"legendFormat": "__auto", | ||
"expr": "sum(kestrel_active_connections{job=~\"$job\", instance=~\"$instance\"})", | ||
"legendFormat": "active connections", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed the legend to use a manual entry because otherwise, now that we can aggregrate Kestrel data across multiple applications, you end up with a very gross and verbose-looking auto-generated tooltip here on the legend. |
||
"range": true, | ||
"refId": "A" | ||
} | ||
|
@@ -627,6 +629,7 @@ | |
"showLegend": false | ||
}, | ||
"tooltip": { | ||
"maxHeight": 600, | ||
"mode": "single", | ||
"sort": "none" | ||
} | ||
|
@@ -638,8 +641,8 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "sum(http_server_active_requests{job=\"$job\", instance=\"$instance\"})", | ||
"legendFormat": "__auto", | ||
"expr": "sum(http_server_active_requests{job=~\"$job\", instance=~\"$instance\"})", | ||
"legendFormat": "active requests", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same deal as with the kestrel connections - manually renamed the legend here. I toyed with the idea of breaking out active connections by service / instance and that's easily doable, but opted for something simpler - the user can get that same data by changing the template variables on the selector. |
||
"range": true, | ||
"refId": "A" | ||
} | ||
|
@@ -695,8 +698,10 @@ | |
"fields": "", | ||
"values": false | ||
}, | ||
"showPercentChange": false, | ||
"text": {}, | ||
"textMode": "value" | ||
"textMode": "value", | ||
"wideLayout": true | ||
}, | ||
"pluginVersion": "", | ||
"targets": [ | ||
|
@@ -707,7 +712,7 @@ | |
}, | ||
"editorMode": "code", | ||
"exemplar": false, | ||
"expr": "sum(http_server_request_duration_seconds_count{job=\"$job\", instance=\"$instance\"})", | ||
"expr": "sum(increase(http_server_request_duration_seconds_count{job=~\"$job\", instance=~\"$instance\"}[$__range]))", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the "Total Requests" counter - and the way it's currently working is just counting the total value over all time, which means that regardless of what time range you pick the value doesn't go down. BeforeCase in point - this is my "24 hour" view This is my "past 6 hours" view with the current chart - I took this screenshot 2nd and it displays a higher value This PRSo we've made two changes, other than the usual bits with the
Now we get what we're really interested in: the total requests that occurred during the time period: Past 24 hours Past 6 hours That's more like it! |
||
"instant": false, | ||
"legendFormat": "__auto", | ||
"range": true, | ||
|
@@ -765,8 +770,10 @@ | |
"fields": "", | ||
"values": false | ||
}, | ||
"showPercentChange": false, | ||
"text": {}, | ||
"textMode": "value" | ||
"textMode": "value", | ||
"wideLayout": true | ||
}, | ||
"pluginVersion": "", | ||
"targets": [ | ||
|
@@ -777,7 +784,7 @@ | |
}, | ||
"editorMode": "code", | ||
"exemplar": false, | ||
"expr": "sum(http_server_request_duration_seconds_count{job=\"$job\", instance=\"$instance\", error_type!=\"\"})", | ||
"expr": "sum(increase(http_server_request_duration_seconds_count{job=~\"$job\", instance=~\"$instance\", error_type!=\"\"}[$__range]))", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
"instant": false, | ||
"legendFormat": "__auto", | ||
"range": true, | ||
|
@@ -835,7 +842,9 @@ | |
"fields": "", | ||
"values": false | ||
}, | ||
"textMode": "value_and_name" | ||
"showPercentChange": false, | ||
"textMode": "value_and_name", | ||
"wideLayout": true | ||
}, | ||
"pluginVersion": "", | ||
"targets": [ | ||
|
@@ -845,7 +854,7 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "sum by (url_scheme) (\r\n max_over_time(http_server_request_duration_seconds_count{job=\"$job\", instance=\"$instance\"}[$__rate_interval])\r\n )", | ||
"expr": "sum by (url_scheme) (\r\n increase(http_server_request_duration_seconds_count{job=~\"$job\", instance=~\"$instance\"}[$__range])\r\n )", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the "Requests Secured" chart - rather than pick out just the max value of this counter over the given rate, which doesn't change, we now compute the delta just like we've done with the "Total Requests" chart. BeforeNotice how the chart is always growing to the right? That's because we're not taking the change in counter values into account - that's what the This PRNow we can see some livelier data, showing dips and valleys in HTTP traffic to boot. |
||
"legendFormat": "{{scheme}}", | ||
"range": true, | ||
"refId": "A" | ||
|
@@ -902,7 +911,9 @@ | |
"fields": "", | ||
"values": false | ||
}, | ||
"textMode": "value_and_name" | ||
"showPercentChange": false, | ||
"textMode": "value_and_name", | ||
"wideLayout": true | ||
}, | ||
"pluginVersion": "", | ||
"targets": [ | ||
|
@@ -912,7 +923,7 @@ | |
"uid": "${DS_PROMETHEUS}" | ||
}, | ||
"editorMode": "code", | ||
"expr": "sum by (method_route) (\r\n label_replace(max_over_time(http_server_request_duration_seconds_count{job=\"$job\", instance=\"$instance\"}[$__rate_interval]), \"method_route\", \"http/$1\", \"network_protocol_version\", \"(.*)\")\r\n )", | ||
"expr": "sum by (method_route) (\r\n label_replace(increase(http_server_request_duration_seconds_count{job=~\"$job\", instance=~\"$instance\"}[$__range]), \"method_route\", \"http/$1\", \"network_protocol_version\", \"(.*)\")\r\n )", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
"legendFormat": "{{protocol}}", | ||
"range": true, | ||
"refId": "A" | ||
|
@@ -1059,7 +1070,7 @@ | |
}, | ||
"editorMode": "code", | ||
"exemplar": false, | ||
"expr": " topk(10,\r\n sum by (http_route, http_request_method, method_route) (\r\n label_join(max_over_time(http_server_request_duration_seconds_count{job=\"$job\", instance=\"$instance\", http_route!=\"\"}[$__rate_interval]), \"method_route\", \" \", \"http_request_method\", \"http_route\")\r\n ))", | ||
"expr": " topk(10,\r\n sum by (http_route, http_request_method, method_route) (\r\n label_join(floor(increase(http_server_request_duration_seconds_count{job=~\"$job\", instance=~\"$instance\", http_route!=\"\"}[$__range])), \"method_route\", \" \", \"http_request_method\", \"http_route\")\r\n ))", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the "Top 10 Requested Endpoints" graph - in addition to adding multi-variable support, we've done two important things:
BeforeI had to scope to 24 hours because the 6 hour chart rendered no data - that issue does not occur in the new version in this PR. These are cumulative counter values - we didn't actually serve this much traffic to these endpoints during this time. This PRThese values are accurate and I can verify these numbers in our log aggregation system too. |
||
"format": "table", | ||
"instant": true, | ||
"interval": "", | ||
|
@@ -1233,7 +1244,7 @@ | |
}, | ||
"editorMode": "code", | ||
"exemplar": false, | ||
"expr": " topk(10,\r\n sum by (http_route, http_request_method, method_route) (\r\n label_join(max_over_time(http_server_request_duration_seconds_count{job=\"$job\", instance=\"$instance\", http_route!=\"\", error_type!=\"\"}[$__rate_interval]), \"method_route\", \" \", \"http_request_method\", \"http_route\")\r\n ))", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
"expr": " topk(10,\r\n sum by (http_route, http_request_method, method_route) (\r\n label_join(floor(increase(http_server_request_duration_seconds_count{job=~\"$job\", instance=~\"$instance\", http_route!=\"\", error_type!=\"\"}[$__rate_interval])), \"method_route\", \" \", \"http_request_method\", \"http_route\")\r\n ))", | ||
"format": "table", | ||
"instant": true, | ||
"interval": "", | ||
|
@@ -1270,9 +1281,9 @@ | |
"type": "table" | ||
} | ||
], | ||
"refresh": "10s", | ||
"refresh": "1m", | ||
Aaronontheweb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"revision": 1, | ||
"schemaVersion": 38, | ||
"schemaVersion": 39, | ||
"tags": [ | ||
"dotnet", | ||
"prometheus", | ||
|
@@ -1288,9 +1299,9 @@ | |
}, | ||
"definition": "label_values(http_server_active_requests,job)", | ||
"hide": 0, | ||
"includeAll": false, | ||
"includeAll": true, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Enables "all" to be used |
||
"label": "Job", | ||
"multi": false, | ||
"multi": true, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Enables mutli-select to be used, which is now feasible since all of the PromQL queries have been updated to support it. |
||
"name": "job", | ||
"options": [], | ||
"query": { | ||
|
@@ -1311,9 +1322,9 @@ | |
}, | ||
"definition": "label_values(http_server_active_requests{job=~\"$job\"},instance)", | ||
"hide": 0, | ||
"includeAll": false, | ||
"includeAll": true, | ||
"label": "Instance", | ||
"multi": false, | ||
"multi": true, | ||
"name": "instance", | ||
"options": [], | ||
"query": { | ||
|
@@ -1329,9 +1340,10 @@ | |
] | ||
}, | ||
"time": { | ||
"from": "now-5m", | ||
"from": "now-24h", | ||
Aaronontheweb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"to": "now" | ||
}, | ||
"timeRangeUpdatedDuringEditOrView": false, | ||
Aaronontheweb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"timepicker": { | ||
"refresh_intervals": [ | ||
"1s", | ||
|
@@ -1350,6 +1362,6 @@ | |
"timezone": "", | ||
"title": "ASP.NET Core", | ||
"uid": "KdDACDp4z", | ||
"version": 1, | ||
"version": 2, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need to bump the chart version if this gets pushed to Grafana Cloud. |
||
"weekStart": "" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're going to see this change on every data plot on every chart:
to
The
=~
being the key change here - this allows Grafana to expand the variables to include all of the selectedjob
/instance
values. I'm not going to comment on every instance of this change because it's all the same, but that's what this is.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this chart is the 99% latency chart, and addition to changing it to support multi-select I've also modified it to use the
$__range
value in Grafana, which corresponds to the tailing time window, versus the$__rate__interval
variable, which is just 4x the Prometheus scraping interval: https://grafana.com/blog/2020/09/28/new-in-grafana-7.2-__rate_interval-for-prometheus-rate-queries-that-just-work/It's useful for rate queries, but that's not what we're measuring here - we're instead trying to determine "what was 99% latency over X period of time?" - the
$__range
value is better for that, and I've made that change in several charts.Let's do a before and after comparison for this chart specifically: same app and time-range, but only a single instance:
Before
This PR
This is probably mostly a taste thing, but what I appreciate about the latter chart are the following:
The only drawback of this second design is that it's harder to see really large outliers - that's easier to see with the original chart design. This design is averaging rates over a longer period of time, which is what makes it more reliable at showing data over smaller time intervals - you can still see where spikes occur, such as this instance here:
But it's not nearly as pronounced. Happy to take feedback on what's more appropriate here, but making sure the chart worked correctly across a wider range of traffic workloads / time ranges was my objective here.