forked from cloudfoundry/docs-cf-admin
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtroubleshooting_slow_requests.html.md.erb
467 lines (378 loc) · 21.7 KB
/
troubleshooting_slow_requests.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
---
title: Troubleshooting Slow Requests in Cloud Foundry
owner: CF for VMs Networking
---
<% current_page.data.title = "Troubleshooting Slow Requests in " + vars.app_runtime_abbr %>
This topic provides experiments that you can run to help determine what part of the
<%= vars.app_runtime_first %> request flow adds latency to your
requests. The procedures in this topic are intended to be run in the order presented.
## <a id="overview"></a>Overview
App requests typically transit the following components. Only the router (Gorouter) and app are within the scope of <%= vars.app_runtime_abbr %>. Operators can deploy the HAProxy load balancer that comes with <%= vars.app_runtime_abbr %> instead of, or in addition to, an infrastructure load balancer.
See the following diagram:
<%= image_tag("request_lifecycle.png", :alt => "There are four boxes labeled, from left to right, 'Client', 'Load Balancer', 'Router', and 'Backend'. A black box frames the 'Router' and 'Backend' boxes and is labeled 'CF'. A horizontal black arrow points left to right from the 'Client' box to the 'Load Balancer' box. A horizontal black arrow points left to right from the 'Load Balancer' box to the 'Router' box. A horizontal black arrow points left to right from the 'Router' box to the 'Backend' box. A horizontal black arrow points right to left from the 'Backend' box back to the 'Router' box. A horizontal black arrow points right to left from the 'Router' box back to the 'Load Balancer' box. A horizontal black arrow points right to left from the 'Load Balancer' box back to the 'Client' box.") %>
Any of the components in the diagram above can cause latency. Delays can also come from the network
itself.
To troubleshoot slow requests and diagnose what might be causing latency to your app requests:
* [Experiment 1: Measure Total Round-Trip App Requests](#total-latency)
* [Experiment 2: View Request Time in Access Logs](#within-cf)
* [Experiment 3: Duplicate Latency on Another Endpoint](#duplicate-latency)
* [Experiment 4: Remove the Load Balancer from the Request Path](#remove-the-load-balancer)
* [Experiment 5: Remove Gorouter from the Request Path](#remove-the-gorouter)
* [Experiment 6: Test the Network between the Router and the App](#test-internal-network)
After you determine the cause of latency, see the following sections for more information:
* [Use App Logs to Locate Delays in <%= vars.app_runtime_abbr %>](#app-logs)
* [Causes for Gorouter Latency](#gorouter-latency)
* [Operations Recommendations](#ops-recommendations)
## <a id="total-latency"></a> Experiment 1: Measure Total Round-Trip App Requests
To measure the total round-trip time for your deployed app that is experiencing latency, run:
```
time curl -v APP-ENDPOINT
```
Where `APP-ENDPOINT` is the URL endpoint for the deployed app.
For example:
<pre class="terminal">
$ time curl -v http://app1.app_domain.com
GET /hello HTTP/1.1
Host: app1.app_domain.com
User-Agent: curl/7.43.0
Accept: */*
HTTP/1.1 200 OK
Content-Type: application/json;charset=utf-8
Date: Tue, 14 Dec 2019 00:31:32 GMT
Server: nginx
X-Content-Type-Options: nosniff
X-Vcap-Request-Id: c30fad28-4972-46eb-7da6-9d07dc79b109
Content-Length: 602
hello world!
real 2m0.707s
user 0m0.005s
sys 0m0.007s
</pre>
The `real` time output shows that the request to `http://app1.app_domain.com` took approximately
two minutes, round-trip. This seems like an unreasonably long time, so it makes sense to find out
where the delay is occurring.
To narrow down the cause of latency, see the following table for information about the output that
you see after running `time curl -v APP-ENDPOINT`:
<table>
<tr>
<th>Result</th>
<th>Explanation</th>
<th>Action</th>
</tr>
<tr>
<td>The output shows latency.</td>
<td>A component of your app request flow is causing a delay.</td>
<td>Continue with the rest of the experiments to try to narrow down which component is causing
latency.</td>
</tr>
<tr>
<td>The output does not show latency.</td>
<td>Your app experiences inconsistent delays.</td>
<td>Try to find an endpoint that is consistently experiencing latency. If that is not possible,
then continue with the following experiments, running each of them multiple times.</td>
</tr>
<tr>
<td>The output shows the error: <code>Could not resolve host: NONEXISTENT.com</code></td>
<td>DNS failed to resolve.</td>
<td>Troubleshoot your DNS configuration.</td>
</tr>
<tr>
<td>The output of <code>curl</code> returns normally, but there is no value for
<code>X-Vcap-Request-Id</code>.</td>
<td>The request from the load balancer did not reach <%= vars.app_runtime_abbr %>.</td>
<td>Troubleshoot your load balancer configuration.</td>
</tr>
</table>
## <a id="within-cf"></a> Experiment 2: View Request Time in Access Logs
In <%= vars.app_runtime_abbr %>, delays can occur within Gorouter, within the app, or within the network between the two.
If you suspect that you are experiencing latency, the most important logs are the access logs. The
`cf logs` command streams log messages from Gorouter as well as from apps. This section describes
how to find and understand the access log timestamps.
To view request time in access logs:
1. (Optional) Run `cf apps` to determine the name of the app.
1. Run:
```
cf logs APP-NAME
```
Where `APP-NAME` is the name of the app.
1. From another command line window, send a request to your app.
1. After your app returns a response, enter `Ctrl-C` to stop streaming `cf logs`.
For example:
<pre class="terminal">
$ cf logs app1
2019-12-14T00:33:32.35-0800 [RTR/0] OUT app1.app_domain.com - [14/12/2019:00:31:32.348 +0000] "GET /hello HTTP/1.1" 200 0 60 "-" "HTTPClient/1.0 (2.7.1, ruby 2.3.3 (2019-11-21))" "10.0.4.207:20810" "10.0.48.67:61555" x_forwarded_for:"52.3.107.171" x_forwarded_proto:"http" vcap_request_id:"01144146-1e7a-4c77-77ab-49ae3e286fe9" response_time:120.00641734 gorouter_time:0.000217 app_id:"13ee085e-bdf5-4a48-aaaf-e854a8a975df" app_index:"0" x_b3_traceid:"3595985e7c34536a" x_b3_spanid:"3595985e7c34536a" x_b3_parentspanid:"-"
2019-12-14T00:32:32.35-0800 [APP/PROC/WEB/0]OUT app1 received request at [14/12/2019:00:32:32.348 +0000] with "vcap_request_id": "01144146-1e7a-4c77-77ab-49ae3e286fe9"
^C
</pre>
In the example above, the first line contains the following timestamps:
* `14/12/2016:00:31:32.348`: Gorouter receives request
* `response_time:120.00641734`: Time measured from when Gorouter receives the request to when
Gorouter finishes sending the response to the end user
* `gorouter_time:0.000217`: Gorouter response time, not including the time that Gorouter spent sending the request to the back end app or the time waiting for the response from the app.
The `response time` value shows that it took 120 seconds for Gorouter to send the request to the
back end app, receive a response, and send that response to the end user. However, the `gorouter_time` value
shows that the Gorouter took 0.000217 seconds to process the request and send the response back
to the end user. This means that the bulk of the time was spent in <%= vars.app_runtime_abbr %> either
within the app, or the network between the Gorouter and the app.
<br><br>
If `gorouter_time` also reported 120 seconds, it would indicate that the time was mostly spent either in the Gorouter itself, or in the network
between the end user and the Gorouter, as relatively little time was spent sending the request to the app.
<br><br>
To narrow down the cause of latency, see the following table for information about the output
that you see after running `cf logs APP-NAME`:
<table>
<tr>
<th>Result</th>
<th>Explanation</th>
<th>Action</th>
</tr>
<tr>
<td width="25%">The output shows latency within <code>gorouter_time</code>.</td>
<td>Gorouter, or the network between the end user and Gorouter, is causing the latency.</td>
<td>Continue with the rest of the experiments to try to narrow down which component is
causing latency.</td>
</tr>
<tr>
<td width="25%">The output shows latency within <code>response_time</code> but not in
<code>gorouter_time</code>.</td>
<td>The network between Gorouter and the app, or the app itself, is causing latency.</td>
<td>Jump to experiment 5, and continue with the rest of the experiments to try to narrow
down which component is causing latency.</td>
</tr>
<tr>
<td>The output does not show latency within <code>response_time</code>.</td>
<td>A component before Gorouter is causing latency to the request before it reaches Gorouter.</td>
<td>Continue with the rest of the experiments to try to narrow down which component is
causing latency.</td>
</tr>
<tr>
<td>The output does not show a log line.</td>
<td>Every incoming request should generate an access log message. If a request does not
generate an access log message, it means Gorouter did not receive the request.</td>
<td>Troubleshoot your Gorouter configuration.</td>
</tr>
</table>
## <a id="duplicate-latency"></a> Experiment 3: Duplicate Latency on Another Endpoint
The next step to debugging latency is finding an endpoint that consistently experiences delays. Use
a test app that does not make any internal or external requests or database calls. For example, see
[dora](https://github.com/cloudfoundry/cf-acceptance-tests/tree/master/assets/dora) on GitHub.
If you cannot push any apps to your foundation, find an API endpoint that does not make any
external calls to use for the rest of the experiments. For example, use a health or information
endpoint.
To duplicate latency on another endpoint:
1. Push an example app.
1. Measure a request's full round-trip time from the client and back as by running:
```
time curl -v TEST-APP-ENDPOINT
```
Where `TEST-APP-ENDPOINT` is the URL endpoint for the test app. While every network is
different, this request should take less than 0.2 seconds.
See the following table for information about the output that you see after running
`time curl -v TEST-APP-ENDPOINT`:
<table>
<tr>
<th>Result</th>
<th>Explanation</th>
<th>Action</th>
</tr>
<tr>
<td width="25%">The output shows latency.</td>
<td>One of the components in the app request path is causing latency.</td>
<td>Continue with the rest of the experiments, using the test app, to try to narrow down which
component is causing latency.</td>
</tr>
<tr>
<td>The output does not show latency.</td>
<td>Something in your app configuration is causing latency.</td>
<td>Evaluate your app for recent changes. See below for more information.</td>
</tr>
</table>
If this experiment shows that something in your app is causing latency, use the following questions
to start troubleshooting your app:
* Did you recently push any changes?
* Does your app make database calls?
* If so, have your database queries changed?
* Does your app make requests to other microservices?
* If so, is there a problem in a downstream app?
* Does your app log where it spends time? For more information, see
[Use App Logs to Locate Delays in <%= vars.app_runtime_abbr %>](#app_logs) below.
## <a id="remove-the-load-balancer"></a> Experiment 4: Remove the Load Balancer from Request Path
The next step is to remove the load balancer from the test path by sending the request directly to
Gorouter. You can do this by accessing the network where Gorouter is deployed, sending the traffic
directly to the Gorouter IP address, and adding the route in the host header.
To remove the load balancer from the request path:
1. Choose a router VM from your deployment and get its IP address. Record this value and use it as the `ROUTER-IP` when you run the command in a later step.
1. Run:
```
bosh ssh router/ROUTER-GUID
```
Where `ROUTER-GUID` is the unique identifier for the router VM. <%= vars.bosh_cli_link %>
3. To determine the amount of time a request takes when it skips the load balancer, run:
```
time curl ROUTER-IP -H "Host: TEST-APP-ENDPOINT"
```
Where:
* `ROUTER-IP` is the router VM IP address that you located in the first step.
* `TEST-APP-ENDPOINT` is the URL endpoint for the test app.
See the following table for information about the output that you see after removing the load
balancer from the app request path:
<table>
<tr>
<th>Result</th>
<th>Explanation</th>
<th>Action</th>
</tr>
<tr>
<td width="25%">The output shows latency.</td>
<td>The load balancer is not causing latency.</td>
<td>Continue with the rest of the experiments, using the test app, to try to narrow down which
component is causing latency.</td>
</tr>
<tr>
<td>The output does not show latency.</td>
<td>A component before Gorouter is causing latency.</td>
<td>Look at your load balancer logs and logs for any other components that exist between the
end client and Gorouter. Additionally, run these experiments with clients in different networks
to help isolate where issues occur.</td>
</tr>
</table>
## <a id="remove-the-gorouter"></a> Experiment 5: Remove Gorouter from Request Path
The next step is to remove Gorouter from the request path. You can SSH into the router VM and send
a request directly to the app.
To remove Gorouter from the app request path:
1. To retrieve the IP address and port number of the Diego Cell where your test app instance runs, run:
```
cf ssh TEST-APP -c "env | grep CF_INSTANCE_ADDR"
```
Where `TEST-APP` is the name of the test app.
For example:
<pre class="terminal">
$ cf ssh my-app -c "env | grep CF_INSTANCE_ADDR"
</p>
2. Choose any router VM from your deployment and SSH into it by running:
```
bosh ssh router/ROUTER-GUID
```
Where `ROUTER-GUID` is the unique identifier for the router VM. <%= vars.bosh_cli_link %>
3. To determine the amount of time a request takes when it skips Gorouter, run
`time curl CF_INSTANCE_ADDR`.
See the following table for information about the output that you see after removing Gorouter from
the app request path:
<table>
<tr>
<th>Result</th>
<th>Explanation</th>
<th>Action</th>
</tr>
<tr>
<td width="25%">The output shows latency.</td>
<td width="25%">Gorouter is not causing latency.</td>
<td>Continue with the rest of the experiments, using the test app, to try to narrow down which
component is causing latency.</td>
</tr>
<tr>
<td>The output does not show latency.</td>
<td>Gorouter is causing latency.</td>
<td>See <a href="#gorouter-latency">Causes for Gorouter Latency</a> and
<a href="#ops-recommendations">Operations Recommendations</a> below.</td>
</tr>
</table>
## <a id="test-internal-network"></a> Experiment 6: Test the Network between the Router and the App
The next step is to time how long it takes for your request to make it from the router VM to the
Diego Cell where your app is deployed. You can do this by using `tcpdump` on both VMs.
To test the network between the router and the app:
1. Choose a router VM from your deployment and record its IP address. Use this value as the
`ROUTER-IP` in later steps.
1. To get the IP address of the Diego Cell where your test app instance is running, run
`cf ssh TEST-APP -c "env | grep CF_INSTANCE_IP"`, where `TEST-APP` is the name of the test app.
1. To get the port number of the Diego Cell where your test app instance is running, run
`cf ssh TEST-APP -c "env | grep CF_INSTANCE_PORT"`, where `TEST-APP` is the name of the test app.
1. On the command line, locate the router VM that matches the `ROUTER-IP` value from the first step.
1. To SSH into the router VM, run `bosh ssh router/ROUTER-GUID`, where `ROUTER-GUID` is the unique
identifier for the router VM. <%= vars.bosh_cli_link %>
1. On the router VM, log in as root.
1. To capture all packets going to your app, run
`tcpdump 'dst CF_INSTANCE_IP and dst port CF_INSTANCE_PORT'`.
1. In a second command line window, SSH into the Diego Cell that corresponds with `CF_INSTANCE_IP`.
Run `bosh ssh digeo-cell/DIEGO-CELL-GUID`, where `DIEGO-CELL-GUID` is the unique identifier for the
Diego Cell where your app is running.
1. On the Diego Cell, log in as root.
1. To capture all packets going to your app, run
`tcpdump 'dst port CF_INSTANCE_PORT and src ROUTER-IP'`, where `ROUTER-IP` is the router VM IP
address that you recorded in the first step.
1. In a third command line window, run `ssh ROUTER-IP`, where `ROUTER-IP` is the router VM IP
address.
1. To make a request to your app, run `curl CF_INSTANCE_IP:CF_INSTANCE_PORT`.
1. Compare the first packet captured on the router VM with the first packet captured on the Diego
Cell. The packets should match. Use the timestamps to determine how long it took the packet to
traverse the network.
<p class="note"><strong>Note</strong>: If you use a test app, this should be the only traffic
to your app. If you are not using a test app and there is traffic to your app, then these
<code>tcpdump</code> commands could result in many packet captures. If the <code>tcpdump</code>
results are too verbose to track, you can write them to a <code>pcap</code> file and use
<code>wireshark</code> to find the important packets. To write <code>tcpdump</code> commands to
a file, use the <code>-w</code> flag. For example: <code>tcpdump -w router.pcap</code>.</p>
See the following table for information about the output that you see after testing the network
between the router and the app:
<table>
<tr>
<th>Result</th>
<th>Explanation</th>
<th>Action</th>
</tr>
<tr>
<td width="25%">The output shows latency.</td>
<td width="25%">Your network is causing latency.</td>
<td>Troubleshoot your network configuration.</td>
</tr>
<tr>
<td>The output does not show latency.</td>
<td>A component in your app request path is causing latency.</td>
<td>Repeat the diagnostic experiments to narrow down which component is causing latency. Follow
the steps in <a href="#total-latency">Experiment 1: Measure Total Round-Trip App Requests</a>
and ensure that you are consistently experiencing latency.</td>
</tr>
</table>
## <a id="app_logs"></a> Use App Logs to Locate Delays in <%= vars.app_runtime_abbr %>
To gain a more detailed picture of where delays exist in your request path, augment the logging
that your app generates. For example, call your logging library from the request handler to
generate log lines when your app receives a request and finishes processing it:
<pre class="terminal">
2019-12-14T00:33:32.35-0800 [RTR/0] OUT app1.app_domain.com - [14/12/2019:00:31:32.348 +0000] "GET /hello HTTP/1.1" 200 0 60 "-" "HTTPClient/1.0 (2.7.1, ruby 2.3.3 (2019-11-21))" "10.0.4.207:20810" "10.0.48.67:61555" x_forwarded_for:"52.3.107.171" x_forwarded_proto:"http" vcap_request_id:"01144146-1e7a-4c77-77ab-49ae3e286fe9" response_time:120.00641734 app_id:"13ee085e-bdf5-4a48-aaaf-e854a8a975df" app_index:"0" x_b3_traceid:"3595985e7c34536a" x_b3_spanid:"3595985e7c34536a" x_b3_parentspanid:"-"
2019-12-14T00:32:32.35-0800 [APP/PROC/WEB/0]OUT app1 received request at [14/12/2019:00:32:32.348 +0000] with "vcap_request_id": "01144146-1e7a-4c77-77ab-49ae3e286fe9"
2019-12-14T00:32:32.50-0800 [APP/PROC/WEB/0]OUT app1 finished processing req at [14/12/2019:00:32:32.500 +0000] with "vcap_request_id": "01144146-1e7a-4c77-77ab-49ae3e286fe9"
</pre>
By comparing the router access log messages from
[Experiment 2: View Request Time in Access Logs](#within-cf) with the new app logs above, you can
construct the following timeline:
* `14/12/2016:00:31:32.348`: Gorouter receives request
* `2016-12-14T00:32:32.35`: App receives request
* `2016-12-14T00:32:32.50`: App finishes processing request
* `2016-12-14T00:33:32.35`: Gorouter finishes processing request
The timeline indicates that Gorouter took close to 60 seconds to send the request to the app and
another 60 seconds to receive the response from the app. This suggests either of the following:
* A delay with Gorouter. See [Causes for Gorouter Latency](#gorouter-latency) below.
* Network latency between Gorouter and the Diego Cells that host the app.
## <a id="gorouter-latency"></a>Causes for Gorouter Latency
Two potential causes for Gorouter latency are:
* Routers are under heavy load from incoming client requests.
* Apps are taking a long time to process requests. This increases the number of concurrent threads
held open by Gorouter, reducing capacity to handle requests for other apps.
## <a id="ops-recommendations"></a> Operations Recommendations
* Monitor CPU load for Gorouters. At high CPU (70%+), latency increases. If the Gorouter CPU
reaches this threshold, consider adding another Gorouter instance.
* Monitor latency of all routers using metrics from the Firehose. Do not monitor the average
latency across all routers. Instead, monitor them individually on the same graph.
* Consider using Pingdom against an app on your <%= vars.app_runtime_abbr %> deployment to monitor
latency and uptime. For more information, see the [Pingdom](http://www.pingdom.com) website.
* Consider enabling access logs on your load balancer. To enable access logs, see your load
balancer documentation. Just as you use Gorouter access log messages above to determine latency
from Gorouter, you can compare load balancer logs to identify latency between the load balancer and
Gorouter. You can also compare load balancer response times with the client response times to
identify latency between client and load balancer.
* Deploy a nozzle to the Loggregator Firehose to track metrics for Gorouter. For more information,
see [Deploying a Nozzle to the Loggregator Firehose](../loggregator/nozzle-tutorial.html).
Available metrics include:
* CPU utilization
* Latency
* Requests per second