Troubleshooting Slow Requests in Cloud Foundry

Page last updated:

This topic suggests ways that an operator of Cloud Foundry (CF) can diagnose the location of app request delays.

App Request Path

App requests typically transit the following components. Only the router (Gorouter) and app are within the scope of Cloud Foundry. Operators may have the HAProxy load balancer that comes with CF deployed, instead of or in addition to, an infrastructure load balancer.

Request lifecycle

You can use time to measure a request’s full round trip time from the client and back, examine cf logs output to measure the time just within Cloud Foundry, and add log messages to your app for fine-grained measurements of where the app itself takes time. By comparing these times, you can determine whether your delay comes from outside CF, inside the Gorouter, or in the app.

The following sections describe a scenario of diagnosing the source of delay for an app app1.

Measure Total Round-Trip App Requests

On a command line, run time curl -v APP-ENDPOINT to measure the total round-trip time for deployed app app1. For example:

$ time curl -v http://app1.app_domain.com

GET /hello HTTP/1.1
Host: app1.app_domain.com
User-Agent: curl/7.43.0
Accept: */*

HTTP/1.1 200 OK
Content-Type: application/json;charset=utf-8
Date: Tue, 14 Dec 2016 00:31:32 GMT
Server: nginx
X-Content-Type-Options: nosniff
X-Vcap-Request-Id: c30fad28-4972-46eb-7da6-9d07dc79b109
Content-Length: 602
hello world!
real    2m0.707s
user    0m0.005s
sys     0m0.007s

The real time output shows that the request to http://app1.app_domain.com took approximately 2 minutes, round-trip. This seems like an unreasonably long time, so it makes sense to find out where the delay is occurring. To narrow it down, the next step measures the part of that request response time that comes from within Cloud Foundry.

Note: If your curl outputs an error like Could not resolve host: NONEXISTENT.com then DNS failed to resolve. If curlreturns normally but lacks a X-Vcap-Request-Id, the request from the Load Balancer did not reach Cloud Foundry.

Measure App Requests within Cloud Foundry

The cf logs command streams log messages from the Gorouter as well as from apps. To see the timestamps of Gorouter messages to and from your app, do the following:

  1. If necessary, run cf apps to determine the name of the app.
  2. Run cf logs APP-NAME. Replace APP-NAME with the name of the app.
  3. From another terminal window, send a request to your app.
  4. After your app returns a response, enter Ctrl-C to stop streaming cf logs.

For example:

$ cf logs app1

2016-12-14T00:33:32.35-0800 [RTR/0]      OUT app1.app_domain.com - [14/12/2016:00:31:32.348 +0000] "GET /hello HTTP/1.1" 200 0 60 "-" "HTTPClient/1.0 (2.7.1, ruby 2.3.3 (2016-11-21))" "10.0.4.207:20810" "10.0.48.67:61555" x_forwarded_for:"52.3.107.171" x_forwarded_proto:"http" vcap_request_id:"01144146-1e7a-4c77-77ab-49ae3e286fe9" response_time:120.00641734 app_id:"13ee085e-bdf5-4a48-aaaf-e854a8a975df" app_index:"0" x_b3_traceid:"3595985e7c34536a" x_b3_spanid:"3595985e7c34536a" x_b3_parentspanid:"-"
2016-12-14T00:32:32.35-0800 [APP/PROC/WEB/0]OUT app1 received request at [14/12/2016:00:32:32.348 +0000] with "vcap_request_id": "01144146-1e7a-4c77-77ab-49ae3e286fe9"
^C

In the example above, the first line contains timestamps from the Gorouter for both when it received the request and what was its response time processing the request:

  • 14/12/2016:00:31:32.348: Gorouter receives request
  • response_time:120.00641734: Gorouter round-trip processing time

This output shows that it took 120 seconds for the Gorouter to process the request, which means that the 2-minute delay above takes place within CF, either within the Gorouter or within the app.

To determine whether the app is responsible, add logging to your app to measure where it is spending time.

Note: Every incoming request should generate an access log message. If a request does not generate an access log message, it means the Gorouter did not receive the request.

Use App Logs to Locate Delays in CF

To gain a more detailed picture of where delays exist in your request path, augment the logging that your app generates. For example, call your logging library from the request handler to generate log lines when your app receives a request and finishes processing it:

2016-12-14T00:33:32.35-0800 [RTR/0]      OUT app1.app_domain.com - [14/12/2016:00:31:32.348 +0000] "GET /hello HTTP/1.1" 200 0 60 "-" "HTTPClient/1.0 (2.7.1, ruby 2.3.3 (2016-11-21))" "10.0.4.207:20810" "10.0.48.67:61555" x_forwarded_for:"52.3.107.171" x_forwarded_proto:"http" vcap_request_id:"01144146-1e7a-4c77-77ab-49ae3e286fe9" response_time:120.00641734 app_id:"13ee085e-bdf5-4a48-aaaf-e854a8a975df" app_index:"0" x_b3_traceid:"3595985e7c34536a" x_b3_spanid:"3595985e7c34536a" x_b3_parentspanid:"-"
2016-12-14T00:32:32.35-0800 [APP/PROC/WEB/0]OUT app1 received request at [14/12/2016:00:32:32.348 +0000] with "vcap_request_id": "01144146-1e7a-4c77-77ab-49ae3e286fe9"
2016-12-14T00:32:32.50-0800 [APP/PROC/WEB/0]OUT app1 finished processing req at [14/12/2016:00:32:32.500 +0000] with "vcap_request_id": "01144146-1e7a-4c77-77ab-49ae3e286fe9"

Comparing the router access log messages from the previous section with the new app logs above, we can construct the following timeline:

  • 14/12/2016:00:31:32.348: Gorouter receives request
  • 2016-12-14T00:32:32.35: App receives request
  • 2016-12-14T00:32:32.50: App finishes processing request
  • 2016-12-14T00:33:32.35: Gorouter finishes processing request

The timeline indicates that the Gorouter took close to 60 seconds to send the request to the app and another 60 seconds to receive the response from the app. This suggests a delay either with the Gorouter, or in network latency between the Gorouter and Diego cells hosting the app.

Time the Gorouter Processing

To determine whether a Gorouter delay comes from the Gorouter itself or network latency between the Gorouter and the app, log into the Gorouter and compare the response times from calling the app two different ways:

  • Call the app through the router proxy to process requests through the Gorouter
  • Directly call a specific instance of the app, bypassing Gorouter processing

To make this comparison, do the following:

  1. Log in to the router using bosh ssh. See the BOSH SSH documentation for more information.

  2. Use time and curl to measure the response time of an app request originating from and processing through the Gorouter, but not running through the client network or load balancer:

    $ time curl -H "Host: app1.app_domain.com" http://IP-GOROUTER-VM:80"
    

  3. Obtain the IP address and port of a specific app instance by running the following and recording associated host and port values:

    $ cf curl /v2/apps/$(cf app app1 --guid)/stats
    {
        "0": {
        "state": "RUNNING",
        "stats": {
            [...]
            "host": "10.10.148.39",
            "port": 60052,
            [...]
        },
    [...]
    }
    
  4. Use the IP address and port values to measure response time calling the app instance directly, bypassing Gorouter processing:

    $ time curl http://APP-HOST-IP:APP-PORT 
    

If the Gorouter and direct response times are similar, it suggests network latency between the Gorouter and the Diego cell. If the Gorouter time is much longer than the direct-to-instance time, the Gorouter is slow processing requests. The next section explains why the Gorouter might be slow.

Potential Causes for Gorouter Latency

  • Routers are under heavy load from incoming client requests.
  • Apps are taking a long time to process requests. This increases the number of concurrent threads held open by the Gorouter, reducing capacity to handle requests for other apps.

Operations Recommendations

  • Monitor CPU load for Gorouters. At high CPU (70%+), latency increases. If Gorouter CPU reaches this threshold, consider adding another Gorouter instance.
  • Monitor latency of all routers using metrics from the Firehose. Do not monitor the average latency across all routers. Instead, monitor them individually on the same graph.
  • Consider using Pingdom against an app on your Cloud Foundry deployment to monitor latency and uptime.
  • Consider enabling access logs on your load balancer. See your load balancer documentation for how. Just as we used Gorouter access log messages above to determine latency from the Gorouter, you can compare load balancer logs to identify latency between the load balancer and Gorouter. You can also compare load balancer response times with the client response times to identify latency between client and load balancer.
  • Deploy a nozzle to the Loggregator Firehose to track metrics for the Gorouter. Available metrics include:
    • CPU utilization
    • Latency
    • Requests per second
Create a pull request or raise an issue on the source for this page in GitHub