Troubleshooting Router Error Responses

Page last updated:

This topic helps operators understand and debug 502 errors that are a result of their infrastructure, Cloud Foundry (CF), or an app.

Overview

In your deployment, 502 errors can come from any of the following:

If you are unsure of the source of 502 errors, see General Debugging Steps below.

General Debugging Steps

Some general debugging steps for any issue resulting in 502 errors are as follows:

  1. Gather the Gorouter logs and Diego Cell logs at the time of the incident.

  2. Review the logs and consider the following:

    1. Which errors are the Gorouters returning?
    2. Is Gorouter’s routing table accurate? Are the endpoints for the route as expected? For more information, see Dynamic Routing Table in the Gorouter documentation on GitHub.
    3. Do the Diego Cell logs have anything interesting about unexpected app crashes or restarts?
    4. Is the app healthy and handling requests successfully? You can use request tracing headers to verify. For more information, see HTTP Headers for Zipkin Tracing in HTTP Routing.
  3. Consider the following:

    • Does your load balancer log 502 errors, but Gorouter does not? This means that traffic is not reaching Gorouter.
    • Was there a recent platform change or upgrade that caused an increase in 502 errors?
    • Are there any suspicious metrics spiking? How is the CPU and memory utilization?

Log Formatting

Levels

The following table describes the log levels supported by Gorouter. The log level is specified in the configuration YAML file for Gorouter.

Message Description Examples
fatal Gorouter is unable to handle any requests due to a fatal error. Gorouter cannot bind to its TCP port, a CF component has published invalid data to Gorouter.
error An unexpected error has occurred. Gorouter failed to fetch token from UAA service.
info An expected event has occurred. Gorouter started or exited, Gorouter has begun to prune routes for stale droplets.
debug A lower-level event has occurred. Route registration, route unregistration.

Message Contents

This section section provides a sample Gorouter log entry and explanation of the contents.

[2017-02-01 22:54:08+0000] {"log_level":0,"timestamp":1485989648.0895808,"message":"endpoint-registered","source":"vcap.Gorouter.registry","data":{"uri":"0-*.login.bosh-lite.com","backend":"10.123.0.134:8080","modification_tag":{"guid":"","index":0}}}

Property Description
log_level Logging level of the message
timestamp Epoch time of the log
message Content of the log entry
source Gorouter function that initiated the log entry
data Additional information that varies based on the message

Access Logs

This section provides details about Gorouter access logs.

Gorouter generates an access log in the following format when it receives a request:

<Request Host> - [<Start Date>] "<Request Method> <Request URL> <Request Protocol>" <Status Code> <Bytes Received> <Bytes Sent> "<Referrer>" "<User-Agent>" <Remote Address> <Backend Address> x_forwarded_for:"<X-Forwarded-For>" x_forwarded_proto:"<X-Forwarded-Proto>" vcap_request_id:<X-Vcap-Request-ID> response_time:<Response Time> gorouter_time:<Gorouter Time> app_id:<Application ID> app_index:<Application Index> x_cf_routererror:<X-Cf-RouterError> <Extra Headers>

Gorouter access logs are also redirected to syslog.

See the list below for more information about the Gorouter access log fields:

  • The following are optional fields: Status Code, Response Time, Application ID, Application Index, X-Cf-RouterError, and Extra Headers.

  • If the access log lacks a Status Code, Response Time, Application ID, Application Index, or X-Cf-RouterError, the corresponding field shows -.

  • Response Time is the total time it takes for the request to go through the Gorouter to the app and for the response to travel back through the Gorouter. This includes the time that the request spends traversing the network to the app and back again to the Gorouter. It also includes the time the app spends forming a response.

  • Gorouter Time is the total time it takes for the request to go through the Gorouter initially plus the time it takes for the response to travel back through the Gorouter. This does not include the time the request spends traversing the network to the app. This also does not include the time the app spends forming a response.

  • X-Cf-RouterError is populated if the Gorouter encounters an error. The returned values can help distinguish whether a non-2xx response code is due to an error in the Gorouter or the back end. For more information on the possible errors, see the Diagnose App Errors section.

Diagnose Gorouter Errors

This section describes the basic structure of Gorouter logs and how to diagnose Gorouter errors.

Gorouter Cannot Connect to the App Container

If Gorouter cannot connect to the app container, you might see this error in the gorouter.log:

[2018-07-05 17:59:10+0000] {"log_level":3,"timestamp":1530813550.92134,"message":
"backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint":
{"ApplicationId":"","Addr":"10.0.32.15:60099","Tags":null,"RouteServiceUrl":""},
"error":"dial tcp 10.0.32.15:60099: getsockopt: connection refused"}}

If TCP cannot make an initial connection to the backend, Gorouter retries TCP dial errors up to three times. If it still fails, Gorouter returns a 502 to the client and writes to the access.log.

Any of the following can cause connection errors between Gorouter and the app container:

  • An app that is unresponsive, indicating an issue with the app.
  • A stale route in Gorouter, indicating an issue with the platform. For more information, see Diagnose Stale Routes below.
  • A corrupted app container, indicating a problem with the platform.

Gorouter Errors After Connecting

If Gorouter successfully dials the endpoint but an error occurs, you might see the following:

  • read: connection reset by peer errors. These can occur when the app closes the connection abruptly with a TCP RST packet and not the expected FIN-ACK. This causes Gorouter to retry the next endpoint. Gorouter does not currently retry on write: connection reset by peer failures.
  • TLS handshake errors. When these errors occur, the Gorouter retries up to three times. If it still fails, Gorouter can return a 502. These errors appear similar to the following in the gorouter.log, and a 502 error is logged in the access.log:
    [2018-07-05 18:20:54+0000] {"log_level":3,"timestamp":1530814854.4359834,"message":"
    backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint":
    {"ApplicationId":"","Addr":"10.0.16.17:61002","Tags":null,"RouteServiceUrl":""},
    "error":"x509:certificate is valid for 53079ca3-c4fe-4910-78b9-c1a6, not xxx"}}
    
  • If a clients cancels a request before the server responds with headers, Gorouter returns a 499 error.

Diagnose Stale Routes in Gorouter

A stale route occurs when Gorouter contains out-of-date route information for a backend app. In nearly all cases, stale routes are self-correcting.

If TLS from Gorouter to apps and other backends is enabled, then when Gorouter detects that it is sending traffic to the wrong app, it prunes that backend app from its route table and terminates the connection. TLS from Gorouter to apps and other backends is enabled by default in cf-deployment v7.0.0.

Causes of Stale Routes

When a route is unmapped or when an app container is deleted because the app is deleted or moved, a deregister message is sent to Gorouter. This message tells Gorouter to delete the route mapping to that container.

If Gorouter does not receive this deregister message, the route is now considered stale. Gorouter still attempts to send traffic to the app.

You are more likely to have stale routes when the following are true:

  • SSL verification is not enabled.
  • You do not have Diego Release v2.34.0 or later deployed, which contains a fix that sends unregistration messages multiple times.
  • You unmapped a route to an app, but traffic to that route is still being sent to the app.

How to Locate Stale Routes

The following procedure helps you identify stale routes:

  1. Verify the state of the deployment. Run cf routes for all spaces and ensure the route is only mapped to the intended apps. Sometimes, there can be multiple routes using the same hostname and domain but with different paths. If the domain is shared, check all orgs as well.
  2. Examine the Gorouter routes table. It might be necessary to check multiple Gorouters, as it is possible that some received the proper deregister message and some did not.
    1. SSH to the VM where Gorouter is running.
    2. To print the entire Gorouter routes table, run: /var/vcap/jobs/gorouter/bin/retrieve-local-routes | jq .
    3. Find the entry for the suspected stale route. Note the values for address and private_instance_id.
  3. Cross-reference the Gorouter routes table entry with actual Long-Running Processes (LRPs):
    1. SSH onto the Diego Cell where the IP address matches the IP address that you found on the routes table entry.
    2. To get information about all of the actual LRPs, run: cfdot actual-lrps | jq .'
    3. Look through the actual LRPs to find the instance ID that you noted from the routes table. If that instance ID exists and the port in the route table does not exist in the ports section, then there is likely a stale route.

      Note: You might be tempted to use the CAPI endpoint GET /v3/processes/:guid/stats to find out information about the host and ports the app is using. However, it is an app developer endpoint and does not provide complete information for operators. Use the cfdot CLI on the Diego Cell to view the actual LRPs directly and all at once.

How to Fix Stale Routes

The following procedure helps you fix stale routes:

  1. Ensure that SSL verification is enabled. For more information, see With TLS Enabled in HTTP Routing.

    Note: Using TLS to verify app identity depends on SSL verification. If you disable SSL verification, there is no way to avoid misrouting.

  2. If there is a stale route, then restarting Gorouter fixes the immediate issue. If you restart all of the Gorouters and see the same issue for the exact same route, then the issue is not a stale route.
  3. If Gorouter is continually missing deregister messages, it might be because either the NATS message bus or the Gorouters are overwhelmed. Look at the VM usage and consider scaling.

Gorouter Error Classification Table

Use this table when you are debugging Gorouter errors. The table lists error types, status codes, and indicates if the Gorouter retries the errored request.

For each error, there is a backend-endpoint-failure log entry in gorouter.log and an error message in gorouter.err.log. Additionally, the access.log records the request status codes. For more information, see the Gorouter documentation on GitHub.

If the request is one that can be retried, Gorouter makes up to three attempts.

Error Type Status Code Can be re‑tried? Source of Issue Evidence
Dial 502 Yes App or Platform Logs with error dial tcp
AttemptedTLSWith
NonTLSBackend
525 Yes Platform Logs with error tls: first record does not look like a TLS handshake or backend_tls_handshake_failed metric increments
HostnameMismatch 503 Yes Platform Logs with error x509: certificate is valid for not
or backend_invalid_id metric increments
UntrustedCert 526 Yes Platform Logs with error prefix x509: certificate signed by unknown authority or backend_invalid_tls_cert metric increments
RemoteFailedCertCheck 496 Yes Platform Logs with error remote error: tls: bad certificate
ContextCancelled 499 No Client/App Logs with error context canceled
This status code appears in logs only. It is never returned to clients as it occurs when the downstream client closes the connection before Gorouter responds.

RemoteHandshakeFailure 525 Yes Platform Logs with error remote error: tls: handshake failure and backend_tls_handshake_failed metric increments
ExpiredOrNotYetValidCertFailure 502 Yes Platform Logs with error x509: certificate has expired or is not yet valid. For example, this error can occur if the Diego Cell clock drifts.
unknown 502 No App or Platform If the error is not one of the types above, then 502 is the default response. Gorouter tracks 502 errors with the gorouter.bad_gateways metric. For more information, see Router Error: 502 Bad Gateway.

Diagnose App Errors

This section describes app-related 502 errors.

If 502 errors only occur in specific app instances and not all app instances on the platform, it is likely an app-related error. The app might be overloaded, unresponsive, or unable to connect to the database.

If all apps are experiencing 502 errors, then it could either be a platform issue, such as a misconfiguration, or an app issue, such as all apps being unable to connect to an upstream database.

Note: Gorouter does not retry any error response returned by the app.

Gorouter specific Response Headers

In the case that Gorouter encounters an error connecting to an application backend, the X-CF-RouterError header will be populated to help distinguish the origin of a non-2xx response code.

The value of the X-Cf-Routererror header can be one of the following:

Value Description
invalid_cf_app_instance_header The provided value for the X-Cf-App-Instance header does not match the required format of APP_GUID:INSTANCE_ID.
empty_host The value for the Host header is empty, or the Host header is equivalent to the remote address. Some LB’s optimistically set the Host header value with their IP address when there is no value present.
unknown_route The desired route does not exist in the gorouter’s route table.
no_endpoints There is an entry in the route table for the desired route, but there are no healthy endpoints available.
Connection Limit Reached The backends associated with the route have reached their max number of connections. The max connection number is set via the spec property router.backends.max_conns.
route_service_unsupported Route services are not enabled or WebSockets requests are bound to route services. You can configure route services using the spec property router.route_services_secret. If the property is empty, route services are disabled.
endpoint_failure The registered endpoint for the desired route failed to handle the request.
Create a pull request or raise an issue on the source for this page in GitHub