Loggregator Guide for Cloud Foundry Operators

Page last updated:

This topic contains information for Cloud Foundry deployments operators about how to configure the Loggregator system to avoid data loss with high volumes of logging and metrics data.

Scaling Loggregator

When the volume of log and metric data generated by Cloud Foundry components exceeds the storage buffer capacity of the Dopplers that collect it, data can be lost. Configuring System Logging explains how to scale the Loggregator system to keep up with high stream volume and minimize data loss.

Scaling Nozzles

You can scale nozzles using the subscription ID, specified when the nozzle connects to the Firehose. If you use the same subscription ID on each nozzle instance, the Firehose evenly distributes events across all instances of the nozzle. For example, if you have two nozzles with the same subscription ID, the Firehose sends half of the events to one nozzle and half to the other. Similarly, if you have three nozzles with the same subscription ID, the Firehose sends each instance one-third of the event traffic.

Stateless nozzles should handle scaling gracefully. If a nozzle buffers or caches the data, the nozzle author must test the results of scaling the number of nozzle instances up or down.

Slow Nozzle Alerts

The Traffic Controller alerts nozzles if they consume events too slowly. If a nozzle falls behind, Loggregator alerts the nozzle in two ways:

  • TruncatingBuffer alerts: If the nozzle consumes messages more slowly than they are produced, the Loggregator system may drop messages. In this case, Loggregator sends the log message, TB: Output channel too full. Dropped N messages, where N is the number of dropped messages. Loggregator also emits a CounterEvent with the name TruncatingBuffer.DroppedMessages. The nozzle receives both messages from the Firehose, alerting the operator to the performance issue.

  • PolicyViolation error: The Traffic Controller periodically sends ping control messages over the Firehose WebSocket connection. If a client does not respond to a ping with a pong message within 30 seconds, the Traffic Controller closes the WebSocket connection with the WebSocket error code ClosePolicyViolation (1008). The nozzle should intercept this WebSocket close error, alerting the operator to the performance issue.

An operator can scale the number of nozzles in response to these alerts to minimize the loss of data.

Forwarding Logs to an External Service

You can configure Cloud Foundry to forward log data from components and apps to an external aggregator service instead of routing it to the Loggregator Firehose. Configuring System Logging explains how to enable log forwarding by specifying the aggregator address, port, and protocol.

Using Log Management Services explains how to bind applications to the external service and configure it to receive logs from Cloud Foundry.

Log Message Size Constraints

The Diego cell emits application logs as UDP messages to the Metron. Diego breaks up log messages greater than approximately 60KiB into multiple envelopes to mitigate this constraint.

Create a pull request or raise an issue on the source for this page in GitHub