Scaling Cloud Controller (cf-for-k8s)

Page last updated:

This topic describes how and when to scale CAPI in cf-for-k8s, and includes details about some key metrics, heuristics, and logs.

cf-api-server

The cf-api-server is the primary container in CAPI. It, along with nginx, powers the Cloud Controller API that all users of Cloud Foundry interact with. In addition to serving external clients, cf-api-server also provides APIs for internal components within Cloud Foundry, such as logging and networking subsystems.

When to scale

When determining whether to scale cf-api-server, look for the following:

Key metrics

Cloud Controller emits the following metrics:

  • sum(rate(container_cpu_usage_seconds_total{container="cf-api-server"}[1m])) by (pod) is above 0.85 utilization of a single pod’s CPU allocation.
  • cc_vitals_uptime is consistently low, indicating frequent restarts (possibly due to memory pressure).

Heuristic failures

The following behaviors may occur:

  • There is a latency in average response.
  • Web UI responsiveness or timeouts are degraded.

Relevant logs

You can find the above heuristic failures in the following logs:

  • kapp logs -a cf -m 'cf-api-server%' -c cf-api-server
  • kapp logs -a cf -m 'cf-api-server%' -c nginx

How to scale

Before and after scaling Cloud Controller API pods, you should verify that the Cloud Controller database is not overloaded. All Cloud Controller processes are backed by the same database, so heavy load on the database impacts API performance regardless of the number of Cloud Controllers deployed. Cloud Controller supports both PostgreSQL and MySQL, so there is no specific scaling guidance for the database.

Cloud Controller API pods should primarily be scaled horizontally. Scaling up the number of compute resources requested past one CPU is not effective. This is because Ruby’s Global Interpreter Lock (GIL) limits the cloud_controller_ng process so that it can only effectively use a single CPU core on a multi-core machine.

Note Since Cloud Controller supports both PostgreSQL and MySQL external databases, there is no absolute guidance about what a healthy database looks like. In general, high database CPU utilization is a good indicator of scaling issues, but always give precedence to the documentation specific to your database.

cf-api-local-worker

Known as “local workers,” this container is primarily responsible for handling files uploaded to the API pods during cf push, such as packages, droplets, and resource matching.

When to scale

When determining whether to scale cf-api-local-worker, look for the following:

Key metrics

Cloud Controller emits the following metrics:

  • cc_job_queue_length_cc-CF_API_SERVER_POD_NAME is continuously growing.
  • cc_job_queue_length_total is continuously growing.

Heuristic failures

The following behaviors may occur:

  • cf push is intermittently failing.
  • cf push average time is elevated.

Relevant logs

You can find the above heuristic failures in the following logs:

  • kapp logs -a cf -m 'cf-api-server%' -c cf-api-server

How to scale

Because local workers are co-located with the Cloud Controller API pod, they are scaled horizontally along with the API.

cf-api-worker

Known as “generic workers” or just “workers”, these pods are responsible for handling asynchronous work, batch deletes, and other periodic tasks scheduled by the cf-api-clock.

When to scale

When determining whether to scale cf-api-worker, look for the following:

Key metrics

Cloud Controller emits the following metrics:

  • cc_job_queue_length_cc-CF_API_WORKER_POD_NAME (for example, cc_job_queue_length_cc_cf_api_worker_565c45df86_h2nsp) is continuously growing.
  • cc_job_queue_length_total is continuously growing.

Heuristic failures

The following behaviors may occur:

  • cf delete-org ORG_NAME appears to leave its contained resources around for a long time.
  • Users report slow deletes for other resources.
  • cf-acceptance-tests succeed generally, but fail during cleanup.

Relevant logs

You can find the above heuristic failures in the following log files:

  • kapp logs -m 'cf-api-worker%' -c cf-api-worker

How to scale

The cf-api-worker pod can safely scale horizontally in all deployments.

cf-api-clock and cf-api-deployment-updater

The cf-api-clock job runs the Eirini sync process and schedules periodic background jobs. The cf-api-deployment-updater job is responsible for handling v3 rolling app deployments. For more information, see Rolling App Deployments (Beta).

When to scale

When determining whether to scale cf-api-clock and cf-api-deployment-updater, look for the following:

Key metrics

Cloud Controller emits the following metrics:

  • sum(rate(container_cpu_usage_seconds_total{container="cf-api-clock"}[1m])) by (pod) is high (approaching 1.0).

Heuristic failures

The following behaviors may occur:

  • The number of workload pods in the cf-workloads namespace does not match the total process instance count reported through the Cloud Controller APIs.
  • Deployments are slow to increase and decrease instance count.

Relevant logs

You can find the above heuristic failures in the following log files:

  • kapp logs -m 'cf-api-clock%' -c cf-api-clock
  • kapp logs -m 'cf-api-deployment-updater%' -c cf-api-deployment-updater

How to scale

Both of these pods are singletons, so extra instances are for failover high availability rather than scalability. Performance issues are likely due to database overloading.

Create a pull request or raise an issue on the source for this page in GitHub