Troubleshooting Cloud Foundry

Page last updated:

This guide provides help with diagnosing and resolving issues encountered when installing and running Cloud Foundry.

Troubleshoot Cloud Foundry Installation Issues

An installation or update can fail for reasons that have nothing to do with the software that you are installing. If an installation or update fails once, start over and try again. If it fails a second time, use the following information to troubleshoot the issue.

Timeouts in “creating bound missing vms” Phase

When deploying Cloud Foundry with BOSH, the “creating bound missing vms” phase occurs after package compilation. A process in the “creating bound missing vms” phase can time out if a BOSH Agent fails to start correctly or if the Agent cannot connect to the NATS message bus.

. . .
Started preparing package compilation > Finding packages to compile. Done (00:00:00)

Started creating bound missing vms
Started creating bound missing vms > api_worker_z1/0. Failed: Timed out pinging to 1da06ba3de2f after 600 seconds (00:12:45)

Error 450002: Timed out pinging to 1da06ba3de2f after 600 seconds

Perform the following steps to determine the cause of the time out:

  • Use your IaaS console to make sure the timed-out virtual machine (VM) is booting correctly
  • Check the Agent log on the VM that timed out for a “handshake” connection between the BOSH Director and the Agent
  • Use the Netcat networking utility to test for routing issues to the NATS IP and port from the VM that timed out

Use your IaaS console to make sure the timed-out VM is booting correctly

For details on how to use your IaaS console to make sure the timed out VM is booting correctly, see your IaaS documentation.

Check the Agent log on the VM that timed out for a “handshake” connection between the BOSH Director and the Agent

  1. Use your IaaS virtualization console to open a terminal window on the VM that timed out and log in as root.

  2. Open the /var/vcap/bosh/log/current log file in a text editor.

  3. Search the log file for a “handshake” between the BOSH Director and the BOSH Agent. This connection is represented in the log as a ping and a pong:

    . . .
    2013-10-03_14:35:48.58456 #[608] INFO: Message: {"method"=>"ping", "arguments"=>[], reply_to"=>"director.b668-1660944090e4"}
    2013-10-03_14:35:48.60182 #[608] INFO: reply_to:director.b668-1660944090e4: payload: {:value=>"pong"}
    
  4. If the handshake does not complete, the Agent cannot communicate with the Director.

Use Netcat to test for routing issues to the NATS IP

  1. Use your IaaS virtualization console to open a terminal window on the VM that timed out and log in as root.

  2. Open the /var/vcap/bosh/log/current log file in a text editor.

  3. Search the beginning of the log file for a line labeled INFO: loaded new infrastructure settings. This line contains a JSON blob of key/value pairs representing the expected infrastructure for the BOSH Agent.

    In this line, locate the IP address and port following nats://nats:nats@.

    . . .
    2013-10-03_14:35:21.83222 #[608] INFO: loaded new infrastructure settings:  {"vm"=>{"name"=>"vm-4d80ede4-b0a5", "id"=>"vm-360"}, {"user"=>"agent", "password"=>"agent"}}, "mbus"=>"nats://nats:nats@192.0.2.17:4222", "env"=>{"bosh"}}}}}
    
  4. Run the Netcat command nc -v IP-ADDRESS PORT to determine whether it is possible to establish a connection between the NATS message bus and this VM.

    $ nc -v 192.0.2.17 4222
    Connection to 192.0.2.17 4222 port [tcp/*] succeeded!
    

Out of Disk Space Error

If, during installation, log files fill all available disk space on a computer, the operating system reports an “Out of Disk Space” error.

To resolve this issue, delete these log files from the tmp directory of the affected computer manually or by rebooting.

Troubleshoot Cloud Foundry Operation Issues

Use the BOSH CLI for Troubleshooting

  1. Run bosh target DIRECTOR-IP-ADDRESS and provide your credentials to log into the BOSH Director.

    $ bosh target 192.0.2.10
    Target set to 'bosh'
    Your username: admin
    Enter password: *****
    Logged in as 'admin'
    
  2. Use the following BOSH commands to troubleshoot your deployment:

  • VMS: Lists all VMs in a deployment
  • Cloudcheck: Cloud consistency check and interactive repair
  • SSH: Start an interactive session or execute commands with a VM

BOSH VMS

bosh vms provides an overview of the virtual machines (VMs) BOSH is managing as part of the current deployment.

$ bosh vms

+-----------------------------------+---------+---------------------------------+---------------+
| Job/index                         | State   | Resource Pool                      | IPs           |
+-----------------------------------+---------+---------------------------------+---------------+
| unknown/unknown                   | running | smoke-tests                        | 192.0.2.14 |
| cloud_controller/0                | running | cloud_controller                   | 192.0.2.23 |
| collector/0                       | running | collector                          | 192.0.2.25 |
| consoledb/0                       | running | consoledb                          | 192.0.2.29 |
| dea/0                             | running | dea                                | 192.0.2.47 |
| health_manager/0                  | running | health_manager                  | 192.0.2.20 |
| loggregator/0                     | running | loggregator                     | 192.0.2.31 |
| loggregator_trafficcontroller/0   | running | loggregator_trafficcontroller   | 192.0.2.32 |
| nats/0                            | running | nats                               | 192.0.2.19 |
| nfs_server/0                      | running | nfs_server                         | 192.0.2.21 |
| router/0                          | running | router                             | 192.0.2.16 |
| saml_login/0                      | running | saml_login                         | 192.0.2.28 |
| syslog/0                          | running | syslog                             | 192.0.2.24 |
| uaa/0                             | running | uaa                                | 192.0.2.27 |
| uaadb/0                           | running | uaadb                              | 192.0.2.26 |
+-----------------------------------+---------+---------------------------------+---------------+

bosh vms may show a VM in an unknown state. Run bosh cloudcheck on VMs in an unknown state to have BOSH attempt to diagnose the problem.

You can also use bosh vms to identify VMs in your deployment, then use bosh ssh to SSH into an identified VM for further troubleshooting.

bosh vms supports the following arguments:

  • –details: Overview also includes Cloud ID, Agent ID, and whether or not the BOSH Resurrector has been enabled for each VM
  • –vitals: Overview also includes load, CPU, memory usage, swap usage, system disk usage, ephemeral disk usage, and persistent disk usage for each VM
  • –dns: Overview also includes the DNS A record for each VM

BOSH Cloudcheck

bosh cloudcheck attempts to detect differences between the VM state database that the BOSH Director maintains and the actual state of the VMs. For each difference detected, bosh cloudcheck offers repair options:

  • Reboot VM: Instructs BOSH to reboot a VM. Rebooting can resolve many transient errors.
  • Ignore problem: Instructs bosh cloudcheck to do nothing. You might want to instruct bosh cloudcheck to ignore a problem in order to run bosh ssh and attempt troubleshooting directly on the machine.
  • Reassociate VM with corresponding instance: Updates the BOSH Director state database. Use this option if you believe that the BOSH Director state database is in error and that a VM is correctly associated with a job.
  • Recreate VM using last known apply spec: Instructs BOSH to destroy a VM and recreate it from the deployment manifest the installer provides. Use this option if a VM is corrupted.
  • Delete VM reference: Instructs BOSH to delete a VM reference in the Director state database. If a VM reference exists in the state database, BOSH expects to find an agent running on the VM. Select this option only if you know this reference is in error. Once you delete the VM reference, BOSH can no longer control the VM.
Example Scenarios

Unresponsive Agent

$ bosh cloudcheck
ccdb/0 (vm-3e37133c-bc33-450e-98b1-f86d5b63502a) is not responding:
- Ignore problem
- Reboot VM
- Recreate VM using last known apply spec
- Delete VM reference (DANGEROUS!)

Missing VM

$ bosh cloudcheck
VM with cloud ID `vm-3e37133c-bc33-450e-98b1-f86d5b63502a' missing:
- Ignore problem
- Recreate VM using last known apply spec
- Delete VM reference (DANGEROUS!)

Unbound Instance VM

$ bosh cloudcheck
VM vm-3e37133c-bc33-450e-98b1-f86d5b63502a' reports itself asccdb/0' but does not have a bound instance:
- Ignore problem
- Delete VM (unless it has persistent disk)
- Reassociate VM with corresponding instance

Out of Sync VM

$ bosh cloudcheck
VM vm-3e37133c-bc33-450e-98b1-f86d5b63502a' is out of sync: expectedcf-d7293430724a2c421061: ccdb/0', got `cf-d7293430724a2c421061: nats/0':
- Ignore problem
- Delete VM (unless it has persistent disk)

BOSH SSH

Use bosh ssh to open secure shells into the VMs in your deployment.

To use bosh ssh:

  1. Run ssh-keygen -t rsa to provide BOSH with the correct public key.
  2. Accept the defaults.
  3. Run bosh ssh.
  4. Select a VM to access.
  5. Create a password for the temporary user the bosh ssh command creates. Use this password if you need sudo access in this session.

  $ bosh ssh
  1. ha_proxy/0
  2. nats/0
  3. etcd_and_metrics/0
  4. etcd_and_metrics/1
  5. etcd_and_metrics/2
  6. health_manager/0
  7. nfs_server/0
  8. ccdb/0
  9. cloud_controller/0
  10. clock_global/0
  11. cloud_controller_worker/0
  12. router/0
  13. uaadb/0
  14. uaa/0
  15. login/0
  16. consoledb/0
  17. dea/0
  18. loggregator/0
  19. loggregator_trafficcontroller/0
  20. smoke-tests/0

Choose an instance: 17 Enter password (use it to sudo on remote host): ******* Target deployment `cf_services-2c3c918a135ab5f91ee1'

Setting up ssh artifacts Starting interactive shell on job loggregator_trafficcontroller/0

View BOSH Logs

You can access BOSH logs by two methods:

  • Using the bosh ssh command to access the log location
  • Using the bosh logs command to output the logs to standard output or to a file

Using the BOSH SSH Command

Use bosh ssh to open secure shells into the VMs in your deployment, then access the logs on the VM.

To use bosh ssh:

  1. Run ssh-keygen -t rsa to provide BOSH with the correct public key.
  2. Accept the defaults.
  3. Run bosh ssh.
  4. Select a VM to access.
  5. Review the /var/vcap/bosh/log/current log file.

Using the BOSH Logs Command

Use bosh logs to output BOSH logs to standard output or to a file.

To use bosh logs:

  1. Run bosh vms to identify VMs in your deployment by job name and index.
  2. Run bosh logs JOB-NAME INDEX to view the logs from the identified VM.

Log into a Non-Responsive BOSH VM

A VM under heavy system load can stop responding to some commands but still function in a limited way. If the VM does not respond to bosh ssh, use the following steps to open a secure shell in a more direct manner:

  1. Run bosh vms and note the IP address of the non-responsive VM.
  2. Run ssh -t vcap@IP-ADDRESS 'sh' where IP-ADDRESS is the IP address of the non-responsive VM.

    $ bosh vms
    
    +---------------+---------+----------------+---------------+
    | Job/index     | State   | Resource Pool  | IPs           |
    +---------------+---------+----------------+---------------+
    | mysql_node/0  | unknown | mysql_node     | 192.0.2.53 |
    +---------------+---------+----------------+---------------+
    
    $ ssh -t vcap@192.0.2.53 'sh'
    

Terminate a BOSH SSH Session

Use ~., entered at the beginning of a new line, to terminate a bosh ssh or ssh session.

To terminate a bosh ssh or ssh initiated from the jumpbox or other server, use ~~., entered at the beginning of a new line. The outermost secure shell session consumes the second ~ and passes the remaining ~. command to the inner ssh session.

Debug a Failing Job

  1. Run bosh vms to determine which job VMs in your deployment are failing. Note the job name and index of the failing VM.
  2. Run bosh ssh JOB-NAME/INDEX to open a secure shell into the failing VM.
  3. Run sudo su - to enter the root environment with root privileges.
  4. Run monit summary to determine which processes are not running.
  5. Review the log files found in /var/vcap/sys/log/ to determine the root cause of the process failures. Some of these logs are formatted using steno with timestamps instead of human-formatted dates. You can use steno-prettify to make the logs more human-readable.
  6. Use monit restart all or monit restart PROCESS to start the processes.

Execute the following commands to set up steno-prettify in the cloudcontroller:

export CC_JOB_DIR=/var/vcap/jobs/cloud_controller_ng
source $CC_JOB_DIR/bin/ruby_version.sh

CC_PACKAGE_DIR=/var/vcap/packages/cloud_controller_ng
export BUNDLE_GEMFILE=$CC_PACKAGE_DIR/cloud_controller_ng/Gemfile
export HOME=/home/vcap # rake needs it to be set to run tasks

if [ -f $BUNDLE_GEMFILE ]; then
    alias  steno-prettify="bundle exec steno-prettify"
    echo "ready to use steno-prettify alias, try steno-prettify on one of the following files:"
    find /var/vcap/sys/log/ -name "*.log" | egrep  -v "err|out|ctl" | xargs ls -al
else
    echo "could not find Gemfile into ${STENO_DIR}:" `ls -al ${BUNDLE_GEMFILE}`
fi

View Cloud Controller Diagnostics

If your BOSH deployment experiences a resource spike, an infinite loop, or another undesirable performance issue, you can refer to the Cloud Controller (CC) diagnostics JSON file for debugging. The file contains data that are unavailable in the output from Cloud Foundry logging, including stack traces for running threads.

Follow the instructions below to SSH into the Cloud Controller VM and populate the diagnostics file with updated information.

  1. Provide BOSH with the correct public key.
    $ ssh-keygen -t rsa
    
  2. Accept the defaults.
  3. Begin an SSH session to the deployment.
    $ bosh ssh
    
  4. Enter the number corresponding to the cloud_controller from the list of VMs.
  5. Retrieve the PID of the CC process.
    $ cat /var/vcap/sys/run/cloud_controller_ng/cloud_controller_ng.pid 
  6. Send the user defined USR1 signal to the CC process to populate the JSON log file for reference. This command does not terminate the CC process.
    $ kill -USR1 CLOUD-CONTROLLER-PID
    
  7. Navigate to the JSON file in the diagnostics directory to see the record of the USR1 signal passed to your CC VM. Refer to the value of the cc.directories.diagnostics property in /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml for the location of the diagnostics directory. The default location of the diagnostics directory is /var/vcap/data/cloud_controller_ng/diagnostics.

Use the Interactive Cloud Controller Shell

The Cloud Controller jobs embeds a pry shell. This may allow users to interact with cc_ng classes, such as scripting some operations, or to access the cc_db using model classes. For examples, the CF CAPI Tracker

$ cd /var/vcap/jobs/cloud_controller_ng
$ bin/console
[...]

Troubleshoot Cloud Foundry Databases

The postgres BOSH job hosts the different databases used by Cloud Foundry, such as diego, ccng, and uaadb. If the postgres job reaches 100% persistent disk usage, it can impact performance. Perform the following steps to diagnose your postgres job:

  1. Retrieve the credentials for the postgres job from your BOSH manifest.
  2. Connect to the database using the PostgreSQL client of your choice. Alternatively, you can connect to the database using a PostgreSQL web interface such as phppgadmin-cf or open an SSH tunnel to the database using cf ssh.
  3. Use the following SQL query to identify the largest schema:

    SELECT schema_name,
           pg_size_pretty(sum(table_size)::bigint),
           (sum(table_size) / pg_database_size(current_database())) * 100
    FROM (
      SELECT pg_catalog.pg_namespace.nspname as schema_name,
             pg_relation_size(pg_catalog.pg_class.oid) as table_size
      FROM   pg_catalog.pg_class
         JOIN pg_catalog.pg_namespace ON relnamespace = pg_catalog.pg_namespace.oid
    ) t
    GROUP BY schema_name
    ORDER BY schema_name
    
  4. Using the output from the above query, identify the largest table.

Force a VM Recreate

If BOSH or an operator identifies a VM as corrupted, BOSH can recreate the VM. This recreation function can fail if a drain script on the VM is broken or times out. To resolve this issue:

  1. SSH into the VM and use the sv stop agent command to kill the BOSH Agent.

    $ bosh ssh myVM
    $ sv stop agent
    
  2. Let the BOSH Health Monitor automatically restart the VM. To follow the status of this process:

  • Run bosh tasks --no-filter to determine the task ID of the “scan and fix” task.
  • Run bosh task TASK-ID with the task ID of the “scan and fix” task.

    $ bosh tasks --no-filter
    +----+---------+--------------------+-------+--------------+--------+
    | #  | State   | Timestamp          | User  | Description  | Result |
    +----+---------+--------------------+-------+--------------+--------+
    | 83 | running | 01-21 12:07:34 UTC | admin | scan and fix | active |
    +----+---------+--------------------+-------+--------------+--------+
    
    $ bosh task 83
    
    Director task 83
      Started scanning 1 vms
    . . .
    

Scale Droplet Execution Agents

In older versions of Cloud Foundry that use the Droplet Execution Agent (DEA) architecture, DEAs stage and run applications. Follow the steps below to increase the resource pool for DEAs and the number of DEAs available in your deployment.

  1. Run bosh download manifest YOUR-DEPLOYMENT-NAME prod-current.yml to download a copy of your existing deployment manifest.
  2. Run bosh deployment prod-current.yml to instruct BOSH to reference the downloaded deployment manifest.
  3. Edit prod-current.yml, increasing the resource pool for DEAs and the number of DEAs.
  4. Run bosh deploy to update your deployment.

Recover from HM9000 Failure

In older versions of Cloud Foundry that use the Droplet Execution Agent (DEA) architecture, the Cloud Foundry Health Manager, HM9000, reconciles the expected states of applications and VMs in a deployment with their actual states. To do this, the HM9000 uses state information kept in etcd, a high-availability data store distributed across multiple nodes.

If etcd enters a bad state and becomes unable to write data, the HM9000 cannot operate correctly. To resolve this issue, delete the contents of the etcd data store. Doing so forces Cloud Foundry to rebuild the etcd data store from current data and allows the HM9000 to recover.

To delete the contents of the etcd data store:

  1. Run bosh vms to identify the etcd nodes in your deployment.
  2. Run bosh ssh to open a secure shell into each etcd node.
  3. Run monit stop etcd on each etcd node.
  4. Delete or move the etcd storage directory, /var/vcap/store/etcd, on each etcd node.
  5. Run the script /var/vcap/jobs/etcd/bin/pre-start to re-create the storage directory.
  6. Run monit start etcd on each etcd node.
Create a pull request or raise an issue on the source for this page in GitHub