BOSH Backup and Restore Developer's Guide

Overview

This guide describes the framework for release authors to add backup and restore functionality to their release by using BOSH Backup and Restore (BBR).

BBR is a framework for backing up and restoring BOSH deployments and BOSH Directors. BBR triggers the backup or restore process on the deployment or BOSH Director and transfers the backup artifacts to and from the deployment or BOSH Director.

The BBR framework consists of a command line interface (CLI) and a set of hooks, which call out to scripts. The framework uses BOSH to orchestrate the execution of backup and restore scripts. BBR allows release authors fine-grained control over how their release is backed up.

Backup Mechanism

Different systems require different strategies for backup and restore. To help release authors make consistent backups of their systems, the BBR framework provides hooks to lock scripts that bring a job to a consistent state before backup begins.

The following diagram illustrates an example BBR script execution sequence.

Script execution order

Because the framework is focused on the interface, hooks, backup destination, and intra-deployment orchestration, it is mechanism-agnostic. Release authors can choose to back up their release how they like, as long as the scripts they write have the correct name and directory structure.

To ensure that backup and restore scripts do not get out of sync with the releases themselves, release authors are responsible for naming, creating, testing, and troubleshooting their own backup and restore scripts.

Important: For all releases, be aware of the possibility that these backup and restore jobs may be collocated with those of other releases. Therefore, release authors must give these jobs unique and descriptive names to avoid name collisions.

Script Organization

BBR sets out a contract with release authors to call designated backup and restore scripts:

  • Backup
    • pre-backup-lock
    • backup
    • post-backup-unlock
  • Restore
    • pre-restore-lock
    • restore
    • post-restore-unlock
  • Metadata script (if needed)
    • metadata

Release authors must implement scripts as part of the BBR contract. Package and distribute the scripts as part of your BOSH release. The Exemplar Backup and Restore Release provides examples of how release authors can structure their jobs to implement the contract with BBR.

The BBR CLI can remotely locate and run the scripts, even if they are located on different VMs. For example, the lock/unlock scripts can be located on release VMs while, due to disk-space constraints, the backup/restore scripts are co-located on a separate backup-restore VM.

Scripts are executed in a specific order. For the backup workflow, the order is pre-backup-lock, backup, and then post-backup-unlock. For the restore workflow, the order is pre-restore-lock, restore, and then post-restore-unlock.

All scripts should be executable scripts. ERB tags may be used for templating.

These scripts are executed similarly to other release job scripts, such as start, stop, or drain, and you can use the job’s package dependencies.

BBR checks exit codes of the scripts when orchestrating. Exit code 0 indicates success, and any other exit code indicates a failure.

Directory Structure of a BOSH Release with BBR Scripts

The following is an example directory structure of a BOSH release that includes BBR scripts.

acme-release
├── README.md
├── config
│   ├── blobs.yml
│   └── final.yml
├── jobs
│   ├── backup-restore-acme
│   │   ├── monit
│   │   ├── spec
│   │   └── templates
│   │       ├── backup.sh.erb
│   │       ├── config.json.erb
│   │       └── restore.sh.erb
│   ├── lock-unlock-acme
│   │   ├── monit
│   │   ├── spec
│   │   └── templates
│   │       ├── post-backup-unlock.sh.erb
│   │       ├── pre-backup-lock.sh.erb
│   │       └── metadata.sh.erb (optional)
│   ├── lock-unlock-acme-interferer
│   │   ├── monit
│   │   ├── spec
│   │   └── templates
│   │       ├── post-backup-unlock.sh.erb
│   │       ├── pre-backup-lock.sh.erb
│   │       └── metadata.sh.erb (optional)
├── packages
└── src

Script Ordering across Jobs

During the backup workflow, all pre-backup-lock scripts are invoked before the backup scripts on all of the jobs in the deployment. All backup scripts are invoked before the post-backup-unlock scripts on all of the jobs in the deployment.

During the restore workflow, all pre-restore-lock scripts are invoked before the restore scripts on all of the jobs in the deployment. All restore scripts are invoked before the post-restore-unlock scripts on all of the jobs in the deployment.

By default pre-backup-lock and pre-restore-lock scripts from different jobs are invoked in an arbitary order. If you want to specify an order for lock scripts from specific jobs you can do so in the metadata script.

The post-backup-unlock and post-restore-unlock scripts from different jobs are invoked in an arbitrary order when no locking dependency is specified, or the opposite order of lock scripts as stated in metadata scripts. When a job specifies locking dependency to a job that does not exist in the current deployment, BBR ignores that dependency.

For more information, see Metadata Script.

Logs

The stdout and stderr streams are captured and sent to the operator who invokes the backup and restore.

Important: Release authors should avoid printing sensitive information to stdout or stderr. BBR prints any output from the scripts in case of failure. Make sure your script does not print any sensitive data, such as credentials. In particular, if you are using set -x in your script, be sure to add set +x before you use any credentials.

Backup Workflow

Pre-Backup-Lock

The release job can have a pre-backup-lock script that stops any processes that could make changes to the components being backed up. This script must allow the job to lock so that backups are consistent across a cluster.

For example, in a Cloud Foundry deployment, the pre-backup-lock script stops Cloud Controller processes that may make changes to its blobstore and database thus ensuring the blobstore and the Cloud Controller database are consistent with each other.

Job Configuration

To add a pre-backup-lock script to a job, do the following:

  1. Create a script with any name in the templates directory of your job.
  2. In the templates section of the release job spec file, add the script name and bin/bbr/pre-backup-lock as a key-value pair. For example:
    ---
    name: lock-unlock-acme
    templates:
      pre-backup-lock.sh.erb: bin/bbr/pre-backup-lock
    
  3. If you want your bin/bbr/pre-backup-lock scripts to run in a specific order, define that order using the optional metadata script. In the sample directory structure illustrated above, this script is called metadata.sh.erb.

    a. The metadata script specifies that the current job must be locked before some other job(s) in the deployment. When run, this script must print a YAML file to stdout. For example:
    #!/usr/bin/env bash
    
    echo "---
    backup_should_be_locked_before:
    - job_name: lock-unlock-acme
      release: acme-release" 
    b. Add an entry to the templates section of the release job spec file:
    templates:
      ...
      metadata.sh.erb: bin/bbr/metadata
    
  4. See Metadata Script for information about the properties you can use in this script.

Backup

The release job can have a backup script that dumps the backup of the job’s database to the directory specified by $BBR_ARTIFACT_DIRECTORY. For example, when backing up MySQL, this script can invoke the mysqldump binary for the MySQL adapter.

There must be at least one job in the deployment providing a backup script. If there isn’t a backup script, calling backup or pre-backup-check for the deployment will fail.

Job Configuration

To add a backup script to a release job:

  1. Create a script with any name in the templates directory of a release job.
  2. In the templates section of the release job spec file, add the script name and bin/bbr/backup as a key-value pair. For example:

    ---
    name: backup-restore-acme
    templates:
      backup.erb: bin/bbr/backup
    

Post-Backup-Unlock

The backup and restore job can have a post-backup-unlock script that will undo the operations done by pre-backup-lock.

Job Configuration

To add a post-backup-unlock script to a release job:

  1. Create a script with any name in the templates directory of a release job.
  2. In the templates section of the release job spec file, add the script name and bin/bbr/post-backup-unlock as a key-value pair. For example:

    ---
    name: lock-unlock-acme
    templates:
      post-backup-unlock.erb: bin/bbr/post-backup-unlock
    

Error Handling During Backup

If errors occur during the backup workflow, clean-up tasks are executed to put the system back in a working state.

A normal workflow for backing up a deployment does the following:

graph TB start(Start) --> check-deployment(Check that the deployment exists) check-deployment -- Yes --> make-local-artifact(Make a local artifact dir) check-deployment -- No --> exit(Exit) make-local-artifact -- Success --> pre-backup(Run pre-backup lock scripts) make-local-artifact -- Failure --> exit pre-backup -- Success --> backup(Run backup scripts) pre-backup -- Failure --> post-backup(Run post-backup-unlock scripts) backup --> post-backup post-backup -- backup was successful --> drain(Drains artifact from instance) post-backup -- backup failed --> cleanup(Removes backup from the instance) drain --> cleanup cleanup --> exit

Restore Workflow

Pre-Restore-Lock

The release job can have a pre-restore-lock script that stops any processes that could make changes to the components being restored. This script must allow the job to lock so that restorations are consistent across a cluster.

Job Configuration

To add a pre-restore-lock script to a job, do the following:

  1. Create a script with any name in the templates directory of your job.

  2. In the templates section of the release job spec file, add the script name and bin/bbr/pre-restore-lock as a key-value pair. For example:

    ---
    name: lock-unlock-acme
    templates:
      pre-restore-lock.sh.erb: bin/bbr/pre-restore-lock
    
  3. If you want your bin/bbr/pre-restore-lock scripts to run in a specific order, define that order using the optional metadata script. In the sample directory structure illustrated above, this script is called metadata.sh.erb.

    The pre-restore-lock scripts are called before any restore scripts have been called. Success indicates the job is ready to be restored.

    a. The metadata script specifies that the current job must be locked before some other job(s) in the deployment. When run, this script must print a YAML file to stdout. For example:

      #!/usr/bin/env bash
      echo "---
      restore_should_be_locked_before:
        job_name: lock-unlock-acme
        release: acme-release"
      

    b. Add an entry to the templates section of the release job spec file:

    templates:
      ...
      metadata.sh.erb: bin/bbr/metadata
    

    See Metadata Script for information about the properties you can use in this script.

Restore

If a release has a backup script, it should also have a restore script. The restore script expects a backup artifact to be provided in $BBR_ARTIFACT_DIRECTORY. For example, when restoring MySQL, the script invokes mysql to restore from a mysqldump file.

Job Configuration

To add a restore script to a release job:

  1. Create a script with any name in the templates directory of a release job.
  2. In the templates section of the release job spec file, add the script name and bin/bbr/restore as a key-value pair. For example:

    ---
    name: backup-restore-acme
    templates:
      restore.erb: bin/bbr/restore
    

Post-Restore-Unlock

The release job can have a post-restore-unlock script that resumes normal service operation. post-restore-unlock should be idempotent since it can be called multiple times even if pre-restore-lock has not been not called.

Job Configuration

To add a post-restore-unlock script to a release, do the following:

  1. Create a script with any name in the templates directory of your job.
  2. In the templates section of the release job spec file, add the script name and bin/bbr/post-restore-unlock as a key-value pair. For example:

    ---
    name: lock-unlock-acme
    templates:
      post-restore-unlock.sh.erb: bin/bbr/post-restore-unlock
    

Error Handling During Restore

If errors occur during the restore workflow, clean-up tasks are executed to put the system back in a working state.

A normal workflow for restoring a deployment does the following:

graph TB start(Start) --> validate-artifact(Check that the artifact is present and the checksums match) validate-artifact --> check-deployment-exists(Check that the deployment exists) check-deployment-exists -- Yes --> check-deployment-matches-backup(Check the structure of the backup matches the destination deployment) check-deployment-exists -- No --> exit(Exit) check-deployment-matches-backup -- Yes --> copy-to-remote(Copy the backup artifacts to the relevent VMs) check-deployment-matches-backup -- No --> cleanup(Cleanup, close ssh connections) copy-to-remote -- Success --> pre-restore-lock(Run pre-restore-lock scripts) copy-to-remote -- Failure --> cleanup pre-restore-lock -- Success --> restore(Run restore scripts) pre-restore-lock -- Failure --> post-restore-unlock(Run post-restore-unlock scripts) restore --> post-restore-unlock post-restore-unlock --> cleanup cleanup --> exit

Metadata Script

The metadata script is a optional script that would be executed by bbr before any other scripts are executed to get more information about the jobs, for example locking dependencies. The script is expected to print a yaml on standard out with more information about the job for backup and restore.

Job Configuration

To add a metadata script to a release job:

  1. Create a script with any name in the templates directory of a release job.
  2. In the templates section of the release job spec file, add the script name and bin/bbr/metadata as a key-value pair. For example:

    ---
    name: backup-restore-acme
    templates:
      metadata.erb: bin/bbr/metadata
    

Properties

  • backup_should_be_locked_before This property can be use to specify locking dependencies of the current job during backup. The jobs are specified as an array with their release names.

  #!/usr/bin/env bash
  echo "---
  backup_should_be_locked_before:
  - job_name: lock-unlock-acme
    release: acme-release" 

  • restore_should_be_locked_before If you want your bin/bbr/pre-restore-lock scripts to run in a specific order, define that order using the optional metadata script as in the above step, but with the restore_should_be_locked_before key. For example:
#!/usr/bin/env bash
echo "---
restore_should_be_locked_before:
- job_name: lock-unlock-acme
  release: acme-release" 

Testing

The functional testing pattern recommended for BBR scripts is:

  1. Create data for which the release is responsible.
  2. Back up the release.
  3. Delete data.
  4. Restore the release.
  5. Validate that restored data are correct.

Where scripts are implemented for releases that are part of larger deployments, you should perform end-to-end testing that validates consistency across releases.

Backup and Restore Utilities

If your release stores state in a Postgres or MySQL database, deploy the database-backup-restorer job from the backup-and-restore-sdk-release GitHub repository.

Ensure your job templates a config.json as follows:

{
  "username": "db user",
  "password": "db password",
  "host": "db host",
  "port": "db port",
  "adapter": "bbr supported adapter (e.g. mysql or postgres)",
  "database": "database name"
}

Ensure your backup and restore scripts call the appropriate database-backup-restorer binaries as follows:

Backup

/var/vcap/jobs/database-backup-restorer/bin/backup /path/to/config.json

cp <output_file> $ARTIFACT_DIRECTORY

Restore

cp $ARTIFACT_DIRECTORY <output_file>

/var/vcap/jobs/database-backup-restorer/bin/restore /path/to/config.json

BBR and Cloud Foundry Databases

Important: Release authors must ensure the following.
For CF Releases—Your scripts must respect the release_level_backup job property so the scripts run if set to true, but do nothing if set to false.

BBR provides a pattern for Cloud Foundry release authors that ensures their scripts abide by the BBR Framework contract.

A Cloud Foundry operator adds a backup-restore instance to their Cloud Foundry deployment. By default, release-specific database backup and restore job scripts are collocated on this instance along with the database-backup-restorer job from the backup-and-restore-sdk-release GitHub repository.

If your release requires that you stop your component’s processes during backup and restore, you also need to provide unlock and lock scripts. You may collocate these scripts on either the backup-restore instance or your component’s instance. You should collocate these scripts on the component’s instance if you want to interact directly with monit or your running job.

Example

Acme Release is a Cloud Foundry component that has no persistent disk on its component VM. The release is an API that stores its data in a MySQL database deployed on a different instance group in this CF deployment.

The release authors have chosen to create two new jobs to be incorporated into their existing release; one job includes the backup and restore scripts in its templates directory, and the other job includes the pre-backup-lock and post-backup-unlock scripts. This is because the operator will collocate the former job on the backup-restore VM and the latter job on the Acme Release VM.

Here is how the backup and restore job should be placed:

graph LR subgraph Backup Restore VM database-backup-restorer bbr-acmedb end subgraph Acme VM acme bbr-lock-unlock-acme end subgraph Acme Release jobs/acme-job-->acme jobs/bbr-lock-unlock-acme-->bbr-lock-unlock-acme jobs/bbr-acmedb-->bbr-acmedb end subgraph BBR SDK Release jobs/database-backup-restorer-->database-backup-restorer end classDef lavender fill:#8ca5ce,stroke:#333 classDef turquoise fill:#8bc1ce,stroke:#333 class jobs/acme-job lavender class jobs/bbr-lock-unlock-acme lavender class jobs/bbr-acmedb lavender class jobs/database-backup-restorer turquoise

Naming Conventions

Multiple backup and restore scripts will be collocated on the backup-restore instance, separated by job name. For this reason, each release should name the jobs containing their backup and restore scripts following the pattern bbr-[RELEASE_NAME]db. For example, in the scenario above, the job containing the backup and restore scripts is bbr-acmedb.

Acceptance Tests

The Disaster Recovery Acceptance Test Suite (DRATS) runs against a Cloud Foundry deployment to ensure that backup and restore works as expected. Specifically, DRATS adds state to a Cloud Foundry deployment by testing backup and restore during a CF operation such as pushing an app. DRATS backs up the deployment, restores from the backup, and asserts that the state is present after restore.

To add extra test cases, create a new TestCase that implements the TestCase interface.

You must implement the following methods:

  • PopulateState() must create some state in the Cloud Foundry deployment to be backed up. This deployment’s name must be set to environment variable DEPLOYMENT_TO_BACKUP.
  • CheckState() must assert that the state in the restored Cloud Foundry deployment matches that created by PopulateState(). The restored Cloud Foundry deployment’s name must be set to DEPLOYMENT_TO_RESTORE.
  • Cleanup() must clean up the state created in the Cloud Foundry deployment to be backed up.

You can find further instructions and examples here.

Running Acceptance Tests

Run acceptance tests as follows:

  1. Deploy Cloud Foundry.
  2. From the DRATS GitHub repository, run scripts/run_acceptance_tests.sh with the following environment variables set:
    • DEPLOYMENT_TO_BACKUP — name of the Cloud Foundry deployment
    • DEPLOYMENT_TO_RESTORE — name of the Cloud Foundry deployment
    • BOSH_URL — URL of BOSH Director that has deployed the above Cloud Foundry deployments
    • BOSH_CLIENT — BOSH Director username
    • BOSH_CLIENT_SECRET — BOSH Director password
    • BOSH_CERT_PATH — path to BOSH Director CA certificate
    • BBR_BUILD_PATH — path to BBR binary
Create a pull request or raise an issue on the source for this page in GitHub