Content-Addressable Storage

Determining CAS Layout and Size

This document describes how to configure the built-in distributed CAS as well as a backup CAS service for an EngFlow Remote Execution cluster.

Distributed CAS

The EngFlow Remote Execution comes with a built-in distributed CAS that reuses worker disks to provide persistent storage of files. It automatically replicates all stored files to up to three worker instances, see --replica_count , and makes new copies whenever a worker instance is lost.

CAS Layout

The total disk space is subdivided into the core OS, the per-action execution directories (consisting of input and output trees), and the CAS (consisting of replicas and a local cache).

For example, a dual-core worker node with 10 GB disk might be subdivided into 2 GB for the OS, two executors with 1 GB each, leaving 6 GB for the CAS, of which we allocate up to 3 GB for replicas.

OSExecution DirectoriesCAS
2 GB2 workers * 1 GB6 GB
3 GB Replicas3 GB Cache
Disk size: 2 GB + 2 * 1 GB + 6 GB = 10 GB

For simplicity, we recommend controlling the CAS size indirectly by setting --disk_size to the total disk size. The worker then computes the CAS size as follows:

  • It reserves 20% of the disk size for the operating system.
  • It reserves --max_output_size times the number of local executors (see --worker_config for the execution directories.
  • The remainder is allocated to the CAS, half of which is storage space for replicas.

Note that the input trees are typically hard-linked into the CAS, so it is not necessary to account for the maximum input tree size here. It is an error if the number of executors times the maximum input tree size is smaller than the CAS cache.

You should observe the following rules when adding additional packages to the base image, increasing the number of executors, or increasing the maximum input and output tree sizes:

  • When adding packages to the base image: the OS size should stay below 10% of the disk size. If it is larger, then you either need to remove packages, or increase the disk size.

  • When increasing the number of executors: with the default maximum input and output tree sizes of 1 GB each, you will need approximately 4-5 GB of disk per executor.

  • When increasing the maximum input or output tree sizes: your disk should be at least 3 times the sum of the maximum input and output tree sizes per executor.

Configuring an External Persistent Storage Service

In addition to the built-in distributed CAS, EngFlow Remote Execution supports backing up CAS data to an external persistent storage service. If files or file metadata is lost from the distributed CAS, the cluster automatically falls back to the persistent storage service. This can be used to run with smaller disks, or to support auto-scaling.

The EngFlow Remote Execution software is designed to minimize read and write accesses to the storage service. That is, it will preferentially use the built-in distributed CAS to fetch a file rather than the storage service.

Google Cloud Storage

In order to configure GCS as a fallback storage mechanism, you have to create a GCS bucket, ensure that both workers and schedulers have read and write access to the bucket, and then configure the location with the following flags:

If you are running outside of GCP, and you are not using application default credentials, then you also have to specify the location of a credentials file using --gcs_credentials .

Note that the EngFlow Remote Execution service will not manage the lifecycle of replicas stored on GCS. To avoid excessive storage use, we recommend setting up lifecycle rules in the GCS management console.

Amazon S3

In order to configure Amazon S3 as a fallback storage mechanism, you have to create a S3 bucket, ensure that both workers and schedulers have read and write access to the bucket, and then configure the location with the following flags:

The IAM account needs these permissions:

"Action": [
    "s3:PutObject",
    "s3:GetObject",
    "s3:DeleteObject",
    "s3:ListBucket"
],

If you are running outside of AWS, you have to use application default credentials, i.e. define the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

Note that the EngFlow Remote Execution service will not manage the lifecycle of replicas stored on S3. To avoid excessive storage use, we recommend setting up lifecycle rules in the S3 management console.

2021-09-21