Auto-Scaling

Configuring Auto-Scaling for Worker Pools

This document describes how to configure auto-scaling worker pools in an EngFlow Remote Execution cluster. Auto-scaling describes the process of automatically adjusting the number of workers or schedulers to the incoming load. Some of the notes here also apply to the use of VM instances with reduced availability such as GCP preemptible or AWS spot instances.

While the EngFlow Remote Execution software is designed to gracefully handle instance failures and use additional resources that are added to the cluster, auto-scaling may remove more machines than the cluster can handle at once.

We first give an overview of the built-in failure recovery and scaling mechanisms, and then discuss options for handling cases that go beyond the built-in facilities. Finally, we provide an example for setting up auto-scaling on GCP.

Built-in Failure Recovery & Scaling Mechanisms

The EngFlow Remote Execution software is designed to transparently handle individual instance failures, both of workers and schedulers, and to automatically start using new instances as they are added to the cluster.

If the built-in distributed CAS is used, files are replicated up to three times (see --replica_count ), so the system can handle up to two simultaneous worker failures without client-visible failures or data loss.

Similarly, action cache entries are replicated up to three times (see --metadata_replica_count , so the system can handle up to two simultaneous scheduler failures. Note that the action cache is exclusively stored on the schedulers.

However, auto-scaling will typically remove more than two instances at the same time when a significant drop in incoming load is detected.

Auto-Scaling More than 2 Instances

There are three options for handling the case where more than two machines are shut down at the same time:

  1. Rely on the client to re-upload the files / re-run the actions
  2. Separate CAS and worker instances, and disable auto-scaling for CAS instances
  3. Use an external persistent storage service

1. Relying on the Client

If the client has a copy of all the relevant files, then a loss of files in the CAS can be automatically recovered from, by having the client re-upload the corresponding files. Unless the client is explicitly configured to not keep copies, and it correctly handles server errors, this approach requires no additional configuration. Similarly for loss of action cache entries.

Note that Bazel’s Build-without-the-Bytes feature explicitly disables downloading output files to the client, which can result in build errors if more than two instances fail or are removed simultaneously.

If you are planning to enable this feature in Bazel, we strongly recommend also enabling one of the other options below to avoid intermittent build failures, and also increasing the replica retention time (see --default_replica_timeout .

2. Separating CAS and Worker Instances

A second option to increase reliability with auto-scaling is to use separate CAS and worker pools, and only enable auto-scaling for the worker pool. This requires configuring the worker pool without replica space - this is a signal to the worker instance to avoid storing file metadata.

Otherwise loss of worker instances can lead to loss of file metadata, which can make files unrecoverable even if they are still present on the CAS instances. Note that even when worker instances store no replicas, they still need CAS space for action input and output files as well as local cache.

You cannot use the generic --disk_size option, but you have to explicitly configure the CAS layout.

This is an example configuration:

  • CAS instances:

      --worker_config=
      --max_cas_size=200gb
      --max_replica_size=200gb
    
  • Worker instances:

      --worker_config=1*cpu=2
      --disk_size=0
      --max_cas_size=8gb
      --max_replica_size=0
    

3. Configuring Additional File Backups

If the built-in mechanisms are not sufficient, it is possible to configure a cluster to rely on an external persistent storage service to keep an additional copy of files and action cache entries. If files or file metadata is lost, the cluster then automatically falls back to the persistent storage service.

See Content-Addressable Storage for details.

GCP Auto-Scaling

You can use GCP’s existing auto-scaler in combination with Google Cloud Operations (formerly StackDriver) monitoring and Google Cloud Storage backups.

You should first enable metrics export to GCO and verify that you can see metrics in the GCP Cloud Console. Enable metrics export to GCO by adding this to your worker configuration:

    --enable_stackdriver=true
    --monitoring_stackdriver_project=${var.project_name}

Automatic Setup with Terraform

If you are using the included Terraform configuration file as part of your GCP setup (see GCP setup), then you may have already created an auto-scaler for your worker pool that uses the used_executors metric.

In this case, you only need to update the Terraform configuration by setting the enable_autoscaler Terraform variable to true:

    variable "enable_autoscaler" {
      type    = bool
      default = true
    }

Manual Setup

Alternatively, you can manually configure the auto-scaler through the GCP console. The auto-scaler is tied to your worker pool’s instance group manager. Select custom.googleapis.com/opencensus/com.engflow.re.exec/used_executors as the metric with a target utilization of 0.25.

2021-09-21