Remote Execution Service¶

This document gives a high-level overview of the EngFlow Remote Execution software, details technical requirements, and provides a brief summary of deployment options.

For an overview of remote execution, see What is Remote Execution?

High-Level System Design Overview¶

An EngFlow Remote Execution cluster consists of two types of instances, schedulers and workers. The scheduler instances are the "brain" of the cluster: they coordinate with each other to distribute the incoming work and perform automated routine maintenance operations such as recovering from an instance loss. All client calls first go to a scheduler, which in turn calls to other schedulers and workers to fulfill each operation.

The worker instances are the storage and execution units of the cluster. They store all uploaded and generated files persistently and are also responsible for executing actions. Typically, the majority of instances in a cluster are workers.

Components¶

Clients (user and CI machines) connect to the service through a Load Balancer, which proxies requests to Schedulers.
Schedulers:
- terminate TLS and authenticate requests (optionally: authenticate clients)
- accept requests from clients, and delegate action execution requests to available workers
- maintain an in-memory Action Cache they share with each other
- optionally: back up the AC contents to External Storage (GCS or S3)
Workers:
- execute actions
- maintain an on-disk CAS (Content Addressable Storage) space for action inputs and outputs
- transfer CAS blobs to each other in a peer-to-peer fashion, collectively creating a distributed CAS cloud
- optionally: replicate CAS blobs for increased of reliability
- optionally: back up the CAS contents to External Storage (GCS or S3)

Storage¶

By default, the EngFlow Remote Execution cluster uses the worker's disks for persistent storage, automatically replicating files to multiple disks to ensure high availability. When a worker instance fails, the files it contains are automatically re-replicated from the remaining instances.

Configuring storage requires planning and monitoring to maintain disk space availability in production - the system cannot increase or decrease disk space autonomously.

Alternatively, the cluster can be connected to an external storage service. In this setup, all incoming and generated files are written through to the storage service, and files are fetched from the storage service as needed for action execution. For performance, worker instances cache and reuse files locally as much as possible.

Note that the worker instances still require significant disk space to cache files and to temporarily store input and output trees of executed actions.

Action Execution¶

Each worker instance can execute a certain number of actions concurrently. To that end, each worker provides a number of executors. Each executor can execute one action at a time, and has a number of properties such as the operating system it runs on, CPU and RAM provisioned, as well as the availability of additional hardware or software resources. Schedulers consider executors interchangeable if they have identical properties.

Each incoming action has a set of requirements, such as the OS, CPU, RAM, and availability of other resources. The cluster can schedule an action to run on any executor that provides at least the required properties.

In order to provide predictable behavior and consistent performance, worker instances enforce a configurable level of action isolation. Fully isolated actions are allowed to access exactly the resources they are allocated, but nothing else.

Of particular interest is CPU isolation: actions are typically limited to a certain number of logical CPUs (CPU cores). Many compilers and tools are single-threaded and restricting them to a single core is fine. However, tests, especially integration tests, are often multi-threaded, and may run slowly when restricted to a single logical CPU.

You have to configure the required number of CPUs on the client side. When configuring CPU counts, you need to balance cluster utilization vs. build latency. Running with fewer CPUs improves utilization, while running multi-threaded actions with more CPUs reduces latency.

Requirements¶

Build Tool Requirements¶

The EngFlow Remote Execution Service implements the open-source Remote Execution API version 2.0.0. Any client that faithfully implements the same version should work. We have successfully tested these clients:

Bazel
- 6.x
- 5.x
- 4.x

Bazel Compatibility Notes¶

See Bazel Known Issues.