Remote Execution Service

High-level Overview of the System, Requirements, and Deployment Options

This document gives a high-level overview of the EngFlow Remote Execution software, details technical requirements, and provides a brief summary of deployment options.

High-Level System Design Overview

An EngFlow Remote Execution cluster consists of two types of instances, schedulers and workers. The scheduler instances are the “brain” of the cluster: they coordinate with each other to distribute the incoming work and perform automated routine maintenance operations such as recovering from an instance loss. All client calls first go to a scheduler, which in turn calls to other schedulers and workers to fulfill each operation.

The worker instances are the storage and execution units of the cluster. They store all uploaded and generated files persistently and are also responsible for executing actions. Typically, the majority of instances in a cluster are workers.

Components

  • Clients (user and CI machines) connect to the service through a Load Balancer, which proxies requests to Schedulers.

  • Schedulers:

    • terminate TLS and authenticate requests (optionally: authenticate clients)
    • accept requests from clients, and delegate action execution requests to available workers
    • maintain an in-memory Action Cache they share with each other
    • optionally: back up the AC contents to External Storage (GCS or S3)
  • Workers:

    • execute actions
    • maintain an on-disk CAS (Content Addressable Storage) space for action inputs and outputs
    • transfer CAS blobs to each other in a peer-to-peer fashion, collectively creating a distributed CAS cloud
    • optionally: replicate CAS blobs for increased of reliability
    • optionally: back up the CAS contents to External Storage (GCS or S3)

Storage

By default, the EngFlow Remote Execution cluster uses the worker’s disks for persistent storage, automatically replicating files to multiple disks to ensure high availability. When a worker instance fails, the files it contains are automatically re-replicated from the remaining instances.

Configuring storage requires planning and monitoring to maintain disk space availability in production - the system cannot increase or decrease disk space autonomously.

Alternatively, the cluster can be connected to an external storage service. In this setup, all incoming and generated files are written through to the storage service, and files are fetched from the storage service as needed for action execution. For performance, worker instances cache and reuse files locally as much as possible.

Note that the worker instances still require significant disk space to cache files and to temporarily store input and output trees of executed actions.

Action Execution

Each worker instance can execute a certain number of actions concurrently. To that end, each worker provides a number of executors. Each executor can execute one action at a time, and has a number of properties such as the operating system it runs on, CPU and RAM provisioned, as well as the availability of additional hardware or software resources. Schedulers consider executors interchangable if they have identical properties.

Each incoming action has a set of requirements, such as the OS, CPU, RAM, and availability of other resources. The cluster can schedule an action to run on any executor that provides at least the required properties.

In order to provide predictable behavior and consistent performance, worker instances enforce a configurable level of action isolation. Fully isolated actions are allowed to access exactly the resources they are allocated, but nothing else.

Of particular interest is CPU isolation: actions are typically limited to a certain number of logical CPUs (CPU cores). Many compilers and tools are single-threaded and restricting them to a single core is fine. However, tests, especially integration tests, are often multi-threaded, and may run slowly when restricted to a single logical CPU.

You have to configure the required number of CPUs on the client side. When configuring CPU counts, you need to balance cluster utilization vs. build latency. Running with fewer CPUs improves utilization, while running multi-threaded actions with more CPUs reduces latency.

Requirements

Build Tool Requirements

The EngFlow Remote Execution Service implements the open-source Remote Execution API version 2.0.0. Any client that faithfully implements the same version should work. We have successfully tested these clients:

Bazel Compatibility Notes

Bazel versions 3.7.{0,1,2} have an issue in the bundled Java builder, which sometimes suppress Java compiler errors when used with remote execution (see upstream bug at https://github.com/bazelbuild/bazel/issues/12959.

Bazel versions before 4.0.0 have an issue with dynamic linking when used with remote execution on macOS. Bazel 4.0.0 has a flag --incompatible_macos_set_install_name to fix the issue. As of 2021-02-04, this flag is not enabled by default yet (see upstream bug at https://github.com/bazelbuild/bazel/issues/7415).

As of 2020-07-13, Bazel does not support cross-platform remote execution: the host OS (what Bazel runs on) must be the same type of OS as the execution OS (what remote executors run on). For example you cannot run the remote executors on Linux and Bazel on Windows; both OSs have to be Linux. Note that there’s no restriction on the target OS or target architecture (what the compiled binaries will run on), so you can cross-compile to any platform as long as the compiler can run on the execution OS.

Update: as of 2021-02-04, we have successfully run builds on macOS, with remote execution running on Linux.

Deployment Requirements

To deploy the EngFlow Remote Execution Service, you need to provide machines, virtual machines, or a Kubernetes cluster. The minimum hardware and software requirements per instance are:

  • General

    • Linux-based operating system
      • Debian 10 (Buster)
      • Ubuntu 18.04 (Bionic Beaver)
    • macOS
      • 10.14 (Mojave)
      • 10.15 (Catalina)
  • Worker

    • 5 GB disk plus 5 GB disk per executor
    • 1 logical CPU per executor or 1 logical CPU if no executors are configured
    • 1 GB RAM plus 1 GB RAM per executor
  • Scheduler

    • 4 logical CPUs, 16 GB RAM

Note:

  • The schedulers do not have to run on the same OS as the workers.

  • If you have multiple schedulers, you also need to provide a gRPC-compatible load balancer that distributes the incoming calls to the schedulers.

  • As of 2020-08-05, EngFlow RE does not support distributed clusters where there is significant network latency between one part of the cluster and the rest, e.g., an on-prem worker pool with dedicated hardware connected to a main cluster hosted in the cloud.

Network Requirements

You must provide a reliable high-performance network connection between all the instances in a cluster, 1 Gigabit Ethernet equivalent or better. We recommend setting up a private network for each cluster such that a) only access from trusted networks is allowed, and b) only the public scheduler ports are reachable from outside the cluster. If your cluster is reachable from the public internet, you must provide an appropriately configured firewall, and also configure server and client authentication.

  • Ports on worker nodes must not be reachable from the public internet. Either the cluster must run on a private network, or an appropriately configured firewall must be used.

  • Only the public port on scheduler nodes may be reachable from the public internet. If a scheduler’s public port is reachable, then both server- and client-authentication must be configured such that it is impossible for an external party to impersonate either server (man-in-the-middle attacks) or client (no access to unauthenticated clients).

  • The only supported mechanism for server authentication is TLS, see Authentication. The included demo certificates should be considered public information, and must not be used for publicly accessible schedulers.

  • Supported mechanisms for client authentication include TLS client certificates as well as GCP authentication tokens, see Authentication.

  • Depending on configuration, actions can access and modify critical local resources as well as access any secrets stored on worker instances. Such clusters must only run trusted code.

EngFlow GmbH cannot be held liable for source code leaks, loss of data, or other issues arising from an improper network configuration.

Monitoring Requirements

The EngFlow Remote Execution software uses a monitoring middleware that supports integrating with a variety of monitoring backends and services. Out of the box, we support the following monitoring services:

  • Prometheus 2.7.0 or later
  • Google Cloud Operations (formerly StackDriver)
  • Zipkin 2.17.0 or later

Deployment Process

The deployment process differs based on what operating system you deploy on, and whether you deploy to a set of bare-metal machines, virtual machines, or an existing Kubernetes cluster.

This section provides a brief summary of the pros and cons of the different deployment types.

Bare-Metal & Virtual Machine Deployments

Setup instructions? See our guides.

Virtual machine deployments are easier to setup and maintain than bare-metal, with both providing high performance and strong action isolation.

On Linux, we recommend using Docker for action isolation. This means that actions are run in individual Docker containers which restrict CPU and memory usage, as well as access to the underlying machine. Docker prevents actions from interfering with each other and with the underlying system, providing predictable performance and high reliability.

Docker action isolation requires client configuration. We recommend the open-source Bazel Toolchains Rules for configuring Bazel. Since the execution is controlled through client configuration, the action execution environment can be changed or updated without having to reconfigure the remote execution cluster.

Finally, these clusters can dynamically reconfigure their executor configuration according to the incoming load.

Kubernetes Deployments

Setup on Kubernetes? See our guide.

Kubernetes deployments are similarly easy to setup as VM deployments, but can require higher maintenance due to their lower level of action isolation, which can cause increased build latency or unexpected and difficult-to-debug build failures.

The reason for the lower level of action isolation is that remote execution shares certain traits with Kubernetes which are difficult to replicate inside of Kubernetes.

Kubernetes runs individual services inside of Docker containers to isolate services from each other while allowing multiple services to run on the same underlying machine. Remote execution runs individual actions on the same underlying machine, and attempts to isolate these actions from each other.

However, processes running in Kubernetes only have limited options for isolating their subprocesses since Kubernetes restricts access to the relevant Kernel APIs. E.g., it is not generally possible to start a docker container from inside Kubernetes, which itself uses Docker to isolate services.

At the same time, Kubernetes is designed for fewer long-lived processes rather than many short-lived ones. This makes it impractical to use dedicated Kubernetes Pods for individual actions, because actions often only take 10s to 100s of milliseconds to run.

Therefore, in a Kubernetes deployment, we run actions inside the same Docker container as the worker instance itself runs, with only limited isolation from each other and from the worker instance itself.

That said, a Kubernetes deployment is still a viable option with some restrictions:

  • Changing the executor environment requires rebuilding the docker images and restarting an existing set of workers, or starting an additional set of workers using the new images. To minimize the need for rebuilding docker images and restarting clusters, we recommend avoiding dependencies on pre-installed tools and compilers, and instead using checked-in ones (either source or binary).

  • Dynamic reconfiguration of executors requires larger nodes which increases the risk of action conflicts - we recommend using static executor configurations with a very small set of different executor types, ideally only one or two. For example, use a large set of single-core executors for all build actions, and a small set of quad-core executors for multi-threaded tests.

  • Limited action isolation facilities increase the risk of action conflicts. We recommend using smaller nodes, ideally single-core nodes, to maintain as much isolation as possible.

Summary

  • Bare-Metal Deployments

    Pros:

    • May be necessary for some operating systems
    • Predictable performance
    • Excellent action isolation
    • Dynamic reconfiguration of execution environments
    • Dynamic reconfiguration of executors

    Cons:

    • Bare-metal machine deployments have high setup and maintenance costs
  • Virtual Machine Deployments

    Pros:

    • Predictable performance
    • Excellent action isolation
    • Dynamic reconfiguration of execution environments
    • Dynamic reconfiguration of executors
    • VMs are easy to setup and maintain compared to bare-metal machines

    Cons:

    • Not possible with all operating systems
  • Kubernetes Deployments

    Pros:

    • Easy to setup

    Cons:

    • Only supports Linux & Windows
    • Limited action isolation
    • Limited dynamic reconfiguration
    • Limited control over execution environment

Glossary

Glossary

Setup

Setup on a self-managed cluster

FAQ

Frequently Asked Questions

Service Configuration

How to configure the EngFlow Remote Execution Service

Client Configuration

How to Configure Clients

Monitoring

How to monitor the health and performance of the service

2021-09-21