Version 1.43 of the documentation is no longer actively maintained. The site that you are currently viewing is an archived snapshot. For up-to-date documentation, see the latest version.

Kubernetes Setup

Setup and use a Kubernetes-based Remote Execution cluster

This document describes how to setup and use a Kubernetes-based cluster of the EngFlow Remote Execution software.

Before deploying to an actual Kubernetes cluster, you can try the service on a single machine.

Requirements

The EngFlow Remote Execution service consists of Schedulers and Workers. These are containerized processes, whose every instance runs in its own Kubernetes Pod. From here on, we will mean the same thing by “instance” and “Pod”.

Minimum hardware requirements for each Pod:

  • Worker: 10 GB SSD, 1 vCPU core, 1 GB RAM plus 1 GB RAM per vCPU core used for execution

  • Scheduler: 4 vCPU cores, 16 GB RAM

Kubernetes cluster requirements:

  • Node Pools: since Scheduler and worker Pods have different hardware requirements, they must be deployed to different kinds of Kubernetes Nodes, so you will need two Node Pools in your cluster.

  • Node Labels: to let Kubernetes deploy Pods to the right kinds of machines, Nodes in the worker pool and in the scheduler pool must have different Node Labels, e.g. engflow_nodetype: worker-node and engflow_nodetype: scheduler-node.

  • Hardware: if you configure Kubernetes to deploy more Pods than Nodes available in a pool, you should size these Nodes to whole multiples of the above requirements to avoid starving Pods of resources.

Minimum software requirements:

  • Docker images:

    • Base OS: Ubuntu 18.04 or Debian 10

    • Java: openjdk-11-jdk-headless or openjdk-8-jdk-headless

  • Client: Bazel 1.0.0

Security requirements:

  • Ports on worker nodes must not be reachable from the public internet. Either the cluster must run on a private network, or an appropriately configured firewall must be used.

  • Only the public port on scheduler nodes may be reachable from the public internet. If a scheduler’s public port is reachable, then both server- and client-authentication must be configured such that it is impossible for an external party to impersonate either server (man-in-the-middle attacks) or client (no access to unauthenticated clients).

  • You must setup server authentication using TLS. The included demo certificates should be considered public information, and must not be used for publicly accessible schedulers.

  • You must either setup client authentication or configure a firewall or VPN such that only trusted clients are able to connect to the cluster.

  • In a Kubernetes-based cluster, actions are only partially sandboxed, and can access and modify critical local resources as well as access any secrets stored on worker nodes. Such clusters must only run trusted code.

Starting a Remote Execution Cluster

First you need to generate configuration files, use them to build container images for the EngFlow services, and finally deploy them to your Kubernetes cluster.

1. Generate the configuration files

Run our provided config generation script and answer the questions:

python3 ./setup/k8s/gen-k8s-config.py

If you need help with any of the script’s questions, see Generate Config Files.

2. Review and customize the generated files:

  • Adding packages to the base image: in the Dockerfiles, you can change the base OS layer, add more layers, or install more tools in the worker images. (But be careful not to remove or replace software required by the Remote Execution service stack.) If you plan to use TLS with your own certificates, you should copy them into the Scheduler’s Docker image.

  • Adjusting the pod configuration: in the yaml files, you can adjust Pod instance counts, Docker image names, image pull policy, etc. (see Kubernetes objects).

  • Disk size and layout: The total disk space is subdivided into the core OS, the per-action execution directories (consisting of input and output trees), and the CAS (consisting of replicas and a local cache).

    For example, a dual-core worker node with 10 GB disk might be subdivided into 2 GB for the OS, two executors with 1 GB each, leaving 6 GB for the CAS, of which we allocate up to 3 GB for replicas.

    OSExecution DirectoriesCAS
    2 GB2 workers * 1 GB6 GB
    3 GB Replicas3 GB Cache
    Disk size: 2 GB + 2 * 1 GB + 6 GB = 10 GB

    For simplicity, we recommend controlling the CAS size indirectly by setting --disk_size to the total disk size. The worker then computes the CAS size as follows:

    • It reserves 20% of the disk size for the operating system.
    • It reserves --max_output_size times the number of local executors (see --worker_config) for the execution directories.
    • The remainder is allocated to the CAS, half of which is storage space for replicas.

    Note that the input trees are typically hard-linked into the CAS, so it is not necessary to account for the maximum input tree size here. It is an error if the number of executors times the maximum input tree size is smaller than the CAS cache.

    You should observe the following rules when adding additional packages to the base image, increasing the number of executors, or increasing the maximum input and output tree sizes:

    • When adding packages to the base image: the OS size should stay below 10% of the disk size. If it is larger, then you either need to remove packages, or increase the disk size.

    • When increasing the number of executors: with the default maximum input and output tree sizes of 1 GB each, you will need approximately 4-5 GB of disk per executor.

    • When increasing the maximum input or output tree sizes: your disk should be at least 3 times the sum of the maximum input and output tree sizes per executor.

  • TLS configuration: in the config file, you can adjust the EngFlow service options (see Command line options). If you plan to use TLS with your own certificates, you should edit the --tls_certificate and --tls_key here, or if you plan not to use TLS at all, you have to add --insecure=true here.

3. Build the Docker images and deploy the service to Kubernetes

Run the docker and kubectl commands printed by gen-k8s-config.py.

These commands will build Docker images, create Kubernetes objects, and deploy the service to Kubernetes. With OpenShift, you can of course use oc instead of kubectl.

Now you should have a running Kubernetes cluster with Scheduler and Worker Pods, and a Load Balancer Service called scheduler-grpc-service in front of the Schedulers.

4. Determine the service address

Find the public IP address and port of the Load Balancer scheduler-grpc-service.

You can do this from the web-based Kubernetes management console.

5. Optional: Set up DNS

If you plan not to use TLS, you can skip this step.

If you plan to use TLS with the certificate we supplied (engflow-ca.crt), then add a DNS entry for demo.engflow.com with the Load Balancer’s IP address. Alternatively, for testing you can temporarily add an entry to /etc/hosts.


Generate Config Files

What information does the config generator need?

Kubernetes objects

What are the relevant Kubernetes objects?

Single-machine setup

Testing EngFlow Remote Execution on a single machine

Last modified 2021-04-21