GCP Setup

Setup and use a Remote Execution cluster on Google Cloud Platform (GCP)

This document describes how to setup EngFlow Remote Execution in a VM-based deployment on Google Compute Engine (GCE) which is part of Google Cloud Platform (GCP). If you want to use Google Kubernetes Engine (GKE), please see the Kubernetes Setup instead.

Requirements

In addition to the baseline requirements for running the EngFlow Remote Execution Service, you will need administrative access to a GCE project to create VM images and instance templates, to start VMs, and so on.

Automated Setup using Packer and Terraform

We provide a Packer config to create the base image, and a minimal Terraform config to start the cluster. The Terraform config includes a service account, an optional GCS bucket, scheduler and worker pools, an optional auto-scaler, and a TCP load balancer.

Both may need to be adjusted for your desired build environment and production deployment. We recommend first starting a basic cluster and adjusting the configuration only after verifying its operation.

1. Setup Google Application Credentials

We recommend creating an additional service account to handle base image creation and cluster setup. This service account requires roles to create VM images and configure and start VMs.

You need to provide credentials for this service account to the packer and terraform tools to setup the cluster. Download credentials as a .json file and set the GOOGLE_APPLICATION_CREDENTIALS environment variable:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json

2. Create a Base Image

The included Terraform template uses the same base image for both scheduler and worker instances. If you want to use separate images (for example when you want to install extra tools on workers), then you need to adjust the Terraform config.

To create the base image with Packer, you need to install the packer command line tool (installation instructions).

Before generating the image, you should inspect the packer configuration in setup/gcp/base-image.json, and modify it to fit your requirements - however, we recommend first starting a minimal cluster and verifying its operation.

For base OS we recommend using Debian 10. You can use other versions or other distributions too (e.g. Ubuntu 18.04) as long as the openjdk-11-jdk-headless package is installed.

To generate the image, change into the setup/gcp directory and run:

packer build base-image.json

The resulting image will be called engflow-re-image-<year>-<month>-<day>-<hour>-<minute>.

3. Start the Cluster

We provide a Terraform config file in setup/gcp/main.tf, which includes scheduler and worker templates, instance group managers (configured with a fixed size), and an internal TCP load balancer. It also embeds the license file as well as the service config (setup/gcp/config).

At a minimum, you need to configure the following options before starting a cluster:

  • project_name - the GCP project name the cluster should run under
  • availability_zone - the target zone where the cluster should run

You should edit the setup/gcp/main.tf file and set these to the desired values. Additionally, you can configure the number of schedulers and workers in the cluster as well as configure Terraform remote state.

Start the cluster using:

terraform init
terraform apply

Once the cluster is running, Terraform prints the IP address of the load balancer, which is the end point for Bazel to talk to. Note that the default configuration only allows connections from other machines in the same GCE network.

Note that the TCP load balancer does not distribute requests from a single client coming over a single connection. On the other hand, GCP’s HTTP/2 load balancers do not support TLS client authentication (mTLS).

Manual Setup

This section outlines the manual process for setting up a cluster on GCE.

1. Create a Service Account

You have to create a new service account for the Remote Execution cluster. This service account is required on the scheduler and worker instances to perform discovery and to log monitoring data to Google Cloud Operations (formerly StackDriver). It must have at least the following role:

  • Compute Viewer aka roles/compute.viewer

    Used to auto-detect live scheduler and worker instances.

If you enable Google Cloud Operations (formerly StackDriver) monitoring with --enable_stackdriver , the service account requires these additional roles:

  • Monitoring Metric Writer aka roles/monitoring.metricWriter

    Necessary to write metrics to GCO.

  • Cloud Trace Agent aka roles/cloudtrace.agent

    Necessary to write performance traces to GCO; only needed if you set --monitoring_trace_probability to a non-zero value.

If you configure Google Cloud Storage as a backup CAS/Action Cache, then the service account requires these additional roles:

  • Storage Object Admin aka roles/storage.objectAdmin

    Necessary to read, write, and delete objects to and from GCS.

If you use Docker images stored in Google Container Registry (GCR), then the service account may require these additional roles:

  • Storage Object Viewer aka roles/storage.objectViewer

    Necessary to read Docker images from GCR. Note that this is a subset of Storage Object Admin (needed for GCS), so you do not need both.

2. Create a Base Image

You can use the same base image for both scheduler and worker instances, or you can create separate images (for example when you want to install extra tools on workers).

  1. Start a clean VM

    We support the following distributions for the base image:

    • Debian 10 (Buster)
    • Ubuntu 18.04 (Bionic Beaver)
  2. SSH into the VM

  3. Update distribution sudo apt update && sudo apt upgrade

  4. Install the engflow-re-services.deb package using sudo apt install ./engflow-re-services.deb

  5. Install the docker.io package using sudo apt install docker.io

  6. Copy your license file to /etc/engflow/license using sudo mv license /etc/engflow/license

  7. Copy your configuration file to /etc/engflow/config using sudo mv config /etc/engflow/config

  8. You can customize the base image at this point if you need additional software installed. However, we recommend using Docker images for customization rather than running actions directly on the underlying VM.

  9. Pull the Docker image you plan to use, e.g., docker pull gcr.io/cloud-marketplace/google/rbe-ubuntu16-04.

    Note: the RBE Docker images require authenticating with gcloud first: gcloud auth configure-docker

  10. Stop the VM

  11. Create an image snapshot of the VM

3. Create Instance Templates

You need to create at least two templates - one for the worker instances and one for the scheduler instances. Depending on your intended cluster layout, you may need multiple worker templates. The steps to create instance templates are very similar in all cases.

  1. Create a new Instance Template

  2. Give it a descriptive name, e.g., worker-template-1

  3. Select a VM configuration, following the baseline requirements:

    • Scheduler: Quad-core, 4 GB RAM
    • Worker: Single-Core, 1 GB RAM
  4. Select the VM image created previously; set disk size following the baseline requirements

  5. Select the Service account create previously

  6. Management -> Labels: engflow_re_cluster_name, default

  7. Management -> Startup Script:

    • Worker: #!/bin/bash systemctl start worker
    • Scheduler: #!/bin/bash systemctl start scheduler

Do not enable HTTP or HTTPS in the firewall configuration unless you want to expose the cluster to the public internet.

4. Start the Cluster

You need to start both schedulers and workers using the previously created templates; you can start them in any order.

  1. Create a new Instance Group from one of the templates

  2. Configure auto-scaling or set a fixed number of instances

  3. Configure the Health Check: TCP to the internal port --private_port

5. Create a TCP Load Balancer

  1. Create a new TCP Load Balancer

  2. Backend: select the scheduler instance group

  3. Frontend: TCP, port 443

2021-09-21