AWS Setup

Setup and use a Remote Execution cluster on Amazon Web Services (AWS)

This page is about VM-based clusters on AWS EC2. For Kubernetes clusters on AWS EKS (Amazon Elastic Kubernetes Service), see Kubernetes Setup.)

These instructions are currently for running a Linux cluster on AWS. Need support for macOS instances? Contact us.

Summary

  1. Unpack the deployment kit
  2. Add your license
  3. Setup access credentials
  4. Create an AMI for the service
  5. Configure the service
  6. Deploy the cluster
  7. Verify the cluster
  8. Configure the client
  9. Run a build

Requirements

In addition to the baseline requirements for running the EngFlow Remote Execution Service, you will need administrative access to an AWS account to create VM images and instance templates, to start VMs, and so on.

1. Unpack the deployment kit

Unpack engflow-re-<VERSION>.zip.

It contains:

  • this documentation (./index.html)
  • the service package (./setup/engflow-re-services.deb)
  • Packer config to build a service AMI (./setup/aws/base-image.json)
  • Terraform config to deploy a cluster (./setup/aws/main.tf)
  • EngFlow config file (./setup/aws/config)

It does not contain a valid license file: ./setup/license is empty.

It is recommended to create a Git repo of the unpacked files so you can save your current configuration and track changes.

2. Add your license

Replace ./setup/license with the license file we sent you.

3. Setup access credentials

Packer and Terraform need credentials with sufficient permissions.

  1. In AWS IAM, create a user with these permissions:

    • AmazonEC2FullAccess
    • Security Token Service
    • AmazonS3FullAccess

    For finer-grained permissions see the Packer documentation.

  2. Setup the environment.

    $ aws configure
    

    Alternatively, if you have a credentials CSV file:

    $ export AWS_ACCESS_KEY_ID=$(tail -n1 credentials.csv | cut -d, -f3)
    
    $ export AWS_SECRET_ACCESS_KEY=$(tail -n1 credentials.csv | cut -d, -f4)
    

4. Create an AMI for the service

All service instances (schedulers and workers) will run on this AMI.

(You can also use separate AMIs in case you want to install extra tools on the worker image. We’ll not discuss that here.)

  1. Customize ./setup/aws/base-image.json

    The default base_image_ami is Debian 10.5 (Buster) from https://wiki.debian.org/Cloud/AmazonEC2Image/Buster in the us-west-1 region. You can also use newer versions and Ubuntu 18.04 or later should also work as long as the openjdk-11-jdk-headless package is installed.

    Remember that AMIs are region-specific: make sure image_region matches the region of the base_image_ami.

    If you are using the Debian AMI from a different region make sure to use the correct region under the AMD64 AMI ID column on https://wiki.debian.org/Cloud/AmazonEC2Image/Buster.

  2. Build the AMI.

    We recommend using Packer 1.6.0 or later.

    $ cd setup/aws/
    
    $ packer build -var base_image_ami=<BASE_AMI> -var image_region=<BASE_AMI_REGION> base-image.json
    

    Example (Debian 10.5 in the eu-central-1 region):

    $ packer build -var base_image_ami=ami-0e2b90ca04cae8da5 -var image_region=eu-central-1 base-image.json
    

    You can define the variables either in the “variables” map of the base-image.json, or on the command line (as shown above).

  3. Note the resulting AMI.

    When Packer finished, it prints a similar output:

    (...)
    Build 'amazon-ebs' finished.
    
    ==> Builds finished. The artifacts of successful builds are:
    --> amazon-ebs: AMIs were created:
    eu-central-1: ami-0123456789abcdef0
    

5. Configure the service

All service instances (schedulers and workers) will use the same config file.

For a first time trial setup we recommend using the default ./setup/aws/config.

Later (and especially before productionizing) you should customize this config. Consider:

See the Service Options Reference for more info.

You can add options for schedulers and workers in the same file. If you need separate options for them, add those to the right aws_launch_template.user_data in ./setup/aws/main.tf.

6. Deploy the cluster

For a first time trial setup we recommend deploying a simple cluster, e.g. with 2 schedulers and 3 workers, and with an internal-facing load balancer. The example commands below do that.

  1. Customize ./setup/aws/main.tf

    Parameters (e.g. worker and scheduler count) are defined at the top.

    This file includes definitions for a VPC, security group, instance profile, scheduler and worker templates, auto-scaling pools (configured with a static size), an Elastic IP (optional), and a Network Load Balancer (internal or internet-facing).

  2. Setup a key pair for SSH.

    You will need it to connect to schedulers and workers to look at their logs.

    Create a key pair in the EC2 web console. Remember that the key pair is region-specific, and must be in the same region as the cluster.

  3. Deploy the cluster.

    We recommend using Terraform 0.12 or later.

    Use the AMI built by Packer, and a zone in the same region as the AMI.

    $ cd setup/aws/
    
    $ terraform init
    
    $ terraform apply -var image_id=<AMI> \
                      -var availability_zone=<ZONE> \
                      -var ssh_key_name=<KEY_PAIR_NAME>
    

    Example:

    $ terraform apply -var image_id=ami-0123456789abcdef0 \
                      -var availability_zone=eu-central-1a \
                      -var ssh_key_name=my-key \
                      -var scheduler_count=2 \
                      -var worker_count=3 \
                      -var public_nlb=false
    

    If the Terraform call fails, check that you have valid credentials (AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID environment variables), or try deleting the ./setup/aws/.terraform directory and running terraform init again.

  4. Note the Load Balancer’s IP.

    When Terraform completes, it prints something like:

    Apply complete! Resources: 20 added, 0 changed, 0 destroyed.
    
    Outputs:
    
    service_address = 10.0.1.10:443 (internal)
    

    Note: as of 2020-09-28 the Load Balancer will not answer pings, but will route gRPC traffic on port 443 to schedulers.

  5. Check the EC2 web console.

    You should see all running instances, the configured load balancer, the load balancer’s target group, and the schedulers listed as the target group’s targets.

7. Verify the cluster

Instances discover each other by asking AWS for a list of endpoints in the security group.

As of 2020-08-19, the service does not have a status page yet. You need to connect to an instance using SSH from an EC2 instance in the same VPC or from a machine in a peered network.

Once you connected to a worker, look at the output of the service:

$ journalctl --unit worker

In the log, you should see a cluster formed:

Aug 12 15:33:33 ip-10-0-1-29 scheduler_service[790]: Members {size:5, ver:5} [
Aug 12 15:33:33 ip-10-0-1-29 scheduler_service[790]:         Member [10.0.1.29]:10081 - 381a734e-3213-4054-9aa5-e32e159f78e3 this
Aug 12 15:33:33 ip-10-0-1-29 scheduler_service[790]:         Member [10.0.1.117]:10081 - 5e178daf-fca1-4709-a408-0261d7c8133e
Aug 12 15:33:33 ip-10-0-1-29 scheduler_service[790]:         Member [10.0.1.200]:10081 - 9914770f-d255-424b-9438-fa3d70b2b67d
Aug 12 15:33:33 ip-10-0-1-29 scheduler_service[790]:         Member [10.0.1.173]:10081 - 8affa7e5-b4b9-4118-8e88-9591136b407a
Aug 12 15:33:33 ip-10-0-1-29 scheduler_service[790]:         Member [10.0.1.239]:10081 - 5e9211fb-438e-4987-8c72-a0d430299adf
Aug 12 15:33:33 ip-10-0-1-29 scheduler_service[790]: ]

On schedulers you should see two clusters: the same one as above, and another one only containing schedulers.

8. Configure the client

If you ran Terraform with -var public_nlb=false, you will need an EC2 instance in the cluster’s VPC or a machine in a peered network.

(If you ran Terraform with -var public_nlb=true, then anyone can connect to the service through the public address. Consider all security risks before choosing this option.)

  1. Install Docker, git, and Bazel.

    $ sudo apt install -y git docker.io
    
    $ sudo adduser $(whoami) docker
    
    $ curl -L -o bazel https://github.com/bazelbuild/bazel/releases/download/3.3.0/bazel-3.3.0-linux-x86_64
    
    $ git clone https://github.com/EngFlow/example
    
    $ chmod +x bazel
    
  2. Add a mapping from the load balancer IP to demo.engflow.com.

    $ sudo vi /etc/hosts
    

    Add the IP printed by Terraform:

    10.0.1.10  demo.engflow.com
    
  3. Log out and log in again

9. Run a build

Run in a terminal:

$ cd example

$ ../bazel build --config=engflow //java/...

Note: the first build can take a while as Bazel first downloads the docker image locally, and the cluster software then downloads the docker image on each worker. You will not see a performance improvement for builds of the example project; it is too small to benefit from the remote execution cluster.

2021-09-21