Cluster Startup

Cluster Startup: How the instances discover each other

This document describes how to configure the EngFlow Remote Execution cluster in order for the scheduler and worker instances to discover and connect to each other.

Specifying IP and Port

Most importantly, you need to configure the internal and external network connections for each instance. The internal network connection is used for cluster-internal traffic, which is security-sensitive and must be private. The external network connection is used to listen for client requests.

Typically, machines will have multiple network adapters, and need to select the correct network on which to listen for internal or external connections.

The internal network is selected with the --private_ip_selector flag. You should set this to the same internal CIDR range that is also used by the underlying deployment infrastructure. You can additionally configure the internal port with the --private_port flag, although this should typically not be necessary. Note that the instances use additional ports for internal traffic which are derived from the one configured here.

In addition to the primary private port configured with --private_port , scheduler instances communicate over --private_port + 1000, and all instances communicate over --private_port + 2000. Also see the Network Traffic page for more details.

For example, on GCP, private IPs are typically chosen from the 10.0.0.0/8 private network block, whereas a local Docker might use the 172.16.0.0/12 block.

By default, the schedulers listen on all available networks for incoming external client requests in addition to listening on the private IP for internal traffic. You can alternatively configure the schedulers to listen only on the private IP for external traffic by setting --public_bind_to_any=false . The public port is specified with the --public_port flag.

Discovery Mechanism

In order for scheduler and worker instances to automatically detect each other, you have to configure an appropriate discovery mechanism using the --discovery flag. We recommend using the available mechanisms from the underlying deployment platform, e.g., using GCP APIs on GCP, AWS APIs on AWS, and Kubernetes APIs on Kubernetes. For each of these, you may also need to configure appropriate end points and scopes, such as the GCP zone or AWS region. See the ‘service discovery’ section in the options reference for a complete list.

Alternatively, the service also supports multicast and static discovery.

In multicast discovery, the instances use the multicast protocol to discover each other, which requires that they are on the same local network. This mechanism works without prior knowledge of the instance IPs.

In static discovery, you have to provide a static list of instances (at least one), which all other instances connect to on startup. This requires prior knowledge of the respective IP addresses and port numbers.

Verification

Every time a machine joins or leaves the cluster, schedulers and workers log the members they know about.

The log looks like:

Members {size:3, ver:3} [
        Member [172.17.0.3]:10081 - 5515cf45-9244-44df-b086-bc8eca4c83ae this
        Member [172.17.0.4]:10081 - bafe827d-b4d9-46ff-bac5-0528de71964e
        Member [172.17.0.5]:10081 - 8ce4c57c-e860-4257-b98b-0a8330ad7606
]

There are two logical clusters: one for only schedulers, and one for all instances. So schedulers print two Members groups, and workers print just one.

If you connect (e.g. via SSH) to any of the machines then you can observe these logs, for example:

$ sudo journalctl -n10000 --unit=worker | grep "^ *Member"
2021-09-21