Service Options Reference

Description of all command-line options

Options common to all instances

--grpc_keep_alive_time=300s (duration)

The keep-alive time for gRPC connections.

--log_file= (string)

The location of a log file to write a full copy of the data to. The amount of data is unbounded, so avoid using this in production.

--log_level=INFO (string)

The verbosity level of local logging. Valid values are OFF, SEVERE, WARNING, INFO, and CONFIG.

--log_to_stderr=true (boolean)

Output logs to standard error. Note systemd expects logs to go to stderr in order to manage them.

--private_bind_to_any=false (boolean; previous name: --bind_to_any)

Whether to configure the internal communication to listen on all local IPs. If your cluster is not connected to the public internet, and the --private_ip_selector mechanism does not work, then this flag might be usable as a workaround. DO NOT enable this for machines that are connected to the public internet.

--private_ip=null (string)

IP address to advertise for cluster-internal gRPC calls. This option is only needed when nodes run in an isolated network (e.g. with docker containers using network mode bridge) and can't be reached from other nodes using the discoverable IP address of the current node. Should only be used with --discovery=static. Please also use --private_bind_to_any when using this option.

--private_ip_selector=192.168.0.0/16 (string; previous name: --local_ip_selector)

A CIDR mask that is used to select a local IPv4 address for each instance. This should match whatever address range the underlying platform uses to generate local IPs. If this does not match any local IP, then the instance will attempt to do a reverse lookup on its own hostname. The instance fails if the hostname resolves to a loopback address. In order to set a fully-specified address, use a /32 selector. The resulting IP is only used for cluster-internal communication. DO NOT use a public address range here.

--private_port=9321 (integer; previous name: --internal_port)

Port to use for cluster-internal communication. You need to configure your network to allow traffic on this port between machines in the same cluster. In addition, you need to allow traffic on port + 1000 (schedulers only) and port + 2000 (all instances); also see --incompatible_use_low_offsets.

Options to configure scheduler instances

--action_cache_size=2gb (capacity)

The maximum amount of memory that can be used for action cache entries by each scheduler. The resolution is 1 megabyte, the value is rounded down to the nearest whole megabyte as needed. (Example: 1600kb is rounded down to 1mb.) Setting a value smaller than 1mb disables the cache.

--basic_auth_htpasswd=/etc/engflow/htpasswd (string)

Path to a htpasswd file containing user names and APR1-encrypted passwords. In a cluster with multiple scheduler instances, all of them must use the same password file. The server automatically reloads the file on changes (based on the last-modified time).

--client_auth=none (one of: {deny, none, basic, mtls, gcp_email, gcp_rbe})

The mechanism for determining gRPC authentication and permissions. Depending on the value, you also need to pass options to configure the authentication mechanism and permissions granted to each client.

For none, use --principal_based_permissions=*->role to set permissions.

For basic, use --basic_auth_htpasswd to set the path to the password file with Apache MD5-encoded passwords, and --principal_based_permissions to control per-user permissions. See the Authentication section for examples.

For mtls, see the --tls_trusted_certificate flag for more details and use principal_based_permissions to set per-user permissions.

For gcp_email, use --principal_based_permissions to set per-user permissions.

For gcp_rbe, see the --gcp_rbe_auth_project flag for more details.

--enable_status_page=false (boolean; previous name: --experimental_status_page)

Whether to enable a scheduler status page. If this is true, the scheduler also starts an HTTP server on the same port as the gRPC server, which allows connecting with a standard browser. The HTTP server uses the authentication mechanism configured using --http_auth.

--experimental_cas_check_storage_only=false (boolean)

When enabled, we do CAS checks only in external storage. When disabled, we first check the distributed CAS then the storage.

--experimental_force_mnemonic_pool_name=[] (list of strings)

A list of mnemonic=pool-name pairs which are used to override pool names provided by the client. Use this to route actions to specific pools based on mnemonics. See Executor pools.

Note that this feature requires a client that provides action mnemonics; as of 2021-03-05, there is no released Bazel binary that supports this.

--experimental_profile_dir= (string)

Ignored if --experimental_profile_to_event_store=true and an event store is configured. If set to a non-empty path, the scheduler collects server-side profiling information, aggregates the data by build id, and writes individual files to this directory. This can help debug performance issues in a cluster. Note that the schedulers do not currently manage the used disk space, and may start failing if they run out of disk. As of 2021-03-23, it is not recommended to enable this in a production cluster.

--experimental_profile_to_event_store=false (boolean)

If set to true, the scheduler collects server-side profiling information, aggregates the data by build id, and writes it to the event store. This can help debug performance issues in a cluster. This new implementation supersedes the previous one that is available with --experimental_profile_dir; if both are enabled, then this one takes precedence but only if the event store is enabled (with --enable_bes=true; also see --experimental_event_storage).

--experimental_strict_transport_security=0d (duration)

Ignored unless --experimental_strict_http_headers is true. If set to a non-zero duration, sets the Strict-Transport-Security header with the given duration as the max-age on all HTTP responses. When the UI is accessed over an HTTPS connection and this header is returned, all future accesses to the same domain are forced to use HTTPS for at least the given duration. DO NOT SET THIS unless you are certain that you do not want to access this domain using HTTP for the forseeable future.

--extend_replicas_on_cache_hit=true (boolean)

Whether to extend replica timeouts when there is a cache hit in the action cache. If this is true, then the action cache service only returns an action cache entry if the replica timeouts for all output files could be successfully extended. Otherwise it does not attempt to extend the timeouts. Setting this to false can improve performance at the increased risk of returning errors later when the client attempts to fetch the corresponding files from the CAS. We strongly recommend leaving this enabled when using build-without-the-bytes.

--force_pool_name=null (string)

If set to a non-empty value, the scheduler ignores the pool name provided in the action and uses this one instead to schedule the action. See Executor pools.

--gcp_rbe_auth_project=null (string; previous name: --experimental_google_auth_project)

Sets the GCP project to use when looking up permissions for OAuth 2.0-authenticated clients if --client_auth=gcp_rbe. The actual permissions are configured through GCP IAM by assigning the existing Google Cloud 'Remote Build Execution' roles to specific users or service accounts.

If you are using Bazel, you can authenticate as follows: for the first-time login, run gcloud auth application-default login. Afterwards, you can run Bazel with the --google_default_credentials flag. Alternatively, you can download a Json file with access keys and use Bazel's --google_credentials option to specify the path to that file.

Note that EngFlow does not control the existence or availability of these GCP roles and cannot guarantee that this option continues to work. Furthermore, we cannot report usage of these permissions to GCP, so they may show up as 'over-granted' in the IAM permissions console.

Use with caution.

--grpc_max_calls_per_connection=400 (integer)

Sets the maximum number of concurrent calls per incoming gRPC connection.

--http_auth=deny (one of: {deny, none, basic, jwt, google_login})

The mechanism for determining HTTP2 authentication and permissions. Depending on the value, you also need to pass options to configure the authentication mechanism and permissions granted to each client.

Note that the /healthz page never requires authentication.

For none, use --principal_based_permissions=*->role to set permissions.

For basic, use --basic_auth_htpasswd to set the path to the password file with Apache MD5-encoded passwords, and --principal_based_permissions to control per-user permissions. See the Authentication section for examples.

For jwt, use --tls_certificate and --tls_key to set the signing certs.

For google_login the --experimental_google_client_id flag must also be set.

--insecure=false (boolean)

Whether to use unencrypted connections. We strongly recommend providing a TLS certificate and key (self-signed if necessary) and avoid setting this flag. This can be temporarily used for testing on a closed network. If this is set, then the settings for --tls_certificate and --tls_key are ignored.

--local_cas_existence_cache_expiry=120s (duration)

Specifies the maximum retention time of replicas in the CAS existence cache of this scheduler.

This flag is closely related to --cas_existence_cache_expiry but they operate on different levels: this flag controls lookups in the distributed CAS, not in the backup storage.

Setting this to 0 disables the cache. When the cache is enabled, the scheduler saves which blobs it has seen recently, and won't look them up in the CAS if the client asks for them again. This optimization can save about 1 second / 1000 input files per action. Choose a cache expiry value that is much shorter than --default_replica_timeout, and is at most the length of the average build; beyond that the risk is higher that the replica no longer really exists and builds may sporadically fail.

--max_batch_size=1kb (capacity)

The maximum batch size that clients are allowed to send to the CAS server batchUpdateBlobs call. This is a form of write-combining that might result in improved performance under the right network conditions. Only set this if you have benchmark results indicating that it is a net win. Note that some clients may not combine writes regardless of this server-side setting. If this is larger than the max gRPC message size, it is silently reduced to that value.

--max_queue_time=1h (duration)

The maximum amount of time an action is allowed to queue before it is aborted.

--max_queue_time_in_empty_pool=5m (duration)

The maximum amount of time an action is allowed to queue before it is aborted if it is assigned to a pool which has never had any executors. This is intentionally shorter than the maximum queue time to detect cases where the client is accidentally misconfigured.

If the worker pool is configured with auto-scaling, and it can scale down to zero workers, then this should be at least as long as the auto-scaling delay plus the time to boot a worker instance. Otherwise a recently restarted scheduler may time out actions prematurely.

--max_replicate_concurrency=0 (integer)

The maximum number of concurrent replicate calls from a scheduler. A negative or zero value indicates no limit. This may be useful to limit the CAS read/write load.

--metadata_replica_count=3 (integer)

The number of replicas to use for scheduler metadata such as the action cache. Must be at least one. Setting this to one can cause metadata loss when a scheduler is restarted, resulting in reduced build performance and build errors.

--principal_based_permissions=[*->user] (list of strings)

Configures the permissions for each principal. This option provides a generic mechanism for configuring permissions where principals are cryptographically authenticated through some other mechanism, such as TLS client certificates or OAuth 2.0 bearer tokens.

Each value must specify a principal and a role as principal->role. Principals can be specified directly as user@example.com or user, specified as all users in a domain *@example.com, or everyone * (just the star character). Roles must be one of none, admin, user, cache-reader, and cache-writer.

Permissions are evaluated based on most-specific to least-specific rather than in the order specified. Therefore, an exact principal match wins over a domain-based match, and the default setting applies only if no other rule applies.

Note that some authentication mechanisms implicitly refuse a connection if the client principal cannot be determined.

--public_bind_to_any=true (boolean)

Whether to configure the public port to listen on all local IPs. If set to false, then the scheduler nodes will only listen on the internal IP addresses specified with --private_ip_selector. DO NOT leave this at true for clusters that are connected to the public internet and that do not have authentication configured.

--public_port=8080 (integer)

The public port on which this cluster listens for connections. Note that typical Linux installations prevent non-root processes from listening on ports 0-1024.

--strict_http_headers=false (boolean)

If set to true, sets a number of HTTP headers on all HTTP responses to improve the security of the UI, such as X-Frame-Options and Content-Security-Policy.

--tls_certificate= (string)

The file name of the TLS certificate to be used by the scheduler nodes to authenticate themselves to clients on the public cluster port (--public_port). A certificate and key are required to support encrypted connections. If this is a self-signed certificate, then you also have to configure the client with the same certificate. If you want to use unencrypted connections, you have to set --insecure=true.

--tls_key= (string)

The file name of the TLS key that matches the certificate given as --tls_certificate.

--tls_trusted_certificate= (string)

Required when --client_auth=mtls, must be empty otherwise. The file name of the certificate that is used by the scheduler nodes to authenticate clients (aka mutual TLS authentication or mTLS). In order to use this, you have to generate client certificates, sign them with the certificate given here, and configure the clients to authenticate themselves. Furthermore, you have to grant permissions to clients using the --principal_based_permissions flag. Note that incoming connections without a TLS certificate are always denied.

Bazel supports client authentication at version 3.1 or later: use the --tls_client_certificate and --tls_client_key options to enable client authentication.

Options to configure service discovery

--aws_region= (string)

Only used when --discovery=aws. Selects the AWS region to scan for instances. If unset, the current region is used.

--cluster_name=default (string)

Only used when --discovery=gcp or aws. The cluster name used to auto-detect instances belonging to the same cluster. All instances must be tagged as engflow_re_cluster_name=[cluster_name] and scheduler instances must additionally be tagged as engflow_re_scheduler_name=[cluster_name].

--discovery=multicast (one of: {gcp, aws, k8s, static, multicast})

Select the discovery mechanism to use. This usually matches the platform that the software runs on.

--gcp_zones= (string)

Only used when --discovery=gcp. A comma-separated list of GCP zones in which to look for instances. If unset, discovery only searches the current zone where this instance runs.

--incompatible_use_low_offsets=false (boolean)

Hazelcast requires one or two ports in addition to the private ports (and the public ports for schedulers); if this is set, then it uses private_port + 1 (all instances) and private_port + 2 (schedulers). This flag must be set identically across all instances in the same cluster; changing the value is an incompatible change.

--k8s_all_pods_service=null (string)

Only used when --discovery=k8s. Name of the Kubernetes NodePort service that connects to all Pods.

--k8s_master=null (string)

Only used when --discovery=k8s. DNS or IP:port of the Kubernetes Master. Leave it empty to use the default; usually you don't need to specify this flag.

If some pods can't discover others and print errors like Failure in executing REST call (...) Caused by: java.net.UnknownHostException: kubernetes.default.svc, then override this flag with https://IP:port where IP and port are that of the Kubernetes Master (see output of kubectl cluster-info).

--k8s_namespace=engflow-re (string)

Only used when --discovery=k8s. Name of the Kubernetes namespace. All Kubernetes objects should be in this namespace, and it must match the namespace value in the yaml files.

--k8s_scheduler_pods_service=null (string)

Only used when --discovery=k8s. Name of the Kubernetes NodePort service that connects to all scheduler Pods.

--split_cluster_name=false (boolean)

Only used when --discovery=gcp or --discovery=aws. If true, then scheduler instances will auto-detect other scheduler instances using the tag engflow_re_scheduler_name=[cluster_name] in addition to auto-detecting worker instances using the tag engflow_re_cluster_name=[cluster_name].

--static_cas_node=[] (list of strings)

Only used when --discovery=static. IP address and port of another CAS node (scheduler or worker), e.g. 1.2.3.4:5678. The port must be that instance's --private_port + 2000. This instance joins that instance's cluster.

You don't have to list all instances' IP and port, but at least one that you list must be online so this one can join. The more instances you list, the less sensitive your cluster will be to machine start order. If you omit the port, nodes may fail to form a cluster. Also see --incompatible_use_low_offsets.

--static_scheduler=[] (list of strings)

Only used when --discovery=static. IP address and port of another scheduler (for the schedulers-only cluster), e.g. 1.2.3.4:5678. The port must be that instance's --private_port + 1000. This instance joins that instance's cluster.

You don't have to list all instances' IP and port, but at least one that you list must be online so this one can join. The more instances you list, the less sensitive your cluster will be to machine start order. If you omit the port, nodes may fail to form a cluster. Also see --incompatible_use_low_offsets.

Options to configure the CAS

--cas_path=/tmp/base/ (string)

The path under which the local CAS is stored and local execution trees are created. The local CAS and the local execution trees should be on the same file system to support hard-links and atomic file moves.

--default_replica_timeout=24h (duration)

The duration for which replicas are retained. Files that expire their duration may be deleted if space is needed for new files. This applies to all CAS writes and existence checks, either initiated by a client, or initiated by a worker to store action outputs. Therefore, this needs to be set conservatively to the longest required duration - at a minimum, it should be set to the longest duration a single build can take. As of 2020-04-09, this is the only way to set replica durations.

--disk_size=0 (capacity)

The total disk size. If this is set to a non-zero value, then the CAS and replica sizes are computed automatically based on this number. Specifically, we set the total CAS size (--max_cas_size) to 80% of the given number (effectively reserving 20% of the space for the OS), minus the number of workers (--worker_config) times the maximum output tree size (--max_output_size). We set the max replica size to half that number.

If this flag is set to a non-zero value, then the --max_cas_size and --max_replica_size options are ignored. If neither this nor --max_cas_size and --max_replica_size are set, the total disk size is derived from the size of the volume --cas_path is on.

--enable_distributed_cas=true (boolean)

Whether this instance should participate in the distributed CAS. If this is true, the instance makes some or all local files available to other instances in the cluster. If false, the instance does not make local files available. However, it still uses the local disk to cache files for local use. The main use case for disabling this flag is for satellite clusters where a subset of machines is remote to the majority of the cluster and should not make their files available to the main cluster. Note that these instances can still pull files from the other instances in the main cluster, not just from external storage.

--experimental_async_storage_uploads=false (boolean)

If false, wait for successful uploads to both the distributed CAS and external storage. If true, do not wait for uploads to external storage to complete.

--experimental_force_lru=false (boolean)

This flag selects between two policies for storing metadata in the distributed CAS. If disabled, the metadata expires some time after the default replica life time. If enabled, the cluster evicts metadata based on available space rather than time. That means that metadata may be lost before the replica life time expires if there are a lot of files in the cluster, but has the advantage that metadata can stay around for longer if there is space available.If the cluster uses external storage, then you should use --experimental_opportunistic_cas instead, which changes the metadata policy in the same way, but additionally reduces re-replication when cas nodes are lost.

--experimental_opportunistic_cas=false (boolean)

This flag selects between two policies for the distributed CAS. If disabled, the cluster only tracks 'replicas' which are guaranteed to be available for a certain amount of time and attempts to re-replicate files if CAS nodes are lost. If enabled, the cluster tracks all files in the distributed CAS and evicts files and metadata based on available space rather than time. This flag is intended to be enabled when using external storage as a backup.

--incompatible_batch_read_blobs_verifies_digests=false (boolean)

Transitional option to enable an incompatible bugfix for CAS.batchReadBlobs.

When false, CAS.batchReadBlobs does not verify digests in the request. If one has an incorrect hash or negative size, the response for it is NOT_FOUND. If a zero-byte digest has the wrong hash, the response is OK and an empty data blob.

When true, CAS.batchReadBlobs verifies all digests in the request. Digests with negative size, invalid hash, or zero-byte digests with non-matching hash will all get an INVALID_ARGUMENT response and no data returned.

--max_cas_size=0 (capacity)

The maximum total size of the local CAS, including replicas and locally cached files. The local CAS keeps files as long as possible, and only evicts them when this value is exceeded. Therefore, this needs to be smaller than the total available disk space by at least the number of local executors times the maximum output size per action when using hardlinks for inputs, or the combined input and output size when using copies (see --worker_config, --max_input_size, and --max_output_size). This flag is ignored if --disk_size is not set to 0.

--max_replica_size=0 (capacity)

The part of the local CAS that is available to replicas. That is, the total space used by replicas on the local machine may not exceed this value. This must be less than the CAS size minus the number of local executors times the maximum input size per action (see --worker_config and --max_input_size); otherwise the worker can run out of disk space. This flag is ignored if --disk_size is not set to 0.

--recover_cas_blobs=true (boolean)

Transitional option to roll out a bugfix.

If true, workers will scan the --cas_path for left-behind (or pre-loaded) CAS content.

If false, workers ignore such blobs. Please let EngFlow know if you find the need to disable this flag.

--replica_count=1 (integer)

The number of replicas for each CAS entry corresponding to a file that has a retention duration (see --default_replica_timeout). This must not exceed the number of nodes that participate in the distributed CAS (typically the same as the worker nodes). The system automatically re-replicates files if an existing node is lost, as long as the file does not exceed its retention duration (measured from the time it was written or existence-checked). As of 2020-04-09, the maximum supported --replica_count is 3.

--use_linux_acls=false (boolean)

If this is true, then the worker sets ACLs on the work directory to allow actions to access files in the input tree and itself to access files in the output tree, regardless of ownership. This makes it possible to run actions as another user, e.g., under Docker.

Enabling this flag requires the setfacl tool to be available on the host machine (e.g., on Debian, by installing the acl package) and that the file system supports ACLs (newer versions of Debian and Ubuntu support ACLs by default). Do not enable this flag on MacOS or Windows.

Options to configure backup storage

--cas_existence_cache_expiry=24h (duration; previous name: --experimental_cas_existence_cache_expiry)

Used only when --external_storage is not none. Specifies the maximum retention time of replicas in the CAS existence cache.

This flag is closely related to --local_cas_existence_cache_expiry but this operate on different levels: this flag controls how long we count on the replica remaining in backup storage, before checking again (if requested).

Setting this to 0 disables the cache. When the cache is enabled, this instance saves which blobs it has seen recently, and won't look them up in the backup storage if something asks for them again. The higher this duration, the longer entries may be kept in the cache, and the less frequently we check the storage backend if the blob really exists or not. Note that actual retention time also depends on --cas_existence_cache_max_size and how full the cache is.

Choose a cache expiry value that is shorter than --default_replica_timeout and also shorter than how often you delete blobs from backup storage.

--cas_existence_cache_max_size=10000000 (integer; previous name: --experimental_cas_existence_cache_max_size)

Used only when --external_storage is not none. Specifies the maximum number of replicas in the CAS existence cache. Setting a higher value increases memory use (each entry adds about 100 bytes) but can significantly reduce the number of calls and upload traffic to the storage backend (potentially by 10x). Setting this value to 0 disables the cache; setting it to -1 means no upper bound; setting it to any positive number sets that maximum number of entries.

--experimental_read_timeout=2m (duration)

Sets a timeout for proxy calls that acts as a fail-safe if the client reads very slowly, or if it does not propagate cancellation correctly; several versions of Bazel have this bug.

--external_storage=none (one of: {none, gcs, s3})

The kind of external storage to use to back up replicas, in addition to storing them on the worker machines. none means no backup, gcs means Google Cloud Storage (GCS), s3 means Amazon S3.

Deprecation: the values gcp and aws (synonyms for gcs and s3) are also supported, but deprecated. They will no longer be supported in version 2.0 and later.

--external_storage_scheduler_threads=50 (integer)

Only used when --external_storage is not none. Specifies how many threads to use to serve external storage requests on schedulers. The value is a positive integer.

--external_storage_worker_threads=50 (integer)

Only used when --external_storage is not none. Specifies how many threads to use to serve external storage requests on workers. The value is a positive integer.

--gcs_blobs_root=blobs (string)

Only used when --external_storage=gcs. Path in the GCS bucket for blobs.

--gcs_bucket=null (string)

Only used when --external_storage=gcs. Name of the GCS bucket.

--gcs_credentials=null (string)

Only used when --external_storage=gcs. Path to the JSON file with the GCS Service Account's credentials. Can be empty if GOOGLE_APPLICATION_CREDENTIALS is set to the JSON file's path.

--gcs_project_id=null (string)

Only used when --external_storage=gcs. Name of the GCP project ID for GCS use.

--incompatible_no_storage_backend_metrics=false (boolean)

Scheduled to be flipped in version 2.0.0

Used only when --external_storage is not none. If true, the storage backend (GCS / AWS S3) will not report metrics. We recommend using the com.engflow.re.storage/* metrics instead, or the default metrics reported by the backend itself.

--incompatible_s3_use_structured_paths=false (boolean)

If true, use structured paths for CAS and Action Cache entries in S3. This can improve performance on large deployments.

--s3_blobs_root=blobs (string)

Only used when --external_storage=s3. Path in the S3 bucket for blobs.

If not empty, then we recommend you specify a relative path (foo/bar) and not an absolute path (/foo/bar). This is because Amazon S3 (and possibly other S3 implementations) treat a leading '/' to be part of the first directory segment.

We also suggest not to add a trailing /; this is added automatically.

If the blobs root is non-empty, the final path of a blob is <blobs_root>/<subdir>/<blob>; otherwise it is <subdir>/<blob>.

--s3_bucket=null (string)

Only used when --external_storage=s3. Name of the S3 bucket.

--s3_endpoint=null (string)

Only used when --external_storage=s3. Set this to override the computed S3 endpoint. This allows running against compatible implementations of S3.

--s3_region=null (string)

Only used when --external_storage=s3. Name of the S3 bucket's region. Can be empty if AWS_REGION is set to this value.

Options to configure the execution service

--action_execution_attempts=3 (integer)

How many times an action should be attempted if one of the retry conditions is true. These are controlled through separate flags, such as --experimental_retry_failure_due_to_signal.

--allow_docker=false (boolean)

Whether to enable dockerized execution. In order to use dockerized execution, the client also needs to send docker image ids, and the worker must have the corresponding docker images available. As of 2020-04-14, dockerized execution is only supported on Linux VMs (i.e., not on macOS nor Kubernetes clusters).

--allow_local=false (boolean)

Whether to enable local execution. You must enable one of --allow_local, --allow_sandbox, or --allow_docker to be able to run actions at all. If multiple flags are enabled, then the strategy is selected based on the requested execution platform. In that case, the worker selects the first of docker, sandbox, and local in that order.

--allow_sandbox=false (boolean; previous name: --sandbox)

Whether to enable sandboxed execution. If enabled, sandboxed execution is used for actions that do not specify a docker image. Also see --allow_local.

This enables the use of --sandbox_binary_path as a wrapper for each action. The behavior of the upstream linux-sandbox binary is to create a new user namespace and init process. It can optionally create a network namespace to block network access (see --sandbox_allow_network_access), mount a tmpfs (see --sandbox_tmpfs_dir).

You can additionally control sandboxing features through action platform settings.

--debug_execute_requests=false (boolean)

If this is true, the worker prints the execute request in full detail to the log. This can generate very large amounts of output, so use with caution.

--docker_additional_env=[] (list of strings)

A list of additional environment variables that are set in every docker container. Changes to this flag are non-hermetic, i.e., the system returns existing cache entries and does not force a rerun of the affected actions.

--docker_additional_mounts=[] (list of strings)

A list of additional directories that are mounted into every docker container. Changes to this flag are non-hermetic, i.e., the system returns existing cache entries and does not force a rerun of the affected actions.

--docker_allow_any_runtime=true (boolean)

If false, then requesting a specific runtime will fail the execution unless it is explicitly allowed using --docker_allowed_runtimes.

--docker_allow_network_access=true (boolean)

If true, then actions can request access to sibling containers and the internet using the dockerNetwork platform setting. Otherwise actions requesting such access fail.

When enabled, action execution containers that are started with dockerNetwork=standard will be connected to a Docker bridge network. The network's name is set in the execution container as the $HOST_NETWORK_NAME environment variable.

When disabled, the value of --docker_default_network_mode is ignored and taken to be none.

--docker_allow_requesting_capabilities=true (boolean)

If false, then requesting capabilities will fail the execution.

--docker_allow_reuse=true (boolean)

Whether to allow reusing Docker containers. If true, we allow reusing a running Docker container for subsequent actions that specify the same image id and Docker options; otherwise we start a new container for every action. In order to enable container reuse, you also have to enable the dockerReuse platform option. Depending on the underlying machine, Docker startup can take several seconds.

--docker_allow_sibling_containers=true (boolean)

If true, then actions can request access to docker with the dockerSiblingContainers platform setting. Otherwise actions requesting such access fail.

--docker_allowed_runtimes=[] (list of strings)

Ignored if --docker_allow_any_runtime=true. A list of runtimes that clients are allowed to set. If you want to allow the default runtime, you have to add the empty string to this list.

--docker_content_trust=false (boolean)

Whether to enable docker's signature verification. When enabled, docker only allows running signed images.

--docker_cpu_limit=set (one of: {none, count, set})

Whether and how to limit docker action CPU usage. Use 'none' to apply no per-action limit, 'count' to set the maximum CPU usage in number of cores, and 'set' to restrict the action to a specific set of cores. Both 'count' and 'set' are computed from the --worker_config option; 'count' simply applies the number of cores, whereas 'set' computes non-overlapping CPU masks starting at 0. We recommend using 'set' if possible, and 'count' otherwise. Use 'none' only if CPU limitation does not work for some reason. Note that the 'set' setting assumes that the worker service has full control over the machine - another process assigning the same CPUs on the same machine can lead to conflicts and performance issues.

--docker_default_network_mode=off (one of: {off, standard, host})

Only used when --allow_docker=true. Ignored and always none if --docker_allow_network_access=false.

Specifies the default network mode for dockerized actions that don't request any particular dockerNetwork platform option.

--docker_disallowed_capabilities=[] (list of strings; previous name: --docker_blacklisted_capabilities)

A list of capabilities that must not be set in execution requests. A request setting a capability provided here fails execution.

--docker_drop_capabilities=[] (list of strings)

A list of docker capabilities that are dropped by default in addition to those that are already dropped by docker.

--docker_enable_ipv6=false (boolean)

Whether to enable IPv6 for the Docker network.

--docker_enforce_known_capabilities=true (boolean)

If true, then all capabilities that are requested to be added are checked against a list of known capabilities before they are passed to docker. If any requested capability is not known, execution fails.

--docker_ipv6_cidr=fd00::/16 (string)

Only used when --docker_enable_ipv6=true. The subnet CIDR range for IPv6 Docker networks. Worker instances use this to generate random IPv6 subnets for each executor; each generated subnet will begin with the given prefix, and have a subnet length given by --docker_ipv6_subnet_length. This can either be a private subnet (starting with fd00), which does not allow any outgoing IPv6 traffic, or it can be public, in which case it should be based on the IPv6 subnet assigned to the underlying machine.

For example, if the machine uses 2001:0db8:3333:4444:5555:6666:7777:8888/64, and this flag is set to 2001:0db8:3333:4444:ff00::/72, and the subnet length is 96, then the worker generates random subnets that look like 2001:0db8:3333:4444:ffXX:XXXX::/96, with each X replaced by a random hexadecimal digit.

Note: the value given here can be identical to the value configured in the Docker daemon's fixed-cidr-v6 configuration option.

--docker_ipv6_subnet_length=112 (integer)

Only used when --docker_enable_ipv6=true. The subnet CIDR prefix length for IPv6 Docker networks; the generated Docker subnets will have 2^(128-X) addresses. See the documentation of --docker_ipv6_cidr for more details.

--docker_max_kernel_memory=0 (capacity)

This is passed to docker to limit the amount of kernel memory available to each action. If unset, then there is no limit applied to docker; memory use is still limited by the available machine memory.

--docker_max_memory=0 (capacity)

This is passed to docker to limit the amount of memory available to each action. If unset, then there is no limit applied to docker; memory use is still limited by the available machine memory.

--docker_process_limit=10000 (integer)

The maximum number of concurrent processes for a single action. This helps prevent runaway processes and fork bombs. Set to -1 for no limit, but beware this allows build actions to fork bomb.

--docker_split_exec_run=true (boolean)

If true, the worker uses separate docker run and docker exec commands to run each action. This allows measuring Docker startup time, but does not work with all docker runtimes, such as runsc (also see the dockerRuntime platform option). This flag only affects non-reusable docker containers; reusable docker containers always have to be run with separate commands. I.e., if you want to use a docker runtime that does not support docker exec, you have to also set the dockerReuse=False in the platform.

--docker_use_image_id=false (boolean)

Whether to resolve docker URLs to image ids and use those for docker run. This is a transitional option to migrate to using image ids for docker run, which in turn allows storing Docker images in the CAS to reduce download times and improve reliability.

--docker_use_process_wrapper=false (boolean)

Whether to run Docker actions through the process wrapper. This also requires setting --process_wrapper_binary_path. Note that this may fail at runtime if the selected Docker container is not compatible with the process-wrapper binary, which is usually linked against libc and libstdc++ among other system libraries.

--experimental_docker_avoid_fifo=true (boolean)

Whether to avoid using a FIFO to control Docker container shutdown. Enabling this improves compatibility with non-standard runtimes like gVisor (runsc). This is a transitional option to migrate to non-fifo containers.

--experimental_docker_force_reuse=false (boolean)

Whether to enforce reusing Docker containers. This is ignored if --docker_allow_reuse is false. If both are true, then the service attempts to reuse running Docker containers regardless of the client setting for the dockerReuse platform option.

--experimental_docker_use_platform_user=false (boolean)

Setting this flag changes the user / group selection for actions. If this flag is false, actions are run as the same user / group as the worker service. If this flag is true, then actions are run as 'nobody:nogroup' by default, and can optionally run as 'root:root' if the dockerRunAsRoot platform option is set to True. Setting this flag to true additionally requires --use_linux_acls=true; otherwise actions will fail due to file system access restrictions. If this flag is enabled, the worker behaves as if --docker_use_addgroup is also enabled.

--experimental_persistent_worker=false (boolean)

Whether to enable experimental support for remote persistent workers. Persistent workers are a mechanism in Bazel to reduce startup overhead for compilers and other tools and is widely used for Java-based tools. Note that enabling support on the worker is not sufficient to use persistent workers - the client must also annotate the persistent worker inputs.

As of 2020-10-28, this requires a patched Bazel binary.

--experimental_persistent_worker_and_docker=false (boolean)

Whether to enable experimental support for remote persistent workers. Persistent workers are a mechanism in Bazel to reduce startup overhead for compilers and other tools and is widely used for Java-based tools. Note that enabling support on the worker is not sufficient to use persistent workers - the client must also annotate the persistent worker inputs.

As of 2020-10-28, this requires a patched Bazel binary.

--experimental_persistent_worker_expand_param_files=true (boolean)

Bazel expands params files (passed with '@filename' to the worker), but considers this legacy behavior; the new '-flagfile' and '--flagfile' arguments are never expanded. This flag controls expansion for '@filename' parameters. If disabled, the service does not expand these parameters, which differs from Bazel, and may not be compatible with all persistent worker implementations.

--experimental_retry_failure_due_to_signal=false (boolean)

Whether to retry actions that fail due to a system signal (128 < exit code < 255). Use --action_execution_attempts to control the maximum number of attempts.

--ignore_unknown_platform_properties=false (boolean)

Whether to ignore unknown platform properties. If false, then actions that set unknown platform properties return an error. Otherwise such properties are silently ignored. Note that changing this flag does not affect existing entries in the action cache, i.e., the server may return cached entries even if re-executing the action would return an error due to unknown properties. All properties are part of the cache key.

--incompatible_keep_relative_argv0=false (boolean)

Transitional option to roll out a bugfix.

When enabled, we keep actions' relative argv0 as relative. When false, we absolutize the path (which may be harmful when it changes the tools' behavior). Flipping this flag changes the command lines we execute.

Removes support for building exec root using symlinks.

--keep_exec_directories_for_debugging=false (boolean; previous name: --debug_actions)

Whether to keep the execution directories after execution for debugging. You also need terminal access to the worker machines to inspect these directories. DO NOT enable this in a production cluster. This flag is silently ignored when persistent workers are used (either with --experimental_persistent_worker or experimental_persistent_worker_and_docker) or when incremental exec roots are enabled (with --experimental_incremental_exec_root).

--max_download_concurrency=200 (integer)

The maximum number of concurrent downloads to a worker before an action starts. A negative or zero value indicates no limit. This may be useful to limit the CAS read load as well as preventing running out of file descriptors.

--max_execution_timeout=15m (duration)

The maximum timeout for the execution of a single action. Clients typically only set timeouts for a subset of actions such as test actions to avoid cache fragmentation. The timeout set here applies to all execution requests that do not have a timeout set. In addition, it also provides an upper bound for execution requests that do have a timeout set, i.e., requested timeouts larger than this are silently ignored.

--max_input_size=1gb (capacity)

The maximum total size of all inputs to an action. Actions that exceed this limit are aborted during setup.

--max_output_size=1gb (capacity)

The maximum total size of all outputs of an action. Actions that exceed this limit are aborted during or after execution.

--max_upload_concurrency=0 (integer)

The maximum number of concurrent uploads from a worker after an action completes. A negative or zero value indicates no limit. This may be useful to limit the CAS write load.

--notification_period=1m (duration)

Configures how often the service provides updates to the client about running actions. Note that this does not apply to queued actions.

--operation_retention_time=1m (duration)

Configures the duration for which the worker retains a finished action before deleting it locally. The worker uses these retained entries to answer waitExecution requests in case the client disconnects during execution. A very small value can cause unnecessary action retries and execution load, and a very large value can cause excessive memory use on the worker.

--process_wrapper_binary_path=/usr/bin/engflow/process-wrapper (string)

The path to a process-wrapper binary on the worker. The process-wrapper binary is part of a Bazel installation and provides improved control of action processes.

--process_wrapper_cpu_limit=none (one of: {none, set})

Whether and how to limit action CPU usage when using the process wrapper. Use 'none' to apply no per-action limit, and 'set' to restrict the action to an automatically computed set of cores. We recommend using 'set' if possible. Use 'none' only if CPU limitation does not work for some reason. Note that the 'set' setting assumes that the worker service has full control over the machine, as it assigns CPUs starting at 0.

--sandbox_allow_network_access=true (boolean)

If true, sandboxed actions can request network access by setting the platform option sandboxNetwork, e.g., exec_properties = { "sandboxNetwork": "standard" }. Otherwise, such actions fail.

--sandbox_binary_path=/usr/bin/engflow/linux-sandbox (string)

The path to a linux-sandbox binary on the worker. The linux-sandbox binary is part of a Bazel installation and uses Linux Kernel APIs to sandbox the execution of an action process.

--sandbox_grace_timeout=5s (duration)

How long to wait before sending SIGKILL after an action times out. When an action times out, we first send it SIGTERM and only send SIGKILL after this grace period. The value may be rounded up to the next larger whole second.

--sandbox_tmpfs_dir=null (string)

Sets the location for an empty tmpfs directory inside the sandbox.

--sandbox_writable_path=[] (list of strings)

Additional absolute paths that are writeable within the sandbox.

--upload_outputs_on_failure=true (boolean)

If true, upload all action outputs into the CAS, even if the action failed with a non-zero code. If false, only upload stdout/stderr (and no other outputs) in the same case.

--use_process_wrapper=false (boolean)

Whether to enable the process wrapper for local actions. The process wrapper provides improved process control, ensuring a more consistent execution environment as well as killing all child processes reliably.

--worker_config=1*cpu=1 (string)

Configures the number and properties of local executors. Specify executor properties as a list of key-value pairs separated by commas, such as cpu=1,ram=2G,pool=c1-m2.

To specify multiple identical executors, prefix a set properties with a number and a * character, such as 4*cpu=2. To specify multiple different executors, combine them with a + character, such as 1*cpu=3,ram=1G+2*cpu=1 (one executor with 3 cores and 1 GB of RAM, and two executors with 1 core). The comma operator has precedence over the star operator, which has precedence over the plus operator. Disable local execution by setting this flag to the empty string.

For automatic configuration, specify auto to create an executor for each available core. This option is useful when the number of cores is not known in advance.

For manual configuration, the only supported keys are cpu, ram, and pool. cpu specifies the number of cores to reserve, ram is silently ignored, and pool specifies the name of the pool for the executor.

--xcode_locator=/usr/local/bin/engflow/xcode-locator (string)

The path to the xcode_locator binary on the worker. The xcode_locator binary is part of every installation of EngFlow Remote Execution Service on macOS and looks up the locations of local XCode installations on macOS. This binary is currently required for local execution on macOS, and must be present on every macOS worker.

Options to configure the result store service

--experimental_google_client_id= (string)

Must be set if and only if --enable_bes=true and --http_auth=google_login.

The client ID from the "Client ID for Web application" page in GCP to enable using Google OAuth to authenticate users on the UI. You must have this client ID correctly configured in GCP to complete the authentication workflow. Note that the email address returned from Google will be matched against the --principal_based_permissions flag to determine permission level. In order to access the build UI, users must be given the admin role.

Monitoring options

--cloudwatch_dimensions=null (string)

Only considered when --enable_cloudwatch=true and ignored otherwise. Sets common dimensions of reported CloudWatch metrics. The value is a comma-separated list of key-value pairs, e.g. "customer=Acme Inc.,cluster=prod", order does not matter.

--cloudwatch_metrics_filter=[] (list of strings)

Required when --enable_cloudwatch=true, ignored otherwise. A list of regexes that filter metric names: a metric is reported to AWS CloudWatch only if it matches any of the regexes. Entries follow Java regex syntax. Matching is partial by default (e.g. "exec" matches every metric whose name contains this string); to match the whole metric name, use ^ and $. If empty or not specified, then no metrics are reported.

Example: report metrics about AWS S3 use: --cloudwatch_metrics_filter+=storage\.s3/; report download-related metrics but from any storage backend: --cloudwatch_metrics_filter+=storage\..*/download.

--cloudwatch_namespace=null (string)

Required when --enable_cloudwatch=true, ignored otherwise. Sets the namespace of reported metrics.

--cloudwatch_region=null (string)

Required when --enable_cloudwatch=true, ignored otherwise. Sets the AWS region of reported metrics.

--enable_cloudwatch=false (boolean)

Enables reporting metrics to AWS CloudWatch.

--enable_prometheus=false (boolean)

Enables a built-in webserver to export monitoring data to Prometheus (https://prometheus.io/). You may also need to set --prometheus_port and configure Prometheus to start scraping from all cluster nodes.

--enable_stackdriver=false (boolean)

Enables reporting of monitoring and tracing data to StackDriver (a monitoring system integrated into Google Cloud that also supports AWS). You also need to set --stackdriver_project and provide application default credentials that allow write access to StackDriver.

--enable_zipkin=false (boolean)

Enables reporting of performance traces to Zipkin (https://zipkin.io/). You also need to set --zipkin_endpoint.

--grpc_metrics=minimal (one of: {none, minimal, basic, all})

The gRPC library provides a number of metrics that can be logged for monitoring. This option selects what subset of metrics to log. Unfortunately, logging all metrics can be expensive (e.g., on Google Cloud Operations). For the minimal setting, all completed RPCs are logged, but no latency metrics, bytes, or messages.

--monitoring_trace_probability=0 (float; previous name: --monitoring_sample_probability)

Sets the probability of recording a performance trace for a given client request to a scheduler. Setting it to 0 disables tracing. Setting it to 1 enables tracing every request. Tracing a large fraction of the traffic is expensive, and should not be used for production clusters. Note that this flag is evaluated once on the scheduler for each incoming RPC call and then passed along on subsequent calls.

--netty_metrics=all (one of: {none, all})

The netty library provides a number of metrics that can be logged for monitoring. This option selects what subset of metrics to log. Unfortunately, logging all metrics can be expensive (e.g., on Google Cloud Operations).

--prometheus_bind_to_any=false (boolean; previous name: --monitoring_prometheus_bind_to_any)

Whether to bind to any local IP. If false, then only bind to the private IP selected with --private_ip_selector. If your cluster is connected to the public internet, then enabling this flag exposes your monitoring data publicly.

--prometheus_port=8888 (integer; previous name: --monitoring_prometheus_port)

Selects the local port to start a prometheus-compatible webserver on.

--stackdriver_export_interval=1m (duration)

Configures the time between metrics exports to StackDriver.

--stackdriver_optimized_reporting=true (boolean)

Transitional option to enable automatic optimization of the metric export interval for each metric based on observed changes. I.e., metrics are only exported when they change rather than every interval. This can significantly reduce Stackdriver costs during periods of low cluster utilization such as nights and weekends.

--stackdriver_project= (string; previous name: --monitoring_stackdriver_project)

Selects the StackDriver project to send monitoring data to.

--zipkin_endpoint=http://localhost:9411/api/v2/spans (string; previous name: --monitoring_zipkin_endpoint)

Configures the zipkin endpoint to push performance traces to.

Options to configure logging to external services.

--aws_log_group_name=null (string)

Only used if --remote_logging_service=aws_cloudwatch. The name of the AWS log group, which must already exist.

--gcp_log_autodetect=true (boolean)

Only used if --remote_logging_service=google_cloud_operations. Whether to automatically detect log labels for this process, like the instance name and availability zone. If you log to GCP from outside of GCP, the automatic detection does not work correctly - in that case, set this flag to false.

--gcp_log_project_id=null (string)

Only used if --remote_logging_service=google_cloud_operations. The GCP project id to log to. Instances that run on GCP automatically detect the current project; you can use this flag to override the automatically detected project id, or provide one explicitly if the instance is not running on GCP.

--remote_log_level=info (one of: {off, severe, warning, info, verbose, all})

The verbosity level of remote logging.

--remote_logging_service=none (one of: {none, google_cloud_operations, aws_cloudwatch})

The external service to log to.

Appendix: flag syntax

Duration flags

You can specify a duration in milliseconds, seconds, minutes, hours, or days. Use the suffix ms, s, m, h, or d respectively:

--flag=5s
# Means: 5 seconds

--flag2=90m
# Means: 90 minutes

Capacity flags

You can specify a capacity in bytes, kilobytes, megabytes, or gigabytes. (The multipliers are 1000, not 1024.) Use no suffix for bytes, or use the suffix kb, mb, or gb for the others respectively:

--flag1=10
# Means: 10 bytes

--flag2=10mb
# Means: 10 MB (10^6 bytes)

List flags

You can specify list flags multiple times. The += operator adds another value, and the = operator drops all accumulated (or default) values:

--flag=value1 --flag+=value2 --flag+=value3
# Means: [value1, value2, value3] (ignoring the default value)

--flag+=value1 --flag+=value2 --flag+=value3
# Means: [<default values>, value1, value2, value3]

--flag+=value1 --flag=value2 --flag+=value3
# Means: [value2, value3]
2021-09-21