Skip to content

Service Options Reference

Description of all command-line options that configure action execution platforms.

OidcOptions


oidc_config

--oidc_config=[] (list of strings)

The path(s) to the OpenId Connect config file(s), each of which is either an absolute path, or the ID of a secret prepended by secretstore://.

Each file's or secret's content should be a JSON string with the keys:

  • issuer: optional; one of {GOOGLE, KEYCLOAK, OKTA, OTHER}; default is OTHER,
  • client_id
  • client_secret (only required if using authorization code flow; depends on OIDC Identity Provider setup),
  • discovery_uri (only usable if the IdP supports config discovery)
  • auth_token_endpoint, token_endpoint, keys_endpoint (these are only used when discovery_uri is not set)

Content example (with config discovery in case the IdP supports it):

{
  "issuer": "GOOGLE",
  "client_id": "1234567890-XXXXXXXXXX.apps.googleusercontent.com",
  "discovery_uri": "https://accounts.google.com/.well-known/openid-configuration"
}

Content example (without config discovery, if the IdP doesn't support it):

{
  "issuer": "GOOGLE",
  "client_id": "1234567890-XXXXXXXXXX.apps.googleusercontent.com",
  "auth_token_endpoint": "https://accounts.google.com/o/oauth2/v2/auth",
  "token_endpoint": "https://oauth2.googleapis.com/token",
  "keys_endpoint": "https://www.googleapis.com/oauth2/v3/certs"
}

oidc_config_admin

--oidc_config_admin=[] (list of strings)

The path(s) to the OpenId Connect config file(s) for EngFlow admins. Same semantics and content format as --oidc_config.


Options common to all instances


aws_autoscaling_send_health_frequency

--aws_autoscaling_send_health_frequency=0s (duration)

If set to a non-zero value, this instance will report its health to the AWS auto scaling group at the frequency specified.


aws_cloudformation_send_resource_signal

--aws_cloudformation_send_resource_signal=false (boolean)

Whether to send a resource signal to CloudFormation when the server is running on instance startup.


discovery_port

--discovery_port=0 (integer)

Port that schedulers advertise the service discovery service on. If not set, this is inferred from the private port of the local instance.


docker_volume_mount_path

--docker_volume_mount_path= (string)

The existing filesystem path where the volume meant for docker data-root is mounted.When set, its usage is reported as part of instance's disk usage metrics.


endpoint

--endpoint=null (string; previous name: --build_and_test_url)

The URL of the cluster from which the Build and Test UI can be accessed.

Format must be http(s)://<host>[:<port>][/<path>].

Example: http://localhost:8080 or https://example.com/remote-execution or http://10.0.0.5:12345/re.


experimental_docker_proxy_port

--experimental_docker_proxy_port=0 (integer)

Port to run the docker proxy on. If not set, this is inferred from the private port of the local instance (private port + 4).


gc_thrashing_rules

--gc_thrashing_rules=[1s:2, 20s:3, 1m:5] (list of strings)

Comma-separated list of length:count pairs to configure garbage collection thrashing monitoring. If, for any length:count pair, the unreclaimable tenured heap usage stays above --gc_thrashing_threshold_percent percent for count GC cycles within length, the process will crash itself.


gc_thrashing_threshold_percent

--gc_thrashing_threshold_percent=100 (integer)

Percentage of the tenured garbage collection heap to consider full. The process may eagerly crash if the unrecclaimable heap usage remains above this threshold for too many garbage collection cycles. See --gc_thrashing_rules for further configuration.


graceful_shutdown_wait_time

--graceful_shutdown_wait_time=2s (duration)

Time to wait for connections to drain after removing the node from service discovery. Set to 0 to inhibit graceful shutdown.


grpc_max_calls_per_connection

--grpc_max_calls_per_connection=0 (integer)

Sets the maximum number of concurrent calls per incoming gRPC connection. The default is 400 on schedulers and 10 times the number of executors on workers or 400 if there are no executors (for a cache-only instance).


grpc_max_message_size

--grpc_max_message_size=20mib (capacity)

The max message size for incoming gRPC calls.


grpc_max_metadata_size

--grpc_max_metadata_size=8kib (capacity)

The max incoming metadata size for gRPC calls.


healthz_port

--healthz_port=0 (integer)

If set to a positive value, enables a HTTP 1.1 server suitable for health checks.


internal_grpc_keep_alive_time

--internal_grpc_keep_alive_time=60s (duration; previous name: --grpc_keep_alive_time)

The keep-alive time for cluster-internal gRPC connections.


internal_tcp_connect_timeout

--internal_tcp_connect_timeout=5s (duration)

The connection timeout for cluster-internal gRPC connections.


log_file

--log_file= (string)

The location(-pattern) of a log file. Empty means 'Unused'. See also: https://docs.oracle.com/en/java/javase/11/docs/api/java.logging/java/util/logging/FileHandler.html. If the path ends in '.json', single-line JSON output will be written.


log_file_count

--log_file_count=20 (integer)

Maximum number of log file rotations before re-using log file names.


log_file_limit

--log_file_limit=100mb (capacity)

Maximum size of a log file before it is rotated.

Set to 0 for unbounded. Though that is accepted, it is NOT recommended: it can fill up the disk.


log_level

--log_level=INFO (string)

The verbosity level of local logging. Valid values are OFF, SEVERE, WARNING, INFO, and CONFIG.


log_to_stderr

--log_to_stderr=true (boolean)

Output logs to standard error. Note systemd expects logs to go to stderr in order to manage them.


private_bind_to_any

--private_bind_to_any=false (boolean; previous name: --bind_to_any)

Whether to configure the internal communication to listen on all local IPs. If your cluster is not connected to the public internet, and the --private_ip_selector mechanism does not work, then this flag might be usable as a workaround. DO NOT enable this for machines that are connected to the public internet.


private_ip

--private_ip=null (string)

IP address to advertise for cluster-internal gRPC calls. This option is only needed when nodes run in an isolated network (e.g. with docker containers using network mode bridge) and can't be reached from other nodes using the discoverable IP address of the current node. Should only be used with --discovery=static. Please also use --private_bind_to_any when using this option.


private_ip_selector

--private_ip_selector=192.168.0.0/16 (string; previous name: --local_ip_selector)

A CIDR mask that is used to select a local IPv4 address for each instance. This should match whatever address range the underlying platform uses to generate local IPs. If this does not match any local IP, then the instance will attempt to do a reverse lookup on its own hostname. The instance fails if the hostname resolves to a loopback address. In order to set a fully-specified address, use a /32 selector. The resulting IP is only used for cluster-internal communication. DO NOT use a public address range here.


private_port

--private_port=9321 (integer; previous name: --internal_port)

Port to use for cluster-internal communication. You need to configure your network to allow traffic on this port between machines in the same cluster. In addition, you need to allow traffic on port + 1000 (schedulers only) and port + 2000 (all instances); also see --incompatible_use_low_offsets.


MetricTelemetryExportingOptions


metric_telemetry_enabled

--metric_telemetry_enabled=null (boolean)

Whether to enable the Metric-based telemetry service.


metric_telemetry_endpoint

--metric_telemetry_endpoint=null (string)

gRPC Endpoint to report OpenCensus metrics to.


metric_telemetry_grpc_handler_retry_delay

--metric_telemetry_grpc_handler_retry_delay=5s (duration)

How long the gRPC handler will wait before attempting to re-send a failed request.


metric_telemetry_grpc_handler_retry_limit

--metric_telemetry_grpc_handler_retry_limit=5 (integer)

How many times the gRPC handler will attempt to re-send failed requests.


metric_telemetry_interval

--metric_telemetry_interval=2m (duration)

Fixed Duration that specifies how often queued metrics should be reported.


Options to configure the license check


license_locator_duration_between_reader_calls

--license_locator_duration_between_reader_calls=10m (duration)

How long the locator must wait between calls to its internal license readers. Acts as a rate limiting flag, as the readers may be fetching from remote sources such as servers, cloud storage, etc.


license_locator_min_cached_licenses_validity

--license_locator_min_cached_licenses_validity=6h (duration)

The minimum amount of validity that the 'longest-lifespan-left' internally-cached license must have before the License Locator will try to immediately schedule a license refresh, to be executed asynchronously.


license_server_request_deadline

--license_server_request_deadline=60s (duration)

How long individual rpc requests to the License Server will be allowed to run for, before they are timed out. For example, in cases where the License Server is not available.


Options to configure scheduler instances


action_cache_size

--action_cache_size=2gb (capacity)

The maximum amount of memory that can be used for action cache entries by each scheduler. The resolution is 1 megabyte, the value is rounded down to the nearest whole megabyte as needed. (Example: 1600kb is rounded down to 1mb.) Setting a value smaller than 1mb disables the cache.


alternative_tls_trusted_certificates

--alternative_tls_trusted_certificates=[] (list of strings)

A list of files or secretstore URLs to load alternative trusted certificates from. All certificates provided here can be used to authenticate clients when --client_auth=mtls is set. See --tls_trusted_certificate for more details.


auth_service

--auth_service= (string)

Required when --client_auth=external, ignored otherwise. The auth service endpoint, must be "grpc://localhost:NNN" where NNN is the port.


basic_auth_htpasswd

--basic_auth_htpasswd=/etc/engflow/htpasswd (string)

Path to a htpasswd file containing user names and APR1-encrypted passwords. In a cluster with multiple scheduler instances, all of them must use the same password file. The server automatically reloads the file on changes (based on the last-modified time).


client_auth

--client_auth=none (one of: {deny, none, mtls, gcp_rbe, external})

The mechanism for determining gRPC authentication and permissions. Depending on the value, you also need to pass options to configure the authentication mechanism and permissions granted to each client.

For none, use --principal_based_permissions=*->role to set permissions.

For mtls, see the --tls_trusted_certificate flag for more details and use principal_based_permissions to set per-user permissions.

For gcp_rbe, see the --gcp_rbe_auth_project flag for more details.

For external, please contact us for details.

For github_token, see the --experimental_github_auth_container flag for more details.


enable_bes

--enable_bes=true (boolean; previous name: --experimental_bes)

Enables gRPC endpoints that implement the Build Event Service, allowing schedulers to handle the Build Event Protocol.


ephemeral_trusted_cert

--ephemeral_trusted_cert=false (boolean)

Every server requires a public/private key pair to create and verify credentials (cookies, JWT, etc.). If --tls_trusted_certificate and --tls_trusted_key are set, then those are used for that purpose. Otherwise, if --tls_certificate and --tls_key are set, then those are used. If neither of these are set, then youcan use this flag to create an ephemeral key pair. Credentials created in this way are never accepted by another server, or even by the same server after a restart.

For more permanent deployments, we recommend not using this flag, but instead generating a key pair, storing it in a secure location, and using the trusted flags, regardless of whether you are also specifying a TLS key pair.


experimental_allow_custom_roles

--experimental_allow_custom_roles=false (boolean)

Whether to allow custom roles on the cluster.


experimental_coalesce_actions

--experimental_coalesce_actions=true (boolean)

When true, coalesce identical actions.


experimental_enable_fetch_api

--experimental_enable_fetch_api=false (boolean)

Whether to enable support for the Asset fetch API. The fetch API is implemented by calling curl with appropriate parameters using the remote execution API. Note that only http and https URLs are supported.

In Bazel, enable use of the Asset fetch API by setting --experimental_remote_downloader to the same value as --remote_executor or --remote_cache.


experimental_fetch_api_docker_image

--experimental_fetch_api_docker_image= (string)

Only used when --experimental_enable_fetch_api=true. If set to a non-empty value, then this is parsed as a canonical docker URL (e.g., docker://alpine/curl@sha256000...), which is in turn used as the container for all fetch calls. The referenced Docker image must be accessible and contain curl at /usr/bin/curl. If the image has an entrypoint, it is ignored.


experimental_fetch_api_max_attempts

--experimental_fetch_api_max_attempts=5 (integer)

Only used when --experimental_enable_fetch_api=true. Number of attempts for internal file uploads and action executions. This flag does not apply if curl fails with a non-zero exit code, only if there are gRPC protocol errors.


experimental_force_mnemonic_pool_name

--experimental_force_mnemonic_pool_name=[] (list of strings)

A list of mnemonic=pool-name pairs which are used to override pool names provided by the client. Use this to route actions to specific pools based on mnemonics. See Executor pools.

Note that this feature requires a client that provides action mnemonics; Bazel 5.0.0 and newer support this.


experimental_force_sibling_containers_pool_name

--experimental_force_sibling_containers_pool_name=null (string)

The pool actions using "dockerSiblingContainers" should be rerouted to. See Platform options - dockerSiblingContainers.


experimental_github_auth_container

--experimental_github_auth_container= (string)

Required when --client_auth=github_token, ignored otherwise. Specifies an existing container on ghcr.io.

Format: organisation_name/container_name:tag, all lower-case.

Example: engflow/hello-world:1.0

The container should be private. Only GitHub Action runners of the organisation with a valid GITHUB_TOKEN shall have access. Other than that the container can be anything; it won't be pulled nor ran, only have its existence checked.


experimental_jwt_auth

--experimental_jwt_auth=false (boolean)

If enabled, and if --mtls_expiration is not zero and --tls_trusted_certificate and --tls_trusted_key are both set, then allow generating a JWT from the UI.


experimental_mnemonic_based_invocation_affinity

--experimental_mnemonic_based_invocation_affinity=[] (list of strings)

A list of mnemonics for which we reuse executors within the same invocation. This can reduce action setup time for actions with similar input trees but can also increase runtime for actions that use remote persistent workers.

Note that this feature requires a client that provides action mnemonics; Bazel 5.0.0 and newer support this.


experimental_strict_transport_security

--experimental_strict_transport_security=0d (duration)

If set to a non-zero duration, sets the Strict-Transport-Security header with the given duration as the max-age on all HTTP responses. When the UI is accessed over an HTTPS connection and this header is returned, all future accesses to the same domain are forced to use HTTPS for at least the given duration. DO NOT SET THIS unless you are certain that you do not want to access this domain using HTTP for the foreseeable future.


experimental_web_login_expiration

--experimental_web_login_expiration=23h (duration)

Only used when --http_auth=google_login, --http_auth=oidc_login, or --http_auth=basic. The amount of time for which a web login token is valid. This should be set to your company's max login policy if applicable.


extend_replicas_on_cache_hit

--extend_replicas_on_cache_hit=true (boolean)

Whether to extend replica timeouts when there is a cache hit in the action cache. If this is true, then the action cache service only returns an action cache entry if the replica timeouts for all output files could be successfully extended. Otherwise it does not attempt to extend the timeouts. Setting this to false can improve performance at the increased risk of returning errors later when the client attempts to fetch the corresponding files from the CAS. We strongly recommend leaving this enabled when using build-without-the-bytes.


force_pool_name

--force_pool_name=null (string)

If set to a non-empty value, the scheduler ignores the pool name provided in the action and uses this one instead to schedule the action. See Executor pools.


gcp_rbe_auth_project

--gcp_rbe_auth_project=null (string; previous name: --experimental_google_auth_project)

Sets the GCP project to use when looking up permissions for OAuth 2.0-authenticated clients if --client_auth=gcp_rbe. The actual permissions are configured through GCP IAM by assigning the existing Google Cloud 'Remote Build Execution' roles to specific users or service accounts.

If you are using Bazel, you can authenticate as follows: for the first-time login, run gcloud auth application-default login. Afterwards, you can run Bazel with the --google_default_credentials flag. Alternatively, you can download a Json file with access keys and use Bazel's --google_credentials option to specify the path to that file.

Note that EngFlow does not control the existence or availability of these GCP roles and cannot guarantee that this option continues to work. Furthermore, we cannot report usage of these permissions to GCP, so they may show up as 'over-granted' in the IAM permissions console.

Use with caution.


google_client_id

--google_client_id= (string; previous name: --experimental_google_client_id)

Must be set if and only if --enable_bes=true and --http_auth=google_login.

The client ID from the "Client ID for Web application" page in GCP to enable using Google OAuth to authenticate users on the UI. You must have this client ID correctly configured in GCP to complete the authentication workflow. Note that the email address returned from Google will be matched against the --principal_based_permissions flag to determine permission level.


grpc_initial_flow_control_window

--grpc_initial_flow_control_window=1mib (capacity)

The initial flow control window for incoming gRPC calls.


grpc_reflection

--grpc_reflection=false (boolean)

Publically enable the gRPC reflection protocol. See https://github.com/grpc/grpc/blob/master/doc/server-reflection.md.


http_auth

--http_auth=[deny] (list of strings)

The mechanism(s) for determining HTTP2 authentication and permissions. Depending on the values, you also need to pass options to configure the authentication mechanism and permissions granted to each client.

Note that the /healthz page never requires authentication.

For none, use --principal_based_permissions=*->role to set permissions.

For basic, use --basic_auth_htpasswd to set the path to the password file with Apache MD5-encoded passwords, and --principal_based_permissions to control per-user permissions. See the Authentication section for examples.

For google_login the --google_client_id flag must also be set.

For oidc_login the --oidc_config flags must also be set.


http_public_bind_to_any

--http_public_bind_to_any=true (boolean)

Only used when the HTTP and gRPC ports are split, i.e. --http_public_port has a positive value different from --public_port.

This flag is similar to --public_bind_to_any but affects only the --http_public_port.


http_public_port

--http_public_port=-1 (integer)

The public port on which this cluster listens for HTTP connections. When this is set to a positive integer different from --public_port, then HTTP and gRPC ports are split: HTTP is served on this port and gRPC on the --public_port. Otherwise they are both served on the --public_port.

Note that typical Linux installations prevent non-root processes from listening on ports 0-1024.


incompatible_force_mnemonic_pool_name_respects_explicit_pools

--incompatible_force_mnemonic_pool_name_respects_explicit_pools=false (boolean)

If this is true, --experimental_force_mnemonic_pool_name only changes the pool for actions that do not explicitly specify a pool.


incompatible_force_sibling_containers_pool_name_respects_explicit_pools

--incompatible_force_sibling_containers_pool_name_respects_explicit_pools=false (boolean)

If this is true, --experimental_force_sibling_containers_pool_name only changes the pool for actions that do not explicitly specify a pool.


incompatible_use_new_external_authentication

--incompatible_use_new_external_authentication=null (boolean)

Whether to use engflow.iam.authentication.v1.Authentication instead of engflow.auth.v1.AuthService for --client_auth=external.


insecure

--insecure=false (boolean)

Whether to use unencrypted connections. We strongly recommend providing a TLS certificate and key (self-signed if necessary) and avoid setting this flag. This can be temporarily used for testing on a closed network. If this is set, then the settings for --tls_certificate and --tls_key are ignored.


local_cas_existence_cache_expiry

--local_cas_existence_cache_expiry=30m (duration)

The maximum time to cache the existence of CAS entries. This flag is ignored if external storage is enabled; use --cas_existence_cache_max_size to control the cache size in that case.

If external storage is disabled, files are only stored in the distributed CAS, which is limited in size, and only guarantees the presence of files for the duration of --default_replica_timeout. Caching existence for longer than that can result in an increased rate of PRECONDITION_FAILED gRPC errors for Execute calls, but should be otherwise safe.

Setting this flag to 0 disables the cache.


max_batch_size

--max_batch_size=10mb (capacity)

The maximum batch size that clients are allowed to send to the CAS server batchUpdateBlobs call. This is a form of write-combining that might result in improved performance under the right network conditions. Only set this if you have benchmark results indicating that it is a net win. Note that some clients may not combine writes regardless of this server-side setting. If this is larger than the max gRPC message size, it is silently reduced to that value.


max_queue_time

--max_queue_time=1h (duration)

The maximum amount of time an action is allowed to queue before it is aborted.


max_queue_time_in_empty_pool

--max_queue_time_in_empty_pool=5m (duration)

The maximum amount of time an action is allowed to queue before it is aborted if it is assigned to a pool which has never had any executors. This is intentionally shorter than the maximum queue time to detect cases where the client is accidentally misconfigured.

If the worker pool is configured with auto-scaling, and it can scale down to zero workers, then this should be at least as long as the auto-scaling delay plus the time to boot a worker instance. Otherwise a recently restarted scheduler may time out actions prematurely.


max_replicate_concurrency

--max_replicate_concurrency=0 (integer)

The maximum number of concurrent replicate calls from a scheduler. A negative or zero value indicates no limit. This may be useful to limit the CAS read/write load.


metadata_replica_count

--metadata_replica_count=3 (integer)

The number of replicas to use for scheduler metadata such as the action cache. Must be at least one. Setting this to one can cause metadata loss when a scheduler is restarted, resulting in reduced build performance and build errors.


mtls_expiration

--mtls_expiration=90d (duration)

Only used when --tls_trusted_certificate and --tls_trusted_key are set. Sets the amount of time in days for which the generated client certificates are valid. Set to zero to disable the functionality.


principal_based_permissions

--principal_based_permissions=[] (list of strings)

Configures the permissions for each principal. This option provides a generic mechanism for configuring permissions where principals are cryptographically authenticated through some other mechanism, such as TLS client certificates or OAuth 2.0 bearer tokens.

Each value must specify a principal and a role as principal->role. Principals can be specified directly (e.g. alice@example.com, bob), or as all users in a a domain (e.g. *@example.com), or everyone * (just the star character). These are the only supported wildcard functions of the * character. Roles must be one of none, admin, user, cache-reader, and cache-writer.

Permissions are evaluated based on most-specific to least-specific rather than in the order specified. Therefore, an exact principal match wins over a domain-based match, and the default setting applies only if no other rule applies.

Note that some authentication mechanisms implicitly refuse a connection if the client principal cannot be determined.


profile_to_event_store

--profile_to_event_store=true (boolean; previous name: --experimental_profile_to_event_store)

Ignored if --enable_bes=false. If set to true, the scheduler collects server-side profiling information, aggregates the data by build id, writes it to the event store, and provides the profile for download as a Chrome json profile in the UI. This can help troubleshoot performance issues in a build.


public_bind_to_any

--public_bind_to_any=true (boolean)

Whether to configure the public port to listen on all local IPs. If set to false, then the scheduler nodes will only listen on the internal IP addresses specified with --private_ip_selector. DO NOT leave this at true for clusters that are connected to the public internet and that do not have authentication configured.


public_port

--public_port=8080 (integer)

The public port on which this cluster listens for gRPC and HTTP connections.

By default, this port serves both types of requests. You can use --http_public_port to serve HTTP requests on another port, for example to route them through a proxy.

Note that typical Linux installations prevent non-root processes from listening on ports 0-1024.


tls_certificate

--tls_certificate= (string)

The path to the TLS certificate chain (in base-64 encoded X.509 format with OpenSSL BEGIN/END CERTIFICATE guards) to be used by the schedulers to authenticate themselves to clients on the public cluster port(s) (--public_port and, if specified, --http_public_port).

A certificate and key are required to support encrypted connections. If this is a self-signed certificate, then you also have to configure the client with the same certificate. If you want to use unencrypted connections, you have to set --insecure=true.


tls_cipher_suites

--tls_cipher_suites=[TLS_AES_128_GCM_SHA256, TLS_AES_256_GCM_SHA384, TLS_CHACHA20_POLY1305_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256] (list of strings)

Configures the set of ciphers that are supported on incoming TLS connections (server-side). The default list follows Mozilla's recommendations including both TLS 1.2 and TLS 1.3. If no cipher for a given TLS version is specified, that TLS version is effectively disabled.


tls_key

--tls_key= (string)

The file name of the TLS key (in base-64 encoded binary PKCS#8 format with OpenSSL BEGIN/END PRIVATE KEY guards) that matches the certificate given as --tls_certificate.


tls_trusted_certificate

--tls_trusted_certificate= (string)

Required when --client_auth=mtls. The file name or secretstore URL of a certificate that is used by the schedulers to authenticate clients (aka mutual TLS authentication or mTLS).

You can generate client certificates yourself and sign them with the corresponding private key, or you can pass the key via --tls_trusted_key and set --mtls_expiration to allow logged-in users to generate their own certificates via the web UI.

In addition, you have to grant permissions to those authenticated clients using the --principal_based_permissions flag.

If you want to provide more than one trusted certificate, you can pass additional file names or secretstore URLs via the --alternative_tls_trusted_certificate flag.

Bazel supports TLS client authentication as of version 3.1: use Bazel's --tls_client_certificate and --tls_client_key options to enable client authentication.


tls_trusted_key

--tls_trusted_key= (string)

The file name or secretstore URL of the TLS key that matches the certificate given as --tls_trusted_certificate. If provided and if --mtls_expiration is not zero, then logged-in users can generate their own client certificates via the web UI, which are signed with the key provided here.

The key provided here must match the certificate provided as --tls_trusted_certificate.


Options to configure service discovery


cluster_name

--cluster_name=default (string)

Only used when --discovery=gcp or aws. The cluster name used to auto-detect instances belonging to the same cluster. All instances must be tagged as engflow_re_cluster_name=[cluster_name] and scheduler instances must additionally be tagged as engflow_re_scheduler_name=[cluster_name].


common_hazelcast_partition_count

--common_hazelcast_partition_count=0 (integer)

Count of Hazelcast partitions in the common cluster. This option is not safe to change when the cluster is running. See https://docs.hazelcast.com/hazelcast/5.1/capacity-planning#partition-count.


discovery

--discovery=static (one of: {gcp, aws, k8s, static, multicast})

Select the discovery mechanism to use. This usually matches the platform that the software runs on.


gcp_zones

--gcp_zones= (string)

Only used when --discovery=gcp. A comma-separated list of GCP zones in which to look for instances. If unset, discovery only searches the current zone where this instance runs.


hazelcast_aws_az

--hazelcast_aws_az= (string)

Only used when --discovery=aws. If specified, then --aws_region is ignored.

Selects the AWS availability zone to scan for instances. If unset, the current zone is detected from the EC2 Instance Metadata Service.


hazelcast_aws_region

--hazelcast_aws_region= (string; previous name: --aws_region)

Only used when --discovery=aws, ignored when --hazelcast_aws_az is specified. Selects the AWS region to scan for instances. If unset, the current region is used.


hazelcast_die_on_demotion

--hazelcast_die_on_demotion=true (boolean)

Crash the process if the frontend Hazelcast cluster member within it transitions from a master to a non-master member.


incompatible_use_low_offsets

--incompatible_use_low_offsets=false (boolean)

Hazelcast requires one or two ports in addition to the private ports (and the public ports for schedulers); if this is set, then it uses private_port + 1 (all instances) and private_port + 2 (schedulers). This flag must be set identically across all instances in the same cluster; changing the value is an incompatible change.


k8s_master

--k8s_master=null (string)

Only used when --discovery=k8s. DNS or IP:port of the Kubernetes Master. Leave it empty to use the default; usually you don't need to specify this flag.

If some pods can't discover others and print errors like Failure in executing REST call (...) Caused by: java.net.UnknownHostException: kubernetes.default.svc, then override this flag with https://IP:port where IP and port are that of the Kubernetes Master (see output of kubectl cluster-info).


k8s_scheduler_pods_service

--k8s_scheduler_pods_service=null (string)

Only used when --discovery=k8s. Name of the Kubernetes NodePort service that connects to all scheduler Pods.


static_scheduler

--static_scheduler=[] (list of strings)

Only used when --discovery=static. IP address and port of another scheduler (for the schedulers-only cluster), e.g. 1.2.3.4:5678. The port must be that instance's --private_port + 1000. This instance joins that instance's cluster.

You don't have to list all instances' IP and port, but at least one that you list must be online so this one can join. The more instances you list, the less sensitive your cluster will be to machine start order. If you omit the port, nodes may fail to form a cluster. Also see --incompatible_use_low_offsets.


usage_instance_type

--usage_instance_type= (string)

Machine type of this instance, for example "t3.xlarge" or "n2-standard-4" or "onprem.linux".

You can leave this flag empty when running on AWS EC2; the value will be read from IMDSv2.

Otherwise you should set this flag anytime IMDSv2 is unavailable on the instance, for example because it is not an AWS EC2 VM.

As of 2024-06-25 this value is used for usage reporting (also known as resource accounting).


Options for the action cache


action_cache_replication_endpoints

--action_cache_replication_endpoints=[] (list of strings)

Clusters to replicate action cache writes to. This is a list of domain:port pairs like "other.cluster.com:443". The scheduler synthesizes a JWT to authenticate with the replica cluster and thus assumes it has the same credentials as this cluster.


action_cache_replication_max_buffer_size

--action_cache_replication_max_buffer_size=10000 (integer)

Maximum number of action cache writes to buffer per endpoint before dropping commences.


action_cache_replication_tls_authority_override

--action_cache_replication_tls_authority_override= (string)

DNS name to validate the replica cluster's SANs against


action_cache_replication_tls_trusted_certificate

--action_cache_replication_tls_trusted_certificate= (string)

certificate to verify action cache replica clusters against


Options to configure the CAS


bytestream_chunk_size

--bytestream_chunk_size=1048560 (capacity; previous name: --bytestream_read_chunk_size)

Size of file chunks streamed by the ByteStream Read gRPC API.


cas_fallback_cluster

--cas_fallback_cluster= (string)

Cluster to send CAS blob read requests to if the blob is not found in the local CAS or external storage. If the fallback cluster is used to serve a blob, the blob is also copied into the local CAS and primary external storage bucket. This option takes the form of a domain name a port pair.


cas_fallback_cluster_tls_authority_override

--cas_fallback_cluster_tls_authority_override= (string)

Authority (domain name) to use when connecting to the --cas_fallback_cluster. Useful if the remote endpoint's certificate SNAs do not match the DNS name.


cas_fallback_cluster_tls_client_certificate

--cas_fallback_cluster_tls_client_certificate= (string)

TLS client certificate for authenticating to the --cas_fallback_cluster.


cas_fallback_cluster_tls_client_key

--cas_fallback_cluster_tls_client_key= (string)

TLS client key for authenticating to the --cas_fallback_cluster.


cas_fallback_cluster_tls_trusted_certificate

--cas_fallback_cluster_tls_trusted_certificate= (string)

Certificate to verify against when connecting to --cas_fallback_cluster.


cas_path

--cas_path=/tmp/base/ (string)

The path under which the local CAS is stored and local execution trees are created. The local CAS and the local execution trees should be on the same file system to support hard-links and atomic file moves.


default_replica_timeout

--default_replica_timeout=1h (duration)

The duration for which replicas are retained. Expired replicated files may be deleted if space is needed for new files. This applies to all CAS writes and existence checks, either initiated by a client, or initiated by a worker to store action outputs. Therefore, this needs to be set conservatively to the longest required duration - at a minimum, it should be set to the longest duration a single build can take. As of 2020-04-09, this is the only way to set replica durations.


disk_size

--disk_size=0 (capacity)

The total disk size. If this is set to a non-zero value, then the CAS and replica sizes are computed automatically based on this number. Specifically, we set the total CAS size (--max_cas_size) to 80% of the given number (effectively reserving 20% of the space for the OS), minus the number of workers (--worker_config) times the maximum output tree size (--max_output_size). We set the max replica size to half that number.

If this flag is set to a non-zero value, then the --max_cas_size and --max_replica_size options are ignored. If neither this nor --max_cas_size and --max_replica_size are set, the total disk size is derived from the size of the volume --cas_path is on.


enable_distributed_cas

--enable_distributed_cas=true (boolean)

Whether this instance should participate in the distributed CAS. If this is true, the instance makes some or all local files available to other instances in the cluster. If false, the instance does not make local files available. However, it still uses the local disk to cache files for local use. The main use case for disabling this flag is for satellite clusters where a subset of machines is remote to the majority of the cluster and should not make their files available to the main cluster. Note that these instances can still pull files from the other instances in the main cluster, not just from external storage.


experimental_async_storage_uploads

--experimental_async_storage_uploads=false (boolean)

If false, wait for successful uploads to both the distributed CAS and external storage. If true, do not wait for uploads to external storage to complete.


experimental_bytestream_enable_ttnm_checker

--experimental_bytestream_enable_ttnm_checker=false (boolean)

Enables the time to next message checker. The checker instruments the grpc requests and estimates the time to next message using exponential moving average.


experimental_bytestream_iop_device_name

--experimental_bytestream_iop_device_name=/tmp/base/ (string)

The device name to monitor for bytestream iops.


experimental_bytestream_iop_disk_stats_poll

--experimental_bytestream_iop_disk_stats_poll=1s (duration)

How often to poll disk stats.


experimental_bytestream_iop_limit

--experimental_bytestream_iop_limit=-1 (integer)

Defines the upper limit of iops (input/output per second) before a worker returns RESOURCE_EXHAUSTED to read and write requests. This is intended to mitigate the issue where a few workers end up serving the majority of bytestream requests. An IOPs limit will be OS and possibly instance dependent, so tuning is recommended. Requires --experimental_bytestream_iop_device_name.


experimental_bytestream_ttnm_estimator_weight

--experimental_bytestream_ttnm_estimator_weight=0.8 (float)

Provides the weight to the exponential moving average calculation that the time to next message checker uses for estimating load. Must be a float between 0 and 1.


experimental_bytestream_ttnm_exhaustion_threshold

--experimental_bytestream_ttnm_exhaustion_threshold=365d (duration)

Causes the ttnm resource check to refuse requests with RESOURCE_EXHAUSTED if the estimated time to next message is above this threshold. By default, this is set to 1 year -- which effectively results in the metric being reported without interrupting requests.


experimental_randomized_read_probability

--experimental_randomized_read_probability=0 (float)

Probability that a bytestream read call is directed to a randomly selected worker instead of a worker that is known to have the file. Setting this to a small value - like 0.001 - can help equalize the load in cache-heavy clusters without significantly increasing external storage costs. Values less than or equal to 0 mean that no reads are randomly redirected. Values larger than or equal to 1 mean that all reads are randomly redirected.


internal_call_timeouts_per_size

--internal_call_timeouts_per_size= ( duration histogram)

A histogram of gRPC timeouts dependent on the total bytes remaining (x=bytes, y=Duration). The timeouts are applied to the total Bytestream call, rather than individual chunks. Example value: 10s,100mib=30s,500mib=2m


max_cas_size

--max_cas_size=0 (capacity)

The maximum total size of the local CAS, including replicas and locally cached files. The local CAS keeps files as long as possible, and only evicts them when this value is exceeded. Therefore, this needs to be smaller than the total available disk space by at least the number of local executors times the maximum output size per action when using hardlinks for inputs, or the combined input and output size when using copies (see --worker_config, --max_input_size, and --max_output_size). This flag is ignored if --disk_size is not set to 0.


max_replica_size

--max_replica_size=0 (capacity)

The part of the local CAS that is available to replicas. That is, the total space used by replicas on the local machine may not exceed this value. This must be less than the CAS size minus the number of local executors times the maximum input size per action (see --worker_config and --max_input_size); otherwise the worker can run out of disk space. This flag is ignored if --disk_size is not set to 0.


max_upload_concurrency

--max_upload_concurrency=0 (integer)

The maximum number of concurrent uploads from a worker after an action completes. A negative or zero value indicates no limit. This may be useful to limit the CAS write load.


recover_cas_blobs

--recover_cas_blobs=true (boolean)

Transitional option to roll out a bugfix.

If true, workers will scan the --cas_path for left-behind (or pre-loaded) CAS content.

If false, workers ignore such blobs. Please let EngFlow know if you find the need to disable this flag.


replica_count

--replica_count=1 (integer)

The number of replicas for each CAS entry corresponding to a file that has a retention duration (see --default_replica_timeout). This must not exceed the number of nodes that participate in the distributed CAS (typically the same as the worker nodes). The system automatically re-replicates files if an existing node is lost, as long as the file does not exceed its retention duration (measured from the time it was written or existence-checked). As of 2020-04-09, the maximum supported --replica_count is 3.


replica_tracker_heap_size

--replica_tracker_heap_size=200mib (capacity)

Size allocated for replica tracker entries in worker's JVM heap.The value provided is truncated to the nearest mebibyte.


stream_fallback_reads

--stream_fallback_reads=true (boolean)

When serving bytestream reads from external storage, send data to the client immediately instead of waiting for the entire backing external storage download to finish. This flag has no effect on Windows.


verify_cas_blobs_on_startup

--verify_cas_blobs_on_startup=none (one of: {none, blocking})

Only used when --recover_cas_blobs=true.

If set to 'blocking', workers verify all CAS blobs on startup, deleting any entries that are inconsistent with the expected digest.

If set to 'none', workers do not verify the CAS blobs. This can significantly reduce worker startup time, especially if the CAS is large.


Options to configure custom IAM roles


gcs_iam_roles_root

--gcs_iam_roles_root=iam (string)

Only used when --role_external_storage=gcs. Path in the GCS bucket for blobs.


role_chunks_size_for_uploads

--role_chunks_size_for_uploads=5mib (capacity)

The chunk size used for uploading to external storage.

When using --role_external_storage=s3, this also defines the threshold for using multipart uploads.


role_external_storage

--role_external_storage=none (one of: {none, gcs, s3})

The kind of external storage to use to store IAM role information. none means no backup, gcs means Google Cloud Storage (GCS), s3 means Amazon S3.


role_external_storage_threads

--role_external_storage_threads=1 (integer)

Only used when --role_external_storage is not none. Specifies how many threads to use to serve custom role storage requests on schedulers. The value is a positive integer.


role_gcs_bucket

--role_gcs_bucket=null (string)

Only used when --role_external_storage=gcs. Name of the GCS bucket.


role_s3_bucket

--role_s3_bucket=null (string)

Only used when --external_storage=s3. Name of the S3 bucket.


role_s3_endpoint

--role_s3_endpoint=null (string)

Only used when --role_external_storage=s3. Set this to override the computed S3 endpoint. This allows running against compatible implementations of S3.


role_s3_prefix

--role_s3_prefix=iam (string)

Only used when --external_storage=s3. Path in the S3 bucket for blobs.

If not empty, then we recommend you specify a relative path (foo/bar) and not an absolute path (/foo/bar). This is because Amazon S3 (and possibly other S3 implementations) treat a leading '/' to be part of the first directory segment.

We also suggest not to add a trailing /; this is added automatically.

If the blobs root is non-empty, the final path of a blob is <blobs_root>/<subdir>/<blob>; otherwise it is <subdir>/<blob>.


role_s3_region

--role_s3_region=null (string)

Only used when --external_storage=s3. Name of the S3 bucket's region. Can be empty if AWS_REGION is set to this value.


role_storage_gcs_credentials

--role_storage_gcs_credentials=null (string)

Only used when --role_external_storage=gcs. Path to the JSON file with the GCS Service Account's credentials. Can be empty if GOOGLE_APPLICATION_CREDENTIALS is set to the JSON file's path.


role_storage_gcs_project_id

--role_storage_gcs_project_id=null (string)

Only used when --role_external_storage=gcs. Name of the GCP project ID for GCS use.


Options to configure backup storage


cas_existence_cache_expiry

--cas_existence_cache_expiry=0s (duration; previous name: --experimental_cas_existence_cache_expiry)

Used only when --external_storage is not none. Specifies the maximum time existence cache hits are kept in external storage.

The default of 0 means that entries can be kept indefinitely; this is safe because the external storage GC explicitly flushes the cache when switching to a new generation, and items are only deleted from the old generation.

Note that the existence cache size is set by --cas_existence_cache_max_size; that is the recommended way to limit memory consumption by the cache.

Note the related flag --local_cas_existence_cache_expiry applies only to the existence cache for the distributed CAS.


cas_existence_cache_max_size

--cas_existence_cache_max_size=10000000 (integer; previous name: --experimental_cas_existence_cache_max_size)

Used only when --external_storage is not none. Specifies the maximum number of entries in the CAS existence cache. Setting a higher value increases memory use (~100 bytes / entry) but can significantly reduce the number of calls and upload traffic to the storage backend. Setting this value to 0 disables the cache; setting it to -1 means no upper bound.

Note the related flag --cas_existence_cache_expiry to set the expiration time.


chunks_size_for_uploads

--chunks_size_for_uploads=5mib (capacity)

The chunk size used for uploading to external storage.

When using --external_storage=s3, this also defines the threshold for using multipart uploads.


experimental_expiration_gc_buffer_in_days

--experimental_expiration_gc_buffer_in_days=-1 (integer)

Only valid with experimental_use_expiration_gc. Defines how often object storage metadata is updated. Each time an object is accessed, the object expiration is checked. If objet will expire in less than this number of days, the object metadata is updated. Higher numbers increase the consistency of expiration based gc, but may also increase cloud costs.


experimental_expiration_use_async_ac

--experimental_expiration_use_async_ac=false (boolean)

Use async AC in expiration based storage.


experimental_read_timeout

--experimental_read_timeout=2m (duration)

Sets a timeout for proxy calls that acts as a fail-safe if the client reads very slowly, or if it does not propagate cancellation correctly; several versions of Bazel have this bug.


experimental_skip_cas_exist_check_buffer

--experimental_skip_cas_exist_check_buffer=0b (capacity)

Only valid with experimental_use_expiration_gc. Any cas blob smaller than the buffer will be written automatically to external storage without an existence check.


experimental_use_expiration_gc

--experimental_use_expiration_gc=false (boolean)

Use expiration based garbage collection. This garbage collection metholodogy leverages cloud provider specific features to remove least recently used objects via lifecycle rules.


external_storage

--external_storage=none (one of: {none, gcs, s3})

The kind of external storage to use to back up replicas, in addition to storing them on the worker machines. none means no backup, gcs means Google Cloud Storage (GCS), s3 means Amazon S3.

Deprecation: the values gcp and aws (synonyms for gcs and s3) are also supported, but deprecated. They will no longer be supported in version 2.0 and later.


external_storage_gc_window_days

--external_storage_gc_window_days=0 (integer)

Number of days to keep unused external blobs. A non-zero value enables a 'generational' garbage collector; a new generation is created every N days, with reads being served from both the current and previous generation and any such files copied to the current generation. Data that is older than one generation is deleted automatically.


external_storage_scheduler_threads

--external_storage_scheduler_threads=50 (integer)

Only used when --external_storage is not none. Specifies how many threads to use to serve external storage requests on schedulers. The value is a positive integer.


external_storage_worker_threads

--external_storage_worker_threads=50 (integer)

Only used when --external_storage is not none. Specifies how many threads to use to serve external storage requests on workers. The value is a positive integer.


gcs_blobs_root

--gcs_blobs_root=blobs (string)

Only used when --external_storage=gcs. Path in the GCS bucket for blobs.


gcs_bucket

--gcs_bucket=null (string)

Only used when --external_storage=gcs. Name of the GCS bucket.


gcs_credentials

--gcs_credentials=null (string)

Only used when --external_storage=gcs. Path to the JSON file with the GCS Service Account's credentials. Can be empty if GOOGLE_APPLICATION_CREDENTIALS is set to the JSON file's path.


gcs_project_id

--gcs_project_id=null (string)

Only used when --external_storage=gcs. Name of the GCP project ID for GCS use.


migrate_storage_max_cache_size

--migrate_storage_max_cache_size=1000000 (integer)

The number storage migration operations that are tracked before being evicted. Evicted operations are logged, but allowed to complete to avoid data loss and may be duplicated with additional requests. Higher numbers will increase memorypressure associated storing path strings.


migrate_storage_max_concurrent_operations

--migrate_storage_max_concurrent_operations=50 (integer)

The max concurrent storage migration operations that can run. Used to ensure that the external storage migration does not over load the storage service during heavy load.


migrate_storage_to_expiration_gc

--migrate_storage_to_expiration_gc=false (boolean)

Use expiration based garbage collection. This garbage collection metholodogy leverages cloud provider specific features to remove least recently used objects via lifecycle rules.


migrate_storage_update_window_duration

--migrate_storage_update_window_duration=3d (duration)

Defines the duration in which a cluster will warm up the new external storage while still serving from the old. In general, this should be a few days to get the new storage warm enough to avoid cluster degradation.


s3_blobs_root

--s3_blobs_root=blobs (string)

Only used when --external_storage=s3. Path in the S3 bucket for blobs.

If not empty, then we recommend you specify a relative path (foo/bar) and not an absolute path (/foo/bar). This is because Amazon S3 (and possibly other S3 implementations) treat a leading '/' to be part of the first directory segment.

We also suggest not to add a trailing /; this is added automatically.

If the blobs root is non-empty, the final path of a blob is <blobs_root>/<subdir>/<blob>; otherwise it is <subdir>/<blob>.


s3_bucket

--s3_bucket=null (string)

Only used when --external_storage=s3. Name of the S3 bucket.


s3_endpoint

--s3_endpoint=null (string)

Only used when --external_storage=s3. Set this to override the computed S3 endpoint. This allows running against compatible implementations of S3.


s3_region

--s3_region=null (string)

Only used when --external_storage=s3. Name of the S3 bucket's region. Can be empty if AWS_REGION is set to this value.


Options to configure readonly backup storage


experimental_readonly_read_timeout

--experimental_readonly_read_timeout=2m (duration)

Sets a timeout for proxy calls that acts as a fail-safe if the client reads very slowly, or if it does not propagate cancellation correctly; several versions of Bazel have this bug.


experimental_readonly_storage_gc_window_days

--experimental_readonly_storage_gc_window_days=0 (integer)

Number of days to keep unused external blobs. This must match the primary clusters window.


readonly_external_storage

--readonly_external_storage=none (one of: {none, gcs, s3})

The kind of external storage to use to back up replicas, in addition to storing them on the worker machines. none means no backup, gcs means Google Cloud Storage (GCS), s3 means Amazon S3.

Deprecation: the values gcp and aws (synonyms for gcs and s3) are also supported, but deprecated. They will no longer be supported in version 2.0 and later.


readonly_external_storage_threads

--readonly_external_storage_threads=50 (integer)

Only used when --external_storage is not none. Specifies how many threads to use to serve external storage requests on workers. The value is a positive integer.


readonly_gcs_blobs_root

--readonly_gcs_blobs_root=blobs (string)

Only used when --external_storage=gcs. Path in the GCS bucket for blobs.


readonly_gcs_bucket

--readonly_gcs_bucket=null (string)

Only used when --external_storage=gcs. Name of the GCS bucket.


readonly_gcs_credentials

--readonly_gcs_credentials=null (string)

Only used when --external_storage=gcs. Path to the JSON file with the GCS Service Account's credentials. Can be empty if GOOGLE_APPLICATION_CREDENTIALS is set to the JSON file's path.


readonly_gcs_project_id

--readonly_gcs_project_id=null (string)

Only used when --external_storage=gcs. Name of the GCP project ID for GCS use.


readonly_s3_blobs_root

--readonly_s3_blobs_root=blobs (string)

Only used when --external_storage=s3. Path in the S3 bucket for blobs.

If not empty, then we recommend you specify a relative path (foo/bar) and not an absolute path (/foo/bar). This is because Amazon S3 (and possibly other S3 implementations) treat a leading '/' to be part of the first directory segment.

We also suggest not to add a trailing /; this is added automatically.

If the blobs root is non-empty, the final path of a blob is <blobs_root>/<subdir>/<blob>; otherwise it is <subdir>/<blob>.


readonly_s3_bucket

--readonly_s3_bucket=null (string)

Only used when --external_storage=s3. Name of the S3 bucket.


readonly_s3_endpoint

--readonly_s3_endpoint=null (string)

Only used when --external_storage=s3. Set this to override the computed S3 endpoint. This allows running against compatible implementations of S3.


readonly_s3_region

--readonly_s3_region=null (string)

Only used when --external_storage=s3. Name of the S3 bucket's region. Can be empty if AWS_REGION is set to this value.


Options to configure the event store service


event_blobs_root

--event_blobs_root=bes (string; previous name: --experimental_event_blobs_root)

Relative path within the storage location under which event store blobs should be stored. For disk storage, use --event_disk_path to change the absolute path.


event_bucket

--event_bucket=null (string; previous name: --experimental_event_bucket)

Only used when --event_storage=gcs or --event_storage=s3. Name of the bucket to store BEP events.


event_disk_path

--event_disk_path=/tmp/engflow/ (string; previous name: --experimental_event_disk_path)

Absolute path under which event store blobs should be stored if disk storage is enabled.


event_gcp_project_id

--event_gcp_project_id=null (string; previous name: --experimental_event_gcp_project_id)

Only used when --event_storage=gcs. Name of the GCP project ID for GCS use to store BEP events.


event_read_cache_size

--event_read_cache_size=0 (capacity)

The cache size for replaying event streams.


event_s3_endpoint

--event_s3_endpoint=null (string; previous name: --experimental_event_s3_endpoint)

Only used when --event_storage=s3. The base URL for the S3 instance if using another service with an S3 compatible API.


event_s3_region

--event_s3_region=null (string; previous name: --experimental_event_s3_region)

Only used when --event_storage=s3. The region that the S3 bucket is located.


event_storage

--event_storage=disk (one of: {null, in_memory, disk, gcs, s3}; previous name: --experimental_event_storage)

The kind of external storage to use to store BEP events. DO NOT use in_memory in production environments!


Options to configure the execution service


action_execution_attempts

--action_execution_attempts=3 (integer)

How many times an action should be attempted if one of the retry conditions is true. These are controlled through separate flags, such as --experimental_retry_failure_due_to_signal.


action_execution_stats_size

--action_execution_stats_size=2gb (capacity)

The maximum amount of memory that can be used for action execution stats by each scheduler when --use_smart_recommender=true. The stats can be used to automatically pick the "best" pool within a pool group (as specified by --pool_groups) based on previous executions.

The resolution is 1 megabyte, the value is rounded down to the nearest whole megabyte as needed. (Example: 1600kb is rounded down to 1mb.) Setting a value smaller than 1mb disables the cache.


allow_docker

--allow_docker=false (boolean)

Whether to enable dockerized execution. In order to use dockerized execution, the client also needs to send docker image ids, and the worker must have the corresponding docker images available. As of 2020-04-14, dockerized execution is only supported on Linux VMs.

This flag is ignored on macOS workers.


allow_local

--allow_local=false (boolean)

Whether to enable local execution. You must enable one of --allow_local, or --allow_docker to be able to run actions at all. If multiple flags are enabled, then the strategy is selected based on the requested execution platform. In that case, the worker selects the first of docker, and local in that order.


debug_execute_requests

--debug_execute_requests=false (boolean)

If this is true, the worker prints the execute request in full detail to the log. This can generate very large amounts of output, so use with caution.


docker_additional_env

--docker_additional_env=[] (list of strings)

A list of additional environment variables that are set in every docker container. Changes to this flag are non-hermetic, i.e., the system returns existing cache entries and does not force a rerun of the affected actions.


docker_additional_mounts

--docker_additional_mounts=[] (list of strings)

A list of additional directories that are mounted into every docker container of the form /path/to/something or /outside_path=/inside_path. All paths must be absolute, and outside paths must exist on the local machine (where this service runs); inside paths may or may not exist.

Changes to this flag are non-hermetic, i.e., the system returns existing cache entries and does not force a rerun of the affected actions.


docker_allow_any_runtime

--docker_allow_any_runtime=true (boolean)

If false, then requesting a specific runtime will fail the execution unless it is explicitly allowed using --docker_allowed_runtimes.


docker_allow_network_access

--docker_allow_network_access=true (boolean)

If true, then actions can request access to sibling containers and the internet using the dockerNetwork platform setting. Otherwise actions requesting such access fail.

When enabled, action execution containers that are started with dockerNetwork=standard will be connected to a Docker bridge network. The network's name is set in the execution container as the $HOST_NETWORK_NAME environment variable.

When disabled, the value of --docker_default_network_mode is ignored and taken to be off.


docker_allow_requesting_capabilities

--docker_allow_requesting_capabilities=true (boolean)

If false, then requesting capabilities will fail the execution.


docker_allow_reuse

--docker_allow_reuse=true (boolean)

Whether to allow reusing Docker containers. If true, we allow reusing a running Docker container for subsequent actions that specify the same image id and Docker options; otherwise we start a new container for every action. Individual actions or builds can opt-out of container reuse with the dockerReuse platform option. Depending on the underlying machine, Docker startup can take several seconds.


docker_allow_sibling_containers

--docker_allow_sibling_containers=true (boolean)

If true, then actions can request access to docker with the dockerSiblingContainers platform setting. Otherwise actions requesting such access fail.


docker_allowed_runtimes

--docker_allowed_runtimes=[] (list of strings)

Ignored if --docker_allow_any_runtime=true. A list of runtimes that clients are allowed to set. If you want to allow the default runtime, you have to add the empty string to this list.


docker_clean_tmp

--docker_clean_tmp=false (boolean)

Only used when --docker_allow_reuse=true. Whether to clean /tmp after reusable Docker actions.


docker_container_startup_timeout

--docker_container_startup_timeout=5m (duration)

How long to wait for docker container startup. (I.e., "docker run") This is not part of the action's execution timeout.


docker_content_trust

--docker_content_trust=false (boolean)

Whether to enable docker's signature verification. When enabled, docker only allows running signed images.


docker_cpu_limit

--docker_cpu_limit=set (one of: {none, count, set})

Whether and how to limit docker action CPU usage. Use 'none' to apply no per-action limit, 'count' to set the maximum CPU usage in number of cores, and 'set' to restrict the action to a specific set of cores. Both 'count' and 'set' are computed from the --worker_config option; 'count' simply applies the number of cores, whereas 'set' computes non-overlapping CPU masks starting at 0. We recommend using 'set' if possible, and 'count' otherwise. Use 'none' only if CPU limitation does not work for some reason. Note that the 'set' setting assumes that the worker service has full control over the machine - another process assigning the same CPUs on the same machine can lead to conflicts and performance issues.


docker_default_network_mode

--docker_default_network_mode=off (one of: {off, standard, host})

Only used when --allow_docker=true. Ignored and considered to be off if --docker_allow_network_access=false.

Specifies the default network mode for dockerized actions that don't request any particular dockerNetwork platform option.


docker_disallowed_capabilities

--docker_disallowed_capabilities=[] (list of strings; previous name: --docker_blacklisted_capabilities)

A list of capabilities that must not be set in execution requests. A request setting a capability provided here fails execution.


docker_drop_capabilities

--docker_drop_capabilities=[] (list of strings)

A list of docker capabilities that are dropped by default in addition to those that are already dropped by docker.


docker_enable_ipv6

--docker_enable_ipv6=false (boolean)

Whether to enable IPv6 for the Docker network.


docker_enforce_known_capabilities

--docker_enforce_known_capabilities=true (boolean)

If true, then all capabilities that are requested to be added are checked against a list of known capabilities before they are passed to docker. If any requested capability is not known, execution fails.


docker_extra_flags

--docker_extra_flags=[] (list of strings)

Extra flags to pass to docker run.


docker_ipv6_cidr

--docker_ipv6_cidr=fd00::/16 (string)

Only used when --docker_enable_ipv6=true. The subnet CIDR range for IPv6 Docker networks. Worker instances use this to generate random IPv6 subnets for each executor; each generated subnet will begin with the given prefix, and have a subnet length given by --docker_ipv6_subnet_length. This can either be a private subnet (starting with fd00), which does not allow any outgoing IPv6 traffic, or it can be public, in which case it should be based on the IPv6 subnet assigned to the underlying machine.

For example, if the machine uses 2001:0db8:3333:4444:5555:6666:7777:8888/64, and this flag is set to 2001:0db8:3333:4444:ff00::/72, and the subnet length is 96, then the worker generates random subnets that look like 2001:0db8:3333:4444:ffXX:XXXX::/96, with each X replaced by a random hexadecimal digit.

Note: the value given here can be identical to the value configured in the Docker daemon's fixed-cidr-v6 configuration option.


docker_ipv6_subnet_length

--docker_ipv6_subnet_length=112 (integer)

Only used when --docker_enable_ipv6=true. The subnet CIDR prefix length for IPv6 Docker networks; the generated Docker subnets will have 2^(128-X) addresses. See the documentation of --docker_ipv6_cidr for more details.


docker_max_kernel_memory

--docker_max_kernel_memory=0 (capacity)

This is passed to docker to limit the amount of kernel memory available to each action. If unset, then there is no limit applied to docker; memory use is still limited by the available machine memory.


docker_max_memory

--docker_max_memory=0 (capacity)

Deprecated. Use --worker_config with a ram setting instead. If both are set, then we pass the maximum of the two values to docker to limit the amount of memory available to each action. If both are unset, then there is no limit applied to docker; memory use is still limited by the available machine memory.


docker_process_limit

--docker_process_limit=10000 (integer)

The maximum number of concurrent processes for a single action. This helps prevent runaway processes and fork bombs. Set to -1 for no limit, but beware this allows build actions to fork bomb.


docker_record_psi

--docker_record_psi=true (boolean)

Record Linux Pressure Stall Information in the docker runner


docker_use_process_wrapper

--docker_use_process_wrapper=true (boolean)

Whether to run Docker actions through the process wrapper. This also requires setting --process_wrapper_binary_path. Note that this may fail at runtime if the selected Docker container is not compatible with the process-wrapper binary, which is usually linked against libc and libstdc++ among other system libraries.


experimental_detect_oom_based_on_memory_use

--experimental_detect_oom_based_on_memory_use=false (boolean)

By default the OOM detector only uses the exit code to detect OOMs. If this flag is enabled, the service additionally considers actions with a non-zero exit code that use more memory than the action memory limit as OOMs. This requires that a memory limit is set - either via --worker_config or --docker_max_memory - and that the process wrapper is enabled via --docker_use_process_wrapper. As of 2023-07-25, this only works for Linux Docker actions.


experimental_detect_oom_memory_factor

--experimental_detect_oom_memory_factor=1.0 (float)

If --experimental_detect_oom_based_on_memory_use is enabled, then this flag controls how the memory comparison is done as used > factor * available. By adjusting the factor to be less than 1.0, more failed actions are considered to be out of memory.


experimental_docker_force_reuse

--experimental_docker_force_reuse=false (boolean)

Whether to enforce reusing Docker containers. This is ignored if --docker_allow_reuse is false. If both are true, then the service attempts to reuse running Docker containers regardless of the client setting for the dockerReuse platform option.


experimental_executor_io_recovery

--experimental_executor_io_recovery=null (boolean)

If enabled, an executor may switch to a new directory if it cannot delete or rename files in its current directory before or after an action. This sometimes happens on Windows because files cannot be deleted or renamed while they are opened by another process.


experimental_force_module_cache_path_for_mnemonics

--experimental_force_module_cache_path_for_mnemonics=[] (list of strings)

A list of mnemonics for which Engflow will force a value for -fmodules-cache-path. This is currently only useful for Objective-C actions.


experimental_historical_results

--experimental_historical_results=true (boolean)

If enabled, workers will upload execution results to the CAS. This enables retrievinguncached action results, including failures. Also set --experimental_historical_result_url to report back the URL for a UI endpoint where the historical execution results can be viewed.


experimental_mia_fallback_pools

--experimental_mia_fallback_pools=[] (list of strings)

Each value is a comma-separated list of fallback pools; whenever an action fails with a worker missing-in-action (MIA), then the action is retried on the next pool id in the sequence. For example, if you specify default,unscaled then an action that fails with an MIA in the default pool is retried in the unscaled pool.

It is important that the pools use the same architecture and operating system. and generally increase along all other dimensions (cpu, ram, input+output tree sizes, etc.). This property is not automatically verified.

Furthermore, you should ensure that the MIA fallback policy and the OOM fallback policies are order-independent; for example, you may have an MIA first, and then an OOM, or vice versa - the resulting pool should always be sufficiently largefor the action.

This flag only works if --experimental_track_action_failures_across_retries is enabled.


experimental_oom_fallback_pools

--experimental_oom_fallback_pools=[] (list of strings)

Each value is a comma-separated list of fallback pools; whenever an action fails with an OOM, then the action is retried on the next pool id in the sequence. For example, if you specify default,more_ram,most_ram then an action that fails with an OOM in the default pool is retried in the more_ram pool, and an action that fails in the more_ram pool is retried in the most_ram pool.

It is important that the pools use the same architecture and operating system, and generally increase along all other dimensions (cpu, ram, input+output tree sizes, etc.). This property is not automatically verified.


experimental_pool_groups

--experimental_pool_groups=[] (list of strings)

Each value is a comma-separated list of pools, as well as their relevant dimensions (cpu, ram, etc.); whenever an action execution is requested for a pool that is in one of the pool groups, instead of executing it on the requested pool, a different pool may be chosen. This choice is based on previous similar executions and how many resources were used there.

In contrast to --experimental_oom_fallback_pools, with this flag the initial execution of an action can run on the right-sized pool from the first time and thus avoid OOMs. It also allows action-based pool selection, which is more fine-grained compared to marking a whole target for execution on a specific pool.

It is important that each value only lists pools that use the same architecture and operating system. Pools must be listed in ascending order for all dimensions (cpu, ram, input+output tree sizes, etc.). This property is not automatically verified.

Additionally, the pool groups need to be disjoint, i.e. no pool is listed in two pool groups. If a pool is not listed in any group, it is interpreted as a single-pool group.

Example: the value small:cpu=1;ram=2gib,medium:cpu=1;ram=4gib,large:cpu=1;ram=8gib specifies a pool group consisting of three pools, small, medium, large with increasing RAM. A request for an action execution for pool small - where a previous similar execution used 1.4 GiB memory - would then be executed on medium, as executing on small would likely OOM.


experimental_pool_groups_allow_downgrade

--experimental_pool_groups_allow_downgrade=false (boolean)

When using --experimental_pool_groups, determines whether actions for a target with a tag specifying which pool to execute on may run on a smaller pool in the same group.

For example, take the pool group small:ram=cpu=1;2gib,medium:cpu=1;ram=4gib,large:cpu=1;ram=8gib and a request to execute an action on pool medium. Additionally, the last execution of a similar execution only took 1 GiB of memory. If --experimental_pool_groups_allow_downgrade=false, the action will be executed on medium. If the flag is true, on the other hand, the action will be executed on small. If the last execution took 5GiB, irrespective of the flag's value, the action will be executed on large, upgrading the pool.


experimental_retry_failure_due_to_signal

--experimental_retry_failure_due_to_signal=false (boolean)

Whether to retry actions that fail due to a system signal (128 < exit code < 255). Use --action_execution_attempts to control the maximum number of attempts.


experimental_retry_persistent_worker_on_error

--experimental_retry_persistent_worker_on_error=null (string)

If this flag is set to a regex pattern, then persistent worker actions that returns a non-zero exit code and an error message matching the given pattern. This causes the current persistent worker to be shut down, and can therefore result in slower actions; use sparingly.

The total number of attempts is controlled via --action_execution_attempts.


--experimental_use_recommended_pool=false (boolean)

When using --experimental_pool_groups, determines whether the recommended pool is used or not. Disabling this flag allows a dry-run of this feature, which only logs the what behavioural changes would happen.

This is a migration flag, which we expect to enable by default and then remove.


experimental_windows_persistent_workers

--experimental_windows_persistent_workers=false (boolean; previous name: --experimental_persistent_worker)

Whether to enable experimental support for remote persistent workers on Windows.

See docs.engflow.com.


extra_xcode

--extra_xcode=[] (list of strings)

A list of paths to Xcode installations.


ignore_unknown_platform_properties

--ignore_unknown_platform_properties=false (boolean)

Whether to ignore unknown platform properties. If false, then actions that set unknown platform properties return an error. Otherwise such properties are silently ignored. Note that changing this flag does not affect existing entries in the action cache, i.e., the server may return cached entries even if re-executing the action would return an error due to unknown properties. All properties are part of the cache key.


incompatible_docker_prevent_memory_swap

--incompatible_docker_prevent_memory_swap=false (boolean)

If enabled, prevents docker containers from using swap memory.


incompatible_one_pool_per_worker

--incompatible_one_pool_per_worker=false (boolean)

Whether prevent workers from configuring multiple pools.


--incompatible_remove_symlink_execroot_strategy=true (boolean)

Removes support for building exec root using symlinks.


incompatible_require_canonical_container_image

--incompatible_require_canonical_container_image=false (boolean)

If enabled, the 'container-image' platform option must contain the digest of the container.


max_download_concurrency

--max_download_concurrency=200 (integer)

The maximum number of concurrent downloads to a worker before an action starts. A negative or zero value indicates no limit. This may be useful to limit the CAS read load as well as preventing running out of file descriptors.


max_execution_timeout

--max_execution_timeout=15m (duration)

The maximum timeout for the execution of a single action. Clients typically only set timeouts for a subset of actions such as test actions to avoid cache fragmentation. The timeout set here applies to all execution requests that do not have a timeout set. In addition, it also provides an upper bound for execution requests that do have a timeout set, i.e., requested timeouts larger than this are silently ignored.


max_input_size

--max_input_size=4gb (capacity)

The maximum total size of all inputs to an action. Actions that exceed this limit are aborted during setup.


max_output_size

--max_output_size=4gb (capacity)

The maximum total size of all outputs of an action. Actions that exceed this limit are aborted during or after execution.


notification_period

--notification_period=1m (duration)

Configures how often the service provides updates to the client about running actions. Note that this does not apply to queued actions.


operation_retention_time

--operation_retention_time=1m (duration)

Configures the duration for which the worker retains a finished action before deleting it locally. The worker uses these retained entries to answer waitExecution requests in case the client disconnects during execution. A very small value can cause unnecessary action retries and execution load, and a very large value can cause excessive memory use on the worker.


process_wrapper_binary_path

--process_wrapper_binary_path=/usr/bin/engflow/process-wrapper (string)

The path to a process-wrapper binary on the worker. The process-wrapper binary is part of a Bazel installation and provides improved control of action processes.


process_wrapper_cpu_limit

--process_wrapper_cpu_limit=none (one of: {none, set})

Whether and how to limit action CPU usage when using the process wrapper. Use 'none' to apply no per-action limit, and 'set' to restrict the action to an automatically computed set of cores. We recommend using 'set' if possible. Use 'none' only if CPU limitation does not work for some reason. Note that the 'set' setting assumes that the worker service has full control over the machine, as it assigns CPUs starting at 0.


sandbox_allow_network_access

--sandbox_allow_network_access=true (boolean)

If true, sandboxed actions can request network access by setting the platform option sandboxNetwork, e.g., exec_properties = { "sandboxNetwork": "standard" }. Otherwise, such actions fail. Only applies to the MacOS sandbox, i.e., if --experimental_allow_mac_sandbox is enabled.


sandbox_grace_timeout

--sandbox_grace_timeout=5s (duration)

How long to wait before sending SIGKILL after an action times out. When an action times out, we first send it SIGTERM and only send SIGKILL after this grace period. The value may be rounded up to the next larger whole second. This applies to any actions run within the process wrapper despite the name of this flag. See --use_process_wrapper or --docker_use_process_wrapper.


use_process_wrapper

--use_process_wrapper=false (boolean)

Whether to enable the process wrapper for local actions. The process wrapper provides improved process control, ensuring a more consistent execution environment as well as killing all child processes reliably.


warm_containers

--warm_containers=true (boolean)

Try to pull active cluster Docker containers onto worker before accepting any actions.


warm_containers_timeout

--warm_containers_timeout=10m (duration)

Maximum time to pull active cluster Docker containers before accepting any actions. After this time, any remaining 'docker pull' operations are cancelled, and the worker begins accepting actions.


worker_config

--worker_config=auto (string)

Configures the number and properties of local executors. Specify executor properties as a list of key-value pairs separated by commas, such as cpu=1,ram=2gb,pool=c1_m2.

To specify multiple identical executors, prefix a set properties with a number and a * character, such as 4*cpu=2. To specify multiple different executors, combine them with a + character, such as 1*cpu=3,ram=1G+2*cpu=1 (one executor with 3 cores and 1 GB of RAM, and two executors with 1 core). The comma operator has precedence over the star operator, which has precedence over the plus operator. Disable local execution by setting this flag to the empty string.

For automatic configuration, specify auto to create an executor for each available core. This option is useful when the number of cores is not known in advance.

For manual configuration, the only supported keys are cpu, ram, and pool. cpu specifies the number of cores to reserve, ram specifies the maximum RAM used by the executor, and pool specifies the name of the pool for the executor.

cpu must be a positive integer (e.g., 2), fractional values are not supported.

ram should, if specified, also include a unit (e.g., 10b for 10 bytes, 5gibfor 5 Gibibytes). The unit is case-insensitive. If no unit is specified, the value is understood in bytes.

pool must, if specified, match the following expression: [a-z0-9_]+.


Options to configure the result store service


bes_keyword_deny_list

--bes_keyword_deny_list=[] (list of strings)

A list of BES keywords prefixes to exclude from search results in the UI. This can be used to cleanup the lists of keywords in the UI.


experimental_build_index_db_threads

--experimental_build_index_db_threads=10 (integer)

Only used when --experimental_build_index is enabled. Specifies how many threads to use to query the invocation index database. The value is a positive integer.


experimental_build_index_service_threads

--experimental_build_index_service_threads=4 (integer)

Only used when --experimantal_build_index is enabled. Specifies how many threads to use to serve invocation index requests. The value is a positive integer.


experimental_filter_prefix_sort

--experimental_filter_prefix_sort=true (boolean)

Sorts filter fields used in invocation search to put prefix matches before substring matches


report_per_event_replay_cpu_time_metrics

--report_per_event_replay_cpu_time_metrics=false (boolean)

If enabled, reports how much CPU time was spent replaying and reducing each individual event within all incoming BES. Reporting may be costly so use this sparingly.


Monitoring options


cas_cached_docker_metrics

--cas_cached_docker_metrics=true (boolean)

Whether to export metrics for the CAS-cached Docker pull strategy.


cloudwatch_dimensions

--cloudwatch_dimensions=null (string)

Only considered when --enable_cloudwatch=true and ignored otherwise. Sets common dimensions of reported CloudWatch metrics. The value is a comma-separated list of key-value pairs, e.g. "customer=Acme Inc.,cluster=prod", order does not matter.


cloudwatch_export_interval

--cloudwatch_export_interval=1m (duration)

Configures the time between metrics exports to CloudWatch.


cloudwatch_metrics_filter

--cloudwatch_metrics_filter=[.*] (list of strings)

Required when --enable_cloudwatch=true, ignored otherwise. A list of regexes that filter metric names: a metric is reported to AWS CloudWatch only if it matches any of the regexes. Entries follow Java regex syntax. Matching is partial by default (e.g. "exec" matches every metric whose name contains this string); to match the whole metric name, use ^ and $. If empty or not specified, then no metrics are reported.

Example: report metrics about AWS S3 use: --cloudwatch_metrics_filter+=storage\.s3/; report download-related metrics but from any storage backend: --cloudwatch_metrics_filter+=storage\..*/download.

Use --cloudwatch_metrics_filter=regex to override the default.


cloudwatch_namespace

--cloudwatch_namespace=null (string)

Required when --enable_cloudwatch=true, ignored otherwise. Sets the namespace of reported metrics.


cloudwatch_region

--cloudwatch_region=null (string)

Required when --enable_cloudwatch=true, ignored otherwise. Sets the AWS region of reported metrics.


enable_cloudwatch

--enable_cloudwatch=false (boolean)

Enables reporting metrics to AWS CloudWatch.


enable_metrics_log

--enable_metrics_log=false (boolean)

Enables reporting metrics to the log; this can be used for testing or for log-based analytics.


enable_prometheus

--enable_prometheus=false (boolean)

Enables a built-in webserver to export monitoring data to Prometheus (https://prometheus.io/). You may also need to set --prometheus_port and configure Prometheus to start scraping from all cluster nodes.


enable_stackdriver

--enable_stackdriver=false (boolean)

Enables reporting of monitoring and tracing data to StackDriver (a monitoring system integrated into Google Cloud that also supports AWS). You also need to set --stackdriver_project and provide application default credentials that allow write access to StackDriver.


enable_zipkin

--enable_zipkin=false (boolean)

Enables reporting of performance traces to Zipkin (https://zipkin.io/). You also need to set --zipkin_endpoint.


execution_stage_latency_metrics

--execution_stage_latency_metrics=true (boolean)

Whether to export execution latency metrics by stage and pool.


grpc_metrics

--grpc_metrics=basic (one of: {none, minimal, basic, all})

The gRPC library provides a number of metrics that can be logged for monitoring. This option selects what subset of metrics to log. Unfortunately, logging all metrics can be expensive (e.g., on Google Cloud Operations). For the minimal setting, all completed RPCs are logged, but no latency metrics, bytes, or messages.


log_metrics_filter

--log_metrics_filter=[.*] (list of strings)

Required when --enable_metrics_log=true, ignored otherwise. A list of regexes that filter metric names: a metric is logged only if it matches any of the regexes. Entries follow Java regex syntax. Matching is partial by default (e.g. "exec" matches every metric whose name contains this string); to match the whole metric name, use ^ and $. If empty or not specified, then no metrics are reported.


lsof_report_interval

--lsof_report_interval=0s (duration)

Configures the time between reports of open file handles. Zero to disable.


monitoring_trace_probability

--monitoring_trace_probability=0 (float; previous name: --monitoring_sample_probability)

Sets the probability of recording a performance trace for a given client request to a scheduler. Setting it to 0 disables tracing. Setting it to 1 enables tracing every request. Tracing a large fraction of the traffic is expensive, and should not be used for production clusters. Note that this flag is evaluated once on the scheduler for each incoming RPC call and then passed along on subsequent calls.


prometheus_bind_to_any

--prometheus_bind_to_any=false (boolean; previous name: --monitoring_prometheus_bind_to_any)

Whether to bind to any local IP. If false, then only bind to the private IP selected with --private_ip_selector. If your cluster is connected to the public internet, then enabling this flag exposes your monitoring data publicly.


prometheus_port

--prometheus_port=8888 (integer; previous name: --monitoring_prometheus_port)

Selects the local port to start a prometheus-compatible webserver on.


stackdriver_export_interval

--stackdriver_export_interval=1m (duration)

Configures the time between metrics exports to StackDriver.


stackdriver_optimized_reporting

--stackdriver_optimized_reporting=true (boolean)

Transitional option to enable automatic optimization of the metric export interval for each metric based on observed changes. I.e., metrics are only exported when they change rather than every interval. This can significantly reduce Stackdriver costs during periods of low cluster utilization such as nights and weekends.


stackdriver_project

--stackdriver_project= (string; previous name: --monitoring_stackdriver_project)

Selects the StackDriver project to send monitoring data to.


zipkin_endpoint

--zipkin_endpoint=http://localhost:9411/api/v2/spans (string; previous name: --monitoring_zipkin_endpoint)

Configures the zipkin endpoint to push performance traces to.


Options to configure logging to external services


aws_log_group_name

--aws_log_group_name=null (string)

Only used if --remote_logging_service=aws_cloudwatch. The name of the AWS log group, which must already exist.


gcp_log_autodetect

--gcp_log_autodetect=true (boolean)

Only used if --remote_logging_service=google_cloud_operations. Whether to automatically detect log labels for this process, like the instance name and availability zone. If you log to GCP from outside of GCP, the automatic detection does not work correctly - in that case, set this flag to false.


gcp_log_project_id

--gcp_log_project_id=null (string)

Only used if --remote_logging_service=google_cloud_operations. The GCP project id to log to. Instances that run on GCP automatically detect the current project; you can use this flag to override the automatically detected project id, or provide one explicitly if the instance is not running on GCP.


remote_log_level

--remote_log_level=info (one of: {off, severe, warning, info, verbose, all})

The verbosity level of remote logging.


remote_logging_service

--remote_logging_service=none (one of: {none, google_cloud_operations, aws_cloudwatch})

The external service to log to.


Threading options


default_thread_pool_size

--default_thread_pool_size=0 (integer)

The size for the default executor.

A nonpositive value means use the number of CPU cores of the machine.


disk_thread_pool_size

--disk_thread_pool_size=0 (integer)

The size for the disk executor.

A nonpositive value means use the number of CPU cores of the machine.


network_thread_pool_size

--network_thread_pool_size=0 (integer)

The size for the network executor.

A nonpositive value means use the number of CPU cores of the machine.


slow_task_threshold

--slow_task_threshold=1s (duration)

Length of execution time before a task is considered slow.


Options to configure CI runners


experimental_bk_agent_default_version

--experimental_bk_agent_default_version=3.58.0 (string)

The version of the BuildKite agent to use when the "buildkite-agent-version" label is not set.


experimental_bk_agents_uuid

--experimental_bk_agents_uuid= (string)

The UUID to use for the Buildkite agents webhook URL. This must be identical for all schedulers in the same cluster.


experimental_bk_api_token

--experimental_bk_api_token= (string)

The file or secretstore URL to a Buildkite Token with permissions to read builds.


experimental_bk_polling_interval

--experimental_bk_polling_interval=10s (duration)

Defines how often queued jobs are polled via the Buildkite APIs.


experimental_bk_secrets_json

--experimental_bk_secrets_json= (string)

The file or secretstore URL to a JSON object with secrets as key/value pairs, e.g., {"BUILDKITE_AGENT_TOKEN":"my-secret-token"}. The maximum size of all keys and values is around 4kiB. This object must contain an entry for 'BUILDKITE_AGENT_TOKEN' or the agent will not be able to register itself.


experimental_ci_cluster

--experimental_ci_cluster= (string)

An identifier for the CI Runners cluster which can be used to target a specific cluster via the engflow-cluster label.


experimental_ci_runners_default_arch

--experimental_ci_runners_default_arch=x64 (one of: {unknown, x64, arm64})

Used for failing actions only. Must match the architecture of --ci_runners_default_pool.


experimental_ci_runners_default_docker_runtime

--experimental_ci_runners_default_docker_runtime=sysbox-runc (string)

Used for failing actions only. Must be compatible with --ci_runners_minimal_docker_image.


experimental_ci_runners_default_os

--experimental_ci_runners_default_os=linux (one of: {windows, macos, linux, unknown})

Used for failing actions only. Must match the OS of --ci_runners_default_pool.


experimental_ci_runners_default_pool

--experimental_ci_runners_default_pool= (string)

Used for failing actions only. Must be of --ci_runners_default_os and --ci_runners_default_arch and be compatible with --ci_runners_minimal_docker_image.


experimental_ci_runners_docker_image

--experimental_ci_runners_docker_image= (string)

Only used when --experimental_enable_gh_runners_webhook=true. This must be a valid canonical docker URL (e.g., docker://alpine/curl@sha256000...), which is in turn used as the container for all fetch and unpacking calls required for CI runners. The referenced Docker image must contain curl, tar, and unzip. If the image has an entrypoint, it is ignored.


experimental_ci_runners_http_retries

--experimental_ci_runners_http_retries=5 (integer)

How many times to retry failing HTTP calls (GitHub or BuildKite API).


experimental_ci_runners_lost_job_timeout

--experimental_ci_runners_lost_job_timeout=20m (duration)

If a CI job has not been updated in the last value, re-process it (up to --experimental_ci_runners_process_retries times).


experimental_ci_runners_minimal_docker_image

--experimental_ci_runners_minimal_docker_image= (string)

Used for failing actions only. Must run on --ci_runners_default_os and --ci_runners_default_arch and be sufficient to start the GitHub action runner.


experimental_ci_runners_process_retries

--experimental_ci_runners_process_retries=0 (integer)

How many times to attempt to retry the processing of a CI job.


experimental_ci_runners_reapi_retries

--experimental_ci_runners_reapi_retries=5 (integer)

How many times to retry failing RE-API calls (the actual CI job execution is exempted).


experimental_ci_runners_status_in_ui

--experimental_ci_runners_status_in_ui=false (boolean)

Whether to show a CI Runners status page in the cluster's UI.


experimental_ci_runners_tracker_duration

--experimental_ci_runners_tracker_duration=24h (duration)

The amount of time a job status is retained in the distributed HZ map.


experimental_enable_bk_agents_webhook

--experimental_enable_bk_agents_webhook=false (boolean)

Whether to allow POST requests to /webhooks/buildkite/agent/<uuid> to trigger running the Buildkite agent.


experimental_enable_bk_polling

--experimental_enable_bk_polling=false (boolean)

Whether to poll for queued jobs via the Buildkite APIs. This requires a Buildkite API token with read_builds scope, see --experimental_bk_api_token.


experimental_enable_gh_polling

--experimental_enable_gh_polling=false (boolean)

Whether to poll for queued jobs via the GitHub APIs.


experimental_enable_gh_runners_webhook

--experimental_enable_gh_runners_webhook=false (boolean)

Whether to allow POST requests to /webhooks/github/runners/<uuid> to trigger running the GitHub action runner.


experimental_gh_polling_interval

--experimental_gh_polling_interval=10s (duration)

Defines how often queued jobs are polled via the GitHub APIs.


experimental_gh_repo

--experimental_gh_repo=[] (list of strings)

Name(s) of the repo(s) to poll, using <owner>/<repo> notation.


experimental_gh_result_delay

--experimental_gh_result_delay=1s (duration)

Defines the delay between a job finishing and querying GitHub for the job status.


experimental_gh_runner_default_version

--experimental_gh_runner_default_version=2.317.0 (string)

The version of the GH action runner to use when the "github-runner-version" label is not set.


experimental_gh_runner_idle_timeout

--experimental_gh_runner_idle_timeout=1m (duration)

The longest time a GH action runner will be idle (no job) before getting killed.


experimental_gh_runner_registration_type

--experimental_gh_runner_registration_type=organization (one of: {organization, repository, enterprise})

The level at which the GH runners are registering to GitHub.


experimental_gh_runner_timeout

--experimental_gh_runner_timeout=5h (duration)

The longest time a GH action runner will run before getting killed.


experimental_gh_runners_secret

--experimental_gh_runners_secret= (string)

The file or secretstore URL to a GitHub Personal Access Token or GitHub App with permissions to create action runners at the organization level. If the flag --experimental_gh_runners_type is set to personal_access_token, then this secret should only be the PAT, with no padding or other formatting. If the flag is set to github_app, then the secret should be a JSON object with two fields, github_app_id and private_key corresponding to the GitHub App.


experimental_gh_runners_type

--experimental_gh_runners_type=personal_access_token (one of: {personal_access_token, github_app})

Whether the GH Runners secrets is a GitHub Personal Access Token or GitHub App.


experimental_gh_runners_uuid

--experimental_gh_runners_uuid= (string)

The UUID to use for the GitHub runners webhook URL. This must be identical for all schedulers in the same cluster.


Appendix: flag syntax

Duration flags

You can specify a duration in milliseconds, seconds, minutes, hours, or days. Use the suffix ms, s, m, h, or d respectively:

Text Only
1
2
3
4
5
--flag=5s
# Means: 5 seconds

--flag2=90m
# Means: 90 minutes

Capacity flags

You can specify a capacity in Bytes, in KiloBytes, MegaBytes, or GigaBytes, and in KibiBytes, MebiBytes, or GibiBytes. Use b or no suffix at all for Bytes; use the suffixes kb, mb, or gb for decimal units (1000 multipliers), and kib, mib, or gib for the binary units (1024 multipliers). Upper and lower-case are considered the same.

Text Only
--flag1=10
--flag1=10b
# Both mean 10 bytes

--flag2=15mb
--flag2=15MB
# Both mean 15 MB (15 * 10^6 = 15_000_000 bytes)

--flag3=25gib
--flag3=25GiB
# Both mean 25 GiB (25 * 1024^3 = 26_843_545_600 bytes)

List flags

You can specify list flags multiple times. The += operator adds another value, and the = operator drops all accumulated (or default) values:

Text Only
1
2
3
4
5
6
7
8
--flag=value1 --flag+=value2 --flag+=value3
# Means: [value1, value2, value3] (ignoring the default value)

--flag+=value1 --flag+=value2 --flag+=value3
# Means: [&lt;default values&gt;, value1, value2, value3]

--flag+=value1 --flag=value2 --flag+=value3
# Means: [value2, value3]