Skip to content

Release Notes

To see your currently deployed version visit [cluster_url]/restatus in your EngFlow cluster web UI. If you do not have the web UI enabled please ask your EngFlow contact which version you are currently running.

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

v2.28.2 (2022-01-30)

Cherrypicks

  • RE: return NOT_FOUND when workers go missing during execution.
  • probers: retry when Execute returns NOT_FOUND.

v2.28.1 (2022-01-27)

Cherrypicks

  • UI login: fix bug in Open ID Connect token requests.

v2.28.0 (2023-01-27)

Cherrypicks

  • RE: added an experimental flag, --experimental_always_retry_missing_worker_failures. When --experimental_always_retry_missing_worker_failures is enabled, the scheduler will always retry on UNAVAILABLE errors from workers.
  • RE: add waittime before shutting down gRPC server.

Added

  • Goma, --logging_timestamp_format: now supports the value fluent-bit, for Fluent-bit compatible timestamps in JSON logs ("%s.%L" format).
  • RE: deployed retriable probers to detect regressions.

Changed

  • RE, --log_file_limit: default is raised from 10mb to 100mb. This should avoid frequent log rotation when logging a lot.
  • fluent-bit: health-check is now enabled (but not exposed via any infrastructure).
  • fluent-bit: will now ship its own log-file by default.

Fixed

  • RE, Goma: the .deb installer now creates some static config files and empty log files for fluent-bit safety.

v2.27.2 (2022-01-25)

Cherrypicks

  • RE: added an experimental flag, --experimental_always_retry_missing_worker_failures. When --experimental_always_retry_missing_worker_failures is enabled, the scheduler will always retry on UNAVAILABLE errors from workers.

v2.27.1 (2022-01-25)

Cherrypicks

  • RE: add waittime before shutting down gRPC server.

v2.27.0 (2023-01-18)

Incompatible

  • --incompatible_track_availability_zone is now a no-op.

Added

  • This change log is now shown in the UI under /restatus.
  • Our public docs now describe how to engage with Customer Support.
  • The RE service can now emit single-line JSON logs, see --log_file.

Changed

  • With --client_auth=github_token, the principal name is now stable: github_token.

v2.26.0 (2023-01-12)

Changed

  • Enable S3 structured paths by default
  • Setting --mtls_expiration=0d is now allowed, and it disables downloading mTLS certificates from the UI.
  • Goma server now rotates log files to avoid filling the filesystem. New logging flags:

    • --logging_rotate_at_mb - Rotate log files when size x MB reached (default 10)
    • --logging_rotate_count - Number of rotated log files to keep (default 1000)
    • --logging_days_to_keep_rotated_files Maximum days to keep rotated log files for (default 28)
    • --logging_compress_rotated_files - Compress rotated log files? (default true)

v2.25.0 (2023-01-03)

Changed

  • Update third_party dependencies.
  • Add probers to release.

v2.24.0 (2022-12-27)

Changed

  • Update timezone selectors to use UI kit dropdown.
  • UI client TLS certs: allow --mtls_expiration=0d.
  • Add support for reading and storing protos (secretstore).

Fixed

  • UI: Fix the datetime picker.

v2.23.4 (2022-12-13)

Cherrypicks

  • auth: use --tls_trusted_certificate and --tls_trusted_key for signing and verifying JWTs.

v2.23.3 (2022-12-12)

Cherrypicks

  • Added an experimental flag, --experimental_docker_max_image_size_in_cas. When --experimental_docker_store_images_in_cas is enabled, workers cache docker container images in the CAS. The new flag sets a size limit on container image files stored in the CAS. It defaults to 10gib, the previously hard-coded limit.

v2.23.2 (2022-12-09)

Internal release. No publicly facing changes.

v2.23.1 (2022-12-09)

Internal release. No publicly facing changes.

v2.23.0 (2022-12-08)

Changed

  • Docs: Launched the new https://docs.engflow.com with search functionality.
  • Enabled --experimental_filter_known_replicas by default, causing the scheduler to confirm whether a CAS node holding a replica is alive before attempting a read.
  • Changed --internal_tcp_connect_timeout default value from 30s to 5s. This controls cluster-internal gRPC connections. Connection attempts to dead nodes will fail faster.
  • Enabled --warm_containers by default, causing workers to pull active cluster Docker containers before accepting actions.

Fixed

  • MacOS: worker no longer exits when --allow_docker=true; the flag is now ignored.
  • Added more Docker platform options to affinity key to add the scheduler in executor selection.

v2.22.1 (2022-11-24)

Changed

  • AWS: Install latest SSM agent in all MacOS machine images.

v2.22.0 (2022-11-23)

Changed

  • Added a Docker credential helper to the AMI and .deb package, which fetches the username/password from AWS Secrets Manager. It's called docker-credential-engflow-aws-secretsmanager.

Fixed

  • InputCas, stats: fix distributedCasLongestDownload.

v2.21.0 (2022-11-14)

Changed

  • Add flag to respect explicitly defined pools for --experimental_force_mnemonic_pool_name.

Fixed

  • Bug causing wrong Action Cache misses due to special-casing the empty blob.

v2.19.3 (2022-11-03)

Cherrypicks

  • Fix Action Cache misses due to special-casing the empty digest.

v2.19.2 (2022-11-02)

Cherrypicks

  • Sharing new AMIs.

v2.19.1 (2022-11-01)

Cherrypicks

  • Fix memory leak in notification queue service.

v2.19.0 (2022-10-24)

Changed

  • Goma: improve Goma cluster logging.
  • RE: update default --max_batch_size from 4mb to 10mb.
  • RE: update default --default_replica_timeout from 24hr to 1hr.
  • RE: update default --cas_existence_cache_expiry from 24hr to 0s.
  • RE: update default --local_cas_existence_cache_expiry from 120s to 30min.
  • RE: update default --hazelcast_aws_use_client_lib from false to true.

Fixed

  • UI: links render correctly in the CHANGELOG.md display of the Cluster Status page.
  • Goma: the grpc_keepalive_time and grpc_keepalive_timeout are now respected properly.
  • RE: correctly warm Docker containers on the default pool.
  • RE: Disconnected nodes will continue to try to reconnect to the cluster indefinitely.

Deprecated

  • The --client_auth=gcp_email, --client_auth=basic, and --http_auth=gcp_email options have been deprecated; they are not used by any clusters.
  • The --external_storage_gc_enable_deletion flag has been deprecated.

v2.18.1 (2022-10-19)

Cherrypicks

  • Goma: set gRPC keepalive time.

v2.18.0 (2022-10-12)

Deprecated

  • The --aws_cloudformation_stack_name and --aws_cloudformation_stack_resource flags are deprecated; the corresponding values are automatically read from instance metadata.

Added

  • Docs: Documentation around --client_auth=github_token.
  • Performance: --warm_containers=true can be used to automatically pull active Docker images onto new workers before they accept any actions.
  • Goma: new logging flags

    • --logging_output_encoding can now be supplied. json will emit single-line JSON, for consumption downstream and an enhanced querying UX.
    • --logging-timestamp-format can now be supplied. unix-utc will output millisecond-precision UNIX timestamps, for simpler dateime parsing downstream and an enhanced querying UX.

Changed

  • UI: the target tree view defaults to only showing non-successful targets.
  • Goma: more readable service logs - lines no longer have Java and systemd journal prefixes.

Fixed

  • RE: Fixed bug in used file tracking that caused workers to run out of disk space.

v2.17.1 (2022-10-11)

Cherrypicks

  • AWS: The x86_64 Debian AMIs now come with pre-downloaded software for supporting instance types with NVIDIA GPUs.

v2.17.0 (2022-09-30)

Changed

  • Performance: --bytestream_read_chunk_size is now 1 MiB by default; this can significantly improve machine-to-machine copy performance.
  • Removed metrics com.engflow.storage.ops/in_flight and com.engflow.storage.ops/stream_in_flight.

Added

  • UI: added button to download a summary of the invocation as markdown; this is intended for integration with other systems like a bug tracker or helpdesk.
  • Added thread pool metrics.
  • UI: Add tooltips to analytics summary to clarify values and improve display while loading data.
  • UI: Show release notes on the cluster status page.

Fixed

  • Goma: fix metrics exporting to CloudWatch.
  • Fix a rare hang when replicating a file to another machine.
  • Fix error handling when the GCS service is temporarily unavailable.

v2.16.4 (2022-09-28)

Cherrypicks

  • Fix high memory consumption when repeatedly reloading an invocation page.
  • Fix cas metrics reporting in the schedulers.
  • Fix heap dump helper tool.

v2.16.3 (2022-09-21)

Cherrypicks

  • Fix login loop when authorization header is sent.

v2.16.2 (2022-09-20)

Cherrypicks

  • Fix cookie parsing with HTTP/2 when multiple cookies are sent.

v2.16.1 (2022-09-20)

Cherrypicks

  • Fix login loop when multiple cookies are set for the UI domain.

v2.16.0 (2022-09-16)

Changed

  • Metrics: All metrics are reported to CloudWatch by default (if enabled).
  • Metrics: High-cardinality metrics for actions are disabled.
  • Metrics: BEP metrics are now in the com.engflow.bep namespace.

Added

  • Worker AMIs for Arm64 MacOS.
  • Support passing through arbitrary flags to docker run.

Fixed

  • Cleanup for --docker_clean_tmp runs as root.
  • Reduced cpu+memory consumption for very large Bazel event streams.
  • Fixed heap dump utility temporary directory creation.
  • Improve analytics summary consistency (build counts).
  • Fix race condition in handling of file uploads.
  • Fix potential hangs in multiple streaming calls.
  • Fix MTls authentication fallback handling.
  • Fix existing scheduler metric to always be reported (previously was not reported if the cluster had no worker pools).

v2.15.2 (2022-09-15)

Cherrypicks

  • Goma: fix bug uploading chunks to CAS.

v2.15.1 (2022-09-14)

Cherrypicks

  • Goma: add latency metrics for inbound HTTP RPC requests.
  • Goma: ensure gomaOutput.toFileBlob returns after cancelation.
  • Goma: recache only upload if missing.
  • Scheduler: handle mTLS authentication fallback correctly.

v2.15.0 (2022-08-25)

Added

  • Added a documentation page about the Invocation Search page.
  • Extended Goma documentation.
  • Add new permission for viewing the UI, so that you no longer need admin access.
  • Monitoring: Added threadpool metrics.
  • UI Authentication: Support logging in with Okta.
  • UI Authentication: Web UI login for basic authentication.
  • UI Authentication: Support using multiple authentication methods together.
  • UI: Allow invocation page tabs to be opened in a separate window.

Fixed

  • Scheduler: Do not serve requests until we've had a chance to discover workers.
  • Fix Test XML file parsing when certain attributes are not present.
  • Avoid network loss during service shutdown.
  • Fix downloading compressed files.
  • UI: Fix messaging around Bazel profile availability while an invocation is still running.
  • UI: Fetch the console log only once.
  • UI: Fix auto-scroll functionality in console log.
  • UI: Ensure the correct test log is shown when switching targets.

v2.14.8 (2022-08-24)

Cherrypicks

  • Goma: Removed large debug logs.
  • Goma: Added log buffering and suppressed logs below "Info" level.
  • Goma: Added SIGUSR2 handler to collect CPU profiling data.

v2.14.7 (2022-08-23)

Cherrypicks

  • Fix a UNKNOWN error from bytestream Write calls.

v2.14.6 (2022-08-23)

Cherrypicks

  • Logging: Remove extraneous logging for some RPC calls.
  • Logging: Log time spent waiting for the client to send write data.
  • Bytestream: Use a 1 MB buffer for writes.
  • Monitoring: Add thread pools latency metrics.
  • Monitoring: More read/write storage metrics.

v2.14.5 (2022-08-20)

Cherrypicks

  • [goma] Implement digest cache in recache package.
  • [goma] Report more metrics for execution and RPC.
  • RemoteActionExecutor: fix action execution hangs .
  • ExecutionUnit: log target id.
  • Logging: log some RPC calls.

v2.14.4 (2022-08-19)

Cherrypicks

  • Fix (rare) NPE in replica selection.
  • Add metrics for distributed CAS fetches.

v2.14.3 (2022-08-12)

Cherrypicks

  • Guard process wrapper execution statistics behind a flag.
  • Retry recovery when losing a CAS node.
  • Fix temp directory handling for dockerized actions.

v2.14.2 (2022-08-10)

Cherrypicks

  • Fix storage metric being off by a factor of 1000.

v2.14.1 (2022-08-10)

Cherrypicks

  • Fix S3 metrics exporting that caused an excessive number of exported metrics and associated CloudWatch costs.

v2.14.0 (2022-08-10)

Added

  • Network: allow setting a TCP connect timeout via --internal_tcp_connect_timeout.
  • Metrics: add download call stats via com.engflow.re.cas/fetch_call_time.
  • UI: Analytics page - if scatter charts cannot be shown, add a button that narrows the search so they can be rendered
  • UI: Invocation search - allow filtering invocations by principals requester and runner

Fixed

  • UI: various styling fixes to improve consistency.
  • UI: Fix async bug that caused filters to not be applied on reset.
  • Metrics: report storage metrics correctly (com.engflow.storage.*/*).
  • Metrics: fix docs to indicate that distribution metrics are reported to CloudWatch.
  • S3: improve handling of "rate limited" errors.

v2.13.0 (2022-08-02)

Incompatible

  • --incompatible_keep_relative_argv0 is now a noop.
  • --experimental_grpc_web is now a noop.
  • Enable --experimental_inmemory_digests by default.
  • Enable --incompatible_strict_digest_verification by default.

Added

  • Partial graceful shutdown support for schedulers.
  • Support RE-API cache compression and gRPC-level compression simultaneously.
  • UI: add BES keywords to search index and allow filtering invocations by them (provided invocation indexing is enabled)
  • UI: improved rendering of large logs, including lazy loading
  • UI: add a badge to mark experimental features
  • UI: accessibility improvements

Fixed

  • Support BES streams with a high data rate.
  • Allow viewing an invocations console output in fullscreen mode.

v2.12.1 (2022-07-25)

Added

  • Support proxying external storage reads through workers under --workers_handle_fallback_requests.

Fixed

  • Fix startup crashes in certain AWS configurations.

v2.12.0 (2022-07-21)

Added

  • Make the chunk size of ByteStream/Read responses configurable with --bytestream_read_chunk_size.
  • Add metric for response time of external authentication.
  • Add --experimental_force_module_cache_path_for_mnemonics flag for improving Objective-C builds.
  • Support changing the initial gRPC control flow window with --grpc_initial_flow_control_window.

Fixed

  • Make sure to use optimized TLS implementation when available.
  • Fix hangs when an error happens early during process startup.
  • Fix spurious cancellation of RPCs that check blob existence.
  • Fix spurious NOT_FOUND errors from ByteStream/Read.
  • Better handling of backend errors in the UI.

v2.11.2 (2022-07-19)

Cherrypicks

  • Race condition while checking CAS cache.
  • Possibly erroneous cancellation of futures from the cache.

v2.11.1 (2022-07-13)

Fixed

  • UI: Fix broken Google login page.

v2.11.0 (2022-07-12)

Added

  • UI: Show license info on cluster status page.
  • UI: Show error details when invocation page fails to load.

Fixed

  • UI: Fix bug where large test suites would not load.
  • UI: Fix bug where invocations would sometimes not load or hang.

v2.10.0 (2022-07-07)

Incompatible

  • The --incompatible_track_availability_zone has been flipped, which makes this release incompatible with v2.3.0 and earlier. Please upgrade to v2.9.0 before upgrading to this release if you're still running an older version.

Added

  • UI: Allow downloading mTLS client certificates from the UI. The CA is configured server-side using --tls_trusted_key and --tls_trusted_certificate.

Fixed

  • UI: Frontend would not show invocation or invocation search pages and instead presented an error.
  • UI: Elements in the UI would overlap each other in unexpected ways.
  • UI: Ensure fetching the log is not aborted prematurely.
  • UI: More clearly surface how many tests failed.
  • UI: Fix cluster status page hanging on load.

v2.9.0 (2022-06-28)

Incompatible

  • macOS: Workers no longer wait until at least one Xcode is available before accepting work.
  • The --advertised_port_offset flag is now a no-op.

Deprecated

  • macOS: Discovering available Xcode versions no longer relies on xcode-locator subprocess.

Added

  • auth: Add support for embedding permissions directly into mTLS client certificate.

Fixed

  • UI: Order test suites and test cases by status, listing failures first.
  • UI: Better reporting of aborted invocations.
  • UI: Ensure full-screen views are always scrollable.
  • UI: Expose if number of test cases does not match reported number of tests.
  • goma: Reduce impact of rate limiter on action execution.

v2.8.0 (2022-06-01)

Incompatible

  • The --incompatible_ignore_legacy_node_properties flag is now a no-op.

Added

  • Goma: Add metrics for how long requests are delayed.
  • Goma: Add metrics for client errors.
  • RE: Add digest of primary output to server-side profile.
  • Add support for TLS 1.3

Fixed

  • Goma: Enable RPC request rate limiter by default.

v2.7.2 (2022-05-24)

Cherrypicks

  • Fix INTERNAL error on certain remote persistent worker error conditions.

v2.7.1 (2022-05-23)

Cherrypicks

  • Docker containers were not getting reused. This was causing a performance hit.

v2.7.0 (2022-05-17)

Added

  • Monitoring: the new com.engflow.instance/gc_avg_duration metric shows the average duration spent in Java garbage collection since the last reported metric.

Fixed

  • Results UI: fixed an issue where some server-side profiles fail with UNKNOWN and returning HTTP 500 and don't load.
  • Results UI: In the build status bar, cached builds were not included in the completed builds, and were categorized as "to build". This is now fixed.
  • macOS: Improved error message for actions failing because of too long command lines.
  • Results UI: fixed login redirection vulnerability that could lure the victim to the attacker's page.
  • AWS, MacOS AMI: install service that reaps the symbols cache to avoid filling up the disk.

v2.6.3 (2022-05-05)

Fixed

  • Fix loading some EngFlow profiles.

v2.6.2 (2022-05-03)

Fixed

  • UI: The login page and other elements were misaligned.

v2.6.1 (2022-05-03)

Fixed

  • UI: Navigating between pages sometimes crashed the frontend.

v2.6.0 (2022-05-03)

This release changes the database schema. In clusters that have it enabled, this results in an empty database after the upgrade.

Incompatible

  • The --incompatible_named_default_pool flag is now a no-op.
  • The --docker_use_image_id flag is now a no-op.

Added

  • Goma: added ability to limit concurrent connections. This should help avoid OOMs when clients upload a lot of input files.

Changed

  • Eventstore options are no longer experimental.
  • Improved error messages when server TLS certificates are not in the expected format.
  • Moved user settings to the side bar.
  • Improved styling of the cluster status and licenses pages.

Fixed

  • UI: Don't show workspace status chips (repo, branch, commit) on the invocation page if they are empty.
  • GCS: Correctly propagate errors when reading / writing events.
  • Fix rare case of schedulers losing track of workers when an incoming execute request is cancelled (previously, the worker was recovered after a timeout).
  • The service now retries execution requests internally in some cases to reduce the likelihood of build failures in clusters with very large worker machines and auto-scaling.
  • Fix popups to display outside their parents.
  • Fix display of the licenses page.
  • Fix display of failed tests on the invocation overview page.
  • Fix handling of timeouts for remote persistent workers - such actions were incorrectly always retried (independent of the retry policy).
  • Fix handling of test.xml reports that only specify a 'status' attribute.

v2.5.2 (2022-04-28)

Cherrypicks

  • Fix NPE when proxy-replaying an event stream from another scheduler (part 2).

v2.5.1 (2022-04-26)

Cherrypicks

  • Fix NPE when proxy-replaying an event stream from another scheduler.
  • Fix multicast discovery.

v2.5.0 (2022-04-13)

Incompatible

Added

  • Record time to create output tree in server-side profiles.

Changed

  • UI: Make Invocation page navigation horizontal

Fixed

  • macOS: Fix chronyd installation.
  • UI: Fix bug causing page to sometimes hang or not load.

v2.4.1 (2022-04-05)

Cherrypicks

  • MacOS AMIs: Fix chronyd installation by preventing brew to run as root.
  • UI: Fix potential issue with S3 data writing.

v2.4.0 (2022-04-05)

Added

  • Goma support for pushing metrics to Stackdriver.
  • Goma logs are uploaded to Cloudwatch.

Changed

  • Increase timeout for EngFlow Free to come online.

Fixed

  • Fix: Reduced redundant logging when invocation streams cannot be found.
  • Fix: Don't trigger alarms when the test.xml output file for a test is not uploaded from Bazel to the BES.
  • AWS: Cloudwatch metric reporting is more fault tolerant.
  • UI: Include more information in build UI stack traces included in the JavaScript console.
  • UI: Users can now change their timezone in Chrome.
  • UI: Improve various error messages in the UI to include more detail.

v2.3.4 (2022-03-29)

Cherrypicks

  • goma: add --exec-timeout to control execution timeout.
  • macOS: fix chrony installation script.

v2.3.3 (2022-03-28)

Cherrypicks

  • goma: add flags to control concurrency for storing action inputs and outputs.

v2.3.2 (2022-03-28)

Cherrypicks

  • Fix stack overflow on large invocations.
  • Correctly propagate max gRPC message size flags to the internal gRPC service.
  • Fix failure to report disk metrics.

v2.3.1 (2022-03-23)

Cherrypicks

  • Fix disk usage metrics not being reported and producing log spam.

v2.3.0 (2022-03-16)

Deprecated

  • Config: --enable_target_tree is now a no-op and will be removed in a future release. The flag is set to true by default.

Fixed

  • UI: Fix infinite loop on some prefix filters.
  • UI: clarify icon indicating that items can be opened.
  • GCP: Improve error reporting for "Partial findMissingBlobs failure".
  • Fix error when loading an invocation page corresponding to a server-side profile.
  • Fix "Async Stream not found" error to avoid automatically assuming this is a severe error.
  • Fix Docker service shutting down before worker service.

Added

  • GCS: Added com.engflow.storage.read/time_to_first_byte and com.engflow.storage.read/time_per_gb metrics.
  • UI: Improve error message when test.xml file is empty.
  • UI: Added analysis failure status for targets.
  • UI: Add a button to download test.xml.
  • UI: Improve UX around downloading test assets.
  • UI: Highlight matching items when searching in the target view.
  • Config: Turn --experimental_google_client_id into --google_client_id.

v2.2.1 (2022-03-08)

Cherrypicks

  • Fix NPE in EventStoreProfileGenerator
  • Permit more frequent keepalive times (10s)
  • Fix race condition in external storage GC causing INTERNAL error
  • Fix uncaught exception in UI when opening an invocation stream for profiling events
  • Install chronyd on macOS images

v2.2.0 (2022-03-03)

Incompatible

  • config: aws and gcp are no longer valid options for --external_storage. Use s3 or gcs instead.

Deprecated

  • config: --experimental_actions_execution_attempts is now a no-op and will be removed in a future release.
  • config: --experimental_gcs_direct_upload, --incompatible_reduce_memory_use, --enable_status_page, and --experimental_per_executor_dirs are now no-ops and will be removed in a future release.

Fixed

  • UI: Allow downloading (partial) EngFlow profiles while builds are still running.
  • UI: Remove links to Bazel command-line options for non-release versions.

Added

  • UI: Warn users if their upload strategy prevents Bazel profiles from being uploaded.
  • Add --incompatible_track_availability_zone, which changes the serialization format for one of the types we share between machines. This can be safely deployed while the cluster is running as long as all nodes are running at least v2.2.0. Do not enable when there are nodes that run an earlier version, or at the same time as upgrading to v2.2.0 (or later). We plan to flip this flag in v2.10.0.
  • docker: Record container start time in EngFlow profile.
  • macOS: Add /var/folders/wp to the allowlist for sandboxed actions.

Removed

  • metrics: com.engflow.re.cas/total_size, com.engflow.re.cas/total_replica_size, com.engflow.re.exec/running_actions, com.engflow.re.exec.docker/containers_created, com.engflow.re.exec.docker/container_creation_failed, and com.engflow.re.exec.docker/containers_destroyed are no longer reported.

v2.1.1 (2022-02-22)

Fixed

  • Make deletion of action execroots faster.

v2.1.0 (2022-02-18)

Changed

  • The Build Event Service is enabled by default (disable with --enable_bes=false).
  • The event store options are no longer experimental. Note that the on-disk location should now be controlled with the separate flag --event_disk_path rather than reusing --event_blobs_root for this purpose.
  • The unnamed worker pool is now called default (enable --incompatible_named_default_pool by default).
  • Improve the Bazel first-time setup instructions.

Fixed

  • Fix server-side profiles to correctly show all action attempts.
  • UI: fix parsing of GitHub URLs.
  • UI: fix icon titles.
  • UI: fix page-up/page-down keys in the console.
  • MacOS: fix repeated warnings about /proc/meminfo.
  • Correctly return NOT_FOUND instead of INTERNAL when an invocation could not be found.
  • Avoid action failures when uploading a file to secondary storage returns an error.

v2.0.0 (2022-01-13)

This release requires a full cluster shutdown and restart. Due to changes of the default settings for a number of incompatible flags, pre-2.0.0 instances may return errors when communicating with instances running 2.0.0 or later and vice versa.

Otherwise, this release is intentionally small to reduce the upgrade risk. In particular, we did not remove deprecated flags and metrics in 2.0.0 (except as noted below); they will be removed in a later release.

Added

  • Support TLS 1.3 for the UI and gRPC APIs.
  • Automatic Garbage Collection for External Storage.
  • Added an inline Profile Viewer.
  • Print warnings when using deprecated command-line flags.
  • UI: the invocation page sidebar can be navigated by keyboard.
  • UI: show 'View Logs' button for test logs.

Fixed

  • Fix permission denied when deleting an exec tree with unexpected mod bits.
  • UI: Some non-existent pages returned 404 (not found) for unauthenticated users; they now return 403 (unauthenticated). This was a potential information leak (benign).
  • UI: correctly show NOT_FOUND for missing nodes.
  • UI: console correctly uses the full height when maximized.
  • UI: fix color for the timezone selector.
  • UI: prevent horizontal overflow in the test view.
  • UI: improve performance of loading large console logs.
  • API: correctly return NOT_FOUND for calls to the results store.

Removed

  • Remove all s3-specific metrics com.engflow.re.storage.s3/*.
  • Remove all gcs-specific metrics com.engflow.re.storage.gcs/*.

Incompatible

  • Flipped --incompatible_reduce_memory_use.

v1.58.9 (2022-02-08)

Cherrypicks

  • Trigger new release due to flaky errors.

v1.58.8 (2022-02-07)

Cherrypicks

  • goma: consider include path when deriving common input/output prefix.

v1.58.7 (2022-02-03)

Cherrypicks

  • Log end-to-end build times from BES.

v1.58.6 (2022-01-28)

Cherrypicks

  • Work around OpenJDK 11.0.14 bug: ignore Host header in http2 requests.
  • ResultStore/GetTarget: respond with NOT_FOUND instead of INTERNAL.

v1.58.5 (2022-01-27)

Cherrypicks

  • Install Xcode 13.2.1 and cmd-line 13.2 on macOS

v1.58.4 (2022-01-07)

Cherrypicks

  • Fix compilation errors due to bad cherrypicks

v1.58.3 (2022-01-06)

Cherrypicks

  • UI: don't fail with INTERNAL error when target tree node is not found
  • UI: don't fail with Chunk too large
  • Profiler: Record retry attempts during input fetching for better profiling
  • Fix com.engflow.re.storage.existence_cache/* stats

v1.58.2 (2021-12-21)

Fixed

  • Fix profile download links.

v1.58.1 (2021-12-20)

Fixed

  • Fix incorrect timezone list.

v1.58.0 (2021-12-17)

We are preparing for a 2.0.0 release in early 2022. To reduce the amount of changes going into that, we have proactively flipped a few flags that were intended for 2.0.0 and that do not require a full cluster restart. We have already enabled these flags on all managed clusters without any issues.

Incompatible

  • Enable --incompatible_remove_symlink_execroot_strategy by default; this removes the symlink exec root strategy, which was never used in production due to being incompatible with dynamically linked binaries.
  • Enable --incompatible_keep_relative_argv0 by default; this fixes the lookup of commands which use a relative command line to be consistent with posix shell lookup (including PATH) and is required by all remote execution clients that we are aware of being used in production.
  • Enable --incompatible_no_storage_backend_metrics by default; this removes a few deprecated metrics related to storage.

New

  • Server-side profiles now contain per-action input tree stats.
  • The --incompatible_named_default_pool flag changes the meaning of the Pool platform option, and allows selecting the default (unnamed) execution pool.
  • Add a dockerUseEntrypoint boolean platform option to disable use of the docker image entrypoint on a per-action basis.
  • Add --incompatible_strict_digest_verification to enable strict validation of digests across all API calls, superseding --incompatible_batch_read_blobs_verifies_digests; both will be enabled by default and removed after the 2.0.0 release.
  • UI: support resizing the tree view.
  • UI: show SCM status (if received from the workspace status command).
  • UI: add a timezone selector to the settings.

Changed

  • Increase the default gRPC max message size to 20mib to reduce issues with uploading large build events.
  • HTTP cookies now use SameSite=Lax to avoid requiring login every time a user follows a external link into the UI, e.g., from CI.
  • Reduce worker service memory footprint.
  • UI: improve consistency and usability.
  • UI: improve the ordering of targets in the overview tab.
  • UI: show local / remote build status icon.
  • UI: use ISO 8601 dates and 24-hour format by default.
  • UI: update target status icons for improved consistency and readability.

Fixed

  • Fix the affinity-based scheduler to take the absolute input root into account if set; this reduces docker container restarts and improves build performance.
  • GCP: images now use gcr instead of gcloud to authenticate docker operations with GCR, which is more reliable.
  • UI: fix linebreaks in the displayed build command line.
  • UI: fix issue where in-progress builds don't render correctly.
  • UI: deduplicate target configuration information.
  • UI: fix critical path display to use timestamp order (don't sort by length).

v1.57.4 (2021-12-15)

Cherrypicks

  • Fix performance regression in CAS downloads. This reverts a bugfix of --log_level, so the finest supported log level is again INFO.

v1.57.3 (2021-12-09)

Cherrypicks

  • Goma: avoid data corruption by resetting buffer upon download retrial

v1.57.2 (2021-12-08)

Cherrypicks

  • Server-side profile: add input_tree_stats to action details

v1.57.1 (2021-12-01)

Cherrypicks

  • Fix release pipeline

v1.57.0 (2021-11-30)

Incompatible

  • Boolean-type flags now enforce their value to be true or false. Previously any value other than the literal true was parsed as false; from now on this is an error.
  • Enable --split_cluster_name by default. If you're currently not setting this flag, make sure schedulers have a tag named engflow_re_scheduler_name with the same value as engflow_re_cluster_name.
  • --experimental_cas_check_storage_only is now a no-op.

New

  • With --http_public_port, you can set a different port for HTTP requests than for gRPC (--public_port).
  • Free tier now supports the EventStore API.
  • Free tier now supports server-side profiles.
  • UI: Enabled target tree by default.
  • UI: Allow users to expand information-dense cards to full-screen.
  • UI: Follow the end of the console while loading.
  • UI: Allow users to filter the target tree by prefix or status.
  • UI: Added Overview tab to help quickly identify build issues.

Changed

  • Workers now create one one-core executor per available CPU core instead of just one one-core executor.
  • Improved compatibility of server-side profiles with Perfetto UI.

Fixed

  • Removed limit of concurrent connections from free tier.
  • Improved parsing of critical path in Build and Test UI.

Deprecated

  • --split_cluster_name is deprecated and will be removed in the next release.

Security

  • Removed debugging HTTP endpoints from Goma.
  • Restrict frame-ancestors from Content-Security-Policy.

v1.56.1 (2021-11-10)

Cherrypicks

  • AWS/GCP images: revert back to Debian 10

v1.56.0 (2021-11-08)

Changed

  • AWS/GCP images now use Debian 11
  • Logging: FindMissingBlobs is now less chatty
  • Docker container reuse is enabled by default; use --docker_allow_reuse=false to opt-out the entire cluster, or set dockerReuse=False for all actions (or builds) that need to opt-out. If you want to opt-out the entire cluster, we recommend setting --docker_allow_reuse=false before you upgrade. This change also switches all actions to separate docker run and docker exec invocations. If that causes problems, you can temporarily opt-out the entire cluster by setting --docker_split_exec_run=false. Note that we plan to deprecate that option; please let us know if you do set this flag.

Fixed

  • --worker_config now correctly handles configurations with more than 2GB ram per executor

Security

  • The HTTP UI now returns various security-related HTTP headers like Content-Security-Policy and X-Frame-Options by default, to prevent a number of attack scenarios (see --strict_http_headers)
  • Action inputs can now be absolute symlinks

v1.55.0 (2021-10-28)

Added

  • Docker pull times to EngFlow profile
  • --grpc_max_message_size flag to control gRPC max message size
  • Log messages regarding lost or corrupted CAS files

Changed

  • All target tree flags are no only controlled by --enable_target_tree
  • --enable_status_page is now true by default
  • --principal_based_permissions now defaults to [] to restrict data access by default

Fixed

  • Race condition that would cause the PublishBuildToolEventStream gRPC call to fail
  • Linux ficlone call for creating action inputs
  • Basic auth using web-browsers

Removed

  • Experimental flags related to the target tree

Security

  • Reduced surface area for phishing attacks

v1.54.1 (2021-10-19)

Cherrypicks

  • Fixed missing working directory for the cached docker strategy

v1.54.0 (2021-10-18)

Added

  • Exposed EventStore endpoint over gRPC
  • Tests parsed from test.xml are shown hierarchically

Changed

  • Server-side profiling is now always enabled when the BES is enabled; disable with --profile_to_event_store=false

Fixed

  • Fix CAS capacity accounting during recovery
  • Wait for CAS metadata writes after file upload; fixes file missing errors when no external storage is configured and build-without-the-bytes is enabled

Security

  • Require explicit principal permissions to be set when accessing HTTP endpoints
  • Patched various low-risk vulnerabilities

v1.53.1 (2021-10-15)

Cherrypicks

  • Don't run actions twice with the "cached docker" strategy

v1.53.0 (2021-10-07)

Added

  • Add memory usage and garbage collection metrics
  • UI: Add profile picture and user menu

Fixes

  • GCP, AWS: fix logging issues causing stuck instances
  • Fixed a bug where some metrics were not reported
  • Fixed crash on start up when --cas_path is undefined
  • UI: Fix broken alert bar
  • Improved error handling during rapid cluster size changes

Changes

  • Report metrics with granular counts of incoming actions
  • When --external_storage is enabled then --experimental_opportunistic_cas is now regarded as true. Previously you had to explicitly enable the flag. We no longer recommend setting --experimental_opportunistic_cas at all, because when --external_storage is disabled then it's safer to use --experimental_force_lru instead
  • client_auth=gcp_rbe: Clients with "remotebuildexecution.blobs.create" permission can now also upload Build Event Streams. Previously such requests failed because (as of 2021-09-29) GCP has no permissions to control Build Event Stream uploads
  • Docker: Respect memory limit provided by --worker_config
  • Enable --experimental_per_executor_dirs by default
  • Increase default --max_batch_size to 4mb

Removed

  • Remove --experimental_profile_dir in favor of --experimental_profile_to_event_store.

v1.52.6 (2021-10-05)

Cherrypicks

  • Minor bugfixes

v1.52.5 (2021-10-05)

Cherrypicks

  • Prevent GCP logging problems from breaking the scheduler process.

v1.52.4 (2021-10-01)

Cherrypicks

  • Fixed bug propagation caused by Hazelcast errors

v1.52.3 (2021-09-29)

Cherrypicks

  • Minor bugfixes

v1.52.2 (2021-09-28)

Cherrypicks

  • Fixed a bug where long log messages would cause schedulers to hang
  • Reduce log spam
  • Fixed a bug where some metrics would not be reported

v1.52.1 (2021-09-22)

Cherrypicks

  • Fixed crash on start up when --cas_path is undefined

v1.52.0 (2021-09-20)

Changes

  • Add invocation IDs to logging and errors

Fixes

  • Various concurrency bugfixes for multi-scheduler clusters
  • Improve logging around failed gRPC calls
  • Reduce log spam for the gRPC NOT_FOUND response code
  • Improve macOS worker support
  • Fixed EngFlow internal profiling when running multiple schedulers

v1.51.2 (2021-09-14)

Cherrypicks

  • Minor bugfixes

v1.51.1 (2021-09-13)

Cherrypicks

  • Fix release pipeline

v1.51.0 (2021-09-08)

Added

  • CAS: Workers now pick up existing files from the CAS directory. It's no longer necessary to delete this directory after a worker is restarted. If this behavior breaks something, use --recover_cas_blobs=false and let EngFlow know.
  • Add metrics for inbound BEP events.
  • com.engflow.eventstore/new_inbound_stream
  • com.engflow.eventstore/new_inbound_bep_event
  • com.engflow.eventstore/new_outbound_bep_event
  • com.engflow.eventstore/new_outbound_stream
  • com.engflow.eventstore/ongoing_streams
  • The build results UI can now authenticate users with Google's login page. See --http_auth=google_login.

Changed

  • Enable --upload_outputs_on_failure by default.
  • UI: Update branding.
  • AWS, dashboard module: Reduce window size from 300s to 60s.
  • GCP, Terraform files: Move service accounts into a Terraform module.
  • The new default of --http_auth is deny. Make sure you override this flag as needed.
  • EngFlow .deb installer: depends on the full OpenJDK, not just the JRE

Fixed

  • Fixed a race condition with BES and multiple schedulers.
  • Use exec-root of executor when starting reusable docker containers. This fixes a bug causing containers not to start if the user has no permission to access the container's default workdir (e.g. when setting it to /root).
  • Do not proxy errors to cancelled client streams.
  • UI: Display correct status icon in target list view.
  • Fix race condition during BES live replay.

Deprecated

  • --docker_use_path is a no-op; please use --incompatible_keep_relative_argv0 instead.
  • --docker_use_addgroup is no longer supported.

v1.50.6 (2021-08-31)

Cherrypicks

  • Fixed unbounded thread creation

v1.50.5 (2021-08-24)

Cherrypicks

  • Added Metrics around BES upload
  • com.engflow.eventstore/new_inbound_stream
  • com.engflow.eventstore/new_inbound_bep_event
  • com.engflow.eventstore/new_outbound_bep_event
  • com.engflow.eventstore/new_outbound_stream
  • com.engflow.eventstore/ongoing_streams

v1.50.4 (2021-08-24)

Cherrypicks

  • Fix race condition during BES live replay

v1.50.3 (2021-08-05)

No change. Just re-triggering the release.

v1.50.0 (2021-08-05)

Added

  • Docker: the new --docker_default_network_mode flag controls the default value of "dockerNetwork" (when the client doesn't request any).
  • MacOS: actions run with sandboxing if --experimental_allow_mac_sandbox is enabled

Deprecated

  • We've removed support for Java 8 and Ubuntu 16.04.

v1.49.0 (2021-07-12)

Added

  • Linux: support file cloning for file systems that support it
  • Docker: add a flag to enable Docker signature verification (--docker_content_trust)

Changed

  • Switch to react for the cluster status page (--enable_status_page)
  • MacOS: packages now contain the process-wrapper binary which can enforce proper shutdown of actions
  • GCP: various improvements to terraform configuration (enable stackdriver integration by default, enable shielded VMs by default, add dashboard module)

Fixed

  • Profiling: fixed a hang when downloading large server-side profiles
  • Profiling: fixed a hang when downloading an unfinished server-side profile
  • Persistent workers: action timeouts are now properly enforced

v1.48.0 (2021-06-23)

Changed

  • Improved documentation around persistent workers.

Fixed

  • Reliability: Internal "Connection reset" calls no longer trigger INTERNAL gRPC errors.

v1.47.0 (2021-06-08)

Changed

  • AWS: The packer config in our release (base-image.json) now installs the AWS SSM agent.

Fixed

  • AWS: fix instance id retrieval on IMDSv2.
  • Reliability: The evicion policy on the action cache could previously cause long-running scheduler services to crash, even if all instances are individually restarted (the schedulers automatically replicate entries from removed instances).

Deprecated

  • The com.engflow.re.scheduler/available_workers metric is deprecated. We recommend using the new com.engflow.re.scheduler/existing_executors metric instead.

v1.46.1 (2021-05-25)

Cherrypicks

  • Fixed a stack overflow bug that caused schedulers to crash.

v1.46.0 (2021-05-24)

Changed

  • Updated the Bazel process-wrapper that is used by workers to isolate actions

Fixed

  • GCP logging will not be enabled even with the flag set if the instances detect that they are running outside of GCP.
  • Persistent workers will be restarted if the kernel kills the action process. This mitigates the risk of leaking processes on poorly behaving actions.

v1.45.1 (2021-05-14)

Cherrypicks

  • Deployment kit, Dockerfile: fix v1.45.0 regressions

v1.45.0 (2021-05-13)

Deprecated

  • --auto_worker_expiration and --docker_use_init are now both no-op. They were enabled by default in v1.31.0, now are always on.
  • Deployment kit, Kubernetes: deleted the obsolete setup files (gen-k8s-config.py and templates/ directory); updated the documentation about the current Kustomization-based setup

Added

  • --experimental_async_storage_uploads: This makes it so we don't wait for aync uploads to complete. This should improve performance in cases where such uploads are slow.

Changed

  • Deployment kit, Debian package: the package no longer "Depends" on OpenJDK; it now "Recommends" OpenJDK's JRE. This lets you skip installing that Java runtime, and use a different runtime. The dockerfile has a --build-arg to control that (see below).
  • Deployment kit, GCP Terraform file: added firewall rule to allow health checks; listen on port 443 instead of 8080; increase scheduler disk size
  • Deployment kit, AWS Terraform file: added use_s3 variable; can generate a random S3 bucket name; cluster_name is customizable
  • Deployment kit, engflow.Dockerfile: made it configurable via --build-arg, installing the JRE and Docker are now optional
  • CloudWatch: log stream names now show the machine's role (scheduler or worker) and are easier to read

Fixed

  • S3 / GCS: intermittent errors are now reported as UNAVAILABLE, not as INTERNAL error
  • gRPC / netty: closed channels are now reported as UNAVAILABLE error

v1.44.1 (2021-05-07)

Cherrypicks

  • Fixed an error where new nodes were unable to join the cluster due to third-party library incompatibilities

v1.44.0 (2021-04-30)

Added

  • Authentication: added a deny mode that denies all incoming requests for the --client_auth and --http_auth options

Changed

  • The --gcs_credentials flag is no longer deprecated
  • AWS deployment configuration: added more alerts

Fixed

  • The status web page is no longer available on the private scheduler port, only on the public port
  • Handle premature exit of the persistent worker process; these are now automatically retried and provide a better error message

v1.43.0 (2021-04-21)

Changed

  • Validate that --cas_path points to a writeable directory
  • Kubernetes: improve configuration to be less dependent on cluster config

Fixed

  • Fixed an error when the client sends an empty byte stream
  • Fixed basic auth documentation

v1.42.0 (2021-04-12)

Added

  • Kubernetes: you can override the default Kubernetes master address with --k8s_master. Normally this should not be necessary, except if you see discovery problems.
  • --worker_config now accepts auto, meaning to create 1 executor that uses all available cores.
  • Actions now log how many output files (and total bytes) they uploaded to the CAS. (Only when replication is enabled.)

Changed

  • Kubernetes: added Dockerfile; added affinity rules to the on-prem Kustomizable overlay

Fixed

  • Fixed Mac release packages that were broken since v1.38

v1.41.1 (2021-04-08)

Cherrypicks

  • Fixed Mac release packages

v1.41.0 (2021-04-07)

Added

  • Kubernetes: new and improved Kustomization-based deployment templates for K8s

Changed

  • Docker: forward env variables for AWS credential as well as DOCKER_HOST to Docker invocations; this supports setups other than the default Docker socket
  • Added more metadata and failed actions to the server-side profile (see --experimental_profile_dir)

Fixed

  • Safeguarded against undeletable files when reusing exec roots
  • Fixed tracking of CAS file locations

v1.40.0 (2021-03-31)

Added

  • S3: The --incompatible_s3_use_structured_paths changes the directory structure, making blob access faster. This is an incompatible change: enabling the flag means the cluster won't find the old bucket content.

Changed

  • Docker FIFO creation: report stderr on failure
  • Docker internal retry: print if stderr was empty
  • S3: we now support more than 50 concurrent connections; see --external_storage_worker_threads and --external_storage_scheduler_threads
  • S3: retry failed downloads
  • S3: set IOException cause for generic errors, so error logs are more detailed

Fixed

  • AWS deployment kit: fix use of list option
  • AWS deployment kit: enable instance_refresh in the Terraform config
  • Docker "OCI runtime exec failed": fixed the --experimental_docker_internal_error_stderr_pattern semantics (added in v1.31), we now correctly retry such actions.
  • Docker: check container after every non-zero exit. This should help with containers that become unusable, e.g., due to a docker daemon restart.
  • Persistent workers: fixed the bug where workers sometimes failed to start, printing execution failed INTERNAL: Bad response from worker:

v1.39.0 (2021-03-23)

Added

  • Added an experimental server-side profiling implementation (see --experimental_profile_dir)

Changed

  • Logging: log average download rate per storage location; look for 'timing' in the worker logs
  • Enable recursive output tree action cache verification by default; previously, the action cache could return cache entries with output files that were no longer available in the CAS, breaking Bazel's build-without-the-bytes mode (--experimental_check_action_cache_recursively)
  • S3 / GCS: use 50 threads by default on workers and remove upper limit (50) on S3

v1.38.5 (2021-04-21)

Cherrypicks

  • Fix release package build

v1.38.4 (2021-04-21)

Cherrypicks

  • Fix release package build

v1.38.3 (2021-04-20)

Cherrypicks

  • Enable the recursive AC check by default
  • Check existence of tree blob
  • Clean exec root if input tree creation fails
  • Force-add replicas to the location map
  • Fixed Mac release packages

v1.38.2 (2021-03-25)

Cherrypicks

  • Add --incompatible_s3_use_structured_paths to use structured paths in S3, which may significantly improve performance under high load

v1.38.1 (2021-03-23)

Cherrypicks

  • Correctly propagate metadata for internal CAS download calls

v1.38.0 (2021-03-22)

Added

  • IPv6 support: added --docker_ipv6_cidr and --docker_ipv6_subnet_length to configure the IPv6 subnets for Dockerized actions

Changed

  • The service now returns an error for HTTP/1.X connections to the gRPC port
  • AWS: Improved deployment templates
  • File downloads are retried internally if there are more copies in the distributed CAS
  • If --experimental_per_executor_dirs is enabled, actions are always run in a deterministically-named directory

Fixed

  • IPv6 support: Dockerized actions run with an IPv6 localhost if IPv6 is enabled
  • Fix crash when enabling --experimental_per_executor_dirs in a cluster that has files in the work directory
  • Fix protocol error when a client attempts to execute an action with a lot of missing files
  • Fix reuse of Docker containers between persistent worker and normal actions

v1.37.4 (2021-03-22)

Cherrypicks

  • S3: do not force absolute blobs root; clarify requirements in documentation

v1.37.3 (2021-03-18)

Cherrypicks

  • S3: sanitize blobs root; add logging

v1.37.2 (2021-03-10)

Cherrypicks

  • Fix action cache recursive output directory check

v1.37.1 (2021-03-10)

Cherrypicks

  • Fix worker startup script

v1.37.0 (2021-03-09)

Added

  • Added a flag to support S3-compatible storage services like MinIO (--s3_endpoint)
  • Added an experimental option to force actions into specific pools by action mnemonic (--experimental_force_mnemonic_pool_name); note that this requires the client to send action mnemonics using the recently updated metadata proto

Changed

  • The --use_upload_to_rereplicate flag is now a no-op. Please remove it from your configs.
  • CloudWatch: --experimental_cloudwatch_no_instanceid, --aws_instance_id, --single_instance_monitoring, and --experimental_single_instance_monitoring are now no-op flags. Please remove these from config files. Instances always behave as if --experimental_cloudwatch_no_instanceid=true.

Fixed

  • Persistent workers: correctly use the relative working directory to look up parameter files and run workers; this is needed for Bazel @ HEAD to work
  • Action cache: fix handling of output directories to avoid returning stale action cache entries - this could cause Bazel client errors if build-without-the-bytes is enabled
  • Action execution: added a flag to stop absolutizing argv[0]; this could cause errors with hermetic C++ toolchains outputting absolute paths to .d files and failing Bazel's consistency checks (--incompatible_keep_relative_argv0); this will be enabled by default in a future release; note that this may break some builds that were relying on this (also see the Bazel issue https://github.com/bazelbuild/bazel/issues/13189)

v1.36.0 (2021-02-26)

Added

  • Added a flag to use per-executor working directories (--experimental_per_executor_dirs=true)
  • Added a flag to pass the executor id to local actions through an env variable (--experimental_local_provide_executor_id=true, ENGFLOW_EXECUTOR_ID)
  • Added a platform option to control the exec root strategy; this can be used to switch between the default, fast hardlink strategy which does not set file permissions to a copy strategy that sets the file permissions as requested by the client (experimentalActionInputStrategy=copy)

Changed

  • Improved logging for persistent workers
  • Increased default cache duration for CAS existence checks to external storage to 24h and 10 million entries
  • Increased default cache duration for CAS existence checks to the distributed CAS to 120 seconds
  • Limited download concurrency to at most 200 concurrent downloads by default to avoid running out of native memory or file descriptors

Fixed

  • Fixed issue where helper threads could go into a busy loop when Docker containers are reused; this may not result in client-visible build issues but causes high CPU load on the worker instances. This was introduced in 1.32.0 when the default for --experimental_docker_avoid_fifo was flipped

v1.35.1 (2021-02-17)

Cherrypicks

  • ExecutedActionMetadata: fix worker start timestamp

v1.35.0 (2021-02-16)

Added

  • Added a flag to use consecutive TCP/IP ports for internal traffic (--incompatible_use_low_offsets=true)
  • S3: Experimental support for multi-part uploads to handle files larger than 5 GB (--experimental_s3_use_transfer_manager=true)
  • Added a flag to disable participation in the distributed CAS; this is useful for satellite cluster where a few machines are remote to the main cluster (--enable_distributed_cas=false)
  • Logging: added a metric to monitor persistent worker use

Fixed

  • Fixed reporting of timestamps in execution result

Deprecated

  • The --experimental_gcs_direct_upload flag is a no-op. Please remove it from your configs.

v1.34.0 (2021-02-08)

Added

  • AWS: Improved AWS Terraform files in the release package to support dashboards and logging

Changed

  • GCS: use a new code path to upload blobs
  • AWS CloudWatch: --experimental_cloudwatch_no_instanceid=true by default. The --aws_instance_id and --single_instance_monitoring flags are deprecated, please remove them from configs.
  • --storage_range_requests is now a no-op. It has been enabled since v1.30

Deprecated

  • GCP: --gcs_credentials flag is deprecated, please use application default credentials instead

v1.33.0 (2021-02-05)

Added

  • AWS: Support for logging to CloudWatch logs; enable with --remote_logging_service=aws_cloudwatch and --aws_log_group_name=name
  • Execution responses include timestamps for client-side metric collection

Changed

  • Debugging: --keep_exec_directories_for_debugging now retains output files as well
  • Docs: clarify metrics documentation

Fixed

  • Fix --experimental_docker_store_images_in_cas to not cache temporary failures
  • Persistent workers: the service no longer waits for persistent worker processes to shut down, but terminates them forcefully

v1.32.4 (2021-02-05)

Cherrypicks

  • AMD64: fix debian package

v1.32.3 (2021-01-30)

Cherrypicks

  • CloudWatch: fix reporting of cumulative metrics

v1.32.2 (2021-01-29)

Cherrypicks

  • Fix gRPC metrics reporting

v1.32.1 (2021-01-28)

Cherrypicks

  • Fix an invalid name resulting in a SecurityException
  • Revert improved HTTP/1.1 handling; this caused health checks on AWS to fail

v1.32.0 (2021-01-27)

Added

  • macOS: we now release packages for macOS, and added documentation about setting up a basic macOS cluster
  • Docker: Support running containers by id --docker_use_image_id; this prevents docker run from attempting to pull the corresponding image
  • Docker: Log time needed to run docker pull
  • Docker: add --experimental_docker_store_images_in_cas to support storing docker images in the CAS to improve performance and reliability
  • AWS CloudWatch: add --experimental_cloudwatch_no_instanceid; when enabled, all machines will report metrics without InstanceId dimension, which makes the metrics aggregatable
  • Google Cloud Storage: implement a faster upload method, activated with --experimental_gcs_direct_upload=true

Changed

  • Improved Packer and Terraform templates
  • GCP: GCP images are more lightweight and boot faster
  • MacOS: --xcode_locator now points to /usr/local/bin/engflow/xcode-locator by default; this is where the MacOS package installs this binary
  • Running as a service now requires the file /etc/engflow/config to exist
  • Increase the default value of --default_replica_timeout to 24h
  • CAS / AC: reduce traffic to the storage backend (when using AWS S3 or Google Cloud Storage) with the help of a cache for recently seen blobs; you can customize its behavior with --experimental_cas_existence_cache_max_size and --experimental_cas_existence_cache_expiry
  • Docker: enable --experimental_docker_avoid_fifo by default for compatibility with gVisor
  • Docs: show metric units and aggregation type in the documentation

Fixed

  • CAS.batchReadBlobs can now respond with INVALID_ARGUMENT for invalid digests. This is an incompatible bugfix and it's disabled by default; enable it with --incompatible_batch_read_blobs_verifies_digests=true
  • Attempting to connect to the cluster via a HTTP/1.1 connection now returns a HTTP/1.1 error reply rather than simply closing the connection
  • CAS: Fix off-by-one error when replicating in the distributed CAS; previously, the cluster created one replica more than requested
  • Metrics: The com.engflow.re.cas/available_space metric is now clamped at zero; previously it was possible for it to temporarily dip below zero while running GC
  • Enable --use_upload_to_rereplicate by default; this fixes a rare mutual deadlock condition when two workers simultaneously attempt to upload files to each other
  • Upgrade gRPC library, which fixes an issue with slow up- and downloads on high-latency connections; unfortunately, we had to disable --experimental_log_unavailable_rpcs during the upgrade
  • Restore compatibility with Java 8
  • CAS: fix a potential scenario where the service could write an incomplete file to Google Cloud Storage
  • Cloud: templates now disable systemd / syslogd integration by default; having the integration enabled causes log lines to be duplicated to multiple log files, which could result in running out of disk space
  • Fixed NullPointerException in RereplicatingCasDownloader when --use_upload_to_rereplicate=true

Deprecated

  • AWS discovery: deprecate --aws_security_group; this flag is unnecessary as cluster members find each other by --cluster_name (we recommend also enabling --split_cluster_name=true)

v1.31.2 (2021-01-21)

Cherrypicks

  • RereplicatingCasDownloader: retain Context to fix NPE ("RequestMetadata not set in current context")

v1.31.1 (2020-12-17)

Cherrypicks

  • extraActionInputs: ensure directory exists before attempting to create input

v1.31.0 (2020-12-11)

Added

  • ARM64: we now release a Debian package for ARM64
  • Docker: Initial IPv6 support with --docker_enable_ipv6; this provides an isolated IPv6 network to actions which can be used for testing IPv6 code
  • Docker: Allow resolving executable paths against PATH; this is not compliant with the remote execution spec, but improves compatibility with existing open source projects that rely on this behavior, e.g., TensorFlow and Envoy
  • CAS: Document --experimental_opportunistic_cas - this flag switches to a different replication policy that reduces pressure on the distributed CAS if an external storage is configured; this improves reliability under load
  • Monitoring: Add a metric com.engflow.re.scheduler/existing_schedulers for the number of schedulers; this can be used to detect instances that are unable to report metrics, e.g., to Google Cloud Operations (formerly StackDriver)
  • Logging: log mTLS client authentication events
  • Docker: added --experimental_docker_internal_error_stderr_pattern to control automatic retries for some kinds of docker exec failures

Changed

  • S3: automatically retry failures after a delay
  • Docker: enable --docker_use_init by default; this helps avoid running out of PIDs when actions spawn a large number of subprocesses
  • Execution: enable --auto_worker_expiration by default; improves tracking of available workers

Fixed

  • Deployment: correctly set the Debian package architecture
  • Docker: correctly pass system capabilities to Docker
  • GCS: improved handling of "connection lost" errors

Deprecated

  • Options: the --docker_use_pull flag is now a no-op; the new code is always enabled

v1.30.0 (2020-12-04)

Added

  • GCP auth: print more server-side logs when authentication fails
  • Docker: the new --docker_use_init flag enables running Docker with a proper init process that reaps zombie processes, which avoids running out of PIDs when reusing docker containers
  • CAS: the new --use_upload_to_rereplicate flag enables using a new CAS re-replication code path that avoids a rare deadlock among worker machines

Changed

  • Debian package: the .deb version is now the release's SemVer, not the build date (check with dpkg -I engflow-re-services.deb)
  • Deployment kit (zip file): the k8s setup files are now under setup/k8s
  • Docker: print reason for container restart
  • External storage: enable range requests by default (see --storage_range_requests)
  • External storage: check on startup if we can access the storage backend
  • AWS Terraform file: renamed the need_external_docker parameter to public_worker_ip

Fixed

  • Build label: fixed missing build label in 1.28 and 1.29
  • Logging: fix the swapped invocation_id and action_digest in ExecutorServer's log line
  • Docs: show the service options' types correctly
  • Docs: display the version selector
  • CloudWatch: report metric units correctly
  • Fix uncaught IllegalStateException wrapping OperationTimeoutException from Hazelcast
  • CAS: detect on-disk file corruption
  • CAS: fix invalidating blobs that went missing with a PRECONDITION_FAILED
  • GCP: fixed Dockerized execution with cached containers on gVisor (requires --experimental_docker_avoid_fifo=true)

v1.29.1 (2020-11-19)

Cherrypicks

  • Monitoring: fix negative pool_utilization metric

v1.29.0 (2020-11-19)

Changed

  • AWS: improve deployment template (simplify role policy, add API endpoints)
  • Monitoring: add more context to logged error messages
  • Logging: log requested number of blobs for FindMissingBlob calls in addition to failed and missing digests
  • Logging: --debug_execute_requests also prints stderr for failed actions

Fixed

  • Execution: correctly create all requested output & input directories
  • CAS: do not unlist CAS nodes that fail due to timeouts; this could potentially result in a denial-of-service if the client sets small timeouts for large uploads
  • CloudWatch: respect max reporting batch size
  • CloudWatch: silently skip histogram metrics, which always failed to report
  • Documentation: correctly render metrics reporting percentages
  • S3: print correct region name when us-east-1
  • Networking: fix reporting of stream errors

v1.28.0 (2020-11-13)

Added

  • AWS, monitoring: --experimental_single_instance_monitoring is now called --single_instance_monitoring (the old name still works)
  • Add --external_storage_scheduler_threads and --external_storage_worker_threads to allow customizing the external storage thread pool

Changed

  • MacOS: sign release
  • GCP, monitoring: Remove code to report metrics to Google Cloud Operations every 30 minutes
  • Logging: Correctly report missing blobs, improve GCS error logs
  • AWS: Improved terraform template for cluster setup

Fixed

  • GCP, monitoring: fix sample reporting for charts that measure rates

v1.27.7 (2020-11-12)

Cherrypicks

  • Status page: fix --http_auth=none to allow access to the status page

v1.27.6 (2020-11-11)

Cherrypicks

  • Infrastructure: fix CI configuration for releases
  • Infrastructure: fix CI machine selection for releases
  • Monitoring: report two values before skipping; this should fix GCP metrics to go down to zero

Added

  • Logging: the --experimental_log_unavailable_rpcs flag (boolean) enables logging the stack trace of RPC calls that fail with UNAVAILABLE. We added this feature only for debugging, and we plan to remove it as soon as we can.
  • Monitoring: Added --enable_status_page to provide a basic cluster status page over HTTP2 (only!) on the same IP+port as the gRPC end point (previously undocumented as --experimental_status_page)
  • Release archive now contains a CHANGELOG.md (this file)
  • CAS: Added an experimental flag to change the CAS re-replication policy to be less aggressive (--experimental_opportunistic_cas). Note:
    • This is an incompatible flag and may require downtime to roll out
    • This should only be enabled when external storage is enabled

Changed

  • Logging: log more detailed CAS upload errors, report INVALID_ARGUMENT correctly, report RESOURCE_EXHAUSTED instead of UNAVAILABLE when no workers are available
  • Logging: log a summary of missing blobs and failures for FindMissingBlobs calls
  • Monitoring: Report metrics to Google Cloud Operations at least every 30 minutes
  • Documentation: the "Bazel First-Time Setup" page now recommends --remote_timeout=600 instead of 3600
  • Docker: Pass --userns=host to Docker to explicitly disable user namespaces; previously, all actions failed when user namespaces were enabled in the Docker daemon

Fixed

  • Dockerized execution: disable user namespaces to avoid action failures
  • Code cleanup: several bugfixes found by static analyzers
  • Error handling: report an error if the output tree cannot be deleted (primarily when --experimental_docker_use_platform_user is enabled)

Deprecated

  • Options: the (undocumented) --affinity_scheduling flag is now a no-op; the new code is always enabled

v1.26.2 (2020-11-02)

Cherrypicks

  • Infrastructure: fixed version name computation in our release pipeline

Added

  • MacOS: create release
  • --experimental_cas_check_storage_only flag: to enable faster CAS checks (when --external_storage is not none)
  • Logging: worker logs the CAS size upon startup

Changed

  • Dockerized actions: add container hostname to /etc/hosts

Fixed

  • Monitoring: report CAS usage regularly, not just when doing a GC
  • Code cleanup: lots of bugfixes found by static analyzers

v1.25.1 (2020-10-30)

Added

  • Documentation: for Remote Persistent Workers
  • /healthz page
  • Status page: now authenticates clients, see the --http_auth flag
  • AWS, monitoring: support for single-machine-only monitoring; see the --experimental_single_instance_monitoring flag
  • Monitoring: the com.engflow.re.scheduler/pool_utilization metric shows what percentage of executors in a pool are currently used

Changed

  • Monitoring: the com.engflow.re.scheduler/queue_age metric now reports min/max ages broken down by executor pool
  • deb package: post-install script creates engflow user's home dir

Fixed

  • Monitoring: stuck actions are now from scheduler's queue, and won't drive up the max age forever
  • Monitoring: fixed GCS metrics that tried reporting negative values

v1.24.1 (2020-10-28)

Cherrypicks

  • GcsClient copy: always set target of copy request

v1.23.1 (2020-10-28)

Cherrypicks

  • GcsClient copy: always set target of copy request

v1.24.0 (2020-10-19)

Added

  • Logging: log OpenCensus attempt to record negative value
  • Logging: LocalExecutionServer tracks and logs per-action timing
  • Monitoring: com.engflow.re.bytestream/read metric to monitor complete vs. partial ByteStream.read calls (hidden from docs because we wanted to use it for debugging only)

Changed

  • Monitoring: com.engflow.re.storage/ops_queue_size now shows the composition of external storage ops queue

Fixed

  • Don't execute an action if input fetch failed

v1.23.0 (2020-10-14)

Added

  • Status page: a simple status page with members list, on the same port as the remote execution service; enabled with --experimental_status_page=true

Fixed

  • ByteStream.read: properly implement resumable downloads from S3/GCS, guarded by --experimental_storage_range_requests=true
  • Fix user-visible cancellation exceptions
  • Fix a race condition in CloudCasDownloader
  • Fix an incorrectly reported AC corruption with S3/GCS

v1.22.0 (2020-10-06)

Added

  • Deployment kit: added an example Bazel project
  • Docs: add on-prem setup instructions

Changed

  • Return INVALID_ARGUMENT for too-large output trees

Fixed

  • Remote logging: Avoid infinite recursion when logging to GCP
  • ExecutionServer: Suppress cancellation exceptions, so they don't get reported to the client
  • Docker pull: fine-grained pull errors

v1.20.1 (2020-10-02)

Cherrypicks

  • Scheduler: also listen to expiration/eviction to avoid losing workers
  • ExecutionServer: catch exceptions from onCompleted to avoid "call already cancelled" errors

v1.21.0 (2020-09-28)

Added

  • Added --experimental_docker_force_reuse flag

Changed

  • AWS, CloudWatch: --cloudwatch_dimensions is now optional
  • AWS, deployment kit: service endpoint now listens on port 443 (was 8080 before)
  • Improved CAS performance

Fixed

  • Fixed bugs in validating the output tree in the client's execution requests
  • Java 8: Fixed crashbug

v1.20.0 (2020-09-15)

Added

  • Workers: can auto-detect the disk size
  • Service interface: implemented ByteStream.QueryWriteStatus
  • Docs: added system diagram

Changed

  • External storage: use more threads: 50 on schedulers, 25 on workers
  • CentOS: use statically-linked netty-tcnative in the RPM package

v1.19.1 (2020-09-03)

Cherrypicks

  • Fix hazelcast InterruptedExceptions

v1.19.0 (2020-09-03)

Fixed

  • Hash mismatch issues

v1.18.0 (2020-09-02)

Added

  • New metric: com.engflow.re.storage/ops_queue_size

Changed

  • Enabled affinity-scheduling by default (--affinity_scheduling=true)
  • Changed server-side execution log message format to "id: message"

Fixed

  • Fewer DEADLINE_EXCEEDED client errors: more findMissingBlobs caching, reduced GCS/S3 traffic

v1.17.0 (2020-08-26)

Added

  • RPM package for CentOS 7