Skip to content

Release Notes

To see your currently deployed version visit [cluster_url]/restatus in your EngFlow cluster web UI. If you do not have the web UI enabled please ask your EngFlow contact which version you are currently running.

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

v2.99.0 (TBD - notes up to 837f84fd10f096d987aed27ce43d4691804f23a4 so far)

New

  • Autoscaling: Support scheduler-based graceful worker shutdown when scaling in worker pools. When enabled, clusters can be configured to wait for workers to finish the actions they are currently executing before being shut down. This is especially useful for long-running actions, which, when aborted, can lengthen overall build times significantly, because they have to be retried from scratch. The feature works as follows: Rather than having the cloud provider's autoscaling group handle the scale in, if fewer workers are needed to handle the anticipated load, schedulers select workers and terminate them only when justified. The behavior is configurable per pool with two parameters: the maximum time a worker is allowed to have been running an action to be terminated immediately, and the maximum time a worker may spend completing running actions before the scheduler terminates it even if it is not idle, thus potentially aborting running actions that then have to be retried.
  • Autoscaling: The new metric com.engflow.re.scheduler/autoscaler_cluster_size_controller_op reports per pool and operation type the number of calls, as well as their status and latency.
  • Autoscaling: The new metric com.engflow.re.management.workercontrol/approx_mft_induced_idle_executor_duration reports approximately how long executors were idle on marked-for-termination workers.
  • Autoscaling: The new metrics com.engflow.re.management.workercontrol/mft_on_scheduler and com.engflow.re.management.workercontrol/mft_on_worker reports per pool how long marked-for-termination connections between the master scheduler and workers lasted and the result.
  • RE: On GCP, support running workers running on arm64-based compute nodes.
  • RE: The new metric com.engflow.re.exec/started_actions_per_pool reports how many actions started on each worker.
  • RE: The new metric com.engflow.re.scheduler/dequeued_actions reports per pool how many actions were removed from the queue, either due to starting execution on a worker, or because it was ejected from the queue as it got too old.

Fixed

  • RE: On Windows, input symbolic link targets are no longer absolutized.
  • UI: Fix a bug where the critical path information card would not render its value when the invocation had finished.
  • UI: Ensure success percentages are truncated, so that even if just 1 test of 1000 fails, the success percentage is not 100%.

Changed

  • CAS: --enable_distributed_cas=false now prevents a worker from serving CAS blobs at all. Previously, the worker could still serve blobs in its local CAS from remote execution.
  • CAS: Introduce --cas_fallback_deadline and --cas_fallback_retries to adjust fallback downloads.
  • IAM: The engflow_roles custom OIDC claim may now be either an array of strings (previous behavior) or a string containing a comma-separated list of role names.
  • RE: Resource usage tag values may may now include upper- and lower-case letters, digits, and some punctuation characters (!#$%*+.?@^_~-).
  • UI: Reduce UI bundle for the Build and Test UI by ~39%.

Removed

  • Remove redundant io.netty.buffer/used_heap_memory and io.netty.buffer/used_direct_memory metric. The com.engflow.thirdparty.netty/used_heap_memory and com.engflow.thirdparty.netty/used_direct_memory already reports this information.

v2.98.1 (2024-12-11)

Internal release. No publicly facing changes.

v2.98.0 (2024-12-10)

New

  • RE: Added list-style --experimental_detect_java_oom_from_action_output to improve OOM detection of failing actions, specifically caused by Java OutOfMemoryError exceptions. Valid values are stdout and stderr to check the first kB of the according action output.
  • UI: Revamps the cache utilization card on the invocation performance tab to use data from bazel's new compact execution log, if one is available and the cluster has running analyzers. Note that this feature will only work for the latest stable format of the compact execution log, shipping in bazel 8.

Fixed

  • RE: Reduce the scheduler's retained memory by de-duplicating strings in client tags, worker references, affinity keys, pool ids, and strings in the request metadata.
  • UI: Fixed an issue where long target patterns could push the invocation timing information in the details page out of view.
  • RE: Support pulling images from registries on nonstandard ports.
  • UI: Fixes an issue where sometimes an unexpected error would appear in the analytics page.
  • UI: Show input tree on action details page even if it hasn't fully loaded into memory yet.

Changed

  • RE: CAS Fallback will no longer fallback indefinitely.

v2.97.0 (2024-11-22)

New

  • GitHub CI Runners now support "auto" version.
  • macOS 14 AWS images are now included in the release.

Fixed

  • UI: Fixed an issue where failed targets sometimes didn't show up on the Highlights page of an invocation.
  • UI: Corrected the visual positioning of the "scroll to bottom" button in the invocation console view.
  • Linux: Increased ARP cache GC threshold to reduce network flakiness in large clusters.
  • RE: Reduced memory usage for action cache lookups.

Changed

  • UI: In the invocation details page, the Requester and Authenticated Runner fields will now be hidden when they are unset and --client_auth=none.

Deprecated

  • --experimental_docker_force_reuse is now a no-op and is now always disabled.
  • --stream_fallback_reads is now a no-op and is now always enabled or disabled depending on OS (disabled on Windows, enabled on others).

v2.96.0 (2024-11-14)

New

  • RE: Paths of inputs and outputs now support utf8 characters.
  • UI Metrics: The new com.engflow.observability.ui/page_load metric will track how long it took to navigate to and render the next page. This includes cases of client-side navigation, and cases of full-page (F5) browser refreshes, which can be distinguished using the page_load metric tag.
  • RE: Add --limit_total_worker_cores; when enabled, the autoscaler takes the maximum total cores into account when determining pool sizes.

Fixed

  • RE: Include more information in the output to the client when Docker containers fail to start.
  • RE: Fix rare issue where the autoscaler could get stuck.
  • CAS: Correctly handle UNIMPLEMENTED return code from workers that do not participate in the CAS.
  • CAS: workers not serving distributed CAS now fall back to external storage after a distributed CAS failure.
  • UI: Allow download of extra test outputs in Firefox.
  • UI: Fixed a bug where existing targets would sometimes return a "missing" error.

Changed

  • Auth: When trying to create an existing role with custom roles API, return an already exists exception instead of an internal error.
  • RE: Adjusted autoscaling estimates for long-running actions.

Deprecated

  • --experimental_force_sibling_containers_pool_name is now a no-op.
  • --force_pool_name is now a no-op.

v2.95.1 (2024-11-13)

Fixed

  • CAS: Workers not serving distributed CAS now fall back to external storage after a distributed CAS failure.
  • CAS: Correctly handle UNIMPLEMENTED return code from workers that do not participate in the CAS.

v2.95.0 (2024-11-05)

New

  • RE: Add --docker_max_container_modified_size to restart reusable containers if they accumulate untracked data.
  • Metrics: Report com.engflow.secretstore/operation_duration_seconds for information on accessing secrets from schedulers.

Fixed

  • CAS: Execution-only workers fall back to both cache workers and external storage when a file is not present in the distributed CAS.
  • CAS: Invalidate CAS existence cache entries when the referenced blob is missing.
  • UI: Triple-dot menus now render on top of all other elements.
  • UI: The test UI will now handle long test output lines better, by only displaying the first 5 lines but still optionally allowing to view the full output through the press of a button.
  • UI: Fixed an issue where streamed action outputs in the UI failed to reset the byte count from previously opened outputs.
  • UI: Fixed an issue rendering the invocation details page under some auth setups.
  • UI: Fixes a UI bug where the action details page would incorrectly claim there was an error while fetching stdout or stderr.
  • UI: Expanded packages in the target tree are automatically scrolled to the top of the viewport.
  • ResultStore: Fixed an error during target tree building when aspects with parameters are used.
  • ResultStore: Fixed an issue encountered while attempting to reduce corrupted BEP streams. This would manifest as a NPE displayed on an invocation's details page, in place of any other information.

Changed

  • UI: Reordered the columns in the worker table on the /restatus page.
  • Metrics: The com.engflow.resultstore/reduce_bes_count metric has been renamed to com.engflow.resultstore/new_reduce_bes_count.

v2.94.2 (2024-10-25)

Fixed

  • UI: Fixed an issue rendering the invocations page and invocation details page under some auth setups.

v2.94.1 (2024-10-18)

Fixed

  • CAS: Fixed an issue where blob expirations could not be longer than 22 days from scheduler startup.

v2.94.0 (2024-10-16)

New

  • RE: Add GCP Secret Manager credential helper for Docker.
  • RE: Support scaling down pools to 0 instances faster.
  • RE: Added metrics for SecretStore operations.
  • RE: Add flag --pools_config to communicate to the scheduler data on the pool configuration. This flag is intended to replace multiple other flags once the migration has been complete.
  • RE: Add a new flag, --experimental_cas_read_preferred_nodes_only, this allows limiting nodes used to read from in distributed CAS for certain cluster configurations.
  • CAS: Track the current GC window via the new metric com.engflow.storage.gc/gc_window_seconds.
  • CAS: Track staleness of CAS blobs when their expiration is refreshed via the new metric com.engflow.re.storage/time_to_expiry_seconds.

Changed

  • UI: Update Perfetto to v48.0.
  • UI: Expanded packages in the target tree are automatically scrolled to the top of the viewport.
  • UI: Always display all status filters on the target tree, even when the invocation hasn't completed.
  • UI: When a build is completed the target status filters will narrow based on the result of the invocation to help highlight potentially errant targets.
  • Update the persistent worker actions metric to match the container lifecycle metric - report pool and lifecycle event.
  • Update EngFlow server profiles to display the pool on the instance row, in the Perfetto UI.
  • CLT is now available in all supported regions.

Fixed

  • RE: Fixed an issue where external storage HTTP connections could leak when cancelled.
  • Logging: remove a kind of chatty scheduler log message ("Handling getXXX locally" and similar). This reduces costs of logging.
  • UI: Fixed bug in which failed targets don't show up properly on the highlights tab of an invocation's page.
  • UI: Fixed some cases where invocations would erroneously display <unknown> as the principal which requested or executed them, when viewed on the invocation search and invocation details pages.

Deprecated

  • RE: deprecated metric com.engflow.re.auth.async/call_count has been removed.

v2.93.1 (2024-10-10)

Fixed

  • RE: Fall back to fetch blobs from S3/GCS when a worker gets a corrupt (digest mismatch) blob from another worker.
  • RE: Fix issue where cancelled HTTP connections to external storage leaked.

v2.93.0 (2024-10-01)

New

  • RE: Add metric com.engflow.re.exec.poolgroups/oom_count reporting how many execute responses were classified as OOMs.
  • RE: Add metric com.engflow.re.exec.poolgroups/initial_recommendation_count reporting how often an action is executed on a different pool than requested.
  • RE: Add metric com.engflow.re.exec.poolgroups/recommendation_change_count reporting how often the smart recommender recommends a pool that differs from the pool the action was previously executed on.

Changed

  • RE: Improve the automatic OOM detection by factoring in cgroup oom_kill events, if available.
  • The --cas_existence_cache_expiry flag is now also applied to expiration storage.
  • The --k8s_namespace flag is now a no-op. This flag only mattered when deploying to Kubernetes, and it's unnecessary because the Pod can read its namespace from the Downward API.
  • The --k8s_all_pods_service flag is now a no-op. This flag only mattered when deploying to Kubernetes, and for a long time now it had to be always equal to --k8s_scheduler_pods_service; the duplication made this flag unnecessary.
  • --disable_pw_scheduled_threads is now a no-op.

Fixed

  • Fix race condition where HTTP connections were not properly closed when an AC read call was cancelled.
  • UI: Fixed bug in which failed targets don't show up properly on the highlights tab of an invocation's page.

v2.92.2 (2024-09-26)

New

  • RE: Add metric com.engflow.re.exec.poolgroups/oom_count reporting how many execute responses were classified as OOMs.
  • RE: Add metric com.engflow.re.exec.poolgroups/initial_recommendation_count reporting how often an action is executed on a different pool than requested.
  • RE: Add metric com.engflow.re.exec.poolgroups/recommendation_change_count reporting how often the smart recommender recommends a pool that differs from the pool the action was previously executed on.

v2.92.1 (2024-09-25)

New

  • The --cas_existence_cache_expiry flag is now also applied to expiration storage.

Fixed

  • RE: Fix race condition where HTTP connections were not properly closed when an AC read call was cancelled.

v2.92.0 (2024-09-23)

New

  • Record oom kills in the profile during action execution in Docker.
  • RE: Add metric com.engflow.docker.container/size to report sizes of containers.
  • RE: Add metric com.engflow.re.exec/max_rss_kib that reports the MaxRSS (maximum resident set size) for a successfully executed action, if available, to record how much memory actions use.
  • RE: Support smart pool recommendations. If enabled, this feature can improve performance and reduce costs of executing actions remotely. Clusters can specifying groups of pools to automatically select which pool within a group to execute an action on remotely. The pool recommendation is based on previous execution statistics and currently selects by memory usage.

Changed

  • RE: The metrics com.engflow.resultstore/reduce_bes_replay_source_count, com.engflow.resultstore/reduce_bes_replay_removed_from_cache_count, com.engflow.resultstore/reduce_bes_completed_duration_since_finish_event and com.engflow.resultstore/reduce_bes_count now includes a tag specifying which replay type (Combined, Invocation Metadata or Target Tree) the data is referring to.

Fixed

  • RE: Fix a rare deadlock when actions timeout from a pool.
  • UI: Improved target tree and target fetching performance.
  • UI: Add UI error boundaries to help debug rendering errors and explain to users what went wrong and potentially how to fix the issue.
  • UI: Emoji file extensions are rendered in the correct order in the Input Tree section of the Action Details page.
  • The fluent-bit configuration now allows dropping long log lines. Previously, if a log line was excessively long (32K), fluent-bit would stop log-shipping that service; this could lead to missing logs. Now it will just skip such lines.
  • RE: ActionCache replication can be symmetrical.

Deprecated

  • --experimental_docker_proxy is now a no-op.
  • --hazelcast_aws_use_sdk is now a no-op.
  • --experimental_docker_store_images_in_cas is now a no-op.

v2.91.3 (2024-09-18)

Fixed

  • RE: prevent reads after every S3 upload.
  • RE: Be stricter about keeping sysbox up.

v2.91.2 (2024-09-15)

Fixed

  • RE: fix pulling non-canonical container images.

v2.91.1 (2024-09-11)

Fixed

  • API: Fixed LIST endpoint for Docker images.

v2.89.4 (2024-09-10)

Fixed

  • UI: Fixed download paths for test resources.

v2.91.0 (2024-09-11)

New

  • UI: Action execution pages ((/actions/executions/<execution-id>)) will now show partial information if available.
  • RE: Add metric com.engflow.docker.container/existing reporting the number of existing docker containers on workers, aggregated by their state.
  • RE: Add metric com.engflow.docker.image/size reporting the sizes of existing docker images on workers.
  • UI: The Getting Started page now displays instructions for multiple client authentication methods, when enabled.
  • RE: Remote persistent workers are now always enabled on Linux and MacOS.

Changed

  • RE: com.engflow.re.exec.docker/existing_containers metric is removed in favor of com.engflow.docker/containers.
  • UI: The com.engflow.observability.ui/page_load_with_data_requests metric now also records the load duration experienced by the user when opening the Invocation Details page and the Analytics page.

Fixed

  • UI: The requester and/or runner of an invocation is now correctly populated when doing BES processing on analyzer instances. Previously, the UI showed these as "unknown".
  • UI: Fixed download paths for test resources.
  • UI: Fixed some cases where the build and test UI could accidentally fetch some resources twice.

Deprecated

  • --hazelcast_aws_use_client_lib is now a no-op.
  • --docker_split_exec_run is now a no-op. Use --docker_allow_reuse instead.
  • --experimental_persistent_worker_expand_param_files is now a no-op.

v2.90.0 (2024-09-03)

New

  • UI: The new metric com.engflow.bes.replay/cpu_time_for_event reports how much CPU time was spent replaying and reducing a single event within a build's BES.

Fixed

  • UI: Principals displayed in the User Menu will now be clipped when they cannot fit in the space, instead of overflowing.
  • UI: Allow "Enter" to login when logging in via basic authentication.
  • UI: Improved page load times by requiring one less call to fetch feature toggles before starting page render.
  • UI: JSON Bazel profiles will now download with the correct extension.
  • UI: Fixed bug where test details would be fetched multiple times even when nothing changed.
  • Scheduler: built-in autoscaler fixes.

Changed

  • Remote persistent workers are now enabled by default on all supported platforms (Linux and macOS).
  • UI: Provide more meaningful information when Bazel profiles did not get uploaded correctly or are missing from the CAS.
  • BES: Significantly reduce the amount of memory used by file references when reducing the BES to extract target details.
  • Reduce S3 upload buffers size, enable sdk multipart upload
  • Remove pin on google-guest-agent to 1:20240528.00-g1 in the base Debian image on GCP.

Deprecated

  • --experimental_persistent_worker_and_docker is now a no-op.
  • --disable_profile_generator_cleanup_threads is now a no-op.

v2.89.3 (2024-09-03)

Fixed

  • RE: Fixed undercounting of the number of actions coalesced in com.engflow.re.scheduler/coalesced_executions metric and logs.

v2.89.2 (2024-08-26)

New

  • Scheduler: add the metrics com.engflow.re.scheduler/estimated_action_time and com.engflow.re.scheduler/estimated_induced_load, and change com.engflow.re.scheduler/desired_executors to report the pre-adjusted value to debug issues with the pool sizes.

Changed

  • RE scheduler: tweak autoscaling equation for faster scale-up on load spikes.
  • BES: Significantly reduce the amount of memory used by file references when reducing the BES to extract target details.

v2.89.1 (2024-08-21)

Changed

  • RE: report a retryable status code to the client when container startup times out.
  • RE: Collect more information when Docker containers fail to start.

Deprecated

  • --experimental_force_lru is now a no-op.
  • --http_compression is now a no-op.

v2.89.0 (2024-08-19)

Internal release. No publicly facing changes.

v2.88.0 (2024-08-13)

New

  • CI Runners: support polling multiple GitHub repositories.

Fixed

  • RE: Do not wait for containers to start when their entrypoint has crashed; fail immediately.
  • BES: If automatic indexing fails to save the last status of an invocation, update the index on the next BES reduction.
  • CLT: Fix bad paths and yaml syntax for grafana and prometheus.

Changed

  • UI: Changed "exit code" filter to "Bazel exit code" to avoid confusion with process exit codes.
  • UI: Update Perfetto Trace Viewer to v47.0 and minify files for faster viewing of the Bazel and EngFlow profile.
  • UI: The branch name displayed on the invocation search page will now also be clipped when it is longer than 15 characters, mimicking the visual behavior of the commit ID next to it.

v2.87.0 (2024-08-05)

New

  • UI Auth: The OIDC provider may now set engflow_roles to a list of role names. The user will have these roles instead of roles set with --principal_based_permissions.

Fixed

  • CAS: CAS Reads will no longer retry reading 0 bytes, which was previously caused by retrying a read that had already finished.
  • S3: Handle S3 416 errors gracefully.
  • UI: Fixed a bug that prevented users from setting a custom timezone.
  • UI: Fixed an issue where small durations (such as action queueing time, in the action details page) were incorrectly displayed as nearly 24 hours instead.

Incompatible

  • BES/EventStore: The service no longer translates deprecated timestamp and duration fields in millis to their according new Timestamp and Duration counterparts. This is a behavioral change for BEP sent from Bazel clients running version 4.x or older. Bazel versions 5.0.0+ are not affected.

Changed

  • BES: Improved EventStore performance: if no sensitive data to redact is detected in an incoming BEP event, avoid packing the unchanged event before saving it to external storage or processing it otherwise.
  • BES: Build events received are now written to storage and acknowledged after at most 2 minutes.
  • BES: Improved EventStore performance by pre-filtering packed build events and only unpacking them if they might have data that should be redacted.
  • API: The v1 and v1alpha IdentityManagementServer APIs have been consolidated into a single v1 API.

v2.86.1 (2024-07-28)

Internal release. No publicly facing changes.

v2.86.0 (2024-07-28)

New

  • BES: The new metric com.engflow.eventstore/bep_event_ack_latency reports how long it takes to acknowledge a build event the client sent to EngFlow's BES.
  • UI: The new metric com.engflow.observability.ui/page_load_with_data_requests reports how long it took to render selected pages, including data requests needed to display the pages' initial contents. For example, for the invocation search page, it tracks how long it took to render the static page, plus the first set of invocations.
  • BES: The new metric com.engflow.bes.replay/cpu_time tracks how much CPU time was spent replaying and processing an invocation's build events.

Changed

  • CI Runners: Webhooks now trigger polling instead of reading the hook contents to improve security and reliability.
  • RE: Add a log message with the worker pool and action mnemonic when actions coalesce.
  • UI: Don't warn that --remote_download_minimal isn't set when using --experimental_remote_output_service.

Fixed

  • RE: Out-of-CAS-space is now treated as RESOURCE_EXHAUSTED.
  • UI: Fix invocation profile HTTP API sometimes returning 500 instead of 404 (NOT FOUND).
  • BES: Revert support for BES upload retries, as this introduced issues with live replays.

Deprecated

  • config: --experimental_junit_test_suite_upload_deadline is now a no-op.

v2.85.3 (2024-07-25)

Fixed

  • CAS: Ensure S3 always returns object expiration date.
  • CAS: Ensure the external storage request does not block the network thread pool.

v2.85.2 (2024-07-22)

New

  • BES: Add com.engflow.eventstore/bes_upload_delay to track ingestion performance.

v2.85.1 (2024-07-19)

Changed

  • BES: Remove com.engflow.eventstore/bep_event_ack_latency metric due to increased scheduler load.
  • CAS: Move blocking storage actions to a dedicated thread pool.
  • CAS: GCS Blocking calls now return storage metadata.
  • GCP: Pin google-guest-agent to 1:20240528.00-g1 to avoid network regression in 1:20240701.00-g1.

v2.85.0 (2024-07-15)

New

  • BES: Export com.engflow.eventstore/bep_event_ack_latency, which measures how long it takes to acknowledge a build event the client sent to EngFlow's BES.
  • BES: Add a read cache for BES replays to reduce the long tail for indexing performance.

Fixed

  • CI Runners: re-process jobs resulting in server-side errors.
  • CI Runners: more frequent running job status updates.
  • ResultStore: Fix a bug that prevented failed attempts to reduce the BES of an invocation to be retried in a timely manner.
  • UI: Fixed visual clipping with the help mode button, on the target tree status filter, inside an invocation's details page.
  • UI: Fixed an issue with help mode buttons not displaying correctly on the new compact invocation search page.

Changed

  • BES: --experimental_use_junit_test_suite_parser_v2 is now enabled by default and a no-op.
  • Metrics: Target tree processing logging will now include the fully-qualified invocation ID, if any exceptions are raised during this process.
  • S3: Blocking time to acquire failures will be treated as RESOURCE_EXHAUSTED.
  • UI: Disable auto-complete on search page. This was causing database issues that caused search to perform very slowly.

v2.84.1 (2024-07-10)

Cherrypicks

  • RE: fixes bug where partially downloaded files lost the successfully written prefix when falling back to external storage.

v2.84.0 (2024-07-03)

Changed

  • UI: Our invocation search page now sports an improved look, where invocations are displayed in a much more compact manner, thus allowing more invocations to be visible at the same time. For invocations that are loaded and visible on the page, you can also now Ctrl + F search for them using their invocation ID.
  • UI: Added a feature-flag, --enable_expensive_invocation_index_queries to control enablement of known expensive queries on the invocation search page. Disabling this flag may improve page-load and search time for very large clusters.
  • Monitoring: improved precision of most time distribution metrics.

Fixed

  • UI: Fixed an issue that prevented multi-value filters (such as BES keywords) from populating suggestions properly.
  • UI: Fixed an HTTP 500 error on the Cluster Status page.

Removed

  • UI: Secondary timezone functionality is removed to simplify user experience.
  • ResultStore: the flag --resultstore_async_db_write no longer has any effect. Database writes are now always asynchronous.

v2.83.2 (2024-07-03)

Fixed

  • Re-release to pick up the OpenSSH security update. The ssh port is not open on any cluster, so the security issue is low priority. However, we're preparing this release just in case.

v2.83.1 (2024-06-27)

Fixed

  • S3: Fix null pointer on asynchronous copy.

v2.83.0 (2024-06-18)

Fixed

  • RE: Correctly report CAS check cache hits (in com.engflow.re.storage/ops).
  • UI: In the target tree's status bar, better differentiate skipped targets from pending targets.
  • UI: Highlight invocations whose BES is still incomplete.
  • UI: It is now easier to see which targets are root-cause failures versus transitive failures.
  • RE: putCasBlob now respects instance_name correctly.

v2.82.0 (2024-06-13)

New

  • New BES metrics: com.engflow.eventstore/incomplete_batches_size estimates the in-flight event batches' sizes, and com.engflow.eventstore/flushing_batches_size shows the size of event batch blobs currently being uploaded to storage.
  • New operation gc_extend and result dimension for com.engflow.re.storage/ops metric
  • UI: On the invocations page, invocations can now also be filtered by command (build, test, run, etc.).
  • UI: The new metric com.engflow.resultstore/reduce_bes_completed_duration_since_finish_event measures how much time passed between the cluster receiving an invocation's last BES event, and us fully processing the BES for display in the UI.
  • UI: Retry saving the result of processing the BES if it fails.
  • UI: Add the metric com.engflow.resultstore/reduce_bes_replay_source_count that tracks how many times processed BES was requested, and where the data was retrieved from.
  • UI: Add the metric com.engflow.resultstore/reduce_bes_replay_removed_from_cache_count and log details to help debug BES processing.

Changed

  • The com.engflow.eventstore/batchstore_* metrics are no longer reported. They were not showing what we intended to.
  • Retire the now unused --experimental_summarize_invocations.

Fixed

  • Analyzers now report com.engflow.meta/engflow_version.
  • When moving ResultStore onto analyzer instances, early-exit if an invocation was not found.
  • UI: Fix bug where failing actions did not correctly update the overall status of a target.
  • UI: Disable "Open Bazel profile" button if the profile is a local file.
  • UI: Fix bug that didn't correctly detect --remote_download_outputs=minimal being set on an invocation, and unnecessarily suggested setting this value.
  • UI: Fix download URI for action outputs.
  • UI: Fix a bug where targets that were aborted in the analysis phase did not show up in the target tree.
  • UI: Update target details cards when new data becomes available for running invocations.
  • UI: Fix a bug where the status of an invocation with some aborted targets was shown as unknown, although we can be more specific.
  • UI: Fix a bug that hid targets with an analysis failure in the target tree.
  • UI: On the invocations page, support filtering for cancelled invocations.
  • UI: Fix a bug where the target tree did not render for running builds.
  • UI: Avoid duplicate processing of an invocation's BES.
  • UI: Fix bug which caused the data shown for running invocations to be out-of-date.

V2.81.1 (2024-06-07)

  • Cache: Add "tenants" to the prefix path in external storage.

v2.81.0 (2024-06-01)

New

  • UI: Redact sensitive information from URLs sent in the BES.
  • UI: The invocation creation date filters now default to including invocations from the last 24 hours, as opposed to the last month.
  • UI: Support opening the Bazel and EngFlow profile using a self-hosted version of Perfetto UI.
  • UI: The EngFlow profile now includes the read location and source of download events.
  • BES: Support a new well-known BES keyword engflow:Requester, which can be used to specify who requested a CI build. See documentation for details
  • Metrics: com.engflow.eventstore/batchstore_* is now reported for BES event batching.
  • BES: The new flag --resultstore_async_db_write enables non-blocking writing of ResultStore database blobs.

Changed

  • UI: Style the documentation links on the top of many of the pages like other links.
  • UI: Redesign 404 pages throughout the Build and Test UI.
  • UI: Removed the "experimental" tag on now stable features in the UI.
  • BES: Reduce memory usage by proactively clearing retained resources for idle streams.

Fixed

  • UI: Display a help message if no action cache statistics were reported by Bazel.
  • UI: Fixed a bug where some legacy URLs were not redirected correctly.
  • Metrics: Unify thread pool metrics to improve monitoring.

v2.80.1 (2024-05-30)

Fixed

  • Cache: fixed a divide-by-zero error calculating download speeds that caused some requests to hang.

v2.80.0 (2024-05-22)

New

  • Metrics: added com.engflow.re.cas/requests_in_flight_incoming, counting number of ByteStream requests in flight, by method name and pool.
  • Metrics: added com.engflow.re.cas/requests_in_flight_outgoing, tracking the number of outgoing CAS requests by method and pool.
  • Metrics: com.engflow.caching.inmemory/* is now reported on analyzers.
  • RE: Setting --worker_config="" now disables the Execution service on a worker, useful for cache-only instances.
  • UI: Reduced the JS bundle size by 28%.
  • CI Runners: add status page to the UI.

Changed

  • --run_common_member is now a no-op.
  • Cache: when --enable_distributed_cas=false and --migration_enable_distributed_cas_disabled_semantics=true (new temporary flag), workers no longer serve local cache files to other workers.
  • UI: The invocation profile open buttons no longer use misleading external link icons.
  • UI: Reduced the padding around tooltip elements found in the UI.

Fixed

  • Fix accounting error in ByteStream reads that could lead to workers filling up with unremovable blobs.

v2.79.2 (2024-05-20)

Internal release. No publicly facing changes.

v2.79.1 (2024-05-17)

Internal release. No publicly facing changes.

v2.79.0 (2024-05-08)

New

  • Metrics: Added com.engflow.re.cas/fetch_retries, a count metric that is incremented when we retry a CAS fetch after an error.
  • Metrics: Added com.engflow.re.cas/load_shed_errors, counting ByteStream requests failed on workers due to load shedding.
  • Metrics: added com.engflow.re.cas/find_replicas tracking the time to look up which instances hold a copy of a file.
  • Metrics: for com.engflow.re.cas/fetch_call_time, split the DISTRIBUTED_CAS tag value into DISTRIBUTED_CAS_NEAR (worker with file in same availability zone), DISTRIBUTED_CAS_FAR (worker with file in other availability zone), and DISTRIBUTED_CAS_FALLBACK (worker without file).

v2.78.3 (2024-05-04)

Internal release. No publicly facing changes.

v2.78.2 (2024-05-04)

Fixed

  • GCS: Avoid writing data to GCS multiple times.

v2.78.1 (2024-05-03)

Fixed

  • External storage: Fix unnecessary large number of requests.

v2.78.0 (2024-04-30)

New

  • UI: Add an option to change the order of autocomplete suggestions in the search filters, so that BES keywords with a prefix match appear before other substring matches.
  • Performance: When serving bytestream reads from external storage, stream the data to the client rather than waiting for the entire backing external storage download to complete.
  • EngFlow profile: record time for shutting down the docker container that was used for a previous action.
  • GCS: Support uploading file chunks in parallel.
  • S3: Make available S3 connection metrics by enabling --experimental_record_s3_metrics.

Changed

  • UI: Collapse the "BEP parsing errors detected" alert box in the invocation page by default, and include in its title how many BEP errors were encountered.

Fixed

  • UI: Improve messaging for aborted invocations.
  • S3: Fix rate limiting for streaming S3 requests.
  • RE: Revert MacOS update (from 2024-04 back to 2023-12) to fix permissions.
  • CI Runners: reduce GitHub API usage by a factor of number of workflows in the repository.

Removed

  • UI: Remove the (reconstructed and inaccurate) "Bazel Command" field in the Configuration tab of the Invocation Details page.

v2.77.2 (2024-04-26)

Fix

  • Revert MacOS update to fix permissions.

v2.77.1 (2024-04-25)

Changed

  • Update the linux kernel to 6.1.85-1.

v2.77.0 (2024-04-23)

Changed

  • API: removed permission from the user role to call the NotificationQueue API. It now requires admin or global-admin.
  • Platform: Support invocation indexing without using a notification queue.
  • The flag --keep_exec_directories_for_debugging is now a noop.

v2.76.0 (2024-04-16)

Fixed

  • UI: the builtin admin role can no longer access invocations outside the default tenant.
  • EngFlow profile: record failed download calls in the profile.
  • RE: return a retryable error code when receiving a 502 during Docker pull.
  • RE logging: previously, when an IP address got reused, some log messages referenced the previous instance ID, making it look like the old instance was still alive. This is now fixed.
  • CI runners: correctly identify the executed GitHub job.
  • CI runners: correctly record the number of failed GitHub error propagation jobs.
  • CI runners: allow GitHub to update the job status before querying it.
  • CI runners: retry an http GOAWAY and similar errors.
  • CI runners: correctly record the number and age of queued jobs.

v2.75.4 (2024-05-31)

Internal release. No publicly facing changes.

v2.75.3 (2024-05-31)

Cherrypicks

  • BES: The new flag --resultstore_async_db_write enables non-blocking writing of ResultStore database blobs.
  • BES: com.engflow.eventstore/batchstore_* now contains metrics on BES batching.

v2.75.2 (2024-05-15)

  • RE: return a retryable error code when receiving a 502 during Docker pull.

v2.75.1 (2024-04-09)

Fixed

  • UI: Fixed a bug where EngFlow profiles were not downloadable.
  • UI: Fixed a bug on the invocations page where results were not correctly filtered.

v2.75.0 (2024-04-04)

New

  • CI runners: adding job status and various error metrics.
  • CI runners: adding retries for HTTP and RE-API calls.
  • CI runners: caching GitHub tokens for 45 minutes.
  • Platform: Add metric com.engflow.notificationqueue/size reporting the approximate size of notification queues.
  • Platform: Add v2 implementation of a Hazelcast-backed notification queue.
  • Platform: Add new metric com.engflow.profiling/publish_invocation_event to measure adding entries to the EngFlow profile.

Changed

  • Platform: Add additional type checks to Hazelcast-backed distributed maps.

Fixed

  • CI runners: fix Buildkite x64 jobs.
  • CI runners: fixing GitHub error propagation for arm64.
  • Avoid a deadlock in the schedulers' shutdown hook that could hang the process indefinitely.

v2.74.4 (2024-04-11)

Fixed

  • Logging: Fixed an issue where, when an instance's IP address was reused, some logged messages would still reference the previous instance ID.

v2.74.3 (2024-03-26)

Fixed

  • Auth: With --http_auth=none, unauthenticated users can acquire tokens with the viewer role to view the UI.

v2.74.2 (2024-03-26)

New

  • Auth: Add new platform role of viewer as a default for --http_auth=none

v2.74.1 (2024-03-22)

Fixed

  • Scheduler: Shutdown even if the threadpool is deadlocked.

v2.74.0 (2024-03-21)

Fixed

  • CI runners: GH workflows will now work with >30 jobs per workflow.
  • CI runners: Fix jobs hanging unnoticed by our runners.
  • CI runners: No longer fail entire polling loop on one misconfigured job.
  • UI: The chip showing the CI details for an invocation now ellipses long lines.

New

  • CAS: Fine tune external storage migration from epoch to expiration gc with --migrate_storage_max_cache_size and --migrate_storage_max_concurrent_operations.
  • CI runners: Report job queue size and age metrics to dashboards.
  • Platform: Improved metrics reporting for notification queues.
  • RE: Add --docker_volume_mount_path=/mnt/engflow/docker option to track disk usage for the docker volume.
  • UI: Enable --experimental_invocation_comparison to be able to compare invocation metadata.

Changed

  • Platform: Restore the serialized format of cluster address to IP address due to backwards compatibility issues.
  • UI: Due to performance considerations, the autocomplete fields suggest values with a substring match by lexicographical sort. They no longer sort prefix matches to the top of the suggestions. E.g. for the search term "foo" the autocomplete suggestions will now show "afoo" before "foobar".

v2.73.0 (2024-03-11)

Fixed

  • CI Runners: fixed the idle timeout condition.
  • BES: BEP uploading no longer halts if an unknown target label is sent to the BES.

New

  • UI: Display BEP parsing errors on the invocation details page.

v2.72.1 (2024-04-09)

Fixed

  • Logging: Fixed an issue where, when an instance's IP address was reused, some logged messages would still reference the previous instance ID.

v2.72.0 (2024-03-06)

Fixed

  • Auth: Fix mTLS generation when the trusted key+cert are stored in a secrets store.
  • CAS: Respect digest size in FindMissingBlobs.
  • CAS: Writes to the replica clusters will correctly report not found rather than a grpc error.
  • CI Runners: GH runner idle time is capped to 1 minute.
  • CI Runners: Failed actions now have a fixed worker pool.
  • CI Runners: Control default GH action runner version via a flag.
  • UI: Colors are now aria compliant with improved contrast.
  • UI: Correctly process BES events reporting aborted tests.
  • UI: Fix memory leak caused by publishing notification to a queue with no consumer.
  • UI: Remove extra calendar UI element used for debugging.
  • UI: Support a no-op notification queue to avoid leaking memory when no queue readers exist.

New

  • RE: Report Pressure Stall Information stats to actions executed on Linux.
  • CAS: --migrate_storage_to_expiration_gc will begin migrating to the new expiration based external storage.
  • IAM: Add built-in role root to allow all operations.
  • UI: Invocation BES can now include metadata on the source control management via keywords. See documentation for details

Changed

  • Auth: JWT issued before v2.48 are no longer supported.
  • Config: --incompatible_reject_instance_name is now a no-op.

v2.71.3 (2024-02-27)

Fixed

  • Fix NullPointerException in GCS code.

v2.71.2 (2024-02-27)

Fixed

  • Return correct release version from Cluster API.
  • UI: Remove stray creation time selector.

v2.71.1 (2024-02-26)

Fixed

  • Rebuild release due to flaky release process.

v2.71.0 (2024-02-22)

Fixed

  • CI Runners: ensure 1:1 job to runner correspondence and propagation of results and error logs to GitHub.
  • CI Runners: fix missing file errors for GH trace events.
  • CI Runners: fix metrics to correctly report OS and arch.
  • ResultStore: Ensure outdated cache entries for the target tree are evicted regularly.
  • RE: propagate Docker start failures correctly when Docker doesn't produce an error file.
  • RE/Windows: suppress the RUNFILES_MANIFEST_FILE and RUNFILES_MANIFEST_ONLY environment variables in Docker actions so that Bazel tests can run without special settings.
  • UI: Labels with canonical repository names (e.g. @@protobuf~21.7//:protobuf_lite) are now correctly displayed in the target tree.
  • UI: Fix race condition, which could lead to invocations not being indexed.

New

  • ResultStore: Add metrics and logging for BES processing costs.
  • UI: The Cluster Status page now displays the contract duration.
  • UI: Use GetTree to populate the action input browser.
  • UI: In the search filters, the autocomplete fields now suggest values that have a prefix match on the current input before suggesting substring matches.
  • UI: In the search filters, the BES keywords autocomplete now also suggests values that have a substring match on the current input. Before, this only included prefix matches.
  • UI: Improve logging for better debuggability.

Changed

  • UI: --experimental_advanced_search is now a no-op (always true).
  • ResultStore API: For the GetInvocation API, the invocation.target_tree and invocation.invocation_metadata.bazel_invocation_metadata.target_information.target fields are now deprecated.

v2.70.1 (2024-02-15)

Fixed

  • UI: Accept OIDC keys that specify the algorithm family RSA, but omit the algorithm.
  • EventStore: When making gRPC calls, ensure unknown build errors are propagated correctly.

v2.70.0 (2024-02-07)

Fixed

  • ResultStore: Only send data after the first build event was processed.
  • ResultStore: When replaying invocations, set the correct last updated timestamp.

Incompatible

  • RE: The flag --workers_handle_fallback_requests is now a no-op.

v2.69.0 (2024-01-31)

New

  • Enable --experimental_http_compression by default.
  • Enable --experimental_historical_results by default.
  • --endpoint (formerly known as --build_and_test_url) is now required on all instances.
  • Improve UI accessibility by more strictly enforcing presence of aria-label in Icon Button elements.
  • Analyzer: When analyzing the EngFlow profile, include how much data was downloaded from the CAS in general and for locally executed actions.
  • UI: Improve protocol to download files. This may cause temporary glitches during deployment of the release (but not after the deployment is complete).

Fixed

  • S3: Return RESOURCE_EXHAUSTED instead of INTERNAL when client is overloaded.
  • Analyzer: Mark analyzer instances ready only after they have successfully connected to the cluster.
  • RE: Windows: use symbolic links for input files instead of hard links.
  • RE: Ensure empty pools are kept track of by the leader scheduler and autoscaled correctly.
  • RE: Ensure output files are cleaned up between actions when --experimental_tree_delta_exec_root=true.
  • UI: Adjust all icon buttons to share the same style of on-hover tooltip.
  • Mini: Fix issue that lead to PERMISSION_DENIED errors.
  • Fix corruption of compressed HTTP responses that set a Content-Length header.

v2.68.2 (2024-01-24)

Fixed

  • ResultStore: Fixed a permission-denied issue while fetching results from other schedulers within the same cluster.

v2.68.1 (2024-01-24)

Fixed

  • UI: Fixed an issue where some icons on the left nav bar were displayed incorrectly.

v2.68.0 (2024-01-23)

New

  • Analyzer: Extend metrics to include number of analysis cache hits.
  • Analyzer: Provide more detailed information on analysis failures.
  • API: Add the HTTP endpoint /api/resultstore/v1/instances/[instance_name]/invocations/[invocation_id]/profiles/bazel, which redirects to an invocation's Bazel profile, provided it was uploaded to the CAS.
  • RE: Introduce support for configuring fallback worker pools to handle scaledown events more gracefully, using the --experimental_mia_fallback_pools flag.
  • UI: For test summaries, highlight if some shards or runs are missing. Factor in missing runs when determining the aggregate status of a test shard.
  • UI: For test targets, add menu items that allow users to copy the URL to the test's logs, test.xml and similar.
  • UI: Analyzer pool nodes are now listed on the cluster's status page.
  • Windows: Allow docker to communicate with the Internet if the --docker_allow_network_access=true and --docker_default_network_mode=standard flags are set.

Incompatible

  • UI Authentication: --http_auth=okta_login is no longer supported. Please use --http_auth=oidc_login instead.

Changed

  • UI: Changed the invocation analysis endpoint from http to grpc.
  • UI: Replaced experimental badges with individual chips indicating documentation links and the experiemntal status of various features.

Fixed

  • Analyzer: Limited the size of the summaries sent for invocation analysis.
  • Profiling: Corrected key names for the action id and digest of action cache lookup events.
  • RE: Fixed excessive re-trying of actions caused by OOM or worker missing-in-action events.
  • ResultStore: Fixed a bug where Bazel profile URIs did not include a custom port.
  • ResultStore: Correctly handle multiple --bes_keyword_deny_list values.
  • S3: Fixed IllegalStateException in certain S3 retry cases.
  • UI: Fixed flicker in target tree view when an invocation is not yet complete.
  • UI: On the action details page, surface if an action was not executed.
  • Fixed the default cipher suites to work with TLS 1.3 when using the JDK SSL implementation (--experimental_select_ssl_impl=jdk).

v2.67.0 (2024-01-11)

New

  • RE: Support eagerly crashing when the JVM appears to be almost out of memory.

Incompatible

  • RE: When using --discovery=static, only specify --static_scheduler (at least one), not --static_cas_node, which is no longer used.
  • RE: Changed --service_discovery_mode to default to builtin.
  • RE: Running inside the linux sandbox is no longer supported.
  • RE: Set --run_common_member=false by default. It can only be false if --service_discovery_mode=builtin (which is also set by default).
  • RE: --experimental_use_async_storage_for_eventstore is now a no-op on AWS S3.
  • RE: Set --discovery to static by default; deprecate multicast.
  • UI: --experimental_use_oidc_discovery is now a no-op. The discovery URI is used whenever it is included in the JSON provided via --oidc_config.

v2.66.0 (2024-01-03)

New

  • Analyzer: Reduce memory consumed when analyzing invocations.
  • Analyzer: Add metrics to keep track of the status of retrieving the Bazel and EngFlow profile for analysis.

Fixed

  • EngFlow profile: Server-side profile generation captures events from all schedulers.
  • EngFlow profile: Fixed formatting of the digest in the EngFlow profile action cache lookup event.

v2.65.1 (2024-01-02)

Fixed

  • ResultStore: Fixed the target tree not loading on older invocations.

v2.65.0 (2023-12-29)

New

  • RE: Added --warm_containers_timeout to control timeouts for worker container warming.

Changed

  • RE: Allow JWT and mTLS in combination with external auth.
  • RE: Record metrics for Caffeine caches.
  • UI: Update copyright footers.

Fixed

  • UI: Fixed an issue where invocation page load sometimes crashed and displayed a blank screen.
  • UI: Fixed a regression where invocations were not timestamped correctly and did not appear in the invocation search.
  • UI: In light mode, change the background color of disabled primary buttons for better contrast.
  • UI: When running in insecure mode, ensure that theContent-Security-Policy allows image sources from http.
  • UI: Fix breakage when an invocation's metadata does not include Bazel command details.
  • ResultStore: Fetches invocations from the cache in external storage instead of replaying multiple times.
  • RE: Fixed an issue with deleting stdout, stderr in Windows.
  • RE: Docker no longer inherits handles from other threads on Windows.
  • Analyzer: Report metrics on invocation analysis when using a separate analyzer pool.

v2.64.0 (2023-12-22)

New

  • AWS: Add --experimental_record_s3_metrics for tracking S3 specific information.
  • UI: Allow viewing historical execution results given the execution ID.

Changed

  • UI: Updated base font from Poppins to Roboto.
  • GRPC: Add logging for PROTOCOL_ERRORs.

Incompatible

  • GCP images no longer have exim4 installed.

Fixed

  • RE: Reduce memory use of persistent workers.
  • ResultStore: Avoid sending is_last more than once.
  • GRPC: Prevent PROTOCOL_ERROR errors by reducing max metadata length and de-duplicating response headers.

v2.63.4 (2023-12-16)

Fixed

  • RE: Reduce memory use of persistent workers.
  • BES/CAS: Set the correct offset for multi-part uploads to prevent upload corruption.

v2.63.3 (2023-12-14)

Fixed

  • UI: Fix issue with showing test reports from old invocations.

v2.63.2 (2023-12-12)

Fixed

  • Fix release process.

v2.63.1 (2023-12-12)

Fixed

  • Fix a GLIBC mismatch in mini.
  • Allow cluster-wide configuration of TCP keepalive.

v2.63.0 (2023-12-06)

Fixed

  • UI: Disable the "Analyze" button if the only available profile is stored on the local disk.
  • Fix a race condition when restarting the worker service, which could cause the worker to get into a bad state causing all subsequent actions to fail.

New

  • UI: Style backticked content in Bazel Invocation Analyzer suggestions as inline code blocks.
  • UI: Also linkify links in Bazel Invocation Analyzer suggestions that lead to one of Bazel's GitHub pages.
  • It is now possible to use Analyzer backends with self-signed certificates.
  • Add property to github-actions CI runner that disables dangling processes clean-up after job execution (default: false).
  • Record the status of CAS uploads in the EngFlow profile.
  • On Windows, executors can now recover from a failure to delete or rename a file in the exec root by switching to a new exec root. The executor will restart a persistent worker when this happens, but in most cases it does not need to restart a docker container.

v2.62.0 (2023-11-28)

Fixed

  • Correct issues establishing current license validity.
  • UI: Fixed error where the target status in the target tree would be inconsistent with the status on the target card.

Incompatible

  • --experimental_build_index_percentage is now a no-op.

v2.61.0 (2023-11-22)

Added

  • UI: Recommend setting the option --remote_download_outputs=minimal if it is not set and remote execution is used.

Changed

  • Moved the AWS CloudFormation signal after container warming and service discovery.

Fixed

  • UI: Fix render error when a test target's summary is empty.
  • UI: If fetching a test target's test report fails, show a more meaningful error message.
  • UI: Increase precision of percentages shown in the Cache Hits and Execution chart.
  • Analyzer: Fix a bug which prevented the Bazel Invocation Analyzer from fetching the Bazel profile from the CAS when using a separate analyzer pool.

Incompatible

  • --strict_http_headers is now a no-op and will be removed in a future release.
  • Secrets URLs must now explicitly declare the secretstore:// schema when the secret is not on the local file system.

v2.60.0 (2023-11-16)

Added

  • Extend which environment variables we redact before storing the BEP: also redact the value of variables whose name include credential. For previously stored BEP, adjust the UI code to retroactively redact these values.
  • ResultStore: fetching logs now requires the (new) resultstore:GetLogs permission.

Fixed

  • Do not cache invocation analyzer results if errors were encountered retrieving either the Bazel or EngFlow profile (allows re-analysis on transitory retrieval issues).
  • UI: Fix incorrect logs displayed for test attempts/shards.
  • RE: allow action outputs outside the working directory.

v2.59.1 (2023-11-13)

Fixed

  • ResultStore/GetTarget: Handle the case when target statuses are given for multiple configurations.

v2.59.0 (2023-11-07)

Added

  • Report container restarts count prior to action execution.
  • Add --experimental_bytestream_iop_limit to mitigate overloaded workers from accepting ByteStream read and writes.
  • Support marking EngFlow flags as sensitive, which will redact their value when shown in the Build and Test UI.

Fixed

  • Fix PERMISSION_DENIED from waitExecution.
  • Ensure HTTP responses with status 304 (Not Modified) are not compressed. This fixes connection errors in Safari.

v2.58.0 (2023-11-01)

Added

  • UI: Mark environment variables containing credentials as sensitive.

v2.57.1 (2023-11-01)

Fixed

  • Corrected an issue with clusters requesting unnecessary amounts of re-scaling during operation.

v2.57.0 (2023-10-25)

Added

  • UI: Action debug page: show richer errors.
  • UI: Show the action digest on the historical results page, and if cacheable, add a link to the AC view.
  • UI: Update paths to the Action Details page.

Fixed

  • RE: Fixed issue with built-in autoscaler getting out of sync with AWS auto-scaling group, preventing clusters from scaling up appropriately.
  • UI: Added help text on the Highlights page in case of a system failure.
  • UI: Fix bug in target tree prefix search that could lead to infinite loops.
  • UI: Fixed code blocks not indicating that there is more contents.
  • UI: Remove non-HTTP/2 headers when converting headers to fix connection problems.

v2.56.2 (2023-10-26)

New

  • Mini: enable running with TLS.

Fixed

  • Propagate credentials to an internal gRPC server that expects it.

v2.56.1 (2023-10-19)

Fixed

  • Fixed an autoscaler issue where the requested instances would drop to the minimum during a deploy.
  • UI: Fixed UI bug when using --http_auth=basic authentication.

v2.56.0 (2023-10-19)

Added

  • UI: Expose the invocation's non-zero exit code as a tooltip in the summary at the top of the invocation details page.

Fixed

  • UI: Fixed error where failing tests would report as passing.
  • UI: Add "No Tests Found" status when running bazel test without requesting any test targets.
  • UI: Fixed bug where the UI became non-interactive when using the browser navigation while a modal was open.
  • Improved Docker shutdown on Windows.
  • Improved Docker path handling on Windows.
  • UI: Improve display of invocations while no Bazel metadata has been received yet.
  • UI: Improve target tree rendering when no target information is available.
  • UI: Better determine the invocation status by leveraging Bazel-specific exit codes.
  • UI: Fix bug where the instance name of an invocation was not added to the search index.

v2.55.2 (2023-11-12)

Fixed

  • ResultStore/GetTarget: Handle the case when target statuses are given for multiple configurations.

v2.55.1 (2023-10-16)

This release contains fixes for CVE-2023-39325, CVE-2023-3978, and CVE-2023-38545.

Fixed

  • UI: More clearly mark bazel test runs if there were no test targets.
  • UI: Ensure the UI stays interactive when using the back button while a modal is open.

v2.55.0 (2023-10-11)

Added

  • UI: Show statistics for cached / uncached actions.
  • UI: Improve display of test target details when using sharding, runs per test or test attempts.
  • UI: Add more hints to help mode.

Fixed

  • Improved error handling during input tree creation.
  • Remove environment variables RUNFILES_MANIFEST_ONLY and RUNFILES_MANIFEST_FILE from actions running locally without sandboxing.
  • UI: Fix bug where some test suites were rendered as passing, although failing

Incompatible

  • Individual actions are now limited to at most 10000 cores.

v2.54.1 (2023-10-12)

This release contains fixes for CVE-2023-39325, CVE-2023-3978, and CVE-2023-38545.

v2.54.0 (2023-10-09)

Fixed

  • BES: invocations running for more than 3 hours are no longer consider timed out.
  • docker: errors from pulling containers are now retried.

Incompatible

  • platform: Java 17 is now required. Note that this does not affect actions execution.
  • platform: --experimental_always_retry_missing_worker_failures is now a no-op.

v2.53.2 (2023-10-06)

Fixed

  • rhodonite OS: Update AWS CLI if already installed.
  • UI: Improve display of test target details when using sharding, runs per test or test attempts

v2.53.1 (2023-10-06)

Fixed

  • rhodonite OS: Set custom uid and gid, as rhodonite reserves 108 and 114.
  • RE: Return UNAVAILABLE on docker pull failure

v2.53.0 (2023-10-05)

Added

  • CAS: Add type=readonly dimension to external storage metrics.
  • Windows: experimental support for actions running in Docker containers.

Fixed

  • probers: log details about BatchUploadBlobs failures.
  • RE: Update AWS base images: debian 12 20231004-1523, macos 12.6.9-20230921-235406
  • RE: Update GCP base images: debian 12 v20231004
  • resultstore: stop compressing build logs.
  • UI: fix "flash" in dark-mode during loading.
  • UI: fix BES keyword logic causing query to return no results.
  • UI: prevent a page-reload when clicking a search-card.
  • Windows: ensure dockerd starts automatically when configured to store data on a secondary volume.

v2.52.2 (2023-09-29)

Fixed

  • BES: A problem when uploading BES from longer builds.

v2.52.1 (2023-09-29)

Fixed

  • (internal) Migrated to new CI fleet.

v2.52.0 (2023-09-19)

Added

Fixed

  • platform: Prefer the same availability zone for uploading or downloading blobs.

Removed

  • goma: --remote-instance-name is now a no-op.
  • platform: --experimental_record_reported_action is now a no-op.

Incompatible

  • platform: --incompatible_reject_instance_name now defaults to true.

v2.51.0 (2023-09-11)

Fixed

  • GCP: when fluent-bit is enabled for remote execution, the fluent-bit shipped logs now correctly set the log level for "S" (severe) messages. Previously these appeared with "default" level, now they appear with "error" level.
  • Linux: adjusted how kernel headers are installed to aid NVIDIA GPU driver installation. If the linux-headers-{VERSION} apt-package is not found for the the current kernel version, an earlier version is installed.
  • Windows: disabled NGen and automatic updates for autoscaled instances to improve performance and prevent unplanned reboots. We update base images instead.

v2.50.1 (2023-09-07)

Fixed

  • Do not require an authenticated session to run the GetCapabilities RPC.

v2.50.0 (2023-08-29)

Changed

  • Executor: Now logs executor and action ID for action events.
  • Windows: Allow more time for the worker service to gracefully shut down when autoscaling terminates an instance.
  • Worker: Add timeout of 5s when gracefully shutting down.

Fixed

  • CAS: S3 acceleration handles cross region requests.
  • CAS: S3 rate limit errors on async CAS.

v2.49.1 (2023-08-17)

Fixed

  • Fix NullPointerException when using --client_auth=gcp_rbe.

v2.49.0 (2023-08-16)

Added

  • EngFlow Profile: Add pool name to executed actions.
  • Record historical action execution results for all actions, including failing and uncached, and report link to the corresponding action details page (Bazel shows this for failed actions).

Changed

  • UI: Enable help-mode by default.

Fixed

  • UI: Fixed test progress bar never completing.
  • UI: Fixed some codeblocks unnecessarily showing fade-out effect.
  • UI: Fixed various typo fixes.
  • UI: Fixed invocation last updated time: report the time the server received the update.
  • CAS: Handle errors downloading from S3 gracefully

v2.48.2 (2023-08-01)

Fixed

  • Fixed creation of Debian 12 images.
  • Fixed uid/gid allocation for the engflow user.

v2.48.1 (2023-07-27)

Fixed

  • Fixed handling of cancelled execute calls; release 2.48.0 introduced an error in cancellation handling that could cause workers to be marked as busy even though they were not, which significantly reduced the capacity of the cluster. This only triggered under high load with very short client-provided timeouts.
  • Record action memory limit and pool id in the server-side profiles.

v2.48.0 (2023-07-24)

New

  • Logging: more logging for failed service discovery calls.
  • Logging: improve logging for executed actions; this merges some log lines and changes structured field names to clarify units.
  • UI: show cached action results on the action page.

Changed

  • Actions that fail with OOM are no longer unconditionally retried.
  • BES: don't propagate certain errors to the client to receive partial data even if there are issues.
  • Auth: all generated JWT now contain a version number.
  • Docker: check if the container image exists locally before pulling it.
  • Profile: profiles now contain multiple full executed events for the same action in case of server-side action retries.

Fixed

  • Various UI improvements in the action input browser.
  • In some cases, uploading to external storage failed in a way that left the cluster in an inconsistent state resulting in PRECONDITION_FAILED failures; retry the upload in those cases.

v2.47.4 (2023-07-17)

Fixed

  • Mini: fix a broken base image.

v2.47.3 (2023-07-18)

Fixed

  • Allow using images preexisting on the local Docker daemon. This mostly affects EngFlow/free.

v2.47.2 (2023-07-12)

Fixed

  • Mini: disable --docker_use_process_wrapper by default.
  • Avoid retries on OOMs if --experimental_retry_failure_due_to_signal=false.

v2.47.1 (2023-07-12)

Fixed

  • Mini: correctly handle the empty file. This only affects the EngFlow/free version.

v2.47.0 (2023-07-07)

New

  • CAS: Uploads and downloads now prefer nodes from the same subnet.
  • UI: We now show warnings when invocations miss recommended Bazel flags.
  • UI: Allow users to hide BES keywords from filter.

Fixed

  • Fix bugs that could cause processes to hang during shutdown.
  • Fix bug that caused Remote Asset API to fail.

v2.46.0 (2023-06-28)

New

  • UI: Limit the number of requested target patterns shown, and warn when patterns are truncated.
  • UI: For failing targets, add a link to the action details pages of their failing actions to ease debugging.

v2.45.2 (2023-06-22)

Fixed

  • Fixed HTTP mTLS support.

v2.45.1 (2023-06-21)

Fixed

  • Server-side profiles no longer have incorrect entries for failed actions.
  • HTTP APIs can be used with mTLS authentication.

v2.45.0 (2023-06-19)

Fixed

  • Optimize the invocation index page.
  • Reduce native memory usage to avoid OOMs.

New

  • Report extra info when an action's output tree is too large.

v2.44.0 (2023-06-13)

Fixed

  • Fix a bug that could sometimes leak memory during gRPC and HTTP requests.
  • goma: --remote-instance-name can now be empty.

New

  • Native memory consumption is now reported as a metric.
  • HTTP: GZIP compression of responses can be enabled with --experimental_http_compression=true.

Removed

  • RE: gRPC-level compression of blobs is removed in favor of Remote API compression. --experimental_compressed_cas_reads is now a no-op.
  • RE: --experimental_event_store_delay is no longer supported and now a no-op.

v2.41.1 (2023-06-13)

Fixed

  • RE: Workers: increase file descriptor limit to 32k.
  • S3: handle uploading empty blobs.

New

  • RE: export metric for direct buffer memory.

v2.43.0 (2023-06-07)

Fixed

  • Limit the number of dimensions exported for UI metrics to reduce potential reporting costs.
  • AWS: Fix a bug where uploading empty blobs to S3 failed.
  • Fix quotes in non-canonical docker image error.
  • UI: Various improvements to the Debug Action page.

New

  • EngFlow profile: add action cache lookup events.

Deprecated

  • --experimental_results_blobs_root is now a no-op.

Removed

  • No-op flag --xcode_locator is no longer supported.

v2.42.0 (2023-05-31)

Fixed

  • GCP: Disable Ops Agent after installation, preventing monitoring cost increase.
  • RE: Increase file descriptor limit to 32k.

New

  • UI: If an invocation is running, display when it was last updated.
  • UI: Add information on queue size and executing actions to the cluster status page.
  • Report metric for the number of open file descriptors.
  • Detailed logging for open file descriptors by type and directory.

v2.41.0 (2023-05-23)

Fixed

  • RE: actions creating unreadable outputs are now client errors, returning ILLEGAL_ARGUMENT.

New

  • engflowapis: Add new API cluster.v1.Cluster/GetInfo to retrieve information about a cluster.
  • UI: Add information on executor pools to the cluster status page.
  • UI: Improve the speed at which the UI processes invocations.

v2.40.2 (2023-05-18)

Internal release. No publicly facing changes.

v2.40.1 (2023-05-18)

Fixed

  • RE: When a worker replies with DEADLINE_EXCEEDED, make client retry Execute.

v2.40.0 (2023-05-17)

Fixed

  • RE: NOT_FOUND errors are now returned in ExecuteResponse instead of returning a gRPC error.
  • RE: Allow retries for more failed executions.
  • UI: Fix bug in console view where scroll-to-bottom would not work when highlighting a line.

New

  • UI: Target are now linked from the Configuration tab.
  • UI: Redact potentially sensitive data from environment variables.

v2.39.0 (2023-05-10)

Fixed

  • UI: fix bug where failed test logs would fail to display on some clusters.
  • RE: export client-triggered execution errors into a separate metric dimension for alerts and dashboards.
  • RE: improve action logging: add primary output path, merge lines for exit code and output stats.
  • RE: improve internal retries of failed docker start errors.

New

  • UI: Introduce help mode, which users can turn on to access tips on different UI components.
  • Auth/TLS: add --tls_cipher_suites flag to configure the set of supported ciphers for incoming TLS connections; only allow TLS 1.2 and 1.3.
  • RE: export execution latency metrics enabled by default.

v2.38.2 (2023-05-12)

Fixed

  • Infra: don't mix Debian 10/11 in the release build.

v2.38.1 (2023-05-12)

Fixed

  • RE: fewer client-side retries on execution errors when machines go missing.
  • RE: report client-triggered execution errors as a separate metric dimension for alerts and dashboards.

New

  • RE: report execution latency metrics by default.

v2.38.0 (2023-05-05)

Fixed

  • UI: Fix duplicate action failed messages.
  • UI: Fix bug where region next to sidebar is not clickable.
  • UI: Fix a bug related to automatically fetching new invocations on the "Invocations" page.
  • UI: Fix bug that led to snackbars (update notifications shown at the bottom of the page) sometimes not being shown.

New

v2.37.0 (2023-05-04)

Deprecated

  • Running EngFlow on Debian 10 is now deprecated. This does not affect docker containers actions run in, nor the client machine the build (e.g., Bazel) runs on.

Fixed

  • UI: Fix a bug in the target tree rendering, which could cause an infinite loop.
  • UI: Don't render empty test logs.
  • RE: Fix a bug where action descriptor was not deserialized correctly.

New

  • UI: In the target tree, add "copy" buttons for the target label.
  • RE: Export execution latency metrics by stage.
  • UI: Do not show an empty target tree for successful invocations. In-progress invocations, and invocations where not all targets were successful, still filter successful targets out by default.
  • UI: Show extra test outputs for test attempts.
  • UI: In the target tree, add "copy" buttons for the target label.
  • RE, Docker: enable process wrapper by default.

v2.36.1 (2023-04-26)

Fixed

  • RE: Address ServiceDiscoveryServer lock contention in the workers with more nuanced concurrency.
  • RE: Invocation resource usage summary correctly aggregates execution pools.

v2.36.0 (2023-04-25)

Fixed

  • RE: Correctly parse default pool summary resource usage for the Invocation.

New

  • UI: Redact authorization headers from different Bazel flags in the BES before storing it.
  • RE: Include mnemonic in the action execution timing log.

v2.35.0 (2023-04-11)

Fixed

  • RE: Fix issues where AWS SNS subscription requests were rejected with unknown cluster keys in the URL.
  • UI: Logs will no longer drop lines on outputs that have newlines trimmed at the start.

New

  • Deploy: Performance enhancements around plan building.
  • RE: --experimental_enable_priority_pools will schedule high priority actions to pools postfixed with _high_priority.
  • RE: CAS now supports legacy GetTree API.

v2.34.1 (2023-03-03)

Cherrypicks

  • UI login: fixed a bug with Google OAuth.

v2.34.0 (2023-03-30)

Fixed

  • resultstore: Fix a bug where reading the profile stream before the invocation finishes was causing unintended side effects.

New

  • UI: In the invocation statistics, include percentiles of the wall times.

v2.33.0 (2023-03-21)

Incompatible

  • MacOS 11 (Big Sur) image for RBE workers is not supported anymore. MacOS 12 is supported.

Deprecated

  • --experimental_compressed_cas_reads is now deprecated and a no-op; it's always enabled.
  • --enable_bytestream_compression, --experimental_bytestream_compression, and --incompatible_s3_use_structured_paths are now no-ops and will be removed in a future release. They are now always enabled.

Fixed

  • RE: Fix a bug where we sometimes scheduled actions on terminated workers.
  • BES: Fix a bug where PublishBuildToolEventStream sometimes returned UNKNOWN.
  • UI: Fix Bazel command-line flag links.
  • AWS: Rate limit signals sent to CloudFormation to avoid exceeding the request quota.
  • UI: Fix a bug that prevented filtering invocations by certain times of the day.
  • UI: Fix bug where scrolling in the console view did not work.
  • UI: Top-level targets now correctly expand in the target tree.

New

  • UI: Enable advanced search by default.
  • UI: Improve differentiation for pending filters.
  • UI: Allow linking to the target tree with certain filters applied.

v2.32.0 (2023-03-14)

Added

  • Probers: logging all retried errors.
  • RE: Local logging can now be configured via JVM properties.

Fixed

  • RE: Fix async storage write/close race.
  • RE: Fix working directory "." corner case.
  • Fix: execution prober first failure not retrying properly.

v2.31.0 (2023-03-07)

Added

  • UI Auth: Support reading permissions from OIDC's JWT payload.
  • RE: The new --experimental_summarize_invocations will include invocation remote resource usage from the resultstore.
  • RE API: Action outputs are now interpreted as relative to working directory, not exec root.

Fixed

  • UI: Fix bug where filter updates in the invocation search were not always applied.
  • UI Auth: Fix bug where OIDC was not always set up correctly on the admin login page.
  • RE: Now reports S3 503 errors as UNAVAILABLE.
  • RE: Retry CloudFormation SignalResource calls.

v2.30.0 (2023-02-18)

Added

  • UI: A separate administrator login at /adminlogin for EngFlow engineers to access cluster UI.
  • RE: Logging config --log_file to have it's own parent directory.
  • RE: Bytestream/Writes, ContentAddressableStorage writes and S3 writes are now asynchronous.
  • RE: ContentAddressableStorage will now support in-transit compression/decompression.

Fixed

  • UI: Fix error displayed on login page.
  • RE: Logging config --log_level will now throw exception on incorrect values.
  • RE: Fix BusyExecutorException from reaching the client and let scheduler retry.
  • RE: Fix gRPC service for notification queue.

v2.29.0 (2023-02-01)

Added

  • RE: The new --experimental_always_retry_missing_worker_failures flag allows server-side execution retries in more cases than before. We expect this reduces the chance of client-visible execution failures.
  • UI: Support authenticating to the UI via OpenID Connect using --http_auth=oidc_login and setting --oidc_config.
  • UI, EngFlow profile: Actions now have a previous_action_runner field. We expect this helps us better understand why containers are restarted.

Fixed

  • RE: Worker now waits before shutting down the gRPC server. We expect this reduces the chance of NOT_FOUND errors for WaitExecution calls.
  • RE: Scheduler now retries executions server-side if the selected worker is busy. Previously this error was returned to the client, which then retried.
  • RE: Scheduler now won't retry WaitExecution if the worker went missing. This returns NOT_FOUND to the client, which should then retry. We expect this reduces the chance of UNAVAILABLE errors and Bazel exit code 34.
  • Goma: Fix timestamp format in JSON logs, so the nanosecond-fraction is zero-padded to 9 decimals.
  • UI: Multitude of bug fixes and usability improvements.

v2.28.2 (2022-01-30)

Cherrypicks

  • RE: return NOT_FOUND when workers go missing during execution.
  • probers: retry when Execute returns NOT_FOUND.

v2.28.1 (2022-01-27)

Cherrypicks

  • UI login: fix bug in Open ID Connect token requests.

v2.28.0 (2023-01-27)

Cherrypicks

  • RE: added an experimental flag, --experimental_always_retry_missing_worker_failures. When --experimental_always_retry_missing_worker_failures is enabled, the scheduler will always retry on UNAVAILABLE errors from workers.
  • RE: add waittime before shutting down gRPC server.

Added

  • Goma, --logging_timestamp_format: now supports the value fluent-bit, for Fluent-bit compatible timestamps in JSON logs ("%s.%L" format).
  • RE: deployed retriable probers to detect regressions.

Changed

  • RE, --log_file_limit: default is raised from 10mb to 100mb. This should avoid frequent log rotation when logging a lot.
  • fluent-bit: health-check is now enabled (but not exposed via any infrastructure).
  • fluent-bit: will now ship its own log-file by default.

Fixed

  • RE, Goma: the .deb installer now creates some static config files and empty log files for fluent-bit safety.

v2.27.2 (2022-01-25)

Cherrypicks

  • RE: added an experimental flag, --experimental_always_retry_missing_worker_failures. When --experimental_always_retry_missing_worker_failures is enabled, the scheduler will always retry on UNAVAILABLE errors from workers.

v2.27.1 (2022-01-25)

Cherrypicks

  • RE: add waittime before shutting down gRPC server.

v2.27.0 (2023-01-18)

Incompatible

  • --incompatible_track_availability_zone is now a no-op.

Added

  • This change log is now shown in the UI under /restatus.
  • Our public docs now describe how to engage with Customer Support.
  • The RE service can now emit single-line JSON logs, see --log_file.

Changed

  • With --client_auth=github_token, the principal name is now stable: github_token.

v2.26.0 (2023-01-12)

Changed

  • Enable S3 structured paths by default
  • Setting --mtls_expiration=0d is now allowed, and it disables downloading mTLS certificates from the UI.
  • Goma server now rotates log files to avoid filling the filesystem. New logging flags:

    • --logging_rotate_at_mb - Rotate log files when size x MB reached (default 10)
    • --logging_rotate_count - Number of rotated log files to keep (default 1000)
    • --logging_days_to_keep_rotated_files Maximum days to keep rotated log files for (default 28)
    • --logging_compress_rotated_files - Compress rotated log files? (default true)

v2.25.0 (2023-01-03)

Changed

  • Update third_party dependencies.
  • Add probers to release.

v2.24.0 (2022-12-27)

Changed

  • Update timezone selectors to use UI kit dropdown.
  • UI client TLS certs: allow --mtls_expiration=0d.
  • Add support for reading and storing protos (secretstore).

Fixed

  • UI: Fix the datetime picker.

v2.23.4 (2022-12-13)

Cherrypicks

  • auth: use --tls_trusted_certificate and --tls_trusted_key for signing and verifying JWTs.

v2.23.3 (2022-12-12)

Cherrypicks

  • Added an experimental flag, --experimental_docker_max_image_size_in_cas. When --experimental_docker_store_images_in_cas is enabled, workers cache docker container images in the CAS. The new flag sets a size limit on container image files stored in the CAS. It defaults to 10gib, the previously hard-coded limit.

v2.23.2 (2022-12-09)

Internal release. No publicly facing changes.

v2.23.1 (2022-12-09)

Internal release. No publicly facing changes.

v2.23.0 (2022-12-08)

Changed

  • Docs: Launched the new https://docs.engflow.com with search functionality.
  • Enabled --experimental_filter_known_replicas by default, causing the scheduler to confirm whether a CAS node holding a replica is alive before attempting a read.
  • Changed --internal_tcp_connect_timeout default value from 30s to 5s. This controls cluster-internal gRPC connections. Connection attempts to dead nodes will fail faster.
  • Enabled --warm_containers by default, causing workers to pull active cluster Docker containers before accepting actions.

Fixed

  • MacOS: worker no longer exits when --allow_docker=true; the flag is now ignored.
  • Added more Docker platform options to affinity key to add the scheduler in executor selection.

v2.22.1 (2022-11-24)

Changed

  • AWS: Install latest SSM agent in all MacOS machine images.

v2.22.0 (2022-11-23)

Changed

  • Added a Docker credential helper to the AMI and .deb package, which fetches the username/password from AWS Secrets Manager. It's called docker-credential-engflow-aws-secretsmanager.

Fixed

  • InputCas, stats: fix distributedCasLongestDownload.

v2.21.0 (2022-11-14)

Changed

  • Add flag to respect explicitly defined pools for --experimental_force_mnemonic_pool_name.

Fixed

  • Bug causing wrong Action Cache misses due to special-casing the empty blob.

v2.19.3 (2022-11-03)

Cherrypicks

  • Fix Action Cache misses due to special-casing the empty digest.

v2.19.2 (2022-11-02)

Cherrypicks

  • Sharing new AMIs.

v2.19.1 (2022-11-01)

Cherrypicks

  • Fix memory leak in notification queue service.

v2.19.0 (2022-10-24)

Changed

  • Goma: improve Goma cluster logging.
  • RE: update default --max_batch_size from 4mb to 10mb.
  • RE: update default --default_replica_timeout from 24hr to 1hr.
  • RE: update default --cas_existence_cache_expiry from 24hr to 0s.
  • RE: update default --local_cas_existence_cache_expiry from 120s to 30min.
  • RE: update default --hazelcast_aws_use_client_lib from false to true.

Fixed

  • UI: links render correctly in the CHANGELOG.md display of the Cluster Status page.
  • Goma: the grpc_keepalive_time and grpc_keepalive_timeout are now respected properly.
  • RE: correctly warm Docker containers on the default pool.
  • RE: Disconnected nodes will continue to try to reconnect to the cluster indefinitely.

Deprecated

  • The --client_auth=gcp_email, --client_auth=basic, and --http_auth=gcp_email options have been deprecated; they are not used by any clusters.
  • The --external_storage_gc_enable_deletion flag has been deprecated.

v2.18.1 (2022-10-19)

Cherrypicks

  • Goma: set gRPC keepalive time.

v2.18.0 (2022-10-12)

Deprecated

  • The --aws_cloudformation_stack_name and --aws_cloudformation_stack_resource flags are deprecated; the corresponding values are automatically read from instance metadata.

Added

  • Docs: Documentation around --client_auth=github_token.
  • Performance: --warm_containers=true can be used to automatically pull active Docker images onto new workers before they accept any actions.
  • Goma: new logging flags

    • --logging_output_encoding can now be supplied. json will emit single-line JSON, for consumption downstream and an enhanced querying UX.
    • --logging-timestamp-format can now be supplied. unix-utc will output millisecond-precision UNIX timestamps, for simpler dateime parsing downstream and an enhanced querying UX.

Changed

  • UI: the target tree view defaults to only showing non-successful targets.
  • Goma: more readable service logs - lines no longer have Java and systemd journal prefixes.

Fixed

  • RE: Fixed bug in used file tracking that caused workers to run out of disk space.

v2.17.1 (2022-10-11)

Cherrypicks

  • AWS: The x86_64 Debian AMIs now come with pre-downloaded software for supporting instance types with NVIDIA GPUs.

v2.17.0 (2022-09-30)

Changed

  • Performance: --bytestream_read_chunk_size is now 1 MiB by default; this can significantly improve machine-to-machine copy performance.
  • Removed metrics com.engflow.storage.ops/in_flight and com.engflow.storage.ops/stream_in_flight.

Added

  • UI: added button to download a summary of the invocation as markdown; this is intended for integration with other systems like a bug tracker or helpdesk.
  • Added thread pool metrics.
  • UI: Add tooltips to analytics summary to clarify values and improve display while loading data.
  • UI: Show release notes on the cluster status page.

Fixed

  • Goma: fix metrics exporting to CloudWatch.
  • Fix a rare hang when replicating a file to another machine.
  • Fix error handling when the GCS service is temporarily unavailable.

v2.16.4 (2022-09-28)

Cherrypicks

  • Fix high memory consumption when repeatedly reloading an invocation page.
  • Fix cas metrics reporting in the schedulers.
  • Fix heap dump helper tool.

v2.16.3 (2022-09-21)

Cherrypicks

  • Fix login loop when authorization header is sent.

v2.16.2 (2022-09-20)

Cherrypicks

  • Fix cookie parsing with HTTP/2 when multiple cookies are sent.

v2.16.1 (2022-09-20)

Cherrypicks

  • Fix login loop when multiple cookies are set for the UI domain.

v2.16.0 (2022-09-16)

Changed

  • Metrics: All metrics are reported to CloudWatch by default (if enabled).
  • Metrics: High-cardinality metrics for actions are disabled.
  • Metrics: BEP metrics are now in the com.engflow.bep namespace.

Added

  • Worker AMIs for Arm64 MacOS.
  • Support passing through arbitrary flags to docker run.

Fixed

  • Cleanup for --docker_clean_tmp runs as root.
  • Reduced cpu+memory consumption for very large Bazel event streams.
  • Fixed heap dump utility temporary directory creation.
  • Improve analytics summary consistency (build counts).
  • Fix race condition in handling of file uploads.
  • Fix potential hangs in multiple streaming calls.
  • Fix MTls authentication fallback handling.
  • Fix existing scheduler metric to always be reported (previously was not reported if the cluster had no worker pools).

v2.15.2 (2022-09-15)

Cherrypicks

  • Goma: fix bug uploading chunks to CAS.

v2.15.1 (2022-09-14)

Cherrypicks

  • Goma: add latency metrics for inbound HTTP RPC requests.
  • Goma: ensure gomaOutput.toFileBlob returns after cancelation.
  • Goma: recache only upload if missing.
  • Scheduler: handle mTLS authentication fallback correctly.

v2.15.0 (2022-08-25)

Added

  • Added a documentation page about the Invocation Search page.
  • Extended Goma documentation.
  • Add new permission for viewing the UI, so that you no longer need admin access.
  • Monitoring: Added threadpool metrics.
  • UI Authentication: Support logging in with Okta.
  • UI Authentication: Web UI login for basic authentication.
  • UI Authentication: Support using multiple authentication methods together.
  • UI: Allow invocation page tabs to be opened in a separate window.

Fixed

  • Scheduler: Do not serve requests until we've had a chance to discover workers.
  • Fix Test XML file parsing when certain attributes are not present.
  • Avoid network loss during service shutdown.
  • Fix downloading compressed files.
  • UI: Fix messaging around Bazel profile availability while an invocation is still running.
  • UI: Fetch the console log only once.
  • UI: Fix auto-scroll functionality in console log.
  • UI: Ensure the correct test log is shown when switching targets.

v2.14.8 (2022-08-24)

Cherrypicks

  • Goma: Removed large debug logs.
  • Goma: Added log buffering and suppressed logs below "Info" level.
  • Goma: Added SIGUSR2 handler to collect CPU profiling data.

v2.14.7 (2022-08-23)

Cherrypicks

  • Fix a UNKNOWN error from bytestream Write calls.

v2.14.6 (2022-08-23)

Cherrypicks

  • Logging: Remove extraneous logging for some RPC calls.
  • Logging: Log time spent waiting for the client to send write data.
  • Bytestream: Use a 1 MB buffer for writes.
  • Monitoring: Add thread pools latency metrics.
  • Monitoring: More read/write storage metrics.

v2.14.5 (2022-08-20)

Cherrypicks

  • [goma] Implement digest cache in recache package.
  • [goma] Report more metrics for execution and RPC.
  • RemoteActionExecutor: fix action execution hangs .
  • ExecutionUnit: log target id.
  • Logging: log some RPC calls.

v2.14.4 (2022-08-19)

Cherrypicks

  • Fix (rare) NPE in replica selection.
  • Add metrics for distributed CAS fetches.

v2.14.3 (2022-08-12)

Cherrypicks

  • Guard process wrapper execution statistics behind a flag.
  • Retry recovery when losing a CAS node.
  • Fix temp directory handling for dockerized actions.

v2.14.2 (2022-08-10)

Cherrypicks

  • Fix storage metric being off by a factor of 1000.

v2.14.1 (2022-08-10)

Cherrypicks

  • Fix S3 metrics exporting that caused an excessive number of exported metrics and associated CloudWatch costs.

v2.14.0 (2022-08-10)

Added

  • Network: allow setting a TCP connect timeout via --internal_tcp_connect_timeout.
  • Metrics: add download call stats via com.engflow.re.cas/fetch_call_time.
  • UI: Analytics page - if scatter charts cannot be shown, add a button that narrows the search so they can be rendered
  • UI: Invocation search - allow filtering invocations by principals requester and runner

Fixed

  • UI: various styling fixes to improve consistency.
  • UI: Fix async bug that caused filters to not be applied on reset.
  • Metrics: report storage metrics correctly (com.engflow.storage.*/*).
  • Metrics: fix docs to indicate that distribution metrics are reported to CloudWatch.
  • S3: improve handling of "rate limited" errors.

v2.13.0 (2022-08-02)

Incompatible

  • --incompatible_keep_relative_argv0 is now a noop.
  • --experimental_grpc_web is now a noop.
  • Enable --experimental_inmemory_digests by default.
  • Enable --incompatible_strict_digest_verification by default.

Added

  • Partial graceful shutdown support for schedulers.
  • Support RE-API cache compression and gRPC-level compression simultaneously.
  • UI: add BES keywords to search index and allow filtering invocations by them (provided invocation indexing is enabled)
  • UI: improved rendering of large logs, including lazy loading
  • UI: add a badge to mark experimental features
  • UI: accessibility improvements

Fixed

  • Support BES streams with a high data rate.
  • Allow viewing an invocations console output in fullscreen mode.

v2.12.1 (2022-07-25)

Added

  • Support proxying external storage reads through workers under --workers_handle_fallback_requests.

Fixed

  • Fix startup crashes in certain AWS configurations.

v2.12.0 (2022-07-21)

Added

  • Make the chunk size of ByteStream/Read responses configurable with --bytestream_read_chunk_size.
  • Add metric for response time of external authentication.
  • Add --experimental_force_module_cache_path_for_mnemonics flag for improving Objective-C builds.
  • Support changing the initial gRPC control flow window with --grpc_initial_flow_control_window.

Fixed

  • Make sure to use optimized TLS implementation when available.
  • Fix hangs when an error happens early during process startup.
  • Fix spurious cancellation of RPCs that check blob existence.
  • Fix spurious NOT_FOUND errors from ByteStream/Read.
  • Better handling of backend errors in the UI.

v2.11.2 (2022-07-19)

Cherrypicks

  • Race condition while checking CAS cache.
  • Possibly erroneous cancellation of futures from the cache.

v2.11.1 (2022-07-13)

Fixed

  • UI: Fix broken Google login page.

v2.11.0 (2022-07-12)

Added

  • UI: Show license info on cluster status page.
  • UI: Show error details when invocation page fails to load.

Fixed

  • UI: Fix bug where large test suites would not load.
  • UI: Fix bug where invocations would sometimes not load or hang.

v2.10.0 (2022-07-07)

Incompatible

  • The --incompatible_track_availability_zone has been flipped, which makes this release incompatible with v2.3.0 and earlier. Please upgrade to v2.9.0 before upgrading to this release if you're still running an older version.

Added

  • UI: Allow downloading mTLS client certificates from the UI. The CA is configured server-side using --tls_trusted_key and --tls_trusted_certificate.

Fixed

  • UI: Frontend would not show invocation or invocation search pages and instead presented an error.
  • UI: Elements in the UI would overlap each other in unexpected ways.
  • UI: Ensure fetching the log is not aborted prematurely.
  • UI: More clearly surface how many tests failed.
  • UI: Fix cluster status page hanging on load.

v2.9.0 (2022-06-28)

Incompatible

  • macOS: Workers no longer wait until at least one Xcode is available before accepting work.
  • The --advertised_port_offset flag is now a no-op.

Deprecated

  • macOS: Discovering available Xcode versions no longer relies on xcode-locator subprocess.

Added

  • auth: Add support for embedding permissions directly into mTLS client certificate.

Fixed

  • UI: Order test suites and test cases by status, listing failures first.
  • UI: Better reporting of aborted invocations.
  • UI: Ensure full-screen views are always scrollable.
  • UI: Expose if number of test cases does not match reported number of tests.
  • goma: Reduce impact of rate limiter on action execution.

v2.8.0 (2022-06-01)

Incompatible

  • The --incompatible_ignore_legacy_node_properties flag is now a no-op.

Added

  • Goma: Add metrics for how long requests are delayed.
  • Goma: Add metrics for client errors.
  • RE: Add digest of primary output to server-side profile.
  • Add support for TLS 1.3

Fixed

  • Goma: Enable RPC request rate limiter by default.

v2.7.2 (2022-05-24)

Cherrypicks

  • Fix INTERNAL error on certain remote persistent worker error conditions.

v2.7.1 (2022-05-23)

Cherrypicks

  • Docker containers were not getting reused. This was causing a performance hit.

v2.7.0 (2022-05-17)

Added

  • Monitoring: the new com.engflow.instance/gc_avg_duration metric shows the average duration spent in Java garbage collection since the last reported metric.

Fixed

  • Results UI: fixed an issue where some server-side profiles fail with UNKNOWN and returning HTTP 500 and don't load.
  • Results UI: In the build status bar, cached builds were not included in the completed builds, and were categorized as "to build". This is now fixed.
  • macOS: Improved error message for actions failing because of too long command lines.
  • Results UI: fixed login redirection vulnerability that could lure the victim to the attacker's page.
  • AWS, MacOS AMI: install service that reaps the symbols cache to avoid filling up the disk.

v2.6.3 (2022-05-05)

Fixed

  • Fix loading some EngFlow profiles.

v2.6.2 (2022-05-03)

Fixed

  • UI: The login page and other elements were misaligned.

v2.6.1 (2022-05-03)

Fixed

  • UI: Navigating between pages sometimes crashed the frontend.

v2.6.0 (2022-05-03)

This release changes the database schema. In clusters that have it enabled, this results in an empty database after the upgrade.

Incompatible

  • The --incompatible_named_default_pool flag is now a no-op.
  • The --docker_use_image_id flag is now a no-op.

Added

  • Goma: added ability to limit concurrent connections. This should help avoid OOMs when clients upload a lot of input files.

Changed

  • Eventstore options are no longer experimental.
  • Improved error messages when server TLS certificates are not in the expected format.
  • Moved user settings to the side bar.
  • Improved styling of the cluster status and licenses pages.

Fixed

  • UI: Don't show workspace status chips (repo, branch, commit) on the invocation page if they are empty.
  • GCS: Correctly propagate errors when reading / writing events.
  • Fix rare case of schedulers losing track of workers when an incoming execute request is cancelled (previously, the worker was recovered after a timeout).
  • The service now retries execution requests internally in some cases to reduce the likelihood of build failures in clusters with very large worker machines and auto-scaling.
  • Fix popups to display outside their parents.
  • Fix display of the licenses page.
  • Fix display of failed tests on the invocation overview page.
  • Fix handling of timeouts for remote persistent workers - such actions were incorrectly always retried (independent of the retry policy).
  • Fix handling of test.xml reports that only specify a 'status' attribute.

v2.5.2 (2022-04-28)

Cherrypicks

  • Fix NPE when proxy-replaying an event stream from another scheduler (part 2).

v2.5.1 (2022-04-26)

Cherrypicks

  • Fix NPE when proxy-replaying an event stream from another scheduler.
  • Fix multicast discovery.

v2.5.0 (2022-04-13)

Incompatible

Added

  • Record time to create output tree in server-side profiles.

Changed

  • UI: Make Invocation page navigation horizontal

Fixed

  • macOS: Fix chronyd installation.
  • UI: Fix bug causing page to sometimes hang or not load.

v2.4.1 (2022-04-05)

Cherrypicks

  • MacOS AMIs: Fix chronyd installation by preventing brew to run as root.
  • UI: Fix potential issue with S3 data writing.

v2.4.0 (2022-04-05)

Added

  • Goma support for pushing metrics to Stackdriver.
  • Goma logs are uploaded to Cloudwatch.

Changed

  • Increase timeout for EngFlow Free to come online.

Fixed

  • Fix: Reduced redundant logging when invocation streams cannot be found.
  • Fix: Don't trigger alarms when the test.xml output file for a test is not uploaded from Bazel to the BES.
  • AWS: Cloudwatch metric reporting is more fault tolerant.
  • UI: Include more information in build UI stack traces included in the JavaScript console.
  • UI: Users can now change their timezone in Chrome.
  • UI: Improve various error messages in the UI to include more detail.

v2.3.4 (2022-03-29)

Cherrypicks

  • goma: add --exec-timeout to control execution timeout.
  • macOS: fix chrony installation script.

v2.3.3 (2022-03-28)

Cherrypicks

  • goma: add flags to control concurrency for storing action inputs and outputs.

v2.3.2 (2022-03-28)

Cherrypicks

  • Fix stack overflow on large invocations.
  • Correctly propagate max gRPC message size flags to the internal gRPC service.
  • Fix failure to report disk metrics.

v2.3.1 (2022-03-23)

Cherrypicks

  • Fix disk usage metrics not being reported and producing log spam.

v2.3.0 (2022-03-16)

Deprecated

  • Config: --enable_target_tree is now a no-op and will be removed in a future release. The flag is set to true by default.

Fixed

  • UI: Fix infinite loop on some prefix filters.
  • UI: clarify icon indicating that items can be opened.
  • GCP: Improve error reporting for "Partial findMissingBlobs failure".
  • Fix error when loading an invocation page corresponding to a server-side profile.
  • Fix "Async Stream not found" error to avoid automatically assuming this is a severe error.
  • Fix Docker service shutting down before worker service.

Added

  • GCS: Added com.engflow.storage.read/time_to_first_byte and com.engflow.storage.read/time_per_gb metrics.
  • UI: Improve error message when test.xml file is empty.
  • UI: Added analysis failure status for targets.
  • UI: Add a button to download test.xml.
  • UI: Improve UX around downloading test assets.
  • UI: Highlight matching items when searching in the target view.
  • Config: Turn --experimental_google_client_id into --google_client_id.

v2.2.1 (2022-03-08)

Cherrypicks

  • Fix NPE in EventStoreProfileGenerator
  • Permit more frequent keepalive times (10s)
  • Fix race condition in external storage GC causing INTERNAL error
  • Fix uncaught exception in UI when opening an invocation stream for profiling events
  • Install chronyd on macOS images

v2.2.0 (2022-03-03)

Incompatible

  • config: aws and gcp are no longer valid options for --external_storage. Use s3 or gcs instead.

Deprecated

  • config: --experimental_actions_execution_attempts is now a no-op and will be removed in a future release.
  • config: --experimental_gcs_direct_upload, --incompatible_reduce_memory_use, --enable_status_page, and --experimental_per_executor_dirs are now no-ops and will be removed in a future release.

Fixed

  • UI: Allow downloading (partial) EngFlow profiles while builds are still running.
  • UI: Remove links to Bazel command-line options for non-release versions.

Added

  • UI: Warn users if their upload strategy prevents Bazel profiles from being uploaded.
  • Add --incompatible_track_availability_zone, which changes the serialization format for one of the types we share between machines. This can be safely deployed while the cluster is running as long as all nodes are running at least v2.2.0. Do not enable when there are nodes that run an earlier version, or at the same time as upgrading to v2.2.0 (or later). We plan to flip this flag in v2.10.0.
  • docker: Record container start time in EngFlow profile.
  • macOS: Add /var/folders/wp to the allowlist for sandboxed actions.

Removed

  • metrics: com.engflow.re.cas/total_size, com.engflow.re.cas/total_replica_size, com.engflow.re.exec/running_actions, com.engflow.re.exec.docker/containers_created, com.engflow.re.exec.docker/container_creation_failed, and com.engflow.re.exec.docker/containers_destroyed are no longer reported.

v2.1.1 (2022-02-22)

Fixed

  • Make deletion of action execroots faster.

v2.1.0 (2022-02-18)

Changed

  • The Build Event Service is enabled by default (disable with --enable_bes=false).
  • The event store options are no longer experimental. Note that the on-disk location should now be controlled with the separate flag --event_disk_path rather than reusing --event_blobs_root for this purpose.
  • The unnamed worker pool is now called default (enable --incompatible_named_default_pool by default).
  • Improve the Bazel first-time setup instructions.

Fixed

  • Fix server-side profiles to correctly show all action attempts.
  • UI: fix parsing of GitHub URLs.
  • UI: fix icon titles.
  • UI: fix page-up/page-down keys in the console.
  • MacOS: fix repeated warnings about /proc/meminfo.
  • Correctly return NOT_FOUND instead of INTERNAL when an invocation could not be found.
  • Avoid action failures when uploading a file to secondary storage returns an error.

v2.0.0 (2022-01-13)

This release requires a full cluster shutdown and restart. Due to changes of the default settings for a number of incompatible flags, pre-2.0.0 instances may return errors when communicating with instances running 2.0.0 or later and vice versa.

Otherwise, this release is intentionally small to reduce the upgrade risk. In particular, we did not remove deprecated flags and metrics in 2.0.0 (except as noted below); they will be removed in a later release.

Added

  • Support TLS 1.3 for the UI and gRPC APIs.
  • Automatic Garbage Collection for External Storage.
  • Added an inline Profile Viewer.
  • Print warnings when using deprecated command-line flags.
  • UI: the invocation page sidebar can be navigated by keyboard.
  • UI: show 'View Logs' button for test logs.

Fixed

  • Fix permission denied when deleting an exec tree with unexpected mod bits.
  • UI: Some non-existent pages returned 404 (not found) for unauthenticated users; they now return 403 (unauthenticated). This was a potential information leak (benign).
  • UI: correctly show NOT_FOUND for missing nodes.
  • UI: console correctly uses the full height when maximized.
  • UI: fix color for the timezone selector.
  • UI: prevent horizontal overflow in the test view.
  • UI: improve performance of loading large console logs.
  • API: correctly return NOT_FOUND for calls to the results store.

Removed

  • Remove all s3-specific metrics com.engflow.re.storage.s3/*.
  • Remove all gcs-specific metrics com.engflow.re.storage.gcs/*.

Incompatible

  • Flipped --incompatible_reduce_memory_use.

v1.58.9 (2022-02-08)

Cherrypicks

  • Trigger new release due to flaky errors.

v1.58.8 (2022-02-07)

Cherrypicks

  • goma: consider include path when deriving common input/output prefix.

v1.58.7 (2022-02-03)

Cherrypicks

  • Log end-to-end build times from BES.

v1.58.6 (2022-01-28)

Cherrypicks

  • Work around OpenJDK 11.0.14 bug: ignore Host header in http2 requests.
  • ResultStore/GetTarget: respond with NOT_FOUND instead of INTERNAL.

v1.58.5 (2022-01-27)

Cherrypicks

  • Install Xcode 13.2.1 and cmd-line 13.2 on macOS

v1.58.4 (2022-01-07)

Cherrypicks

  • Fix compilation errors due to bad cherrypicks

v1.58.3 (2022-01-06)

Cherrypicks

  • UI: don't fail with INTERNAL error when target tree node is not found
  • UI: don't fail with Chunk too large
  • Profiler: Record retry attempts during input fetching for better profiling
  • Fix com.engflow.re.storage.existence_cache/* stats

v1.58.2 (2021-12-21)

Fixed

  • Fix profile download links.

v1.58.1 (2021-12-20)

Fixed

  • Fix incorrect timezone list.

v1.58.0 (2021-12-17)

We are preparing for a 2.0.0 release in early 2022. To reduce the amount of changes going into that, we have proactively flipped a few flags that were intended for 2.0.0 and that do not require a full cluster restart. We have already enabled these flags on all managed clusters without any issues.

Incompatible

  • Enable --incompatible_remove_symlink_execroot_strategy by default; this removes the symlink exec root strategy, which was never used in production due to being incompatible with dynamically linked binaries.
  • Enable --incompatible_keep_relative_argv0 by default; this fixes the lookup of commands which use a relative command line to be consistent with posix shell lookup (including PATH) and is required by all remote execution clients that we are aware of being used in production.
  • Enable --incompatible_no_storage_backend_metrics by default; this removes a few deprecated metrics related to storage.

New

  • Server-side profiles now contain per-action input tree stats.
  • The --incompatible_named_default_pool flag changes the meaning of the Pool platform option, and allows selecting the default (unnamed) execution pool.
  • Add a dockerUseEntrypoint boolean platform option to disable use of the docker image entrypoint on a per-action basis.
  • Add --incompatible_strict_digest_verification to enable strict validation of digests across all API calls, superseding --incompatible_batch_read_blobs_verifies_digests; both will be enabled by default and removed after the 2.0.0 release.
  • UI: support resizing the tree view.
  • UI: show SCM status (if received from the workspace status command).
  • UI: add a timezone selector to the settings.

Changed

  • Increase the default gRPC max message size to 20mib to reduce issues with uploading large build events.
  • HTTP cookies now use SameSite=Lax to avoid requiring login every time a user follows a external link into the UI, e.g., from CI.
  • Reduce worker service memory footprint.
  • UI: improve consistency and usability.
  • UI: improve the ordering of targets in the overview tab.
  • UI: show local / remote build status icon.
  • UI: use ISO 8601 dates and 24-hour format by default.
  • UI: update target status icons for improved consistency and readability.

Fixed

  • Fix the affinity-based scheduler to take the absolute input root into account if set; this reduces docker container restarts and improves build performance.
  • GCP: images now use gcr instead of gcloud to authenticate docker operations with GCR, which is more reliable.
  • UI: fix linebreaks in the displayed build command line.
  • UI: fix issue where in-progress builds don't render correctly.
  • UI: deduplicate target configuration information.
  • UI: fix critical path display to use timestamp order (don't sort by length).

v1.57.4 (2021-12-15)

Cherrypicks

  • Fix performance regression in CAS downloads. This reverts a bugfix of --log_level, so the finest supported log level is again INFO.

v1.57.3 (2021-12-09)

Cherrypicks

  • Goma: avoid data corruption by resetting buffer upon download retrial

v1.57.2 (2021-12-08)

Cherrypicks

  • Server-side profile: add input_tree_stats to action details

v1.57.1 (2021-12-01)

Cherrypicks

  • Fix release pipeline

v1.57.0 (2021-11-30)

Incompatible

  • Boolean-type flags now enforce their value to be true or false. Previously any value other than the literal true was parsed as false; from now on this is an error.
  • Enable --split_cluster_name by default. If you're currently not setting this flag, make sure schedulers have a tag named engflow_re_scheduler_name with the same value as engflow_re_cluster_name.
  • --experimental_cas_check_storage_only is now a no-op.

New

  • With --http_public_port, you can set a different port for HTTP requests than for gRPC (--public_port).
  • Free tier now supports the EventStore API.
  • Free tier now supports server-side profiles.
  • UI: Enabled target tree by default.
  • UI: Allow users to expand information-dense cards to full-screen.
  • UI: Follow the end of the console while loading.
  • UI: Allow users to filter the target tree by prefix or status.
  • UI: Added Overview tab to help quickly identify build issues.

Changed

  • Workers now create one one-core executor per available CPU core instead of just one one-core executor.
  • Improved compatibility of server-side profiles with Perfetto UI.

Fixed

  • Removed limit of concurrent connections from free tier.
  • Improved parsing of critical path in Build and Test UI.

Deprecated

  • --split_cluster_name is deprecated and will be removed in the next release.

Security

  • Removed debugging HTTP endpoints from Goma.
  • Restrict frame-ancestors from Content-Security-Policy.

v1.56.1 (2021-11-10)

Cherrypicks

  • AWS/GCP images: revert back to Debian 10

v1.56.0 (2021-11-08)

Changed

  • AWS/GCP images now use Debian 11
  • Logging: FindMissingBlobs is now less chatty
  • Docker container reuse is enabled by default; use --docker_allow_reuse=false to opt-out the entire cluster, or set dockerReuse=False for all actions (or builds) that need to opt-out. If you want to opt-out the entire cluster, we recommend setting --docker_allow_reuse=false before you upgrade. This change also switches all actions to separate docker run and docker exec invocations. If that causes problems, you can temporarily opt-out the entire cluster by setting --docker_split_exec_run=false. Note that we plan to deprecate that option; please let us know if you do set this flag.

Fixed

  • --worker_config now correctly handles configurations with more than 2GB ram per executor

Security

  • The HTTP UI now returns various security-related HTTP headers like Content-Security-Policy and X-Frame-Options by default, to prevent a number of attack scenarios (see --strict_http_headers)
  • Action inputs can now be absolute symlinks

v1.55.0 (2021-10-28)

Added

  • Docker pull times to EngFlow profile
  • --grpc_max_message_size flag to control gRPC max message size
  • Log messages regarding lost or corrupted CAS files

Changed

  • All target tree flags are no only controlled by --enable_target_tree
  • --enable_status_page is now true by default
  • --principal_based_permissions now defaults to [] to restrict data access by default

Fixed

  • Race condition that would cause the PublishBuildToolEventStream gRPC call to fail
  • Linux ficlone call for creating action inputs
  • Basic auth using web-browsers

Removed

  • Experimental flags related to the target tree

Security

  • Reduced surface area for phishing attacks

v1.54.1 (2021-10-19)

Cherrypicks

  • Fixed missing working directory for the cached docker strategy

v1.54.0 (2021-10-18)

Added

  • Exposed EventStore endpoint over gRPC
  • Tests parsed from test.xml are shown hierarchically

Changed

  • Server-side profiling is now always enabled when the BES is enabled; disable with --profile_to_event_store=false

Fixed

  • Fix CAS capacity accounting during recovery
  • Wait for CAS metadata writes after file upload; fixes file missing errors when no external storage is configured and build-without-the-bytes is enabled

Security

  • Require explicit principal permissions to be set when accessing HTTP endpoints
  • Patched various low-risk vulnerabilities

v1.53.1 (2021-10-15)

Cherrypicks

  • Don't run actions twice with the "cached docker" strategy

v1.53.0 (2021-10-07)

Added

  • Add memory usage and garbage collection metrics
  • UI: Add profile picture and user menu

Fixes

  • GCP, AWS: fix logging issues causing stuck instances
  • Fixed a bug where some metrics were not reported
  • Fixed crash on start up when --cas_path is undefined
  • UI: Fix broken alert bar
  • Improved error handling during rapid cluster size changes

Changes

  • Report metrics with granular counts of incoming actions
  • When --external_storage is enabled then --experimental_opportunistic_cas is now regarded as true. Previously you had to explicitly enable the flag. We no longer recommend setting --experimental_opportunistic_cas at all, because when --external_storage is disabled then it's safer to use --experimental_force_lru instead
  • client_auth=gcp_rbe: Clients with "remotebuildexecution.blobs.create" permission can now also upload Build Event Streams. Previously such requests failed because (as of 2021-09-29) GCP has no permissions to control Build Event Stream uploads
  • Docker: Respect memory limit provided by --worker_config
  • Enable --experimental_per_executor_dirs by default
  • Increase default --max_batch_size to 4mb

Removed

  • Remove --experimental_profile_dir in favor of --experimental_profile_to_event_store.

v1.52.6 (2021-10-05)

Cherrypicks

  • Minor bugfixes

v1.52.5 (2021-10-05)

Cherrypicks

  • Prevent GCP logging problems from breaking the scheduler process.

v1.52.4 (2021-10-01)

Cherrypicks

  • Fixed bug propagation caused by Hazelcast errors

v1.52.3 (2021-09-29)

Cherrypicks

  • Minor bugfixes

v1.52.2 (2021-09-28)

Cherrypicks

  • Fixed a bug where long log messages would cause schedulers to hang
  • Reduce log spam
  • Fixed a bug where some metrics would not be reported

v1.52.1 (2021-09-22)

Cherrypicks

  • Fixed crash on start up when --cas_path is undefined

v1.52.0 (2021-09-20)

Changes

  • Add invocation IDs to logging and errors

Fixes

  • Various concurrency bugfixes for multi-scheduler clusters
  • Improve logging around failed gRPC calls
  • Reduce log spam for the gRPC NOT_FOUND response code
  • Improve macOS worker support
  • Fixed EngFlow internal profiling when running multiple schedulers

v1.51.2 (2021-09-14)

Cherrypicks

  • Minor bugfixes

v1.51.1 (2021-09-13)

Cherrypicks

  • Fix release pipeline

v1.51.0 (2021-09-08)

Added

  • CAS: Workers now pick up existing files from the CAS directory. It's no longer necessary to delete this directory after a worker is restarted. If this behavior breaks something, use --recover_cas_blobs=false and let EngFlow know.
  • Add metrics for inbound BEP events.
    • com.engflow.eventstore/new_inbound_stream
    • com.engflow.eventstore/new_inbound_bep_event
    • com.engflow.eventstore/new_outbound_bep_event
    • com.engflow.eventstore/new_outbound_stream
    • com.engflow.eventstore/ongoing_streams
  • The build results UI can now authenticate users with Google's login page. See --http_auth=google_login.

Changed

  • Enable --upload_outputs_on_failure by default.
  • UI: Update branding.
  • AWS, dashboard module: Reduce window size from 300s to 60s.
  • GCP, Terraform files: Move service accounts into a Terraform module.
  • The new default of --http_auth is deny. Make sure you override this flag as needed.
  • EngFlow .deb installer: depends on the full OpenJDK, not just the JRE

Fixed

  • Fixed a race condition with BES and multiple schedulers.
  • Use exec-root of executor when starting reusable docker containers. This fixes a bug causing containers not to start if the user has no permission to access the container's default workdir (e.g. when setting it to /root).
  • Do not proxy errors to cancelled client streams.
  • UI: Display correct status icon in target list view.
  • Fix race condition during BES live replay.

Deprecated

  • --docker_use_path is a no-op; please use --incompatible_keep_relative_argv0 instead.
  • --docker_use_addgroup is no longer supported.

v1.50.6 (2021-08-31)

Cherrypicks

  • Fixed unbounded thread creation

v1.50.5 (2021-08-24)

Cherrypicks

  • Added Metrics around BES upload
    • com.engflow.eventstore/new_inbound_stream
    • com.engflow.eventstore/new_inbound_bep_event
    • com.engflow.eventstore/new_outbound_bep_event
    • com.engflow.eventstore/new_outbound_stream
    • com.engflow.eventstore/ongoing_streams

v1.50.4 (2021-08-24)

Cherrypicks

  • Fix race condition during BES live replay

v1.50.3 (2021-08-05)

No change. Just re-triggering the release.

v1.50.0 (2021-08-05)

Added

  • Docker: the new --docker_default_network_mode flag controls the default value of "dockerNetwork" (when the client doesn't request any).
  • MacOS: actions run with sandboxing if --experimental_allow_mac_sandbox is enabled

Deprecated

  • We've removed support for Java 8 and Ubuntu 16.04.

v1.49.0 (2021-07-12)

Added

  • Linux: support file cloning for file systems that support it
  • Docker: add a flag to enable Docker signature verification (--docker_content_trust)

Changed

  • Switch to react for the cluster status page (--enable_status_page)
  • MacOS: packages now contain the process-wrapper binary which can enforce proper shutdown of actions
  • GCP: various improvements to terraform configuration (enable stackdriver integration by default, enable shielded VMs by default, add dashboard module)

Fixed

  • Profiling: fixed a hang when downloading large server-side profiles
  • Profiling: fixed a hang when downloading an unfinished server-side profile
  • Persistent workers: action timeouts are now properly enforced

v1.48.0 (2021-06-23)

Changed

  • Improved documentation around persistent workers.

Fixed

  • Reliability: Internal "Connection reset" calls no longer trigger INTERNAL gRPC errors.

v1.47.0 (2021-06-08)

Changed

  • AWS: The packer config in our release (base-image.json) now installs the AWS SSM agent.

Fixed

  • AWS: fix instance id retrieval on IMDSv2.
  • Reliability: The evicion policy on the action cache could previously cause long-running scheduler services to crash, even if all instances are individually restarted (the schedulers automatically replicate entries from removed instances).

Deprecated

  • The com.engflow.re.scheduler/available_workers metric is deprecated. We recommend using the new com.engflow.re.scheduler/existing_executors metric instead.

v1.46.1 (2021-05-25)

Cherrypicks

  • Fixed a stack overflow bug that caused schedulers to crash.

v1.46.0 (2021-05-24)

Changed

  • Updated the Bazel process-wrapper that is used by workers to isolate actions

Fixed

  • GCP logging will not be enabled even with the flag set if the instances detect that they are running outside of GCP.
  • Persistent workers will be restarted if the kernel kills the action process. This mitigates the risk of leaking processes on poorly behaving actions.

v1.45.1 (2021-05-14)

Cherrypicks

  • Deployment kit, Dockerfile: fix v1.45.0 regressions

v1.45.0 (2021-05-13)

Deprecated

  • --auto_worker_expiration and --docker_use_init are now both no-op. They were enabled by default in v1.31.0, now are always on.
  • Deployment kit, Kubernetes: deleted the obsolete setup files (gen-k8s-config.py and templates/ directory); updated the documentation about the current Kustomization-based setup

Added

  • --experimental_async_storage_uploads: This makes it so we don't wait for aync uploads to complete. This should improve performance in cases where such uploads are slow.

Changed

  • Deployment kit, Debian package: the package no longer "Depends" on OpenJDK; it now "Recommends" OpenJDK's JRE. This lets you skip installing that Java runtime, and use a different runtime. The dockerfile has a --build-arg to control that (see below).
  • Deployment kit, GCP Terraform file: added firewall rule to allow health checks; listen on port 443 instead of 8080; increase scheduler disk size
  • Deployment kit, AWS Terraform file: added use_s3 variable; can generate a random S3 bucket name; cluster_name is customizable
  • Deployment kit, engflow.Dockerfile: made it configurable via --build-arg, installing the JRE and Docker are now optional
  • CloudWatch: log stream names now show the machine's role (scheduler or worker) and are easier to read

Fixed

  • S3 / GCS: intermittent errors are now reported as UNAVAILABLE, not as INTERNAL error
  • gRPC / netty: closed channels are now reported as UNAVAILABLE error

v1.44.1 (2021-05-07)

Cherrypicks

  • Fixed an error where new nodes were unable to join the cluster due to third-party library incompatibilities

v1.44.0 (2021-04-30)

Added

  • Authentication: added a deny mode that denies all incoming requests for the --client_auth and --http_auth options

Changed

  • The --gcs_credentials flag is no longer deprecated
  • AWS deployment configuration: added more alerts

Fixed

  • The status web page is no longer available on the private scheduler port, only on the public port
  • Handle premature exit of the persistent worker process; these are now automatically retried and provide a better error message

v1.43.0 (2021-04-21)

Changed

  • Validate that --cas_path points to a writeable directory
  • Kubernetes: improve configuration to be less dependent on cluster config

Fixed

  • Fixed an error when the client sends an empty byte stream
  • Fixed basic auth documentation

v1.42.0 (2021-04-12)

Added

  • Kubernetes: you can override the default Kubernetes master address with --k8s_master. Normally this should not be necessary, except if you see discovery problems.
  • --worker_config now accepts auto, meaning to create 1 executor that uses all available cores.
  • Actions now log how many output files (and total bytes) they uploaded to the CAS. (Only when replication is enabled.)

Changed

  • Kubernetes: added Dockerfile; added affinity rules to the on-prem Kustomizable overlay

Fixed

  • Fixed Mac release packages that were broken since v1.38

v1.41.1 (2021-04-08)

Cherrypicks

  • Fixed Mac release packages

v1.41.0 (2021-04-07)

Added

  • Kubernetes: new and improved Kustomization-based deployment templates for K8s

Changed

  • Docker: forward env variables for AWS credential as well as DOCKER_HOST to Docker invocations; this supports setups other than the default Docker socket
  • Added more metadata and failed actions to the server-side profile (see --experimental_profile_dir)

Fixed

  • Safeguarded against undeletable files when reusing exec roots
  • Fixed tracking of CAS file locations

v1.40.0 (2021-03-31)

Added

  • S3: The --incompatible_s3_use_structured_paths changes the directory structure, making blob access faster. This is an incompatible change: enabling the flag means the cluster won't find the old bucket content.

Changed

  • Docker FIFO creation: report stderr on failure
  • Docker internal retry: print if stderr was empty
  • S3: we now support more than 50 concurrent connections; see --external_storage_worker_threads and --external_storage_scheduler_threads
  • S3: retry failed downloads
  • S3: set IOException cause for generic errors, so error logs are more detailed

Fixed

  • AWS deployment kit: fix use of list option
  • AWS deployment kit: enable instance_refresh in the Terraform config
  • Docker "OCI runtime exec failed": fixed the --experimental_docker_internal_error_stderr_pattern semantics (added in v1.31), we now correctly retry such actions.
  • Docker: check container after every non-zero exit. This should help with containers that become unusable, e.g., due to a docker daemon restart.
  • Persistent workers: fixed the bug where workers sometimes failed to start, printing execution failed INTERNAL: Bad response from worker:

v1.39.0 (2021-03-23)

Added

  • Added an experimental server-side profiling implementation (see --experimental_profile_dir)

Changed

  • Logging: log average download rate per storage location; look for 'timing' in the worker logs
  • Enable recursive output tree action cache verification by default; previously, the action cache could return cache entries with output files that were no longer available in the CAS, breaking Bazel's build-without-the-bytes mode (--experimental_check_action_cache_recursively)
  • S3 / GCS: use 50 threads by default on workers and remove upper limit (50) on S3

v1.38.5 (2021-04-21)

Cherrypicks

  • Fix release package build

v1.38.4 (2021-04-21)

Cherrypicks

  • Fix release package build

v1.38.3 (2021-04-20)

Cherrypicks

  • Enable the recursive AC check by default
  • Check existence of tree blob
  • Clean exec root if input tree creation fails
  • Force-add replicas to the location map
  • Fixed Mac release packages

v1.38.2 (2021-03-25)

Cherrypicks

  • Add --incompatible_s3_use_structured_paths to use structured paths in S3, which may significantly improve performance under high load

v1.38.1 (2021-03-23)

Cherrypicks

  • Correctly propagate metadata for internal CAS download calls

v1.38.0 (2021-03-22)

Added

  • IPv6 support: added --docker_ipv6_cidr and --docker_ipv6_subnet_length to configure the IPv6 subnets for Dockerized actions

Changed

  • The service now returns an error for HTTP/1.X connections to the gRPC port
  • AWS: Improved deployment templates
  • File downloads are retried internally if there are more copies in the distributed CAS
  • If --experimental_per_executor_dirs is enabled, actions are always run in a deterministically-named directory

Fixed

  • IPv6 support: Dockerized actions run with an IPv6 localhost if IPv6 is enabled
  • Fix crash when enabling --experimental_per_executor_dirs in a cluster that has files in the work directory
  • Fix protocol error when a client attempts to execute an action with a lot of missing files
  • Fix reuse of Docker containers between persistent worker and normal actions

v1.37.4 (2021-03-22)

Cherrypicks

  • S3: do not force absolute blobs root; clarify requirements in documentation

v1.37.3 (2021-03-18)

Cherrypicks

  • S3: sanitize blobs root; add logging

v1.37.2 (2021-03-10)

Cherrypicks

  • Fix action cache recursive output directory check

v1.37.1 (2021-03-10)

Cherrypicks

  • Fix worker startup script

v1.37.0 (2021-03-09)

Added

  • Added a flag to support S3-compatible storage services like MinIO (--s3_endpoint)
  • Added an experimental option to force actions into specific pools by action mnemonic (--experimental_force_mnemonic_pool_name); note that this requires the client to send action mnemonics using the recently updated metadata proto

Changed

  • The --use_upload_to_rereplicate flag is now a no-op. Please remove it from your configs.
  • CloudWatch: --experimental_cloudwatch_no_instanceid, --aws_instance_id, --single_instance_monitoring, and --experimental_single_instance_monitoring are now no-op flags. Please remove these from config files. Instances always behave as if --experimental_cloudwatch_no_instanceid=true.

Fixed

  • Persistent workers: correctly use the relative working directory to look up parameter files and run workers; this is needed for Bazel @ HEAD to work
  • Action cache: fix handling of output directories to avoid returning stale action cache entries - this could cause Bazel client errors if build-without-the-bytes is enabled
  • Action execution: added a flag to stop absolutizing argv[0]; this could cause errors with hermetic C++ toolchains outputting absolute paths to .d files and failing Bazel's consistency checks (--incompatible_keep_relative_argv0); this will be enabled by default in a future release; note that this may break some builds that were relying on this (also see the Bazel issue https://github.com/bazelbuild/bazel/issues/13189)

v1.36.0 (2021-02-26)

Added

  • Added a flag to use per-executor working directories (--experimental_per_executor_dirs=true)
  • Added a flag to pass the executor id to local actions through an env variable (--experimental_local_provide_executor_id=true, ENGFLOW_EXECUTOR_ID)
  • Added a platform option to control the exec root strategy; this can be used to switch between the default, fast hardlink strategy which does not set file permissions to a copy strategy that sets the file permissions as requested by the client (experimentalActionInputStrategy=copy)

Changed

  • Improved logging for persistent workers
  • Increased default cache duration for CAS existence checks to external storage to 24h and 10 million entries
  • Increased default cache duration for CAS existence checks to the distributed CAS to 120 seconds
  • Limited download concurrency to at most 200 concurrent downloads by default to avoid running out of native memory or file descriptors

Fixed

  • Fixed issue where helper threads could go into a busy loop when Docker containers are reused; this may not result in client-visible build issues but causes high CPU load on the worker instances. This was introduced in 1.32.0 when the default for --experimental_docker_avoid_fifo was flipped

v1.35.1 (2021-02-17)

Cherrypicks

  • ExecutedActionMetadata: fix worker start timestamp

v1.35.0 (2021-02-16)

Added

  • Added a flag to use consecutive TCP/IP ports for internal traffic (--incompatible_use_low_offsets=true)
  • S3: Experimental support for multi-part uploads to handle files larger than 5 GB (--experimental_s3_use_transfer_manager=true)
  • Added a flag to disable participation in the distributed CAS; this is useful for satellite cluster where a few machines are remote to the main cluster (--enable_distributed_cas=false)
  • Logging: added a metric to monitor persistent worker use

Fixed

  • Fixed reporting of timestamps in execution result

Deprecated

  • The --experimental_gcs_direct_upload flag is a no-op. Please remove it from your configs.

v1.34.0 (2021-02-08)

Added

  • AWS: Improved AWS Terraform files in the release package to support dashboards and logging

Changed

  • GCS: use a new code path to upload blobs
  • AWS CloudWatch: --experimental_cloudwatch_no_instanceid=true by default. The --aws_instance_id and --single_instance_monitoring flags are deprecated, please remove them from configs.
  • --storage_range_requests is now a no-op. It has been enabled since v1.30

Deprecated

  • GCP: --gcs_credentials flag is deprecated, please use application default credentials instead

v1.33.0 (2021-02-05)

Added

  • AWS: Support for logging to CloudWatch logs; enable with --remote_logging_service=aws_cloudwatch and --aws_log_group_name=name
  • Execution responses include timestamps for client-side metric collection

Changed

  • Debugging: --keep_exec_directories_for_debugging now retains output files as well
  • Docs: clarify metrics documentation

Fixed

  • Fix --experimental_docker_store_images_in_cas to not cache temporary failures
  • Persistent workers: the service no longer waits for persistent worker processes to shut down, but terminates them forcefully

v1.32.4 (2021-02-05)

Cherrypicks

  • AMD64: fix debian package

v1.32.3 (2021-01-30)

Cherrypicks

  • CloudWatch: fix reporting of cumulative metrics

v1.32.2 (2021-01-29)

Cherrypicks

  • Fix gRPC metrics reporting

v1.32.1 (2021-01-28)

Cherrypicks

  • Fix an invalid name resulting in a SecurityException
  • Revert improved HTTP/1.1 handling; this caused health checks on AWS to fail

v1.32.0 (2021-01-27)

Added

  • macOS: we now release packages for macOS, and added documentation about setting up a basic macOS cluster
  • Docker: Support running containers by id --docker_use_image_id; this prevents docker run from attempting to pull the corresponding image
  • Docker: Log time needed to run docker pull
  • Docker: add --experimental_docker_store_images_in_cas to support storing docker images in the CAS to improve performance and reliability
  • AWS CloudWatch: add --experimental_cloudwatch_no_instanceid; when enabled, all machines will report metrics without InstanceId dimension, which makes the metrics aggregatable
  • Google Cloud Storage: implement a faster upload method, activated with --experimental_gcs_direct_upload=true

Changed

  • Improved Packer and Terraform templates
  • GCP: GCP images are more lightweight and boot faster
  • MacOS: --xcode_locator now points to /usr/local/bin/engflow/xcode-locator by default; this is where the MacOS package installs this binary
  • Running as a service now requires the file /etc/engflow/config to exist
  • Increase the default value of --default_replica_timeout to 24h
  • CAS / AC: reduce traffic to the storage backend (when using AWS S3 or Google Cloud Storage) with the help of a cache for recently seen blobs; you can customize its behavior with --experimental_cas_existence_cache_max_size and --experimental_cas_existence_cache_expiry
  • Docker: enable --experimental_docker_avoid_fifo by default for compatibility with gVisor
  • Docs: show metric units and aggregation type in the documentation

Fixed

  • CAS.batchReadBlobs can now respond with INVALID_ARGUMENT for invalid digests. This is an incompatible bugfix and it's disabled by default; enable it with --incompatible_batch_read_blobs_verifies_digests=true
  • Attempting to connect to the cluster via a HTTP/1.1 connection now returns a HTTP/1.1 error reply rather than simply closing the connection
  • CAS: Fix off-by-one error when replicating in the distributed CAS; previously, the cluster created one replica more than requested
  • Metrics: The com.engflow.re.cas/available_space metric is now clamped at zero; previously it was possible for it to temporarily dip below zero while running GC
  • Enable --use_upload_to_rereplicate by default; this fixes a rare mutual deadlock condition when two workers simultaneously attempt to upload files to each other
  • Upgrade gRPC library, which fixes an issue with slow up- and downloads on high-latency connections; unfortunately, we had to disable --experimental_log_unavailable_rpcs during the upgrade
  • Restore compatibility with Java 8
  • CAS: fix a potential scenario where the service could write an incomplete file to Google Cloud Storage
  • Cloud: templates now disable systemd / syslogd integration by default; having the integration enabled causes log lines to be duplicated to multiple log files, which could result in running out of disk space
  • Fixed NullPointerException in RereplicatingCasDownloader when --use_upload_to_rereplicate=true

Deprecated

  • AWS discovery: deprecate --aws_security_group; this flag is unnecessary as cluster members find each other by --cluster_name (we recommend also enabling --split_cluster_name=true)

v1.31.2 (2021-01-21)

Cherrypicks

  • RereplicatingCasDownloader: retain Context to fix NPE ("RequestMetadata not set in current context")

v1.31.1 (2020-12-17)

Cherrypicks

  • extraActionInputs: ensure directory exists before attempting to create input

v1.31.0 (2020-12-11)

Added

  • ARM64: we now release a Debian package for ARM64
  • Docker: Initial IPv6 support with --docker_enable_ipv6; this provides an isolated IPv6 network to actions which can be used for testing IPv6 code
  • Docker: Allow resolving executable paths against PATH; this is not compliant with the remote execution spec, but improves compatibility with existing open source projects that rely on this behavior, e.g., TensorFlow and Envoy
  • CAS: Document --experimental_opportunistic_cas - this flag switches to a different replication policy that reduces pressure on the distributed CAS if an external storage is configured; this improves reliability under load
  • Monitoring: Add a metric com.engflow.re.scheduler/existing_schedulers for the number of schedulers; this can be used to detect instances that are unable to report metrics, e.g., to Google Cloud Operations (formerly StackDriver)
  • Logging: log mTLS client authentication events
  • Docker: added --experimental_docker_internal_error_stderr_pattern to control automatic retries for some kinds of docker exec failures

Changed

  • S3: automatically retry failures after a delay
  • Docker: enable --docker_use_init by default; this helps avoid running out of PIDs when actions spawn a large number of subprocesses
  • Execution: enable --auto_worker_expiration by default; improves tracking of available workers

Fixed

  • Deployment: correctly set the Debian package architecture
  • Docker: correctly pass system capabilities to Docker
  • GCS: improved handling of "connection lost" errors

Deprecated

  • Options: the --docker_use_pull flag is now a no-op; the new code is always enabled

v1.30.0 (2020-12-04)

Added

  • GCP auth: print more server-side logs when authentication fails
  • Docker: the new --docker_use_init flag enables running Docker with a proper init process that reaps zombie processes, which avoids running out of PIDs when reusing docker containers
  • CAS: the new --use_upload_to_rereplicate flag enables using a new CAS re-replication code path that avoids a rare deadlock among worker machines

Changed

  • Debian package: the .deb version is now the release's SemVer, not the build date (check with dpkg -I engflow-re-services.deb)
  • Deployment kit (zip file): the k8s setup files are now under setup/k8s
  • Docker: print reason for container restart
  • External storage: enable range requests by default (see --storage_range_requests)
  • External storage: check on startup if we can access the storage backend
  • AWS Terraform file: renamed the need_external_docker parameter to public_worker_ip

Fixed

  • Build label: fixed missing build label in 1.28 and 1.29
  • Logging: fix the swapped invocation_id and action_digest in ExecutorServer's log line
  • Docs: show the service options' types correctly
  • Docs: display the version selector
  • CloudWatch: report metric units correctly
  • Fix uncaught IllegalStateException wrapping OperationTimeoutException from Hazelcast
  • CAS: detect on-disk file corruption
  • CAS: fix invalidating blobs that went missing with a PRECONDITION_FAILED
  • GCP: fixed Dockerized execution with cached containers on gVisor (requires --experimental_docker_avoid_fifo=true)

v1.29.1 (2020-11-19)

Cherrypicks

  • Monitoring: fix negative pool_utilization metric

v1.29.0 (2020-11-19)

Changed

  • AWS: improve deployment template (simplify role policy, add API endpoints)
  • Monitoring: add more context to logged error messages
  • Logging: log requested number of blobs for FindMissingBlob calls in addition to failed and missing digests
  • Logging: --debug_execute_requests also prints stderr for failed actions

Fixed

  • Execution: correctly create all requested output & input directories
  • CAS: do not unlist CAS nodes that fail due to timeouts; this could potentially result in a denial-of-service if the client sets small timeouts for large uploads
  • CloudWatch: respect max reporting batch size
  • CloudWatch: silently skip histogram metrics, which always failed to report
  • Documentation: correctly render metrics reporting percentages
  • S3: print correct region name when us-east-1
  • Networking: fix reporting of stream errors

v1.28.0 (2020-11-13)

Added

  • AWS, monitoring: --experimental_single_instance_monitoring is now called --single_instance_monitoring (the old name still works)
  • Add --external_storage_scheduler_threads and --external_storage_worker_threads to allow customizing the external storage thread pool

Changed

  • MacOS: sign release
  • GCP, monitoring: Remove code to report metrics to Google Cloud Operations every 30 minutes
  • Logging: Correctly report missing blobs, improve GCS error logs
  • AWS: Improved terraform template for cluster setup

Fixed

  • GCP, monitoring: fix sample reporting for charts that measure rates

v1.27.7 (2020-11-12)

Cherrypicks

  • Status page: fix --http_auth=none to allow access to the status page

v1.27.6 (2020-11-11)

Cherrypicks

  • Infrastructure: fix CI configuration for releases
  • Infrastructure: fix CI machine selection for releases
  • Monitoring: report two values before skipping; this should fix GCP metrics to go down to zero

Added

  • Logging: the --experimental_log_unavailable_rpcs flag (boolean) enables logging the stack trace of RPC calls that fail with UNAVAILABLE. We added this feature only for debugging, and we plan to remove it as soon as we can.
  • Monitoring: Added --enable_status_page to provide a basic cluster status page over HTTP2 (only!) on the same IP+port as the gRPC end point (previously undocumented as --experimental_status_page)
  • Release archive now contains a CHANGELOG.md (this file)
  • CAS: Added an experimental flag to change the CAS re-replication policy to be less aggressive (--experimental_opportunistic_cas). Note:
    • This is an incompatible flag and may require downtime to roll out
    • This should only be enabled when external storage is enabled

Changed

  • Logging: log more detailed CAS upload errors, report INVALID_ARGUMENT correctly, report RESOURCE_EXHAUSTED instead of UNAVAILABLE when no workers are available
  • Logging: log a summary of missing blobs and failures for FindMissingBlobs calls
  • Monitoring: Report metrics to Google Cloud Operations at least every 30 minutes
  • Documentation: the "Bazel First-Time Setup" page now recommends --remote_timeout=600 instead of 3600
  • Docker: Pass --userns=host to Docker to explicitly disable user namespaces; previously, all actions failed when user namespaces were enabled in the Docker daemon

Fixed

  • Dockerized execution: disable user namespaces to avoid action failures
  • Code cleanup: several bugfixes found by static analyzers
  • Error handling: report an error if the output tree cannot be deleted (primarily when --experimental_docker_use_platform_user is enabled)

Deprecated

  • Options: the (undocumented) --affinity_scheduling flag is now a no-op; the new code is always enabled

v1.26.2 (2020-11-02)

Cherrypicks

  • Infrastructure: fixed version name computation in our release pipeline

Added

  • MacOS: create release
  • --experimental_cas_check_storage_only flag: to enable faster CAS checks (when --external_storage is not none)
  • Logging: worker logs the CAS size upon startup

Changed

  • Dockerized actions: add container hostname to /etc/hosts

Fixed

  • Monitoring: report CAS usage regularly, not just when doing a GC
  • Code cleanup: lots of bugfixes found by static analyzers

v1.25.1 (2020-10-30)

Added

  • Documentation: for Remote Persistent Workers
  • /healthz page
  • Status page: now authenticates clients, see the --http_auth flag
  • AWS, monitoring: support for single-machine-only monitoring; see the --experimental_single_instance_monitoring flag
  • Monitoring: the com.engflow.re.scheduler/pool_utilization metric shows what percentage of executors in a pool are currently used

Changed

  • Monitoring: the com.engflow.re.scheduler/queue_age metric now reports min/max ages broken down by executor pool
  • deb package: post-install script creates engflow user's home dir

Fixed

  • Monitoring: stuck actions are now from scheduler's queue, and won't drive up the max age forever
  • Monitoring: fixed GCS metrics that tried reporting negative values

v1.24.1 (2020-10-28)

Cherrypicks

  • GcsClient copy: always set target of copy request

v1.23.1 (2020-10-28)

Cherrypicks

  • GcsClient copy: always set target of copy request

v1.24.0 (2020-10-19)

Added

  • Logging: log OpenCensus attempt to record negative value
  • Logging: LocalExecutionServer tracks and logs per-action timing
  • Monitoring: com.engflow.re.bytestream/read metric to monitor complete vs. partial ByteStream.read calls (hidden from docs because we wanted to use it for debugging only)

Changed

  • Monitoring: com.engflow.re.storage/ops_queue_size now shows the composition of external storage ops queue

Fixed

  • Don't execute an action if input fetch failed

v1.23.0 (2020-10-14)

Added

  • Status page: a simple status page with members list, on the same port as the remote execution service; enabled with --experimental_status_page=true

Fixed

  • ByteStream.read: properly implement resumable downloads from S3/GCS, guarded by --experimental_storage_range_requests=true
  • Fix user-visible cancellation exceptions
  • Fix a race condition in CloudCasDownloader
  • Fix an incorrectly reported AC corruption with S3/GCS

v1.22.0 (2020-10-06)

Added

  • Deployment kit: added an example Bazel project
  • Docs: add on-prem setup instructions

Changed

  • Return INVALID_ARGUMENT for too-large output trees

Fixed

  • Remote logging: Avoid infinite recursion when logging to GCP
  • ExecutionServer: Suppress cancellation exceptions, so they don't get reported to the client
  • Docker pull: fine-grained pull errors

v1.20.1 (2020-10-02)

Cherrypicks

  • Scheduler: also listen to expiration/eviction to avoid losing workers
  • ExecutionServer: catch exceptions from onCompleted to avoid "call already cancelled" errors

v1.21.0 (2020-09-28)

Added

  • Added --experimental_docker_force_reuse flag

Changed

  • AWS, CloudWatch: --cloudwatch_dimensions is now optional
  • AWS, deployment kit: service endpoint now listens on port 443 (was 8080 before)
  • Improved CAS performance

Fixed

  • Fixed bugs in validating the output tree in the client's execution requests
  • Java 8: Fixed crashbug

v1.20.0 (2020-09-15)

Added

  • Workers: can auto-detect the disk size
  • Service interface: implemented ByteStream.QueryWriteStatus
  • Docs: added system diagram

Changed

  • External storage: use more threads: 50 on schedulers, 25 on workers
  • CentOS: use statically-linked netty-tcnative in the RPM package

v1.19.1 (2020-09-03)

Cherrypicks

  • Fix hazelcast InterruptedExceptions

v1.19.0 (2020-09-03)

Fixed

  • Hash mismatch issues

v1.18.0 (2020-09-02)

Added

  • New metric: com.engflow.re.storage/ops_queue_size

Changed

  • Enabled affinity-scheduling by default (--affinity_scheduling=true)
  • Changed server-side execution log message format to "id: message"

Fixed

  • Fewer DEADLINE_EXCEEDED client errors: more findMissingBlobs caching, reduced GCS/S3 traffic

v1.17.0 (2020-08-26)

Added

  • RPM package for CentOS 7