Release Notes¶
To see your currently deployed version visit [cluster_url]/restatus
in your EngFlow cluster web
UI. If you do not have the web UI enabled please ask your EngFlow contact which version you are
currently running.
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
v2.96.0 (2024-11-14)¶
New¶
- RE: Paths of inputs and outputs now support
utf8
characters. - UI Metrics: The new
com.engflow.observability.ui/page_load
metric will track how long it took to navigate to and render the next page. This includes cases of client-side navigation, and cases of full-page (F5) browser refreshes, which can be distinguished using thepage_load
metric tag. - RE: Add
--limit_total_worker_cores
; when enabled, the autoscaler takes the maximum total cores into account when determining pool sizes.
Fixed¶
- RE: Include more information in the output to the client when Docker containers fail to start.
- RE: Fix rare issue where the autoscaler could get stuck.
- CAS: Correctly handle UNIMPLEMENTED return code from workers that do not participate in the CAS.
- CAS: workers not serving distributed CAS now fall back to external storage after a distributed CAS failure.
- UI: Allow download of extra test outputs in Firefox.
- UI: Fixed a bug where existing targets would sometimes return a "missing" error.
Changed¶
- Auth: When trying to create an existing role with custom roles API, return an already exists exception instead of an internal error.
- RE: Adjusted autoscaling estimates for long-running actions.
Deprecated¶
--experimental_force_sibling_containers_pool_name
is now a no-op.--force_pool_name
is now a no-op.
v2.95.1 (2024-11-13)¶
Fixed¶
- CAS: Workers not serving distributed CAS now fall back to external storage after a distributed CAS failure.
- CAS: Correctly handle UNIMPLEMENTED return code from workers that do not participate in the CAS.
v2.95.0 (2024-11-05)¶
New¶
- RE: Add
--docker_max_container_modified_size
to restart reusable containers if they accumulate untracked data. - Metrics: Report
com.engflow.secretstore/operation_duration_seconds
for information on accessing secrets from schedulers.
Fixed¶
- CAS: Execution-only workers fall back to both cache workers and external storage when a file is not present in the distributed CAS.
- CAS: Invalidate CAS existence cache entries when the referenced blob is missing.
- UI: Triple-dot menus now render on top of all other elements.
- UI: The test UI will now handle long test output lines better, by only displaying the first 5 lines but still optionally allowing to view the full output through the press of a button.
- UI: Fixed an issue where streamed action outputs in the UI failed to reset the byte count from previously opened outputs.
- UI: Fixed an issue rendering the invocation details page under some auth setups.
- UI: Fixes a UI bug where the action details page would incorrectly claim there was an error while fetching stdout or stderr.
- UI: Expanded packages in the target tree are automatically scrolled to the top of the viewport.
- ResultStore: Fixed an error during target tree building when aspects with parameters are used.
- ResultStore: Fixed an issue encountered while attempting to reduce corrupted BEP streams. This would manifest as a NPE displayed on an invocation's details page, in place of any other information.
Changed¶
- UI: Reordered the columns in the worker table on the
/restatus
page. - Metrics: The
com.engflow.resultstore/reduce_bes_count
metric has been renamed tocom.engflow.resultstore/new_reduce_bes_count
.
v2.94.2 (2024-10-25)¶
Fixed¶
- UI: Fixed an issue rendering the invocations page and invocation details page under some auth setups.
v2.94.1 (2024-10-18)¶
Fixed¶
- CAS: Fixed an issue where blob expirations could not be longer than 22 days from scheduler startup.
v2.94.0 (2024-10-16)¶
New¶
- RE: Add GCP Secret Manager credential helper for Docker.
- RE: Support scaling down pools to 0 instances faster.
- RE: Added metrics for
SecretStore
operations. - RE: Add flag
--pools_config
to communicate to the scheduler data on the pool configuration. This flag is intended to replace multiple other flags once the migration has been complete. - RE: Add a new flag,
--experimental_cas_read_preferred_nodes_only
, this allows limiting nodes used to read from in distributed CAS for certain cluster configurations. - CAS: Track the current GC window via the new metric
com.engflow.storage.gc/gc_window_seconds
. - CAS: Track staleness of CAS blobs when their expiration
is refreshed via the new metric
com.engflow.re.storage/time_to_expiry_seconds
.
Changed¶
- UI: Update Perfetto to
v48.0
. - UI: Expanded packages in the target tree are automatically scrolled to the top of the viewport.
- UI: Always display all status filters on the target tree, even when the invocation hasn't completed.
- UI: When a build is completed the target status filters will narrow based on the result of the invocation to help highlight potentially errant targets.
- Update the persistent worker actions metric to match the container lifecycle metric - report pool and lifecycle event.
- Update EngFlow server profiles to display the pool on the instance row, in the Perfetto UI.
- CLT is now available in all supported regions.
Fixed¶
- RE: Fixed an issue where external storage HTTP connections could leak when cancelled.
- Logging: remove a kind of chatty scheduler log message
("Handling
getXXX
locally" and similar). This reduces costs of logging. - UI: Fixed bug in which failed targets don't show up properly on the highlights tab of an invocation's page.
- UI: Fixed some cases where invocations would
erroneously display
<unknown>
as the principal which requested or executed them, when viewed on the invocation search and invocation details pages.
Deprecated¶
- RE: deprecated metric
com.engflow.re.auth.async/call_count
has been removed.
v2.93.1 (2024-10-10)¶
Fixed¶
- RE: Fall back to fetch blobs from S3/GCS when a worker gets a corrupt (digest mismatch) blob from another worker.
- RE: Fix issue where cancelled HTTP connections to external storage leaked.
v2.93.0 (2024-10-01)¶
New¶
- RE: Add metric
com.engflow.re.exec.poolgroups/oom_count
reporting how many execute responses were classified as OOMs. - RE: Add metric
com.engflow.re.exec.poolgroups/initial_recommendation_count
reporting how often an action is executed on a different pool than requested. - RE: Add metric
com.engflow.re.exec.poolgroups/recommendation_change_count
reporting how often the smart recommender recommends a pool that differs from the pool the action was previously executed on.
Changed¶
- RE: Improve the automatic OOM detection by factoring in cgroup oom_kill events, if available.
- The
--cas_existence_cache_expiry
flag is now also applied to expiration storage. - The
--k8s_namespace
flag is now a no-op. This flag only mattered when deploying to Kubernetes, and it's unnecessary because the Pod can read its namespace from the Downward API. - The
--k8s_all_pods_service
flag is now a no-op. This flag only mattered when deploying to Kubernetes, and for a long time now it had to be always equal to--k8s_scheduler_pods_service
; the duplication made this flag unnecessary. --disable_pw_scheduled_threads
is now a no-op.
Fixed¶
- Fix race condition where HTTP connections were not properly closed when an AC read call was cancelled.
- UI: Fixed bug in which failed targets don't show up properly on the highlights tab of an invocation's page.
v2.92.2 (2024-09-26)¶
New¶
- RE: Add metric
com.engflow.re.exec.poolgroups/oom_count
reporting how many execute responses were classified as OOMs. - RE: Add metric
com.engflow.re.exec.poolgroups/initial_recommendation_count
reporting how often an action is executed on a different pool than requested. - RE: Add metric
com.engflow.re.exec.poolgroups/recommendation_change_count
reporting how often the smart recommender recommends a pool that differs from the pool the action was previously executed on.
v2.92.1 (2024-09-25)¶
New¶
- The
--cas_existence_cache_expiry
flag is now also applied to expiration storage.
Fixed¶
- RE: Fix race condition where HTTP connections were not properly closed when an AC read call was cancelled.
v2.92.0 (2024-09-23)¶
New¶
- Record oom kills in the profile during action execution in Docker.
- RE: Add metric
com.engflow.docker.container/size
to report sizes of containers. - RE: Add metric
com.engflow.re.exec/max_rss_kib
that reports the MaxRSS (maximum resident set size) for a successfully executed action, if available, to record how much memory actions use. - RE: Support smart pool recommendations. If enabled, this feature can improve performance and reduce costs of executing actions remotely. Clusters can specifying groups of pools to automatically select which pool within a group to execute an action on remotely. The pool recommendation is based on previous execution statistics and currently selects by memory usage.
Changed¶
- RE: The metrics
com.engflow.resultstore/reduce_bes_replay_source_count
,com.engflow.resultstore/reduce_bes_replay_removed_from_cache_count
,com.engflow.resultstore/reduce_bes_completed_duration_since_finish_event
andcom.engflow.resultstore/reduce_bes_count
now includes a tag specifying which replay type (Combined, Invocation Metadata or Target Tree) the data is referring to.
Fixed¶
- RE: Fix a rare deadlock when actions timeout from a pool.
- UI: Improved target tree and target fetching performance.
- UI: Add UI error boundaries to help debug rendering errors and explain to users what went wrong and potentially how to fix the issue.
- UI: Emoji file extensions are rendered in the correct order in the Input Tree section of the Action Details page.
- The fluent-bit configuration now allows dropping long log lines. Previously, if a log line was excessively long (32K), fluent-bit would stop log-shipping that service; this could lead to missing logs. Now it will just skip such lines.
- RE: ActionCache replication can be symmetrical.
Deprecated¶
--experimental_docker_proxy
is now a no-op.--hazelcast_aws_use_sdk
is now a no-op.--experimental_docker_store_images_in_cas
is now a no-op.
v2.91.3 (2024-09-18)¶
Fixed¶
- RE: prevent reads after every S3 upload.
- RE: Be stricter about keeping sysbox up.
v2.91.2 (2024-09-15)¶
Fixed¶
- RE: fix pulling non-canonical container images.
v2.91.1 (2024-09-11)¶
Fixed¶
- API: Fixed
LIST
endpoint for Docker images.
v2.89.4 (2024-09-10)¶
Fixed¶
- UI: Fixed download paths for test resources.
v2.91.0 (2024-09-11)¶
New¶
- UI: Action execution pages (
(/actions/executions/<execution-id>)
) will now show partial information if available. - RE: Add metric
com.engflow.docker.container/existing
reporting the number of existing docker containers on workers, aggregated by their state. - RE: Add metric
com.engflow.docker.image/size
reporting the sizes of existing docker images on workers. - UI: The Getting Started page now displays instructions for multiple client authentication methods, when enabled.
- RE: Remote persistent workers are now always enabled on Linux and MacOS.
Changed¶
- RE:
com.engflow.re.exec.docker/existing_containers
metric is removed in favor ofcom.engflow.docker/containers
. - UI: The
com.engflow.observability.ui/page_load_with_data_requests
metric now also records the load duration experienced by the user when opening the Invocation Details page and the Analytics page.
Fixed¶
- UI: The requester and/or runner of an invocation is now correctly populated when doing BES processing on analyzer instances. Previously, the UI showed these as "unknown".
- UI: Fixed download paths for test resources.
- UI: Fixed some cases where the build and test UI could accidentally fetch some resources twice.
Deprecated¶
--hazelcast_aws_use_client_lib
is now a no-op.--docker_split_exec_run
is now a no-op. Use--docker_allow_reuse
instead.--experimental_persistent_worker_expand_param_files
is now a no-op.
v2.90.0 (2024-09-03)¶
New¶
- UI: The new metric
com.engflow.bes.replay/cpu_time_for_event
reports how much CPU time was spent replaying and reducing a single event within a build's BES.
Fixed¶
- UI: Principals displayed in the User Menu will now be clipped when they cannot fit in the space, instead of overflowing.
- UI: Allow "Enter" to login when logging in via basic authentication.
- UI: Improved page load times by requiring one less call to fetch feature toggles before starting page render.
- UI: JSON Bazel profiles will now download with the correct extension.
- UI: Fixed bug where test details would be fetched multiple times even when nothing changed.
- Scheduler: built-in autoscaler fixes.
Changed¶
- Remote persistent workers are now enabled by default on all supported platforms (Linux and macOS).
- UI: Provide more meaningful information when Bazel profiles did not get uploaded correctly or are missing from the CAS.
- BES: Significantly reduce the amount of memory used by file references when reducing the BES to extract target details.
- Reduce S3 upload buffers size, enable sdk multipart upload
- Remove pin on google-guest-agent to 1:20240528.00-g1 in the base Debian image on GCP.
Deprecated¶
--experimental_persistent_worker_and_docker
is now a no-op.--disable_profile_generator_cleanup_threads
is now a no-op.
v2.89.3 (2024-09-03)¶
Fixed¶
- RE: Fixed undercounting of the number of actions coalesced in
com.engflow.re.scheduler/coalesced_executions
metric and logs.
v2.89.2 (2024-08-26)¶
New¶
- Scheduler: add the metrics
com.engflow.re.scheduler/estimated_action_time
andcom.engflow.re.scheduler/estimated_induced_load
, and changecom.engflow.re.scheduler/desired_executors
to report the pre-adjusted value to debug issues with the pool sizes.
Changed¶
- RE scheduler: tweak autoscaling equation for faster scale-up on load spikes.
- BES: Significantly reduce the amount of memory used by file references when reducing the BES to extract target details.
v2.89.1 (2024-08-21)¶
Changed¶
- RE: report a retryable status code to the client when container startup times out.
- RE: Collect more information when Docker containers fail to start.
Deprecated¶
--experimental_force_lru
is now a no-op.--http_compression
is now a no-op.
v2.89.0 (2024-08-19)¶
Internal release. No publicly facing changes.
v2.88.0 (2024-08-13)¶
New¶
- CI Runners: support polling multiple GitHub repositories.
Fixed¶
- RE: Do not wait for containers to start when their entrypoint has crashed; fail immediately.
- BES: If automatic indexing fails to save the last status of an invocation, update the index on the next BES reduction.
- CLT: Fix bad paths and yaml syntax for grafana and prometheus.
Changed¶
- UI: Changed "exit code" filter to "Bazel exit code" to avoid confusion with process exit codes.
- UI: Update Perfetto Trace Viewer to v47.0 and minify files for faster viewing of the Bazel and EngFlow profile.
- UI: The branch name displayed on the invocation search page will now also be clipped when it is longer than 15 characters, mimicking the visual behavior of the commit ID next to it.
v2.87.0 (2024-08-05)¶
New¶
- UI Auth: The OIDC provider may now set
engflow_roles
to a list of role names. The user will have these roles instead of roles set with--principal_based_permissions
.
Fixed¶
- CAS: CAS Reads will no longer retry reading 0 bytes, which was previously caused by retrying a read that had already finished.
- S3: Handle S3
416
errors gracefully. - UI: Fixed a bug that prevented users from setting a custom timezone.
- UI: Fixed an issue where small durations (such as action queueing time, in the action details page) were incorrectly displayed as nearly 24 hours instead.
Incompatible¶
- BES/EventStore: The service no longer translates deprecated timestamp and duration fields in millis to their according new Timestamp and Duration counterparts. This is a behavioral change for BEP sent from Bazel clients running version
4.x
or older. Bazel versions5.0.0+
are not affected.
Changed¶
- BES: Improved EventStore performance: if no sensitive data to redact is detected in an incoming BEP event, avoid packing the unchanged event before saving it to external storage or processing it otherwise.
- BES: Build events received are now written to storage and acknowledged after at most 2 minutes.
- BES: Improved EventStore performance by pre-filtering packed build events and only unpacking them if they might have data that should be redacted.
- API: The
v1
andv1alpha
IdentityManagementServer
APIs have been consolidated into a singlev1
API.
v2.86.1 (2024-07-28)¶
Internal release. No publicly facing changes.
v2.86.0 (2024-07-28)¶
New¶
- BES: The new metric
com.engflow.eventstore/bep_event_ack_latency
reports how long it takes to acknowledge a build event the client sent to EngFlow's BES. - UI: The new metric
com.engflow.observability.ui/page_load_with_data_requests
reports how long it took to render selected pages, including data requests needed to display the pages' initial contents. For example, for the invocation search page, it tracks how long it took to render the static page, plus the first set of invocations. - BES: The new metric
com.engflow.bes.replay/cpu_time
tracks how much CPU time was spent replaying and processing an invocation's build events.
Changed¶
- CI Runners: Webhooks now trigger polling instead of reading the hook contents to improve security and reliability.
- RE: Add a log message with the worker pool and action mnemonic when actions coalesce.
- UI: Don't warn that
--remote_download_minimal
isn't set when using--experimental_remote_output_service
.
Fixed¶
- RE: Out-of-CAS-space is now treated as RESOURCE_EXHAUSTED.
- UI: Fix invocation profile HTTP API sometimes returning 500 instead of 404 (NOT FOUND).
- BES: Revert support for BES upload retries, as this introduced issues with live replays.
Deprecated¶
- config:
--experimental_junit_test_suite_upload_deadline
is now a no-op.
v2.85.3 (2024-07-25)¶
Fixed¶
- CAS: Ensure S3 always returns object expiration date.
- CAS: Ensure the external storage request does not block the network thread pool.
v2.85.2 (2024-07-22)¶
New¶
- BES: Add
com.engflow.eventstore/bes_upload_delay
to track ingestion performance.
v2.85.1 (2024-07-19)¶
Changed¶
- BES: Remove
com.engflow.eventstore/bep_event_ack_latency
metric due to increased scheduler load. - CAS: Move blocking storage actions to a dedicated thread pool.
- CAS: GCS Blocking calls now return storage metadata.
- GCP: Pin google-guest-agent to 1:20240528.00-g1 to avoid network regression in 1:20240701.00-g1.
v2.85.0 (2024-07-15)¶
New¶
- BES: Export
com.engflow.eventstore/bep_event_ack_latency
, which measures how long it takes to acknowledge a build event the client sent to EngFlow's BES. - BES: Add a read cache for BES replays to reduce the long tail for indexing performance.
Fixed¶
- CI Runners: re-process jobs resulting in server-side errors.
- CI Runners: more frequent running job status updates.
- ResultStore: Fix a bug that prevented failed attempts to reduce the BES of an invocation to be retried in a timely manner.
- UI: Fixed visual clipping with the help mode button, on the target tree status filter, inside an invocation's details page.
- UI: Fixed an issue with help mode buttons not displaying correctly on the new compact invocation search page.
Changed¶
- BES:
--experimental_use_junit_test_suite_parser_v2
is now enabled by default and a no-op. - Metrics: Target tree processing logging will now include the fully-qualified invocation ID, if any exceptions are raised during this process.
- S3: Blocking time to acquire failures will be treated as RESOURCE_EXHAUSTED.
- UI: Disable auto-complete on search page. This was causing database issues that caused search to perform very slowly.
v2.84.1 (2024-07-10)¶
Cherrypicks¶
- RE: fixes bug where partially downloaded files lost the successfully written prefix when falling back to external storage.
v2.84.0 (2024-07-03)¶
Changed¶
- UI: Our invocation search page now sports an improved look, where invocations are displayed in a much more compact manner, thus allowing more invocations to be visible at the same time. For invocations that are loaded and visible on the page, you can also now Ctrl + F search for them using their invocation ID.
- UI: Added a feature-flag,
--enable_expensive_invocation_index_queries
to control enablement of known expensive queries on the invocation search page. Disabling this flag may improve page-load and search time for very large clusters. - Monitoring: improved precision of most time distribution metrics.
Fixed¶
- UI: Fixed an issue that prevented multi-value filters (such as BES keywords) from populating suggestions properly.
- UI: Fixed an HTTP 500 error on the Cluster Status page.
Removed¶
- UI: Secondary timezone functionality is removed to simplify user experience.
- ResultStore: the flag
--resultstore_async_db_write
no longer has any effect. Database writes are now always asynchronous.
v2.83.2 (2024-07-03)¶
Fixed¶
- Re-release to pick up the OpenSSH security update. The ssh port is not open on any cluster, so the security issue is low priority. However, we're preparing this release just in case.
v2.83.1 (2024-06-27)¶
Fixed¶
- S3: Fix null pointer on asynchronous copy.
v2.83.0 (2024-06-18)¶
Fixed¶
- RE: Correctly report CAS check cache hits (in
com.engflow.re.storage/ops
). - UI: In the target tree's status bar, better differentiate skipped targets from pending targets.
- UI: Highlight invocations whose BES is still incomplete.
- UI: It is now easier to see which targets are root-cause failures versus transitive failures.
- RE:
putCasBlob
now respectsinstance_name
correctly.
v2.82.0 (2024-06-13)¶
New¶
- New BES metrics:
com.engflow.eventstore/incomplete_batches_size
estimates the in-flight event batches' sizes, andcom.engflow.eventstore/flushing_batches_size
shows the size of event batch blobs currently being uploaded to storage. - New operation
gc_extend
and result dimension forcom.engflow.re.storage/ops metric
- UI: On the invocations page, invocations can now also be filtered by command (build, test, run, etc.).
- UI: The new metric
com.engflow.resultstore/reduce_bes_completed_duration_since_finish_event
measures how much time passed between the cluster receiving an invocation's last BES event, and us fully processing the BES for display in the UI. - UI: Retry saving the result of processing the BES if it fails.
- UI: Add the metric
com.engflow.resultstore/reduce_bes_replay_source_count
that tracks how many times processed BES was requested, and where the data was retrieved from. - UI: Add the metric
com.engflow.resultstore/reduce_bes_replay_removed_from_cache_count
and log details to help debug BES processing.
Changed¶
- The
com.engflow.eventstore/batchstore_*
metrics are no longer reported. They were not showing what we intended to. - Retire the now unused
--experimental_summarize_invocations
.
Fixed¶
- Analyzers now report
com.engflow.meta/engflow_version
. - When moving ResultStore onto analyzer instances, early-exit if an invocation was not found.
- UI: Fix bug where failing actions did not correctly update the overall status of a target.
- UI: Disable "Open Bazel profile" button if the profile is a local file.
- UI: Fix bug that didn't correctly detect
--remote_download_outputs=minimal
being set on an invocation, and unnecessarily suggested setting this value. - UI: Fix download URI for action outputs.
- UI: Fix a bug where targets that were aborted in the analysis phase did not show up in the target tree.
- UI: Update target details cards when new data becomes available for running invocations.
- UI: Fix a bug where the status of an invocation with some aborted targets was shown as unknown, although we can be more specific.
- UI: Fix a bug that hid targets with an analysis failure in the target tree.
- UI: On the invocations page, support filtering for cancelled invocations.
- UI: Fix a bug where the target tree did not render for running builds.
- UI: Avoid duplicate processing of an invocation's BES.
- UI: Fix bug which caused the data shown for running invocations to be out-of-date.
V2.81.1 (2024-06-07)¶
- Cache: Add "tenants" to the prefix path in external storage.
v2.81.0 (2024-06-01)¶
New¶
- UI: Redact sensitive information from URLs sent in the BES.
- UI: The invocation creation date filters now default to including invocations from the last 24 hours, as opposed to the last month.
- UI: Support opening the Bazel and EngFlow profile using a self-hosted version of Perfetto UI.
- UI: The EngFlow profile now includes the read location and source of download events.
- BES: Support a new well-known BES keyword
engflow:Requester
, which can be used to specify who requested a CI build. See documentation for details - Metrics:
com.engflow.eventstore/batchstore_*
is now reported for BES event batching. - BES: The new flag
--resultstore_async_db_write
enables non-blocking writing of ResultStore database blobs.
Changed¶
- UI: Style the documentation links on the top of many of the pages like other links.
- UI: Redesign 404 pages throughout the Build and Test UI.
- UI: Removed the "experimental" tag on now stable features in the UI.
- BES: Reduce memory usage by proactively clearing retained resources for idle streams.
Fixed¶
- UI: Display a help message if no action cache statistics were reported by Bazel.
- UI: Fixed a bug where some legacy URLs were not redirected correctly.
- Metrics: Unify thread pool metrics to improve monitoring.
v2.80.1 (2024-05-30)¶
Fixed¶
- Cache: fixed a divide-by-zero error calculating download speeds that caused some requests to hang.
v2.80.0 (2024-05-22)¶
New¶
- Metrics: added
com.engflow.re.cas/requests_in_flight_incoming
, counting number of ByteStream requests in flight, by method name and pool. - Metrics: added
com.engflow.re.cas/requests_in_flight_outgoing
, tracking the number of outgoing CAS requests by method and pool. - Metrics:
com.engflow.caching.inmemory/*
is now reported on analyzers. - RE: Setting
--worker_config=""
now disables the Execution service on a worker, useful for cache-only instances. - UI: Reduced the JS bundle size by 28%.
- CI Runners: add status page to the UI.
Changed¶
--run_common_member
is now a no-op.- Cache: when
--enable_distributed_cas=false
and--migration_enable_distributed_cas_disabled_semantics=true
(new temporary flag), workers no longer serve local cache files to other workers. - UI: The invocation profile open buttons no longer use misleading external link icons.
- UI: Reduced the padding around tooltip elements found in the UI.
Fixed¶
- Fix accounting error in ByteStream reads that could lead to workers filling up with unremovable blobs.
v2.79.2 (2024-05-20)¶
Internal release. No publicly facing changes.
v2.79.1 (2024-05-17)¶
Internal release. No publicly facing changes.
v2.79.0 (2024-05-08)¶
New¶
- Metrics: Added
com.engflow.re.cas/fetch_retries
, a count metric that is incremented when we retry a CAS fetch after an error. - Metrics: Added
com.engflow.re.cas/load_shed_errors
, counting ByteStream requests failed on workers due to load shedding. - Metrics: added
com.engflow.re.cas/find_replicas
tracking the time to look up which instances hold a copy of a file. - Metrics: for
com.engflow.re.cas/fetch_call_time
, split theDISTRIBUTED_CAS
tag value intoDISTRIBUTED_CAS_NEAR
(worker with file in same availability zone),DISTRIBUTED_CAS_FAR
(worker with file in other availability zone), andDISTRIBUTED_CAS_FALLBACK
(worker without file).
v2.78.3 (2024-05-04)¶
Internal release. No publicly facing changes.
v2.78.2 (2024-05-04)¶
Fixed¶
- GCS: Avoid writing data to GCS multiple times.
v2.78.1 (2024-05-03)¶
Fixed¶
- External storage: Fix unnecessary large number of requests.
v2.78.0 (2024-04-30)¶
New¶
- UI: Add an option to change the order of autocomplete suggestions in the search filters, so that BES keywords with a prefix match appear before other substring matches.
- Performance: When serving bytestream reads from external storage, stream the data to the client rather than waiting for the entire backing external storage download to complete.
- EngFlow profile: record time for shutting down the docker container that was used for a previous action.
- GCS: Support uploading file chunks in parallel.
- S3: Make available S3 connection metrics by enabling
--experimental_record_s3_metrics
.
Changed¶
- UI: Collapse the "BEP parsing errors detected" alert box in the invocation page by default, and include in its title how many BEP errors were encountered.
Fixed¶
- UI: Improve messaging for aborted invocations.
- S3: Fix rate limiting for streaming S3 requests.
- RE: Revert MacOS update (from 2024-04 back to 2023-12) to fix permissions.
- CI Runners: reduce GitHub API usage by a factor of number of workflows in the repository.
Removed¶
- UI: Remove the (reconstructed and inaccurate) "Bazel Command" field in the Configuration tab of the Invocation Details page.
v2.77.2 (2024-04-26)¶
Fix¶
- Revert MacOS update to fix permissions.
v2.77.1 (2024-04-25)¶
Changed¶
- Update the linux kernel to 6.1.85-1.
v2.77.0 (2024-04-23)¶
Changed¶
- API: removed permission from the
user
role to call theNotificationQueue
API. It now requiresadmin
orglobal-admin
. - Platform: Support invocation indexing without using a notification queue.
- The flag
--keep_exec_directories_for_debugging
is now a noop.
v2.76.0 (2024-04-16)¶
Fixed¶
- UI: the builtin admin role can no longer access invocations outside the default tenant.
- EngFlow profile: record failed download calls in the profile.
- RE: return a retryable error code when receiving a 502 during Docker pull.
- RE logging: previously, when an IP address got reused, some log messages referenced the previous instance ID, making it look like the old instance was still alive. This is now fixed.
- CI runners: correctly identify the executed GitHub job.
- CI runners: correctly record the number of failed GitHub error propagation jobs.
- CI runners: allow GitHub to update the job status before querying it.
- CI runners: retry an http GOAWAY and similar errors.
- CI runners: correctly record the number and age of queued jobs.
v2.75.4 (2024-05-31)¶
Internal release. No publicly facing changes.
v2.75.3 (2024-05-31)¶
Cherrypicks¶
- BES: The new flag --resultstore_async_db_write enables non-blocking writing of ResultStore database blobs.
- BES:
com.engflow.eventstore/batchstore_*
now contains metrics on BES batching.
v2.75.2 (2024-05-15)¶
- RE: return a retryable error code when receiving a 502 during Docker pull.
v2.75.1 (2024-04-09)¶
Fixed¶
- UI: Fixed a bug where EngFlow profiles were not downloadable.
- UI: Fixed a bug on the invocations page where results were not correctly filtered.
v2.75.0 (2024-04-04)¶
New¶
- CI runners: adding job status and various error metrics.
- CI runners: adding retries for HTTP and RE-API calls.
- CI runners: caching GitHub tokens for 45 minutes.
- Platform: Add metric
com.engflow.notificationqueue/size
reporting the approximate size of notification queues. - Platform: Add v2 implementation of a Hazelcast-backed notification queue.
- Platform: Add new metric
com.engflow.profiling/publish_invocation_event
to measure adding entries to the EngFlow profile.
Changed¶
- Platform: Add additional type checks to Hazelcast-backed distributed maps.
Fixed¶
- CI runners: fix Buildkite x64 jobs.
- CI runners: fixing GitHub error propagation for arm64.
- Avoid a deadlock in the schedulers' shutdown hook that could hang the process indefinitely.
v2.74.4 (2024-04-11)¶
Fixed¶
- Logging: Fixed an issue where, when an instance's IP address was reused, some logged messages would still reference the previous instance ID.
v2.74.3 (2024-03-26)¶
Fixed¶
- Auth: With
--http_auth=none
, unauthenticated users can acquire tokens with theviewer
role to view the UI.
v2.74.2 (2024-03-26)¶
New¶
- Auth: Add new platform role of
viewer
as a default for--http_auth=none
v2.74.1 (2024-03-22)¶
Fixed¶
- Scheduler: Shutdown even if the threadpool is deadlocked.
v2.74.0 (2024-03-21)¶
Fixed¶
- CI runners: GH workflows will now work with >30 jobs per workflow.
- CI runners: Fix jobs hanging unnoticed by our runners.
- CI runners: No longer fail entire polling loop on one misconfigured job.
- UI: The chip showing the CI details for an invocation now ellipses long lines.
New¶
- CAS: Fine tune external storage migration from epoch to expiration gc
with
--migrate_storage_max_cache_size
and--migrate_storage_max_concurrent_operations
. - CI runners: Report job queue size and age metrics to dashboards.
- Platform: Improved metrics reporting for notification queues.
- RE: Add
--docker_volume_mount_path=/mnt/engflow/docker
option to track disk usage for the docker volume. - UI: Enable
--experimental_invocation_comparison
to be able to compare invocation metadata.
Changed¶
- Platform: Restore the serialized format of cluster address to IP address due to backwards compatibility issues.
- UI: Due to performance considerations, the autocomplete fields suggest values with a substring match by lexicographical sort. They no longer sort prefix matches to the top of the suggestions. E.g. for the search term "foo" the autocomplete suggestions will now show "afoo" before "foobar".
v2.73.0 (2024-03-11)¶
Fixed¶
- CI Runners: fixed the idle timeout condition.
- BES: BEP uploading no longer halts if an unknown target label is sent to the BES.
New¶
- UI: Display BEP parsing errors on the invocation details page.
v2.72.1 (2024-04-09)¶
Fixed¶
- Logging: Fixed an issue where, when an instance's IP address was reused, some logged messages would still reference the previous instance ID.
v2.72.0 (2024-03-06)¶
Fixed¶
- Auth: Fix mTLS generation when the trusted key+cert are stored in a secrets store.
- CAS: Respect digest size in FindMissingBlobs.
- CAS: Writes to the replica clusters will correctly report not found rather than a grpc error.
- CI Runners: GH runner idle time is capped to 1 minute.
- CI Runners: Failed actions now have a fixed worker pool.
- CI Runners: Control default GH action runner version via a flag.
- UI: Colors are now aria compliant with improved contrast.
- UI: Correctly process BES events reporting aborted tests.
- UI: Fix memory leak caused by publishing notification to a queue with no consumer.
- UI: Remove extra calendar UI element used for debugging.
- UI: Support a no-op notification queue to avoid leaking memory when no queue readers exist.
New¶
- RE: Report Pressure Stall Information stats to actions executed on Linux.
- CAS:
--migrate_storage_to_expiration_gc
will begin migrating to the new expiration based external storage. - IAM: Add built-in role
root
to allow all operations. - UI: Invocation BES can now include metadata on the source control management via keywords. See documentation for details
Changed¶
- Auth: JWT issued before
v2.48
are no longer supported. - Config:
--incompatible_reject_instance_name
is now a no-op.
v2.71.3 (2024-02-27)¶
Fixed¶
- Fix
NullPointerException
in GCS code.
v2.71.2 (2024-02-27)¶
Fixed¶
- Return correct release version from Cluster API.
- UI: Remove stray creation time selector.
v2.71.1 (2024-02-26)¶
Fixed¶
- Rebuild release due to flaky release process.
v2.71.0 (2024-02-22)¶
Fixed¶
- CI Runners: ensure 1:1 job to runner correspondence and propagation of results and error logs to GitHub.
- CI Runners: fix missing file errors for GH trace events.
- CI Runners: fix metrics to correctly report OS and arch.
- ResultStore: Ensure outdated cache entries for the target tree are evicted regularly.
- RE: propagate Docker start failures correctly when Docker doesn't produce an error file.
- RE/Windows: suppress the
RUNFILES_MANIFEST_FILE
andRUNFILES_MANIFEST_ONLY
environment variables in Docker actions so that Bazel tests can run without special settings. - UI: Labels with canonical repository names (e.g.
@@protobuf~21.7//:protobuf_lite
) are now correctly displayed in the target tree. - UI: Fix race condition, which could lead to invocations not being indexed.
New¶
- ResultStore: Add metrics and logging for BES processing costs.
- UI: The Cluster Status page now displays the contract duration.
- UI: Use
GetTree
to populate the action input browser. - UI: In the search filters, the autocomplete fields now suggest values that have a prefix match on the current input before suggesting substring matches.
- UI: In the search filters, the BES keywords autocomplete now also suggests values that have a substring match on the current input. Before, this only included prefix matches.
- UI: Improve logging for better debuggability.
Changed¶
- UI:
--experimental_advanced_search
is now a no-op (alwaystrue
). - ResultStore API: For the
GetInvocation
API, theinvocation.target_tree
andinvocation.invocation_metadata.bazel_invocation_metadata.target_information.target
fields are now deprecated.
v2.70.1 (2024-02-15)¶
Fixed¶
- UI: Accept OIDC keys that specify the algorithm family RSA, but omit the algorithm.
- EventStore: When making gRPC calls, ensure unknown build errors are propagated correctly.
v2.70.0 (2024-02-07)¶
Fixed¶
- ResultStore: Only send data after the first build event was processed.
- ResultStore: When replaying invocations, set the correct last updated timestamp.
Incompatible¶
- RE: The flag
--workers_handle_fallback_requests
is now a no-op.
v2.69.0 (2024-01-31)¶
New¶
- Enable
--experimental_http_compression
by default. - Enable
--experimental_historical_results
by default. --endpoint
(formerly known as--build_and_test_url
) is now required on all instances.- Improve UI accessibility by more strictly enforcing presence of
aria-label
in Icon Button elements. - Analyzer: When analyzing the EngFlow profile, include how much data was downloaded from the CAS in general and for locally executed actions.
- UI: Improve protocol to download files. This may cause temporary glitches during deployment of the release (but not after the deployment is complete).
Fixed¶
- S3: Return
RESOURCE_EXHAUSTED
instead ofINTERNAL
when client is overloaded. - Analyzer: Mark analyzer instances ready only after they have successfully connected to the cluster.
- RE: Windows: use symbolic links for input files instead of hard links.
- RE: Ensure empty pools are kept track of by the leader scheduler and autoscaled correctly.
- RE: Ensure output files are cleaned up between actions
when
--experimental_tree_delta_exec_root=true
. - UI: Adjust all icon buttons to share the same style of on-hover tooltip.
- Mini: Fix issue that lead to
PERMISSION_DENIED
errors. - Fix corruption of compressed HTTP responses that set a
Content-Length
header.
v2.68.2 (2024-01-24)¶
Fixed¶
- ResultStore: Fixed a permission-denied issue while fetching results from other schedulers within the same cluster.
v2.68.1 (2024-01-24)¶
Fixed¶
- UI: Fixed an issue where some icons on the left nav bar were displayed incorrectly.
v2.68.0 (2024-01-23)¶
New¶
- Analyzer: Extend metrics to include number of analysis cache hits.
- Analyzer: Provide more detailed information on analysis failures.
- API: Add the HTTP
endpoint
/api/resultstore/v1/instances/[instance_name]/invocations/[invocation_id]/profiles/bazel
, which redirects to an invocation's Bazel profile, provided it was uploaded to the CAS. - RE: Introduce support for configuring fallback worker pools to handle scaledown events more
gracefully, using the
--experimental_mia_fallback_pools
flag. - UI: For test summaries, highlight if some shards or runs are missing. Factor in missing runs when determining the aggregate status of a test shard.
- UI: For test targets, add menu items that allow users to copy the URL to the test's
logs,
test.xml
and similar. - UI: Analyzer pool nodes are now listed on the cluster's status page.
- Windows: Allow docker to communicate with the Internet if the
--docker_allow_network_access=true
and--docker_default_network_mode=standard
flags are set.
Incompatible¶
- UI Authentication:
--http_auth=okta_login
is no longer supported. Please use --http_auth=oidc_login instead.
Changed¶
- UI: Changed the invocation analysis endpoint from http to grpc.
- UI: Replaced experimental badges with individual chips indicating documentation links and the experiemntal status of various features.
Fixed¶
- Analyzer: Limited the size of the summaries sent for invocation analysis.
- Profiling: Corrected key names for the action id and digest of action cache lookup events.
- RE: Fixed excessive re-trying of actions caused by OOM or worker missing-in-action events.
- ResultStore: Fixed a bug where Bazel profile URIs did not include a custom port.
- ResultStore: Correctly handle multiple
--bes_keyword_deny_list
values. - S3: Fixed IllegalStateException in certain S3 retry cases.
- UI: Fixed flicker in target tree view when an invocation is not yet complete.
- UI: On the action details page, surface if an action was not executed.
- Fixed the default cipher suites to work with TLS 1.3 when using the JDK SSL
implementation (
--experimental_select_ssl_impl=jdk
).
v2.67.0 (2024-01-11)¶
New¶
- RE: Support eagerly crashing when the JVM appears to be almost out of memory.
Incompatible¶
- RE: When using
--discovery=static
, only specify--static_scheduler
(at least one), not--static_cas_node
, which is no longer used. - RE: Changed
--service_discovery_mode
to default tobuiltin
. - RE: Running inside the linux sandbox is no longer supported.
- RE: Set
--run_common_member=false
by default. It can only be false if--service_discovery_mode=builtin
(which is also set by default). - RE:
--experimental_use_async_storage_for_eventstore
is now a no-op on AWS S3. - RE: Set
--discovery
tostatic
by default; deprecatemulticast
. - UI:
--experimental_use_oidc_discovery
is now a no-op. The discovery URI is used whenever it is included in the JSON provided via--oidc_config
.
v2.66.0 (2024-01-03)¶
New¶
- Analyzer: Reduce memory consumed when analyzing invocations.
- Analyzer: Add metrics to keep track of the status of retrieving the Bazel and EngFlow profile for analysis.
Fixed¶
- EngFlow profile: Server-side profile generation captures events from all schedulers.
- EngFlow profile: Fixed formatting of the digest in the EngFlow profile action cache lookup event.
v2.65.1 (2024-01-02)¶
Fixed¶
- ResultStore: Fixed the target tree not loading on older invocations.
v2.65.0 (2023-12-29)¶
New¶
- RE: Added
--warm_containers_timeout
to control timeouts for worker container warming.
Changed¶
- RE: Allow JWT and mTLS in combination with external auth.
- RE: Record metrics for
Caffeine
caches. - UI: Update copyright footers.
Fixed¶
- UI: Fixed an issue where invocation page load sometimes crashed and displayed a blank screen.
- UI: Fixed a regression where invocations were not timestamped correctly and did not appear in the invocation search.
- UI: In light mode, change the background color of disabled primary buttons for better contrast.
- UI: When running in insecure mode, ensure that the
Content-Security-Policy
allows image sources fromhttp
. - UI: Fix breakage when an invocation's metadata does not include Bazel command details.
- ResultStore: Fetches invocations from the cache in external storage instead of replaying multiple times.
- RE: Fixed an issue with deleting stdout, stderr in Windows.
- RE: Docker no longer inherits handles from other threads on Windows.
- Analyzer: Report metrics on invocation analysis when using a separate analyzer pool.
v2.64.0 (2023-12-22)¶
New¶
- AWS: Add
--experimental_record_s3_metrics
for tracking S3 specific information. - UI: Allow viewing historical execution results given the execution ID.
Changed¶
- UI: Updated base font from Poppins to Roboto.
- GRPC: Add logging for
PROTOCOL_ERROR
s.
Incompatible¶
- GCP images no longer have
exim4
installed.
Fixed¶
- RE: Reduce memory use of persistent workers.
- ResultStore: Avoid sending
is_last
more than once. - GRPC: Prevent PROTOCOL_ERROR errors by reducing max metadata length and de-duplicating response headers.
v2.63.4 (2023-12-16)¶
Fixed¶
- RE: Reduce memory use of persistent workers.
- BES/CAS: Set the correct offset for multi-part uploads to prevent upload corruption.
v2.63.3 (2023-12-14)¶
Fixed¶
- UI: Fix issue with showing test reports from old invocations.
v2.63.2 (2023-12-12)¶
Fixed¶
- Fix release process.
v2.63.1 (2023-12-12)¶
Fixed¶
- Fix a GLIBC mismatch in mini.
- Allow cluster-wide configuration of TCP keepalive.
v2.63.0 (2023-12-06)¶
Fixed¶
- UI: Disable the "Analyze" button if the only available profile is stored on the local disk.
- Fix a race condition when restarting the worker service, which could cause the worker to get into a bad state causing all subsequent actions to fail.
New¶
- UI: Style backticked content in Bazel Invocation Analyzer suggestions as inline code blocks.
- UI: Also linkify links in Bazel Invocation Analyzer suggestions that lead to one of Bazel's GitHub pages.
- It is now possible to use Analyzer backends with self-signed certificates.
- Add property to github-actions CI runner that disables dangling processes clean-up after job execution (default: false).
- Record the status of CAS uploads in the EngFlow profile.
- On Windows, executors can now recover from a failure to delete or rename a file in the exec root by switching to a new exec root. The executor will restart a persistent worker when this happens, but in most cases it does not need to restart a docker container.
v2.62.0 (2023-11-28)¶
Fixed¶
- Correct issues establishing current license validity.
- UI: Fixed error where the target status in the target tree would be inconsistent with the status on the target card.
Incompatible¶
--experimental_build_index_percentage
is now a no-op.
v2.61.0 (2023-11-22)¶
Added¶
- UI: Recommend setting the option
--remote_download_outputs=minimal
if it is not set and remote execution is used.
Changed¶
- Moved the AWS CloudFormation signal after container warming and service discovery.
Fixed¶
- UI: Fix render error when a test target's summary is empty.
- UI: If fetching a test target's test report fails, show a more meaningful error message.
- UI: Increase precision of percentages shown in the Cache Hits and Execution chart.
- Analyzer: Fix a bug which prevented the Bazel Invocation Analyzer from fetching the Bazel profile from the CAS when using a separate analyzer pool.
Incompatible¶
--strict_http_headers
is now a no-op and will be removed in a future release.- Secrets URLs must now explicitly declare the
secretstore://
schema when the secret is not on the local file system.
v2.60.0 (2023-11-16)¶
Added¶
- Extend which environment variables we redact before storing the BEP: also redact the value of
variables whose name include
credential
. For previously stored BEP, adjust the UI code to retroactively redact these values. - ResultStore: fetching logs now requires the (new)
resultstore:GetLogs
permission.
Fixed¶
- Do not cache invocation analyzer results if errors were encountered retrieving either the Bazel or EngFlow profile (allows re-analysis on transitory retrieval issues).
- UI: Fix incorrect logs displayed for test attempts/shards.
- RE: allow action outputs outside the working directory.
v2.59.1 (2023-11-13)¶
Fixed¶
- ResultStore/GetTarget: Handle the case when target statuses are given for multiple configurations.
v2.59.0 (2023-11-07)¶
Added¶
- Report container restarts count prior to action execution.
- Add
--experimental_bytestream_iop_limit
to mitigate overloaded workers from accepting ByteStream read and writes. - Support marking EngFlow flags as sensitive, which will redact their value when shown in the Build and Test UI.
Fixed¶
- Fix
PERMISSION_DENIED
from waitExecution. - Ensure HTTP responses with status 304 (Not Modified) are not compressed. This fixes connection errors in Safari.
v2.58.0 (2023-11-01)¶
Added¶
- UI: Mark environment variables containing
credentials
as sensitive.
v2.57.1 (2023-11-01)¶
Fixed¶
- Corrected an issue with clusters requesting unnecessary amounts of re-scaling during operation.
v2.57.0 (2023-10-25)¶
Added¶
- UI: Action debug page: show richer errors.
- UI: Show the action digest on the historical results page, and if cacheable, add a link to the AC view.
- UI: Update paths to the Action Details page.
Fixed¶
- RE: Fixed issue with built-in autoscaler getting out of sync with AWS auto-scaling group, preventing clusters from scaling up appropriately.
- UI: Added help text on the Highlights page in case of a system failure.
- UI: Fix bug in target tree prefix search that could lead to infinite loops.
- UI: Fixed code blocks not indicating that there is more contents.
- UI: Remove non-HTTP/2 headers when converting headers to fix connection problems.
v2.56.2 (2023-10-26)¶
New¶
- Mini: enable running with TLS.
Fixed¶
- Propagate credentials to an internal gRPC server that expects it.
v2.56.1 (2023-10-19)¶
Fixed¶
- Fixed an autoscaler issue where the requested instances would drop to the minimum during a deploy.
- UI: Fixed UI bug when using
--http_auth=basic
authentication.
v2.56.0 (2023-10-19)¶
Added¶
- UI: Expose the invocation's non-zero exit code as a tooltip in the summary at the top of the invocation details page.
Fixed¶
- UI: Fixed error where failing tests would report as passing.
- UI: Add "No Tests Found" status when running
bazel test
without requesting any test targets. - UI: Fixed bug where the UI became non-interactive when using the browser navigation while a modal was open.
- Improved Docker shutdown on Windows.
- Improved Docker path handling on Windows.
- UI: Improve display of invocations while no Bazel metadata has been received yet.
- UI: Improve target tree rendering when no target information is available.
- UI: Better determine the invocation status by leveraging Bazel-specific exit codes.
- UI: Fix bug where the instance name of an invocation was not added to the search index.
v2.55.2 (2023-11-12)¶
Fixed¶
- ResultStore/GetTarget: Handle the case when target statuses are given for multiple configurations.
v2.55.1 (2023-10-16)¶
This release contains fixes for CVE-2023-39325
, CVE-2023-3978
, and CVE-2023-38545
.
Fixed¶
- UI: More clearly mark
bazel test
runs if there were no test targets. - UI: Ensure the UI stays interactive when using the back button while a modal is open.
v2.55.0 (2023-10-11)¶
Added¶
- UI: Show statistics for cached / uncached actions.
- UI: Improve display of test target details when using sharding, runs per test or test attempts.
- UI: Add more hints to help mode.
Fixed¶
- Improved error handling during input tree creation.
- Remove environment variables
RUNFILES_MANIFEST_ONLY
andRUNFILES_MANIFEST_FILE
from actions running locally without sandboxing. - UI: Fix bug where some test suites were rendered as passing, although failing
Incompatible¶
- Individual actions are now limited to at most 10000 cores.
v2.54.1 (2023-10-12)¶
This release contains fixes for CVE-2023-39325
, CVE-2023-3978
, and CVE-2023-38545
.
v2.54.0 (2023-10-09)¶
Fixed¶
- BES: invocations running for more than 3 hours are no longer consider timed out.
- docker: errors from pulling containers are now retried.
Incompatible¶
- platform: Java 17 is now required. Note that this does not affect actions execution.
- platform:
--experimental_always_retry_missing_worker_failures
is now a no-op.
v2.53.2 (2023-10-06)¶
Fixed¶
- rhodonite OS: Update AWS CLI if already installed.
- UI: Improve display of test target details when using sharding, runs per test or test attempts
v2.53.1 (2023-10-06)¶
Fixed¶
- rhodonite OS: Set custom uid and gid, as rhodonite reserves 108 and 114.
- RE: Return UNAVAILABLE on docker pull failure
v2.53.0 (2023-10-05)¶
Added¶
- CAS: Add
type=readonly
dimension to external storage metrics. - Windows: experimental support for actions running in Docker containers.
Fixed¶
- probers: log details about
BatchUploadBlobs
failures. - RE: Update AWS base images: debian 12 20231004-1523, macos 12.6.9-20230921-235406
- RE: Update GCP base images: debian 12 v20231004
- resultstore: stop compressing build logs.
- UI: fix "flash" in dark-mode during loading.
- UI: fix BES keyword logic causing query to return no results.
- UI: prevent a page-reload when clicking a search-card.
- Windows: ensure dockerd starts automatically when configured to store data on a secondary volume.
v2.52.2 (2023-09-29)¶
Fixed¶
- BES: A problem when uploading BES from longer builds.
v2.52.1 (2023-09-29)¶
Fixed¶
- (internal) Migrated to new CI fleet.
v2.52.0 (2023-09-19)¶
Added¶
- platform: Surface the
instance_name
an invocation belongs to and allow filtering and searching invocations byinstance_name
. - platform: Add API for accessing IAM roles.
- platform: Add API for authentication using an external service.
Fixed¶
- platform: Prefer the same availability zone for uploading or downloading blobs.
Removed¶
- goma:
--remote-instance-name
is now a no-op. - platform:
--experimental_record_reported_action
is now a no-op.
Incompatible¶
- platform:
--incompatible_reject_instance_name
now defaults to true.
v2.51.0 (2023-09-11)¶
Fixed¶
- GCP: when fluent-bit is enabled for remote execution, the fluent-bit shipped logs now correctly set the log level for "S" (severe) messages. Previously these appeared with "default" level, now they appear with "error" level.
- Linux: adjusted how kernel headers are installed to aid NVIDIA GPU driver installation. If the linux-headers-{VERSION} apt-package is not found for the the current kernel version, an earlier version is installed.
- Windows: disabled NGen and automatic updates for autoscaled instances to improve performance and prevent unplanned reboots. We update base images instead.
v2.50.1 (2023-09-07)¶
Fixed¶
- Do not require an authenticated session to run the
GetCapabilities
RPC.
v2.50.0 (2023-08-29)¶
Changed¶
- Executor: Now logs executor and action ID for action events.
- Windows: Allow more time for the worker service to gracefully shut down when autoscaling terminates an instance.
- Worker: Add timeout of 5s when gracefully shutting down.
Fixed¶
- CAS: S3 acceleration handles cross region requests.
- CAS: S3 rate limit errors on async CAS.
v2.49.1 (2023-08-17)¶
Fixed¶
- Fix
NullPointerException
when using--client_auth=gcp_rbe
.
v2.49.0 (2023-08-16)¶
Added¶
- EngFlow Profile: Add pool name to executed actions.
- Record historical action execution results for all actions, including failing and uncached, and report link to the corresponding action details page (Bazel shows this for failed actions).
Changed¶
- UI: Enable help-mode by default.
Fixed¶
- UI: Fixed test progress bar never completing.
- UI: Fixed some codeblocks unnecessarily showing fade-out effect.
- UI: Fixed various typo fixes.
- UI: Fixed invocation last updated time: report the time the server received the update.
- CAS: Handle errors downloading from S3 gracefully
v2.48.2 (2023-08-01)¶
Fixed¶
- Fixed creation of Debian 12 images.
- Fixed uid/gid allocation for the engflow user.
v2.48.1 (2023-07-27)¶
Fixed¶
- Fixed handling of cancelled execute calls; release 2.48.0 introduced an error in cancellation handling that could cause workers to be marked as busy even though they were not, which significantly reduced the capacity of the cluster. This only triggered under high load with very short client-provided timeouts.
- Record action memory limit and pool id in the server-side profiles.
v2.48.0 (2023-07-24)¶
New¶
- Logging: more logging for failed service discovery calls.
- Logging: improve logging for executed actions; this merges some log lines and changes structured field names to clarify units.
- UI: show cached action results on the action page.
Changed¶
- Actions that fail with OOM are no longer unconditionally retried.
- BES: don't propagate certain errors to the client to receive partial data even if there are issues.
- Auth: all generated JWT now contain a version number.
- Docker: check if the container image exists locally before pulling it.
- Profile: profiles now contain multiple full executed events for the same action in case of server-side action retries.
Fixed¶
- Various UI improvements in the action input browser.
- In some cases, uploading to external storage failed in a way that left the
cluster in an inconsistent state resulting in
PRECONDITION_FAILED
failures; retry the upload in those cases.
v2.47.4 (2023-07-17)¶
Fixed¶
- Mini: fix a broken base image.
v2.47.3 (2023-07-18)¶
Fixed¶
- Allow using images preexisting on the local Docker daemon. This mostly affects EngFlow/free.
v2.47.2 (2023-07-12)¶
Fixed¶
- Mini: disable
--docker_use_process_wrapper
by default. - Avoid retries on OOMs if
--experimental_retry_failure_due_to_signal=false
.
v2.47.1 (2023-07-12)¶
Fixed¶
- Mini: correctly handle the empty file. This only affects the EngFlow/free version.
v2.47.0 (2023-07-07)¶
New¶
- CAS: Uploads and downloads now prefer nodes from the same subnet.
- UI: We now show warnings when invocations miss recommended Bazel flags.
- UI: Allow users to hide BES keywords from filter.
Fixed¶
- Fix bugs that could cause processes to hang during shutdown.
- Fix bug that caused Remote Asset API to fail.
v2.46.0 (2023-06-28)¶
New¶
- UI: Limit the number of requested target patterns shown, and warn when patterns are truncated.
- UI: For failing targets, add a link to the action details pages of their failing actions to ease debugging.
v2.45.2 (2023-06-22)¶
Fixed¶
- Fixed HTTP mTLS support.
v2.45.1 (2023-06-21)¶
Fixed¶
- Server-side profiles no longer have incorrect entries for failed actions.
- HTTP APIs can be used with mTLS authentication.
v2.45.0 (2023-06-19)¶
Fixed¶
- Optimize the invocation index page.
- Reduce native memory usage to avoid OOMs.
New¶
- Report extra info when an action's output tree is too large.
v2.44.0 (2023-06-13)¶
Fixed¶
- Fix a bug that could sometimes leak memory during gRPC and HTTP requests.
- goma:
--remote-instance-name
can now be empty.
New¶
- Native memory consumption is now reported as a metric.
- HTTP: GZIP compression of responses can be enabled with
--experimental_http_compression=true
.
Removed¶
- RE:
gRPC
-level compression of blobs is removed in favor of Remote API compression.--experimental_compressed_cas_reads
is now a no-op. - RE:
--experimental_event_store_delay
is no longer supported and now a no-op.
v2.41.1 (2023-06-13)¶
Fixed¶
- RE: Workers: increase file descriptor limit to 32k.
- S3: handle uploading empty blobs.
New¶
- RE: export metric for direct buffer memory.
v2.43.0 (2023-06-07)¶
Fixed¶
- Limit the number of dimensions exported for UI metrics to reduce potential reporting costs.
- AWS: Fix a bug where uploading empty blobs to S3 failed.
- Fix quotes in non-canonical docker image error.
- UI: Various improvements to the Debug Action page.
New¶
- EngFlow profile: add action cache lookup events.
Deprecated¶
--experimental_results_blobs_root
is now a no-op.
Removed¶
- No-op flag
--xcode_locator
is no longer supported.
v2.42.0 (2023-05-31)¶
Fixed¶
- GCP: Disable Ops Agent after installation, preventing monitoring cost increase.
- RE: Increase file descriptor limit to 32k.
New¶
- UI: If an invocation is running, display when it was last updated.
- UI: Add information on queue size and executing actions to the cluster status page.
- Report metric for the number of open file descriptors.
- Detailed logging for open file descriptors by type and directory.
v2.41.0 (2023-05-23)¶
Fixed¶
- RE: actions creating unreadable outputs are now client errors, returning
ILLEGAL_ARGUMENT
.
New¶
- engflowapis: Add new API
cluster.v1.Cluster/GetInfo
to retrieve information about a cluster. - UI: Add information on executor pools to the cluster status page.
- UI: Improve the speed at which the UI processes invocations.
v2.40.2 (2023-05-18)¶
Internal release. No publicly facing changes.
v2.40.1 (2023-05-18)¶
Fixed¶
- RE: When a worker replies with
DEADLINE_EXCEEDED
, make client retryExecute
.
v2.40.0 (2023-05-17)¶
Fixed¶
- RE:
NOT_FOUND
errors are now returned inExecuteResponse
instead of returning a gRPC error. - RE: Allow retries for more failed executions.
- UI: Fix bug in console view where scroll-to-bottom would not work when highlighting a line.
New¶
- UI: Target are now linked from the Configuration tab.
- UI: Redact potentially sensitive data from environment variables.
v2.39.0 (2023-05-10)¶
Fixed¶
- UI: fix bug where failed test logs would fail to display on some clusters.
- RE: export client-triggered execution errors into a separate metric dimension for alerts and dashboards.
- RE: improve action logging: add primary output path, merge lines for exit code and output stats.
- RE: improve internal retries of failed docker start errors.
New¶
- UI: Introduce help mode, which users can turn on to access tips on different UI components.
- Auth/TLS: add
--tls_cipher_suites
flag to configure the set of supported ciphers for incoming TLS connections; only allow TLS 1.2 and 1.3. - RE: export execution latency metrics enabled by default.
v2.38.2 (2023-05-12)¶
Fixed¶
- Infra: don't mix Debian 10/11 in the release build.
v2.38.1 (2023-05-12)¶
Fixed¶
- RE: fewer client-side retries on execution errors when machines go missing.
- RE: report client-triggered execution errors as a separate metric dimension for alerts and dashboards.
New¶
- RE: report execution latency metrics by default.
v2.38.0 (2023-05-05)¶
Fixed¶
- UI: Fix duplicate action failed messages.
- UI: Fix bug where region next to sidebar is not clickable.
- UI: Fix a bug related to automatically fetching new invocations on the "Invocations" page.
- UI: Fix bug that led to snackbars (update notifications shown at the bottom of the page) sometimes not being shown.
New¶
- UI: A "debug action" page allows users to debug remotely executed actions by exploring the data available in build.bazel.remote.execution.v2.Action
v2.37.0 (2023-05-04)¶
Deprecated¶
- Running EngFlow on Debian 10 is now deprecated. This does not affect docker containers actions run in, nor the client machine the build (e.g., Bazel) runs on.
Fixed¶
- UI: Fix a bug in the target tree rendering, which could cause an infinite loop.
- UI: Don't render empty test logs.
- RE: Fix a bug where action descriptor was not deserialized correctly.
New¶
- UI: In the target tree, add "copy" buttons for the target label.
- RE: Export execution latency metrics by stage.
- UI: Do not show an empty target tree for successful invocations. In-progress invocations, and invocations where not all targets were successful, still filter successful targets out by default.
- UI: Show extra test outputs for test attempts.
- UI: In the target tree, add "copy" buttons for the target label.
- RE, Docker: enable process wrapper by default.
v2.36.1 (2023-04-26)¶
Fixed¶
- RE: Address ServiceDiscoveryServer lock contention in the workers with more nuanced concurrency.
- RE: Invocation resource usage summary correctly aggregates execution pools.
v2.36.0 (2023-04-25)¶
Fixed¶
- RE: Correctly parse default pool summary resource usage for the Invocation.
New¶
- UI: Redact authorization headers from different Bazel flags in the BES before storing it.
- RE: Include mnemonic in the action execution timing log.
v2.35.0 (2023-04-11)¶
Fixed¶
- RE: Fix issues where AWS SNS subscription requests were rejected with unknown cluster keys in the URL.
- UI: Logs will no longer drop lines on outputs that have newlines trimmed at the start.
New¶
- Deploy: Performance enhancements around plan building.
- RE:
--experimental_enable_priority_pools
will schedule high priority actions to pools postfixed with_high_priority
. - RE: CAS now supports legacy
GetTree
API.
v2.34.1 (2023-03-03)¶
Cherrypicks¶
- UI login: fixed a bug with Google OAuth.
v2.34.0 (2023-03-30)¶
Fixed¶
- resultstore: Fix a bug where reading the profile stream before the invocation finishes was causing unintended side effects.
New¶
- UI: In the invocation statistics, include percentiles of the wall times.
v2.33.0 (2023-03-21)¶
Incompatible¶
- MacOS 11 (Big Sur) image for RBE workers is not supported anymore. MacOS 12 is supported.
Deprecated¶
--experimental_compressed_cas_reads
is now deprecated and a no-op; it's always enabled.--enable_bytestream_compression
,--experimental_bytestream_compression
, and--incompatible_s3_use_structured_paths
are now no-ops and will be removed in a future release. They are now always enabled.
Fixed¶
- RE: Fix a bug where we sometimes scheduled actions on terminated workers.
- BES: Fix a bug where
PublishBuildToolEventStream
sometimes returnedUNKNOWN
. - UI: Fix Bazel command-line flag links.
- AWS: Rate limit signals sent to CloudFormation to avoid exceeding the request quota.
- UI: Fix a bug that prevented filtering invocations by certain times of the day.
- UI: Fix bug where scrolling in the console view did not work.
- UI: Top-level targets now correctly expand in the target tree.
New¶
- UI: Enable advanced search by default.
- UI: Improve differentiation for pending filters.
- UI: Allow linking to the target tree with certain filters applied.
v2.32.0 (2023-03-14)¶
Added¶
- Probers: logging all retried errors.
- RE: Local logging can now be configured via JVM properties.
Fixed¶
- RE: Fix async storage write/close race.
- RE: Fix working directory "." corner case.
- Fix: execution prober first failure not retrying properly.
v2.31.0 (2023-03-07)¶
Added¶
- UI Auth: Support reading permissions from OIDC's JWT payload.
- RE: The new
--experimental_summarize_invocations
will include invocation remote resource usage from the resultstore. - RE API: Action outputs are now interpreted as relative to working directory, not exec root.
Fixed¶
- UI: Fix bug where filter updates in the invocation search were not always applied.
- UI Auth: Fix bug where OIDC was not always set up correctly on the admin login page.
- RE: Now reports S3 503 errors as UNAVAILABLE.
- RE: Retry CloudFormation SignalResource calls.
v2.30.0 (2023-02-18)¶
Added¶
- UI: A separate administrator login at /adminlogin for EngFlow engineers to access cluster UI.
- RE: Logging config
--log_file
to have it's own parent directory. - RE: Bytestream/Writes, ContentAddressableStorage writes and S3 writes are now asynchronous.
- RE: ContentAddressableStorage will now support in-transit compression/decompression.
Fixed¶
- UI: Fix error displayed on login page.
- RE: Logging config
--log_level
will now throw exception on incorrect values. - RE: Fix BusyExecutorException from reaching the client and let scheduler retry.
- RE: Fix gRPC service for notification queue.
v2.29.0 (2023-02-01)¶
Added¶
- RE: The new
--experimental_always_retry_missing_worker_failures
flag allows server-side execution retries in more cases than before. We expect this reduces the chance of client-visible execution failures. - UI: Support authenticating to the UI via OpenID Connect using
--http_auth=oidc_login
and setting--oidc_config
. - UI, EngFlow profile: Actions now have a
previous_action_runner
field. We expect this helps us better understand why containers are restarted.
Fixed¶
- RE: Worker now waits before shutting down the gRPC server. We expect this reduces the chance
of
NOT_FOUND
errors forWaitExecution
calls. - RE: Scheduler now retries executions server-side if the selected worker is busy. Previously this error was returned to the client, which then retried.
- RE: Scheduler now won't retry
WaitExecution
if the worker went missing. This returnsNOT_FOUND
to the client, which should then retry. We expect this reduces the chance ofUNAVAILABLE
errors and Bazel exit code 34. - Goma: Fix timestamp format in JSON logs, so the nanosecond-fraction is zero-padded to 9 decimals.
- UI: Multitude of bug fixes and usability improvements.
v2.28.2 (2022-01-30)¶
Cherrypicks¶
- RE: return
NOT_FOUND
when workers go missing during execution. - probers: retry when
Execute
returnsNOT_FOUND
.
v2.28.1 (2022-01-27)¶
Cherrypicks¶
- UI login: fix bug in Open ID Connect token requests.
v2.28.0 (2023-01-27)¶
Cherrypicks¶
- RE: added an experimental flag,
--experimental_always_retry_missing_worker_failures
. When--experimental_always_retry_missing_worker_failures
is enabled, the scheduler will always retry on UNAVAILABLE errors from workers. - RE: add waittime before shutting down gRPC server.
Added¶
- Goma,
--logging_timestamp_format
: now supports the valuefluent-bit
, for Fluent-bit compatible timestamps in JSON logs ("%s.%L" format). - RE: deployed retriable probers to detect regressions.
Changed¶
- RE,
--log_file_limit
: default is raised from 10mb to 100mb. This should avoid frequent log rotation when logging a lot. - fluent-bit: health-check is now enabled (but not exposed via any infrastructure).
- fluent-bit: will now ship its own log-file by default.
Fixed¶
- RE, Goma: the .deb installer now creates some static config files and empty log files for fluent-bit safety.
v2.27.2 (2022-01-25)¶
Cherrypicks¶
- RE: added an experimental flag,
--experimental_always_retry_missing_worker_failures
. When--experimental_always_retry_missing_worker_failures
is enabled, the scheduler will always retry on UNAVAILABLE errors from workers.
v2.27.1 (2022-01-25)¶
Cherrypicks¶
- RE: add waittime before shutting down gRPC server.
v2.27.0 (2023-01-18)¶
Incompatible¶
--incompatible_track_availability_zone
is now a no-op.
Added¶
- This change log is now shown in the UI under
/restatus
. - Our public docs now describe how to engage with Customer Support.
- The RE service can now emit single-line JSON logs, see
--log_file
.
Changed¶
- With
--client_auth=github_token
, the principal name is now stable:github_token
.
v2.26.0 (2023-01-12)¶
Changed¶
- Enable S3 structured paths by default
- Setting
--mtls_expiration=0d
is now allowed, and it disables downloading mTLS certificates from the UI. -
Goma server now rotates log files to avoid filling the filesystem. New logging flags:
--logging_rotate_at_mb
- Rotate log files when size x MB reached (default 10)--logging_rotate_count
- Number of rotated log files to keep (default 1000)--logging_days_to_keep_rotated_files
Maximum days to keep rotated log files for (default 28)--logging_compress_rotated_files
- Compress rotated log files? (default true)
v2.25.0 (2023-01-03)¶
Changed¶
- Update third_party dependencies.
- Add probers to release.
v2.24.0 (2022-12-27)¶
Changed¶
- Update timezone selectors to use UI kit dropdown.
- UI client TLS certs: allow --mtls_expiration=0d.
- Add support for reading and storing protos (secretstore).
Fixed¶
- UI: Fix the datetime picker.
v2.23.4 (2022-12-13)¶
Cherrypicks¶
- auth: use
--tls_trusted_certificate
and--tls_trusted_key
for signing and verifying JWTs.
v2.23.3 (2022-12-12)¶
Cherrypicks¶
- Added an experimental flag,
--experimental_docker_max_image_size_in_cas
. When--experimental_docker_store_images_in_cas
is enabled, workers cache docker container images in the CAS. The new flag sets a size limit on container image files stored in the CAS. It defaults to10gib
, the previously hard-coded limit.
v2.23.2 (2022-12-09)¶
Internal release. No publicly facing changes.
v2.23.1 (2022-12-09)¶
Internal release. No publicly facing changes.
v2.23.0 (2022-12-08)¶
Changed¶
- Docs: Launched the new https://docs.engflow.com with search functionality.
- Enabled
--experimental_filter_known_replicas
by default, causing the scheduler to confirm whether a CAS node holding a replica is alive before attempting a read. - Changed
--internal_tcp_connect_timeout
default value from 30s to 5s. This controls cluster-internal gRPC connections. Connection attempts to dead nodes will fail faster. - Enabled
--warm_containers
by default, causing workers to pull active cluster Docker containers before accepting actions.
Fixed¶
- MacOS: worker no longer exits when
--allow_docker=true
; the flag is now ignored. - Added more Docker platform options to affinity key to add the scheduler in executor selection.
v2.22.1 (2022-11-24)¶
Changed¶
- AWS: Install latest SSM agent in all MacOS machine images.
v2.22.0 (2022-11-23)¶
Changed¶
- Added a Docker credential helper to the AMI and
.deb
package, which fetches the username/password from AWS Secrets Manager. It's calleddocker-credential-engflow-aws-secretsmanager
.
Fixed¶
- InputCas, stats: fix
distributedCasLongestDownload
.
v2.21.0 (2022-11-14)¶
Changed¶
- Add flag to respect explicitly defined pools for
--experimental_force_mnemonic_pool_name
.
Fixed¶
- Bug causing wrong Action Cache misses due to special-casing the empty blob.
v2.19.3 (2022-11-03)¶
Cherrypicks¶
- Fix Action Cache misses due to special-casing the empty digest.
v2.19.2 (2022-11-02)¶
Cherrypicks¶
- Sharing new AMIs.
v2.19.1 (2022-11-01)¶
Cherrypicks¶
- Fix memory leak in notification queue service.
v2.19.0 (2022-10-24)¶
Changed¶
- Goma: improve Goma cluster logging.
- RE: update default
--max_batch_size
from4mb
to10mb
. - RE: update default
--default_replica_timeout
from24hr
to1hr
. - RE: update default
--cas_existence_cache_expiry
from24hr
to0s
. - RE: update default
--local_cas_existence_cache_expiry
from120s
to30min
. - RE: update default
--hazelcast_aws_use_client_lib
fromfalse
totrue
.
Fixed¶
- UI: links render correctly in the CHANGELOG.md display of the Cluster Status page.
- Goma: the
grpc_keepalive_time
andgrpc_keepalive_timeout
are now respected properly. - RE: correctly warm Docker containers on the default pool.
- RE: Disconnected nodes will continue to try to reconnect to the cluster indefinitely.
Deprecated¶
- The
--client_auth=gcp_email
,--client_auth=basic
, and--http_auth=gcp_email
options have been deprecated; they are not used by any clusters. - The
--external_storage_gc_enable_deletion
flag has been deprecated.
v2.18.1 (2022-10-19)¶
Cherrypicks¶
- Goma: set gRPC keepalive time.
v2.18.0 (2022-10-12)¶
Deprecated¶
- The
--aws_cloudformation_stack_name
and--aws_cloudformation_stack_resource
flags are deprecated; the corresponding values are automatically read from instance metadata.
Added¶
- Docs: Documentation
around
--client_auth=github_token
. - Performance:
--warm_containers=true
can be used to automatically pull active Docker images onto new workers before they accept any actions. -
Goma: new logging flags
--logging_output_encoding
can now be supplied.json
will emit single-line JSON, for consumption downstream and an enhanced querying UX.--logging-timestamp-format
can now be supplied.unix-utc
will output millisecond-precision UNIX timestamps, for simpler dateime parsing downstream and an enhanced querying UX.
Changed¶
- UI: the target tree view defaults to only showing non-successful targets.
- Goma: more readable service logs - lines no longer have Java and systemd journal prefixes.
Fixed¶
- RE: Fixed bug in used file tracking that caused workers to run out of disk space.
v2.17.1 (2022-10-11)¶
Cherrypicks¶
- AWS: The x86_64 Debian AMIs now come with pre-downloaded software for supporting instance types with NVIDIA GPUs.
v2.17.0 (2022-09-30)¶
Changed¶
- Performance:
--bytestream_read_chunk_size
is now 1 MiB by default; this can significantly improve machine-to-machine copy performance. - Removed metrics
com.engflow.storage.ops/in_flight
andcom.engflow.storage.ops/stream_in_flight
.
Added¶
- UI: added button to download a summary of the invocation as markdown; this is intended for integration with other systems like a bug tracker or helpdesk.
- Added thread pool metrics.
- UI: Add tooltips to analytics summary to clarify values and improve display while loading data.
- UI: Show release notes on the cluster status page.
Fixed¶
- Goma: fix metrics exporting to CloudWatch.
- Fix a rare hang when replicating a file to another machine.
- Fix error handling when the GCS service is temporarily unavailable.
v2.16.4 (2022-09-28)¶
Cherrypicks¶
- Fix high memory consumption when repeatedly reloading an invocation page.
- Fix cas metrics reporting in the schedulers.
- Fix heap dump helper tool.
v2.16.3 (2022-09-21)¶
Cherrypicks¶
- Fix login loop when
authorization
header is sent.
v2.16.2 (2022-09-20)¶
Cherrypicks¶
- Fix cookie parsing with HTTP/2 when multiple cookies are sent.
v2.16.1 (2022-09-20)¶
Cherrypicks¶
- Fix login loop when multiple cookies are set for the UI domain.
v2.16.0 (2022-09-16)¶
Changed¶
- Metrics: All metrics are reported to CloudWatch by default (if enabled).
- Metrics: High-cardinality metrics for actions are disabled.
- Metrics: BEP metrics are now in the
com.engflow.bep
namespace.
Added¶
- Worker AMIs for Arm64 MacOS.
- Support passing through arbitrary flags to
docker run
.
Fixed¶
- Cleanup for
--docker_clean_tmp
runs as root. - Reduced cpu+memory consumption for very large Bazel event streams.
- Fixed heap dump utility temporary directory creation.
- Improve analytics summary consistency (build counts).
- Fix race condition in handling of file uploads.
- Fix potential hangs in multiple streaming calls.
- Fix MTls authentication fallback handling.
- Fix existing scheduler metric to always be reported (previously was not reported if the cluster had no worker pools).
v2.15.2 (2022-09-15)¶
Cherrypicks¶
- Goma: fix bug uploading chunks to CAS.
v2.15.1 (2022-09-14)¶
Cherrypicks¶
- Goma: add latency metrics for inbound HTTP RPC requests.
- Goma: ensure
gomaOutput.toFileBlob
returns after cancelation. - Goma: recache only upload if missing.
- Scheduler: handle mTLS authentication fallback correctly.
v2.15.0 (2022-08-25)¶
Added¶
- Added a documentation page about the Invocation Search page.
- Extended Goma documentation.
- Add new permission for viewing the UI, so that you no longer need admin access.
- Monitoring: Added threadpool metrics.
- UI Authentication: Support logging in with Okta.
- UI Authentication: Web UI login for basic authentication.
- UI Authentication: Support using multiple authentication methods together.
- UI: Allow invocation page tabs to be opened in a separate window.
Fixed¶
- Scheduler: Do not serve requests until we've had a chance to discover workers.
- Fix Test XML file parsing when certain attributes are not present.
- Avoid network loss during service shutdown.
- Fix downloading compressed files.
- UI: Fix messaging around Bazel profile availability while an invocation is still running.
- UI: Fetch the console log only once.
- UI: Fix auto-scroll functionality in console log.
- UI: Ensure the correct test log is shown when switching targets.
v2.14.8 (2022-08-24)¶
Cherrypicks¶
- Goma: Removed large debug logs.
- Goma: Added log buffering and suppressed logs below "Info" level.
- Goma: Added SIGUSR2 handler to collect CPU profiling data.
v2.14.7 (2022-08-23)¶
Cherrypicks¶
- Fix a UNKNOWN error from bytestream
Write
calls.
v2.14.6 (2022-08-23)¶
Cherrypicks¶
- Logging: Remove extraneous logging for some RPC calls.
- Logging: Log time spent waiting for the client to send write data.
- Bytestream: Use a 1 MB buffer for writes.
- Monitoring: Add thread pools latency metrics.
- Monitoring: More read/write storage metrics.
v2.14.5 (2022-08-20)¶
Cherrypicks¶
- [goma] Implement digest cache in recache package.
- [goma] Report more metrics for execution and RPC.
- RemoteActionExecutor: fix action execution hangs .
- ExecutionUnit: log target id.
- Logging: log some RPC calls.
v2.14.4 (2022-08-19)¶
Cherrypicks¶
- Fix (rare) NPE in replica selection.
- Add metrics for distributed CAS fetches.
v2.14.3 (2022-08-12)¶
Cherrypicks¶
- Guard process wrapper execution statistics behind a flag.
- Retry recovery when losing a CAS node.
- Fix temp directory handling for dockerized actions.
v2.14.2 (2022-08-10)¶
Cherrypicks¶
- Fix storage metric being off by a factor of 1000.
v2.14.1 (2022-08-10)¶
Cherrypicks¶
- Fix S3 metrics exporting that caused an excessive number of exported metrics and associated CloudWatch costs.
v2.14.0 (2022-08-10)¶
Added¶
- Network: allow setting a TCP connect timeout via
--internal_tcp_connect_timeout
. - Metrics: add download call stats via
com.engflow.re.cas/fetch_call_time
. - UI: Analytics page - if scatter charts cannot be shown, add a button that narrows the search so they can be rendered
- UI: Invocation search - allow filtering invocations by principals requester and runner
Fixed¶
- UI: various styling fixes to improve consistency.
- UI: Fix async bug that caused filters to not be applied on reset.
- Metrics: report storage metrics correctly (
com.engflow.storage.*/*
). - Metrics: fix docs to indicate that distribution metrics are reported to CloudWatch.
- S3: improve handling of "rate limited" errors.
v2.13.0 (2022-08-02)¶
Incompatible¶
--incompatible_keep_relative_argv0
is now a noop.--experimental_grpc_web
is now a noop.- Enable
--experimental_inmemory_digests
by default. - Enable
--incompatible_strict_digest_verification
by default.
Added¶
- Partial graceful shutdown support for schedulers.
- Support RE-API cache compression and gRPC-level compression simultaneously.
- UI: add BES keywords to search index and allow filtering invocations by them (provided invocation indexing is enabled)
- UI: improved rendering of large logs, including lazy loading
- UI: add a badge to mark experimental features
- UI: accessibility improvements
Fixed¶
- Support BES streams with a high data rate.
- Allow viewing an invocations console output in fullscreen mode.
v2.12.1 (2022-07-25)¶
Added¶
- Support proxying external storage reads through workers under
--workers_handle_fallback_requests
.
Fixed¶
- Fix startup crashes in certain AWS configurations.
v2.12.0 (2022-07-21)¶
Added¶
- Make the chunk size of
ByteStream/Read
responses configurable with--bytestream_read_chunk_size
. - Add metric for response time of external authentication.
- Add
--experimental_force_module_cache_path_for_mnemonics
flag for improving Objective-C builds. - Support changing the initial gRPC control flow window with
--grpc_initial_flow_control_window
.
Fixed¶
- Make sure to use optimized TLS implementation when available.
- Fix hangs when an error happens early during process startup.
- Fix spurious cancellation of RPCs that check blob existence.
- Fix spurious
NOT_FOUND
errors fromByteStream/Read
. - Better handling of backend errors in the UI.
v2.11.2 (2022-07-19)¶
Cherrypicks¶
- Race condition while checking CAS cache.
- Possibly erroneous cancellation of futures from the cache.
v2.11.1 (2022-07-13)¶
Fixed¶
- UI: Fix broken Google login page.
v2.11.0 (2022-07-12)¶
Added¶
- UI: Show license info on cluster status page.
- UI: Show error details when invocation page fails to load.
Fixed¶
- UI: Fix bug where large test suites would not load.
- UI: Fix bug where invocations would sometimes not load or hang.
v2.10.0 (2022-07-07)¶
Incompatible¶
- The
--incompatible_track_availability_zone
has been flipped, which makes this release incompatible withv2.3.0
and earlier. Please upgrade tov2.9.0
before upgrading to this release if you're still running an older version.
Added¶
- UI: Allow downloading
mTLS
client certificates from the UI. The CA is configured server-side using--tls_trusted_key
and--tls_trusted_certificate
.
Fixed¶
- UI: Frontend would not show invocation or invocation search pages and instead presented an error.
- UI: Elements in the UI would overlap each other in unexpected ways.
- UI: Ensure fetching the log is not aborted prematurely.
- UI: More clearly surface how many tests failed.
- UI: Fix cluster status page hanging on load.
v2.9.0 (2022-06-28)¶
Incompatible¶
- macOS: Workers no longer wait until at least one Xcode is available before accepting work.
- The
--advertised_port_offset
flag is now a no-op.
Deprecated¶
- macOS: Discovering available Xcode versions no longer relies on
xcode-locator
subprocess.
Added¶
- auth: Add support for embedding permissions directly into mTLS client certificate.
Fixed¶
- UI: Order test suites and test cases by status, listing failures first.
- UI: Better reporting of aborted invocations.
- UI: Ensure full-screen views are always scrollable.
- UI: Expose if number of test cases does not match reported number of tests.
- goma: Reduce impact of rate limiter on action execution.
v2.8.0 (2022-06-01)¶
Incompatible¶
- The
--incompatible_ignore_legacy_node_properties
flag is now a no-op.
Added¶
- Goma: Add metrics for how long requests are delayed.
- Goma: Add metrics for client errors.
- RE: Add digest of primary output to server-side profile.
- Add support for TLS 1.3
Fixed¶
- Goma: Enable RPC request rate limiter by default.
v2.7.2 (2022-05-24)¶
Cherrypicks¶
- Fix
INTERNAL
error on certain remote persistent worker error conditions.
v2.7.1 (2022-05-23)¶
Cherrypicks¶
- Docker containers were not getting reused. This was causing a performance hit.
v2.7.0 (2022-05-17)¶
Added¶
- Monitoring: the new
com.engflow.instance/gc_avg_duration
metric shows the average duration spent in Java garbage collection since the last reported metric.
Fixed¶
- Results UI: fixed an issue where some server-side profiles fail with
UNKNOWN
and returning HTTP 500 and don't load. - Results UI: In the build status bar, cached builds were not included in the completed builds, and were categorized as "to build". This is now fixed.
- macOS: Improved error message for actions failing because of too long command lines.
- Results UI: fixed login redirection vulnerability that could lure the victim to the attacker's page.
- AWS, MacOS AMI: install service that reaps the symbols cache to avoid filling up the disk.
v2.6.3 (2022-05-05)¶
Fixed¶
- Fix loading some EngFlow profiles.
v2.6.2 (2022-05-03)¶
Fixed¶
- UI: The login page and other elements were misaligned.
v2.6.1 (2022-05-03)¶
Fixed¶
- UI: Navigating between pages sometimes crashed the frontend.
v2.6.0 (2022-05-03)¶
This release changes the database schema. In clusters that have it enabled, this results in an empty database after the upgrade.
Incompatible¶
- The
--incompatible_named_default_pool
flag is now a no-op. - The
--docker_use_image_id
flag is now a no-op.
Added¶
- Goma: added ability to limit concurrent connections. This should help avoid OOMs when clients upload a lot of input files.
Changed¶
- Eventstore options are no longer experimental.
- Improved error messages when server TLS certificates are not in the expected format.
- Moved user settings to the side bar.
- Improved styling of the cluster status and licenses pages.
Fixed¶
- UI: Don't show workspace status chips (repo, branch, commit) on the invocation page if they are empty.
- GCS: Correctly propagate errors when reading / writing events.
- Fix rare case of schedulers losing track of workers when an incoming execute request is cancelled (previously, the worker was recovered after a timeout).
- The service now retries execution requests internally in some cases to reduce the likelihood of build failures in clusters with very large worker machines and auto-scaling.
- Fix popups to display outside their parents.
- Fix display of the licenses page.
- Fix display of failed tests on the invocation overview page.
- Fix handling of timeouts for remote persistent workers - such actions were incorrectly always retried (independent of the retry policy).
- Fix handling of
test.xml
reports that only specify a 'status' attribute.
v2.5.2 (2022-04-28)¶
Cherrypicks¶
- Fix NPE when proxy-replaying an event stream from another scheduler (part 2).
v2.5.1 (2022-04-26)¶
Cherrypicks¶
- Fix NPE when proxy-replaying an event stream from another scheduler.
- Fix multicast discovery.
v2.5.0 (2022-04-13)¶
Incompatible¶
- Remote persistent workers: flip
--incompatible_ignore_legacy_node_properties
. Please ensure to update Bazel to the latest version provided at https://docs.engflow.com/re/client/remote-persistent-workers.html when using remote persistent workers.
Added¶
- Record time to create output tree in server-side profiles.
Changed¶
- UI: Make Invocation page navigation horizontal
Fixed¶
- macOS: Fix chronyd installation.
- UI: Fix bug causing page to sometimes hang or not load.
v2.4.1 (2022-04-05)¶
Cherrypicks¶
- MacOS AMIs: Fix chronyd installation by preventing brew to run as root.
- UI: Fix potential issue with S3 data writing.
v2.4.0 (2022-04-05)¶
Added¶
- Goma support for pushing metrics to Stackdriver.
- Goma logs are uploaded to Cloudwatch.
Changed¶
- Increase timeout for EngFlow Free to come online.
Fixed¶
- Fix: Reduced redundant logging when invocation streams cannot be found.
- Fix: Don't trigger alarms when the test.xml output file for a test is not uploaded from Bazel to the BES.
- AWS: Cloudwatch metric reporting is more fault tolerant.
- UI: Include more information in build UI stack traces included in the JavaScript console.
- UI: Users can now change their timezone in Chrome.
- UI: Improve various error messages in the UI to include more detail.
v2.3.4 (2022-03-29)¶
Cherrypicks¶
- goma: add
--exec-timeout
to control execution timeout. - macOS: fix chrony installation script.
v2.3.3 (2022-03-28)¶
Cherrypicks¶
- goma: add flags to control concurrency for storing action inputs and outputs.
v2.3.2 (2022-03-28)¶
Cherrypicks¶
- Fix stack overflow on large invocations.
- Correctly propagate max gRPC message size flags to the internal gRPC service.
- Fix failure to report disk metrics.
v2.3.1 (2022-03-23)¶
Cherrypicks¶
- Fix disk usage metrics not being reported and producing log spam.
v2.3.0 (2022-03-16)¶
Deprecated¶
- Config:
--enable_target_tree
is now a no-op and will be removed in a future release. The flag is set totrue
by default.
Fixed¶
- UI: Fix infinite loop on some prefix filters.
- UI: clarify icon indicating that items can be opened.
- GCP: Improve error reporting for "Partial findMissingBlobs failure".
- Fix error when loading an invocation page corresponding to a server-side profile.
- Fix "Async Stream not found" error to avoid automatically assuming this is a severe error.
- Fix Docker service shutting down before worker service.
Added¶
- GCS: Added
com.engflow.storage.read/time_to_first_byte
andcom.engflow.storage.read/time_per_gb
metrics. - UI: Improve error message when test.xml file is empty.
- UI: Added analysis failure status for targets.
- UI: Add a button to download test.xml.
- UI: Improve UX around downloading test assets.
- UI: Highlight matching items when searching in the target view.
- Config: Turn
--experimental_google_client_id
into--google_client_id
.
v2.2.1 (2022-03-08)¶
Cherrypicks¶
- Fix NPE in EventStoreProfileGenerator
- Permit more frequent keepalive times (10s)
- Fix race condition in external storage GC causing INTERNAL error
- Fix uncaught exception in UI when opening an invocation stream for profiling events
- Install chronyd on macOS images
v2.2.0 (2022-03-03)¶
Incompatible¶
- config:
aws
andgcp
are no longer valid options for--external_storage
. Uses3
orgcs
instead.
Deprecated¶
- config:
--experimental_actions_execution_attempts
is now a no-op and will be removed in a future release. - config:
--experimental_gcs_direct_upload
,--incompatible_reduce_memory_use
,--enable_status_page
, and--experimental_per_executor_dirs
are now no-ops and will be removed in a future release.
Fixed¶
- UI: Allow downloading (partial) EngFlow profiles while builds are still running.
- UI: Remove links to Bazel command-line options for non-release versions.
Added¶
- UI: Warn users if their upload strategy prevents Bazel profiles from being uploaded.
- Add
--incompatible_track_availability_zone
, which changes the serialization format for one of the types we share between machines. This can be safely deployed while the cluster is running as long as all nodes are running at leastv2.2.0
. Do not enable when there are nodes that run an earlier version, or at the same time as upgrading tov2.2.0
(or later). We plan to flip this flag inv2.10.0
. - docker: Record container start time in EngFlow profile.
- macOS: Add
/var/folders/wp
to the allowlist for sandboxed actions.
Removed¶
- metrics:
com.engflow.re.cas/total_size
,com.engflow.re.cas/total_replica_size
,com.engflow.re.exec/running_actions
,com.engflow.re.exec.docker/containers_created
,com.engflow.re.exec.docker/container_creation_failed
, andcom.engflow.re.exec.docker/containers_destroyed
are no longer reported.
v2.1.1 (2022-02-22)¶
Fixed¶
- Make deletion of action execroots faster.
v2.1.0 (2022-02-18)¶
Changed¶
- The Build Event Service is enabled by default (disable with
--enable_bes=false
). - The event store options are no longer experimental. Note that the on-disk
location should now be controlled with the separate flag
--event_disk_path
rather than reusing--event_blobs_root
for this purpose. - The unnamed worker pool is now called
default
(enable--incompatible_named_default_pool
by default). - Improve the Bazel first-time setup instructions.
Fixed¶
- Fix server-side profiles to correctly show all action attempts.
- UI: fix parsing of GitHub URLs.
- UI: fix icon titles.
- UI: fix page-up/page-down keys in the console.
- MacOS: fix repeated warnings about /proc/meminfo.
- Correctly return
NOT_FOUND
instead ofINTERNAL
when an invocation could not be found. - Avoid action failures when uploading a file to secondary storage returns an error.
v2.0.0 (2022-01-13)¶
This release requires a full cluster shutdown and restart. Due to changes of the default settings for a number of incompatible flags, pre-2.0.0 instances may return errors when communicating with instances running 2.0.0 or later and vice versa.
Otherwise, this release is intentionally small to reduce the upgrade risk. In particular, we did not remove deprecated flags and metrics in 2.0.0 (except as noted below); they will be removed in a later release.
Added¶
- Support TLS 1.3 for the UI and gRPC APIs.
- Automatic Garbage Collection for External Storage.
- Added an inline Profile Viewer.
- Print warnings when using deprecated command-line flags.
- UI: the invocation page sidebar can be navigated by keyboard.
- UI: show 'View Logs' button for test logs.
Fixed¶
- Fix permission denied when deleting an exec tree with unexpected mod bits.
- UI: Some non-existent pages returned 404 (not found) for unauthenticated users; they now return 403 (unauthenticated). This was a potential information leak (benign).
- UI: correctly show
NOT_FOUND
for missing nodes. - UI: console correctly uses the full height when maximized.
- UI: fix color for the timezone selector.
- UI: prevent horizontal overflow in the test view.
- UI: improve performance of loading large console logs.
- API: correctly return
NOT_FOUND
for calls to the results store.
Removed¶
- Remove all s3-specific metrics
com.engflow.re.storage.s3/*
. - Remove all gcs-specific metrics
com.engflow.re.storage.gcs/*
.
Incompatible¶
- Flipped
--incompatible_reduce_memory_use
.
v1.58.9 (2022-02-08)¶
Cherrypicks¶
- Trigger new release due to flaky errors.
v1.58.8 (2022-02-07)¶
Cherrypicks¶
- goma: consider include path when deriving common input/output prefix.
v1.58.7 (2022-02-03)¶
Cherrypicks¶
- Log end-to-end build times from BES.
v1.58.6 (2022-01-28)¶
Cherrypicks¶
- Work around OpenJDK 11.0.14 bug: ignore Host header in http2 requests.
- ResultStore/GetTarget: respond with
NOT_FOUND
instead ofINTERNAL
.
v1.58.5 (2022-01-27)¶
Cherrypicks¶
- Install Xcode 13.2.1 and cmd-line 13.2 on macOS
v1.58.4 (2022-01-07)¶
Cherrypicks¶
- Fix compilation errors due to bad cherrypicks
v1.58.3 (2022-01-06)¶
Cherrypicks¶
- UI: don't fail with INTERNAL error when target tree node is not found
- UI: don't fail with Chunk too large
- Profiler: Record retry attempts during input fetching for better profiling
- Fix
com.engflow.re.storage.existence_cache/*
stats
v1.58.2 (2021-12-21)¶
Fixed¶
- Fix profile download links.
v1.58.1 (2021-12-20)¶
Fixed¶
- Fix incorrect timezone list.
v1.58.0 (2021-12-17)¶
We are preparing for a 2.0.0 release in early 2022. To reduce the amount of changes going into that, we have proactively flipped a few flags that were intended for 2.0.0 and that do not require a full cluster restart. We have already enabled these flags on all managed clusters without any issues.
Incompatible¶
- Enable
--incompatible_remove_symlink_execroot_strategy
by default; this removes thesymlink
exec root strategy, which was never used in production due to being incompatible with dynamically linked binaries. - Enable
--incompatible_keep_relative_argv0
by default; this fixes the lookup of commands which use a relative command line to be consistent with posix shell lookup (includingPATH
) and is required by all remote execution clients that we are aware of being used in production. - Enable
--incompatible_no_storage_backend_metrics
by default; this removes a few deprecated metrics related to storage.
New¶
- Server-side profiles now contain per-action input tree stats.
- The
--incompatible_named_default_pool
flag changes the meaning of thePool
platform option, and allows selecting the default (unnamed) execution pool. - Add a
dockerUseEntrypoint
boolean platform option to disable use of the docker image entrypoint on a per-action basis. - Add
--incompatible_strict_digest_verification
to enable strict validation of digests across all API calls, superseding--incompatible_batch_read_blobs_verifies_digests
; both will be enabled by default and removed after the 2.0.0 release. - UI: support resizing the tree view.
- UI: show SCM status (if received from the workspace status command).
- UI: add a timezone selector to the settings.
Changed¶
- Increase the default gRPC max message size to 20mib to reduce issues with uploading large build events.
- HTTP cookies now use
SameSite=Lax
to avoid requiring login every time a user follows a external link into the UI, e.g., from CI. - Reduce worker service memory footprint.
- UI: improve consistency and usability.
- UI: improve the ordering of targets in the overview tab.
- UI: show local / remote build status icon.
- UI: use ISO 8601 dates and 24-hour format by default.
- UI: update target status icons for improved consistency and readability.
Fixed¶
- Fix the affinity-based scheduler to take the absolute input root into account if set; this reduces docker container restarts and improves build performance.
- GCP: images now use
gcr
instead ofgcloud
to authenticate docker operations with GCR, which is more reliable. - UI: fix linebreaks in the displayed build command line.
- UI: fix issue where in-progress builds don't render correctly.
- UI: deduplicate target configuration information.
- UI: fix critical path display to use timestamp order (don't sort by length).
v1.57.4 (2021-12-15)¶
Cherrypicks¶
- Fix performance regression in CAS downloads. This reverts a bugfix of
--log_level
, so the finest supported log level is againINFO
.
v1.57.3 (2021-12-09)¶
Cherrypicks¶
- Goma: avoid data corruption by resetting buffer upon download retrial
v1.57.2 (2021-12-08)¶
Cherrypicks¶
- Server-side profile: add
input_tree_stats
to action details
v1.57.1 (2021-12-01)¶
Cherrypicks¶
- Fix release pipeline
v1.57.0 (2021-11-30)¶
Incompatible¶
- Boolean-type flags now enforce their value to be
true
orfalse
. Previously any value other than the literaltrue
was parsed as false; from now on this is an error. - Enable
--split_cluster_name
by default. If you're currently not setting this flag, make sure schedulers have a tag namedengflow_re_scheduler_name
with the same value asengflow_re_cluster_name
. --experimental_cas_check_storage_only
is now a no-op.
New¶
- With
--http_public_port
, you can set a different port for HTTP requests than for gRPC (--public_port
). - Free tier now supports the EventStore API.
- Free tier now supports server-side profiles.
- UI: Enabled target tree by default.
- UI: Allow users to expand information-dense cards to full-screen.
- UI: Follow the end of the console while loading.
- UI: Allow users to filter the target tree by prefix or status.
- UI: Added Overview tab to help quickly identify build issues.
Changed¶
- Workers now create one one-core executor per available CPU core instead of just one one-core executor.
- Improved compatibility of server-side profiles with Perfetto UI.
Fixed¶
- Removed limit of concurrent connections from free tier.
- Improved parsing of critical path in Build and Test UI.
Deprecated¶
--split_cluster_name
is deprecated and will be removed in the next release.
Security¶
- Removed debugging HTTP endpoints from Goma.
- Restrict
frame-ancestors
fromContent-Security-Policy
.
v1.56.1 (2021-11-10)¶
Cherrypicks¶
- AWS/GCP images: revert back to Debian 10
v1.56.0 (2021-11-08)¶
Changed¶
- AWS/GCP images now use Debian 11
- Logging:
FindMissingBlobs
is now less chatty - Docker container reuse is enabled by default; use
--docker_allow_reuse=false
to opt-out the entire cluster, or setdockerReuse=False
for all actions (or builds) that need to opt-out. If you want to opt-out the entire cluster, we recommend setting--docker_allow_reuse=false
before you upgrade. This change also switches all actions to separatedocker run
anddocker exec
invocations. If that causes problems, you can temporarily opt-out the entire cluster by setting--docker_split_exec_run=false
. Note that we plan to deprecate that option; please let us know if you do set this flag.
Fixed¶
--worker_config
now correctly handles configurations with more than 2GBram
per executor
Security¶
- The HTTP UI now returns various security-related HTTP headers like
Content-Security-Policy
andX-Frame-Options
by default, to prevent a number of attack scenarios (see--strict_http_headers
) - Action inputs can now be absolute symlinks
v1.55.0 (2021-10-28)¶
Added¶
- Docker pull times to EngFlow profile
--grpc_max_message_size
flag to control gRPC max message size- Log messages regarding lost or corrupted CAS files
Changed¶
- All target tree flags are no only controlled by
--enable_target_tree
--enable_status_page
is nowtrue
by default--principal_based_permissions
now defaults to[]
to restrict data access by default
Fixed¶
- Race condition that would cause the PublishBuildToolEventStream gRPC call to fail
- Linux ficlone call for creating action inputs
- Basic auth using web-browsers
Removed¶
- Experimental flags related to the target tree
Security¶
- Reduced surface area for phishing attacks
v1.54.1 (2021-10-19)¶
Cherrypicks¶
- Fixed missing working directory for the cached docker strategy
v1.54.0 (2021-10-18)¶
Added¶
- Exposed EventStore endpoint over gRPC
- Tests parsed from test.xml are shown hierarchically
Changed¶
- Server-side profiling is now always enabled when the BES is
enabled; disable with
--profile_to_event_store=false
Fixed¶
- Fix CAS capacity accounting during recovery
- Wait for CAS metadata writes after file upload; fixes file missing errors when no external storage is configured and build-without-the-bytes is enabled
Security¶
- Require explicit principal permissions to be set when accessing HTTP endpoints
- Patched various low-risk vulnerabilities
v1.53.1 (2021-10-15)¶
Cherrypicks¶
- Don't run actions twice with the "cached docker" strategy
v1.53.0 (2021-10-07)¶
Added¶
- Add memory usage and garbage collection metrics
- UI: Add profile picture and user menu
Fixes¶
- GCP, AWS: fix logging issues causing stuck instances
- Fixed a bug where some metrics were not reported
- Fixed crash on start up when
--cas_path
is undefined - UI: Fix broken alert bar
- Improved error handling during rapid cluster size changes
Changes¶
- Report metrics with granular counts of incoming actions
- When
--external_storage
is enabled then--experimental_opportunistic_cas
is now regarded astrue
. Previously you had to explicitly enable the flag. We no longer recommend setting--experimental_opportunistic_cas
at all, because when--external_storage
is disabled then it's safer to use--experimental_force_lru
instead client_auth=gcp_rbe
: Clients with"remotebuildexecution.blobs.create"
permission can now also upload Build Event Streams. Previously such requests failed because (as of 2021-09-29) GCP has no permissions to control Build Event Stream uploads- Docker: Respect memory limit provided by
--worker_config
- Enable
--experimental_per_executor_dirs
by default - Increase default
--max_batch_size
to4mb
Removed¶
- Remove
--experimental_profile_dir
in favor of--experimental_profile_to_event_store
.
v1.52.6 (2021-10-05)¶
Cherrypicks¶
- Minor bugfixes
v1.52.5 (2021-10-05)¶
Cherrypicks¶
- Prevent GCP logging problems from breaking the scheduler process.
v1.52.4 (2021-10-01)¶
Cherrypicks¶
- Fixed bug propagation caused by Hazelcast errors
v1.52.3 (2021-09-29)¶
Cherrypicks¶
- Minor bugfixes
v1.52.2 (2021-09-28)¶
Cherrypicks¶
- Fixed a bug where long log messages would cause schedulers to hang
- Reduce log spam
- Fixed a bug where some metrics would not be reported
v1.52.1 (2021-09-22)¶
Cherrypicks¶
- Fixed crash on start up when
--cas_path
is undefined
v1.52.0 (2021-09-20)¶
Changes¶
- Add invocation IDs to logging and errors
Fixes¶
- Various concurrency bugfixes for multi-scheduler clusters
- Improve logging around failed gRPC calls
- Reduce log spam for the gRPC NOT_FOUND response code
- Improve macOS worker support
- Fixed EngFlow internal profiling when running multiple schedulers
v1.51.2 (2021-09-14)¶
Cherrypicks¶
- Minor bugfixes
v1.51.1 (2021-09-13)¶
Cherrypicks¶
- Fix release pipeline
v1.51.0 (2021-09-08)¶
Added¶
- CAS: Workers now pick up existing files from the CAS directory. It's no longer
necessary to delete this directory after a worker is restarted. If this
behavior breaks something, use
--recover_cas_blobs=false
and let EngFlow know. - Add metrics for inbound BEP events.
com.engflow.eventstore/new_inbound_stream
com.engflow.eventstore/new_inbound_bep_event
com.engflow.eventstore/new_outbound_bep_event
com.engflow.eventstore/new_outbound_stream
com.engflow.eventstore/ongoing_streams
- The build results UI can now authenticate users with Google's login page. See
--http_auth=google_login
.
Changed¶
- Enable
--upload_outputs_on_failure by
default. - UI: Update branding.
- AWS, dashboard module: Reduce window size from 300s to 60s.
- GCP, Terraform files: Move service accounts into a Terraform module.
- The new default of
--http_auth
isdeny
. Make sure you override this flag as needed. - EngFlow .deb installer: depends on the full OpenJDK, not just the JRE
Fixed¶
- Fixed a race condition with BES and multiple schedulers.
- Use exec-root of executor when starting reusable docker containers. This fixes
a bug causing containers not to start if the user has no permission to access
the container's default workdir (e.g. when setting it to
/root
). - Do not proxy errors to cancelled client streams.
- UI: Display correct status icon in target list view.
- Fix race condition during BES live replay.
Deprecated¶
--docker_use_path
is a no-op; please use--incompatible_keep_relative_argv0
instead.--docker_use_addgroup
is no longer supported.
v1.50.6 (2021-08-31)¶
Cherrypicks¶
- Fixed unbounded thread creation
v1.50.5 (2021-08-24)¶
Cherrypicks¶
- Added Metrics around BES upload
- com.engflow.eventstore/new_inbound_stream
- com.engflow.eventstore/new_inbound_bep_event
- com.engflow.eventstore/new_outbound_bep_event
- com.engflow.eventstore/new_outbound_stream
- com.engflow.eventstore/ongoing_streams
v1.50.4 (2021-08-24)¶
Cherrypicks¶
- Fix race condition during BES live replay
v1.50.3 (2021-08-05)¶
No change. Just re-triggering the release.
v1.50.0 (2021-08-05)¶
Added¶
- Docker: the new
--docker_default_network_mode
flag controls the default value of "dockerNetwork" (when the client doesn't request any). - MacOS: actions run with sandboxing if
--experimental_allow_mac_sandbox
is enabled
Deprecated¶
- We've removed support for Java 8 and Ubuntu 16.04.
v1.49.0 (2021-07-12)¶
Added¶
- Linux: support file cloning for file systems that support it
- Docker: add a flag to enable Docker signature verification
(
--docker_content_trust
)
Changed¶
- Switch to react for the cluster status page (
--enable_status_page
) - MacOS: packages now contain the process-wrapper binary which can enforce proper shutdown of actions
- GCP: various improvements to terraform configuration (enable stackdriver integration by default, enable shielded VMs by default, add dashboard module)
Fixed¶
- Profiling: fixed a hang when downloading large server-side profiles
- Profiling: fixed a hang when downloading an unfinished server-side profile
- Persistent workers: action timeouts are now properly enforced
v1.48.0 (2021-06-23)¶
Changed¶
- Improved documentation around persistent workers.
Fixed¶
- Reliability: Internal "Connection reset" calls no longer trigger INTERNAL gRPC errors.
v1.47.0 (2021-06-08)¶
Changed¶
- AWS: The packer config in our release (base-image.json) now installs the AWS SSM agent.
Fixed¶
- AWS: fix instance id retrieval on IMDSv2.
- Reliability: The evicion policy on the action cache could previously cause long-running scheduler services to crash, even if all instances are individually restarted (the schedulers automatically replicate entries from removed instances).
Deprecated¶
- The
com.engflow.re.scheduler/available_workers
metric is deprecated. We recommend using the newcom.engflow.re.scheduler/existing_executors
metric instead.
v1.46.1 (2021-05-25)¶
Cherrypicks¶
- Fixed a stack overflow bug that caused schedulers to crash.
v1.46.0 (2021-05-24)¶
Changed¶
- Updated the Bazel process-wrapper that is used by workers to isolate actions
Fixed¶
- GCP logging will not be enabled even with the flag set if the instances detect that they are running outside of GCP.
- Persistent workers will be restarted if the kernel kills the action process. This mitigates the risk of leaking processes on poorly behaving actions.
v1.45.1 (2021-05-14)¶
Cherrypicks¶
- Deployment kit, Dockerfile: fix v1.45.0 regressions
v1.45.0 (2021-05-13)¶
Deprecated¶
--auto_worker_expiration
and--docker_use_init
are now both no-op. They were enabled by default in v1.31.0, now are always on.- Deployment kit, Kubernetes: deleted the obsolete setup files
(
gen-k8s-config.py
andtemplates/
directory); updated the documentation about the current Kustomization-based setup
Added¶
--experimental_async_storage_uploads
: This makes it so we don't wait for aync uploads to complete. This should improve performance in cases where such uploads are slow.
Changed¶
- Deployment kit, Debian package: the package no longer "Depends" on OpenJDK; it
now "Recommends" OpenJDK's JRE. This lets you skip installing that Java
runtime, and use a different runtime. The dockerfile has a
--build-arg
to control that (see below). - Deployment kit, GCP Terraform file: added firewall rule to allow health checks; listen on port 443 instead of 8080; increase scheduler disk size
- Deployment kit, AWS Terraform file: added
use_s3
variable; can generate a random S3 bucket name;cluster_name
is customizable - Deployment kit, engflow.Dockerfile: made it configurable via
--build-arg
, installing the JRE and Docker are now optional - CloudWatch: log stream names now show the machine's role (scheduler or worker) and are easier to read
Fixed¶
- S3 / GCS: intermittent errors are now reported as
UNAVAILABLE
, not asINTERNAL
error - gRPC / netty: closed channels are now reported as
UNAVAILABLE
error
v1.44.1 (2021-05-07)¶
Cherrypicks¶
- Fixed an error where new nodes were unable to join the cluster due to third-party library incompatibilities
v1.44.0 (2021-04-30)¶
Added¶
- Authentication: added a
deny
mode that denies all incoming requests for the--client_auth
and--http_auth
options
Changed¶
- The
--gcs_credentials
flag is no longer deprecated - AWS deployment configuration: added more alerts
Fixed¶
- The status web page is no longer available on the private scheduler port, only on the public port
- Handle premature exit of the persistent worker process; these are now automatically retried and provide a better error message
v1.43.0 (2021-04-21)¶
Changed¶
- Validate that
--cas_path
points to a writeable directory - Kubernetes: improve configuration to be less dependent on cluster config
Fixed¶
- Fixed an error when the client sends an empty byte stream
- Fixed basic auth documentation
v1.42.0 (2021-04-12)¶
Added¶
- Kubernetes: you can override the default Kubernetes master address with
--k8s_master
. Normally this should not be necessary, except if you see discovery problems. --worker_config
now acceptsauto
, meaning to create 1 executor that uses all available cores.- Actions now log how many output files (and total bytes) they uploaded to the CAS. (Only when replication is enabled.)
Changed¶
- Kubernetes: added Dockerfile; added affinity rules to the on-prem Kustomizable overlay
Fixed¶
- Fixed Mac release packages that were broken since v1.38
v1.41.1 (2021-04-08)¶
Cherrypicks¶
- Fixed Mac release packages
v1.41.0 (2021-04-07)¶
Added¶
- Kubernetes: new and improved Kustomization-based deployment templates for K8s
Changed¶
- Docker: forward env variables for AWS credential as well as
DOCKER_HOST
to Docker invocations; this supports setups other than the default Docker socket - Added more metadata and failed actions to the server-side profile (see
--experimental_profile_dir
)
Fixed¶
- Safeguarded against undeletable files when reusing exec roots
- Fixed tracking of CAS file locations
v1.40.0 (2021-03-31)¶
Added¶
- S3: The
--incompatible_s3_use_structured_paths
changes the directory structure, making blob access faster. This is an incompatible change: enabling the flag means the cluster won't find the old bucket content.
Changed¶
- Docker FIFO creation: report stderr on failure
- Docker internal retry: print if stderr was empty
- S3: we now support more than 50 concurrent connections; see
--external_storage_worker_threads
and--external_storage_scheduler_threads
- S3: retry failed downloads
- S3: set IOException cause for generic errors, so error logs are more detailed
Fixed¶
- AWS deployment kit: fix use of list option
- AWS deployment kit: enable instance_refresh in the Terraform config
- Docker "OCI runtime exec failed": fixed the
--experimental_docker_internal_error_stderr_pattern
semantics (added in v1.31), we now correctly retry such actions. - Docker: check container after every non-zero exit. This should help with containers that become unusable, e.g., due to a docker daemon restart.
- Persistent workers: fixed the bug where workers sometimes failed to start,
printing
execution failed INTERNAL: Bad response from worker:
v1.39.0 (2021-03-23)¶
Added¶
- Added an experimental server-side profiling implementation (see
--experimental_profile_dir
)
Changed¶
- Logging: log average download rate per storage location; look for 'timing' in the worker logs
- Enable recursive output tree action cache verification by default; previously,
the action cache could return cache entries with output files that were no
longer available in the CAS, breaking Bazel's build-without-the-bytes mode
(
--experimental_check_action_cache_recursively
) - S3 / GCS: use 50 threads by default on workers and remove upper limit (50) on S3
v1.38.5 (2021-04-21)¶
Cherrypicks¶
- Fix release package build
v1.38.4 (2021-04-21)¶
Cherrypicks¶
- Fix release package build
v1.38.3 (2021-04-20)¶
Cherrypicks¶
- Enable the recursive AC check by default
- Check existence of tree blob
- Clean exec root if input tree creation fails
- Force-add replicas to the location map
- Fixed Mac release packages
v1.38.2 (2021-03-25)¶
Cherrypicks¶
- Add
--incompatible_s3_use_structured_paths
to use structured paths in S3, which may significantly improve performance under high load
v1.38.1 (2021-03-23)¶
Cherrypicks¶
- Correctly propagate metadata for internal CAS download calls
v1.38.0 (2021-03-22)¶
Added¶
- IPv6 support: added
--docker_ipv6_cidr
and--docker_ipv6_subnet_length
to configure the IPv6 subnets for Dockerized actions
Changed¶
- The service now returns an error for HTTP/1.X connections to the gRPC port
- AWS: Improved deployment templates
- File downloads are retried internally if there are more copies in the distributed CAS
- If
--experimental_per_executor_dirs
is enabled, actions are always run in a deterministically-named directory
Fixed¶
- IPv6 support: Dockerized actions run with an IPv6 localhost if IPv6 is enabled
- Fix crash when enabling
--experimental_per_executor_dirs
in a cluster that has files in the work directory - Fix protocol error when a client attempts to execute an action with a lot of missing files
- Fix reuse of Docker containers between persistent worker and normal actions
v1.37.4 (2021-03-22)¶
Cherrypicks¶
- S3: do not force absolute blobs root; clarify requirements in documentation
v1.37.3 (2021-03-18)¶
Cherrypicks¶
- S3: sanitize blobs root; add logging
v1.37.2 (2021-03-10)¶
Cherrypicks¶
- Fix action cache recursive output directory check
v1.37.1 (2021-03-10)¶
Cherrypicks¶
- Fix worker startup script
v1.37.0 (2021-03-09)¶
Added¶
- Added a flag to support S3-compatible storage services like MinIO
(
--s3_endpoint
) - Added an experimental option to force actions into specific pools by action
mnemonic (
--experimental_force_mnemonic_pool_name
); note that this requires the client to send action mnemonics using the recently updated metadata proto
Changed¶
- The
--use_upload_to_rereplicate
flag is now a no-op. Please remove it from your configs. - CloudWatch:
--experimental_cloudwatch_no_instanceid
,--aws_instance_id
,--single_instance_monitoring
, and--experimental_single_instance_monitoring
are now no-op flags. Please remove these from config files. Instances always behave as if--experimental_cloudwatch_no_instanceid=true
.
Fixed¶
- Persistent workers: correctly use the relative working directory to look up parameter files and run workers; this is needed for Bazel @ HEAD to work
- Action cache: fix handling of output directories to avoid returning stale action cache entries - this could cause Bazel client errors if build-without-the-bytes is enabled
- Action execution: added a flag to stop absolutizing argv[0]; this could cause
errors with hermetic C++ toolchains outputting absolute paths to .d files and
failing Bazel's consistency checks (
--incompatible_keep_relative_argv0
); this will be enabled by default in a future release; note that this may break some builds that were relying on this (also see the Bazel issue https://github.com/bazelbuild/bazel/issues/13189)
v1.36.0 (2021-02-26)¶
Added¶
- Added a flag to use per-executor working directories
(
--experimental_per_executor_dirs=true
) - Added a flag to pass the executor id to local actions through an env variable
(
--experimental_local_provide_executor_id=true
,ENGFLOW_EXECUTOR_ID
) - Added a platform option to control the exec root strategy; this can be used to
switch between the default, fast hardlink strategy which does not set file
permissions to a copy strategy that sets the file permissions as requested by
the client (
experimentalActionInputStrategy=copy
)
Changed¶
- Improved logging for persistent workers
- Increased default cache duration for CAS existence checks to external storage to 24h and 10 million entries
- Increased default cache duration for CAS existence checks to the distributed CAS to 120 seconds
- Limited download concurrency to at most 200 concurrent downloads by default to avoid running out of native memory or file descriptors
Fixed¶
- Fixed issue where helper threads could go into a busy loop when Docker
containers are reused; this may not result in client-visible build issues but
causes high CPU load on the worker instances. This was introduced in 1.32.0
when the default for
--experimental_docker_avoid_fifo
was flipped
v1.35.1 (2021-02-17)¶
Cherrypicks¶
- ExecutedActionMetadata: fix worker start timestamp
v1.35.0 (2021-02-16)¶
Added¶
- Added a flag to use consecutive TCP/IP ports for internal traffic
(
--incompatible_use_low_offsets=true
) - S3: Experimental support for multi-part uploads to handle files larger than 5 GB
(
--experimental_s3_use_transfer_manager=true
) - Added a flag to disable participation in the distributed CAS; this is useful
for satellite cluster where a few machines are remote to the main cluster
(
--enable_distributed_cas=false
) - Logging: added a metric to monitor persistent worker use
Fixed¶
- Fixed reporting of timestamps in execution result
Deprecated¶
- The
--experimental_gcs_direct_upload
flag is a no-op. Please remove it from your configs.
v1.34.0 (2021-02-08)¶
Added¶
- AWS: Improved AWS Terraform files in the release package to support dashboards and logging
Changed¶
- GCS: use a new code path to upload blobs
- AWS CloudWatch:
--experimental_cloudwatch_no_instanceid=true
by default. The--aws_instance_id
and--single_instance_monitoring
flags are deprecated, please remove them from configs. --storage_range_requests
is now a no-op. It has been enabled since v1.30
Deprecated¶
- GCP:
--gcs_credentials
flag is deprecated, please use application default credentials instead
v1.33.0 (2021-02-05)¶
Added¶
- AWS: Support for logging to CloudWatch logs; enable with
--remote_logging_service=aws_cloudwatch
and--aws_log_group_name=name
- Execution responses include timestamps for client-side metric collection
Changed¶
- Debugging:
--keep_exec_directories_for_debugging
now retains output files as well - Docs: clarify metrics documentation
Fixed¶
- Fix
--experimental_docker_store_images_in_cas
to not cache temporary failures - Persistent workers: the service no longer waits for persistent worker processes to shut down, but terminates them forcefully
v1.32.4 (2021-02-05)¶
Cherrypicks¶
- AMD64: fix debian package
v1.32.3 (2021-01-30)¶
Cherrypicks¶
- CloudWatch: fix reporting of cumulative metrics
v1.32.2 (2021-01-29)¶
Cherrypicks¶
- Fix gRPC metrics reporting
v1.32.1 (2021-01-28)¶
Cherrypicks¶
- Fix an invalid name resulting in a SecurityException
- Revert improved HTTP/1.1 handling; this caused health checks on AWS to fail
v1.32.0 (2021-01-27)¶
Added¶
- macOS: we now release packages for macOS, and added documentation about setting up a basic macOS cluster
- Docker: Support running containers by id
--docker_use_image_id
; this prevents docker run from attempting to pull the corresponding image - Docker: Log time needed to run
docker pull
- Docker: add
--experimental_docker_store_images_in_cas
to support storing docker images in the CAS to improve performance and reliability - AWS CloudWatch: add
--experimental_cloudwatch_no_instanceid
; when enabled, all machines will report metrics withoutInstanceId
dimension, which makes the metrics aggregatable - Google Cloud Storage: implement a faster upload method, activated with
--experimental_gcs_direct_upload=true
Changed¶
- Improved Packer and Terraform templates
- GCP: GCP images are more lightweight and boot faster
- MacOS:
--xcode_locator
now points to/usr/local/bin/engflow/xcode-locator
by default; this is where the MacOS package installs this binary - Running as a service now requires the file
/etc/engflow/config
to exist - Increase the default value of
--default_replica_timeout
to 24h - CAS / AC: reduce traffic to the storage backend (when using AWS S3 or Google
Cloud Storage) with the help of a cache for recently seen blobs; you can
customize its behavior with
--experimental_cas_existence_cache_max_size
and--experimental_cas_existence_cache_expiry
- Docker: enable
--experimental_docker_avoid_fifo
by default for compatibility with gVisor - Docs: show metric units and aggregation type in the documentation
Fixed¶
- CAS.batchReadBlobs can now respond with
INVALID_ARGUMENT
for invalid digests. This is an incompatible bugfix and it's disabled by default; enable it with--incompatible_batch_read_blobs_verifies_digests=true
- Attempting to connect to the cluster via a HTTP/1.1 connection now returns a HTTP/1.1 error reply rather than simply closing the connection
- CAS: Fix off-by-one error when replicating in the distributed CAS; previously, the cluster created one replica more than requested
- Metrics: The
com.engflow.re.cas/available_space
metric is now clamped at zero; previously it was possible for it to temporarily dip below zero while running GC - Enable
--use_upload_to_rereplicate
by default; this fixes a rare mutual deadlock condition when two workers simultaneously attempt to upload files to each other - Upgrade gRPC library, which fixes an issue with slow up- and downloads on
high-latency connections; unfortunately, we had to disable
--experimental_log_unavailable_rpcs
during the upgrade - Restore compatibility with Java 8
- CAS: fix a potential scenario where the service could write an incomplete file to Google Cloud Storage
- Cloud: templates now disable systemd / syslogd integration by default; having the integration enabled causes log lines to be duplicated to multiple log files, which could result in running out of disk space
- Fixed NullPointerException in RereplicatingCasDownloader when
--use_upload_to_rereplicate=true
Deprecated¶
- AWS discovery: deprecate
--aws_security_group
; this flag is unnecessary as cluster members find each other by--cluster_name
(we recommend also enabling--split_cluster_name=true
)
v1.31.2 (2021-01-21)¶
Cherrypicks¶
- RereplicatingCasDownloader: retain Context to fix NPE ("RequestMetadata not set in current context")
v1.31.1 (2020-12-17)¶
Cherrypicks¶
- extraActionInputs: ensure directory exists before attempting to create input
v1.31.0 (2020-12-11)¶
Added¶
- ARM64: we now release a Debian package for ARM64
- Docker: Initial IPv6 support with
--docker_enable_ipv6
; this provides an isolated IPv6 network to actions which can be used for testing IPv6 code - Docker: Allow resolving executable paths against PATH; this is not compliant with the remote execution spec, but improves compatibility with existing open source projects that rely on this behavior, e.g., TensorFlow and Envoy
- CAS: Document
--experimental_opportunistic_cas
- this flag switches to a different replication policy that reduces pressure on the distributed CAS if an external storage is configured; this improves reliability under load - Monitoring: Add a metric
com.engflow.re.scheduler/existing_schedulers
for the number of schedulers; this can be used to detect instances that are unable to report metrics, e.g., to Google Cloud Operations (formerly StackDriver) - Logging: log mTLS client authentication events
- Docker: added
--experimental_docker_internal_error_stderr_pattern
to control automatic retries for some kinds of docker exec failures
Changed¶
- S3: automatically retry failures after a delay
- Docker: enable
--docker_use_init
by default; this helps avoid running out of PIDs when actions spawn a large number of subprocesses - Execution: enable
--auto_worker_expiration
by default; improves tracking of available workers
Fixed¶
- Deployment: correctly set the Debian package architecture
- Docker: correctly pass system capabilities to Docker
- GCS: improved handling of "connection lost" errors
Deprecated¶
- Options: the
--docker_use_pull
flag is now a no-op; the new code is always enabled
v1.30.0 (2020-12-04)¶
Added¶
- GCP auth: print more server-side logs when authentication fails
- Docker: the new
--docker_use_init
flag enables running Docker with a proper init process that reaps zombie processes, which avoids running out of PIDs when reusing docker containers - CAS: the new
--use_upload_to_rereplicate
flag enables using a new CAS re-replication code path that avoids a rare deadlock among worker machines
Changed¶
- Debian package: the
.deb
version is now the release's SemVer, not the build date (check withdpkg -I engflow-re-services.deb
) - Deployment kit (zip file): the k8s setup files are now under
setup/k8s
- Docker: print reason for container restart
- External storage: enable range requests by default (see
--storage_range_requests
) - External storage: check on startup if we can access the storage backend
- AWS Terraform file: renamed the
need_external_docker
parameter topublic_worker_ip
Fixed¶
- Build label: fixed missing build label in 1.28 and 1.29
- Logging: fix the swapped
invocation_id
andaction_digest
in ExecutorServer's log line - Docs: show the service options' types correctly
- Docs: display the version selector
- CloudWatch: report metric units correctly
- Fix uncaught IllegalStateException wrapping OperationTimeoutException from Hazelcast
- CAS: detect on-disk file corruption
- CAS: fix invalidating blobs that went missing with a
PRECONDITION_FAILED
- GCP: fixed Dockerized execution with cached containers on gVisor (requires
--experimental_docker_avoid_fifo=true
)
v1.29.1 (2020-11-19)¶
Cherrypicks¶
- Monitoring: fix negative
pool_utilization
metric
v1.29.0 (2020-11-19)¶
Changed¶
- AWS: improve deployment template (simplify role policy, add API endpoints)
- Monitoring: add more context to logged error messages
- Logging: log requested number of blobs for FindMissingBlob calls in addition to failed and missing digests
- Logging:
--debug_execute_requests
also prints stderr for failed actions
Fixed¶
- Execution: correctly create all requested output & input directories
- CAS: do not unlist CAS nodes that fail due to timeouts; this could potentially result in a denial-of-service if the client sets small timeouts for large uploads
- CloudWatch: respect max reporting batch size
- CloudWatch: silently skip histogram metrics, which always failed to report
- Documentation: correctly render metrics reporting percentages
- S3: print correct region name when us-east-1
- Networking: fix reporting of stream errors
v1.28.0 (2020-11-13)¶
Added¶
- AWS, monitoring:
--experimental_single_instance_monitoring
is now called--single_instance_monitoring
(the old name still works) - Add
--external_storage_scheduler_threads
and--external_storage_worker_threads
to allow customizing the external storage thread pool
Changed¶
- MacOS: sign release
- GCP, monitoring: Remove code to report metrics to Google Cloud Operations every 30 minutes
- Logging: Correctly report missing blobs, improve GCS error logs
- AWS: Improved terraform template for cluster setup
Fixed¶
- GCP, monitoring: fix sample reporting for charts that measure rates
v1.27.7 (2020-11-12)¶
Cherrypicks¶
- Status page: fix --http_auth=none to allow access to the status page
v1.27.6 (2020-11-11)¶
Cherrypicks¶
- Infrastructure: fix CI configuration for releases
- Infrastructure: fix CI machine selection for releases
- Monitoring: report two values before skipping; this should fix GCP metrics to go down to zero
Added¶
- Logging: the
--experimental_log_unavailable_rpcs
flag (boolean) enables logging the stack trace of RPC calls that fail with UNAVAILABLE. We added this feature only for debugging, and we plan to remove it as soon as we can. - Monitoring: Added
--enable_status_page
to provide a basic cluster status page over HTTP2 (only!) on the same IP+port as the gRPC end point (previously undocumented as--experimental_status_page
) - Release archive now contains a CHANGELOG.md (this file)
- CAS: Added an experimental flag to change the CAS re-replication policy to be
less aggressive (
--experimental_opportunistic_cas
). Note:- This is an incompatible flag and may require downtime to roll out
- This should only be enabled when external storage is enabled
Changed¶
- Logging: log more detailed CAS upload errors, report
INVALID_ARGUMENT
correctly, reportRESOURCE_EXHAUSTED
instead ofUNAVAILABLE
when no workers are available - Logging: log a summary of missing blobs and failures for FindMissingBlobs calls
- Monitoring: Report metrics to Google Cloud Operations at least every 30 minutes
- Documentation: the "Bazel First-Time Setup" page now recommends
--remote_timeout=600
instead of3600
- Docker: Pass
--userns=host
to Docker to explicitly disable user namespaces; previously, all actions failed when user namespaces were enabled in the Docker daemon
Fixed¶
- Dockerized execution: disable user namespaces to avoid action failures
- Code cleanup: several bugfixes found by static analyzers
- Error handling: report an error if the output tree cannot be deleted
(primarily when
--experimental_docker_use_platform_user
is enabled)
Deprecated¶
- Options: the (undocumented)
--affinity_scheduling
flag is now a no-op; the new code is always enabled
v1.26.2 (2020-11-02)¶
Cherrypicks¶
- Infrastructure: fixed version name computation in our release pipeline
Added¶
- MacOS: create release
--experimental_cas_check_storage_only
flag: to enable faster CAS checks (when--external_storage
is not none)- Logging: worker logs the CAS size upon startup
Changed¶
- Dockerized actions: add container hostname to
/etc/hosts
Fixed¶
- Monitoring: report CAS usage regularly, not just when doing a GC
- Code cleanup: lots of bugfixes found by static analyzers
v1.25.1 (2020-10-30)¶
Added¶
- Documentation: for Remote Persistent Workers
/healthz
page- Status page: now authenticates clients, see the
--http_auth
flag - AWS, monitoring: support for single-machine-only monitoring; see the
--experimental_single_instance_monitoring
flag - Monitoring: the
com.engflow.re.scheduler/pool_utilization
metric shows what percentage of executors in a pool are currently used
Changed¶
- Monitoring: the
com.engflow.re.scheduler/queue_age
metric now reports min/max ages broken down by executor pool - deb package: post-install script creates engflow user's home dir
Fixed¶
- Monitoring: stuck actions are now from scheduler's queue, and won't drive up the max age forever
- Monitoring: fixed GCS metrics that tried reporting negative values
v1.24.1 (2020-10-28)¶
Cherrypicks¶
- GcsClient copy: always set target of copy request
v1.23.1 (2020-10-28)¶
Cherrypicks¶
- GcsClient copy: always set target of copy request
v1.24.0 (2020-10-19)¶
Added¶
- Logging: log OpenCensus attempt to record negative value
- Logging: LocalExecutionServer tracks and logs per-action timing
- Monitoring:
com.engflow.re.bytestream/read
metric to monitor complete vs. partialByteStream.read
calls (hidden from docs because we wanted to use it for debugging only)
Changed¶
- Monitoring:
com.engflow.re.storage/ops_queue_size
now shows the composition of external storage ops queue
Fixed¶
- Don't execute an action if input fetch failed
v1.23.0 (2020-10-14)¶
Added¶
- Status page: a simple status page with members list, on the same port as
the remote execution service; enabled with
--experimental_status_page=true
Fixed¶
ByteStream.read
: properly implement resumable downloads from S3/GCS, guarded by--experimental_storage_range_requests=true
- Fix user-visible cancellation exceptions
- Fix a race condition in CloudCasDownloader
- Fix an incorrectly reported AC corruption with S3/GCS
v1.22.0 (2020-10-06)¶
Added¶
- Deployment kit: added an example Bazel project
- Docs: add on-prem setup instructions
Changed¶
- Return INVALID_ARGUMENT for too-large output trees
Fixed¶
- Remote logging: Avoid infinite recursion when logging to GCP
- ExecutionServer: Suppress cancellation exceptions, so they don't get reported to the client
- Docker pull: fine-grained pull errors
v1.20.1 (2020-10-02)¶
Cherrypicks¶
- Scheduler: also listen to expiration/eviction to avoid losing workers
- ExecutionServer: catch exceptions from onCompleted to avoid "call already cancelled" errors
v1.21.0 (2020-09-28)¶
Added¶
- Added
--experimental_docker_force_reuse
flag
Changed¶
- AWS, CloudWatch:
--cloudwatch_dimensions
is now optional - AWS, deployment kit: service endpoint now listens on port 443 (was 8080 before)
- Improved CAS performance
Fixed¶
- Fixed bugs in validating the output tree in the client's execution requests
- Java 8: Fixed crashbug
v1.20.0 (2020-09-15)¶
Added¶
- Workers: can auto-detect the disk size
- Service interface: implemented
ByteStream.QueryWriteStatus
- Docs: added system diagram
Changed¶
- External storage: use more threads: 50 on schedulers, 25 on workers
- CentOS: use statically-linked netty-tcnative in the RPM package
v1.19.1 (2020-09-03)¶
Cherrypicks¶
- Fix hazelcast InterruptedExceptions
v1.19.0 (2020-09-03)¶
Fixed¶
- Hash mismatch issues
v1.18.0 (2020-09-02)¶
Added¶
- New metric:
com.engflow.re.storage/ops_queue_size
Changed¶
- Enabled affinity-scheduling by default (
--affinity_scheduling=true
) - Changed server-side execution log message format to "id: message"
Fixed¶
- Fewer
DEADLINE_EXCEEDED
client errors: more findMissingBlobs caching, reduced GCS/S3 traffic
v1.17.0 (2020-08-26)¶
Added¶
- RPM package for CentOS 7