Metrics Reference¶

Description of all metrics available for monitoring.

AnalyzeInvocation metrics¶

com_engflow_invocationanalyzer_bazel_profile_count¶

no unit: The number of Bazel profiles that were attempted to be fetched

Tags

status: The status of retrieving the Bazel profile.

Details: The number of Bazel profiles that were attempted to be fetched.

com_engflow_invocationanalyzer_bazel_profile_size¶

no unit: The size of the uncompressed Bazel profile handled
Details: The size of the uncompressed Bazel profile handled.

com_engflow_invocationanalyzer_engflow_profile_count¶

no unit: The number of EngFlow profiles that were attempted to be fetched

Tags

status: The status of retrieving the EngFlow profile.

Details: The number of EngFlow profiles that were attempted to be fetched.

com_engflow_invocationanalyzer_engflow_profile_size¶

no unit: The size of the uncompressed EngFlow profile handled
Details: The size of the uncompressed EngFlow profile handled.

com_engflow_invocationanalyzer_time_needed¶

no unit: The time distribution of handing individual profile analysis requests

Tags

status: The status of the analysis performed

Details: The time distribution of handling individual Bazel profiles.

OperationController metrics reporter¶

com_engflow_operationcontroller_active¶

no unit: The number of active operations (i.e., currently running).

Tags

name: The name of the OperationController

Details: The number of active operations (i.e., currently running).

com_engflow_operationcontroller_latency¶

no unit: The latency to start operations (i.e., how long operations are waiting to be executed).

Tags

name: The name of the OperationController

Details: The latency to start operations (i.e., how long operations are waiting to be executed).

com_engflow_operationcontroller_queued¶

no unit: The number of operations queued for execution.

Tags

name: The name of the OperationController

Details: The number of operations queued for execution.

com_engflow_operationcontroller_runtime¶

no unit: The runtime of operations (i.e., the duration operations are running for).

Tags

name: The name of the OperationController

Details: The runtime of operations (i.e., the duration operations are running for).

Metrics derived from raw BEP streams¶

com_engflow_bep_invocation_completed¶

no unit: Fired with the count of completed invocations reported to the BEP.

Tags

exit_code: The human readable exit code of the invocation.

com_engflow_bep_invocation_duration¶

no unit: Fired on invocation completed with the average duration of the invocation.

com_engflow_bep_invocation_started¶

no unit: Fired with the count of newly started invocations reported to the BEP.

Blob-storage implementation metrics¶

com_engflow_blobstore_latency¶

no unit: The duration each operation takes.

Tags

operation
status

com_engflow_blobstore_ops¶

no unit: Fires every time an operation takes place.

Tags

operation

Docker proxy¶

com_engflow_dockerproxy_blob_upload_bytes¶

no unit: The size of Docker blobs that the proxy successfully uploaded to the CAS.

Tags

status

com_engflow_dockerproxy_cache_hit_bytes¶

no unit: The size of Docker blobs that the proxy could find in the CAS.

com_engflow_dockerproxy_cache_miss_bytes¶

no unit: The size of Docker blobs that the proxy expected but could not find in the CAS.

com_engflow_dockerproxy_known_blobs¶

no unit: The number of Docker blobs that the proxy has metadata about.

HTTP clients for the Docker proxy¶

com_engflow_dockerproxy_http_received_bytes¶

no unit: Bytes received over the HTTP client

Tags

client: The name of the HTTP client (can be used to distinguish layers)

com_engflow_dockerproxy_http_request_latency_seconds¶

no unit: Time it's taken to serve HTTP requests

Tags

client: The name of the HTTP client (can be used to distinguish layers)
status: The status code, reduced to 1xx..5xx, or FAILED if an exception occurred

com_engflow_dockerproxy_http_requests¶

no unit: Number of HTTP requests started on the HTTP client

Tags

client: The name of the HTTP client (can be used to distinguish layers)
method: The HTTP method of the request

BEP Event Storage and Replay¶

com_engflow_eventstore_bep_event_ack_latency¶

no unit: This is a distribution. Tracks how much time passed between receiving a build event and sending an acknowledgement to the client.

Tags

status

com_engflow_eventstore_bes_upload_delay¶

no unit: This is a distribution. Tracks how much longer an invocation's BES upload took compared to the invocation's duration as reported by the BES.

Tags

status

com_engflow_eventstore_build_event_owners¶

no unit: The total number of build owners residing in memory. A build owner is an internal piece of state associated to a particular build reporting build events.

com_engflow_eventstore_flushing_batches_size¶

no unit: The total size of complete build event batches that are currently being uploaded to storage. Normally, batches are flushed quickly, so this value should stay near zero; if it doesn't, that could mean we are falling behind with batch uploads. Every instance reports its own stats; sum them up to get a cluster-wide metric.

com_engflow_eventstore_grpc_eventstore_ttfb¶

no unit: This is a distribution. Tracks how much time passed between requesting EventStore data via gRPC, and receiving the first byte.

Tags

type

com_engflow_eventstore_inbound_bep_events¶

no unit: Incremented whenever an event is received on an inbound stream.

Tags

type

Details: An inbound stream means a client (e.g. Bazel) sending BES events to a scheduler.

Every instance reports its own stats; sum them up to get a cluster-wide metric.

com_engflow_eventstore_incomplete_batches_size¶

no unit: The estimated size in bytes it would take to serialize all incomplete build event batches. These batches aren't yet written to storage. Actual JVM heap footprint is likely larger. Every instance reports its own stats; sum them up to get a cluster-wide metric.

com_engflow_eventstore_new_outbound_streams¶

no unit: Incremented whenever a new outbound BEP stream is created.
Details: An outbound stream means a scheduler replaying a BES stream in order to reduce it to a result stream. It makes no difference if the scheduler is replaying to itself or to another instance.

Every instance reports its own stats; sum them up to get a cluster-wide metric.

com_engflow_eventstore_ongoing_streams¶

no unit: The total number of streams that are inbound, outbound, or both.
Details: An inbound stream means a client (e.g. Bazel) sending BES events to a scheduler. An outbound stream means a scheduler replaying a BES stream in order to reduce it to a result stream. It makes no difference if the scheduler is replaying to itself or to another instance.

Every instance reports its own stats; sum them up to get a cluster-wide metric.

com_engflow_eventstore_outbound_bep_events¶

no unit: Incremented whenever an event is sent on an outbound stream.
Details: An outbound stream means a scheduler replaying a BES stream in order to reduce it to a result stream. It makes no difference if the scheduler is replaying to itself or to another instance.

Every instance reports its own stats; sum them up to get a cluster-wide metric.

Virtual Machine Instances¶

com_engflow_instance_by_pool_total_disk_space¶

no unit: The size of the volume.

Tags

instance_role: type of the instance (scheduler/worker/etc.)
volume: the name of the disk volume
pool: the name of the pool the instance serves

com_engflow_instance_by_pool_used_disk_percentage¶

no unit: The percentage of the volume that is currently used.

Tags

instance_role: type of the instance (scheduler/worker/etc.)
volume: the name of the disk volume
pool: the name of the pool the instance serves

com_engflow_instance_by_pool_used_disk_space¶

no unit: The total number of bytes used on the volume.

Tags

instance_role: type of the instance (scheduler/worker/etc.)
volume: the name of the disk volume
pool: the name of the pool the instance serves

com_engflow_instance_gc_avg_duration¶

no unit: The average duration spent in garbage collection since the last reported metric.

Tags

gc_type

com_engflow_instance_gc_count¶

no unit: The total number of garbage collections during the lifecycle of this process.

Tags

gc_type

com_engflow_instance_gc_time¶

no unit: The total estimated time in milliseconds performing garbage collection.

Tags

gc_type

com_engflow_instance_new_gc_avg_duration¶

no unit: The average duration spent in garbage collection since the last reported metric.

Tags

gc_type: GC Old generation / GC Young Generation
instance_role: type of the instance (scheduler/worker/etc.)

com_engflow_instance_new_gc_count¶

no unit: The total number of garbage collections during the lifecycle of this process.

Tags

gc_type: GC Old generation / GC Young Generation
instance_role: type of the instance (scheduler/worker/etc.)

com_engflow_instance_new_gc_time¶

no unit: The total wall time in milliseconds spent blocked in garbage collection since the start of the process. This measures time when the application is not running due to a collector pause.

Tags

gc_type: G1 Old generation / G1 Young Generation
instance_role: type of the instance (scheduler/worker/etc.)

com_engflow_instance_new_open_file_descriptors¶

no unit: The number of file descriptors the process has currently open.

Tags

instance_role: type of the instance (scheduler/worker/etc.)

com_engflow_instance_new_spot_instance_reclaim_count¶

no unit: The number of spot instances that were reclaimed.

Tags

instance_role

com_engflow_instance_new_total_disk_space¶

no unit: The size of the volume.

Tags

instance_role: type of the instance (scheduler/worker/etc.)
volume: the name of the disk volume

com_engflow_instance_new_total_system_memory¶

no unit: The total amount of system memory in bytes.

Tags

instance_role: type of the instance (scheduler/worker/etc.)

com_engflow_instance_new_used_disk_percentage¶

no unit: The percentage of the volume that is currently used.

Tags

instance_role: type of the instance (scheduler/worker/etc.)
volume: the name of the disk volume

com_engflow_instance_new_used_disk_space¶

no unit: The total number of bytes used on the volume.

Tags

instance_role: type of the instance (scheduler/worker/etc.)
volume: the name of the disk volume

com_engflow_instance_new_used_process_native_buffer_memory¶

no unit: The total amount of native buffer memory for this process in bytes.

Tags

instance_role: type of the instance (scheduler/worker/etc.)

com_engflow_instance_new_used_system_memory¶

no unit: The amount of used system memory in bytes.

Tags

instance_role: type of the instance (scheduler/worker/etc.)

com_engflow_instance_new_used_system_memory_percentage¶

no unit: The percentage of system memory used.

Tags

instance_role: type of the instance (scheduler/worker/etc.)

com_engflow_instance_total_disk_space¶

no unit: The size of the volume.

Tags

volume

com_engflow_instance_total_system_memory¶

no unit: The total amount of system memory in bytes.

com_engflow_instance_used_disk_percentage¶

no unit: The percentage of the volume that is currently used.

Tags

volume

com_engflow_instance_used_disk_space¶

no unit: The total number of bytes used on the volume.

Tags

volume

com_engflow_instance_used_system_memory¶

no unit: The amount of used system memory in bytes.

com_engflow_instance_used_system_memory_percentage¶

no unit: The percentage of system memory used.

com_engflow_jvm_heap_size¶

no unit: The heap size of the JVM, in bytes.

Tags

pool
dimension

Replica Tracker metrics¶

com_engflow_contentaddressablestorage_replicatracker_operation¶

no unit: The runtime of operations on the replica tracker.

Tags

name

Netty monitoring¶

com_engflow_thirdparty_netty_used_direct_memory¶

no unit: Direct (non-heap) memory use

Tags

buffer_name

com_engflow_thirdparty_netty_used_heap_memory¶

no unit: Heap memory use

Tags

buffer_name

Worker Control metrics¶

com_engflow_re_management_workercontrol_approx_mft_induced_idle_executor_duration¶

no unit: Approximately how much time all executors of all workers marked-for-termination were idle. Each scheduler reports the approximate duration of executor idleness as induced by the scheduler marking workers for termination. Only the master scheduler should report non-zero values. Sum up the idle durations reported by all schedulers to get the overall idleness.

Tags

pool: name of the pool ("default" for the default pool)

com_engflow_re_management_workercontrol_mft_on_scheduler¶

no unit: This is a distribution. Tracks how long the marked-for-termination call lasted on the scheduler.

Tags

pool: name of the pool ("default" for the default pool)
result: the result of the marked-for-termination call

com_engflow_re_management_workercontrol_mft_on_worker¶

no unit: This is a distribution. Tracks how long the marked-for-termination call lasted on the worker.

Tags

pool: name of the pool ("default" for the default pool)
result: the result of the marked-for-termination call

com_engflow_re_management_workercontrol_ongoing_mft_worker_count¶

no unit: Number of workers currently marked for termination, per pool, as reported by the scheduler.

Tags

pool: name of the pool ("default" for the default pool)

Action scheduling¶

com_engflow_re_scheduler_autoscaler_cluster_size_controller_op¶

no unit: Per pool, distribution of how long each cluster size controller operation took and what its completion status was.

Tags

pool: name of pool ("default" for the default pool)
op: the operation performed, "setClusterSize" or "reduceClusterSizeByInstance"
status: "succeeded" or "failed"

com_engflow_re_scheduler_autoscaler_set_size_operations¶

no unit: Per pool, number of attempts to set its autoscaling group's desired size

Tags

pool: name of pool ("default" for the default pool)
status: "succeeded" or "failed"

Details: Deprecated. Use com.engflow.re.scheduler/autoscaler_cluster_size_controller_op instead.

com_engflow_re_scheduler_available_workers¶

no unit: Deprecated; number of idle executors, per pool

Tags

pool: name of the pool ("default" for the default pool)

Details: Deprecated. Indicates the number of idle executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value. This metric is deprecated, because it may be imprecise: schedulers that are started while workers are busy may report a higher value than they should for several minutes. We recommend monitoring com.engflow.re.scheduler/existing_executors instead.

com_engflow_re_scheduler_coalesced_executions¶

no unit: Number of action requests of coalesced into an existing execution
Details: Number of action requests of coalesced into an existing execution

com_engflow_re_scheduler_cores_per_executor¶

no unit: Number of cores per executors, per pool

Tags

pool: name of the pool ("default" for the default pool)

Details: Number of cores per executor. Every scheduler reports the same (or about the same) value.

com_engflow_re_scheduler_dequeued_actions¶

no unit: Per pool, the number of actions that were removed from the queue, either due to starting execution on a worker, or because it was ejected from the queue as it got too old.

Tags

pool: name of the pool ("default" for the default pool)
reason

com_engflow_re_scheduler_desired_executors¶

no unit: Number of desired executors, per pool

Tags

pool: name of the pool ("default" for the default pool)

Details: Indicates an estimate for the number of required executors per pool. Every scheduler reports its own estimate - they should be summed up to get the total desired pool size.

com_engflow_re_scheduler_estimated_action_time¶

no unit: Estimated action time

Tags

pool: name of the pool ("default" for the default pool)

Details: Indicates an estimate for duration of an action. Every scheduler reports its own estimate.

com_engflow_re_scheduler_estimated_induced_load¶

no unit: Estimated induced wait time

Tags

pool: name of the pool ("default" for the default pool)

Details: Indicates an estimate for future incoming work in ms per ms. Every scheduler reports its own estimate.

com_engflow_re_scheduler_exceeded_licensed_max_worker_cores¶

no unit: How often the master scheduler scaled worker pools less than desired due to licensed maximum worker core limitations.

Only schedulers report this metric. Every scheduler reports the same (or about the same) value.

com_engflow_re_scheduler_existing_executors¶

no unit: Number of existing executors, per pool

Tags

pool: name of the pool ("default" for the default pool)

Details: Number of existing executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value.

com_engflow_re_scheduler_existing_schedulers¶

no unit: Number of existing schedulers
Details: Only schedulers report this metric. Every scheduler reports a constant "1". This can be used to detect schedulers that are unable to send monitoring metrics.

com_engflow_re_scheduler_global_queue_size¶

no unit: Number of waiting actions, per pool

Tags

pool: name of the pool ("default" for the default pool)

Details: Indicates the number of actions waiting for execution, per pool, in the cluster. Only schedulers report this metric. The schedulers coordinate to calculate this sum.

com_engflow_re_scheduler_global_used_executors¶

no unit: Number of used executors, per pool

Tags

pool: name of the pool ("default" for the default pool)

Details: Indicates the number of executors that are in use, per pool, in the cluster. Only schedulers report this metric.

com_engflow_re_scheduler_licensed_max_worker_cores¶

no unit: Maximum number of worker cores permitted by the EngFlow license

com_engflow_re_scheduler_max_executors¶

no unit: Configured maximum number of executors, per pool

Tags

pool: name of the pool ("default" for the default pool)

Details: Indicates the configured maximum number of executors per pool.

com_engflow_re_scheduler_max_instances¶

no unit: Configured maximum number of instances, per pool

Tags

pool: name of the pool ("default" for the default pool)

Details: Indicates the configured maximum number of instances per pool.

com_engflow_re_scheduler_mia_count¶

no unit: Per pool, the number of action that failed, because the executing worker went missing-in-action

Tags

pool: name of the pool ("default" for the default pool)

com_engflow_re_scheduler_min_executors¶

no unit: Configured minimum number of executors, per pool

Tags

pool: name of the pool ("default" for the default pool)

Details: Indicates the configured minimum number of executors per pool.

com_engflow_re_scheduler_min_instances¶

no unit: Configured minimum number of instances, per pool

Tags

pool: name of the pool ("default" for the default pool)

Details: Indicates the configured minimum number of instances per pool.

com_engflow_re_scheduler_pool_utilization¶

no unit: Current executor utilization, per pool

Tags

pool: name of the pool ("default" for the default pool)

Details: Reports current executor utilization (used*100/total) per pool, as a percentage ([0..100]). Only schedulers report this metric. Every scheduler reports the same (or about the same) value.

To help making scale-up decisions when a pool is empty, utilization is reported as 100 if there are actions waiting and 0 if not. A pool may be empty if it was scaled down, or if it never existed (the client may request any pool name).

com_engflow_re_scheduler_queue_age¶

no unit: Min/max age of queued actions, per pool

Tags

pool: name of the pool ("default" for the default pool)
statistic: "min" (youngest) or "max" (oldest) action in the pool's queue

Details: Reports minimum and maximum age in each executor pool, i.e. how long entries have been waiting. Only schedulers report this metric. Every scheduler reports its own queue lengths. Changes in these values indicate a change in the cluster's throughput.

com_engflow_re_scheduler_queue_size¶

no unit: Number of waiting actions, per pool

Tags

pool: name of the pool ("default" for the default pool)

Details: Indicates the number of actions waiting for execution, per pool, on this scheduler. Only schedulers report this metric. Every scheduler reports its own queue lengths.

com_engflow_re_scheduler_target_desired_instances¶

no unit: Number of desired instances, per pool

Tags

pool: name of the pool ("default" for the default pool)

Details: Indicates the number of desired instances per pool, as reported by the master scheduler Depending on which scaling method is active, this is the size the autoscaling group will be set to by the scheduler, or the master scheduler marks instances termination and terminates them to reach this size eventually.

Observability UI metrics¶

com_engflow_observability_ui_app_load¶

no unit: The duration of the initial application load.

Tags

page

com_engflow_observability_ui_caught_error¶

no unit: An error in the web client was manually caught and reported.

Tags

page
error_description

com_engflow_observability_ui_navigation¶

no unit: A single user navigation to a new page.

Tags

page

com_engflow_observability_ui_page_load¶

no unit: Measures the duration elapsed loading the current page.

Tags

page
load_type

com_engflow_observability_ui_page_load_with_data_requests¶

no unit: Measures the duration elapsed loading the current page, plus any page-specific data requests sent after it was initially loaded. This is currently only selectively enabled for some pages.

Tags

page
request_status

com_engflow_observability_ui_rendering_error¶

no unit: An error in the web client was caught by the rendering pipeline and reported.

Tags

page
section

com_engflow_observability_ui_uncaught_error¶

no unit: An uncaught error was thrown in the web client.

Tags

page
error_name

BES Replay Metrics¶

com_engflow_bes_replay_cpu_time¶

no unit: The amount of CPU time spent replaying and reducing build event streams, measured in milliseconds.

Tags

status
replay_type

com_engflow_bes_replay_cpu_time_for_event¶

no unit: The amount of CPU time spent replaying and reducing a single event within a build event stream, measured in milliseconds.

Tags

status
replay_type
event_type

ResultStore metrics¶

com_engflow_resultstore_reduce_bes_completed_duration_since_finish_event¶

no unit: This is a distribution. Tracks how much time passed between receiving an invocation's finish event, and completing the BES reduction.

Tags

status: Whether the BES reduction finished successfully.
replay_type

com_engflow_resultstore_reduce_bes_replay_removed_from_cache_count¶

no unit: The number of BES replays that were removed from the replay cache.

Tags

status: The status of the replay when it was removed from the cache.
replay_type

com_engflow_resultstore_reduce_bes_replay_source_count¶

no unit: The number of BES replays requested, tagged by where the data was fetched from, and the type of replay.

Tags

source: Indicates where the data was fetched from.
replay_type

Storage implementation metrics¶

com_engflow_storage_gc_gc_window_seconds¶

no unit: The storage service's GC window.

Tags

name: name of the storage service

com_engflow_storage_read_size¶

no unit: Number of file bytes sent to the client for a read request. May be smaller than the file size in case of error or partial read. Only recorded if the file was found.

Tags

name: name of the storage service
status: op result

com_engflow_storage_read_time_per_gb¶

no unit: Time taken per 1 billion bytes (1 GB) to download a file from storage.

Tags

name: name of the storage service
status: op result

com_engflow_storage_read_time_to_first_byte¶

no unit: Time taken between initiating a download to receiving the first byte.

Tags

name: name of the storage service
status: op result

com_engflow_storage_read_time_to_next_chunk¶

no unit: Time taken between being notified that the client is ready and sending the next response. May be recorded 0 or multiples times for the same call, depending on control flow events.

Tags

name: name of the storage service
status: op result

com_engflow_storage_write_size¶

no unit: Number of file bytes received from the client for a write request. May be smaller than the file size in case of error.

Tags

name: name of the storage service
status: op result

com_engflow_storage_write_time_per_gb¶

no unit: Time taken per 1 billion bytes (1 GB) to upload a file to storage.

Tags

name: name of the storage service
status: op result

com_engflow_storage_write_time_to_commit¶

no unit: Time between being notified the write is complete and committing the write.

Tags

name: name of the storage service
status: op result

Integration metrics¶

com_engflow_integration_process_duration¶

no unit: Reports how long it took in milliseconds to process an event sent to a third party integration.

Tags

integration: The name of the service we are integrating with.
status: The status of trying to send an event to that integration.

NotificationQueue metrics¶

com_engflow_notificationqueue_dequeue_latency¶

no unit: This is a distribution. Refers to the time passed between notification creation and dequeuing it.

Tags

expired: Whether the notification was expired and discarded.
name: The name of the queue.

com_engflow_notificationqueue_head_age¶

no unit: The age of the first notification in the queue, i.e. the time passed since the notification was created. Notably, it does NOT reflect how much time has passed since the notification was last published. For example, if a notification is not acknowledged and published anew, the head age may be disproportionately high compared to the age of the next notifications in the queue. This can lead to acceptable, occasional spikes.

Tags

name: The name of the queue.

com_engflow_notificationqueue_publish¶

no unit: This is a distribution. Refers to the time needed to publish a notification.

Tags

status: The status of the operation.
name: The name of the queue.

com_engflow_notificationqueue_size¶

no unit: The approximate size of the queue.

Tags

name: The name of the queue.

Action execution¶

com_engflow_re_exec_completed_actions¶

no unit: Number of actions that ran to completion, grouped by exit code

Tags

exit_code: the action's exit code

Details: This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, in all pools combined, since the last time this metric was reported.

Only workers report this metric. All workers report their own values. We recommend grouping by exit_code=0 and exit_code!=0, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the cluster.

com_engflow_re_exec_completed_actions_per_pool¶

no unit: Number of executed actions (not cached), grouped by pool and status

Tags

pool: name of the pool (_default_ for the default pool)
status: the action's status (ExecutionStatus: SUCCESS, NON_ZERO_EXIT, CLIENT_ERROR, ERROR)

Details: This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, per pool, since the last time this metric was reported.

Only workers report this metric. All workers report their own values. We recommend grouping by status, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the pool/cluster.

com_engflow_re_exec_execution_latency¶

no unit: Bucketed latency (ms), grouped by pool and execution stage

Tags

pool: name of the pool (_default_ for the default pool)
stage: the action's stage (ExecutionStage: QUEUED, DOWNLOAD_INPUTS, EXECUTE_ACTION, UPLOAD_OUTPUTS, EXECUTOR_TOTAL)

com_engflow_re_exec_execution_latency_by_pool_group¶

no unit: Bucketed latency (ms), grouped by pool group and execution stage

Tags

pool_group: name of the pool group (_default_ for the default pool)
stage: the action's stage (ExecutionStage: QUEUED, DOWNLOAD_INPUTS, EXECUTE_ACTION, UPLOAD_OUTPUTS, EXECUTOR_TOTAL)

com_engflow_re_exec_executors_existing¶

no unit: Total number of executors on this worker, in all pools combined
Details: Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the cluster.

com_engflow_re_exec_executors_existing_per_pool¶

no unit: Total number of executors on this worker, per pool

Tags

pool: name of the pool (_default_ for the default pool)

Details: Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the pool/cluster.

com_engflow_re_exec_max_rss_kib¶

no unit: Reported MaxRSS (maximal resident set size) of successfully executed actions (not cached), grouped by pool

Tags

pool: name of the pool (_default_ for the default pool)

Details: This metric is a distribution. Each measurement indicates approximately how much memory (in Kib) a successful action (ExecutionStatus: SUCCESS) on this worker reportedly used.

Only workers report this metric. All workers report their own values.

com_engflow_re_exec_started_actions_per_pool¶

no unit: Number of started actions, grouped by pool

Tags

pool: name of the pool (_default_ for the default pool)

Details: This metric reflects the rate of change. Each measurement indicates how many actions started on this worker, per pool, since the last time this metric was reported.

Only workers report this metric. All workers report their own values.

com_engflow_re_exec_used_executors¶

no unit: Number of busy executors, in all pools
Details: Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the cluster.

com_engflow_re_exec_used_executors_per_pool¶

no unit: Number of busy executors, per pool

Tags

pool: name of the pool (_default_ for the default pool)

Details: Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the pool/cluster.

Hazelcast monitoring¶

com_engflow_re_hazelcast_is_master¶

no unit: Whether a machine is a cluster master; if this sums up to more than one (with the same name), then the cluster is unhealthy.

Tags

name: name of the Hazelcast cluster.

com_engflow_re_hazelcast_map_entries¶

no unit: The number of entries in Hazelcast maps map

Tags

cluster_name: name of the Hazelcast cluster
map_name: name of the Hazelcast map

com_engflow_re_hazelcast_map_memory_used¶

no unit: The amount of memory used for the map

Tags

cluster_name: name of the Hazelcast cluster
map_name: name of the Hazelcast map

com_engflow_re_hazelcast_member_count¶

no unit: The number of members in the cluster; only the master reports this value

Tags

name: name of the Hazelcast cluster.

com_engflow_re_hazelcast_op_time¶

no unit: Distribution of operation time

Tags

name: name of the distributed hash map
status: op result

com_engflow_thirdparty_hazelcast_partition_migration_finished¶

no unit: The number of finished Hazelcast partition migrations.

Tags

name: name of the Hazelcast cluster

com_engflow_thirdparty_hazelcast_partition_migration_started¶

no unit: The number of started Hazelcast partition migrations.

Tags

name: name of the Hazelcast cluster

com_engflow_thirdparty_hazelcast_partition_migration_time¶

no unit: Time of Hazelcast partition migrations, per Hazelcast cluster.

Tags

name: name of the Hazelcast cluster

Details: Reports the time of Hazelcast partition migrations.

com_engflow_thirdparty_hazelcast_replica_migration¶

no unit: The number of Hazelcast replica migrations.

Tags

name: name of the Hazelcast cluster
status: status of the operation (OK or FAILURE)

Uncaught exceptions¶

com_engflow_re_uncaught_exceptions¶

no unit: Fires every time there is an uncaught exception

Action queue metrics¶

com_engflow_remoteexecution_queue_age¶

no unit: Reports age of queued actions in each executor pool (i.e. how long entries have been waiting) - bucketed by priority.

Tags

pool: The name of the pool (_default_ for the default pool).
priority: The (numeric) priority of the action.

com_engflow_remoteexecution_queue_enqueued¶

no unit: Reports the number of actions enqueued - bucketed by priority.

Tags

pool: The name of the pool (_default_ for the default pool).
priority: The (numeric) priority of the action.

CAS server metrics¶

com_engflow_re_cas_missing_digests¶

no unit: The total number of missing digests seen by findMissingBlobs.

com_engflow_re_cas_requested_digests¶

no unit: The total number of digests requested by a findMissingBlob call

Remote Execution metrics¶

com_engflow_remoteexecution_queue_time¶

no unit: This is a distribution. Refers to the time actions are queued.

Tags

pool: The name of the pool.

Details: This is a distribution. Refers to the time actions are queued.

Invocation index monitoring¶

com_engflow_resultstore_index_sql_invocation_index_database_queue_size¶

no unit: All enqueued or in-progress invocation index database operations
Details: Reflects the number of incomplete operations (either queued or being worked on).

Every instance reports this metric. Every instance reports its own stats.

CAS usage¶

com_engflow_re_cas_check_blob_exists¶

no unit: Distribution of time needed to check whether a blob exists

Tags

status: operation result, e.g. OK, NOT_FOUND

Details: This is a distribution. Refers to the time needed to check whether a blob exists.

com_engflow_re_cas_fetch_call_time¶

no unit: Distribution of CAS fetch operation time

Tags

source: name of the CAS location, e.g. EXTERNAL_STORAGE, DISTRIBUTED_CAS_NEAR
status: op result, e.g. OK, UNAVAILABLE

Details: The time distribution of individual CAS download calls; each call is measured independently, including when falling back between different sources.

com_engflow_re_cas_fetch_retries¶

no unit: Count of retries needed when fetching a CAS blob.
Details: Count of retries needed when fetching a CAS blob. Incremented after each failure, so 0 indicates a fetch without errors.

com_engflow_re_cas_find_replicas_time¶

no unit: Distribution of time needed to find which instances have copies of a file
Details: Distribution of time needed to find which instances have copies of a file.

com_engflow_re_cas_load_shed_errors¶

no unit: Count of RESOURCE_EXHAUSTED errors returned by workers for CAS requests due to load shedding.

Tags

method: method returning the error, e.g., read

Details: Count of RESOURCE_EXHAUSTED errors returned by workers for CAS requests due to load shedding.

com_engflow_re_cas_remote_check_blob_exists¶

no unit: Distribution of time needed to check whether a blob exists

Tags

status: operation result, e.g. OK, NOT_FOUND
source: address used to check on cas blob.

Details: This is a distribution. Refers to the time needed to check whether a blob exists.

com_engflow_re_cas_requests_in_flight_incoming¶

no unit: Number of currently open incoming cache requests, by method and pool.

Tags

method: read or write
pool: name of the pool serving the request

Details: Number of currently open incoming cache requests, by method and pool.

com_engflow_re_cas_requests_in_flight_outgoing¶

no unit: Number of currently open outgoing cache requests, by method and pool.

Tags

method: read or write
pool: name of the pool originating the request

Details: Number of currently open outgoing cache requests, by method and pool. Includes both distributed CAS (ByteStream) and external storage.

com_engflow_re_cas_requests_served¶

no unit: Number of CAS requests served, by method, pool, and status

Tags

method: read or write
pool: name of the pool serving the request
status: result, for example, OK, RESOURCE_EXHAUSTED

Details: Number of CAS requests served, by method, pool, and status

com_engflow_re_cas_time_to_next_message¶

no unit: Estimated number of milliseconds to the next grpc message.

Tags

pool: name of the pool serving the request

Details: Estimated number of milliseconds to the next grpc message.

Local CAS usage¶

com_engflow_re_cas_available_replica_space¶

no unit: Available storage space in the CAS that can be used for replicas
Details: Only workers report this metric. All workers report their own values.

com_engflow_re_cas_available_space¶

no unit: Available storage space in the CAS
Details: Only workers report this metric. All workers report their own values.

com_engflow_re_cas_free_time¶

no unit: Distribution of time needed to free space in the CAS
Details: This is a distribution. It refers to the deletion of expired replicas.

Only workers report this metric. All workers report their own values.

com_engflow_re_cas_freed_bytes¶

no unit: Bytes evicted from the local CAS in order to free space
Details: Bytes evicted from the local CAS in order to free space.

com_engflow_re_cas_gc_time¶

no unit: Distribution of time needed for the GC
Details: This is a distribution. It refers to the collection of expired replicas.

Only workers report this metric. All workers report their own values.

com_engflow_re_cas_lost_files_count¶

no unit: The number of files that were lost from the CAS
Details: The number of files that were deleted by some other process or the CAS instance detected that they no longer matched the expected digest.

Only workers report this metric. All workers report their own values.

com_engflow_re_cas_max_total_replica_size¶

no unit: The max total replica size
Details: This is the maximum amount of storage space the CAS is allowed to use for replicas.

Only workers report this metric. All workers report their own values.

com_engflow_re_cas_max_total_size¶

no unit: The max total CAS size on the node
Details: This is the maximum amount of storage space the CAS is allowed to use.

Only workers report this metric. All workers report their own values.

com_engflow_re_cas_replica_bytes¶

no unit: Total bytes used in the local CAS by replica files
Details: Only workers report this metric. All workers report their own values.

com_engflow_re_cas_total_size¶

no unit: Total bytes used by local CAS files
Details: Only workers report this metric. All workers report their own values.

Client authorization¶

com_engflow_re_auth_async_duration¶

no unit: Authentication call duration
Details: This is a distribution. Only schedulers report this metric. Every scheduler reports its own stats.

SecretStore metrics¶

com_engflow_secretstore_operation_duration_seconds¶

no unit: Time taken to perform an operation on the secret store

Tags

store_type: The implementation of the instrumented secret store
operation: The secret store operation reported ('read', etc.)
status: The gRPC status code string, in SCREAMING_SNAKE_CASE

Licensing metrics¶

com_engflow_licensing_license_server_fetch_result¶

no unit: The result of attempted license renewals using the MyEngFlow License Server.

Tags

status

Docker use¶

com_engflow_re_exec_docker_container_lifecycle_events_count¶

no unit: Count of various lifecycle events relating to containers

Tags

pool
container_lifecycle_event_name

com_engflow_re_exec_docker_container_shutdown_time¶

no unit: The time needed to shutdown a docker container

com_engflow_re_exec_docker_container_startup_time¶

no unit: The time needed to start a docker container

Tags

status: result of the operation, e.g. "OK", "FAILED"

com_engflow_re_exec_docker_containers_failed¶

no unit: The number of docker containers that failed

com_engflow_re_exec_docker_docker_proxy_failure_count¶

no unit: The number of times pulls through the docker proxy failed

com_engflow_re_exec_docker_image_pull_time¶

no unit: The time needed to pull a docker image

Tags

status: result of the operation, e.g. "OK", "FAILED"

com_engflow_re_exec_docker_image_warm_time¶

no unit: The time needed to warm a docker image after it has been pulled

Tags

status: result of the operation, e.g. "OK", "FAILED"

com_engflow_re_exec_docker_network_create_time¶

no unit: The time needed to create a docker network

Tags

status: result of the operation, e.g. "OK", "FAILED"

com_engflow_re_exec_docker_network_destroy_time¶

no unit: The time needed to destroy a docker network

Tags

status: result of the operation, e.g. "OK", "FAILED"

com_engflow_re_exec_docker_sibling_container_enabled¶

no unit: Counts times that sibling container access was requested via platform properties

Tags

pool

Persistent worker use¶

com_engflow_re_exec_worker_actions¶

no unit: The number of persistent worker actions run

Tags

reuse_status: new or reused
pool: name of the pool

Details: The number of persistent worker actions run, aggregated by whether they reused a previous persistent worker process or not

Scheduler metrics¶

com_engflow_profiling_publish_invocation_event¶

no unit: Timing for publishing invocation events.

com_engflow_re_cas_entries_evicted¶

no unit: The number of CAS entries that were evicted due to memory size limitations

com_engflow_re_cas_entries_lost¶

no unit: The number of CAS entries that could not be recovered on CAS node shutdown events

com_engflow_re_hazelcast_batch_cas_node_remover_time¶

no unit: The duration of each batch-removal of lost CAS nodes from the Hz map opportunistic_cas_location.

Tags

ms

com_engflow_re_profiler_events¶

no unit: The number of server-side profile events recorded.

com_engflow_re_profiler_live_handles¶

no unit: The number of profiles being streamed to the eventstore.

com_engflow_re_remaining_certificate_validity_days¶

no unit: The number of remaining days before certificates expire

Tags

issuer: The issuer (issuer distinguished name) value from the certificate
serial_number: The serial number assigned to the certificate by the issuer

Details: Reports the number of remaining validity days for each X509 certificate processed by schedulers. The issuer and serial number uniquely identify certificates.

com_engflow_re_remaining_license_time¶

no unit: The number of remaining days before the license expires

Java memory metrics¶

com_engflow_re_java_heap¶

no unit: The amount of heap memory used
Details: Every instance reports this metric. Every instance reports its own stats.

Meta metrics¶

com_engflow_meta_engflow_version¶

no unit: A heartbeat metric that reports the EngFlow build label if present and "missing_version" otherwise.

Tags

version

com_engflow_meta_parallel_engflow_version¶

no unit: A heartbeat metric that reports how many different EngFlow versions are currently registered with the cluster, indicated by the build label. All instances that do not report a build label are rated as running the same version, different to all other versions reported. Each scheduler reports its own metric.

Docker daemon¶

com_engflow_docker_container_existing¶

no unit: The number of existing containers.

Tags

daemon: The docker daemon this metric is for (e.g., host).
pool: The pool this metric is for (e.g., default or macos).
state: The state of the container (e.g., running, exited, ...).

com_engflow_docker_container_size¶

no unit: The size distribution of existing docker containers.

Tags

daemon: The docker daemon this metric is for (e.g., host).
pool: The pool this metric is for (e.g., default or macos).
scope: The scope of the value (container if the value is the size of a single container, and daemon if the value is the sum of all containers known to the docker daemon).
filesystem: The filesystem the value applies to (overlay if the value is the size of files modified by the container, and root if the value is the total size of the container including the image).

com_engflow_docker_image_size¶

no unit: The size distribution of existing docker images.

Tags

daemon: The docker daemon this metric is for (e.g., host).
pool: The pool this metric is for (e.g., default or macos).
scope: The scope of the value (image if the value is the size of a single image, and daemon if the value is the sum of all images known to the docker daemon).

DB Connection Pool usage¶

com_engflow_resultstore_index_db_cp_active_connections¶

no unit: The number of active connections in the pool

Tags

db_connection_pool_name

com_engflow_resultstore_index_db_cp_connection_acquire_time¶

no unit: The time it takes for the connection pool to acquire a DB connection

Tags

db_connection_pool_name

com_engflow_resultstore_index_db_cp_connection_create_time¶

no unit: The time it takes for the connection pool to create a new DB connection

Tags

db_connection_pool_name

com_engflow_resultstore_index_db_cp_connection_timeout_count¶

no unit: The count of timed-out connections

Tags

db_connection_pool_name

com_engflow_resultstore_index_db_cp_connection_usage_time¶

no unit: The duration of a use of a connection given by the connection pool

Tags

db_connection_pool_name

com_engflow_resultstore_index_db_cp_idle_connections¶

no unit: The number of idle connections in the pool

Tags

db_connection_pool_name

com_engflow_resultstore_index_db_cp_max_connections¶

no unit: Maximum number of connections existing in the pool

Tags

db_connection_pool_name

com_engflow_resultstore_index_db_cp_min_connections¶

no unit: Minimum number of connections existing in the pool

Tags

db_connection_pool_name

com_engflow_resultstore_index_db_cp_pending_connections¶

no unit: The number of pending connections in the pool

Tags

db_connection_pool_name

com_engflow_resultstore_index_db_cp_total_connections¶

no unit: The number of all currently existing connections in the pool

Tags

db_connection_pool_name

Caffeine cache metrics¶

com_engflow_caching_inmemory_evicts¶

no unit: Number of cache evictions tagged with the reason for eviction.

Tags

name: The name of the cache
reason: The reason for eviction

com_engflow_caching_inmemory_hits¶

no unit: Number of cache hits.

Tags

name: The name of the cache

com_engflow_caching_inmemory_loads¶

no unit: Timing and status information for cache loads.

Tags

name: The name of the cache
status: The status of the load

com_engflow_caching_inmemory_misses¶

no unit: Number of cache misses.

Tags

name: The name of the cache

Credential management metrics¶

com_engflow_re_auth_credential_revocation_fetch_duration¶

no unit: How long it takes to fetch or recalculate the revocation list.

Tags

status: fetch/calculation status

Details: The time it takes for the credential manager to fetch/calculate the credential revocation list.

DB Query stats¶

com_engflow_resultstore_index_duration¶

no unit: The duration of a query

Tags

query_name
query_outcome

com_engflow_resultstore_index_preparation¶

no unit: The duration of creating a preparedQuery

Tags

query_name

Outgoing CI API Calls¶

com_engflow_ci_http_api_call_latency¶

no unit: Time taken for an outgoing API call, indexed by hostname and status

Tags

hostname: hostname of the API service that was called
status: HTTP status of the call (1xx, 2xx, 3xx, 4xx, 5xx, or UNKNOWN)

Details: Indicates the number of API calls made, per service and their status, on this scheduler. The status only contains the abbreviated HTTP status code. Only schedulers report this metric. Every scheduler reports its own API calls.

CI runner¶

com_engflow_ci_basic_bentos_prepared¶

no unit: Number of Bento runs scheduled, split out by cold/warm reason

Tags

ci_family: name of the CI system (buildkite|github_actions)
snapshot_usage: whether a Bento was used, and if so, whether a snapshot was used too (none|unknown_bento|want_cold|no_known_snapshot|missing_snapshot|warm)

com_engflow_ci_basic_gh_error_propagation_jobs¶

no unit: Number of finished GitHub error propagation jobs

Tags

status: the status of the error propagation job (unknown|success|non_zero_exit|client_error|error)

Details: Number of finished GitHub error propagation jobs.

com_engflow_ci_basic_gh_runner_no_jobs¶

no unit: Number of finished GitHub runs that timed out and did not pick up any jobs.
Details: Number of finished GitHub runs that timed out and did not pick up any jobs.

com_engflow_ci_basic_gh_runner_wrong_job¶

no unit: Number of finished GitHub runs that picked up the wrong job.
Details: Number of finished GitHub runs that picked up the wrong job.

com_engflow_ci_basic_git_command_duration¶

no unit: The duration of individual git commands during a CI job

Tags

git_command: git command: checkout/fetch/index-pack
ci_family: name of the CI system (buildkite|github_actions)
runner_architecture: architecture of the requested runner for a given job
runner_os: OS of the requested runner for a given job

Details: Measures the duration of various git subcommands during a job's execution

com_engflow_ci_basic_job_duration¶

no unit: The duration of CI jobs

Tags

ci_family: name of the CI system (buildkite|github_actions)
runner_architecture: architecture of the requested runner for a given job
runner_os: OS of the requested runner for a given job

Details: Measures the duration of a job's execution.

com_engflow_ci_basic_jobs_completed¶

no unit: Number of jobs completed

Tags

ci_family: name of the CI system (buildkite|github_actions)
status: job completion status (unknown|success|non_zero_exit|client_error|error)

Details: The number of completed CI jobs, by status. This includes jobs for which we failed before starting the remote runner.

com_engflow_ci_basic_jobs_queue_age¶

no unit: Age of queued jobs

Tags

ci_family: name of the CI system (buildkite|github_actions)
statistic: min/max queue age (min|max)

Details: The age of queued jobs on GitHub/BuildKite. This metric is only reported if polling is enabled, and only for jobs that are configured for EngFlow CI runners. The value is how long ago was the job created relative to the current poll time.

com_engflow_ci_basic_jobs_queued¶

no unit: Number of queued jobs

Tags

ci_family: name of the CI system (buildkite|github_actions)

Details: The number of queued jobs on GitHub/BuildKite. This metric is only reported if polling is enabled.

com_engflow_ci_basic_jobs_started¶

no unit: Number of jobs started

Tags

ci_family: name of the CI system (buildkite|github_actions)

Details: A job counts as "started" if the scheduler began executing it, i.e. the scheduler started fetching the runner agent, even if later the agent fails to start or to obtain a job from the CI system.

com_engflow_ci_basic_poll_duration_millis¶

no unit: Number of polls started

Tags

ci_family: name of the CI system (buildkite|github_actions)
status: the status of the polling job (OK|Failure)

Details: Measures the duration of polls against the remote CI system

com_engflow_ci_basic_time_to_start_job¶

no unit: How long it takes for CI runners to start a job

Tags

ci_family: name of the CI system (buildkite|github_actions)
runner_architecture: architecture of the requested runner for a given job
runner_os: OS of the requested runner for a given job

Details: Measures the time between a given CI requesting a job's execution and an action actually starting it on CI runners.

Metrics about logs.¶

com_engflow_telemetry_logging_backend_error¶

no unit: The number of errors handling logs in the logging backend.

com_engflow_telemetry_logging_report_time¶

no unit: The duration of reporting a log statement to the logging backend, bucketed by their severity.

Tags

severity

gRPC factory metrics¶

com_engflow_grpc_factory_channels¶

no unit: The number of open gRPC channels in the channel factory

CI runner - Github Action metrics¶

com_engflow_ci_github_metrics_api_primary_rate_limit_max¶

no unit: Maximum number of requests available in the current API rate limiter slot

Tags

API_RESOURCE_NAME

com_engflow_ci_github_metrics_api_primary_rate_limit_used¶

no unit: Number of requests used in the current API rate limiter slot

Tags

API_RESOURCE_NAME

RPC metrics¶

com_engflow_rpc_duration¶

no unit: Time distribution of RPCs

Tags

type: The type of the RPC (e.g., client or server)
protocol: The protocol of the RPC (e.g., rest)
method: The RPC method
status: The status of the operation (e.g., OK)