Skip to content

Metrics Reference

Description of all metrics available for monitoring.

AnalyzeInvocation metrics


com_engflow_invocationanalyzer_bazel_profile_count

no unit
1

Tags

  • status: The status of retrieving the Bazel profile.
Details

The number of Bazel profiles that were attempted to be fetched.


com_engflow_invocationanalyzer_bazel_profile_size

no unit
By
Details

The size of the uncompressed Bazel profile handled.


com_engflow_invocationanalyzer_engflow_profile_count

no unit
1

Tags

  • status: The status of retrieving the EngFlow profile.
Details

The number of EngFlow profiles that were attempted to be fetched.


com_engflow_invocationanalyzer_engflow_profile_size

no unit
By
Details

The size of the uncompressed EngFlow profile handled.


com_engflow_invocationanalyzer_time_needed

no unit
ms

Tags

  • status: The status of the analysis performed
Details

The time distribution of handling individual Bazel profiles.


OperationController metrics reporter


com_engflow_operationcontroller_active

no unit
The number of active operations (i.e., currently running).

Tags

  • name: The name of the OperationController
Details

The number of active operations (i.e., currently running).


com_engflow_operationcontroller_latency

no unit
The latency to start operations (i.e., how long operations are waiting to be executed).

Tags

  • name: The name of the OperationController
Details

The latency to start operations (i.e., how long operations are waiting to be executed).


com_engflow_operationcontroller_queued

no unit
The number of operations queued for execution.

Tags

  • name: The name of the OperationController
Details

The number of operations queued for execution.


com_engflow_operationcontroller_runtime

no unit
The runtime of operations (i.e., the duration operations are running for).

Tags

  • name: The name of the OperationController
Details

The runtime of operations (i.e., the duration operations are running for).


Metrics derived from raw BEP streams


com_engflow_bep_invocation_completed

no unit
Fired with the count of completed invocations reported to the BEP.

Tags

  • exit_code: The human readable exit code of the invocation.

com_engflow_bep_invocation_duration

no unit
Fired on invocation completed with the average duration of the invocation.

com_engflow_bep_invocation_started

no unit
Fired with the count of newly started invocations reported to the BEP.

Blob-storage implementation metrics


com_engflow_blobstore_latency

no unit
The duration each operation takes.

Tags

  • operation

  • status


com_engflow_blobstore_ops

no unit
Fires every time an operation takes place.

Tags

  • operation

Docker proxy


com_engflow_dockerproxy_blob_upload_bytes

no unit
The size of Docker blobs that the proxy successfully uploaded to the CAS.

Tags

  • status

com_engflow_dockerproxy_cache_hit_bytes

no unit
The size of Docker blobs that the proxy could find in the CAS.

com_engflow_dockerproxy_cache_miss_bytes

no unit
The size of Docker blobs that the proxy expected but could not find in the CAS.

com_engflow_dockerproxy_known_blobs

no unit
The number of Docker blobs that the proxy has metadata about.

HTTP clients for the Docker proxy


com_engflow_dockerproxy_http_received_bytes

no unit
Bytes received over the HTTP client

Tags

  • client: The name of the HTTP client (can be used to distinguish layers)

com_engflow_dockerproxy_http_request_latency_seconds

no unit
Time it's taken to serve HTTP requests

Tags

  • client: The name of the HTTP client (can be used to distinguish layers)

  • status: The status code, reduced to 1xx..5xx, or FAILED if an exception occurred


com_engflow_dockerproxy_http_requests

no unit
Number of HTTP requests started on the HTTP client

Tags

  • client: The name of the HTTP client (can be used to distinguish layers)

  • method: The HTTP method of the request


BEP Event Storage and Replay


com_engflow_eventstore_bep_event_ack_latency

no unit
This is a distribution. Tracks how much time passed between receiving a build event and sending an acknowledgement to the client.

Tags

  • status

com_engflow_eventstore_bes_upload_delay

no unit
This is a distribution. Tracks how much longer an invocation's BES upload took compared to the invocation's duration as reported by the BES.

Tags

  • status

com_engflow_eventstore_build_event_owners

no unit
The total number of build owners residing in memory. A build owner is an internal piece of state associated to a particular build reporting build events.

com_engflow_eventstore_flushing_batches_size

no unit
The total size of complete build event batches that are currently being uploaded to storage. Normally, batches are flushed quickly, so this value should stay near zero; if it doesn't, that could mean we are falling behind with batch uploads. Every instance reports its own stats; sum them up to get a cluster-wide metric.

com_engflow_eventstore_grpc_eventstore_ttfb

no unit
This is a distribution. Tracks how much time passed between requesting EventStore data via gRPC, and receiving the first byte.

Tags

  • type

com_engflow_eventstore_inbound_bep_events

no unit
Incremented whenever an event is received on an inbound stream.

Tags

  • type
Details

An inbound stream means a client (e.g. Bazel) sending BES events to a scheduler.

Every instance reports its own stats; sum them up to get a cluster-wide metric.


com_engflow_eventstore_incomplete_batches_size

no unit
The estimated size in bytes it would take to serialize all incomplete build event batches. These batches aren't yet written to storage. Actual JVM heap footprint is likely larger. Every instance reports its own stats; sum them up to get a cluster-wide metric.

com_engflow_eventstore_new_outbound_streams

no unit
Incremented whenever a new outbound BEP stream is created.
Details

An outbound stream means a scheduler replaying a BES stream in order to reduce it to a result stream. It makes no difference if the scheduler is replaying to itself or to another instance.

Every instance reports its own stats; sum them up to get a cluster-wide metric.


com_engflow_eventstore_ongoing_streams

no unit
The total number of streams that are inbound, outbound, or both.
Details

An inbound stream means a client (e.g. Bazel) sending BES events to a scheduler. An outbound stream means a scheduler replaying a BES stream in order to reduce it to a result stream. It makes no difference if the scheduler is replaying to itself or to another instance.

Every instance reports its own stats; sum them up to get a cluster-wide metric.


com_engflow_eventstore_outbound_bep_events

no unit
Incremented whenever an event is sent on an outbound stream.
Details

An outbound stream means a scheduler replaying a BES stream in order to reduce it to a result stream. It makes no difference if the scheduler is replaying to itself or to another instance.

Every instance reports its own stats; sum them up to get a cluster-wide metric.


Virtual Machine Instances


com_engflow_instance_by_pool_total_disk_space

no unit
The size of the volume.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

  • volume: the name of the disk volume

  • pool: the name of the pool the instance serves


com_engflow_instance_by_pool_used_disk_percentage

no unit
The percentage of the volume that is currently used.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

  • volume: the name of the disk volume

  • pool: the name of the pool the instance serves


com_engflow_instance_by_pool_used_disk_space

no unit
The total number of bytes used on the volume.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

  • volume: the name of the disk volume

  • pool: the name of the pool the instance serves


com_engflow_instance_gc_avg_duration

no unit
The average duration spent in garbage collection since the last reported metric.

Tags

  • gc_type

com_engflow_instance_gc_count

no unit
The total number of garbage collections during the lifecycle of this process.

Tags

  • gc_type

com_engflow_instance_gc_time

no unit
The total estimated time in milliseconds performing garbage collection.

Tags

  • gc_type

com_engflow_instance_new_gc_avg_duration

no unit
The average duration spent in garbage collection since the last reported metric.

Tags

  • gc_type: GC Old generation / GC Young Generation

  • instance_role: type of the instance (scheduler/worker/etc.)


com_engflow_instance_new_gc_count

no unit
The total number of garbage collections during the lifecycle of this process.

Tags

  • gc_type: GC Old generation / GC Young Generation

  • instance_role: type of the instance (scheduler/worker/etc.)


com_engflow_instance_new_gc_time

no unit
The total wall time in milliseconds spent blocked in garbage collection since the start of the process. This measures time when the application is not running due to a collector pause.

Tags

  • gc_type: G1 Old generation / G1 Young Generation

  • instance_role: type of the instance (scheduler/worker/etc.)


com_engflow_instance_new_open_file_descriptors

no unit
The number of file descriptors the process has currently open.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com_engflow_instance_new_spot_instance_reclaim_count

no unit
The number of spot instances that were reclaimed.

Tags

  • instance_role

com_engflow_instance_new_total_disk_space

no unit
The size of the volume.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

  • volume: the name of the disk volume


com_engflow_instance_new_total_system_memory

no unit
The total amount of system memory in bytes.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com_engflow_instance_new_used_disk_percentage

no unit
The percentage of the volume that is currently used.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

  • volume: the name of the disk volume


com_engflow_instance_new_used_disk_space

no unit
The total number of bytes used on the volume.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

  • volume: the name of the disk volume


com_engflow_instance_new_used_process_native_buffer_memory

no unit
The total amount of native buffer memory for this process in bytes.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com_engflow_instance_new_used_system_memory

no unit
The amount of used system memory in bytes.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com_engflow_instance_new_used_system_memory_percentage

no unit
The percentage of system memory used.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com_engflow_instance_total_disk_space

no unit
The size of the volume.

Tags

  • volume

com_engflow_instance_total_system_memory

no unit
The total amount of system memory in bytes.

com_engflow_instance_used_disk_percentage

no unit
The percentage of the volume that is currently used.

Tags

  • volume

com_engflow_instance_used_disk_space

no unit
The total number of bytes used on the volume.

Tags

  • volume

com_engflow_instance_used_system_memory

no unit
The amount of used system memory in bytes.

com_engflow_instance_used_system_memory_percentage

no unit
The percentage of system memory used.

Netty monitoring


com_engflow_thirdparty_netty_used_direct_memory

no unit
Direct (non-heap) memory use

Tags

  • buffer_name

com_engflow_thirdparty_netty_used_heap_memory

no unit
Heap memory use

Tags

  • buffer_name

Worker Control metrics


com_engflow_re_management_workercontrol_approx_mft_induced_idle_executor_duration

no unit
Approximately how much time all executors of all workers marked-for-termination were idle. Each scheduler reports the approximate duration of executor idleness as induced by the scheduler marking workers for termination. Only the master scheduler should report non-zero values. Sum up the idle durations reported by all schedulers to get the overall idleness.

Tags

  • pool: name of the pool ("default" for the default pool)

com_engflow_re_management_workercontrol_mft_on_scheduler

no unit
This is a distribution. Tracks how long the marked-for-termination call lasted on the scheduler.

Tags

  • pool: name of the pool ("default" for the default pool)

  • result: the result of the marked-for-termination call


com_engflow_re_management_workercontrol_mft_on_worker

no unit
This is a distribution. Tracks how long the marked-for-termination call lasted on the worker.

Tags

  • pool: name of the pool ("default" for the default pool)

  • result: the result of the marked-for-termination call


com_engflow_re_management_workercontrol_ongoing_mft_worker_count

no unit
Number of workers currently marked for termination, per pool, as reported by the scheduler.

Tags

  • pool: name of the pool ("default" for the default pool)

Action scheduling


com_engflow_re_scheduler_autoscaler_cluster_size_controller_op

no unit
Per pool, distribution of how long each cluster size controller operation took and what its completion status was.

Tags

  • pool: name of pool ("default" for the default pool)

  • op: the operation performed, "setClusterSize" or "reduceClusterSizeByInstance"

  • status: "succeeded" or "failed"


com_engflow_re_scheduler_autoscaler_set_size_operations

no unit
Per pool, number of attempts to set its autoscaling group's desired size

Tags

  • pool: name of pool ("default" for the default pool)

  • status: "succeeded" or "failed"

Details

Deprecated. Use com.engflow.re.scheduler/autoscaler_cluster_size_controller_op instead.


com_engflow_re_scheduler_available_workers

no unit
Deprecated; number of idle executors, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Deprecated. Indicates the number of idle executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value. This metric is deprecated, because it may be imprecise: schedulers that are started while workers are busy may report a higher value than they should for several minutes. We recommend monitoring com.engflow.re.scheduler/existing_executors instead.


com_engflow_re_scheduler_coalesced_executions

no unit
Number of action requests of coalesced into an existing execution
Details

Number of action requests of coalesced into an existing execution


com_engflow_re_scheduler_cores_per_executor

no unit
Number of cores per executors, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Number of cores per executor. Every scheduler reports the same (or about the same) value.


com_engflow_re_scheduler_dequeued_actions

no unit
Per pool, the number of actions that were removed from the queue, either due to starting execution on a worker, or because it was ejected from the queue as it got too old.

Tags

  • pool: name of the pool ("default" for the default pool)

  • reason


com_engflow_re_scheduler_desired_executors

no unit
Number of desired executors, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates an estimate for the number of required executors per pool. Every scheduler reports its own estimate - they should be summed up to get the total desired pool size.


com_engflow_re_scheduler_estimated_action_time

no unit
Estimated action time

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates an estimate for duration of an action. Every scheduler reports its own estimate.


com_engflow_re_scheduler_estimated_induced_load

no unit
Estimated induced wait time

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates an estimate for future incoming work in ms per ms. Every scheduler reports its own estimate.


com_engflow_re_scheduler_existing_executors

no unit
Number of existing executors, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Number of existing executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value.


com_engflow_re_scheduler_existing_schedulers

no unit
Number of existing schedulers
Details

Only schedulers report this metric. Every scheduler reports a constant "1". This can be used to detect schedulers that are unable to send monitoring metrics.


com_engflow_re_scheduler_global_queue_size

no unit
Number of waiting actions, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates the number of actions waiting for execution, per pool, in the cluster. Only schedulers report this metric. The schedulers coordinate to calculate this sum.


com_engflow_re_scheduler_global_used_executors

no unit
Number of used executors, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates the number of executors that are in use, per pool, in the cluster. Only schedulers report this metric.


com_engflow_re_scheduler_licensed_max_worker_cores

no unit
Maximum number of worker cores permitted by the EngFlow license

com_engflow_re_scheduler_max_instances

no unit
Configured maximum number of instances, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates the configured maximum number of instances per pool.


com_engflow_re_scheduler_mia_count

no unit
Per pool, the number of action that failed, because the executing worker went missing-in-action

Tags

  • pool: name of the pool ("default" for the default pool)

com_engflow_re_scheduler_min_instances

no unit
Configured minimum number of instances, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates the configured minimum number of instances per pool.


com_engflow_re_scheduler_pool_utilization

no unit
Current executor utilization, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Reports current executor utilization (used*100/total) per pool, as a percentage ([0..100]). Only schedulers report this metric. Every scheduler reports the same (or about the same) value.

To help making scale-up decisions when a pool is empty, utilization is reported as 100 if there are actions waiting and 0 if not. A pool may be empty if it was scaled down, or if it never existed (the client may request any pool name).


com_engflow_re_scheduler_queue_age

no unit
Min/max age of queued actions, per pool

Tags

  • pool: name of the pool ("default" for the default pool)

  • statistic: "min" (youngest) or "max" (oldest) action in the pool's queue

Details

Reports minimum and maximum age in each executor pool, i.e. how long entries have been waiting. Only schedulers report this metric. Every scheduler reports its own queue lengths. Changes in these values indicate a change in the cluster's throughput.


com_engflow_re_scheduler_queue_size

no unit
Number of waiting actions, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates the number of actions waiting for execution, per pool, on this scheduler. Only schedulers report this metric. Every scheduler reports its own queue lengths.


com_engflow_re_scheduler_target_desired_instances

no unit
Number of desired instances, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates the number of desired instances per pool, as reported by the master scheduler Depending on which scaling method is active, this is the size the autoscaling group will be set to by the scheduler, or the master scheduler marks instances termination and terminates them to reach this size eventually.


Observability UI metrics


com_engflow_observability_ui_app_load

no unit
The duration of the initial application load.

Tags

  • page

com_engflow_observability_ui_caught_error

no unit
An error in the web client was manually caught and reported.

Tags

  • page

  • error_description


com_engflow_observability_ui_navigation

no unit
A single user navigation to a new page.

Tags

  • page

com_engflow_observability_ui_page_load

no unit
ms

Tags

  • page

  • load_type


com_engflow_observability_ui_page_load_with_data_requests

no unit
Measures the duration elapsed loading the current page, plus any page-specific data requests sent after it was initially loaded. This is currently only selectively enabled for some pages.

Tags

  • page

  • request_status


com_engflow_observability_ui_rendering_error

no unit
An error in the web client was caught by the rendering pipeline and reported.

Tags

  • page

  • section


com_engflow_observability_ui_uncaught_error

no unit
An uncaught error was thrown in the web client.

Tags

  • page

  • error_name


BES Replay Metrics


com_engflow_bes_replay_cpu_time

no unit
The amount of CPU time spent replaying and reducing build event streams, measured in milliseconds.

Tags

  • status

  • replay_type


com_engflow_bes_replay_cpu_time_for_event

no unit
The amount of CPU time spent replaying and reducing a single event within a build event stream, measured in milliseconds.

Tags

  • status

  • replay_type

  • event_type


ResultStore metrics


com_engflow_resultstore_reduce_bes_completed_duration_since_finish_event

no unit
This is a distribution. Tracks how much time passed between receiving an invocation's finish event, and completing the BES reduction.

Tags

  • status: Whether the BES reduction finished successfully.

  • replay_type


com_engflow_resultstore_reduce_bes_replay_removed_from_cache_count

no unit
The number of BES replays that were removed from the replay cache.

Tags

  • status: The status of the replay when it was removed from the cache.

  • replay_type


com_engflow_resultstore_reduce_bes_replay_source_count

no unit
The number of BES replays requested, tagged by where the data was fetched from, and the type of replay.

Tags

  • source: Indicates where the data was fetched from.

  • replay_type


Storage implementation metrics


com_engflow_storage_gc_gc_window_seconds

no unit
The storage service's GC window.

Tags

  • name: name of the storage service

com_engflow_storage_read_size

no unit
By

Tags

  • name: name of the storage service

  • status: op result


com_engflow_storage_read_time_per_gb

no unit
ms

Tags

  • name: name of the storage service

  • status: op result


com_engflow_storage_read_time_to_first_byte

no unit
ms

Tags

  • name: name of the storage service

  • status: op result


com_engflow_storage_read_time_to_next_chunk

no unit
ms

Tags

  • name: name of the storage service

  • status: op result


com_engflow_storage_write_size

no unit
By

Tags

  • name: name of the storage service

  • status: op result


com_engflow_storage_write_time_per_gb

no unit
ms

Tags

  • name: name of the storage service

  • status: op result


com_engflow_storage_write_time_to_commit

no unit
ms

Tags

  • name: name of the storage service

  • status: op result


Integration metrics


com_engflow_integration_process_duration

no unit
Reports how long it took in milliseconds to process an event sent to a third party integration.

Tags

  • integration: The name of the service we are integrating with.

  • status: The status of trying to send an event to that integration.


NotificationQueue metrics


com_engflow_notificationqueue_dequeue_latency

no unit
This is a distribution. Refers to the time passed between notification creation and dequeuing it.

Tags

  • expired: Whether the notification was expired and discarded.

  • name: The name of the queue.


com_engflow_notificationqueue_head_age

no unit
The age of the first notification in the queue, i.e. the time passed since the notification was created. Notably, it does NOT reflect how much time has passed since the notification was last published. For example, if a notification is not acknowledged and published anew, the head age may be disproportionately high compared to the age of the next notifications in the queue. This can lead to acceptable, occasional spikes.

Tags

  • name: The name of the queue.

com_engflow_notificationqueue_publish

no unit
This is a distribution. Refers to the time needed to publish a notification.

Tags

  • status: The status of the operation.

  • name: The name of the queue.


com_engflow_notificationqueue_size

no unit
The approximate size of the queue.

Tags

  • name: The name of the queue.

Action execution


com_engflow_re_exec_completed_actions

no unit
Number of actions that ran to completion, grouped by exit code

Tags

  • exit_code: the action's exit code
Details

This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, in all pools combined, since the last time this metric was reported.

Only workers report this metric. All workers report their own values. We recommend grouping by exit_code=0 and exit_code!=0, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the cluster.


com_engflow_re_exec_completed_actions_per_pool

no unit
Number of executed actions (not cached), grouped by pool and status

Tags

  • pool: name of the pool (_default_ for the default pool)

  • status: the action's status (ExecutionStatus: SUCCESS, NON_ZERO_EXIT, CLIENT_ERROR, ERROR)

Details

This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, per pool, since the last time this metric was reported.

Only workers report this metric. All workers report their own values. We recommend grouping by status, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the pool/cluster.


com_engflow_re_exec_execution_latency

no unit
ms

Tags

  • pool: name of the pool (_default_ for the default pool)

  • stage: the action's stage (ExecutionStage: QUEUED, DOWNLOAD_INPUTS, EXECUTE_ACTION, UPLOAD_OUTPUTS, EXECUTOR_TOTAL)


com_engflow_re_exec_executors_existing

no unit
Total number of executors on this worker, in all pools combined
Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the cluster.


com_engflow_re_exec_executors_existing_per_pool

no unit
Total number of executors on this worker, per pool

Tags

  • pool: name of the pool (_default_ for the default pool)
Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the pool/cluster.


com_engflow_re_exec_max_rss_kib

no unit
KiBy

Tags

  • pool: name of the pool (_default_ for the default pool)
Details

This metric is a distribution. Each measurement indicates approximately how much memory (in Kib) a successful action (ExecutionStatus: SUCCESS) on this worker reportedly used.

Only workers report this metric. All workers report their own values.


com_engflow_re_exec_started_actions_per_pool

no unit
Number of started actions, grouped by pool

Tags

  • pool: name of the pool (_default_ for the default pool)
Details

This metric reflects the rate of change. Each measurement indicates how many actions started on this worker, per pool, since the last time this metric was reported.

Only workers report this metric. All workers report their own values.


com_engflow_re_exec_used_executors

no unit
Number of busy executors, in all pools
Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the cluster.


com_engflow_re_exec_used_executors_per_pool

no unit
Number of busy executors, per pool

Tags

  • pool: name of the pool (_default_ for the default pool)
Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the pool/cluster.


Hazelcast monitoring


com_engflow_re_hazelcast_is_master

no unit
Whether a machine is a cluster master; if this sums up to more than one (with the same name), then the cluster is unhealthy.

Tags

  • name: name of the Hazelcast cluster.

com_engflow_re_hazelcast_map_entries

no unit
The number of entries in Hazelcast maps map

Tags

  • cluster_name: name of the Hazelcast cluster

  • map_name: name of the Hazelcast map


com_engflow_re_hazelcast_map_memory_used

no unit
The amount of memory used for the map

Tags

  • cluster_name: name of the Hazelcast cluster

  • map_name: name of the Hazelcast map


com_engflow_re_hazelcast_member_count

no unit
The number of members in the cluster; only the master reports this value

Tags

  • name: name of the Hazelcast cluster.

com_engflow_re_hazelcast_op_time

no unit
ms

Tags

  • name: name of the distributed hash map

  • status: op result


com_engflow_thirdparty_hazelcast_partition_migration_finished

no unit
The number of finished Hazelcast partition migrations.

Tags

  • name: name of the Hazelcast cluster

com_engflow_thirdparty_hazelcast_partition_migration_started

no unit
The number of started Hazelcast partition migrations.

Tags

  • name: name of the Hazelcast cluster

com_engflow_thirdparty_hazelcast_partition_migration_time

no unit
Time of Hazelcast partition migrations, per Hazelcast cluster.

Tags

  • name: name of the Hazelcast cluster
Details

Reports the time of Hazelcast partition migrations.


com_engflow_thirdparty_hazelcast_replica_migration

no unit
The number of Hazelcast replica migrations.

Tags

  • name: name of the Hazelcast cluster

  • status: status of the operation (OK or FAILURE)


Uncaught exceptions


com_engflow_re_uncaught_exceptions

no unit
Fires every time there is an uncaught exception

Action queue metrics


com_engflow_remoteexecution_queue_age

no unit
Reports age of queued actions in each executor pool (i.e. how long entries have been waiting) - bucketed by priority.

Tags

  • pool: The name of the pool (_default_ for the default pool).

  • priority: The (numeric) priority of the action.


com_engflow_remoteexecution_queue_enqueued

no unit
Reports the number of actions enqueued - bucketed by priority.

Tags

  • pool: The name of the pool (_default_ for the default pool).

  • priority: The (numeric) priority of the action.


CAS server metrics


com_engflow_re_cas_missing_digests

no unit
The total number of missing digests seen by findMissingBlobs.

com_engflow_re_cas_requested_digests

no unit
The total number of digests requested by a findMissingBlob call

Remote Execution metrics


com_engflow_remoteexecution_queue_time

no unit
This is a distribution. Refers to the time actions are queued.

Tags

  • pool: The name of the pool.
Details

This is a distribution. Refers to the time actions are queued.


Invocation index monitoring


com_engflow_resultstore_index_sql_invocation_index_database_queue_size

no unit
All enqueued or in-progress invocation index database operations
Details

Reflects the number of incomplete operations (either queued or being worked on).

Every instance reports this metric. Every instance reports its own stats.


CAS usage


com_engflow_re_cas_check_blob_exists

no unit
Distribution of time needed to check whether a blob exists

Tags

  • status: operation result, e.g. OK, NOT_FOUND
Details

This is a distribution. Refers to the time needed to check whether a blob exists.


com_engflow_re_cas_fetch_call_time

no unit
ms

Tags

  • source: name of the CAS location, e.g. EXTERNAL_STORAGE, DISTRIBUTED_CAS_NEAR

  • status: op result, e.g. OK, UNAVAILABLE

Details

The time distribution of individual CAS download calls; each call is measured independently, including when falling back between different sources.


com_engflow_re_cas_fetch_retries

no unit
Count of retries needed when fetching a CAS blob.
Details

Count of retries needed when fetching a CAS blob. Incremented after each failure, so 0 indicates a fetch without errors.


com_engflow_re_cas_find_replicas_time

no unit
ms
Details

Distribution of time needed to find which instances have copies of a file.


com_engflow_re_cas_load_shed_errors

no unit
Count of RESOURCE_EXHAUSTED errors returned by workers for CAS requests due to load shedding.

Tags

  • method: method returning the error, e.g., read
Details

Count of RESOURCE_EXHAUSTED errors returned by workers for CAS requests due to load shedding.


com_engflow_re_cas_remote_check_blob_exists

no unit
Distribution of time needed to check whether a blob exists

Tags

  • status: operation result, e.g. OK, NOT_FOUND

  • source: address used to check on cas blob.

Details

This is a distribution. Refers to the time needed to check whether a blob exists.


com_engflow_re_cas_requests_in_flight_incoming

no unit
Number of currently open incoming cache requests, by method and pool.

Tags

  • method: read or write

  • pool: name of the pool serving the request

Details

Number of currently open incoming cache requests, by method and pool.


com_engflow_re_cas_requests_in_flight_outgoing

no unit
Number of currently open outgoing cache requests, by method and pool.

Tags

  • method: read or write

  • pool: name of the pool originating the request

Details

Number of currently open outgoing cache requests, by method and pool. Includes both distributed CAS (ByteStream) and external storage.


com_engflow_re_cas_requests_served

no unit
Number of CAS requests served, by method, pool, and status

Tags

  • method: read or write

  • pool: name of the pool serving the request

  • status: result, for example, OK, RESOURCE_EXHAUSTED

Details

Number of CAS requests served, by method, pool, and status


com_engflow_re_cas_time_to_next_message

no unit
Estimated number of milliseconds to the next grpc message.

Tags

  • pool: name of the pool serving the request
Details

Estimated number of milliseconds to the next grpc message.


Local CAS usage


com_engflow_re_cas_available_replica_space

no unit
Available storage space in the CAS that can be used for replicas
Details

Only workers report this metric. All workers report their own values.


com_engflow_re_cas_available_space

no unit
Available storage space in the CAS
Details

Only workers report this metric. All workers report their own values.


com_engflow_re_cas_free_time

no unit
ms
Details

This is a distribution. It refers to the deletion of expired replicas.

Only workers report this metric. All workers report their own values.


com_engflow_re_cas_gc_time

no unit
ms
Details

This is a distribution. It refers to the collection of expired replicas.

Only workers report this metric. All workers report their own values.


com_engflow_re_cas_lost_files_count

no unit
The number of files that were lost from the CAS
Details

The number of files that were deleted by some other process or the CAS instance detected that they no longer matched the expected digest.

Only workers report this metric. All workers report their own values.


com_engflow_re_cas_max_total_replica_size

no unit
The max total replica size
Details

This is the maximum amount of storage space the CAS is allowed to use for replicas.

Only workers report this metric. All workers report their own values.


com_engflow_re_cas_max_total_size

no unit
The max total CAS size on the node
Details

This is the maximum amount of storage space the CAS is allowed to use.

Only workers report this metric. All workers report their own values.


Client authorization


com_engflow_re_auth_async_duration

no unit
ms
Details

This is a distribution. Only schedulers report this metric. Every scheduler reports its own stats.


SecretStore metrics


com_engflow_secretstore_operation_duration_seconds

no unit
Time taken to perform an operation on the secret store

Tags

  • store_type: The implementation of the instrumented secret store

  • operation: The secret store operation reported ('read', etc.)

  • status: The gRPC status code string, in SCREAMING_SNAKE_CASE


Licensing metrics


com_engflow_licensing_license_server_fetch_result

no unit
The result of attempted license renewals using the MyEngFlow License Server.

Tags

  • status

Docker use


com_engflow_re_exec_docker_container_lifecycle_events_count

no unit
Count of various lifecycle events relating to containers

Tags

  • pool

  • container_lifecycle_event_name


com_engflow_re_exec_docker_container_shutdown_time

no unit
ms

com_engflow_re_exec_docker_container_startup_time

no unit
ms

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com_engflow_re_exec_docker_containers_failed

no unit
The number of docker containers that failed

com_engflow_re_exec_docker_docker_proxy_failure_count

no unit
The number of times pulls through the docker proxy failed

com_engflow_re_exec_docker_image_pull_time

no unit
ms

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com_engflow_re_exec_docker_network_create_time

no unit
ms

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com_engflow_re_exec_docker_network_destroy_time

no unit
ms

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com_engflow_re_exec_docker_sibling_container_enabled

no unit
Counts times that sibling container access was requested via platform properties

Tags

  • pool

Persistent worker use


com_engflow_re_exec_worker_actions

no unit
The number of persistent worker actions run

Tags

  • reuse_status: new or reused

  • pool: name of the pool

Details

The number of persistent worker actions run, aggregated by whether they reused a previous persistent worker process or not


Scheduler metrics


com_engflow_profiling_publish_invocation_event

no unit
Timing for publishing invocation events.

com_engflow_re_cas_entries_evicted

no unit
The number of CAS entries that were evicted due to memory size limitations

com_engflow_re_cas_entries_lost

no unit
The number of CAS entries that could not be recovered on CAS node shutdown events

com_engflow_re_profiler_events

no unit
The number of server-side profile events recorded.

com_engflow_re_profiler_live_handles

no unit
The number of profiles being streamed to the eventstore.

com_engflow_re_remaining_certificate_validity_days

no unit
The number of remaining days before certificates expire

Tags

  • issuer: The issuer (issuer distinguished name) value from the certificate

  • serial_number: The serial number assigned to the certificate by the issuer

Details

Reports the number of remaining validity days for each X509 certificate processed by schedulers. The issuer and serial number uniquely identify certificates.


com_engflow_re_remaining_license_time

no unit
The number of remaining days before the license expires

Java memory metrics


com_engflow_re_java_heap

no unit
The amount of heap memory used
Details

Every instance reports this metric. Every instance reports its own stats.


Meta metrics


com_engflow_meta_engflow_version

no unit
A heartbeat metric that reports the EngFlow build label if present and "missing_version" otherwise.

Tags

  • version

com_engflow_meta_parallel_engflow_version

no unit
A heartbeat metric that reports how many different EngFlow versions are currently registered with the cluster, indicated by the build label. All instances that do not report a build label are rated as running the same version, different to all other versions reported. Each scheduler reports its own metric.

Docker daemon


com_engflow_docker_container_existing

no unit
The number of existing containers.

Tags

  • daemon: The docker daemon this metric is for (e.g., host).

  • pool: The pool this metric is for (e.g., default or macos).

  • state: The state of the container (e.g., running, exited, ...).


com_engflow_docker_container_size

no unit
The size distribution of existing docker containers.

Tags

  • daemon: The docker daemon this metric is for (e.g., host).

  • pool: The pool this metric is for (e.g., default or macos).

  • scope: The scope of the value (container if the value is the size of a single container, and daemon if the value is the sum of all containers known to the docker daemon).

  • filesystem: The filesystem the value applies to (overlay if the value is the size of files modified by the container, and root if the value is the total size of the container including the image).


com_engflow_docker_image_size

no unit
The size distribution of existing docker images.

Tags

  • daemon: The docker daemon this metric is for (e.g., host).

  • pool: The pool this metric is for (e.g., default or macos).

  • scope: The scope of the value (image if the value is the size of a single image, and daemon if the value is the sum of all images known to the docker daemon).


DB Connection Pool usage


com_engflow_resultstore_index_db_cp_active_connections

no unit
The number of active connections in the pool

Tags

  • db_connection_pool_name

com_engflow_resultstore_index_db_cp_connection_acquire_time

no unit
us

Tags

  • db_connection_pool_name

com_engflow_resultstore_index_db_cp_connection_create_time

no unit
ms

Tags

  • db_connection_pool_name

com_engflow_resultstore_index_db_cp_connection_timeout_count

no unit
The count of timed-out connections

Tags

  • db_connection_pool_name

com_engflow_resultstore_index_db_cp_connection_usage_time

no unit
ms

Tags

  • db_connection_pool_name

com_engflow_resultstore_index_db_cp_idle_connections

no unit
The number of idle connections in the pool

Tags

  • db_connection_pool_name

com_engflow_resultstore_index_db_cp_max_connections

no unit
Maximum number of connections existing in the pool

Tags

  • db_connection_pool_name

com_engflow_resultstore_index_db_cp_min_connections

no unit
Minimum number of connections existing in the pool

Tags

  • db_connection_pool_name

com_engflow_resultstore_index_db_cp_pending_connections

no unit
The number of pending connections in the pool

Tags

  • db_connection_pool_name

com_engflow_resultstore_index_db_cp_total_connections

no unit
The number of all currently existing connections in the pool

Tags

  • db_connection_pool_name

Caffeine cache metrics


com_engflow_caching_inmemory_evicts

no unit
Number of cache evictions tagged with the reason for eviction.

Tags

  • name: The name of the cache

  • reason: The reason for eviction


com_engflow_caching_inmemory_hits

no unit
Number of cache hits.

Tags

  • name: The name of the cache

com_engflow_caching_inmemory_loads

no unit
Timing and status information for cache loads.

Tags

  • name: The name of the cache

  • status: The status of the load


com_engflow_caching_inmemory_misses

no unit
Number of cache misses.

Tags

  • name: The name of the cache

DB Query stats


com_engflow_resultstore_index_duration

no unit
ms

Tags

  • query_name

  • query_outcome


com_engflow_resultstore_index_preparation

no unit
ms

Tags

  • query_name

Outgoing CI API Calls


com_engflow_ci_http_api_call_latency

no unit
ms

Tags

  • hostname: hostname of the API service that was called

  • status: HTTP status of the call (1xx, 2xx, 3xx, 4xx, 5xx, or UNKNOWN)

Details

Indicates the number of API calls made, per service and their status, on this scheduler. The status only contains the abbreviated HTTP status code. Only schedulers report this metric. Every scheduler reports its own API calls.


CI runner


com_engflow_ci_basic_bentos_prepared

no unit
Number of Bento runs scheduled, split out by cold/warm reason

Tags

  • ci_family: name of the CI system (buildkite|github_actions)

  • snapshot_usage: whether a Bento was used, and if so, whether a snapshot was used too (none|unknown_bento|want_cold|no_known_snapshot|missing_snapshot|warm)


com_engflow_ci_basic_gh_error_propagation_jobs

no unit
Number of finished GitHub error propagation jobs

Tags

  • status: the status of the error propagation job
Details

Number of finished GitHub error propagation jobs.


com_engflow_ci_basic_gh_runner_no_jobs

no unit
Number of finished GitHub runs that timed out and did not pick up any jobs.
Details

Number of finished GitHub runs that timed out and did not pick up any jobs.


com_engflow_ci_basic_gh_runner_wrong_job

no unit
Number of finished GitHub runs that picked up the wrong job.
Details

Number of finished GitHub runs that picked up the wrong job.


com_engflow_ci_basic_jobs_completed

no unit
Number of jobs completed

Tags

  • status: job completion status
Details

The number of completed CI jobs, by status. This includes jobs for which we failed before starting the remote runner.


com_engflow_ci_basic_jobs_queue_age

no unit
Age of queued jobs

Tags

  • ci_family: name of the CI of the job

  • statistic: min/max queue age

Details

The age of queued jobs on GitHub/BuildKite. This metric is only reported if polling is enabled, and only for jobs that are configured for EngFlow CI runners. The value is how long ago was the job created relative to the current poll time.


com_engflow_ci_basic_jobs_queued

no unit
Number of queued jobs

Tags

  • ci_family: name of the CI of the job
Details

The number of queued jobs on GitHub/BuildKite. This metric is only reported if polling is enabled.


com_engflow_ci_basic_jobs_started

no unit
Number of jobs started

Tags

  • ci_family: name of the CI system (buildkite|github_actions)
Details

A job counts as "started" if the scheduler began executing it, i.e. the scheduler started fetching the runner agent, even if later the agent fails to start or to obtain a job from the CI system.


com_engflow_ci_basic_poll_duration_millis

no unit
Number of polls started

Tags

  • ci_family: name of the CI system to poll

  • status: the status of the polling job

Details

Measures the duration of polls against the remote CI system


com_engflow_ci_full_builtin_step_duration

no unit
The duration of individual built-in steps in a CI job

Tags

  • ci_family: name of the CI of the job

  • git_repository: name of the repository of the job

  • step: job step name

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

Details

Measures the duration of a job's individual built-in (non-user-defined) steps' execution.


com_engflow_ci_full_git_command_duration

no unit
The duration of individual git commands during a CI job

Tags

  • git_command: git command: checkout/fetch/index-pack

  • ci_family: name of the CI of the job

  • git_repository: name of the repository of the job

  • pipeline: name of the pipeline that triggered the job

  • job: job name

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

Details

Measures the duration of various git subcommands during a job's execution


com_engflow_ci_full_job_duration

no unit
The duration of CI jobs

Tags

  • ci_family: name of the CI of the job

  • git_repository: name of the repository of the job

  • pipeline: name of the pipeline that triggered the job

  • job: job name

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

Details

Measures the duration of a job's execution.


com_engflow_ci_full_jobs_completed

no unit
Number of jobs completed

Tags

  • status: job completion status

  • ci_family: name of the CI of the job

  • git_repository: name of the repository of the job

  • pipeline: name of the pipeline that triggered the job

  • job: job name

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

Details

The number of completed CI jobs, by status. This includes jobs for which we failed before starting the remote runner.


com_engflow_ci_full_jobs_started

no unit
Number of jobs that CI Agents successfully started

Tags

  • ci_family: name of the CI of the job

  • git_repository: name of the repository of the job

  • pipeline: name of the pipeline that triggered the job

  • job: job name

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

Details

The number of jobs started on this scheduler. Every scheduler reports its own jobs.


com_engflow_ci_full_step_duration

no unit
The duration of individual user-defined steps in a CI job

Tags

  • ci_family: name of the CI of the job

  • git_repository: name of the repository of the job

  • pipeline: name of the pipeline that triggered the job

  • job: job name

  • step: job step name

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

Details

Measures the duration of a job's individual steps' execution.


com_engflow_ci_full_time_to_start_job

no unit
How long it takes for CI runners to start a job

Tags

  • ci_family: name of the CI of the job

  • git_repository: name of the repository of the job

  • pipeline: name of the pipeline that triggered the job

  • job: job name

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

Details

Measures the time between a given CI requesting a job's execution and an action actually starting it on CI runners.


gRPC factory metrics


com_engflow_grpc_factory_channels

no unit
The number of open gRPC channels in the channel factory

CI runner - Github Action metrics


com_engflow_ci_github_metrics_api_primary_rate_limit_max

no unit
Maximum number of requests available in the current API rate limiter slot

Tags

  • API_RESOURCE_NAME

com_engflow_ci_github_metrics_api_primary_rate_limit_used

no unit
Number of requests used in the current API rate limiter slot

Tags

  • API_RESOURCE_NAME

RPC metrics


com_engflow_rpc_duration

no unit
Time distribution of RPCs

Tags

  • type: The type of the RPC (e.g., client or server)

  • protocol: The protocol of the RPC (e.g., rest)

  • method: The RPC method

  • status: The status of the operation (e.g., OK)