Skip to content

Metrics Reference

Description of all metrics available for monitoring.

AnalyzeInvocation metrics


com.engflow.invocationanalyzer/bazel_profile_count

no unit
The number of Bazel profiles that were attempted to be fetched

Tags

  • status: The status of retrieving the Bazel profile.
Details

The number of Bazel profiles that were attempted to be fetched.


com.engflow.invocationanalyzer/bazel_profile_size

bytes
The size of the uncompressed Bazel profile handled
Details

The size of the uncompressed Bazel profile handled.


com.engflow.invocationanalyzer/engflow_profile_count

no unit
The number of EngFlow profiles that were attempted to be fetched

Tags

  • status: The status of retrieving the EngFlow profile.
Details

The number of EngFlow profiles that were attempted to be fetched.


com.engflow.invocationanalyzer/engflow_profile_size

bytes
The size of the uncompressed EngFlow profile handled
Details

The size of the uncompressed EngFlow profile handled.


com.engflow.invocationanalyzer/time_needed

milliseconds
The time distribution of handing individual profile analysis requests

Tags

  • status: The status of the analysis performed
Details

The time distribution of handling individual Bazel profiles.


Opencensus OperationController metrics reporter


com.engflow.operationcontroller/active

no unit
The number of active operations (i.e., currently running).

Tags

  • name: The name of the OperationController
Details

The number of active operations (i.e., currently running).


com.engflow.operationcontroller/latency

milliseconds
The latency to start operations (i.e., how long operations are waiting to be executed).

Tags

  • name: The name of the OperationController
Details

The latency to start operations (i.e., how long operations are waiting to be executed).


com.engflow.operationcontroller/queued

no unit
The number of operations queued for execution.

Tags

  • name: The name of the OperationController
Details

The number of operations queued for execution.


com.engflow.operationcontroller/runtime

milliseconds
The runtime of operations (i.e., the duration operations are running for).

Tags

  • name: The name of the OperationController
Details

The runtime of operations (i.e., the duration operations are running for).


Metrics derived from raw BEP streams


com.engflow.bep/invocation_completed

no unit
Fired with the count of completed invocations reported to the BEP.

Tags

  • exit_code: The human readable exit code of the invocation.

com.engflow.bep/invocation_duration

milliseconds
Fired on invocation completed with the average duration of the invocation.

com.engflow.bep/invocation_started

no unit
Fired with the count of newly started invocations reported to the BEP.

Blob-storage implementation metrics


com.engflow.blobstore/latency

milliseconds
The duration each operation takes.

Tags

  • operation

  • status


com.engflow.blobstore/ops

no unit
Fires every time an operation takes place.

Tags

  • operation

Docker proxy


com.engflow.dockerproxy/blob_upload_bytes

bytes
The size of Docker blobs that the proxy successfully uploaded to the CAS.

Tags

  • status

com.engflow.dockerproxy/cache_hit_bytes

bytes
The size of Docker blobs that the proxy could find in the CAS.

com.engflow.dockerproxy/cache_miss_bytes

bytes
The size of Docker blobs that the proxy expected but could not find in the CAS.

com.engflow.dockerproxy/known_blobs_total

no unit
The number of Docker blobs that the proxy has metadata about.

HTTP clients for the Docker proxy


com.engflow.dockerproxy/http_received_bytes_total

bytes
Bytes received over the HTTP client

Tags

  • client: The name of the HTTP client (can be used to distinguish layers)

com.engflow.dockerproxy/http_request_latency_seconds_total

s
Time it's taken to serve HTTP requests

Tags

  • client: The name of the HTTP client (can be used to distinguish layers)

  • status: The status code, reduced to 1xx..5xx, or FAILED if an exception occurred


com.engflow.dockerproxy/http_requests_total

no unit
Number of HTTP requests started on the HTTP client

Tags

  • client: The name of the HTTP client (can be used to distinguish layers)

  • method: The HTTP method of the request


BEP Event Storage and Replay


com.engflow.eventstore/bep_event_ack_latency

milliseconds
This is a distribution. Tracks how much time passed between receiving a build event and sending an acknowledgement to the client.

Tags

  • status

com.engflow.eventstore/bes_upload_delay

milliseconds
This is a distribution. Tracks how much longer an invocation's BES upload took compared to the invocation's duration as reported by the BES.

Tags

  • status

com.engflow.eventstore/build_event_owners

no unit
The total number of build owners residing in memory. A build owner is an internal piece of state associated to a particular build reporting build events.

com.engflow.eventstore/flushing_batches_size

no unit
The total size of complete build event batches that are currently being uploaded to storage. Normally, batches are flushed quickly, so this value should stay near zero; if it doesn't, that could mean we are falling behind with batch uploads. Every instance reports its own stats; sum them up to get a cluster-wide metric.

com.engflow.eventstore/grpc_eventstore_ttfb

milliseconds
This is a distribution. Tracks how much time passed between requesting EventStore data via gRPC, and receiving the first byte.

Tags

  • type

com.engflow.eventstore/inbound_bep_events

no unit
Incremented whenever an event is received on an inbound stream.

Tags

  • type
Details

An inbound stream means a client (e.g. Bazel) sending BES events to a scheduler.

Every instance reports its own stats; sum them up to get a cluster-wide metric.


com.engflow.eventstore/incomplete_batches_size

no unit
The estimated size in bytes it would take to serialize all incomplete build event batches. These batches aren't yet written to storage. Actual JVM heap footprint is likely larger. Every instance reports its own stats; sum them up to get a cluster-wide metric.

com.engflow.eventstore/new_outbound_streams

no unit
Incremented whenever a new outbound BEP stream is created.
Details

An outbound stream means a scheduler replaying a BES stream in order to reduce it to a result stream. It makes no difference if the scheduler is replaying to itself or to another instance.

Every instance reports its own stats; sum them up to get a cluster-wide metric.


com.engflow.eventstore/ongoing_streams

no unit
The total number of streams that are inbound, outbound, or both.
Details

An inbound stream means a client (e.g. Bazel) sending BES events to a scheduler. An outbound stream means a scheduler replaying a BES stream in order to reduce it to a result stream. It makes no difference if the scheduler is replaying to itself or to another instance.

Every instance reports its own stats; sum them up to get a cluster-wide metric.


com.engflow.eventstore/outbound_bep_events

no unit
Incremented whenever an event is sent on an outbound stream.
Details

An outbound stream means a scheduler replaying a BES stream in order to reduce it to a result stream. It makes no difference if the scheduler is replaying to itself or to another instance.

Every instance reports its own stats; sum them up to get a cluster-wide metric.


Virtual Machine Instances


com.engflow.instance.new/gc_avg_duration

milliseconds
The average duration spent in garbage collection since the last reported metric.

Tags

  • gc_type: GC Old generation / GC Young Generation

  • instance_role: type of the instance (scheduler/worker/etc.)


com.engflow.instance.new/gc_count

no unit
The total number of garbage collections during the lifecycle of this process.

Tags

  • gc_type: GC Old generation / GC Young Generation

  • instance_role: type of the instance (scheduler/worker/etc.)


com.engflow.instance.new/gc_time

milliseconds
The total wall time in milliseconds spent blocked in garbage collection since the start of the process. This measures time when the application is not running due to a collector pause.

Tags

  • gc_type: G1 Old generation / G1 Young Generation

  • instance_role: type of the instance (scheduler/worker/etc.)


com.engflow.instance.new/open_file_descriptors

no unit
The number of file descriptors the process has currently open.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com.engflow.instance.new/total_disk_space

bytes
The size of the volume.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

  • volume: the name of the disk volume


com.engflow.instance.new/total_system_memory

bytes
The total amount of system memory in bytes.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com.engflow.instance.new/used_disk_percentage

percentage
The percentage of the volume that is currently used.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

  • volume: the name of the disk volume


com.engflow.instance.new/used_disk_space

bytes
The total number of bytes used on the volume.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

  • volume: the name of the disk volume


com.engflow.instance.new/used_process_native_buffer_memory

bytes
The total amount of native buffer memory for this process in bytes.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com.engflow.instance.new/used_system_memory

bytes
The amount of used system memory in bytes.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com.engflow.instance.new/used_system_memory_percentage

percentage
The percentage of system memory used.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com.engflow.instance/gc_avg_duration

milliseconds
The average duration spent in garbage collection since the last reported metric.

Tags

  • gc_type

com.engflow.instance/gc_count

no unit
The total number of garbage collections during the lifecycle of this process.

Tags

  • gc_type

com.engflow.instance/gc_time

milliseconds
The total estimated time in milliseconds performing garbage collection.

Tags

  • gc_type

com.engflow.instance/total_disk_space

bytes
The size of the volume.

Tags

  • volume

com.engflow.instance/total_system_memory

bytes
The total amount of system memory in bytes.

com.engflow.instance/used_disk_percentage

percentage
The percentage of the volume that is currently used.

Tags

  • volume

com.engflow.instance/used_disk_space

bytes
The total number of bytes used on the volume.

Tags

  • volume

com.engflow.instance/used_system_memory

bytes
The amount of used system memory in bytes.

com.engflow.instance/used_system_memory_percentage

percentage
The percentage of system memory used.

Netty monitoring


com.engflow.thirdparty.netty/used_direct_memory

bytes
Direct (non-heap) memory use

Tags

  • buffer_name

com.engflow.thirdparty.netty/used_heap_memory

bytes
Heap memory use

Tags

  • buffer_name

Worker Control metrics


com.engflow.re.management.workercontrol/approx_mft_induced_idle_executor_duration

milliseconds
Approximately how much time all executors of all workers marked-for-termination were idle. Each scheduler reports the approximate duration of executor idleness as induced by the scheduler marking workers for termination. Only the master scheduler should report non-zero values. Sum up the idle durations reported by all schedulers to get the overall idleness.

Tags

  • pool: name of the pool ("default" for the default pool)

com.engflow.re.management.workercontrol/mft_on_scheduler

milliseconds
This is a distribution. Tracks how long the marked-for-termination call lasted on the scheduler.

Tags

  • pool: name of the pool ("default" for the default pool)

  • result: the result of the marked-for-termination call


com.engflow.re.management.workercontrol/mft_on_worker

milliseconds
This is a distribution. Tracks how long the marked-for-termination call lasted on the worker.

Tags

  • pool: name of the pool ("default" for the default pool)

  • result: the result of the marked-for-termination call


Action scheduling


com.engflow.re.scheduler/autoscaler_cluster_size_controller_op

milliseconds
Per pool, distribution of how long each cluster size controller operation took and what its completion status was.

Tags

  • op: the operation performed, "setClusterSize" or "reduceClusterSizeByInstance"

  • pool: name of pool ("default" for the default pool)

  • status: "succeeded" or "failed"


com.engflow.re.scheduler/autoscaler_set_size_operations

no unit
Per pool, number of attempts to set its autoscaling group's desired size

Tags

  • pool: name of pool ("default" for the default pool)

  • status: "succeeded" or "failed"

Details

Deprecated. Use com.engflow.re.scheduler/autoscaler_cluster_size_controller_op instead.


com.engflow.re.scheduler/available_workers

no unit
Deprecated; number of idle executors, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Deprecated. Indicates the number of idle executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value. This metric is deprecated, because it may be imprecise: schedulers that are started while workers are busy may report a higher value than they should for several minutes. We recommend monitoring com.engflow.re.scheduler/existing_executors instead.


com.engflow.re.scheduler/coalesced_executions

no unit
Number of action requests of coalesced into an existing execution
Details

Number of action requests of coalesced into an existing execution


com.engflow.re.scheduler/dequeued_actions

no unit
Per pool, the number of actions that were removed from the queue, either due to starting execution on a worker, or because it was ejected from the queue as it got too old.

Tags

  • pool: name of the pool ("default" for the default pool)

  • reason


com.engflow.re.scheduler/desired_executors

no unit
Number of desired executors, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates an estimate for the number of required executors per pool. Every scheduler reports its own estimate - they should be summed up to get the total desired pool size.


com.engflow.re.scheduler/estimated_action_time

milliseconds
Estimated action time

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates an estimate for duration of an action. Every scheduler reports its own estimate.


com.engflow.re.scheduler/estimated_induced_load

no unit
Estimated induced wait time

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates an estimate for future incoming work in ms per ms. Every scheduler reports its own estimate.


com.engflow.re.scheduler/existing_executors

no unit
Number of existing executors, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Number of existing executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value.


com.engflow.re.scheduler/existing_schedulers

no unit
Number of existing schedulers
Details

Only schedulers report this metric. Every scheduler reports a constant "1". This can be used to detect schedulers that are unable to send monitoring metrics.


com.engflow.re.scheduler/global_queue_size

no unit
Number of waiting actions, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates the number of actions waiting for execution, per pool, in the cluster. Only schedulers report this metric. The schedulers coordinate to calculate this sum.


com.engflow.re.scheduler/global_used_executors

no unit
Number of used executors, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates the number of executors that are in use, per pool, in the cluster. Only schedulers report this metric.


com.engflow.re.scheduler/mia_count

no unit
Per pool, the number of action that failed, because the executing worker went missing-in-action

Tags

  • pool: name of the pool ("default" for the default pool)

com.engflow.re.scheduler/pool_utilization

percentage
Current executor utilization, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Reports current executor utilization (used*100/total) per pool, as a percentage ([0..100]). Only schedulers report this metric. Every scheduler reports the same (or about the same) value.

To help making scale-up decisions when a pool is empty, utilization is reported as 100 if there are actions waiting and 0 if not. A pool may be empty if it was scaled down, or if it never existed (the client may request any pool name).


com.engflow.re.scheduler/queue_age

milliseconds
Min/max age of queued actions, per pool

Tags

  • pool: name of the pool ("default" for the default pool)

  • statistic: "min" (youngest) or "max" (oldest) action in the pool's queue

Details

Reports minimum and maximum age in each executor pool, i.e. how long entries have been waiting. Only schedulers report this metric. Every scheduler reports its own queue lengths. Changes in these values indicate a change in the cluster's throughput.


com.engflow.re.scheduler/queue_size

no unit
Number of waiting actions, per pool

Tags

  • pool: name of the pool ("default" for the default pool)
Details

Indicates the number of actions waiting for execution, per pool, on this scheduler. Only schedulers report this metric. Every scheduler reports its own queue lengths.


Observability UI metrics


com.engflow.observability.ui/app_load

milliseconds
The duration of the initial application load.

Tags

  • page

com.engflow.observability.ui/caught_error

no unit
An error in the web client was manually caught and reported.

Tags

  • error_description

  • page


com.engflow.observability.ui/navigation

no unit
A single user navigation to a new page.

Tags

  • page

com.engflow.observability.ui/page_load

milliseconds
Measures the duration elapsed loading the current page.

Tags

  • load_type

  • page


com.engflow.observability.ui/page_load_with_data_requests

milliseconds
Measures the duration elapsed loading the current page, plus any page-specific data requests sent after it was initially loaded. This is currently only selectively enabled for some pages.

Tags

  • page

  • request_status


com.engflow.observability.ui/rendering_error

no unit
An error in the web client was caught by the rendering pipeline and reported.

Tags

  • page

  • section


com.engflow.observability.ui/uncaught_error

no unit
An uncaught error was thrown in the web client.

Tags

  • error_name

  • page


BES Replay Metrics


com.engflow.bes.replay/cpu_time

milliseconds
The amount of CPU time spent replaying and reducing build event streams, measured in milliseconds.

Tags

  • replay_type

  • status


com.engflow.bes.replay/cpu_time_for_event

milliseconds
The amount of CPU time spent replaying and reducing a single event within a build event stream, measured in milliseconds.

Tags

  • event_type

  • replay_type

  • status


ResultStore metrics


com.engflow.resultstore/reduce_bes_completed_duration_since_finish_event

milliseconds
This is a distribution. Tracks how much time passed between receiving aninvocation's finish event, and completing the BES reduction.

Tags

  • replay_type

  • status: Whether the BES reduction finished successfully.


com.engflow.resultstore/reduce_bes_replay_removed_from_cache_count

no unit
The number of BES replays that were removed from the replay cache.

Tags

  • replay_type

  • status: The status of the replay when it was removed from the cache.


com.engflow.resultstore/reduce_bes_replay_source_count

no unit
The number of BES replays requested, tagged by where the data was fetched from, and the type of replay.

Tags

  • replay_type

  • source: Indicates where the data was fetched from.


Storage implementation metrics


com.engflow.storage.gc/gc_window_seconds

s
The storage service's GC window.

Tags

  • name: name of the storage service

com.engflow.storage.read/size

bytes
Number of file bytes sent to the client for a read request. May be smaller than the file size in case of error or partial read. Only recorded if the file was found.

Tags

  • name: name of the storage service

  • status: op result


com.engflow.storage.read/time_per_gb

milliseconds
Time taken per 1 billion bytes (1 GB) to download a file from storage.

Tags

  • name: name of the storage service

  • status: op result


com.engflow.storage.read/time_to_first_byte

milliseconds
Time taken between initiating a download to receiving the first byte.

Tags

  • name: name of the storage service

  • status: op result


com.engflow.storage.read/time_to_next_chunk

milliseconds
Time taken between being notified that the client is ready and sending the next response. May be recorded 0 or multiples times for the same call, depending on control flow events.

Tags

  • name: name of the storage service

  • status: op result


com.engflow.storage.write/size

bytes
Number of file bytes received from the client for a write request. May be smaller than the file size in case of error.

Tags

  • name: name of the storage service

  • status: op result


com.engflow.storage.write/time_per_gb

milliseconds
Time taken per 1 billion bytes (1 GB) to upload a file to storage.

Tags

  • name: name of the storage service

  • status: op result


com.engflow.storage.write/time_to_commit

milliseconds
Time between being notified the write is complete and committing the write.

Tags

  • name: name of the storage service

  • status: op result


NotificationQueue metrics


com.engflow.notificationqueue/dequeue_latency

milliseconds
This is a distribution. Refers to the time passed between notification creation and dequeuing it.

Tags

  • expired: Whether the notification was expired and discarded.

  • name: The name of the queue.


com.engflow.notificationqueue/head_age

milliseconds
The age of the first notification in the queue, i.e. the time passed since the notification was created. Notably, it does NOT reflect how much time has passed since the notification was last published. For example, if a notification is not acknowledged and published anew, the head age may be disproportionately high compared to the age of the next notifications in the queue. This can lead to acceptable, occasional spikes.

Tags

  • name: The name of the queue.

com.engflow.notificationqueue/publish

milliseconds
This is a distribution. Refers to the time needed to publish a notification.

Tags

  • name: The name of the queue.

  • status: The status of the operation.


com.engflow.notificationqueue/size

no unit
The approximate size of the queue.

Tags

  • name: The name of the queue.

Action execution


com.engflow.re.exec/completed_actions

no unit
Number of actions that ran to completion, grouped by exit code

Tags

  • exit_code: the action's exit code
Details

This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, in all pools combined, since the last time this metric was reported.

Only workers report this metric. All workers report their own values. We recommend grouping by exit_code=0 and exit_code!=0, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the cluster.


com.engflow.re.exec/completed_actions_per_pool

no unit
Number of executed actions (not cached), grouped by pool and status

Tags

  • pool: name of the pool (_default_ for the default pool)

  • status: the action's status (ExecutionStatus: SUCCESS, NON_ZERO_EXIT, CLIENT_ERROR, ERROR)

Details

This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, per pool, since the last time this metric was reported.

Only workers report this metric. All workers report their own values. We recommend grouping by status, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the pool/cluster.


com.engflow.re.exec/execution_latency

milliseconds
Bucketed latency (ms), grouped by pool and execution stage

Tags

  • pool: name of the pool (_default_ for the default pool)

  • stage: the action's stage (ExecutionStage: QUEUED, DOWNLOAD_INPUTS, EXECUTE_ACTION, UPLOAD_OUTPUTS, EXECUTOR_TOTAL)


com.engflow.re.exec/executors_existing

no unit
Total number of executors on this worker, in all pools combined
Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the cluster.


com.engflow.re.exec/executors_existing_per_pool

no unit
Total number of executors on this worker, per pool

Tags

  • pool: name of the pool (_default_ for the default pool)
Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the pool/cluster.


com.engflow.re.exec/max_rss_kib

KiBy
Reported MaxRSS (maximal resident set size) of successfully executed actions (not cached), grouped by pool

Tags

  • pool: name of the pool (_default_ for the default pool)
Details

This metric is a distribution. Each measurement indicates approximately how much memory (in Kib) a successful action (ExecutionStatus: SUCCESS) on this worker reportedly used.

Only workers report this metric. All workers report their own values.


com.engflow.re.exec/started_actions_per_pool

no unit
Number of started actions, grouped by pool

Tags

  • pool: name of the pool (_default_ for the default pool)
Details

This metric reflects the rate of change. Each measurement indicates how many actions started on this worker, per pool, since the last time this metric was reported.

Only workers report this metric. All workers report their own values.


com.engflow.re.exec/used_executors

no unit
Number of busy executors, in all pools
Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the cluster.


com.engflow.re.exec/used_executors_per_pool

no unit
Number of busy executors, per pool

Tags

  • pool: name of the pool (_default_ for the default pool)
Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the pool/cluster.


Hazelcast monitoring


com.engflow.re.hazelcast.map/entries

no unit
The number of entries in Hazelcast maps map

Tags

  • cluster_name: name of the Hazelcast cluster

  • map_name: name of the Hazelcast map


com.engflow.re.hazelcast.map/memory_used

bytes
The amount of memory used for the map

Tags

  • cluster_name: name of the Hazelcast cluster

  • map_name: name of the Hazelcast map


com.engflow.re.hazelcast/is_master

no unit
Whether a machine is a cluster master; if this sums up to more than one (with the same name), then the cluster is unhealthy.

Tags

  • name: name of the Hazelcast cluster.

com.engflow.re.hazelcast/member_count

no unit
The number of members in the cluster; only the master reports this value

Tags

  • name: name of the Hazelcast cluster.

com.engflow.re.hazelcast/op_time

milliseconds
Distribution of operation time

Tags

  • name: name of the distributed hash map

  • status: op result


com.engflow.thirdparty.hazelcast/partition_migration_finished

no unit
The number of finished Hazelcast partition migrations.

Tags

  • name: name of the Hazelcast cluster

com.engflow.thirdparty.hazelcast/partition_migration_started

no unit
The number of started Hazelcast partition migrations.

Tags

  • name: name of the Hazelcast cluster

com.engflow.thirdparty.hazelcast/partition_migration_time

milliseconds
Time of Hazelcast partition migrations, per Hazelcast cluster.

Tags

  • name: name of the Hazelcast cluster
Details

Reports the time of Hazelcast partition migrations.


com.engflow.thirdparty.hazelcast/replica_migration

no unit
The number of Hazelcast replica migrations.

Tags

  • name: name of the Hazelcast cluster

  • status: status of the operation (OK or FAILURE)


Uncaught exceptions


com.engflow.re/uncaught_exceptions

no unit
Fires every time there is an uncaught exception

CAS server metrics


com.engflow.re.cas/missing_digests

no unit
The total number of missing digests seen by findMissingBlobs.

com.engflow.re.cas/requested_digests

no unit
The total number of digests requested by a findMissingBlob call

Remote Execution metrics


com.engflow.remoteexecution/queue_time

milliseconds
This is a distribution. Refers to the time actions are queued.

Tags

  • pool: The name of the pool.
Details

This is a distribution. Refers to the time actions are queued.


Invocation index monitoring


com.engflow.resultstore.index/sql_invocation_index_database_queue_size

no unit
All enqueued or in-progress invocation index database operations
Details

Reflects the number of incomplete operations (either queued or being worked on).

Every instance reports this metric. Every instance reports its own stats.


CAS usage


com.engflow.re.cas/check_blob_exists

milliseconds
Distribution of time needed to check whether a blob exists

Tags

  • status: operation result, e.g. OK, NOT_FOUND
Details

This is a distribution. Refers to the time needed to check whether a blob exists.


com.engflow.re.cas/fetch_call_time

milliseconds
Distribution of CAS fetch operation time

Tags

  • source: name of the CAS location, e.g. EXTERNAL_STORAGE, DISTRIBUTED_CAS_NEAR

  • status: op result, e.g. OK, UNAVAILABLE

Details

The time distribution of individual CAS download calls; each call is measured independently, including when falling back between different sources.


com.engflow.re.cas/fetch_retries

no unit
Count of retries needed when fetching a CAS blob.
Details

Count of retries needed when fetching a CAS blob. Incremented after each failure, so 0 indicates a fetch without errors.


com.engflow.re.cas/find_replicas_time

milliseconds
Distribution of time needed to find which instances have copies of a file
Details

Distribution of time needed to find which instances have copies of a file.


com.engflow.re.cas/load_shed_errors

no unit
Count of RESOURCE_EXHAUSTED errors returned by workers for CAS requests due to load shedding.

Tags

  • method: method returning the error, e.g., read
Details

Count of RESOURCE_EXHAUSTED errors returned by workers for CAS requests due to load shedding.


com.engflow.re.cas/remote_check_blob_exists

milliseconds
Distribution of time needed to check whether a blob exists

Tags

  • source: address used to check on cas blob.

  • status: operation result, e.g. OK, NOT_FOUND

Details

This is a distribution. Refers to the time needed to check whether a blob exists.


com.engflow.re.cas/requests_in_flight_incoming

no unit
Number of currently open incoming cache requests, by method and pool.

Tags

  • method: read or write

  • pool: name of the pool serving the request

Details

Number of currently open incoming cache requests, by method and pool.


com.engflow.re.cas/requests_in_flight_outgoing

no unit
Number of currently open outgoing cache requests, by method and pool.

Tags

  • method: read or write

  • pool: name of the pool originating the request

Details

Number of currently open outgoing cache requests, by method and pool. Includes both distributed CAS (ByteStream) and external storage.


com.engflow.re.cas/time_to_next_message

milliseconds
Estimated number of milliseconds to the next grpc message.

Tags

  • pool: name of the pool serving the request
Details

Estimated number of milliseconds to the next grpc message.


Local CAS usage


com.engflow.re.cas/available_replica_space

bytes
Available storage space in the CAS that can be used for replicas
Details

Only workers report this metric. All workers report their own values.


com.engflow.re.cas/available_space

bytes
Available storage space in the CAS
Details

Only workers report this metric. All workers report their own values.


com.engflow.re.cas/free_time

milliseconds
Distribution of time needed to free space in the CAS
Details

This is a distribution. It refers to the deletion of expired replicas.

Only workers report this metric. All workers report their own values.


com.engflow.re.cas/gc_time

milliseconds
Distribution of time needed for the GC
Details

This is a distribution. It refers to the collection of expired replicas.

Only workers report this metric. All workers report their own values.


com.engflow.re.cas/lost_files_count

no unit
The number of files that were lost from the CAS
Details

The number of files that were deleted by some other process or the CAS instance detected that they no longer matched the expected digest.

Only workers report this metric. All workers report their own values.


com.engflow.re.cas/max_total_replica_size

bytes
The max total replica size
Details

This is the maximum amount of storage space the CAS is allowed to use for replicas.

Only workers report this metric. All workers report their own values.


com.engflow.re.cas/max_total_size

bytes
The max total CAS size on the node
Details

This is the maximum amount of storage space the CAS is allowed to use.

Only workers report this metric. All workers report their own values.


Client authorization


com.engflow.re.auth.async/duration

milliseconds
Authentication call duration
Details

This is a distribution. Only schedulers report this metric. Every scheduler reports its own stats.


SecretStore metrics


com.engflow.secretstore/operation_duration_seconds

s
Time taken to perform an operation on the secret store

Tags

  • operation: The secret store operation reported ('read', etc.)

  • status: The gRPC status code string, in SCREAMING_SNAKE_CASE

  • store_type: The implementation of the instrumented secret store


Licensing metrics


com.engflow.licensing/license_server_fetch_result

no unit
The result of attempted license renewals using the MyEngFlow License Server.

Tags

  • status

Docker use


com.engflow.re.exec.docker/cas_fetch_seconds

s
The time spent pulling Docker images from the CAS. This includes fetching only

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com.engflow.re.exec.docker/cas_pull_bytes

bytes
The size of Docker images pulled from the CAS

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com.engflow.re.exec.docker/cas_pull_seconds

s
The time spent pulling Docker images from the CAS. This includes both fetching and loading

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com.engflow.re.exec.docker/cas_save_seconds

s
The time spent uploading Docker images to the CAS. This includes saving only

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com.engflow.re.exec.docker/cas_upload_bytes

bytes
The size of Docker images uploaded to the CAS

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com.engflow.re.exec.docker/cas_upload_seconds

s
The time spent uploading Docker images to the CAS. This includes both saving and uploading

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com.engflow.re.exec.docker/container_lifecycle_events_count

no unit
Count of various lifecycle events relating to containers

Tags

  • container_lifecycle_event_name

  • pool


com.engflow.re.exec.docker/container_shutdown_time

milliseconds
The time needed to shutdown a docker container

com.engflow.re.exec.docker/container_startup_time

milliseconds
The time needed to start a docker container

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com.engflow.re.exec.docker/containers_failed

no unit
The number of docker containers that failed

com.engflow.re.exec.docker/docker_proxy_failure_count

no unit
The number of times pulls through the docker proxy failed

com.engflow.re.exec.docker/image_pull_time

milliseconds
The time needed to pull a docker image

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com.engflow.re.exec.docker/network_create_time

milliseconds
The time needed to create a docker network

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com.engflow.re.exec.docker/network_destroy_time

milliseconds
The time needed to destroy a docker network

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com.engflow.re.exec.docker/sibling_container_enabled_total

no unit
Counts times that sibling container access was requested via platform properties

Tags

  • pool

Persistent worker use


com.engflow.re.exec.worker/actions

no unit
The number of persistent worker actions run

Tags

  • pool: name of the pool

  • reuse_status: new or reused

Details

The number of persistent worker actions run, aggregated by whether they reused a previous persistent worker process or not


Scheduler metrics


com.engflow.profiling/publish_invocation_event

milliseconds
Timing for publishing invocation events.

com.engflow.re.cas/entries_evicted

no unit
The number of CAS entries that were evicted due to memory size limitations

com.engflow.re.cas/entries_lost

no unit
The number of CAS entries that could not be recovered on CAS node shutdown events

com.engflow.re.profiler/events

no unit
The number of server-side profile events recorded.

com.engflow.re.profiler/live_handles

no unit
The number of profiles being streamed to the eventstore.

com.engflow.re/remaining_certificate_validity_days

days
The number of remaining days before certificates expire

Tags

  • issuer: The issuer (issuer distinguished name) value from the certificate

  • serial_number: The serial number assigned to the certificate by the issuer

Details

Reports the number of remaining validity days for each X509 certificate processed by schedulers. The issuer and serial number uniquely identify certificates.


com.engflow.re/remaining_license_time

days
The number of remaining days before the license expires

Java memory metrics


com.engflow.re/java_heap

bytes
The amount of heap memory used
Details

Every instance reports this metric. Every instance reports its own stats.


Meta metrics


com.engflow.meta/engflow_version

no unit
A heartbeat metric that reports the EngFlow build label if present and "missing_version" otherwise.

Tags

  • version

com.engflow.meta/parallel_engflow_version

no unit
A heartbeat metric that reports how many different EngFlow versions are currently registered with the cluster, indicated by the build label. All instances that do not report a build label are rated as running the same version, different to all other versions reported. Each scheduler reports its own metric.

Docker daemon


com.engflow.docker.container/existing

no unit
The number of existing containers.

Tags

  • daemon: The docker daemon this metric is for (e.g., host).

  • pool: The pool this metric is for (e.g., default or macos).

  • state: The state of the container (e.g., running, exited, ...).


com.engflow.docker.container/size

bytes
The size distribution of existing docker containers.

Tags

  • daemon: The docker daemon this metric is for (e.g., host).

  • filesystem: The filesystem the value applies to (overlay if the value is the size of files modified by the container, and root if the value is the total size of the container including the image).

  • pool: The pool this metric is for (e.g., default or macos).

  • scope: The scope of the value (container if the value is the size of a single container, and daemon if the value is the sum of all containers known to the docker daemon).


com.engflow.docker.image/size

bytes
The size distribution of existing docker images.

Tags

  • daemon: The docker daemon this metric is for (e.g., host).

  • pool: The pool this metric is for (e.g., default or macos).

  • scope: The scope of the value (image if the value is the size of a single image, and daemon if the value is the sum of all images known to the docker daemon).


DB Connection Pool usage


com.engflow.resultstore.index/db_cp_active_connections

no unit
The number of active connections in the pool

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_connection_acquire_time

us
The time it takes for the connection pool to acquire a DB connection

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_connection_create_time

milliseconds
The time it takes for the connection pool to create a new DB connection

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_connection_timeout_count

no unit
The count of timed-out connections

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_connection_usage_time

milliseconds
The duration of a use of a connection given by the connection pool

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_idle_connections

no unit
The number of idle connections in the pool

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_max_connections

no unit
Maximum number of connections existing in the pool

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_min_connections

no unit
Minimum number of connections existing in the pool

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_pending_connections

no unit
The number of pending connections in the pool

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_total_connections

no unit
The number of all currently existing connections in the pool

Tags

  • db_connection_pool_name

Caffeine cache metrics


com.engflow.caching.inmemory/evicts

no unit
Number of cache evictions tagged with the reason for eviction.

Tags

  • name: The name of the cache

  • reason: The reason for eviction


com.engflow.caching.inmemory/hits

no unit
Number of cache hits.

Tags

  • name: The name of the cache

com.engflow.caching.inmemory/loads

milliseconds
Timing and status information for cache loads.

Tags

  • name: The name of the cache

  • status: The status of the load


com.engflow.caching.inmemory/misses

no unit
Number of cache misses.

Tags

  • name: The name of the cache

DB Query stats


com.engflow.resultstore.index/duration

milliseconds
The duration of a query

Tags

  • query_name

  • query_outcome


com.engflow.resultstore.index/preparation

milliseconds
The duration of creating a preparedQuery

Tags

  • query_name

Outgoing CI API Calls


com.engflow.ci.http/api_call_latency

milliseconds
Time taken for an outgoing API call, indexed by hostname and status

Tags

  • hostname: hostname of the API service that was called

  • status: HTTP status of the call (1xx, 2xx, 3xx, 4xx, 5xx, or UNKNOWN)

Details

Indicates the number of API calls made, per service and their status, on this scheduler. The status only contains the abbreviated HTTP status code. Only schedulers report this metric. Every scheduler reports its own API calls.


CI runner


com.engflow.ci.basic/gh_error_propagation_jobs

no unit
Number of finished GitHub error propagation jobs

Tags

  • status: the status of the error propagation job
Details

Number of finished GitHub error propagation jobs.


com.engflow.ci.basic/gh_runner_no_jobs

no unit
Number of finished GitHub runs that timed out and did not pick up any jobs.
Details

Number of finished GitHub runs that timed out and did not pick up any jobs.


com.engflow.ci.basic/gh_runner_wrong_job

no unit
Number of finished GitHub runs that picked up the wrong job.
Details

Number of finished GitHub runs that picked up the wrong job.


com.engflow.ci.basic/jobs_completed

no unit
Number of jobs completed

Tags

  • status: job completion status
Details

The number of completed CI jobs, by status. This includes jobs for which we failed before starting the remote runner.


com.engflow.ci.basic/jobs_queue_age

milliseconds
Age of queued jobs

Tags

  • ci_family: name of the CI of the job

  • statistic: min/max queue age

Details

The age of queued jobs on GitHub/BuildKite. This metric is only reported if polling is enabled, and only for jobs that are configured for EngFlow CI runners. The value is how long ago was the job created relative to the current poll time.


com.engflow.ci.basic/jobs_queued

no unit
Number of queued jobs

Tags

  • ci_family: name of the CI of the job
Details

The number of queued jobs on GitHub/BuildKite. This metric is only reported if polling is enabled.


com.engflow.ci.basic/poll_duration_millis

milliseconds
Number of polls started

Tags

  • ci_family: name of the CI system to poll

  • status: the status of the polling job

Details

Measures the duration of polls against the remote CI system


com.engflow.ci.full/builtin_step_duration

milliseconds
The duration of individual built-in steps in a CI job

Tags

  • ci_family: name of the CI of the job

  • git_repository: name of the repository of the job

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

  • step: job step name

Details

Measures the duration of a job's individual built-in (non-user-defined) steps' execution.


com.engflow.ci.full/git_command_duration

milliseconds
The duration of individual git commands during a CI job

Tags

  • ci_family: name of the CI of the job

  • git_command: git command: checkout/fetch/index-pack

  • git_repository: name of the repository of the job

  • job: job name

  • pipeline: name of the pipeline that triggered the job

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

Details

Measures the duration of various git subcommands during a job's execution


com.engflow.ci.full/job_duration

milliseconds
The duration of CI jobs

Tags

  • ci_family: name of the CI of the job

  • git_repository: name of the repository of the job

  • job: job name

  • pipeline: name of the pipeline that triggered the job

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

Details

Measures the duration of a job's execution.


com.engflow.ci.full/jobs_completed

no unit
Number of jobs completed

Tags

  • ci_family: name of the CI of the job

  • git_repository: name of the repository of the job

  • job: job name

  • pipeline: name of the pipeline that triggered the job

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

  • status: job completion status

Details

The number of completed CI jobs, by status. This includes jobs for which we failed before starting the remote runner.


com.engflow.ci.full/jobs_started

no unit
Number of jobs started

Tags

  • ci_family: name of the CI of the job

  • git_repository: name of the repository of the job

  • job: job name

  • pipeline: name of the pipeline that triggered the job

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

Details

The number of jobs started on this scheduler. Every scheduler reports its own jobs.


com.engflow.ci.full/step_duration

milliseconds
The duration of individual user-defined steps in a CI job

Tags

  • ci_family: name of the CI of the job

  • git_repository: name of the repository of the job

  • job: job name

  • pipeline: name of the pipeline that triggered the job

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

  • step: job step name

Details

Measures the duration of a job's individual steps' execution.


com.engflow.ci.full/time_to_start_job

milliseconds
How long it takes for CI runners to start a job

Tags

  • ci_family: name of the CI of the job

  • git_repository: name of the repository of the job

  • job: job name

  • pipeline: name of the pipeline that triggered the job

  • runner_architecture: architecture of the requested runner for a given job

  • runner_os: OS of the requested runner for a given job

Details

Measures the time between a given CI requesting a job's execution and an action actually starting it on CI runners.


gRPC factory metrics


com.engflow.grpc.factory/channels

no unit
The number of open gRPC channels in the channel factory

CI runner - Github Action metrics


com.engflow.ci.github.metrics/api_primary_rate_limit_max

no unit
Maximum number of requests available in the current API rate limiter slot

Tags

  • API_RESOURCE_NAME

com.engflow.ci.github.metrics/api_primary_rate_limit_used

no unit
Number of requests used in the current API rate limiter slot

Tags

  • API_RESOURCE_NAME

RPC metrics


com.engflow.rpc/duration

s
Time distribution of RPCs

Tags

  • method: The RPC method

  • protocol: The protocol of the RPC (e.g., rest)

  • status: The status of the operation (e.g., OK)

  • type: The type of the RPC (e.g., client or server)