Skip to content

Metrics Reference

Description of all metrics available for monitoring.

AnalyzeInvocationProfileHandler metrics


com.engflow.invocationanalyzer/bazel_profile_size

bytes
The size of the uncompressed Bazel profile handled
Details

The size of the uncompressed Bazel profile handled.


com.engflow.invocationanalyzer/engflow_profile_size

bytes
The size of the uncompressed EngFlow profile handled
Details

The size of the uncompressed EngFlow profile handled.


com.engflow.invocationanalyzer/time_needed

milliseconds
The time distribution of handing individual profile analysis requests

Tags

  • status: The status of the analysis performed
Details

The time distribution of handling individual Bazel profiles.


Metrics derived from raw BEP streams


com.engflow.bep/invocation_completed

no unit
Fired with the count of completed invocations reported to the BEP.

Tags

  • exit_code: The human readable exit code of the invocation.

com.engflow.bep/invocation_duration

milliseconds
Fired on invocation completed with the average duration of the invocation.

com.engflow.bep/invocation_started

no unit
Fired with the count of newly started invocations reported to the BEP.

Blob-storage implementation metrics


com.engflow.blobstore/ops

no unit
Fires every time an operation takes place.

Tags

  • operation

BEP Event Storage and Replay


com.engflow.eventstore/build_event_owners

no unit
The total number of build owners residing in memory. A build owner is an internal piece of state associated to a particular build reporting build events.

com.engflow.eventstore/inbound_bep_events

no unit
Fired whenever an event is received on an inbound stream.

Tags

  • type

com.engflow.eventstore/new_outbound_streams

no unit
Fired whenever a new outbound BEP stream is read.

com.engflow.eventstore/ongoing_streams

no unit
The total number of streams that are inbound, outbound, or both.

com.engflow.eventstore/outbound_bep_events

no unit
Fired whenever an event is sent on an outbound stream.

Virtual Machine Instances


com.engflow.instance.new/gc_avg_duration

milliseconds
The average duration spent in garbage collection since the last reported metric.

Tags

  • gc_type: GC Old generation / GC Young Generation

  • instance_role: type of the instance (scheduler/worker/etc.)


com.engflow.instance.new/gc_count

no unit
The total number of garbage collections during the lifecycle of this process.

Tags

  • gc_type: GC Old generation / GC Young Generation

  • instance_role: type of the instance (scheduler/worker/etc.)


com.engflow.instance.new/gc_time

milliseconds
The total estimated time in milliseconds performing garbage collection.

Tags

  • gc_type: G1 Old generation / G1 Young Generation

  • instance_role: type of the instance (scheduler/worker/etc.)


com.engflow.instance.new/open_file_descriptors

no unit
The number of file descriptors the process has currently open.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com.engflow.instance.new/total_disk_space

bytes
The size of the volume.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

  • volume: the name of the disk volume


com.engflow.instance.new/total_system_memory

bytes
The total amount of system memory in bytes.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com.engflow.instance.new/used_disk_percentage

percentage
The percentage of the volume that is currently used.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

  • volume: the name of the disk volume


com.engflow.instance.new/used_disk_space

bytes
The total number of bytes used on the volume.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

  • volume: the name of the disk volume


com.engflow.instance.new/used_system_memory

bytes
The amount of used system memory in bytes.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com.engflow.instance.new/used_system_memory_percentage

percentage
The percentage of system memory used.

Tags

  • instance_role: type of the instance (scheduler/worker/etc.)

com.engflow.instance/gc_avg_duration

milliseconds
The average duration spent in garbage collection since the last reported metric.

Tags

  • gc_type

com.engflow.instance/gc_count

no unit
The total number of garbage collections during the lifecycle of this process.

Tags

  • gc_type

com.engflow.instance/gc_time

milliseconds
The total estimated time in milliseconds performing garbage collection.

Tags

  • gc_type

com.engflow.instance/total_disk_space

bytes
The size of the volume.

Tags

  • volume

com.engflow.instance/total_system_memory

bytes
The total amount of system memory in bytes.

com.engflow.instance/used_disk_percentage

percentage
The percentage of the volume that is currently used.

Tags

  • volume

com.engflow.instance/used_disk_space

bytes
The total number of bytes used on the volume.

Tags

  • volume

com.engflow.instance/used_system_memory

bytes
The amount of used system memory in bytes.

com.engflow.instance/used_system_memory_percentage

percentage
The percentage of system memory used.

Netty monitoring


com.engflow.thirdparty.netty/used_direct_memory

bytes
Direct (non-heap) memory use

Tags

  • buffer_name

com.engflow.thirdparty.netty/used_heap_memory

bytes
Heap memory use

Tags

  • buffer_name

io.netty.buffer/used_direct_memory

bytes
Direct (non-heap) memory use

Tags

  • buffer_name

io.netty.buffer/used_heap_memory

bytes
Heap memory use

Tags

  • buffer_name

Action scheduling


com.engflow.re.scheduler/available_workers

no unit
Deprecated; number of idle executors, per pool

Tags

  • name: name of the pool ("default" for the default pool)
Details

Deprecated. Indicates the number of idle executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value. This metric is deprecated, because it may be imprecise: schedulers that are started while workers are busy may report a higher value than they should for several minutes. We recommend monitoring com.engflow.re.scheduler/existing_executors instead.


com.engflow.re.scheduler/desired_executors

no unit
Number of desired executors, per pool

Tags

  • name: name of the pool ("default" for the default pool)
Details

Indicates an estimate for the number of required executors per pool. Every scheduler reports its own estimate - they should be summed up to get the total desired pool size.


com.engflow.re.scheduler/existing_executors

no unit
Number of existing executors, per pool

Tags

  • name: name of the pool ("default" for the default pool)
Details

Number of existing executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value.


com.engflow.re.scheduler/existing_schedulers

no unit
Number of existing schedulers
Details

Only schedulers report this metric. Every scheduler reports a constant "1". This can be used to detect schedulers that are unable to send monitoring metrics.


com.engflow.re.scheduler/pool_utilization

percentage
Current executor utilization, per pool

Tags

  • name: name of the pool ("default" for the default pool)
Details

Reports current executor utilization (used*100/total) per pool, as a percentage ([0..100]). Only schedulers report this metric. Every scheduler reports the same (or about the same) value.

To help making scale-up decisions when a pool is empty, utilization is reported as 100 if there are actions waiting and 0 if not. A pool may be empty if it was scaled down, or if it never existed (the client may request any pool name).


com.engflow.re.scheduler/queue_age

milliseconds
Min/max age of queued actions, per pool

Tags

  • name: name of the pool ("default" for the default pool)

  • statistic: "min" (youngest) or "max" (oldest) action in the pool's queue

Details

Reports minimum and maximum age in each executor pool, i.e. how long entries have been waiting. Only schedulers report this metric. Every scheduler reports its own queue lengths. Changes in these values indicate a change in the cluster's throughput.


com.engflow.re.scheduler/queue_size

no unit
Number of waiting actions, per pool

Tags

  • name: name of the pool ("default" for the default pool)
Details

Indicates the number of actions waiting for execution, per pool, on this scheduler. Only schedulers report this metric. Every scheduler reports its own queue lengths.


Observability UI metrics


com.engflow.observability.ui/app_load

milliseconds
The duration of the initial application load.

Tags

  • page

com.engflow.observability.ui/caught_error

no unit
An error in the web client and was manually caught and reported.

Tags

  • error_description

  • page


com.engflow.observability.ui/navigation

no unit
A single user navigation to a new page.

Tags

  • page

com.engflow.observability.ui/uncaught_error

no unit
An uncaught error was thrown in the web client.

Tags

  • error_name

  • page


Storage implementation metrics


com.engflow.storage.read/size

bytes
Number of file bytes sent to the client for a read request. May be smaller than the file size in case of error or partial read. Only recorded if the file was found.

Tags

  • name: name of the storage service

  • status: op result


com.engflow.storage.read/time_per_gb

milliseconds
Time taken per 1 billion bytes (1 GB) to download a file from storage.

Tags

  • name: name of the storage service

  • status: op result


com.engflow.storage.read/time_to_first_byte

milliseconds
Time taken between initiating a download to receiving the first byte.

Tags

  • name: name of the storage service

  • status: op result


com.engflow.storage.read/time_to_next_chunk

milliseconds
Time taken between being notified that the client is ready and sending the next response. May be recorded 0 or multiples times for the same call, depending on control flow events.

Tags

  • name: name of the storage service

  • status: op result


com.engflow.storage.write/size

bytes
Number of file bytes received from the client for a write request. May be smaller than the file size in case of error.

Tags

  • name: name of the storage service

  • status: op result


com.engflow.storage.write/time_per_gb

milliseconds
Time taken per 1 billion bytes (1 GB) to upload a file to storage.

Tags

  • name: name of the storage service

  • status: op result


com.engflow.storage.write/time_to_commit

milliseconds
Time between being notified the write is complete and committing the write.

Tags

  • name: name of the storage service

  • status: op result


NotificationQueue metrics


com.engflow.notificationqueue/publish

milliseconds
This is a distribution. Refers to the time needed to publish a notification.

Tags

  • name: The name of the queue.

  • status: The status of the operation.

Details

This is a distribution. Refers to the time needed to publish a notification.


Action execution


com.engflow.re.exec/completed_actions

no unit
Number of actions that ran to completion, grouped by exit code

Tags

  • exit_code: the action's exit code
Details

This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, in all pools combined, since the last time this metric was reported.

Only workers report this metric. All workers report their own values. We recommend grouping by exit_code=0 and exit_code!=0, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the cluster.


com.engflow.re.exec/completed_actions_per_pool

no unit
Number of executed actions (not cached), grouped by pool and status

Tags

  • pool: name of the pool (_default_ for the default pool)

  • status: the action's status (ExecutionStatus: SUCCESS, NON_ZERO_EXIT, ERROR)

Details

This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, per pool, since the last time this metric was reported.

Only workers report this metric. All workers report their own values. We recommend grouping by status, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the pool/cluster.


com.engflow.re.exec/execution_latency

milliseconds
Bucketed latency (ms), grouped by pool and execution stage

Tags

  • pool: name of the pool (_default_ for the default pool)

  • stage: the action's stage (ExecutionStage: QUEUED, DOWNLOAD_INPUTS, EXECUTE_ACTION, UPLOAD_OUTPUTS, EXECUTOR_TOTAL)


com.engflow.re.exec/executors_existing

no unit
Total number of executors on this worker, in all pools combined
Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the cluster.


com.engflow.re.exec/executors_existing_per_pool

no unit
Total number of executors on this worker, per pool

Tags

  • pool: name of the pool (_default_ for the default pool)
Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the pool/cluster.


com.engflow.re.exec/used_executors

no unit
Number of busy executors, in all pools
Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the cluster.


com.engflow.re.exec/used_executors_per_pool

no unit
Number of busy executors, per pool

Tags

  • pool: name of the pool (_default_ for the default pool)
Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the pool/cluster.


Hazelcast monitoring


com.engflow.re.hazelcast.map/entries

no unit
The number of entries in Hazelcast maps map

Tags

  • cluster_name: name of the Hazelcast cluster

  • map_name: name of the Hazelcast map


com.engflow.re.hazelcast.map/memory_used

bytes
The amount of memory used for the map

Tags

  • cluster_name: name of the Hazelcast cluster

  • map_name: name of the Hazelcast map


com.engflow.re.hazelcast/is_master

no unit
Whether a machine is a cluster master; if this sums up to more than one (with the same name), then the cluster is unhealthy.

Tags

  • name: name of the Hazelcast cluster.

com.engflow.re.hazelcast/member_count

no unit
The number of members in the cluster; only the master reports this value

Tags

  • name: name of the Hazelcast cluster.

com.engflow.re.hazelcast/op_time

milliseconds
Distribution of operation time

Tags

  • name: name of the distributed hash map

  • status: op result


com.engflow.thirdparty.hazelcast/partition_migration_finished

no unit
The number of finished Hazelcast partition migrations.

Tags

  • name: name of the Hazelcast cluster

com.engflow.thirdparty.hazelcast/partition_migration_started

no unit
The number of started Hazelcast partition migrations.

Tags

  • name: name of the Hazelcast cluster

com.engflow.thirdparty.hazelcast/partition_migration_time

milliseconds
Time of Hazelcast partition migrations, per Hazelcast cluster.

Tags

  • name: name of the Hazelcast cluster
Details

Reports the time of Hazelcast partition migrations.


com.engflow.thirdparty.hazelcast/replica_migration

no unit
The number of Hazelcast replica migrations.

Tags

  • name: name of the Hazelcast cluster

  • status: status of the operation (OK or FAILURE)


Uncaught exceptions


com.engflow.re/uncaught_exceptions

no unit
Fires every time there is an uncaught exception

CAS server metrics


com.engflow.re.cas/missing_digests

no unit
The total number of missing digests seen by findMissingBlobs.

com.engflow.re.cas/requested_digests

no unit
The total number of digests requested by a findMissingBlob call

Remote Execution metrics


com.engflow.remoteexecution/queue_time

milliseconds
This is a distribution. Refers to the time actions are queued.

Tags

  • pool: The name of the pool.
Details

This is a distribution. Refers to the time actions are queued.


Invocation index monitoring


com.engflow.resultstore.index/sql_invocation_index_database_queue_size

no unit
All enqueued or in-progress invocation index database operations
Details

Reflects the number of incomplete operations (either queued or being worked on).

Every instance reports this metric. Every instance reports its own stats.


CAS usage


com.engflow.re.cas/available_replica_space

bytes
Available storage space in the CAS that can be used for replicas
Details

Only workers report this metric. All workers report their own values.


com.engflow.re.cas/available_space

bytes
Available storage space in the CAS
Details

Only workers report this metric. All workers report their own values.


com.engflow.re.cas/check_blob_exists

milliseconds
Distribution of time needed to check whether a blob exists

Tags

  • status: operation result, e.g. OK, NOT_FOUND
Details

This is a distribution. Refers to the time needed to check whether a blob exists.


com.engflow.re.cas/fetch_call_time

milliseconds
Distribution of CAS fetch operation time

Tags

  • source: name of the CAS location, e.g. EXTERNAL_STORAGE, DISTRIBUTED_CAS

  • status: op result, e.g. OK, UNAVAILABLE

Details

The time distribution of individual CAS download calls; each call is measured independently, including when falling back between different sources.


com.engflow.re.cas/free_time

milliseconds
Distribution of time needed to free space in the CAS
Details

This is a distribution. It refers to the deletion of expired replicas.

Only workers report this metric. All workers report their own values.


com.engflow.re.cas/gc_time

milliseconds
Distribution of time needed for the GC
Details

This is a distribution. It refers to the collection of expired replicas.

Only workers report this metric. All workers report their own values.


com.engflow.re.cas/lost_files_count

no unit
The number of files that were lost from the CAS
Details

The number of files that were deleted by some other process or the CAS instance detected that they no longer matched the expected digest.

Only workers report this metric. All workers report their own values.


com.engflow.re.cas/max_total_replica_size

bytes
The max total replica size
Details

This is the maximum amount of storage space the CAS is allowed to use for replicas.

Only workers report this metric. All workers report their own values.


com.engflow.re.cas/max_total_size

bytes
The max total CAS size on the node
Details

This is the maximum amount of storage space the CAS is allowed to use.

Only workers report this metric. All workers report their own values.


Client authorization


com.engflow.re.auth.async/call_count

no unit
Number of calls made
Details

Deprecated. Though it may seem so, this metric doesn't actually track client connection attempts accurately.

Use com.engflow.re.auth.async/duration aggregated by count instead.


com.engflow.re.auth.async/duration

milliseconds
Authentication call duration
Details

This is a distribution. Only schedulers report this metric. Every scheduler reports its own stats.


External storage use


com.engflow.re.storage.existence_cache/evictions

no unit
Evictions from the ExternalStorage CAS existence cache

com.engflow.re.storage.existence_cache/hits

no unit
Hits on the ExternalStorage CAS existence cache

com.engflow.re.storage.existence_cache/misses

no unit
Misses on the ExternalStorage CAS existence cache

com.engflow.re.storage/gc_check

no unit
GC status updates

Tags

  • result

com.engflow.re.storage/gc_deleted_objects

no unit
count objects deleted for GC
Details

Logged when GC deletes an objects


com.engflow.re.storage/ops

no unit
All completed external storage operations

Tags

  • operation

  • result


com.engflow.re.storage/ops_queue_size

no unit
All enqueued or in-progress external storage operations

Tags

  • operation

com.engflow.re.storage/proxy_stall_time_ms

milliseconds
Total milliseconds reads are blocked by client flow control
Details

Total milliseconds reads are blocked by client flow control


com.engflow.re.storage/traffic

bytes
All external storage traffic

Tags

  • operation

Docker use


com.engflow.re.exec.docker/container_shutdown_time

milliseconds
The time needed to shutdown a docker container

com.engflow.re.exec.docker/container_startup_time

milliseconds
The time needed to start a docker container

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com.engflow.re.exec.docker/containers_failed

no unit
The number of docker containers that failed

com.engflow.re.exec.docker/existing_containers

no unit
The number of running docker containers

com.engflow.re.exec.docker/image_pull_time

milliseconds
The time needed to pull a docker image

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com.engflow.re.exec.docker/network_create_time

milliseconds
The time needed to create a docker network

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

com.engflow.re.exec.docker/network_destroy_time

milliseconds
The time needed to destroy a docker network

Tags

  • status: result of the operation, e.g. "OK", "FAILED"

Persistent worker use


com.engflow.re.exec.worker/actions

no unit
The number of persistent worker actions run

Tags

  • reuse_status: new or reused
Details

The number of persistent worker actions run, aggregated by whether they reused a previous persistent worker process or not


Scheduler metrics


com.engflow.re.cas/entries_evicted

no unit
The number of CAS entries that were evicted due to memory size limitations

com.engflow.re.cas/entries_lost

no unit
The number of CAS entries that could not be recovered on CAS node shutdown events

com.engflow.re.profiler/events

no unit
The number of server-side profile events recorded.

com.engflow.re.profiler/live_handles

no unit
The number of profiles being streamed to the eventstore.

com.engflow.re/remaining_license_time

days
The number of remaining days before the license expires

Java memory metrics


com.engflow.re/java_heap

bytes
The amount of heap memory used
Details

Every instance reports this metric. Every instance reports its own stats.


Meta metrics


com.engflow.meta/engflow_version

no unit
A heartbeat metric that reports the EngFlow build label if present and "missing_version" otherwise.

Tags

  • version

DB Connection Pool usage


com.engflow.resultstore.index/db_cp_active_connections

no unit
The number of active connections in the pool

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_connection_acquire_time

us
The time it takes for the connection pool to acquire a DB connection

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_connection_create_time

milliseconds
The time it takes for the connection pool to create a new DB connection

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_connection_timeout_count

no unit
The count of timed-out connections

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_connection_usage_time

milliseconds
The duration of a use of a connection given by the connection pool

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_idle_connections

no unit
The number of idle connections in the pool

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_max_connections

no unit
Maximum number of connections existing in the pool

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_min_connections

no unit
Minimum number of connections existing in the pool

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_pending_connections

no unit
The number of pending connections in the pool

Tags

  • db_connection_pool_name

com.engflow.resultstore.index/db_cp_total_connections

no unit
The number of all currently existing connections in the pool

Tags

  • db_connection_pool_name

DB Query stats


com.engflow.resultstore.index/duration

milliseconds
The duration of a query

Tags

  • query_name

  • query_outcome


com.engflow.resultstore.index/preparation

milliseconds
The duration of creating a preparedQuery

Tags

  • query_name