Metrics Reference¶
Description of all metrics available for monitoring.
AnalyzeInvocationProfileHandler metrics¶
com.engflow.invocationanalyzer/bazel_profile_size¶
- bytes
- The size of the uncompressed Bazel profile handled
- Details
The size of the uncompressed Bazel profile handled.
com.engflow.invocationanalyzer/engflow_profile_size¶
- bytes
- The size of the uncompressed EngFlow profile handled
- Details
The size of the uncompressed EngFlow profile handled.
com.engflow.invocationanalyzer/time_needed¶
- milliseconds
- The time distribution of handing individual profile analysis requests
Tags
status
: The status of the analysis performed
- Details
The time distribution of handling individual Bazel profiles.
Metrics derived from raw BEP streams¶
com.engflow.bep/invocation_completed¶
- no unit
- Fired with the count of completed invocations reported to the BEP.
Tags
exit_code
: The human readable exit code of the invocation.
com.engflow.bep/invocation_duration¶
- milliseconds
- Fired on invocation completed with the average duration of the invocation.
com.engflow.bep/invocation_started¶
- no unit
- Fired with the count of newly started invocations reported to the BEP.
Blob-storage implementation metrics¶
com.engflow.blobstore/ops¶
- no unit
- Fires every time an operation takes place.
Tags
operation
BEP Event Storage and Replay¶
com.engflow.eventstore/build_event_owners¶
- no unit
- The total number of build owners residing in memory. A build owner is an internal piece of state associated to a particular build reporting build events.
com.engflow.eventstore/inbound_bep_events¶
- no unit
- Fired whenever an event is received on an inbound stream.
Tags
type
com.engflow.eventstore/new_outbound_streams¶
- no unit
- Fired whenever a new outbound BEP stream is read.
com.engflow.eventstore/ongoing_streams¶
- no unit
- The total number of streams that are inbound, outbound, or both.
com.engflow.eventstore/outbound_bep_events¶
- no unit
- Fired whenever an event is sent on an outbound stream.
Virtual Machine Instances¶
com.engflow.instance.new/gc_avg_duration¶
- milliseconds
- The average duration spent in garbage collection since the last reported metric.
Tags
-
gc_type
: GC Old generation / GC Young Generation -
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance.new/gc_count¶
- no unit
- The total number of garbage collections during the lifecycle of this process.
Tags
-
gc_type
: GC Old generation / GC Young Generation -
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance.new/gc_time¶
- milliseconds
- The total estimated time in milliseconds performing garbage collection.
Tags
-
gc_type
: G1 Old generation / G1 Young Generation -
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance.new/open_file_descriptors¶
- no unit
- The number of file descriptors the process has currently open.
Tags
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance.new/total_disk_space¶
- bytes
- The size of the volume.
Tags
-
instance_role
: type of the instance (scheduler/worker/etc.) -
volume
: the name of the disk volume
com.engflow.instance.new/total_system_memory¶
- bytes
- The total amount of system memory in bytes.
Tags
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance.new/used_disk_percentage¶
- percentage
- The percentage of the volume that is currently used.
Tags
-
instance_role
: type of the instance (scheduler/worker/etc.) -
volume
: the name of the disk volume
com.engflow.instance.new/used_disk_space¶
- bytes
- The total number of bytes used on the volume.
Tags
-
instance_role
: type of the instance (scheduler/worker/etc.) -
volume
: the name of the disk volume
com.engflow.instance.new/used_system_memory¶
- bytes
- The amount of used system memory in bytes.
Tags
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance.new/used_system_memory_percentage¶
- percentage
- The percentage of system memory used.
Tags
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance/gc_avg_duration¶
- milliseconds
- The average duration spent in garbage collection since the last reported metric.
Tags
gc_type
com.engflow.instance/gc_count¶
- no unit
- The total number of garbage collections during the lifecycle of this process.
Tags
gc_type
com.engflow.instance/gc_time¶
- milliseconds
- The total estimated time in milliseconds performing garbage collection.
Tags
gc_type
com.engflow.instance/total_disk_space¶
- bytes
- The size of the volume.
Tags
volume
com.engflow.instance/total_system_memory¶
- bytes
- The total amount of system memory in bytes.
com.engflow.instance/used_disk_percentage¶
- percentage
- The percentage of the volume that is currently used.
Tags
volume
com.engflow.instance/used_disk_space¶
- bytes
- The total number of bytes used on the volume.
Tags
volume
com.engflow.instance/used_system_memory¶
- bytes
- The amount of used system memory in bytes.
com.engflow.instance/used_system_memory_percentage¶
- percentage
- The percentage of system memory used.
Netty monitoring¶
com.engflow.thirdparty.netty/used_direct_memory¶
- bytes
- Direct (non-heap) memory use
Tags
buffer_name
com.engflow.thirdparty.netty/used_heap_memory¶
- bytes
- Heap memory use
Tags
buffer_name
io.netty.buffer/used_direct_memory¶
- bytes
- Direct (non-heap) memory use
Tags
buffer_name
io.netty.buffer/used_heap_memory¶
- bytes
- Heap memory use
Tags
buffer_name
Action scheduling¶
com.engflow.re.scheduler/available_workers¶
- no unit
- Deprecated; number of idle executors, per pool
Tags
name
: name of the pool ("default" for the default pool)
- Details
Deprecated. Indicates the number of idle executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value. This metric is deprecated, because it may be imprecise: schedulers that are started while workers are busy may report a higher value than they should for several minutes. We recommend monitoring
com.engflow.re.scheduler/existing_executors
instead.
com.engflow.re.scheduler/desired_executors¶
- no unit
- Number of desired executors, per pool
Tags
name
: name of the pool ("default" for the default pool)
- Details
Indicates an estimate for the number of required executors per pool. Every scheduler reports its own estimate - they should be summed up to get the total desired pool size.
com.engflow.re.scheduler/existing_executors¶
- no unit
- Number of existing executors, per pool
Tags
name
: name of the pool ("default" for the default pool)
- Details
Number of existing executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value.
com.engflow.re.scheduler/existing_schedulers¶
- no unit
- Number of existing schedulers
- Details
Only schedulers report this metric. Every scheduler reports a constant "1". This can be used to detect schedulers that are unable to send monitoring metrics.
com.engflow.re.scheduler/pool_utilization¶
- percentage
- Current executor utilization, per pool
Tags
name
: name of the pool ("default" for the default pool)
- Details
Reports current executor utilization (used*100/total) per pool, as a percentage ([0..100]). Only schedulers report this metric. Every scheduler reports the same (or about the same) value.
To help making scale-up decisions when a pool is empty, utilization is reported as 100 if there are actions waiting and 0 if not. A pool may be empty if it was scaled down, or if it never existed (the client may request any pool name).
com.engflow.re.scheduler/queue_age¶
- milliseconds
- Min/max age of queued actions, per pool
Tags
-
name
: name of the pool ("default" for the default pool) -
statistic
: "min" (youngest) or "max" (oldest) action in the pool's queue
- Details
Reports minimum and maximum age in each executor pool, i.e. how long entries have been waiting. Only schedulers report this metric. Every scheduler reports its own queue lengths. Changes in these values indicate a change in the cluster's throughput.
com.engflow.re.scheduler/queue_size¶
- no unit
- Number of waiting actions, per pool
Tags
name
: name of the pool ("default" for the default pool)
- Details
Indicates the number of actions waiting for execution, per pool, on this scheduler. Only schedulers report this metric. Every scheduler reports its own queue lengths.
Observability UI metrics¶
com.engflow.observability.ui/app_load¶
- milliseconds
- The duration of the initial application load.
Tags
page
com.engflow.observability.ui/caught_error¶
- no unit
- An error in the web client and was manually caught and reported.
Tags
-
error_description
-
page
com.engflow.observability.ui/navigation¶
- no unit
- A single user navigation to a new page.
Tags
page
com.engflow.observability.ui/uncaught_error¶
- no unit
- An uncaught error was thrown in the web client.
Tags
-
error_name
-
page
Storage implementation metrics¶
com.engflow.storage.read/size¶
- bytes
- Number of file bytes sent to the client for a read request. May be smaller than the file size in case of error or partial read. Only recorded if the file was found.
Tags
-
name
: name of the storage service -
status
: op result
com.engflow.storage.read/time_per_gb¶
- milliseconds
- Time taken per 1 billion bytes (1 GB) to download a file from storage.
Tags
-
name
: name of the storage service -
status
: op result
com.engflow.storage.read/time_to_first_byte¶
- milliseconds
- Time taken between initiating a download to receiving the first byte.
Tags
-
name
: name of the storage service -
status
: op result
com.engflow.storage.read/time_to_next_chunk¶
- milliseconds
- Time taken between being notified that the client is ready and sending the next response. May be recorded 0 or multiples times for the same call, depending on control flow events.
Tags
-
name
: name of the storage service -
status
: op result
com.engflow.storage.write/size¶
- bytes
- Number of file bytes received from the client for a write request. May be smaller than the file size in case of error.
Tags
-
name
: name of the storage service -
status
: op result
com.engflow.storage.write/time_per_gb¶
- milliseconds
- Time taken per 1 billion bytes (1 GB) to upload a file to storage.
Tags
-
name
: name of the storage service -
status
: op result
com.engflow.storage.write/time_to_commit¶
- milliseconds
- Time between being notified the write is complete and committing the write.
Tags
-
name
: name of the storage service -
status
: op result
NotificationQueue metrics¶
com.engflow.notificationqueue/publish¶
- milliseconds
- This is a distribution. Refers to the time needed to publish a notification.
Tags
-
name
: The name of the queue. -
status
: The status of the operation.
- Details
This is a distribution. Refers to the time needed to publish a notification.
Action execution¶
com.engflow.re.exec/completed_actions¶
- no unit
- Number of actions that ran to completion, grouped by exit code
Tags
exit_code
: the action's exit code
- Details
This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, in all pools combined, since the last time this metric was reported.
Only workers report this metric. All workers report their own values. We recommend grouping by exit_code=0
and exit_code!=0
, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the cluster.
com.engflow.re.exec/completed_actions_per_pool¶
- no unit
- Number of executed actions (not cached), grouped by pool and status
Tags
-
pool
: name of the pool (_default_
for the default pool) -
status
: the action's status (ExecutionStatus: SUCCESS, NON_ZERO_EXIT, ERROR)
- Details
This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, per pool, since the last time this metric was reported.
Only workers report this metric. All workers report their own values. We recommend grouping by status, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the pool/cluster.
com.engflow.re.exec/execution_latency¶
- milliseconds
- Bucketed latency (ms), grouped by pool and execution stage
Tags
-
pool
: name of the pool (_default_
for the default pool) -
stage
: the action's stage (ExecutionStage: QUEUED, DOWNLOAD_INPUTS, EXECUTE_ACTION, UPLOAD_OUTPUTS, EXECUTOR_TOTAL)
com.engflow.re.exec/executors_existing¶
- no unit
- Total number of executors on this worker, in all pools combined
- Details
Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the cluster.
com.engflow.re.exec/executors_existing_per_pool¶
- no unit
- Total number of executors on this worker, per pool
Tags
pool
: name of the pool (_default_
for the default pool)
- Details
Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the pool/cluster.
com.engflow.re.exec/used_executors¶
- no unit
- Number of busy executors, in all pools
- Details
Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the cluster.
com.engflow.re.exec/used_executors_per_pool¶
- no unit
- Number of busy executors, per pool
Tags
pool
: name of the pool (_default_
for the default pool)
- Details
Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the pool/cluster.
Hazelcast monitoring¶
com.engflow.re.hazelcast.map/entries¶
- no unit
- The number of entries in Hazelcast maps map
Tags
-
cluster_name
: name of the Hazelcast cluster -
map_name
: name of the Hazelcast map
com.engflow.re.hazelcast.map/memory_used¶
- bytes
- The amount of memory used for the map
Tags
-
cluster_name
: name of the Hazelcast cluster -
map_name
: name of the Hazelcast map
com.engflow.re.hazelcast/is_master¶
- no unit
- Whether a machine is a cluster master; if this sums up to more than one (with the same name), then the cluster is unhealthy.
Tags
name
: name of the Hazelcast cluster.
com.engflow.re.hazelcast/member_count¶
- no unit
- The number of members in the cluster; only the master reports this value
Tags
name
: name of the Hazelcast cluster.
com.engflow.re.hazelcast/op_time¶
- milliseconds
- Distribution of operation time
Tags
-
name
: name of the distributed hash map -
status
: op result
com.engflow.thirdparty.hazelcast/partition_migration_finished¶
- no unit
- The number of finished Hazelcast partition migrations.
Tags
name
: name of the Hazelcast cluster
com.engflow.thirdparty.hazelcast/partition_migration_started¶
- no unit
- The number of started Hazelcast partition migrations.
Tags
name
: name of the Hazelcast cluster
com.engflow.thirdparty.hazelcast/partition_migration_time¶
- milliseconds
- Time of Hazelcast partition migrations, per Hazelcast cluster.
Tags
name
: name of the Hazelcast cluster
- Details
Reports the time of Hazelcast partition migrations.
com.engflow.thirdparty.hazelcast/replica_migration¶
- no unit
- The number of Hazelcast replica migrations.
Tags
-
name
: name of the Hazelcast cluster -
status
: status of the operation (OK
orFAILURE
)
Uncaught exceptions¶
com.engflow.re/uncaught_exceptions¶
- no unit
- Fires every time there is an uncaught exception
CAS server metrics¶
com.engflow.re.cas/missing_digests¶
- no unit
- The total number of missing digests seen by findMissingBlobs.
com.engflow.re.cas/requested_digests¶
- no unit
- The total number of digests requested by a findMissingBlob call
Remote Execution metrics¶
com.engflow.remoteexecution/queue_time¶
- milliseconds
- This is a distribution. Refers to the time actions are queued.
Tags
pool
: The name of the pool.
- Details
This is a distribution. Refers to the time actions are queued.
Invocation index monitoring¶
com.engflow.resultstore.index/sql_invocation_index_database_queue_size¶
- no unit
- All enqueued or in-progress invocation index database operations
- Details
Reflects the number of incomplete operations (either queued or being worked on).
Every instance reports this metric. Every instance reports its own stats.
CAS usage¶
com.engflow.re.cas/available_replica_space¶
- bytes
- Available storage space in the CAS that can be used for replicas
- Details
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/available_space¶
- bytes
- Available storage space in the CAS
- Details
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/check_blob_exists¶
- milliseconds
- Distribution of time needed to check whether a blob exists
Tags
status
: operation result, e.g.OK
,NOT_FOUND
- Details
This is a distribution. Refers to the time needed to check whether a blob exists.
com.engflow.re.cas/fetch_call_time¶
- milliseconds
- Distribution of CAS fetch operation time
Tags
-
source
: name of the CAS location, e.g.EXTERNAL_STORAGE
,DISTRIBUTED_CAS
-
status
: op result, e.g.OK
,UNAVAILABLE
- Details
The time distribution of individual CAS download calls; each call is measured independently, including when falling back between different sources.
com.engflow.re.cas/free_time¶
- milliseconds
- Distribution of time needed to free space in the CAS
- Details
This is a distribution. It refers to the deletion of expired replicas.
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/gc_time¶
- milliseconds
- Distribution of time needed for the GC
- Details
This is a distribution. It refers to the collection of expired replicas.
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/lost_files_count¶
- no unit
- The number of files that were lost from the CAS
- Details
The number of files that were deleted by some other process or the CAS instance detected that they no longer matched the expected digest.
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/max_total_replica_size¶
- bytes
- The max total replica size
- Details
This is the maximum amount of storage space the CAS is allowed to use for replicas.
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/max_total_size¶
- bytes
- The max total CAS size on the node
- Details
This is the maximum amount of storage space the CAS is allowed to use.
Only workers report this metric. All workers report their own values.
Client authorization¶
com.engflow.re.auth.async/call_count¶
- no unit
- Number of calls made
- Details
Deprecated. Though it may seem so, this metric doesn't actually track client connection attempts accurately.
Use com.engflow.re.auth.async/duration
aggregated by count instead.
com.engflow.re.auth.async/duration¶
- milliseconds
- Authentication call duration
- Details
This is a distribution. Only schedulers report this metric. Every scheduler reports its own stats.
External storage use¶
com.engflow.re.storage.existence_cache/evictions¶
- no unit
- Evictions from the ExternalStorage CAS existence cache
com.engflow.re.storage.existence_cache/hits¶
- no unit
- Hits on the ExternalStorage CAS existence cache
com.engflow.re.storage.existence_cache/misses¶
- no unit
- Misses on the ExternalStorage CAS existence cache
com.engflow.re.storage/gc_check¶
- no unit
- GC status updates
Tags
result
com.engflow.re.storage/gc_deleted_objects¶
- no unit
- count objects deleted for GC
- Details
Logged when GC deletes an objects
com.engflow.re.storage/ops¶
- no unit
- All completed external storage operations
Tags
-
operation
-
result
com.engflow.re.storage/ops_queue_size¶
- no unit
- All enqueued or in-progress external storage operations
Tags
operation
com.engflow.re.storage/proxy_stall_time_ms¶
- milliseconds
- Total milliseconds reads are blocked by client flow control
- Details
Total milliseconds reads are blocked by client flow control
com.engflow.re.storage/traffic¶
- bytes
- All external storage traffic
Tags
operation
Docker use¶
com.engflow.re.exec.docker/container_shutdown_time¶
- milliseconds
- The time needed to shutdown a docker container
com.engflow.re.exec.docker/container_startup_time¶
- milliseconds
- The time needed to start a docker container
Tags
status
: result of the operation, e.g. "OK", "FAILED"
com.engflow.re.exec.docker/containers_failed¶
- no unit
- The number of docker containers that failed
com.engflow.re.exec.docker/existing_containers¶
- no unit
- The number of running docker containers
com.engflow.re.exec.docker/image_pull_time¶
- milliseconds
- The time needed to pull a docker image
Tags
status
: result of the operation, e.g. "OK", "FAILED"
com.engflow.re.exec.docker/network_create_time¶
- milliseconds
- The time needed to create a docker network
Tags
status
: result of the operation, e.g. "OK", "FAILED"
com.engflow.re.exec.docker/network_destroy_time¶
- milliseconds
- The time needed to destroy a docker network
Tags
status
: result of the operation, e.g. "OK", "FAILED"
Persistent worker use¶
com.engflow.re.exec.worker/actions¶
- no unit
- The number of persistent worker actions run
Tags
reuse_status
:new
orreused
- Details
The number of persistent worker actions run, aggregated by whether they reused a previous persistent worker process or not
Scheduler metrics¶
com.engflow.re.cas/entries_evicted¶
- no unit
- The number of CAS entries that were evicted due to memory size limitations
com.engflow.re.cas/entries_lost¶
- no unit
- The number of CAS entries that could not be recovered on CAS node shutdown events
com.engflow.re.profiler/events¶
- no unit
- The number of server-side profile events recorded.
com.engflow.re.profiler/live_handles¶
- no unit
- The number of profiles being streamed to the eventstore.
com.engflow.re/remaining_license_time¶
- days
- The number of remaining days before the license expires
Java memory metrics¶
com.engflow.re/java_heap¶
- bytes
- The amount of heap memory used
- Details
Every instance reports this metric. Every instance reports its own stats.
Meta metrics¶
com.engflow.meta/engflow_version¶
- no unit
- A heartbeat metric that reports the EngFlow build label if present and "missing_version" otherwise.
Tags
version
DB Connection Pool usage¶
com.engflow.resultstore.index/db_cp_active_connections¶
- no unit
- The number of active connections in the pool
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_connection_acquire_time¶
- us
- The time it takes for the connection pool to acquire a DB connection
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_connection_create_time¶
- milliseconds
- The time it takes for the connection pool to create a new DB connection
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_connection_timeout_count¶
- no unit
- The count of timed-out connections
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_connection_usage_time¶
- milliseconds
- The duration of a use of a connection given by the connection pool
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_idle_connections¶
- no unit
- The number of idle connections in the pool
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_max_connections¶
- no unit
- Maximum number of connections existing in the pool
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_min_connections¶
- no unit
- Minimum number of connections existing in the pool
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_pending_connections¶
- no unit
- The number of pending connections in the pool
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_total_connections¶
- no unit
- The number of all currently existing connections in the pool
Tags
db_connection_pool_name
DB Query stats¶
com.engflow.resultstore.index/duration¶
- milliseconds
- The duration of a query
Tags
-
query_name
-
query_outcome
com.engflow.resultstore.index/preparation¶
- milliseconds
- The duration of creating a preparedQuery
Tags
query_name