Metrics Reference¶
Description of all metrics available for monitoring.
AnalyzeInvocation metrics¶
com.engflow.invocationanalyzer/bazel_profile_count¶
- no unit
- The number of Bazel profiles that were attempted to be fetched
Tags
status
: The status of retrieving the Bazel profile.
- Details
The number of Bazel profiles that were attempted to be fetched.
com.engflow.invocationanalyzer/bazel_profile_size¶
- bytes
- The size of the uncompressed Bazel profile handled
- Details
The size of the uncompressed Bazel profile handled.
com.engflow.invocationanalyzer/engflow_profile_count¶
- no unit
- The number of EngFlow profiles that were attempted to be fetched
Tags
status
: The status of retrieving the EngFlow profile.
- Details
The number of EngFlow profiles that were attempted to be fetched.
com.engflow.invocationanalyzer/engflow_profile_size¶
- bytes
- The size of the uncompressed EngFlow profile handled
- Details
The size of the uncompressed EngFlow profile handled.
com.engflow.invocationanalyzer/time_needed¶
- milliseconds
- The time distribution of handing individual profile analysis requests
Tags
status
: The status of the analysis performed
- Details
The time distribution of handling individual Bazel profiles.
Opencensus OperationController metrics reporter¶
com.engflow.operationcontroller/active¶
- no unit
- The number of active operations (i.e., currently running).
Tags
name
: The name of the OperationController
- Details
The number of active operations (i.e., currently running).
com.engflow.operationcontroller/latency¶
- milliseconds
- The latency to start operations (i.e., how long operations are waiting to be executed).
Tags
name
: The name of the OperationController
- Details
The latency to start operations (i.e., how long operations are waiting to be executed).
com.engflow.operationcontroller/queued¶
- no unit
- The number of operations queued for execution.
Tags
name
: The name of the OperationController
- Details
The number of operations queued for execution.
com.engflow.operationcontroller/runtime¶
- milliseconds
- The runtime of operations (i.e., the duration operations are running for).
Tags
name
: The name of the OperationController
- Details
The runtime of operations (i.e., the duration operations are running for).
Metrics derived from raw BEP streams¶
com.engflow.bep/invocation_completed¶
- no unit
- Fired with the count of completed invocations reported to the BEP.
Tags
exit_code
: The human readable exit code of the invocation.
com.engflow.bep/invocation_duration¶
- milliseconds
- Fired on invocation completed with the average duration of the invocation.
com.engflow.bep/invocation_started¶
- no unit
- Fired with the count of newly started invocations reported to the BEP.
Blob-storage implementation metrics¶
com.engflow.blobstore/latency¶
- milliseconds
- The duration each operation takes.
Tags
-
operation
-
status
com.engflow.blobstore/ops¶
- no unit
- Fires every time an operation takes place.
Tags
operation
Docker proxy¶
com.engflow.dockerproxy/blob_upload_bytes¶
- bytes
- The size of Docker blobs that the proxy successfully uploaded to the CAS.
Tags
status
com.engflow.dockerproxy/cache_hit_bytes¶
- bytes
- The size of Docker blobs that the proxy could find in the CAS.
com.engflow.dockerproxy/cache_miss_bytes¶
- bytes
- The size of Docker blobs that the proxy expected but could not find in the CAS.
com.engflow.dockerproxy/known_blobs_total¶
- no unit
- The number of Docker blobs that the proxy has metadata about.
HTTP clients for the Docker proxy¶
com.engflow.dockerproxy/http_received_bytes_total¶
- bytes
- Bytes received over the HTTP client
Tags
client
: The name of the HTTP client (can be used to distinguish layers)
com.engflow.dockerproxy/http_request_latency_seconds_total¶
- s
- Time it's taken to serve HTTP requests
Tags
-
client
: The name of the HTTP client (can be used to distinguish layers) -
status
: The status code, reduced to 1xx..5xx, or FAILED if an exception occurred
com.engflow.dockerproxy/http_requests_total¶
- no unit
- Number of HTTP requests started on the HTTP client
Tags
-
client
: The name of the HTTP client (can be used to distinguish layers) -
method
: The HTTP method of the request
BEP Event Storage and Replay¶
com.engflow.eventstore/bep_event_ack_latency¶
- milliseconds
- This is a distribution. Tracks how much time passed between receiving a build event and sending an acknowledgement to the client.
Tags
status
com.engflow.eventstore/bes_upload_delay¶
- milliseconds
- This is a distribution. Tracks how much longer an invocation's BES upload took compared to the invocation's duration as reported by the BES.
Tags
status
com.engflow.eventstore/build_event_owners¶
- no unit
- The total number of build owners residing in memory. A build owner is an internal piece of state associated to a particular build reporting build events.
com.engflow.eventstore/flushing_batches_size¶
- no unit
- The total size of complete build event batches that are currently being uploaded to storage. Normally, batches are flushed quickly, so this value should stay near zero; if it doesn't, that could mean we are falling behind with batch uploads. Every instance reports its own stats; sum them up to get a cluster-wide metric.
com.engflow.eventstore/grpc_eventstore_ttfb¶
- milliseconds
- This is a distribution. Tracks how much time passed between requesting EventStore data via gRPC, and receiving the first byte.
Tags
type
com.engflow.eventstore/inbound_bep_events¶
- no unit
- Incremented whenever an event is received on an inbound stream.
Tags
type
- Details
An inbound stream means a client (e.g. Bazel) sending BES events to a scheduler.
Every instance reports its own stats; sum them up to get a cluster-wide metric.
com.engflow.eventstore/incomplete_batches_size¶
- no unit
- The estimated size in bytes it would take to serialize all incomplete build event batches. These batches aren't yet written to storage. Actual JVM heap footprint is likely larger. Every instance reports its own stats; sum them up to get a cluster-wide metric.
com.engflow.eventstore/new_outbound_streams¶
- no unit
- Incremented whenever a new outbound BEP stream is created.
- Details
An outbound stream means a scheduler replaying a BES stream in order to reduce it to a result stream. It makes no difference if the scheduler is replaying to itself or to another instance.
Every instance reports its own stats; sum them up to get a cluster-wide metric.
com.engflow.eventstore/ongoing_streams¶
- no unit
- The total number of streams that are inbound, outbound, or both.
- Details
An inbound stream means a client (e.g. Bazel) sending BES events to a scheduler. An outbound stream means a scheduler replaying a BES stream in order to reduce it to a result stream. It makes no difference if the scheduler is replaying to itself or to another instance.
Every instance reports its own stats; sum them up to get a cluster-wide metric.
com.engflow.eventstore/outbound_bep_events¶
- no unit
- Incremented whenever an event is sent on an outbound stream.
- Details
An outbound stream means a scheduler replaying a BES stream in order to reduce it to a result stream. It makes no difference if the scheduler is replaying to itself or to another instance.
Every instance reports its own stats; sum them up to get a cluster-wide metric.
Virtual Machine Instances¶
com.engflow.instance.new/gc_avg_duration¶
- milliseconds
- The average duration spent in garbage collection since the last reported metric.
Tags
-
gc_type
: GC Old generation / GC Young Generation -
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance.new/gc_count¶
- no unit
- The total number of garbage collections during the lifecycle of this process.
Tags
-
gc_type
: GC Old generation / GC Young Generation -
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance.new/gc_time¶
- milliseconds
- The total wall time in milliseconds spent blocked in garbage collection since the start of the process. This measures time when the application is not running due to a collector pause.
Tags
-
gc_type
: G1 Old generation / G1 Young Generation -
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance.new/open_file_descriptors¶
- no unit
- The number of file descriptors the process has currently open.
Tags
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance.new/total_disk_space¶
- bytes
- The size of the volume.
Tags
-
instance_role
: type of the instance (scheduler/worker/etc.) -
volume
: the name of the disk volume
com.engflow.instance.new/total_system_memory¶
- bytes
- The total amount of system memory in bytes.
Tags
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance.new/used_disk_percentage¶
- percentage
- The percentage of the volume that is currently used.
Tags
-
instance_role
: type of the instance (scheduler/worker/etc.) -
volume
: the name of the disk volume
com.engflow.instance.new/used_disk_space¶
- bytes
- The total number of bytes used on the volume.
Tags
-
instance_role
: type of the instance (scheduler/worker/etc.) -
volume
: the name of the disk volume
com.engflow.instance.new/used_process_native_buffer_memory¶
- bytes
- The total amount of native buffer memory for this process in bytes.
Tags
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance.new/used_system_memory¶
- bytes
- The amount of used system memory in bytes.
Tags
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance.new/used_system_memory_percentage¶
- percentage
- The percentage of system memory used.
Tags
instance_role
: type of the instance (scheduler/worker/etc.)
com.engflow.instance/gc_avg_duration¶
- milliseconds
- The average duration spent in garbage collection since the last reported metric.
Tags
gc_type
com.engflow.instance/gc_count¶
- no unit
- The total number of garbage collections during the lifecycle of this process.
Tags
gc_type
com.engflow.instance/gc_time¶
- milliseconds
- The total estimated time in milliseconds performing garbage collection.
Tags
gc_type
com.engflow.instance/total_disk_space¶
- bytes
- The size of the volume.
Tags
volume
com.engflow.instance/total_system_memory¶
- bytes
- The total amount of system memory in bytes.
com.engflow.instance/used_disk_percentage¶
- percentage
- The percentage of the volume that is currently used.
Tags
volume
com.engflow.instance/used_disk_space¶
- bytes
- The total number of bytes used on the volume.
Tags
volume
com.engflow.instance/used_system_memory¶
- bytes
- The amount of used system memory in bytes.
com.engflow.instance/used_system_memory_percentage¶
- percentage
- The percentage of system memory used.
Netty monitoring¶
com.engflow.thirdparty.netty/used_direct_memory¶
- bytes
- Direct (non-heap) memory use
Tags
buffer_name
com.engflow.thirdparty.netty/used_heap_memory¶
- bytes
- Heap memory use
Tags
buffer_name
Worker Control metrics¶
com.engflow.re.management.workercontrol/approx_mft_induced_idle_executor_duration¶
- milliseconds
- Approximately how much time all executors of all workers marked-for-termination were idle. Each scheduler reports the approximate duration of executor idleness as induced by the scheduler marking workers for termination. Only the master scheduler should report non-zero values. Sum up the idle durations reported by all schedulers to get the overall idleness.
Tags
pool
: name of the pool ("default" for the default pool)
com.engflow.re.management.workercontrol/mft_on_scheduler¶
- milliseconds
- This is a distribution. Tracks how long the marked-for-termination call lasted on the scheduler.
Tags
-
pool
: name of the pool ("default" for the default pool) -
result
: the result of the marked-for-termination call
com.engflow.re.management.workercontrol/mft_on_worker¶
- milliseconds
- This is a distribution. Tracks how long the marked-for-termination call lasted on the worker.
Tags
-
pool
: name of the pool ("default" for the default pool) -
result
: the result of the marked-for-termination call
Action scheduling¶
com.engflow.re.scheduler/autoscaler_cluster_size_controller_op¶
- milliseconds
- Per pool, distribution of how long each cluster size controller operation took and what its completion status was.
Tags
-
op
: the operation performed, "setClusterSize" or "reduceClusterSizeByInstance" -
pool
: name of pool ("default" for the default pool) -
status
: "succeeded" or "failed"
com.engflow.re.scheduler/autoscaler_set_size_operations¶
- no unit
- Per pool, number of attempts to set its autoscaling group's desired size
Tags
-
pool
: name of pool ("default" for the default pool) -
status
: "succeeded" or "failed"
- Details
Deprecated. Use
com.engflow.re.scheduler/autoscaler_cluster_size_controller_op
instead.
com.engflow.re.scheduler/available_workers¶
- no unit
- Deprecated; number of idle executors, per pool
Tags
pool
: name of the pool ("default" for the default pool)
- Details
Deprecated. Indicates the number of idle executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value. This metric is deprecated, because it may be imprecise: schedulers that are started while workers are busy may report a higher value than they should for several minutes. We recommend monitoring
com.engflow.re.scheduler/existing_executors
instead.
com.engflow.re.scheduler/coalesced_executions¶
- no unit
- Number of action requests of coalesced into an existing execution
- Details
Number of action requests of coalesced into an existing execution
com.engflow.re.scheduler/dequeued_actions¶
- no unit
- Per pool, the number of actions that were removed from the queue, either due to starting execution on a worker, or because it was ejected from the queue as it got too old.
Tags
-
pool
: name of the pool ("default" for the default pool) -
reason
com.engflow.re.scheduler/desired_executors¶
- no unit
- Number of desired executors, per pool
Tags
pool
: name of the pool ("default" for the default pool)
- Details
Indicates an estimate for the number of required executors per pool. Every scheduler reports its own estimate - they should be summed up to get the total desired pool size.
com.engflow.re.scheduler/estimated_action_time¶
- milliseconds
- Estimated action time
Tags
pool
: name of the pool ("default" for the default pool)
- Details
Indicates an estimate for duration of an action. Every scheduler reports its own estimate.
com.engflow.re.scheduler/estimated_induced_load¶
- no unit
- Estimated induced wait time
Tags
pool
: name of the pool ("default" for the default pool)
- Details
Indicates an estimate for future incoming work in ms per ms. Every scheduler reports its own estimate.
com.engflow.re.scheduler/existing_executors¶
- no unit
- Number of existing executors, per pool
Tags
pool
: name of the pool ("default" for the default pool)
- Details
Number of existing executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value.
com.engflow.re.scheduler/existing_schedulers¶
- no unit
- Number of existing schedulers
- Details
Only schedulers report this metric. Every scheduler reports a constant "1". This can be used to detect schedulers that are unable to send monitoring metrics.
com.engflow.re.scheduler/global_queue_size¶
- no unit
- Number of waiting actions, per pool
Tags
pool
: name of the pool ("default" for the default pool)
- Details
Indicates the number of actions waiting for execution, per pool, in the cluster. Only schedulers report this metric. The schedulers coordinate to calculate this sum.
com.engflow.re.scheduler/global_used_executors¶
- no unit
- Number of used executors, per pool
Tags
pool
: name of the pool ("default" for the default pool)
- Details
Indicates the number of executors that are in use, per pool, in the cluster. Only schedulers report this metric.
com.engflow.re.scheduler/mia_count¶
- no unit
- Per pool, the number of action that failed, because the executing worker went missing-in-action
Tags
pool
: name of the pool ("default" for the default pool)
com.engflow.re.scheduler/pool_utilization¶
- percentage
- Current executor utilization, per pool
Tags
pool
: name of the pool ("default" for the default pool)
- Details
Reports current executor utilization (used*100/total) per pool, as a percentage ([0..100]). Only schedulers report this metric. Every scheduler reports the same (or about the same) value.
To help making scale-up decisions when a pool is empty, utilization is reported as 100 if there are actions waiting and 0 if not. A pool may be empty if it was scaled down, or if it never existed (the client may request any pool name).
com.engflow.re.scheduler/queue_age¶
- milliseconds
- Min/max age of queued actions, per pool
Tags
-
pool
: name of the pool ("default" for the default pool) -
statistic
: "min" (youngest) or "max" (oldest) action in the pool's queue
- Details
Reports minimum and maximum age in each executor pool, i.e. how long entries have been waiting. Only schedulers report this metric. Every scheduler reports its own queue lengths. Changes in these values indicate a change in the cluster's throughput.
com.engflow.re.scheduler/queue_size¶
- no unit
- Number of waiting actions, per pool
Tags
pool
: name of the pool ("default" for the default pool)
- Details
Indicates the number of actions waiting for execution, per pool, on this scheduler. Only schedulers report this metric. Every scheduler reports its own queue lengths.
Observability UI metrics¶
com.engflow.observability.ui/app_load¶
- milliseconds
- The duration of the initial application load.
Tags
page
com.engflow.observability.ui/caught_error¶
- no unit
- An error in the web client was manually caught and reported.
Tags
-
error_description
-
page
com.engflow.observability.ui/navigation¶
- no unit
- A single user navigation to a new page.
Tags
page
com.engflow.observability.ui/page_load¶
- milliseconds
- Measures the duration elapsed loading the current page.
Tags
-
load_type
-
page
com.engflow.observability.ui/page_load_with_data_requests¶
- milliseconds
- Measures the duration elapsed loading the current page, plus any page-specific data requests sent after it was initially loaded. This is currently only selectively enabled for some pages.
Tags
-
page
-
request_status
com.engflow.observability.ui/rendering_error¶
- no unit
- An error in the web client was caught by the rendering pipeline and reported.
Tags
-
page
-
section
com.engflow.observability.ui/uncaught_error¶
- no unit
- An uncaught error was thrown in the web client.
Tags
-
error_name
-
page
BES Replay Metrics¶
com.engflow.bes.replay/cpu_time¶
- milliseconds
- The amount of CPU time spent replaying and reducing build event streams, measured in milliseconds.
Tags
-
replay_type
-
status
com.engflow.bes.replay/cpu_time_for_event¶
- milliseconds
- The amount of CPU time spent replaying and reducing a single event within a build event stream, measured in milliseconds.
Tags
-
event_type
-
replay_type
-
status
ResultStore metrics¶
com.engflow.resultstore/reduce_bes_completed_duration_since_finish_event¶
- milliseconds
- This is a distribution. Tracks how much time passed between receiving aninvocation's finish event, and completing the BES reduction.
Tags
-
replay_type
-
status
: Whether the BES reduction finished successfully.
com.engflow.resultstore/reduce_bes_replay_removed_from_cache_count¶
- no unit
- The number of BES replays that were removed from the replay cache.
Tags
-
replay_type
-
status
: The status of the replay when it was removed from the cache.
com.engflow.resultstore/reduce_bes_replay_source_count¶
- no unit
- The number of BES replays requested, tagged by where the data was fetched from, and the type of replay.
Tags
-
replay_type
-
source
: Indicates where the data was fetched from.
Storage implementation metrics¶
com.engflow.storage.gc/gc_window_seconds¶
- s
- The storage service's GC window.
Tags
name
: name of the storage service
com.engflow.storage.read/size¶
- bytes
- Number of file bytes sent to the client for a read request. May be smaller than the file size in case of error or partial read. Only recorded if the file was found.
Tags
-
name
: name of the storage service -
status
: op result
com.engflow.storage.read/time_per_gb¶
- milliseconds
- Time taken per 1 billion bytes (1 GB) to download a file from storage.
Tags
-
name
: name of the storage service -
status
: op result
com.engflow.storage.read/time_to_first_byte¶
- milliseconds
- Time taken between initiating a download to receiving the first byte.
Tags
-
name
: name of the storage service -
status
: op result
com.engflow.storage.read/time_to_next_chunk¶
- milliseconds
- Time taken between being notified that the client is ready and sending the next response. May be recorded 0 or multiples times for the same call, depending on control flow events.
Tags
-
name
: name of the storage service -
status
: op result
com.engflow.storage.write/size¶
- bytes
- Number of file bytes received from the client for a write request. May be smaller than the file size in case of error.
Tags
-
name
: name of the storage service -
status
: op result
com.engflow.storage.write/time_per_gb¶
- milliseconds
- Time taken per 1 billion bytes (1 GB) to upload a file to storage.
Tags
-
name
: name of the storage service -
status
: op result
com.engflow.storage.write/time_to_commit¶
- milliseconds
- Time between being notified the write is complete and committing the write.
Tags
-
name
: name of the storage service -
status
: op result
NotificationQueue metrics¶
com.engflow.notificationqueue/dequeue_latency¶
- milliseconds
- This is a distribution. Refers to the time passed between notification creation and dequeuing it.
Tags
-
expired
: Whether the notification was expired and discarded. -
name
: The name of the queue.
com.engflow.notificationqueue/head_age¶
- milliseconds
- The age of the first notification in the queue, i.e. the time passed since the notification was created. Notably, it does NOT reflect how much time has passed since the notification was last published. For example, if a notification is not acknowledged and published anew, the head age may be disproportionately high compared to the age of the next notifications in the queue. This can lead to acceptable, occasional spikes.
Tags
name
: The name of the queue.
com.engflow.notificationqueue/publish¶
- milliseconds
- This is a distribution. Refers to the time needed to publish a notification.
Tags
-
name
: The name of the queue. -
status
: The status of the operation.
com.engflow.notificationqueue/size¶
- no unit
- The approximate size of the queue.
Tags
name
: The name of the queue.
Action execution¶
com.engflow.re.exec/completed_actions¶
- no unit
- Number of actions that ran to completion, grouped by exit code
Tags
exit_code
: the action's exit code
- Details
This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, in all pools combined, since the last time this metric was reported.
Only workers report this metric. All workers report their own values. We recommend grouping by exit_code=0
and exit_code!=0
, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the cluster.
com.engflow.re.exec/completed_actions_per_pool¶
- no unit
- Number of executed actions (not cached), grouped by pool and status
Tags
-
pool
: name of the pool (_default_
for the default pool) -
status
: the action's status (ExecutionStatus: SUCCESS, NON_ZERO_EXIT, CLIENT_ERROR, ERROR)
- Details
This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, per pool, since the last time this metric was reported.
Only workers report this metric. All workers report their own values. We recommend grouping by status, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the pool/cluster.
com.engflow.re.exec/execution_latency¶
- milliseconds
- Bucketed latency (ms), grouped by pool and execution stage
Tags
-
pool
: name of the pool (_default_
for the default pool) -
stage
: the action's stage (ExecutionStage: QUEUED, DOWNLOAD_INPUTS, EXECUTE_ACTION, UPLOAD_OUTPUTS, EXECUTOR_TOTAL)
com.engflow.re.exec/executors_existing¶
- no unit
- Total number of executors on this worker, in all pools combined
- Details
Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the cluster.
com.engflow.re.exec/executors_existing_per_pool¶
- no unit
- Total number of executors on this worker, per pool
Tags
pool
: name of the pool (_default_
for the default pool)
- Details
Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the pool/cluster.
com.engflow.re.exec/max_rss_kib¶
- KiBy
- Reported MaxRSS (maximal resident set size) of successfully executed actions (not cached), grouped by pool
Tags
pool
: name of the pool (_default_
for the default pool)
- Details
This metric is a distribution. Each measurement indicates approximately how much memory (in Kib) a successful action (ExecutionStatus: SUCCESS) on this worker reportedly used.
Only workers report this metric. All workers report their own values.
com.engflow.re.exec/started_actions_per_pool¶
- no unit
- Number of started actions, grouped by pool
Tags
pool
: name of the pool (_default_
for the default pool)
- Details
This metric reflects the rate of change. Each measurement indicates how many actions started on this worker, per pool, since the last time this metric was reported.
Only workers report this metric. All workers report their own values.
com.engflow.re.exec/used_executors¶
- no unit
- Number of busy executors, in all pools
- Details
Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the cluster.
com.engflow.re.exec/used_executors_per_pool¶
- no unit
- Number of busy executors, per pool
Tags
pool
: name of the pool (_default_
for the default pool)
- Details
Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the pool/cluster.
Hazelcast monitoring¶
com.engflow.re.hazelcast.map/entries¶
- no unit
- The number of entries in Hazelcast maps map
Tags
-
cluster_name
: name of the Hazelcast cluster -
map_name
: name of the Hazelcast map
com.engflow.re.hazelcast.map/memory_used¶
- bytes
- The amount of memory used for the map
Tags
-
cluster_name
: name of the Hazelcast cluster -
map_name
: name of the Hazelcast map
com.engflow.re.hazelcast/is_master¶
- no unit
- Whether a machine is a cluster master; if this sums up to more than one (with the same name), then the cluster is unhealthy.
Tags
name
: name of the Hazelcast cluster.
com.engflow.re.hazelcast/member_count¶
- no unit
- The number of members in the cluster; only the master reports this value
Tags
name
: name of the Hazelcast cluster.
com.engflow.re.hazelcast/op_time¶
- milliseconds
- Distribution of operation time
Tags
-
name
: name of the distributed hash map -
status
: op result
com.engflow.thirdparty.hazelcast/partition_migration_finished¶
- no unit
- The number of finished Hazelcast partition migrations.
Tags
name
: name of the Hazelcast cluster
com.engflow.thirdparty.hazelcast/partition_migration_started¶
- no unit
- The number of started Hazelcast partition migrations.
Tags
name
: name of the Hazelcast cluster
com.engflow.thirdparty.hazelcast/partition_migration_time¶
- milliseconds
- Time of Hazelcast partition migrations, per Hazelcast cluster.
Tags
name
: name of the Hazelcast cluster
- Details
Reports the time of Hazelcast partition migrations.
com.engflow.thirdparty.hazelcast/replica_migration¶
- no unit
- The number of Hazelcast replica migrations.
Tags
-
name
: name of the Hazelcast cluster -
status
: status of the operation (OK
orFAILURE
)
Uncaught exceptions¶
com.engflow.re/uncaught_exceptions¶
- no unit
- Fires every time there is an uncaught exception
CAS server metrics¶
com.engflow.re.cas/missing_digests¶
- no unit
- The total number of missing digests seen by findMissingBlobs.
com.engflow.re.cas/requested_digests¶
- no unit
- The total number of digests requested by a findMissingBlob call
Remote Execution metrics¶
com.engflow.remoteexecution/queue_time¶
- milliseconds
- This is a distribution. Refers to the time actions are queued.
Tags
pool
: The name of the pool.
- Details
This is a distribution. Refers to the time actions are queued.
Invocation index monitoring¶
com.engflow.resultstore.index/sql_invocation_index_database_queue_size¶
- no unit
- All enqueued or in-progress invocation index database operations
- Details
Reflects the number of incomplete operations (either queued or being worked on).
Every instance reports this metric. Every instance reports its own stats.
CAS usage¶
com.engflow.re.cas/check_blob_exists¶
- milliseconds
- Distribution of time needed to check whether a blob exists
Tags
status
: operation result, e.g.OK
,NOT_FOUND
- Details
This is a distribution. Refers to the time needed to check whether a blob exists.
com.engflow.re.cas/fetch_call_time¶
- milliseconds
- Distribution of CAS fetch operation time
Tags
-
source
: name of the CAS location, e.g.EXTERNAL_STORAGE
,DISTRIBUTED_CAS_NEAR
-
status
: op result, e.g.OK
,UNAVAILABLE
- Details
The time distribution of individual CAS download calls; each call is measured independently, including when falling back between different sources.
com.engflow.re.cas/fetch_retries¶
- no unit
- Count of retries needed when fetching a CAS blob.
- Details
Count of retries needed when fetching a CAS blob. Incremented after each failure, so 0 indicates a fetch without errors.
com.engflow.re.cas/find_replicas_time¶
- milliseconds
- Distribution of time needed to find which instances have copies of a file
- Details
Distribution of time needed to find which instances have copies of a file.
com.engflow.re.cas/load_shed_errors¶
- no unit
- Count of
RESOURCE_EXHAUSTED
errors returned by workers for CAS requests due to load shedding.
Tags
method
: method returning the error, e.g.,read
- Details
Count of
RESOURCE_EXHAUSTED
errors returned by workers for CAS requests due to load shedding.
com.engflow.re.cas/remote_check_blob_exists¶
- milliseconds
- Distribution of time needed to check whether a blob exists
Tags
-
source
: address used to check on cas blob. -
status
: operation result, e.g.OK
,NOT_FOUND
- Details
This is a distribution. Refers to the time needed to check whether a blob exists.
com.engflow.re.cas/requests_in_flight_incoming¶
- no unit
- Number of currently open incoming cache requests, by method and pool.
Tags
-
method
:read
orwrite
-
pool
: name of the pool serving the request
- Details
Number of currently open incoming cache requests, by method and pool.
com.engflow.re.cas/requests_in_flight_outgoing¶
- no unit
- Number of currently open outgoing cache requests, by method and pool.
Tags
-
method
:read
orwrite
-
pool
: name of the pool originating the request
- Details
Number of currently open outgoing cache requests, by method and pool. Includes both distributed CAS (ByteStream) and external storage.
com.engflow.re.cas/time_to_next_message¶
- milliseconds
- Estimated number of milliseconds to the next grpc message.
Tags
pool
: name of the pool serving the request
- Details
Estimated number of milliseconds to the next grpc message.
Local CAS usage¶
com.engflow.re.cas/available_replica_space¶
- bytes
- Available storage space in the CAS that can be used for replicas
- Details
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/available_space¶
- bytes
- Available storage space in the CAS
- Details
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/free_time¶
- milliseconds
- Distribution of time needed to free space in the CAS
- Details
This is a distribution. It refers to the deletion of expired replicas.
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/gc_time¶
- milliseconds
- Distribution of time needed for the GC
- Details
This is a distribution. It refers to the collection of expired replicas.
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/lost_files_count¶
- no unit
- The number of files that were lost from the CAS
- Details
The number of files that were deleted by some other process or the CAS instance detected that they no longer matched the expected digest.
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/max_total_replica_size¶
- bytes
- The max total replica size
- Details
This is the maximum amount of storage space the CAS is allowed to use for replicas.
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/max_total_size¶
- bytes
- The max total CAS size on the node
- Details
This is the maximum amount of storage space the CAS is allowed to use.
Only workers report this metric. All workers report their own values.
Client authorization¶
com.engflow.re.auth.async/duration¶
- milliseconds
- Authentication call duration
- Details
This is a distribution. Only schedulers report this metric. Every scheduler reports its own stats.
SecretStore metrics¶
com.engflow.secretstore/operation_duration_seconds¶
- s
- Time taken to perform an operation on the secret store
Tags
-
operation
: The secret store operation reported ('read', etc.) -
status
: The gRPC status code string, in SCREAMING_SNAKE_CASE -
store_type
: The implementation of the instrumented secret store
Licensing metrics¶
com.engflow.licensing/license_server_fetch_result¶
- no unit
- The result of attempted license renewals using the MyEngFlow License Server.
Tags
status
Docker use¶
com.engflow.re.exec.docker/cas_fetch_seconds¶
- s
- The time spent pulling Docker images from the CAS. This includes fetching only
Tags
status
: result of the operation, e.g. "OK", "FAILED"
com.engflow.re.exec.docker/cas_pull_bytes¶
- bytes
- The size of Docker images pulled from the CAS
Tags
status
: result of the operation, e.g. "OK", "FAILED"
com.engflow.re.exec.docker/cas_pull_seconds¶
- s
- The time spent pulling Docker images from the CAS. This includes both fetching and loading
Tags
status
: result of the operation, e.g. "OK", "FAILED"
com.engflow.re.exec.docker/cas_save_seconds¶
- s
- The time spent uploading Docker images to the CAS. This includes saving only
Tags
status
: result of the operation, e.g. "OK", "FAILED"
com.engflow.re.exec.docker/cas_upload_bytes¶
- bytes
- The size of Docker images uploaded to the CAS
Tags
status
: result of the operation, e.g. "OK", "FAILED"
com.engflow.re.exec.docker/cas_upload_seconds¶
- s
- The time spent uploading Docker images to the CAS. This includes both saving and uploading
Tags
status
: result of the operation, e.g. "OK", "FAILED"
com.engflow.re.exec.docker/container_lifecycle_events_count¶
- no unit
- Count of various lifecycle events relating to containers
Tags
-
container_lifecycle_event_name
-
pool
com.engflow.re.exec.docker/container_shutdown_time¶
- milliseconds
- The time needed to shutdown a docker container
com.engflow.re.exec.docker/container_startup_time¶
- milliseconds
- The time needed to start a docker container
Tags
status
: result of the operation, e.g. "OK", "FAILED"
com.engflow.re.exec.docker/containers_failed¶
- no unit
- The number of docker containers that failed
com.engflow.re.exec.docker/docker_proxy_failure_count¶
- no unit
- The number of times pulls through the docker proxy failed
com.engflow.re.exec.docker/image_pull_time¶
- milliseconds
- The time needed to pull a docker image
Tags
status
: result of the operation, e.g. "OK", "FAILED"
com.engflow.re.exec.docker/network_create_time¶
- milliseconds
- The time needed to create a docker network
Tags
status
: result of the operation, e.g. "OK", "FAILED"
com.engflow.re.exec.docker/network_destroy_time¶
- milliseconds
- The time needed to destroy a docker network
Tags
status
: result of the operation, e.g. "OK", "FAILED"
com.engflow.re.exec.docker/sibling_container_enabled_total¶
- no unit
- Counts times that sibling container access was requested via platform properties
Tags
pool
Persistent worker use¶
com.engflow.re.exec.worker/actions¶
- no unit
- The number of persistent worker actions run
Tags
-
pool
: name of the pool -
reuse_status
:new
orreused
- Details
The number of persistent worker actions run, aggregated by whether they reused a previous persistent worker process or not
Scheduler metrics¶
com.engflow.profiling/publish_invocation_event¶
- milliseconds
- Timing for publishing invocation events.
com.engflow.re.cas/entries_evicted¶
- no unit
- The number of CAS entries that were evicted due to memory size limitations
com.engflow.re.cas/entries_lost¶
- no unit
- The number of CAS entries that could not be recovered on CAS node shutdown events
com.engflow.re.profiler/events¶
- no unit
- The number of server-side profile events recorded.
com.engflow.re.profiler/live_handles¶
- no unit
- The number of profiles being streamed to the eventstore.
com.engflow.re/remaining_certificate_validity_days¶
- days
- The number of remaining days before certificates expire
Tags
-
issuer
: The issuer (issuer distinguished name) value from the certificate -
serial_number
: The serial number assigned to the certificate by the issuer
- Details
Reports the number of remaining validity days for each X509 certificate processed by schedulers. The issuer and serial number uniquely identify certificates.
com.engflow.re/remaining_license_time¶
- days
- The number of remaining days before the license expires
Java memory metrics¶
com.engflow.re/java_heap¶
- bytes
- The amount of heap memory used
- Details
Every instance reports this metric. Every instance reports its own stats.
Meta metrics¶
com.engflow.meta/engflow_version¶
- no unit
- A heartbeat metric that reports the EngFlow build label if present and "missing_version" otherwise.
Tags
version
com.engflow.meta/parallel_engflow_version¶
- no unit
- A heartbeat metric that reports how many different EngFlow versions are currently registered with the cluster, indicated by the build label. All instances that do not report a build label are rated as running the same version, different to all other versions reported. Each scheduler reports its own metric.
Docker daemon¶
com.engflow.docker.container/existing¶
- no unit
- The number of existing containers.
Tags
-
daemon
: The docker daemon this metric is for (e.g.,host
). -
pool
: The pool this metric is for (e.g.,default
ormacos
). -
state
: The state of the container (e.g.,running
,exited
, ...).
com.engflow.docker.container/size¶
- bytes
- The size distribution of existing docker containers.
Tags
-
daemon
: The docker daemon this metric is for (e.g.,host
). -
filesystem
: The filesystem the value applies to (overlay
if the value is the size of files modified by the container, androot
if the value is the total size of the container including the image). -
pool
: The pool this metric is for (e.g.,default
ormacos
). -
scope
: The scope of the value (container
if the value is the size of a single container, anddaemon
if the value is the sum of all containers known to the docker daemon).
com.engflow.docker.image/size¶
- bytes
- The size distribution of existing docker images.
Tags
-
daemon
: The docker daemon this metric is for (e.g.,host
). -
pool
: The pool this metric is for (e.g.,default
ormacos
). -
scope
: The scope of the value (image
if the value is the size of a single image, anddaemon
if the value is the sum of all images known to the docker daemon).
DB Connection Pool usage¶
com.engflow.resultstore.index/db_cp_active_connections¶
- no unit
- The number of active connections in the pool
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_connection_acquire_time¶
- us
- The time it takes for the connection pool to acquire a DB connection
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_connection_create_time¶
- milliseconds
- The time it takes for the connection pool to create a new DB connection
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_connection_timeout_count¶
- no unit
- The count of timed-out connections
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_connection_usage_time¶
- milliseconds
- The duration of a use of a connection given by the connection pool
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_idle_connections¶
- no unit
- The number of idle connections in the pool
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_max_connections¶
- no unit
- Maximum number of connections existing in the pool
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_min_connections¶
- no unit
- Minimum number of connections existing in the pool
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_pending_connections¶
- no unit
- The number of pending connections in the pool
Tags
db_connection_pool_name
com.engflow.resultstore.index/db_cp_total_connections¶
- no unit
- The number of all currently existing connections in the pool
Tags
db_connection_pool_name
Caffeine cache metrics¶
com.engflow.caching.inmemory/evicts¶
- no unit
- Number of cache evictions tagged with the reason for eviction.
Tags
-
name
: The name of the cache -
reason
: The reason for eviction
com.engflow.caching.inmemory/hits¶
- no unit
- Number of cache hits.
Tags
name
: The name of the cache
com.engflow.caching.inmemory/loads¶
- milliseconds
- Timing and status information for cache loads.
Tags
-
name
: The name of the cache -
status
: The status of the load
com.engflow.caching.inmemory/misses¶
- no unit
- Number of cache misses.
Tags
name
: The name of the cache
DB Query stats¶
com.engflow.resultstore.index/duration¶
- milliseconds
- The duration of a query
Tags
-
query_name
-
query_outcome
com.engflow.resultstore.index/preparation¶
- milliseconds
- The duration of creating a preparedQuery
Tags
query_name
Outgoing CI API Calls¶
com.engflow.ci.http/api_call_latency¶
- milliseconds
- Time taken for an outgoing API call, indexed by hostname and status
Tags
-
hostname
: hostname of the API service that was called -
status
: HTTP status of the call (1xx, 2xx, 3xx, 4xx, 5xx, or UNKNOWN)
- Details
Indicates the number of API calls made, per service and their status, on this scheduler. The status only contains the abbreviated HTTP status code. Only schedulers report this metric. Every scheduler reports its own API calls.
CI runner¶
com.engflow.ci.basic/gh_error_propagation_jobs¶
- no unit
- Number of finished GitHub error propagation jobs
Tags
status
: the status of the error propagation job
- Details
Number of finished GitHub error propagation jobs.
com.engflow.ci.basic/gh_runner_no_jobs¶
- no unit
- Number of finished GitHub runs that timed out and did not pick up any jobs.
- Details
Number of finished GitHub runs that timed out and did not pick up any jobs.
com.engflow.ci.basic/gh_runner_wrong_job¶
- no unit
- Number of finished GitHub runs that picked up the wrong job.
- Details
Number of finished GitHub runs that picked up the wrong job.
com.engflow.ci.basic/jobs_completed¶
- no unit
- Number of jobs completed
Tags
status
: job completion status
- Details
The number of completed CI jobs, by status. This includes jobs for which we failed before starting the remote runner.
com.engflow.ci.basic/jobs_queue_age¶
- milliseconds
- Age of queued jobs
Tags
-
ci_family
: name of the CI of the job -
statistic
: min/max queue age
- Details
The age of queued jobs on GitHub/BuildKite. This metric is only reported if polling is enabled, and only for jobs that are configured for EngFlow CI runners. The value is how long ago was the job created relative to the current poll time.
com.engflow.ci.basic/jobs_queued¶
- no unit
- Number of queued jobs
Tags
ci_family
: name of the CI of the job
- Details
The number of queued jobs on GitHub/BuildKite. This metric is only reported if polling is enabled.
com.engflow.ci.basic/poll_duration_millis¶
- milliseconds
- Number of polls started
Tags
-
ci_family
: name of the CI system to poll -
status
: the status of the polling job
- Details
Measures the duration of polls against the remote CI system
com.engflow.ci.full/builtin_step_duration¶
- milliseconds
- The duration of individual built-in steps in a CI job
Tags
-
ci_family
: name of the CI of the job -
git_repository
: name of the repository of the job -
runner_architecture
: architecture of the requested runner for a given job -
runner_os
: OS of the requested runner for a given job -
step
: job step name
- Details
Measures the duration of a job's individual built-in (non-user-defined) steps' execution.
com.engflow.ci.full/git_command_duration¶
- milliseconds
- The duration of individual git commands during a CI job
Tags
-
ci_family
: name of the CI of the job -
git_command
: git command: checkout/fetch/index-pack -
git_repository
: name of the repository of the job -
job
: job name -
pipeline
: name of the pipeline that triggered the job -
runner_architecture
: architecture of the requested runner for a given job -
runner_os
: OS of the requested runner for a given job
- Details
Measures the duration of various git subcommands during a job's execution
com.engflow.ci.full/job_duration¶
- milliseconds
- The duration of CI jobs
Tags
-
ci_family
: name of the CI of the job -
git_repository
: name of the repository of the job -
job
: job name -
pipeline
: name of the pipeline that triggered the job -
runner_architecture
: architecture of the requested runner for a given job -
runner_os
: OS of the requested runner for a given job
- Details
Measures the duration of a job's execution.
com.engflow.ci.full/jobs_completed¶
- no unit
- Number of jobs completed
Tags
-
ci_family
: name of the CI of the job -
git_repository
: name of the repository of the job -
job
: job name -
pipeline
: name of the pipeline that triggered the job -
runner_architecture
: architecture of the requested runner for a given job -
runner_os
: OS of the requested runner for a given job -
status
: job completion status
- Details
The number of completed CI jobs, by status. This includes jobs for which we failed before starting the remote runner.
com.engflow.ci.full/jobs_started¶
- no unit
- Number of jobs started
Tags
-
ci_family
: name of the CI of the job -
git_repository
: name of the repository of the job -
job
: job name -
pipeline
: name of the pipeline that triggered the job -
runner_architecture
: architecture of the requested runner for a given job -
runner_os
: OS of the requested runner for a given job
- Details
The number of jobs started on this scheduler. Every scheduler reports its own jobs.
com.engflow.ci.full/step_duration¶
- milliseconds
- The duration of individual user-defined steps in a CI job
Tags
-
ci_family
: name of the CI of the job -
git_repository
: name of the repository of the job -
job
: job name -
pipeline
: name of the pipeline that triggered the job -
runner_architecture
: architecture of the requested runner for a given job -
runner_os
: OS of the requested runner for a given job -
step
: job step name
- Details
Measures the duration of a job's individual steps' execution.
com.engflow.ci.full/time_to_start_job¶
- milliseconds
- How long it takes for CI runners to start a job
Tags
-
ci_family
: name of the CI of the job -
git_repository
: name of the repository of the job -
job
: job name -
pipeline
: name of the pipeline that triggered the job -
runner_architecture
: architecture of the requested runner for a given job -
runner_os
: OS of the requested runner for a given job
- Details
Measures the time between a given CI requesting a job's execution and an action actually starting it on CI runners.
gRPC factory metrics¶
com.engflow.grpc.factory/channels¶
- no unit
- The number of open gRPC channels in the channel factory
CI runner - Github Action metrics¶
com.engflow.ci.github.metrics/api_primary_rate_limit_max¶
- no unit
- Maximum number of requests available in the current API rate limiter slot
Tags
API_RESOURCE_NAME
com.engflow.ci.github.metrics/api_primary_rate_limit_used¶
- no unit
- Number of requests used in the current API rate limiter slot
Tags
API_RESOURCE_NAME
RPC metrics¶
com.engflow.rpc/duration¶
- s
- Time distribution of RPCs
Tags
-
method
: The RPC method -
protocol
: The protocol of the RPC (e.g.,rest
) -
status
: The status of the operation (e.g.,OK
) -
type
: The type of the RPC (e.g.,client
orserver
)