Metrics Reference

Description of all metrics available for monitoring

Blob-storage implementation metrics

com.engflow.blobstore/ops (no unit)

Fires every time an operation takes place..

Tags
  • operation

BEP Event Storage and Replay

com.engflow.eventstore/build_event_owners (no unit)

The total number of build owners residing in memory. A build owner is an internal piece of state associated to a particular build reporting build events..

com.engflow.eventstore/inbound_bep_events (no unit)

Fired whenever an event is received on an inbound stream..

Tags
  • type
com.engflow.eventstore/invocation_attempts (no unit)

Fired whenever a new BEP invocation attempt started event is received..

com.engflow.eventstore/new_outbound_streams (no unit)

Fired whenever a new outbound BEP stream is read..

com.engflow.eventstore/ongoing_streams (no unit)

The total number of streams that are inbound, outbound, or both..

com.engflow.eventstore/outbound_bep_events (no unit)

Fired whenever an event is sent on an outbound stream..

Virtual Machine Instances

com.engflow.instance/total_disk_space (bytes)

The size of the volume..

Tags
  • volume
com.engflow.instance/used_disk_percentage (percentage)

The percentage of the volume that is currently used..

Tags
  • volume
com.engflow.instance/used_disk_space (bytes)

The total number of bytes used on the volume..

Tags
  • volume

Netty monitoring

com.engflow.thirdparty.netty/used_direct_memory (bytes)

Direct (non-heap) memory use.

Tags
  • buffer_name
com.engflow.thirdparty.netty/used_heap_memory (bytes)

Heap memory use.

Tags
  • buffer_name
io.netty.buffer/used_direct_memory (bytes)

Direct (non-heap) memory use.

Tags
  • buffer_name
io.netty.buffer/used_heap_memory (bytes)

Heap memory use.

Tags
  • buffer_name

Action scheduling

com.engflow.re.scheduler/availability_map_size (no unit)

Number of busy executors in all pools.

Details

Only schedulers report this metric. All schedulers report their own values. You should sum up the time series to get the total number of busy executors in the cluster. The result should be equal to the sum of com.engflow.re.exec/used_executors.

com.engflow.re.scheduler/available_workers (no unit)

Deprecated; number of idle executors, per pool.

Tags
  • name: name of the pool ("_default_" for the default pool)
Details

Deprecated. Indicates the number of idle executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value. This metric is deprecated, because it may be imprecise: schedulers that are started while workers are busy may report a higher value than they should for several minutes. We recommend monitoring com.engflow.re.scheduler/existing_executors instead.

com.engflow.re.scheduler/existing_executors (no unit)

Number of existing executors, per pool.

Tags
  • name: name of the pool ("_default_" for the default pool)
Details

Number of existing executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value.

com.engflow.re.scheduler/existing_schedulers (no unit)

Number of existing schedulers.

Details

Only schedulers report this metric. Every scheduler reports a constant "1". This can be used to detect schedulers that are unable to send monitoring metrics.

com.engflow.re.scheduler/owner_map_size (no unit)

Number of entries.

Details

Deprecated. This is rarely useful and measures an internal data structure that is subject to change. Schedulers use the owner map to keep track of actions. The size of this map indicates how many actions are being executed.

com.engflow.re.scheduler/pool_utilization (percentage)

Current executor utilization, per pool.

Tags
  • name: name of the pool ("_default_" for the default pool)
Details

Reports current executor utilization (used*100/total) per pool, as a percentage ([0..100]). Only schedulers report this metric. Every scheduler reports the same (or about the same) value.

To help making scale-up decisions when a pool is empty, utilization is reported as 100 if there are actions waiting and 0 if not. A pool may be empty if it was scaled down, or if it never existed (the client may request any pool name).

com.engflow.re.scheduler/queue_age (milliseconds)

Min/max age of queued actions, per pool.

Tags
  • name: name of the pool ("_default_" for the default pool)
  • statistic: "min" (youngest) or "max" (oldest) action in the pool's queue
Details

Reports minimum and maximum age in each executor pool, i.e. how long entries have been waiting. Only schedulers report this metric. Every scheduler reports its own queue lengths. Changes in these values indicate a change in the cluster's throughput.

com.engflow.re.scheduler/queue_size (no unit)

Number of waiting actions, per pool.

Tags
  • name: name of the pool ("_default_" for the default pool)
Details

Indicates the number of actions waiting for execution, per pool, on this scheduler. Only schedulers report this metric. Every scheduler reports its own queue lengths.

Action execution

com.engflow.re.exec/completed_actions (no unit)

Number of actions that ran to completion, grouped by exit code.

Tags
  • exit_code: the action's exit code
Details

This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, in all pools combined, since the last time this metric was reported.

Only workers report this metric. All workers report their own values. We recommend grouping by exit_code=0 and exit_code!=0, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the cluster.

com.engflow.re.exec/executors_existing (no unit)

Total number of executors on this worker, in all pools combined.

Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the cluster.

com.engflow.re.exec/running_actions (no unit)

Number of running actions.

Details

Deprecated. This metric measured something between executor usage and actual action runtime, but what it actually measured was not well-defined. Use com.engflow.re.exec/used_executors to measure executor usage.

com.engflow.re.exec/used_executors (no unit)

Number of busy executors, in all pools.

Details

Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the cluster.

Uncaught exceptions

com.engflow.re/uncaught_exceptions (no unit)

Fires every time there is an uncaught exception.

CAS server metrics

com.engflow.re.cas/missing_digests (no unit)

The total number of missing digests seen by findMissingBlobs..

com.engflow.re.cas/requested_digests (no unit)

The total number of digests requested by a findMissingBlob call.

CAS usage

com.engflow.re.cas/available_replica_space (bytes)

Available storage space in the CAS that can be used for replicas.

Details

Only workers report this metric. All workers report their own values.

com.engflow.re.cas/available_space (bytes)

Available storage space in the CAS.

Details

Only workers report this metric. All workers report their own values.

com.engflow.re.cas/free_time (milliseconds)

Distribution of time needed to free space in the CAS.

Details

This is a distribution. It refers to the deletion of expired replicas.

Only workers report this metric. All workers report their own values.

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.re.cas/gc_time (milliseconds)

Distribution of time needed for the GC.

Details

This is a distribution. It refers to the collection of expired replicas.

Only workers report this metric. All workers report their own values.

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.re.cas/lost_files_count (no unit)

The number of files that were lost from the CAS.

Details

The number of files that were deleted by some other process or the CAS instance detected that they no longer matched the expected digest.

Only workers report this metric. All workers report their own values.

com.engflow.re.cas/max_total_replica_size (bytes)

The max total replica size.

Details

This is the maximum amount of storage space the CAS is allowed to use for replicas.

Only workers report this metric. All workers report their own values.

com.engflow.re.cas/max_total_size (bytes)

The max total CAS size on the node.

Details

This is the maximum amount of storage space the CAS is allowed to use.

Only workers report this metric. All workers report their own values.

Client authorization

com.engflow.re.auth.async/call_count (no unit)

Number of calls made.

Details

Deprecated. Though it may seem so, this metric doesn't actually track client connection attempts accurately.

Use com.engflow.re.auth.async/duration aggregated by count instead.

com.engflow.re.auth.async/duration (milliseconds)

Authentication call duration.

Details

This is a distribution. Only schedulers report this metric. Every scheduler reports its own stats.

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

External storage use

com.engflow.re.storage/ops (no unit)

All completed external storage operations.

Tags
  • operation: the type of operation, e.g. "cas_check", "ac_upload"
  • result: the result of the operation, e.g. "successful", "interrupted"
Details

This metric reflects the rate of change. Each measurement indicates how many operations completed on this instance since the last time this metric was reported.

Every instance reports this metric. Every instance reports its own stats.

com.engflow.re.storage/ops_queue_size (no unit)

All enqueued or in-progress external storage operations.

Tags
  • operation: the type of operation, e.g. "cas_check", "ac_upload"
Details

Reflects the number of incomplete operations (either queued or being worked on).

Every instance reports this metric. Every instance reports its own stats.

com.engflow.re.storage/traffic (bytes)

All external storage traffic.

Tags
  • operation
Details

This metric may be imprecise; the source of truth is the set of metrics published by the storage backend itself.

Every instance reports this metric. Every instance reports its own stats.

Amazon S3 use

com.engflow.re.storage.s3/download_bytes (bytes)

Total amount of data downloaded from S3.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

Each machine measures this separately. Summing up the measurements from all instances can help estimating S3 traffic costs.

com.engflow.re.storage.s3/download_cache_misses (no unit)

Number of cache misses in S3.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

This means download attempts that failed because the blob was missing from S3.

com.engflow.re.storage.s3/download_complete (no unit)

Number of completed downloads from S3.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

This means successful and complete downloads.

com.engflow.re.storage.s3/download_fail (no unit)

Number of failed downloads from S3.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

This means download attempts that started (i.e. blob was found) but failed.

com.engflow.re.storage.s3/upload_bytes (bytes)

Total amount of data uploaded to S3.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

Each machine measures this separately. Summing up the measurements from all instances can help estimating S3 traffic costs.

com.engflow.re.storage.s3/upload_cache_hits (no unit)

Number of cache hits in S3.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

This means no-op upload attempts, i.e. those that succeeded because the blob was already uploaded.

com.engflow.re.storage.s3/upload_complete (no unit)

Number of completed uploads to S3.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

Successful and complete uploads, i.e. the blob was not present before the upload, and was successfully uploaded.

com.engflow.re.storage.s3/upload_fail (no unit)

Number of failed uploads to S3.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

This means upload attempts that started (i.e. blob was new) but failed.

Google Cloud Storage (GCS) use

com.engflow.re.storage.gcs/download_bytes (bytes)

Total amount of data downloaded from GCS.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

Each machine measures this separately. Summing up the measurements from all instances can help estimating GCS traffic costs.

com.engflow.re.storage.gcs/download_cache_misses (no unit)

Number of cache misses in GCS.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

This means download attempts that failed because the blob was missing from GCS.

com.engflow.re.storage.gcs/download_complete (no unit)

Number of completed downloads from GCS.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

This means successful and complete downloads.

com.engflow.re.storage.gcs/download_fail (no unit)

Number of failed downloads from GCS.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

This means download attempts that started (i.e. blob was found) but failed.

com.engflow.re.storage.gcs/upload_bytes (bytes)

Total amount of data uploaded to GCS.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

Each machine measures this separately. Summing up the measurements from all instances can help estimating GCS traffic costs.

com.engflow.re.storage.gcs/upload_cache_hits (no unit)

Number of cache hits in GCS.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

This means no-op upload attempts, i.e. those that succeeded because the blob was already uploaded.

com.engflow.re.storage.gcs/upload_complete (no unit)

Number of completed uploads to GCS.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

Successful and complete uploads, i.e. the blob was not present before the upload, and was successfully uploaded.

com.engflow.re.storage.gcs/upload_fail (no unit)

Number of failed uploads to GCS.

Details

Deprecated, see --incompatible_no_storage_backend_metrics.

This means upload attempts that started (i.e. blob was new) but failed.

Docker use

com.engflow.re.exec.docker/container_creation_failed (no unit)

The number of docker containers that failed during creation.

Details

Deprecated. Instead use container_startup_time, filtered by status and aggregated by count.

com.engflow.re.exec.docker/container_shutdown_time (milliseconds)

The time needed to shutdown a docker container.

Details

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.re.exec.docker/container_startup_time (milliseconds)

The time needed to start a docker container.

Tags
  • status: result of the operation, e.g. "OK", "FAILED"
Details

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.re.exec.docker/containers_created (no unit)

The number of docker containers created.

Details

Deprecated. Instead use container_startup_time, aggregated by count.

com.engflow.re.exec.docker/containers_destroyed (no unit)

The number of docker containers destroyed.

Details

Deprecated. Instead use container_shutdown_time, aggregated by count.

com.engflow.re.exec.docker/containers_failed (no unit)

The number of docker containers that failed.

com.engflow.re.exec.docker/image_pull_time (milliseconds)

The time needed to pull a docker image.

Tags
  • status: result of the operation, e.g. "OK", "FAILED"
Details

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.re.exec.docker/network_create_time (milliseconds)

The time needed to create a docker network.

Tags
  • status: result of the operation, e.g. "OK", "FAILED"
Details

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.re.exec.docker/network_destroy_time (milliseconds)

The time needed to destroy a docker network.

Tags
  • status: result of the operation, e.g. "OK", "FAILED"
Details

CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.

com.engflow.re.exec.docker/networks_created (no unit)

The number of docker networks created.

Details

Deprecated. Instead use network_create_time, aggregated by count.

com.engflow.re.exec.docker/networks_destroyed (no unit)

The number of docker networks destroyed.

Details

Deprecated. Instead use network_destroy_time, aggregated by count.

Persistent worker use

com.engflow.re.exec.worker/actions (no unit)

The number of persistent worker actions run.

Tags
  • reuse_status: `new` or `reused`
Details

The number of persistent worker actions run, aggregated by whether they reused a previous persistent worker process or not

Scheduler metrics

com.engflow.re.ac.distributed/entries (no unit)

The number of action cache entries on a specific scheduler instance.

com.engflow.re.ac.distributed/memory_used (bytes)

The amount of memory used for the action cache, in bytes.

com.engflow.re.cas/entries_evicted (no unit)

The number of CAS entries that were evicted due to memory size limitations.

com.engflow.re.cas/entries_lost (no unit)

The number of CAS entries that could not be recovered on CAS node shutdown events.

com.engflow.re.profiler/events (no unit)

The number of server-side profile events recorded..

com.engflow.re.profiler/live_handles (no unit)

The number of profiles being streamed to the eventstore..

com.engflow.re.scheduler/build_id (no unit)

The number of distinct build ids for which the service received at least one action.

com.engflow.re/remaining_license_time (days)

The number of remaining days before the license expires.

Java memory metrics

com.engflow.re/java_heap (bytes)

The amount of heap memory used.

Details

Every instance reports this metric. Every instance reports its own stats.

2021-09-21