Version 1.43 of the documentation is no longer actively maintained. The site that you are currently viewing is an archived snapshot. For up-to-date documentation, see the latest version.
Metrics Reference
Action scheduling
com.engflow.re.scheduler/availability_map_size
(no unit)-
Number of busy executors.
- Details
Indicates the total number of busy executors, across all pools. Only schedulers report this metric. Every scheduler reports the same (or about the same) value. Do not sum up values from multiple schedulers!
com.engflow.re.scheduler/available_workers
(no unit)-
Number of idle executors, per pool.
- Tags
name
: name of the pool ("_default_" for the default pool)
- Details
Indicates the number of idle executors, per pool, according to this scheduler. Only schedulers report this metric. Every scheduler reports the same (or about the same) value.
com.engflow.re.scheduler/existing_schedulers
(no unit)-
Number of existing schedulers.
- Details
Only schedulers report this metric. Every scheduler reports a constant "1". This can be used to detect schedulers that are unable to send monitoring metrics.
com.engflow.re.scheduler/owner_map_size
(no unit)-
Number of entries.
- Details
Deprecated. This is rarely useful and measures an internal data structure that is subject to change. Schedulers use the owner map to keep track of actions. The size of this map indicates how many actions are being executed.
com.engflow.re.scheduler/pool_utilization
(percentage)-
Current executor utilization, per pool.
- Tags
name
: name of the pool ("_default_" for the default pool)
- Details
Reports current executor utilization (used*100/total) per pool, as a percentage ([0..100]). Only schedulers report this metric. Every scheduler reports the same (or about the same) value.
To help making scale-up decisions when a pool is empty, utilization is reported as 100 if there are actions waiting and 0 if not. A pool may be empty if it was scaled down, or if it never existed (the client may request any pool name).
com.engflow.re.scheduler/queue_age
(milliseconds)-
Min/max age of queued actions, per pool.
- Tags
name
: name of the pool ("_default_" for the default pool)statistic
: "min" (youngest) or "max" (oldest) action in the pool's queue
- Details
Reports minimum and maximum age in each executor pool, i.e. how long entries have been waiting. Only schedulers report this metric. Every scheduler reports its own queue lengths. Changes in these values indicate a change in the cluster's throughput.
com.engflow.re.scheduler/queue_size
(no unit)-
Number of waiting actions, per pool.
- Tags
name
: name of the pool ("_default_" for the default pool)
- Details
Indicates the number of actions waiting for execution, per pool, on this scheduler. Only schedulers report this metric. Every scheduler reports its own queue lengths.
Action execution
com.engflow.re.exec/completed_actions
(no unit)-
Number of actions that ran to completion, grouped by exit code.
- Tags
exit_code
: the action's exit code
- Details
This metric reflects the rate of change. Each measurement indicates how many actions completed on this worker, in all pools combined, since the last time this metric was reported.
Only workers report this metric. All workers report their own values. We recommend grouping by
exit_code=0
andexit_code!=0
, and summing up the time series in the groups. This yields the rate of successful and unsuccessful action completion across the cluster.
com.engflow.re.exec/executors_existing
(no unit)-
Total number of executors on this worker, in all pools combined.
- Details
Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of executors in the cluster.
com.engflow.re.exec/running_actions
(no unit)-
Number of running actions.
- Details
Deprecated. This metric measured something between executor usage and actual action runtime, but what it actually measured was not well-defined. Use com.engflow.re.exec/used_executors to measure executor usage.
com.engflow.re.exec/used_executors
(no unit)-
Number of busy executors, in all pools.
- Details
Only workers report this metric. All workers report their own values. You should sum up the time series to get the total number of busy executors in the cluster.
CAS usage
com.engflow.re.cas/available_replica_space
(bytes)-
Available storage space in the CAS that can be used for replicas.
- Details
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/available_space
(bytes)-
Available storage space in the CAS.
- Details
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/free_time
(milliseconds)-
Distribution of time needed to free space in the CAS.
- Details
This is a distribution. It refers to the deletion of expired replicas.
Only workers report this metric. All workers report their own values.
CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.
com.engflow.re.cas/gc_time
(milliseconds)-
Distribution of time needed for the GC.
- Details
This is a distribution. It refers to the collection of expired replicas.
Only workers report this metric. All workers report their own values.
CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.
com.engflow.re.cas/lost_files_count
(no unit)-
The number of files that were lost from the CAS.
- Details
The number of files that were deleted by some other process or the CAS instance detected that they no longer matched the expected digest.
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/max_total_replica_size
(bytes)-
The max total replica size.
- Details
This is the maximum amount of storage space the CAS is allowed to use for replicas.
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/max_total_size
(bytes)-
The max total CAS size on the node.
- Details
This is the maximum amount of storage space the CAS is allowed to use.
Only workers report this metric. All workers report their own values.
com.engflow.re.cas/total_replica_size
(bytes)-
The total replica size.
- Details
Deprecated, please use
com.engflow.re.cas/available_replica_space
instead. This is the total replica size.
com.engflow.re.cas/total_size
(bytes)-
Total CAS size.
- Details
Deprecated, please use
com.engflow.re.cas/available_space
instead. Combined size of all files stored in the CAS.
Client authorization
com.engflow.re.auth.async/call_count
(no unit)-
Number of calls made.
- Details
Deprecated. Though it may seem so, this metric doesn't actually track client connection attempts accurately.
Use
com.engflow.re.auth.async/duration
aggregated by count instead.
com.engflow.re.auth.async/duration
(milliseconds)-
Authentication call duration.
- Details
This is a distribution. Only schedulers report this metric. Every scheduler reports its own stats.
CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.
External storage use
com.engflow.re.storage/ops
(no unit)-
All completed external storage operations.
- Tags
operation
: the type of operation, e.g. "cas_check", "ac_upload"result
: the result of the operation, e.g. "successful", "interrupted"
- Details
This metric reflects the rate of change. Each measurement indicates how many operations completed on this instance since the last time this metric was reported.
Every instance reports this metric. Every instance reports its own stats.
com.engflow.re.storage/ops_queue_size
(no unit)-
All enqueued or in-progress external storage operations.
- Tags
operation
: the type of operation, e.g. "cas_check", "ac_upload"
- Details
Reflects the number of incomplete operations (either queued or being worked on).
Every instance reports this metric. Every instance reports its own stats.
com.engflow.re.storage/traffic
(bytes)-
All external storage traffic.
- Tags
operation
- Details
This metric may be imprecise; the source of truth is the set of metrics published by the storage backend itself.
Every instance reports this metric. Every instance reports its own stats.
Amazon S3 use
com.engflow.re.storage.s3/download_bytes
(bytes)-
Total amount of data downloaded from S3.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.Each machine measures this separately. Summing up the measurements from all instances can help estimating S3 traffic costs.
com.engflow.re.storage.s3/download_cache_misses
(no unit)-
Number of cache misses in S3.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.This means download attempts that failed because the blob was missing from S3.
com.engflow.re.storage.s3/download_complete
(no unit)-
Number of completed downloads from S3.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.This means successful and complete downloads.
com.engflow.re.storage.s3/download_fail
(no unit)-
Number of failed downloads from S3.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.This means download attempts that started (i.e. blob was found) but failed.
com.engflow.re.storage.s3/upload_bytes
(bytes)-
Total amount of data uploaded to S3.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.Each machine measures this separately. Summing up the measurements from all instances can help estimating S3 traffic costs.
com.engflow.re.storage.s3/upload_cache_hits
(no unit)-
Number of cache hits in S3.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.This means no-op upload attempts, i.e. those that succeeded because the blob was already uploaded.
com.engflow.re.storage.s3/upload_complete
(no unit)-
Number of completed uploads to S3.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.Successful and complete uploads, i.e. the blob was not present before the upload, and was successfully uploaded.
com.engflow.re.storage.s3/upload_fail
(no unit)-
Number of failed uploads to S3.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.This means upload attempts that started (i.e. blob was new) but failed.
Google Cloud Storage (GCS) use
com.engflow.re.storage.gcs/download_bytes
(bytes)-
Total amount of data downloaded from GCS.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.Each machine measures this separately. Summing up the measurements from all instances can help estimating GCS traffic costs.
com.engflow.re.storage.gcs/download_cache_misses
(no unit)-
Number of cache misses in GCS.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.This means download attempts that failed because the blob was missing from GCS.
com.engflow.re.storage.gcs/download_complete
(no unit)-
Number of completed downloads from GCS.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.This means successful and complete downloads.
com.engflow.re.storage.gcs/download_fail
(no unit)-
Number of failed downloads from GCS.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.This means download attempts that started (i.e. blob was found) but failed.
com.engflow.re.storage.gcs/upload_bytes
(bytes)-
Total amount of data uploaded to GCS.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.Each machine measures this separately. Summing up the measurements from all instances can help estimating GCS traffic costs.
com.engflow.re.storage.gcs/upload_cache_hits
(no unit)-
Number of cache hits in GCS.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.This means no-op upload attempts, i.e. those that succeeded because the blob was already uploaded.
com.engflow.re.storage.gcs/upload_complete
(no unit)-
Number of completed uploads to GCS.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.Successful and complete uploads, i.e. the blob was not present before the upload, and was successfully uploaded.
com.engflow.re.storage.gcs/upload_fail
(no unit)-
Number of failed uploads to GCS.
- Details
Deprecated, see
--incompatible_no_storage_backend_metrics
.This means upload attempts that started (i.e. blob was new) but failed.
Docker use
com.engflow.re.exec.docker/container_creation_failed
(no unit)-
The number of docker containers that failed during creation.
- Details
Deprecated. Instead use container_startup_time, filtered by status and aggregated by count.
com.engflow.re.exec.docker/container_shutdown_time
(milliseconds)-
The time needed to shutdown a docker container.
- Details
CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.
com.engflow.re.exec.docker/container_startup_time
(milliseconds)-
The time needed to start a docker container.
- Tags
status
: result of the operation, e.g. "OK", "FAILED"
- Details
CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.
com.engflow.re.exec.docker/containers_created
(no unit)-
The number of docker containers created.
- Details
Deprecated. Instead use container_startup_time, aggregated by count.
com.engflow.re.exec.docker/containers_destroyed
(no unit)-
The number of docker containers destroyed.
- Details
Deprecated. Instead use container_shutdown_time, aggregated by count.
com.engflow.re.exec.docker/containers_failed
(no unit)-
The number of docker containers that failed.
com.engflow.re.exec.docker/image_pull_time
(milliseconds)-
The time needed to pull a docker image.
- Tags
status
: result of the operation, e.g. "OK", "FAILED"
- Details
CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.
com.engflow.re.exec.docker/network_create_time
(milliseconds)-
The time needed to create a docker network.
- Tags
status
: result of the operation, e.g. "OK", "FAILED"
- Details
CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.
com.engflow.re.exec.docker/network_destroy_time
(milliseconds)-
The time needed to destroy a docker network.
- Tags
status
: result of the operation, e.g. "OK", "FAILED"
- Details
CloudWatch note: as of 2021-02-04 this metric is not reported to CloudWatch.
com.engflow.re.exec.docker/networks_created
(no unit)-
The number of docker networks created.
- Details
Deprecated. Instead use network_create_time, aggregated by count.
com.engflow.re.exec.docker/networks_destroyed
(no unit)-
The number of docker networks destroyed.
- Details
Deprecated. Instead use network_destroy_time, aggregated by count.
Persistent worker use
com.engflow.re.exec.worker/actions
(no unit)-
The number of persistent worker actions run.
- Tags
reuse_status
: `new` or `reused`
- Details
The number of persistent worker actions run, aggregated by whether they reused a previous persistent worker process or not
Scheduler metrics
com.engflow.re.ac.distributed/entries
(no unit)-
The number of action cache entries on a specific scheduler instance.
com.engflow.re.ac.distributed/memory_used
(bytes)-
The amount of memory used for the action cache.
com.engflow.re.cas/entries_evicted
(no unit)-
The number of CAS entries that were evicted due to memory size limitations.
com.engflow.re.cas/entries_lost
(no unit)-
The number of CAS entries that could not be recovered on CAS node shutdown events.
com.engflow.re.scheduler/build_id
(no unit)-
The number of distinct build ids for which the service received at least one action.
com.engflow.re/remaining_license_time
(days)-
The time remaining before the license expires.
Java memory metrics
com.engflow.re/java_heap
(bytes)-
The amount of heap memory used.
- Details
Every instance reports this metric. Every instance reports its own stats.