Build analytics¶

Beta

This feature is currently in beta. Contact EngFlow if you'd like to use it.

EngFlow captures build performance data for every remotely executed invocation. While this data is surfaced in the Build and Test UI as the EngFlow Profile, build analytics makes this data available as a queryable data store for you to extract powerful build performance insights.

Queryable build data deepens your visibility into:

Build health information
Action-level metrics
Build trends over a period of time that's significant to your organization.

Data schema¶

EngFlow build analytics records data from the Remote Execution API v2 (REAPI) as rows in an Apache Iceberg table. If your EngFlow RE cluster is deployed on AWS, this data is stored in AWS’s S3 Tables. If your cluster is deployed on GCP, this data is stored in a GCS bucket, leveraging the invocation index database as the Iceberg catalog.

See Iceberg table schema for the full schema.

Enabling build analytics for your cluster¶

Contact EngFlow to enable build analytics for your cluster.

Please note that enabling build analytics will result in a small increase in your cloud costs. We recommend monitoring the associated cost during the first 30 days of usage.

Querying build analytics data¶

This section describes how to query your cluster's build analytics data, including the access and tooling you need to get started. Note that these instructions use DuckDB to query the database. You may use a similar tool of your choice.

Clusters deployed on AWSClusters deployed on GCP

Step 1: Install prerequisite tooling and verify access¶

Make sure you have the following core tools installed:
- Python 3.14 or above
- A Python package manager like uv or pip.

Install required dependencies. The requirements.txt file below lists the required packages. Create a copy of this file in the directory where you'll be running your queries:

requirements.txt
[project]
name = "analytics-example"
version = "0.1.0"
dependencies = [
    "duckdb>=1.4.2",
    "ipykernel>=7.1.0",
    "jupyter>=1.1.1",
    "pandas>=2.3.3",
    "pip>=25.3",
    "plotly>=6.4.0",
    "pyarrow>=22.0.0",
]
requires-python = '>=3.14'

Run uv sync to install the dependencies.
Install the AWS CLI.
Make sure you can access S3 tables.

Step 2: Create a Jupyter Notebook¶

Using Jupyter Notebook in an IDE

You can do this step in an IDE like Visual Studio Code. If using an IDE, select the Python interpreter from the virtual environment created by uv as your notebook kernel. This ensures the notebook has access to all installed dependencies.

Start Jupyter Notebook and create a notebook named engflow_analytics.ipynb. Then, create the following cells in your notebook:

Python

# Import required libraries
import duckdb
import json

Python

# Install and load the Iceberg extension for DuckDB
# This enables DuckDB to read Apache Iceberg table format
duckdb.sql("INSTALL iceberg; LOAD ICEBERG")

Python

# Authenticate with AWS using Single Sign-On
! aws sso login

Python

# Export temporary AWS credentials for the specified cluster profile
# Update cluster name and region
credentials = !aws configure export-credentials --profile <your-cluster-name> --region us-east-1 --format process
credentials = json.loads(''.join(credentials))
print(f'These credentials will expire on {credentials["Expiration"]} and need to be refreshed.')

Python

# Find the S3 Tables bucket ARN that contains observability data
aws_arn = !aws --profile <your-cluster-name> s3tables list-table-buckets
table_bucket_result = json.loads(''.join(aws_arn))['tableBuckets']
table_bucket = [x['arn'] for x in table_bucket_result if '-observability' in x['name']][0]
table_bucket

Python
# Update region

region = 'us-east-1'

# Create a secret in DuckDB containing AWS credentials
duckdb.sql(f"""
  CREATE OR REPLACE SECRET s3table_secret (
    TYPE s3,
    KEY_ID '{credentials['AccessKeyId']}',
    SECRET '{credentials['SecretAccessKey']}',
    SESSION_TOKEN '{credentials['SessionToken']}',
    REGION '{region}'
);                   
""")

# Attach the S3 Tables database to DuckDB for querying
duckdb.sql(f"""
ATTACH '{table_bucket}' AS s3_tables (
    TYPE iceberg,
    SECRET s3table_secret,
    ENDPOINT_TYPE s3_tables
)
""")

Python

# Create a local DuckDB table from the remote executions table
# This is optional but can improve query performance for repeated queries
# It can take a long time to cold start if scanning over large data sets
# We recommend filtering this data to applicable data ranges
duckdb.sql("""
   CREATE OR REPLACE TABLE executions AS
SELECT * FROM s3_tables.observability.executions;        
""")

Step 3: Run a test query¶

To verify your setup, run a query to get the number of test executions that ran yesterday:

Python

# Update the database name, schema, and table name
a = duckdb.sql("""
    SELECT COUNT(*) as test_that_run_yesterday
    FROM <database>.<schema>.<table>
    WHERE start_timestamp_day = CURRENT_DATE() - INTERVAL 1 DAY
        AND action_mnemonic = 'TestRunner'
""").to_df()

If you see a result showing the number of test executions, your setup is working as expected.

Step 1: Install prerequisite tooling and verify access¶

Make sure you have the following core tools installed:
- Python 3.14 or above
- A Python package manager like uv or pip.

Install required dependencies. The requirements.txt file below lists the required packages. Create a copy of this file in the directory where you'll be running your queries:

requirements.txt
[project]
name = "analytics-example"
version = "0.1.0"
dependencies = [
    "duckdb>=1.4.2",
    "google-cloud-bigquery[all]>=3.38.0",
    "ipykernel>=7.1.0",
    "jupyter>=1.1.1",
    "pandas>=2.3.3",
    "pip>=25.3",
    "plotly>=6.4.0",
    "pyarrow>=22.0.0",
]
requires-python = '>=3.14'

Run uv sync to install the dependencies.
Install the Google Cloud CLI.
Make sure you can access BigQuery datasets.

Step 2: Create a Jupyter Notebook¶

Using Jupyter Notebook in an IDE

You can do this step in an IDE like Visual Studio Code. If using an IDE, select the Python interpreter from the virtual environment created by uv as your notebook kernel. This ensures the notebook has access to all installed dependencies.

Start Jupyter Notebook and create a notebook named engflow_analytics.ipynb. Then, create the following cells in your notebook:

Python

# Import required libraries
import duckdb
import pandas as pd
import plotly.express as px
import pyarrow

Python

# Authenticate with Google Cloud using application default credentials
! gcloud auth application-default login

Python

# Initialize BigQuery client to query the analytics dataset
from google.cloud import bigquery
bq = bigquery.Client()

# Update your project ID
client = bigquery.Client(project='my-project-id-123abc456')

Step 3: Run a test query¶

To verify your setup, run a query to get the number of test executions that ran yesterday:

Python

# Query BigQuery to count test executions from yesterday
# Update the project ID, dataset name, and table name
jb = client.query_and_wait("""
SELECT COUNT(*) as test_that_run_yesterday
FROM `<projectId>.<datasetName>.<tableName>`  
WHERE start_timestamp_day = CURRENT_DATE() - INTERVAL 1 DAY
    AND action_mnemonic = 'TestRunner'
""")

# Convert query results to a pandas DataFrame for analysis
jb.to_dataframe()

If you see a result showing the number of test executions, your setup is working as expected.

Next steps¶

Now that you've verified your setup, you can:

Explore the full Iceberg table schema.
Create custom queries to analyze your build performance.
Set up automated reports or dashboards using the queried data.