Azure (staging & prod)
This page builds a staging and a production environment for the backend and workers on Azure. Read the Overview first — in particular the two runtime shapes and the no-app-level-auth access model, which are what make the backend a managed service and each worker a VM.
Azure is the worked example here because it is what this project’s environments run on, but nothing about the design is Azure-specific: the backend is “a managed container with a volume and a single replica” and a worker is “a VM with a container runtime on the private network.” Any provider offering those works the same way.
The example assets referenced below live under
deployments/azure/
and
deployments/env/.
They are copy-pasteable starting points with placeholder values, not a managed
IaC pipeline — adapt them rather than running them blind.
Topology
Section titled “Topology” private network (Tailscale tailnet or Azure VNet) ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Azure Container App VM / VM Scale Set │ │ ┌───────────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ tcab-backend │◀─────────│ tcab-worker │ │ tcab-worker │ │ │ │ 1 replica │ │ + Docker │ │ + Docker │ │ │ │ + volume (state) │ └──────────────┘ └──────────────┘ │ │ └─────────┬─────────┘ each individually addressable │ │ │ │ │ web console (operator browser, joins the same network) │ └─────────────┼───────────────────────────────────────────────────────┘ │ outbound only ▼ Cloudflare R2 (snapshot) + Pages deploy hook ──▶ public galleryEverything sits on one private network per environment. The backend’s only inbound traffic is from workers and operators on that network; its only outbound traffic is the snapshot upload to Cloudflare R2 and the deploy-hook call that rebuilds the public gallery. Workers reach the backend over the same network and reach model APIs and package registries from inside their run containers.
One environment is one resource group — rg-tcab-staging and
rg-tcab-prod — so the two are fully isolated and tearing one down is a single
operation. Build staging first, confirm the flow, then repeat for prod with
prod’s own secrets and a TCAB_ENV=prod tag.
Prerequisites
Section titled “Prerequisites”- An Azure subscription and the
azCLI, logged in (az login). - An Azure Container Registry (or another registry the Container App can pull from) to hold the backend image.
- A private network. The default here is a Tailscale tailnet with an auth key per environment; the Azure-native alternative is at the end.
- The publishing credentials a worker needs if it will publish runs: a
GITHUB_TOKENand a Cloudflare API token, plus the backend’s R2 credentials and site deploy-hook URL. See.env.backend.exampleand.env.worker.examplefor the full list; treat all of them as secrets.
The example
deployments/azure/az-provision.sh
collects the commands below into one annotated script you can step through.
Backend on Azure Container Apps
Section titled “Backend on Azure Container Apps”The backend with its default SQLite store is stateful: it owns a database file, an on-disk definition store, a checkout it ingests from, and a headless browser for rendering references. Hosting it on Container Apps that way works, but three things are non-negotiable and follow directly from that:
- A single replica. SQLite is single-writer and the store is local, so the
app must be pinned
minReplicas = maxReplicas = 1. This service coordinates publishes and serves a low-traffic API; it is not something you scale out. - A persistent volume. Mount an Azure Files share at the SQLite database
path (in
TCAB_BACKEND_DATABASE_URL) and the pathsTCAB_BACKEND_STOREandTCAB_BACKEND_CHECKOUTpoint to, so the database, store, and checkout survive a revision or restart. Prefer an NFS Azure Files share for the SQLite file — SMB file locking interacts poorly with SQLite. A volume survives restarts but is not a backup; see Backups. - An image with a browser. The stock binary has no Chromium. Build the
backend image from
deployments/azure/backend.Dockerfile, which layers thetcab-backendbinary and a headless Chromium; setTCAB_REFERENCE_BROWSERto that browser if it is not auto-detected.
Constraints 1 and 2 are properties of the SQLite store, not the backend
itself. Point TCAB_BACKEND_DATABASE_URL at a managed PostgreSQL instance
(see Backups) and the backend becomes
stateless: no volume for the database, no single-replica pin. Constraint 3 (the
browser image) and the volume for the definition store and checkout still apply,
since those remain on local disk.
Internal ingress only — the Container App must not have a public FQDN. Workers and operators reach it over the private network, and its outbound R2 and deploy-hook calls do not require inbound exposure.
Steps:
# Build and push the backend image (includes Chromium).az acr build -r <registry> -t tcab-backend:<tag> \ -f deployments/azure/backend.Dockerfile .
# Create the Container Apps environment and an Azure Files volume for state,# then deploy the app pinned to one replica with internal-only ingress.# deployments/azure/containerapp.yaml is an example app definition; fill in the# image, the mounted volume, and the env values from# deployments/env/backend.<env>.env.example.az containerapp create -g rg-tcab-<env> -n tcab-backend \ --yaml deployments/azure/containerapp.yamlProvide the backend’s secrets (R2 keys, deploy-hook URL) as Container Apps
secrets referenced by the env vars, not as plain values in the YAML. The
non-secret settings — bind address, the volume paths, TCAB_ENV — come from
deployments/env/backend.<env>.env.example.
Ingesting definitions
Section titled “Ingesting definitions”The backend serves the catalog from the checkout at TCAB_BACKEND_CHECKOUT,
populated by calling POST /ingest (see the
backend API). Put the
repository on the mounted volume and refresh it when test cases change — for
example a small scheduled job that pulls the repo into the volume and then calls
POST /ingest. Because the volume is persistent, this is a periodic update, not
something that happens on every restart.
Simpler alternative. If running a stateful service on Container Apps feels like more than you want, the backend also runs cleanly as a
tcab-backendsystemd service on a VM — the same VM you already run workers on, or its own. The exampledeployments/systemd/tcab-backend.serviceunit and a local Chromium make this a one-box deployment with no Azure Files or custom image. The managed path above is the default; this is a valid, simpler trade.
Workers on VMs
Section titled “Workers on VMs”A worker must run on a host with a container runtime,
so each worker is a VM. Provision identical nodes — one to start for staging, a
small pool for prod — with the cloud-init in
deployments/azure/worker-cloud-init.yaml,
which on first boot:
- installs Docker and pulls the harness container images;
- installs the
tcab-workerbinary and atcab-worker.servicesystemd unit that reads/etc/test-cabinet/worker.env; - joins the private network (a Tailscale auth key is passed in as cloud-init data) so the node comes up with its own stable private address.
az vm create -g rg-tcab-<env> -n tcab-worker-1 \ --image Ubuntu2404 --size <size> \ --custom-data deployments/azure/worker-cloud-init.yaml# Repeat per node, or use a VM Scale Set with the same custom-data.Each worker’s /etc/test-cabinet/worker.env is built from
deployments/env/worker.<env>.env.example:
it sets TCAB_BACKEND_URL to the backend’s private address, TCAB_ENV, and the
harness API keys (and, if the worker publishes, the GitHub and Cloudflare
credentials). Inject the secret values from your secret store; the committed
template carries placeholders only.
A “pool” is individually-addressed nodes
Section titled “A “pool” is individually-addressed nodes”Because worker jobs are per-instance,
do not put workers behind a single load balancer — a run submitted through
one address must be polled on that same node, and a round-robin LB would scatter
the follow-up requests. Scale by adding nodes, each registered in the
web console by its own private URL. A mesh VPN makes
this painless: every node has its own 100.x address the moment it joins. A VM
Scale Set is fine for provisioning identical nodes, but address them
individually, not through the scale set’s load balancer.
Per-environment differences
Section titled “Per-environment differences”Staging and prod are the same topology; keep them that way so staging actually rehearses prod. Only these differ:
| Staging | Prod | |
|---|---|---|
| Resource group | rg-tcab-staging | rg-tcab-prod |
TCAB_ENV | staging | prod |
| Workers | one node is enough | a pool sized to demand |
| Private network | its own tailnet tag / VNet | its own tailnet tag / VNet |
| Secrets | staging keys & tokens | prod keys & tokens |
Use separate Cloudflare R2 buckets (and deploy hooks) per environment if you want
staging publishes to land in a separate gallery dataset from prod; point each
backend’s TCAB_R2_* and TCAB_SITE_DEPLOY_HOOK_URL at the right one.
Alternative: an Azure-native private network
Section titled “Alternative: an Azure-native private network”If you would rather not depend on a third-party mesh VPN, replace Tailscale with an Azure VNet:
- Put the worker VMs and the Container Apps environment on private subnets of one VNet per environment, with NSGs restricting traffic to within the VNet.
- Give the backend internal ingress on the VNet so workers resolve it by its private address.
- Reach the environment as an operator through an Azure VPN Gateway or Azure Bastion rather than any public endpoint.
The trade is more Azure-specific networking to stand up and maintain, and per-VM private addressing to manage for the individually-addressed worker requirement, versus Tailscale handing each node a stable address for free. The service configuration is otherwise unchanged — only how hosts find each other differs.
Operating these environments
Section titled “Operating these environments”Two cross-cutting concerns have their own pages:
- Backups — the only irreplaceable data is the backend’s database, so backups reduce to protecting that one store. The strategy is tied to the backend-hosting choice above: a SQLite backend on a VM streams to object storage with Litestream, while managed PostgreSQL hands you provider-managed point-in-time restore.
- Telemetry — choosing and wiring an OTLP collector
for staging and prod, tagged by
TCAB_ENV. Enable it in both environments.