Skip to content

First Time Setup

This guide takes a fresh checkout of The Test Cabinet to the point where you can launch a run. It covers the toolchain, the container runtime, the harness image, the headless browser, and credentials — the four things a run needs that the repository alone does not provide.

The project is in early development, so setup assumes some familiarity with Rust, Node, and containers. Building holds the authoritative build details; this guide is the task-oriented version that sits on top of it.

Runs are driven by the tcab CLI (binary tcab, crate test-cabinet-cli). There are two ways to invoke it, and the rest of these guides use the first:

  • A released binarytcab run …. Released binaries are published on GitHub (Linux static-musl, Windows, macOS).
  • From a source checkoutcargo run -p test-cabinet-cli -- run …. Everything after -- is passed to tcab. This is the form to use while working in the repository.

Wherever a guide shows tcab <args>, the source-checkout equivalent is cargo run -p test-cabinet-cli -- <args>.

The repository is both a Cargo (Rust) and an npm (TypeScript) workspace. Build both once:

Terminal window
cargo build --workspace # Rust: core, CLI, desktop shell
npm install # TypeScript: installs every workspace

The pinned Rust toolchain is declared in rust-toolchain.toml. Format and lint with cargo fmt --all and cargo clippy --workspace.

If you are on a distribution without the generic FHS dynamic loader (notably NixOS), build the fully static tcab instead with cargo build-portable (an alias that targets x86_64-unknown-linux-musl); see Portable build for the musl prerequisites.

Every run executes inside an isolated container so a model cannot reach the host filesystem or other runs’ outputs (see Execution). You need Podman (preferred) or Docker on PATH. The runtime is auto-detected; override it with TCAB_CONTAINER_RUNTIME=<binary>.

Runs always execute Linux containers, so platform expectations differ:

  • Linux — rootless Podman runs containers directly on the host. tcab adds --userns=keep-id so the mounted repository stays writable by the run user.
  • macOS — Podman runs containers inside its managed Linux VM (podman machine init && podman machine start). The VM shares your home directory but not the OS temp directory, which is why staged inputs default to ~/.tcab (below). On Apple Silicon the machine is arm64, so harness images build and run arm64 by default.
  • Windows — Podman runs on its WSL2 backend, so WSL must be installed (wsl --install) before podman machine init.

Where a run stages its mountable inputs — the seeded repository, collected artifacts, and capture scratch — is resolved as --work-dir, then TCAB_WORK_DIR, then ~/.tcab. It must be a path the runtime can mount; on macOS and Windows that rules out the OS temp directory, which is why the default is home-based.

A run drives an agent harness inside the container, so the harness’s run-container image must be built once. From the containers/ directory (see its README.md):

Terminal window
cd containers && DOCKER=podman ./build.sh claude # builds the base + claude image

Build the image for whichever harness you intend to run. The supported harness slugs are claude, codex, cline, antigravity, goose, kilo, opencode, and pi. Confirm availability without starting a run:

Terminal window
tcab harnesses # human-readable table; add --json for machine output

The validator and the reference renderer shell out to a Playwright browser driver. Install the Chromium revision the driver expects through the pinning workspace — a bare npx playwright fetches a different version:

Terminal window
npm exec -w @test-cabinet/browser-driver -- playwright install chromium

The driver (packages/browser-driver/driver.mjs) is located relative to the working directory; override with TCAB_BROWSER_DRIVER. A run will not start unless every one of the selected variant’s reference mockups renders, since those screenshots are both the seeded visual targets and the validation baselines — a render failure aborts the run before a harness session is spent. (The seed, validate, and catalog commands degrade per-view instead of aborting.)

The harness needs an API key for its model provider. The CLI keeps the several kinds of credential separate and never conflates them (see CLI Authentication); for a basic run you only need the harness key.

Each harness reads a specific variable — ANTHROPIC_API_KEY for claude, OPENAI_API_KEY for codex, OPENROUTER_API_KEY for the OpenRouter-backed harnesses. The CLI loads a .env from the working directory (or any parent) on startup; copy .env.example to .env and fill in the keys. Variables already exported in the shell take precedence over the file. The key is passed into the run container as a secret and is never written into the seeded repository.

Run from the repository root so the test-cases/ catalog and the browser driver resolve (override the catalog location with TCAB_TEST_CASES_DIR):

Terminal window
tcab run \
--test-case pong --version v1.0.0 --variant base \
--harness claude --model anthropic/claude-opus-4 \
--out-dir runs

This renders the references, seeds a fresh repository with the selected variant’s specs and screenshots, renders the prompt and hands it to the harness in a container while printing the live event stream, then builds and load-checks the result, runs the declared checks, and writes runs/<id>/run-record.json alongside a copy of the implementation. --variant is required; --max-runtime <seconds> overrides the case’s default cap for this invocation.