Harness Events
Overview
Section titled “Overview”While a run is in progress its harness is doing work — running commands, reading and writing files, emitting assistant messages, and occasionally reporting its own errors. The agent harness layer converts that activity into a single stream of normalized harness events so that callers can observe a run as it happens without needing to understand any harness specific output format.
Every supported harness reports its activity differently. The harness layer is responsible for translating each harness’s raw output into the normalized event types defined here, exactly as it translates raw usage into the normalized token classes in Metrics. Callers — the testing harness application, its command line interface, and the desktop shell — consume one uniform stream regardless of which harness produced it.
This solves a concrete problem: without an event stream the only signal a caller gets is the final outcome, so a run appears to sit silently until it finishes and a failure surfaces as a single opaque line. Emitting events as they arrive lets callers show live progress and, when a harness fails, see the harness’s own diagnostic output rather than a truncated summary.
Event Stream
Section titled “Event Stream”A harness invocation produces an ordered stream of events as the harness runs. Events are delivered to the caller in the order the harness emits them, before the invocation completes, so a caller can render progress live.
Each event is one of the normalized event types below. Every event carries a discriminator identifying its type, and callers branch on that discriminator rather than inspecting a generic payload.
Common Fields
Section titled “Common Fields”Every event, regardless of type, carries the following fields:
- Type — the discriminator slug identifying the event type. Each type below defines its own slug.
- Timestamp — an ISO 8601 timestamp for when the event was observed. Most harnesses do not stamp their own output, so this is the time the testing harness saw the line rather than a harness provided time.
- Session ID (optional) — the harness reported session identifier the event belongs to, when the harness exposes one. Unlike a dedicated session manager, The Test Cabinet does not mint its own session IDs; this field carries the underlying harness’s identifier when it can be determined and is otherwise unset.
The type discriminator is inline on every event. Events do not nest their type
specific data under a payload field; callers check the type field and read the
type specific fields directly.
Event Types
Section titled “Event Types”Agent Message
Section titled “Agent Message”Generated when an agent emits a plain natural language message that is not structured tool activity, a harness diagnostic, or a terminal result the harness reports separately.
- Discriminator:
agent - Message — the plain text emitted by the agent.
Command
Section titled “Command”Generated when an agent runs a shell command. If a harness does not differentiate shell commands used for reading, searching, or listing files from ordinary commands, those operations are reported as command events rather than the dedicated file operation events below.
- Discriminator:
command - Command — the shell command the agent attempted to run.
- Working directory (optional) — the directory the command ran from, when the harness reports it.
- Exit code (optional) — the process exit code, when the command reached a point where one exists and the harness reports it.
- Is success (optional) — whether the command succeeded. Agent caused failures (for example a malformed command) are still command events with this set to false. Unset when the harness does not report command success.
File Read
Section titled “File Read”Generated when an agent reads a file. Reports the operation that occurred, never the data returned by it.
- Discriminator:
read - Path — the file that was read, as an absolute path when it can be determined. The path is not guaranteed to exist.
- Start line / End line (optional) — the inclusive line range read, when the harness reports it.
- Is success (optional) — whether the read succeeded. This is not the same as whether the path exists; a read can fail for other reasons such as permissions. Unset when the harness does not report it.
File Write
Section titled “File Write”Generated when an agent writes to a file. Reports where the write occurred, never the written payload.
- Discriminator:
write - Path — the file that was written, as an absolute path when it can be determined. The path is not guaranteed to exist.
- Start line / End line (optional) — the inclusive line range written, when the harness reports it.
- Is success (optional) — whether the write succeeded, on the same terms as a read’s success field.
File Search
Section titled “File Search”Generated when an agent searches the filesystem or searches within files. Reports the search that occurred, never the results. If a harness does not differentiate search commands from ordinary shell commands, searches are reported as command events instead.
- Discriminator:
search - Query — the search pattern, file name, glob, or other search expression.
- Path (optional) — the file or directory scope searched, as an absolute path when set.
- Is success (optional) — whether the search completed, which is not the same as whether it matched anything.
Directory List
Section titled “Directory List”Generated when an agent lists directory contents. Reports the listing operation, never the entries returned.
- Discriminator:
list - Path (optional) — the directory whose contents were listed, as an absolute path when set.
- Is success (optional) — whether the listing completed.
Generated when an agent uses a skill, but only when the harness differentiates skill use from an ordinary file read. When a harness reports skill files as ordinary reads, those are reported as read events instead.
- Discriminator:
skill - Path — the skill file that was read, as an absolute path when it can be determined.
- Skill name (optional) — the harness provided name for the skill.
- Start line / End line (optional) — the inclusive line range read.
- Is success (optional) — whether the skill use completed.
Orchestration
Section titled “Orchestration”Generated when a harness reports subagent orchestration activity, such as a subagent starting or completing.
- Discriminator:
orchestration - Action — one of
subagent_started,subagent_completed, orsubagent_failed. - Subagent ID (optional) — the harness provided identifier for the subagent.
- Subagent name (optional) — the harness provided display or role name.
- Is success (optional) — whether the action completed successfully, most meaningful for terminal actions.
Harness Error
Section titled “Harness Error”Generated when the underlying harness reports an error caused by the harness itself. This is not used for agent caused errors; a malformed command an agent ran is a command event with its success field set to false.
- Discriminator:
error - Message — a human readable description of the error.
- Code (optional) — a harness provided stable error code, when one exists.
Warning
Section titled “Warning”Generated when the underlying harness reports output indicating a potential issue. Harness diagnostics printed to standard error that are not clearly fatal are surfaced as warnings.
- Discriminator:
warning - Message — a human readable description of the potential issue.
- Code (optional) — a harness provided stable warning code, when one exists.
Unknown
Section titled “Unknown”Generated when the harness layer cannot classify a piece of harness output as any of the types above. Preserving these rather than dropping them keeps the stream lossless, which matters most when diagnosing a failing harness.
- Discriminator:
unknown - Raw — the original harness output that could not be classified. It may be any JSON value, including a string for non JSON output.
Translating Harness Output
Section titled “Translating Harness Output”Each harness emits its activity in its own format, and the harness layer maps that format onto the event types above. Two broad strategies are used:
- Structured mapping. When a harness emits a documented machine readable event stream, the harness layer parses it and maps each event to its precise normalized type. Codex, Claude Code, Cline, Goose, Kilo Code, OpenCode, and Pi are mapped this way.
- Best effort mapping. For harnesses whose event formats are not yet modeled
in detail, the harness layer surfaces output as it streams — recognizable
diagnostics become warning or error events and everything else becomes an
unknown event carrying the raw output. This still gives callers live visibility
and full failure output, and a harness can be promoted to a structured mapping
later without changing the event contract. Antigravity is mapped this way: it
authenticates only with a Google account, so it cannot run in The Test
Cabinet’s API-key-only mode, and its plain
--printoutput carries no structured stream to model.
A structured mapping’s exact field names are confirmed against real CLI output
rather than a published schema. Where a harness’s stream has not yet been
captured from a real run, the mapping reads each field from a small set of
candidate locations and falls back to an unknown event rather than guessing — and
the raw.jsonl and events.jsonl files
a run records make it straightforward to confirm and refine those field names
against an actual stream.
Regardless of strategy, output a harness writes to standard error is surfaced as warning events while the run is in progress, and an invocation that exits non zero produces a terminal error event carrying the harness’s own failure output. The exit status alone is never the only signal a caller receives.
Codex Event Mapping
Section titled “Codex Event Mapping”Codex is run with codex exec --json, which emits a line delimited JSON stream on
standard output. Each non empty line is a complete JSON object. The stream has two
layers: lifecycle events describing the conversation and turn boundaries, and item
events wrapping a streamed item that carries its own type.
Lifecycle events are consumed for metadata rather than emitted as activity:
| Codex event | Handling |
|---|---|
thread.started | Carries thread_id, captured as the session ID for later events. |
turn.started | Marks the start of a turn. No event is emitted. |
turn.completed | Carries usage, consumed for usage. No event. |
Items are reported first as item.started and then as item.completed. The
normalized event is derived from the completed state so that terminal information
such as a command’s exit code is available; the started notification is not turned
into a duplicate event. Completed items map as follows:
| Codex item type | Event |
|---|---|
command_execution | command, or a recognized file operation (see below) |
file_change | one write per changed path |
agent_message | agent |
error | error |
| any other item type | unknown |
Codex runs file operations through shell commands rather than dedicated tools, so
command_execution items are inspected before falling back to a command event.
When the command is a bash -lc invocation whose first simple command is a known
file operation, it is mapped accordingly: cat and print only sed ranges map to
read events, rg, grep, and find map to search events, and ls maps to a
list event. A sed -n '10,20p' path range maps to a read event with start line 10
and end line 20. Paths are reported as absolute when they can be determined from
the command; because Codex does not report a per command working directory,
relative paths are surfaced as written. Commands that are not confidently a file
operation remain command events, with the item’s exit code mapped to the exit code
and success fields.
Codex exposes a single diagnostic channel through error items, which it uses for
both true errors and advisory notices (for example reporting that a bypass flag is
enabled). Because Codex provides no severity signal, all error items map to error
events. Codex does not emit dedicated skill, warning, or orchestration activity in
this version, so those event types have no Codex source; if a future version adds
one, the corresponding event type must be produced from it. Any line that fails to
parse as JSON, and any item type not listed above, becomes an unknown event so the
stream stays lossless.
Claude Code Event Mapping
Section titled “Claude Code Event Mapping”Claude Code is run with claude --print --output-format stream-json --verbose,
which emits a line delimited JSON stream on standard output. Each non empty line
is a complete JSON object carrying a top level type. The stream is stateful: an
assistant event introduces a tool use, and the operation it requested is only
turned into a normalized event once the matching tool-result arrives in a later
user event. Pairing the requested operation with its observed result is what
lets a file read report both the path the agent asked for and whether the read
succeeded. Any event may carry a session_id; the first non empty one seen is
captured as the session ID for the stream.
Top level events map as follows:
| Claude Code event | Handling |
|---|---|
system | Session lifecycle metadata. The init event’s cwd is captured to resolve relative paths; init, status, and thinking_tokens subtypes emit no event. Any other subtype becomes unknown. |
assistant | Text content becomes an agent message and tool-use content is recorded for correlation (see below). |
user | Tool-result content resolves a recorded tool use into its event; echoed prompt or injected-context text emits no event. |
rate_limit_event | Consumed as credential state, except a non allowed status, which becomes a warning. |
result | The terminal result; its usage and final output are consumed for metrics, and only a reported terminal error becomes an error. |
stream_event | Lower-level partial telemetry that the completed assistant and user events restate, so it is consumed. |
| any other type | unknown |
Within an assistant message, text blocks are joined into one agent message,
while thinking and redacted_thinking blocks are model reasoning and carry no
activity. Each tool-use block is recognized by name and recorded; an unrecognized
tool (an MCP tool, web tool, todo tool, and the like) or a malformed tool-use
block becomes an unknown event. Recognized tools map to events when their
tool-result arrives, whose is_error and interruption flags set the success
field:
| Claude Code tool | Event |
|---|---|
Read | read, with the line range derived from the offset and limit input |
Write, Edit, MultiEdit, NotebookEdit | write |
Grep, Glob | search |
LS | list |
Bash | command, or a recognized file operation classified from the command exactly as a Codex command is |
Skill | skill, with the path synthesized as skills/<name>/SKILL.md under the workspace |
StructuredOutput | Native delivery of --json-schema output; the tool use and its result emit no event |
| any other tool | unknown |
A tool-result is paired with its tool use by a unique tool_use_id; an ambiguous
match emits an unknown event rather than guessing the operation. A read result
that arrives without a recorded tool use is still recovered as a read event from
the file metadata it carries, and any other unpaired tool-result becomes an
unknown event so the stream stays lossless. As with every harness, file
operations report only the operation that occurred — the path, optional line
range, and success — never the contents the operation returned.
Claude Code does not emit a stable source for orchestration activity in this version, so that event type has no Claude Code source; if a future version adds one, the corresponding event type must be produced from it.
Cline Event Mapping
Section titled “Cline Event Mapping”Cline is run with cline --json, a line delimited JSON stream. Cline 3.x wraps
every record in a top-level type: hook_event (lifecycle bookkeeping,
consumed), agent_event (agent activity nested in its event object), and
run_result (the terminal record, whose final text and usage are consumed
elsewhere). The session id is captured from a sessionId, session_id, or id
field; the taskId/task_id fields name the in-memory conversation, not the
session, and are never captured.
Within an agent_event, the nested event’s type and contentType drive the
mapping. Iteration boundaries, per-step usage, and done are consumed. A text
block’s streaming delta arrives on content_start and is consumed; the matching
content_end carries the complete text and becomes an agent
message. A tool call’s input arrives on content_start (recorded against its
toolCallId) and is resolved when the content_end carries the tool output,
whose success flag — or, for a batch, every item succeeding — sets the success
field. Tool names map as follows:
| Cline tool | Event |
|---|---|
run_commands, execute_command, bash | one command per command (a commands array or single string) |
read_files, read_file | one read per file (a files array or single path) |
editor, write_to_file, replace_in_file, new_rule | write |
apply_patch | one write per file named by the patch markers |
search_files, search_codebase | search |
list_files | list |
skills, use_skill | skill |
| any other tool | unknown |
Older Cline versions emit a flat say/ask stream instead of the wrapped records.
That legacy stream is handled conservatively: a say text or completion result
and an ask followup become agent messages, reasoning is consumed, a diagnostic
say becomes an error, and everything else — including legacy
tool activity, which is not reconstructed — becomes an unknown event. Cline does
not emit a stable orchestration source in this version.
Goose Event Mapping
Section titled “Goose Event Mapping”Goose is run with goose run --output-format stream-json, a line delimited JSON
stream of message, notification, error, and complete events. The session
id comes from the named session Goose is launched with. The complete event
carries usage and is consumed; it also flushes the final assistant text.
A message event carries a serialized conversation message whose content is an
array of blocks processed in order. Assistant text blocks are accumulated into
one pending message — Goose streams a message as cumulative-or-delta records
sharing one id, so a record that restates the pending text replaces it and any
other same-id record is appended — and flushed as an agent
message when other activity follows or the run completes. User text and
thinking/redactedThinking blocks carry no activity. A toolRequest block is
recorded against its call id and resolved when the matching toolResponse
arrives, whose toolResult.status sets the success field. Tool names, after
stripping an extension prefix such as developer__, map as follows:
| Goose tool | Event |
|---|---|
text_editor | read or write, by its command |
read | read |
write, edit | write |
grep, glob | search |
list | list |
shell | command, or a recognized file operation |
load_skill, skill | skill |
todo__* | consumed — internal session state, no event |
| any other tool | unknown |
notification events are surfaced as unknown rather than parsed from prose, and
error events become error events.
Kilo Code Event Mapping
Section titled “Kilo Code Event Mapping”Kilo Code is run with kilo run --format json and is built on OpenCode-style
runtime events (see OpenCode), so it shares that
stream shape: step_start/step_finish boundaries (consumed; the latter carries
usage), reasoning (consumed), text (an agent message),
self-contained tool_use events, and error events. The session id is captured
from sessionID. Kilo extends the OpenCode tool set with workflow and semantic
tools:
| Kilo tool | Event |
|---|---|
task, agent_manager | orchestration when the spawned agent/session is identified, otherwise unknown |
codesearch | search |
| all OpenCode tools | as in the OpenCode mapping |
OpenCode Event Mapping
Section titled “OpenCode Event Mapping”OpenCode is run with opencode run --format json, a line delimited JSON stream
of step_start, text, tool_use, step_finish, reasoning, and error
events, with the session id at sessionID. Step boundaries carry usage and are
consumed, reasoning is model thinking and is consumed, and a text event becomes
an agent message.
A tool_use event is self-contained — it carries the tool name, input, and a
terminal status in one event, so no request/response correlation is needed — and
its completed/error status sets the success field. Tool names map as follows:
| OpenCode tool | Event |
|---|---|
read | read |
write, edit | write |
apply_patch | one write per file named by the patch markers |
grep, glob | search |
bash | command, or a recognized file operation |
skill | skill |
lsp | search when it carries a query/symbol, otherwise unknown |
| any other tool (webfetch, websearch, todowrite, question, …) | unknown |
error events become error events. OpenCode does not expose
orchestration in this version.
Pi Event Mapping
Section titled “Pi Event Mapping”Pi is run with pi --mode json --print, a line delimited JSON stream of lifecycle
markers (session, agent_start/agent_end, turn_start/turn_end,
message_start, message_update) and the two activity-bearing records,
message_end and tool_execution_end. The session id is captured from the
session record’s id. Lifecycle markers, the partial message_update deltas,
and turn_end (consumed for usage) emit no event.
A message_end record whose message role is assistant becomes an
agent message — its content is a string or an array of text
parts — while a non-assistant message (such as the echoed user prompt) is
ignored. A tool_execution_end record is self-contained and carries a
toolName, structured input, and a terminal status; its status or error field
sets the success field. Tool names are matched case-insensitively:
| Pi tool | Event |
|---|---|
read | read |
write, edit | write |
search, grep, glob | search |
list | list |
bash, shell | command, or a recognized file operation |
| any other tool | unknown |
Pi does not emit a dedicated skill, warning, or orchestration source in this version.