test-harness

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Test the Claude Code harness — the hooks, skills, settings, and CLAUDE.md that steer an agent — as the assembled machine it ships as. vigiles gives three tiers, cheapest first; this skill picks the right one, writes the test, and runs it.

The guiding rule: start at the cheapest tier that can answer the question, and climb only when it genuinely can't. Two of the three tiers need no model and no API key, so they run on every commit for free — reach for the paid real-model tier only when the question actually requires a real model.

将Claude Code harness（用于引导Agent的hooks、skills、settings和CLAUDE.md）作为完整交付的系统进行测试。vigiles提供三个层级，成本从低到高；本skill会选择合适的层级，编写并运行测试。

核心原则：从能解决问题的最低成本层级开始，仅在该层级确实无法解决时再升级。三个层级中有两个无需模型和API密钥，可在每次提交时免费运行——只有当问题确实需要真实模型时，才使用付费的真实模型层级。

Step 0 — Pick the tier (the judgment call)

步骤0——选择层级（判断决策）

Match what you're testing to the cheapest tier that can answer it:

What you're testing	Tier	Cost	API
"Does this hook block/allow event X?" — pure hook logic, every event type (incl. Edit/Write, PreCompact, SessionEnd, SubagentStop)	Unit	free, milliseconds, no `claude`	`runHook`
"Is the hook actually wired into the assembled plugin and does it fire in a real session?"	Deterministic	free, no API key (real `claude` + scripted mock)	`runHarnessTest` + `scriptModel`
"Did the injected context (a SessionStart hook, a `/command` ) actually reach the model?"	Deterministic	free, no API key	`runHarnessTest` → `trace.modelRequests` / `assertRequestContains`
"Does this skill's description trigger when it should (recall) and stay quiet when it shouldn't (precision)?"	Eval	paid (real model)	`measureTriggerRate` (+ `irrelevantPrompts` ) → `assertTriggerRate({ min, maxFalsePositive })`
"Does this harness change move what the agent does?" (A/B, signal vs noise)	Eval	paid (real model)	`runEval` + `assertSignificant`

Most harness questions — block/allow, wired-in, context-landed — never need a model. Only "does the model trigger / behave differently" needs the eval tier.

If the unit and deterministic tiers can both answer it, prefer unit: it's faster and reaches events the deterministic mock can't drive.

根据测试内容匹配能解决问题的最低成本层级：

测试内容	层级	成本	API
"此hook是否拦截/允许事件X？"——纯hook逻辑，所有事件类型（包括Edit/Write、PreCompact、SessionEnd、SubagentStop）	Unit	免费，毫秒级，无需 `claude`	`runHook`
"hook是否真正接入已组装的插件，并在真实会话中触发？"	Deterministic	免费，无需API密钥（真实 `claude` + 脚本化模拟）	`runHarnessTest` + `scriptModel`
"注入的上下文（SessionStart hook、 `/command` ）是否真正传递到模型？"	Deterministic	免费，无需API密钥	`runHarnessTest` → `trace.modelRequests` / `assertRequestContains`
"此skill的描述触发是否符合预期（召回率），且不该触发时保持静默（精确率）？"	Eval	付费（真实模型）	`measureTriggerRate` (+ `irrelevantPrompts` ) → `assertTriggerRate({ min, maxFalsePositive })`
"此harness变更是否改变Agent的行为？"（A/B测试，信号与噪声）	Eval	付费（真实模型）	`runEval` + `assertSignificant`

大多数harness相关问题——拦截/允许、接入状态、上下文生效——都无需模型。只有“模型是否触发/行为是否改变”才需要eval层级。

如果unit和deterministic层级都能解决问题，优先选择unit：速度更快，且能覆盖deterministic模拟无法驱动的事件。

Step 1 — Ensure vigiles is installed

步骤1——确保vigiles已安装

Check whether

vigiles

is a dependency (

package.json

), and install it as a dev dependency if not:

bash

npm i -D vigiles    # or: pnpm add -D vigiles / yarn add -D vigiles

The deterministic tier additionally needs the

claude

CLI on PATH (no API key):

npm i -g @anthropic-ai/claude-code

. The eval tier needs model auth. If the

claude

CLI is missing, you can still write and run unit-tier tests.

检查

vigiles

是否为依赖项（

package.json

），如果未安装则将其作为开发依赖安装：

bash

npm i -D vigiles    # 或：pnpm add -D vigiles / yarn add -D vigiles

deterministic层级还需要

claude

CLI在PATH中（无需API密钥）：

npm i -g @anthropic-ai/claude-code

。eval层级需要模型授权。如果缺少

claude

CLI，仍可编写并运行unit层级的测试。

Step 2 — Locate the harness surface to test

步骤2——定位要测试的harness组件

Find what the project actually ships, in this order:

.claude/settings.json

.claude/settings.local.json

— inline

hooks

.claude-plugin/plugin.json

— a plugin manifest (

hooks

skills

agents

mcpServers

```
hooks/hooks.json
```
— the plugin hooks convention (e.g. obra/superpowers).

skills/<name>/SKILL.md

agents/<name>.md

commands/<name>.md

Pick one concrete thing to pin down — a specific

PreToolUse

hook, a specific

SessionStart

injection, a specific skill.

按以下顺序查找项目实际交付的内容：

.claude/settings.json

.claude/settings.local.json

——内联

hooks

。

.claude-plugin/plugin.json

——插件清单（

hooks

skills

agents

mcpServers

）。

```
hooks/hooks.json
```
——插件hooks约定（如obra/superpowers）。

skills/<name>/SKILL.md

agents/<name>.md

commands/<name>.md

。

选择一个具体的测试对象——特定的

PreToolUse

hook、特定的

SessionStart

注入、特定的skill。

Step 3 — Write the test for the chosen tier

步骤3——为所选层级编写测试

Unit (
runHook
) — hand a hook a synthesized event, assert the decision:

import { runHook, assertHookBlocked } from "vigiles/testing";

const r = runHook(hookCommand, {
  hook_event_name: "PreToolUse",
  tool_name: "Bash",
  tool_input: { command: "git commit --no-verify" },
});
assertHookBlocked(r); // exit 2 / decision:"block" / permissionDecision:"deny"

Testing a hook you didn't write (a vendored third-party script)? Mark it

{ trusted: false }

and it runs confined under bubblewrap by default (read-only host, cleared env, no network egress). Add

{ recordEgress: true }

to also record what it tries to reach —

r.egress

plus

assertNoEgress(r)

assertEgressOnly(r, [...])

— the supply-chain check for "what does this skill phone home to / install from?". When the hook's setup needs a real install,

{ egress: { allow: ["registry.npmjs.org"] } }

lets it reach only that allowlist (a packet-layer

nft

wall, so a raw socket off-list is dropped too) →

r.egress

(allowed hosts) +

r.egressDropped

. Be precise about the boundaries: see

docs/sandboxing.md

(it blocks destruction and egress, but does NOT isolate reads of host files, and only under bwrap).

Deterministic (
runHarnessTest
) — load the real plugin, drive a scripted mock model, assert the hook fired (or the context landed):

import {
  runHarnessTest,
  scriptModel,
  assertHookFired,
  assertRequestContains,
} from "vigiles/testing";

const r = await runHarnessTest({
  pluginDir: "./", // or { settings: { hooks: {...} } }
  transcript: true,
  model: scriptModel([{ text: "ok" }]),
});
assertHookFired(r, "SessionStart");
assertRequestContains(r, "expected injected text"); // did it actually land?

Eval (
runEval
) — A/B the change on vs off across real-model trials, then gate on significance, not eyeballing:

import { runEval, assertSignificant } from "vigiles/testing";

const report = await runEval({
  arms: { off: {}, on: { pluginDir: "./" } },
  task: "…a task the harness change should affect…",
  measure: (ctx) => ({ ok: /* a bare predicate over the trace */ true }),
  trials: 6,
  cache: "readwrite",
});
assertSignificant(report, { baseline: "off", arm: "on", metric: "ok" });

Unit（
runHook
）——为hook提供合成事件，断言决策结果：

import { runHook, assertHookBlocked } from "vigiles/testing";

const r = runHook(hookCommand, {
  hook_event_name: "PreToolUse",
  tool_name: "Bash",
  tool_input: { command: "git commit --no-verify" },
});
assertHookBlocked(r); // exit 2 / decision:"block" / permissionDecision:"deny"

测试未自行编写的hook（第三方脚本）？标记为

{ trusted: false }

，默认会在bubblewrap隔离环境中运行（只读主机、清空环境、无网络出口）。添加

{ recordEgress: true }

还可记录它尝试访问的地址——

r.egress

配合

assertNoEgress(r)

assertEgressOnly(r, [...])

——用于供应链检查“此skill会向哪些地址发送请求/从哪里安装依赖？”。当hook的安装需要真实环境时，

{ egress: { allow: ["registry.npmjs.org"] } }

可允许它仅访问该白名单（基于数据包层的

nft

防火墙，不在白名单内的原始套接字请求会被拦截）→

r.egress

（允许的主机）+

r.egressDropped

。请明确边界限制：详见

docs/sandboxing.md

（它会拦截破坏性操作和网络出口，但不会隔离对主机文件的读取，且仅在bwrap环境下生效）。

Deterministic（
runHarnessTest
）——加载真实插件，驱动脚本化模拟模型，断言hook是否触发（或上下文是否生效）：

import {
  runHarnessTest,
  scriptModel,
  assertHookFired,
  assertRequestContains,
} from "vigiles/testing";

const r = await runHarnessTest({
  pluginDir: "./", // 或 { settings: { hooks: {...} } }
  transcript: true,
  model: scriptModel([{ text: "ok" }]),
});
assertHookFired(r, "SessionStart");
assertRequestContains(r, "expected injected text"); // 上下文是否真正生效？

Eval（
runEval
）——在真实模型试验中对比变更开启/关闭的A/B效果，然后基于显著性而非主观判断验证结果：

import { runEval, assertSignificant } from "vigiles/testing";

const report = await runEval({
  arms: { off: {}, on: { pluginDir: "./" } },
  task: "…harness变更应影响的任务…",
  measure: (ctx) => ({ ok: /* 基于追踪记录的布尔判断 */ true }),
  trials: 6,
  cache: "readwrite",
});
assertSignificant(report, { baseline: "off", arm: "on", metric: "ok" });

Step 4 — Run it

步骤4——运行测试

In a runner (node:test / vitest / jest) the tests are plain async functions. Or use the zero-setup CLI, which discovers and runs the files:

bash

npx vigiles test                 # *.harness.{mjs,ts} — unit + deterministic, no API key
npx vigiles eval --trials=6      # *.eval.{mjs,ts} — real model (local / nightly, not CI)

Unit-tier

runHook

tests need no

claude

and always run — write and run them even with no

claude

installed. A tier that genuinely can't run reports a loud

⊘ SKIPPED

(tallied separately, never a fake

✓

); a standalone script emits one via

skip(reason)

from

vigiles/testing

. A skip passes by default, but in a CI job that asserts the capability is present, run vigiles test --no-skip
so a skipped tier fails — a green-with-skips is untested surface. Keep unit + deterministic tests in CI (free); run evals locally or on a schedule with auth.

在测试运行器（node:test / vitest / jest）中，测试是普通的异步函数。也可使用零配置CLI，它会自动发现并运行测试文件：

bash

npx vigiles test                 # *.harness.{mjs,ts} —— unit + deterministic，无需API密钥
npx vigiles eval --trials=6      # *.eval.{mjs,ts} —— 真实模型（本地/夜间环境，不建议在CI中运行）

Unit层级的

runHook

测试无需

claude

，始终可运行——即使未安装

claude

也可编写并运行。确实无法运行的层级会显示醒目的

⊘ SKIPPED

（单独统计，不会显示为虚假的

✓

）；独立脚本可通过

vigiles/testing

中的

skip(reason)

触发跳过。跳过默认视为通过，但在需要验证功能存在的CI任务中，请运行**

vigiles test --no-skip

**，这样跳过的层级会导致失败——带跳过的绿色状态意味着存在未测试的内容。将unit + deterministic测试放在CI中（免费）；在本地或定时任务中运行需要授权的eval测试。

When the user didn't say what to test

当用户未指定测试内容时

Don't ask them to specify — pick something real and demonstrate. Scan the harness surface (Step 2), choose the cheapest meaningful test, write it, run it, and show the result. Good default picks, in order:

A
```
PreToolUse
```
hook → unit-test that it blocks the thing it's meant to block (and allows a safe sibling).
A
```
SessionStart
```
hook that injects context → deterministic test that the text actually reaches the model (
```
assertRequestContains
```
).
A skill → deterministic test that it resolves via
```
pluginDir
```
, then offer the paid
```
measureTriggerRate
```
eval as a follow-up.

Then say which tier you used and why, and offer to climb a tier if the cheaper test can't fully answer their question.

不要让用户明确指定——选择真实场景进行演示。扫描harness组件（步骤2），选择成本最低且有意义的测试，编写并运行，展示结果。推荐的默认选择顺序：

```
PreToolUse
```
hook → 单元测试验证它是否拦截预期内容（并允许安全操作）。
注入上下文的
```
SessionStart
```
hook → deterministic测试验证文本是否真正传递到模型（
```
assertRequestContains
```
）。
skill → deterministic测试验证它是否通过
```
pluginDir
```
加载，然后提供付费的
```
measureTriggerRate
```
eval作为后续选项。

然后说明使用的层级及原因，并提出如果低成本测试无法完全解决问题，可升级层级。

Reference

参考资料

The full guide — every tier, testing skills for real, "fired ≠ landed", the safe-by-default sandbox, the coverage matrix, and how it compares to promptfoo — is in

docs/harness-testing.md

完整指南——包括所有层级、skill的真实测试、“触发≠生效”、默认安全沙箱、覆盖矩阵以及与promptfoo的对比——详见

docs/harness-testing.md

。