test-harness

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
Test the Claude Code harness — the hooks, skills, settings, and CLAUDE.md that steer an agent — as the assembled machine it ships as. vigiles gives three tiers, cheapest first; this skill picks the right one, writes the test, and runs it.
The guiding rule: start at the cheapest tier that can answer the question, and climb only when it genuinely can't. Two of the three tiers need no model and no API key, so they run on every commit for free — reach for the paid real-model tier only when the question actually requires a real model.
将Claude Code harness(用于引导Agent的hooks、skills、settings和CLAUDE.md)作为完整交付的系统进行测试。vigiles提供三个层级,成本从低到高;本skill会选择合适的层级,编写并运行测试。
核心原则:从能解决问题的最低成本层级开始,仅在该层级确实无法解决时再升级。三个层级中有两个无需模型和API密钥,可在每次提交时免费运行——只有当问题确实需要真实模型时,才使用付费的真实模型层级。

Step 0 — Pick the tier (the judgment call)

步骤0——选择层级(判断决策)

Match what you're testing to the cheapest tier that can answer it:
What you're testingTierCostAPI
"Does this hook block/allow event X?" — pure hook logic, every event type (incl. Edit/Write, PreCompact, SessionEnd, SubagentStop)Unitfree, milliseconds, no
claude
runHook
"Is the hook actually wired into the assembled plugin and does it fire in a real session?"Deterministicfree, no API key (real
claude
+ scripted mock)
runHarnessTest
+
scriptModel
"Did the injected context (a SessionStart hook, a
/command
) actually reach the model?"
Deterministicfree, no API key
runHarnessTest
trace.modelRequests
/
assertRequestContains
"Does this skill's description trigger when it should (recall) and stay quiet when it shouldn't (precision)?"Evalpaid (real model)
measureTriggerRate
(+
irrelevantPrompts
) →
assertTriggerRate({ min, maxFalsePositive })
"Does this harness change move what the agent does?" (A/B, signal vs noise)Evalpaid (real model)
runEval
+
assertSignificant
Most harness questions — block/allow, wired-in, context-landed — never need a model. Only "does the model trigger / behave differently" needs the eval tier.
If the unit and deterministic tiers can both answer it, prefer unit: it's faster and reaches events the deterministic mock can't drive.
根据测试内容匹配能解决问题的最低成本层级:
测试内容层级成本API
"此hook是否拦截/允许事件X?"——纯hook逻辑,所有事件类型(包括Edit/Write、PreCompact、SessionEnd、SubagentStop)Unit免费,毫秒级,无需
claude
runHook
"hook是否真正接入已组装的插件,并在真实会话中触发?"Deterministic免费,无需API密钥(真实
claude
+ 脚本化模拟)
runHarnessTest
+
scriptModel
"注入的上下文(SessionStart hook、
/command
)是否真正传递到模型?"
Deterministic免费,无需API密钥
runHarnessTest
trace.modelRequests
/
assertRequestContains
"此skill的描述触发是否符合预期(召回率),且不该触发时保持静默(精确率)?"Eval付费(真实模型)
measureTriggerRate
(+
irrelevantPrompts
) →
assertTriggerRate({ min, maxFalsePositive })
"此harness变更是否改变Agent的行为?"(A/B测试,信号与噪声)Eval付费(真实模型)
runEval
+
assertSignificant
大多数harness相关问题——拦截/允许、接入状态、上下文生效——都无需模型。只有“模型是否触发/行为是否改变”才需要eval层级。
如果unit和deterministic层级都能解决问题,优先选择unit:速度更快,且能覆盖deterministic模拟无法驱动的事件。

Step 1 — Ensure vigiles is installed

步骤1——确保vigiles已安装

Check whether
vigiles
is a dependency (
package.json
), and install it as a dev dependency if not:
bash
npm i -D vigiles    # or: pnpm add -D vigiles / yarn add -D vigiles
The deterministic tier additionally needs the
claude
CLI on PATH (no API key):
npm i -g @anthropic-ai/claude-code
. The eval tier needs model auth. If the
claude
CLI is missing, you can still write and run unit-tier tests.
检查
vigiles
是否为依赖项(
package.json
),如果未安装则将其作为开发依赖安装:
bash
npm i -D vigiles    # 或:pnpm add -D vigiles / yarn add -D vigiles
deterministic层级还需要
claude
CLI在PATH中(无需API密钥):
npm i -g @anthropic-ai/claude-code
。eval层级需要模型授权。如果缺少
claude
CLI,仍可编写并运行unit层级的测试。

Step 2 — Locate the harness surface to test

步骤2——定位要测试的harness组件

Find what the project actually ships, in this order:
  1. .claude/settings.json
    /
    .claude/settings.local.json
    — inline
    hooks
    .
  2. .claude-plugin/plugin.json
    — a plugin manifest (
    hooks
    ,
    skills
    ,
    agents
    ,
    mcpServers
    ).
  3. hooks/hooks.json
    — the plugin hooks convention (e.g. obra/superpowers).
  4. skills/<name>/SKILL.md
    ,
    agents/<name>.md
    ,
    commands/<name>.md
    .
Pick one concrete thing to pin down — a specific
PreToolUse
hook, a specific
SessionStart
injection, a specific skill.
按以下顺序查找项目实际交付的内容:
  1. .claude/settings.json
    /
    .claude/settings.local.json
    ——内联
    hooks
  2. .claude-plugin/plugin.json
    ——插件清单(
    hooks
    ,
    skills
    ,
    agents
    ,
    mcpServers
    )。
  3. hooks/hooks.json
    ——插件hooks约定(如obra/superpowers)。
  4. skills/<name>/SKILL.md
    ,
    agents/<name>.md
    ,
    commands/<name>.md
选择一个具体的测试对象——特定的
PreToolUse
hook、特定的
SessionStart
注入、特定的skill。

Step 3 — Write the test for the chosen tier

步骤3——为所选层级编写测试

Unit (
runHook
)
— hand a hook a synthesized event, assert the decision:
ts
import { runHook, assertHookBlocked } from "vigiles/testing";

const r = runHook(hookCommand, {
  hook_event_name: "PreToolUse",
  tool_name: "Bash",
  tool_input: { command: "git commit --no-verify" },
});
assertHookBlocked(r); // exit 2 / decision:"block" / permissionDecision:"deny"
Testing a hook you didn't write (a vendored third-party script)? Mark it
{ trusted: false }
and it runs confined under bubblewrap by default (read-only host, cleared env, no network egress). Add
{ recordEgress: true }
to also record what it tries to reach —
r.egress
plus
assertNoEgress(r)
/
assertEgressOnly(r, [...])
— the supply-chain check for "what does this skill phone home to / install from?". When the hook's setup needs a real install,
{ egress: { allow: ["registry.npmjs.org"] } }
lets it reach only that allowlist (a packet-layer
nft
wall, so a raw socket off-list is dropped too) →
r.egress
(allowed hosts) +
r.egressDropped
. Be precise about the boundaries: see
docs/sandboxing.md
(it blocks destruction and egress, but does NOT isolate reads of host files, and only under bwrap).
Deterministic (
runHarnessTest
)
— load the real plugin, drive a scripted mock model, assert the hook fired (or the context landed):
ts
import {
  runHarnessTest,
  scriptModel,
  assertHookFired,
  assertRequestContains,
} from "vigiles/testing";

const r = await runHarnessTest({
  pluginDir: "./", // or { settings: { hooks: {...} } }
  transcript: true,
  model: scriptModel([{ text: "ok" }]),
});
assertHookFired(r, "SessionStart");
assertRequestContains(r, "expected injected text"); // did it actually land?
Eval (
runEval
)
— A/B the change on vs off across real-model trials, then gate on significance, not eyeballing:
ts
import { runEval, assertSignificant } from "vigiles/testing";

const report = await runEval({
  arms: { off: {}, on: { pluginDir: "./" } },
  task: "…a task the harness change should affect…",
  measure: (ctx) => ({ ok: /* a bare predicate over the trace */ true }),
  trials: 6,
  cache: "readwrite",
});
assertSignificant(report, { baseline: "off", arm: "on", metric: "ok" });
Unit(
runHook
——为hook提供合成事件,断言决策结果:
ts
import { runHook, assertHookBlocked } from "vigiles/testing";

const r = runHook(hookCommand, {
  hook_event_name: "PreToolUse",
  tool_name: "Bash",
  tool_input: { command: "git commit --no-verify" },
});
assertHookBlocked(r); // exit 2 / decision:"block" / permissionDecision:"deny"
测试未自行编写的hook(第三方脚本)?标记为
{ trusted: false }
,默认会在bubblewrap隔离环境中运行(只读主机、清空环境、无网络出口)。添加
{ recordEgress: true }
还可记录它尝试访问的地址——
r.egress
配合
assertNoEgress(r)
/
assertEgressOnly(r, [...])
——用于供应链检查“此skill会向哪些地址发送请求/从哪里安装依赖?”。当hook的安装需要真实环境时,
{ egress: { allow: ["registry.npmjs.org"] } }
可允许它仅访问该白名单(基于数据包层的
nft
防火墙,不在白名单内的原始套接字请求会被拦截)→
r.egress
(允许的主机)+
r.egressDropped
。请明确边界限制:详见
docs/sandboxing.md
(它会拦截破坏性操作和网络出口,但不会隔离对主机文件的读取,且仅在bwrap环境下生效)。
Deterministic(
runHarnessTest
——加载真实插件,驱动脚本化模拟模型,断言hook是否触发(或上下文是否生效):
ts
import {
  runHarnessTest,
  scriptModel,
  assertHookFired,
  assertRequestContains,
} from "vigiles/testing";

const r = await runHarnessTest({
  pluginDir: "./", // 或 { settings: { hooks: {...} } }
  transcript: true,
  model: scriptModel([{ text: "ok" }]),
});
assertHookFired(r, "SessionStart");
assertRequestContains(r, "expected injected text"); // 上下文是否真正生效?
Eval(
runEval
——在真实模型试验中对比变更开启/关闭的A/B效果,然后基于显著性而非主观判断验证结果:
ts
import { runEval, assertSignificant } from "vigiles/testing";

const report = await runEval({
  arms: { off: {}, on: { pluginDir: "./" } },
  task: "…harness变更应影响的任务…",
  measure: (ctx) => ({ ok: /* 基于追踪记录的布尔判断 */ true }),
  trials: 6,
  cache: "readwrite",
});
assertSignificant(report, { baseline: "off", arm: "on", metric: "ok" });

Step 4 — Run it

步骤4——运行测试

In a runner (node:test / vitest / jest) the tests are plain async functions. Or use the zero-setup CLI, which discovers and runs the files:
bash
npx vigiles test                 # *.harness.{mjs,ts} — unit + deterministic, no API key
npx vigiles eval --trials=6      # *.eval.{mjs,ts} — real model (local / nightly, not CI)
Unit-tier
runHook
tests need no
claude
and always run — write and run them even with no
claude
installed. A tier that genuinely can't run reports a loud
⊘ SKIPPED
(tallied separately, never a fake
); a standalone script emits one via
skip(reason)
from
vigiles/testing
. A skip passes by default, but in a CI job that asserts the capability is present, run
vigiles test --no-skip
so a skipped tier fails — a green-with-skips is untested surface. Keep unit + deterministic tests in CI (free); run evals locally or on a schedule with auth.
在测试运行器(node:test / vitest / jest)中,测试是普通的异步函数。也可使用零配置CLI,它会自动发现并运行测试文件:
bash
npx vigiles test                 # *.harness.{mjs,ts} —— unit + deterministic,无需API密钥
npx vigiles eval --trials=6      # *.eval.{mjs,ts} —— 真实模型(本地/夜间环境,不建议在CI中运行)
Unit层级的
runHook
测试无需
claude
始终可运行——即使未安装
claude
也可编写并运行。确实无法运行的层级会显示醒目的
⊘ SKIPPED
(单独统计,不会显示为虚假的
);独立脚本可通过
vigiles/testing
中的
skip(reason)
触发跳过。跳过默认视为通过,但在需要验证功能存在的CI任务中,请运行**
vigiles test --no-skip
**,这样跳过的层级会导致失败——带跳过的绿色状态意味着存在未测试的内容。将unit + deterministic测试放在CI中(免费);在本地或定时任务中运行需要授权的eval测试。

When the user didn't say what to test

当用户未指定测试内容时

Don't ask them to specify — pick something real and demonstrate. Scan the harness surface (Step 2), choose the cheapest meaningful test, write it, run it, and show the result. Good default picks, in order:
  1. A
    PreToolUse
    hook → unit-test that it blocks the thing it's meant to block (and allows a safe sibling).
  2. A
    SessionStart
    hook that injects context → deterministic test that the text actually reaches the model (
    assertRequestContains
    ).
  3. A skill → deterministic test that it resolves via
    pluginDir
    , then offer the paid
    measureTriggerRate
    eval as a follow-up.
Then say which tier you used and why, and offer to climb a tier if the cheaper test can't fully answer their question.
不要让用户明确指定——选择真实场景进行演示。扫描harness组件(步骤2),选择成本最低且有意义的测试,编写并运行,展示结果。推荐的默认选择顺序:
  1. PreToolUse
    hook → 单元测试验证它是否拦截预期内容(并允许安全操作)。
  2. 注入上下文的
    SessionStart
    hook → deterministic测试验证文本是否真正传递到模型(
    assertRequestContains
    )。
  3. skill → deterministic测试验证它是否通过
    pluginDir
    加载,然后提供付费的
    measureTriggerRate
    eval作为后续选项。
然后说明使用的层级及原因,并提出如果低成本测试无法完全解决问题,可升级层级。

Reference

参考资料

The full guide — every tier, testing skills for real, "fired ≠ landed", the safe-by-default sandbox, the coverage matrix, and how it compares to promptfoo — is in
docs/harness-testing.md
.
完整指南——包括所有层级、skill的真实测试、“触发≠生效”、默认安全沙箱、覆盖矩阵以及与promptfoo的对比——详见
docs/harness-testing.md