test-harness
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTest the Claude Code harness — the hooks, skills, settings, and CLAUDE.md that
steer an agent — as the assembled machine it ships as. vigiles gives three tiers,
cheapest first; this skill picks the right one, writes the test, and runs it.
The guiding rule: start at the cheapest tier that can answer the question, and
climb only when it genuinely can't. Two of the three tiers need no model and no
API key, so they run on every commit for free — reach for the paid real-model
tier only when the question actually requires a real model.
将Claude Code harness(用于引导Agent的hooks、skills、settings和CLAUDE.md)作为完整交付的系统进行测试。vigiles提供三个层级,成本从低到高;本skill会选择合适的层级,编写并运行测试。
核心原则:从能解决问题的最低成本层级开始,仅在该层级确实无法解决时再升级。三个层级中有两个无需模型和API密钥,可在每次提交时免费运行——只有当问题确实需要真实模型时,才使用付费的真实模型层级。
Step 0 — Pick the tier (the judgment call)
步骤0——选择层级(判断决策)
Match what you're testing to the cheapest tier that can answer it:
| What you're testing | Tier | Cost | API |
|---|---|---|---|
| "Does this hook block/allow event X?" — pure hook logic, every event type (incl. Edit/Write, PreCompact, SessionEnd, SubagentStop) | Unit | free, milliseconds, no | |
| "Is the hook actually wired into the assembled plugin and does it fire in a real session?" | Deterministic | free, no API key (real | |
"Did the injected context (a SessionStart hook, a | Deterministic | free, no API key | |
| "Does this skill's description trigger when it should (recall) and stay quiet when it shouldn't (precision)?" | Eval | paid (real model) | |
| "Does this harness change move what the agent does?" (A/B, signal vs noise) | Eval | paid (real model) | |
Most harness questions — block/allow, wired-in, context-landed — never need a
model. Only "does the model trigger / behave differently" needs the eval tier.
If the unit and deterministic tiers can both answer it, prefer unit: it's
faster and reaches events the deterministic mock can't drive.
根据测试内容匹配能解决问题的最低成本层级:
| 测试内容 | 层级 | 成本 | API |
|---|---|---|---|
| "此hook是否拦截/允许事件X?"——纯hook逻辑,所有事件类型(包括Edit/Write、PreCompact、SessionEnd、SubagentStop) | Unit | 免费,毫秒级,无需 | |
| "hook是否真正接入已组装的插件,并在真实会话中触发?" | Deterministic | 免费,无需API密钥(真实 | |
"注入的上下文(SessionStart hook、 | Deterministic | 免费,无需API密钥 | |
| "此skill的描述触发是否符合预期(召回率),且不该触发时保持静默(精确率)?" | Eval | 付费(真实模型) | |
| "此harness变更是否改变Agent的行为?"(A/B测试,信号与噪声) | Eval | 付费(真实模型) | |
大多数harness相关问题——拦截/允许、接入状态、上下文生效——都无需模型。只有“模型是否触发/行为是否改变”才需要eval层级。
如果unit和deterministic层级都能解决问题,优先选择unit:速度更快,且能覆盖deterministic模拟无法驱动的事件。
Step 1 — Ensure vigiles is installed
步骤1——确保vigiles已安装
Check whether is a dependency (), and install it as a
dev dependency if not:
vigilespackage.jsonbash
npm i -D vigiles # or: pnpm add -D vigiles / yarn add -D vigilesThe deterministic tier additionally needs the CLI on PATH (no API key):
. The eval tier needs model auth. If the
CLI is missing, you can still write and run unit-tier tests.
claudenpm i -g @anthropic-ai/claude-codeclaude检查是否为依赖项(),如果未安装则将其作为开发依赖安装:
vigilespackage.jsonbash
npm i -D vigiles # 或:pnpm add -D vigiles / yarn add -D vigilesdeterministic层级还需要 CLI在PATH中(无需API密钥):。eval层级需要模型授权。如果缺少 CLI,仍可编写并运行unit层级的测试。
claudenpm i -g @anthropic-ai/claude-codeclaudeStep 2 — Locate the harness surface to test
步骤2——定位要测试的harness组件
Find what the project actually ships, in this order:
- /
.claude/settings.json— inline.claude/settings.local.json.hooks - — a plugin manifest (
.claude-plugin/plugin.json,hooks,skills,agents).mcpServers - — the plugin hooks convention (e.g. obra/superpowers).
hooks/hooks.json - ,
skills/<name>/SKILL.md,agents/<name>.md.commands/<name>.md
Pick one concrete thing to pin down — a specific hook, a specific
injection, a specific skill.
PreToolUseSessionStart按以下顺序查找项目实际交付的内容:
- /
.claude/settings.json——内联.claude/settings.local.json。hooks - ——插件清单(
.claude-plugin/plugin.json,hooks,skills,agents)。mcpServers - ——插件hooks约定(如obra/superpowers)。
hooks/hooks.json - ,
skills/<name>/SKILL.md,agents/<name>.md。commands/<name>.md
选择一个具体的测试对象——特定的 hook、特定的注入、特定的skill。
PreToolUseSessionStartStep 3 — Write the test for the chosen tier
步骤3——为所选层级编写测试
Unit () — hand a hook a synthesized event, assert the decision:
runHookts
import { runHook, assertHookBlocked } from "vigiles/testing";
const r = runHook(hookCommand, {
hook_event_name: "PreToolUse",
tool_name: "Bash",
tool_input: { command: "git commit --no-verify" },
});
assertHookBlocked(r); // exit 2 / decision:"block" / permissionDecision:"deny"Testing a hook you didn't write (a vendored third-party script)? Mark it
and it runs confined under bubblewrap by default (read-only
host, cleared env, no network egress). Add to also
record what it tries to reach — plus /
— the supply-chain check for "what does this skill
phone home to / install from?". When the hook's setup needs a real install,
lets it reach only that
allowlist (a packet-layer wall, so a raw socket off-list is dropped too) →
(allowed hosts) + . Be precise about the boundaries:
see
(it blocks destruction and
egress, but does NOT isolate reads of host files, and only under bwrap).
{ trusted: false }{ recordEgress: true }r.egressassertNoEgress(r)assertEgressOnly(r, [...]){ egress: { allow: ["registry.npmjs.org"] } }nftr.egressr.egressDroppeddocs/sandboxing.mdDeterministic () — load the real plugin, drive a scripted
mock model, assert the hook fired (or the context landed):
runHarnessTestts
import {
runHarnessTest,
scriptModel,
assertHookFired,
assertRequestContains,
} from "vigiles/testing";
const r = await runHarnessTest({
pluginDir: "./", // or { settings: { hooks: {...} } }
transcript: true,
model: scriptModel([{ text: "ok" }]),
});
assertHookFired(r, "SessionStart");
assertRequestContains(r, "expected injected text"); // did it actually land?Eval () — A/B the change on vs off across real-model trials, then
gate on significance, not eyeballing:
runEvalts
import { runEval, assertSignificant } from "vigiles/testing";
const report = await runEval({
arms: { off: {}, on: { pluginDir: "./" } },
task: "…a task the harness change should affect…",
measure: (ctx) => ({ ok: /* a bare predicate over the trace */ true }),
trials: 6,
cache: "readwrite",
});
assertSignificant(report, { baseline: "off", arm: "on", metric: "ok" });Unit()——为hook提供合成事件,断言决策结果:
runHookts
import { runHook, assertHookBlocked } from "vigiles/testing";
const r = runHook(hookCommand, {
hook_event_name: "PreToolUse",
tool_name: "Bash",
tool_input: { command: "git commit --no-verify" },
});
assertHookBlocked(r); // exit 2 / decision:"block" / permissionDecision:"deny"测试未自行编写的hook(第三方脚本)?标记为,默认会在bubblewrap隔离环境中运行(只读主机、清空环境、无网络出口)。添加还可记录它尝试访问的地址——配合 / ——用于供应链检查“此skill会向哪些地址发送请求/从哪里安装依赖?”。当hook的安装需要真实环境时,可允许它仅访问该白名单(基于数据包层的防火墙,不在白名单内的原始套接字请求会被拦截)→(允许的主机)+ 。请明确边界限制:详见(它会拦截破坏性操作和网络出口,但不会隔离对主机文件的读取,且仅在bwrap环境下生效)。
{ trusted: false }{ recordEgress: true }r.egressassertNoEgress(r)assertEgressOnly(r, [...]){ egress: { allow: ["registry.npmjs.org"] } }nftr.egressr.egressDroppeddocs/sandboxing.mdDeterministic()——加载真实插件,驱动脚本化模拟模型,断言hook是否触发(或上下文是否生效):
runHarnessTestts
import {
runHarnessTest,
scriptModel,
assertHookFired,
assertRequestContains,
} from "vigiles/testing";
const r = await runHarnessTest({
pluginDir: "./", // 或 { settings: { hooks: {...} } }
transcript: true,
model: scriptModel([{ text: "ok" }]),
});
assertHookFired(r, "SessionStart");
assertRequestContains(r, "expected injected text"); // 上下文是否真正生效?Eval()——在真实模型试验中对比变更开启/关闭的A/B效果,然后基于显著性而非主观判断验证结果:
runEvalts
import { runEval, assertSignificant } from "vigiles/testing";
const report = await runEval({
arms: { off: {}, on: { pluginDir: "./" } },
task: "…harness变更应影响的任务…",
measure: (ctx) => ({ ok: /* 基于追踪记录的布尔判断 */ true }),
trials: 6,
cache: "readwrite",
});
assertSignificant(report, { baseline: "off", arm: "on", metric: "ok" });Step 4 — Run it
步骤4——运行测试
In a runner (node:test / vitest / jest) the tests are plain async functions. Or
use the zero-setup CLI, which discovers and runs the files:
bash
npx vigiles test # *.harness.{mjs,ts} — unit + deterministic, no API key
npx vigiles eval --trials=6 # *.eval.{mjs,ts} — real model (local / nightly, not CI)Unit-tier tests need no and always run — write and run them
even with no installed. A tier that genuinely can't run reports a loud
(tallied separately, never a fake ); a standalone script emits one
via from . A skip passes by default, but in a CI
job that asserts the capability is present, run so a
skipped tier fails — a green-with-skips is untested surface. Keep unit +
deterministic tests in CI (free); run evals locally or on a schedule with auth.
runHookclaudeclaude⊘ SKIPPED✓skip(reason)vigiles/testingvigiles test --no-skip在测试运行器(node:test / vitest / jest)中,测试是普通的异步函数。也可使用零配置CLI,它会自动发现并运行测试文件:
bash
npx vigiles test # *.harness.{mjs,ts} —— unit + deterministic,无需API密钥
npx vigiles eval --trials=6 # *.eval.{mjs,ts} —— 真实模型(本地/夜间环境,不建议在CI中运行)Unit层级的测试无需,始终可运行——即使未安装也可编写并运行。确实无法运行的层级会显示醒目的(单独统计,不会显示为虚假的);独立脚本可通过中的触发跳过。跳过默认视为通过,但在需要验证功能存在的CI任务中,请运行****,这样跳过的层级会导致失败——带跳过的绿色状态意味着存在未测试的内容。将unit + deterministic测试放在CI中(免费);在本地或定时任务中运行需要授权的eval测试。
runHookclaudeclaude⊘ SKIPPED✓vigiles/testingskip(reason)vigiles test --no-skipWhen the user didn't say what to test
当用户未指定测试内容时
Don't ask them to specify — pick something real and demonstrate. Scan the
harness surface (Step 2), choose the cheapest meaningful test, write it, run it,
and show the result. Good default picks, in order:
- A hook → unit-test that it blocks the thing it's meant to block (and allows a safe sibling).
PreToolUse - A hook that injects context → deterministic test that the text actually reaches the model (
SessionStart).assertRequestContains - A skill → deterministic test that it resolves via , then offer the paid
pluginDireval as a follow-up.measureTriggerRate
Then say which tier you used and why, and offer to climb a tier if the cheaper
test can't fully answer their question.
不要让用户明确指定——选择真实场景进行演示。扫描harness组件(步骤2),选择成本最低且有意义的测试,编写并运行,展示结果。推荐的默认选择顺序:
- hook → 单元测试验证它是否拦截预期内容(并允许安全操作)。
PreToolUse - 注入上下文的hook → deterministic测试验证文本是否真正传递到模型(
SessionStart)。assertRequestContains - skill → deterministic测试验证它是否通过加载,然后提供付费的
pluginDireval作为后续选项。measureTriggerRate
然后说明使用的层级及原因,并提出如果低成本测试无法完全解决问题,可升级层级。
Reference
参考资料
The full guide — every tier, testing skills for real, "fired ≠ landed", the
safe-by-default sandbox, the coverage matrix, and how it compares to promptfoo —
is in .
docs/harness-testing.md完整指南——包括所有层级、skill的真实测试、“触发≠生效”、默认安全沙箱、覆盖矩阵以及与promptfoo的对比——详见。
docs/harness-testing.md