harness

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Harness

验证框架（Harness）

The verification infrastructure that makes agent work trustworthy.

让Agent工作成果可信的验证基础设施。

Principles

核心原则

Environment > instruction — the harness matters more than the prompt
Mechanical enforcement > documentation — git hooks and CI gates > prose
Separate builder from judge — self-evaluation is unreliable; spawn independent evaluators
Deterministic where possible — lint/format/push hardcoded, implementation agentic
Context is a public good — push knowledge into the repo; what agents can't access doesn't exist
Scoped rules over global rules — per-directory/file-pattern rules, not a global dump
Progressive disclosure — small entry points, load detail on demand
Accept and correct > prevent all errors — small steady error rate with rapid correction beats perfection-seeking that serializes everything

环境优先于指令——验证框架的重要性超过提示词
机械执行优先于文档——Git钩子和CI门禁优于文字说明
构建者与评审者分离——自我评估不可靠；需启用独立的评估Agent
尽可能保持确定性——代码检查、格式化、推送操作硬编码实现，具体执行可由Agent完成
上下文是公共资源——将知识存入代码库；Agent无法访问的信息等同于不存在
范围规则优先于全局规则——采用按目录/文件模式设置的规则，而非全局统一配置
渐进式披露——从简单入口开始，按需加载详细内容
接受并修正错误优于杜绝所有错误——低错误率配合快速修正，胜过追求完美导致的流程串行化

The 7-Layer Stack

七层架构

Every harness has these layers. Name them when grading — "we have layers 1-3, missing 4-7":

Boot — single command starts the app
Smoke — app is alive (health endpoint,
```
--version
```
). Under 5 seconds
Interact — agent can exercise the app (Playwright, curl, shell scripts)
E2e — key user flows on real surfaces (not mocks)
Enforce — git hooks, CI gates, custom lint rules with agent-readable errors
Observe — structured logs, health endpoints, error traces queryable by agent
Isolate — per-worktree or per-container, parallel agents don't collide

每个验证框架都包含以下层级。评估时需明确指出：“我们已实现第1-3层，缺失第4-7层”：

启动层（Boot）——单命令启动应用
冒烟测试层（Smoke）——确认应用存活（健康检查端点、
```
--version
```
命令），耗时不超过5秒
交互层（Interact）——Agent可操作应用（通过Playwright、curl、Shell脚本）
端到端测试层（E2e）——在真实场景（而非模拟环境）验证核心用户流程
强制执行层（Enforce）——Git钩子、CI门禁、自定义代码检查规则，且错误信息需Agent可读
观测层（Observe）——结构化日志、健康检查端点、可由Agent查询的错误追踪信息
隔离层（Isolate）——基于工作区或容器的隔离机制，避免并行Agent互相干扰

Workflow

工作流程

1. Audit

1. 审核（Audit）

Grade the repo across four dimensions. For each:

status

(pass/partial/fail),

evidence

(file or command),

gap

(what's missing).

Bootable — one command starts the app and confirms it's running
Testable — tests hit the real running app, not just mocks. Detect
```
jest.mock
```
/
```
vi.mock
```
/
```
unittest.mock
```
— mock-only = zero
Observable — structured logs, health endpoints, or error traces queryable by agent
Verifiable — agent can produce evidence (screenshots, response logs, traces)

Use parallel subagents where available (one per dimension); otherwise audit sequentially. Grade using

references/grading.md

. Lowest dimension = overall grade.

从四个维度评估代码库。每个维度需包含：

状态

（通过/部分通过/失败）、

证据

（文件或命令）、

缺陷

（缺失内容）。

可启动性——单命令启动应用并确认其运行状态
可测试性——测试针对真实运行的应用，而非仅模拟环境。检测
```
jest.mock
```
/
```
vi.mock
```
/
```
unittest.mock
```
——仅使用模拟环境的测试得分为0
可观测性——结构化日志、健康检查端点或可由Agent查询的错误追踪信息
可验证性——Agent可生成成果证据（截图、响应日志、追踪信息）

若支持并行子Agent，可同时评估各维度；否则按顺序审核。使用

references/grading.md

进行评分。整体得分取各维度中的最低分。

2. Setup

2. 搭建（Setup）

Based on grade, build missing layers in priority order:

Boot → Smoke → Interact → E2e → Enforce → Observe → Isolate

Each piece should be independently useful. Stop after any step if remaining gaps aren't blocking. See

references/setup-patterns.md

for concrete patterns by project type.

根据评分结果，按优先级顺序补充缺失的层级：

启动层 → 冒烟测试层 → 交互层 → 端到端测试层 → 强制执行层 → 观测层 → 隔离层

每个层级的功能需独立可用。若剩余缺陷不会阻碍后续工作，可在完成任意步骤后停止。具体搭建模式可参考

references/setup-patterns.md

，该文档按项目类型提供了具体方案。

3. Verify

3. 验证（Verify）

Prove changes work on real surfaces. The agent that wrote the code must not verify it — spawn an independent evaluator. If subagents are unavailable, use a fresh session or hand off to human review. Do not self-certify with implementation context still loaded.

Boot the app, interact with it (Playwright CLI for UI, curl for APIs, CLI invocation)
Check nearby flows and likely regressions, not just the exact diff
Investigate anything odd instead of rationalizing it
Max 2 verification cycles — escalate after that, don't loop
Keep proof: commands run, screenshots, response logs, traces

For subagent lanes, evaluator pattern, and cost trade-offs:

references/verification.md

证明变更在真实场景下有效。编写代码的Agent不得自行验证其工作成果——需启用独立的评估Agent。若无法使用子Agent，可使用新会话或移交人工审核。禁止在仍保留实现上下文的情况下自我认证。

启动应用，通过Playwright CLI（针对UI）、curl（针对API）或CLI调用与应用交互
检查相关流程及可能的回归问题，而非仅验证变更的代码部分
对异常情况进行排查，而非合理化解释
最多进行2次验证循环——若仍未通过则升级处理，避免无限循环
保留验证证据：执行的命令、截图、响应日志、追踪信息

关于子Agent分工、评估模式及成本权衡，可参考

references/verification.md

4. Document

4. 文档（Document）

Keep the repo legible to humans and agents.

```
AGENTS.md
```
≈ 100 lines — table of contents, not encyclopedia. Points to
```
docs/
```
```
README.md
```
— human-facing overview, setup, usage
Scoped rules per directory/file pattern, not global dump
Update docs as part of the work, not after. Doc drift = test failure

For AGENTS.md structure, scoped rules, and hygiene:

references/documentation.md

确保代码库对人类和Agent均清晰可读。

```
AGENTS.md
```
——约100行内容，为目录式结构而非百科全书，指向
```
docs/
```
目录
```
README.md
```
——面向人类的概述、搭建及使用说明
按目录/文件模式设置范围规则，而非全局统一配置
文档更新需与开发工作同步进行，而非滞后。文档与代码不一致等同于测试失败

关于

AGENTS.md

结构、范围规则及文档维护，可参考

references/documentation.md

5. Specify (when warranted)

5. 规格定义（Specify，必要时执行）

For non-trivial features, write a spec before coding. Not a throwaway PRD — a living contract.

Define what, why, acceptance criteria, non-goals
Define conformance tests or acceptance checks — the mechanical definition of "done"
Get human approval on spec before implementation when scope is non-trivial
Break into testable tasks
Capture decisions during implementation and flow them back to the spec
Reconcile spec ↔ code ↔ tests after implementation

For the SDD triangle, conformance tests, and the 70/30 rule:

references/specifications.md

对于非 trivial 的功能，需在编码前编写规格文档。这不是一次性的PRD，而是可迭代的契约。

定义功能内容、设计原因、验收标准及非目标范围
定义一致性测试或验收检查——即“完成”的机械定义
当功能范围非 trivial 时，需在实现前获得人工对规格文档的批准
将功能拆分为可测试的任务
在实现过程中记录决策，并同步更新到规格文档
实现完成后，协调规格文档 ↔ 代码 ↔ 测试三者的一致性

关于SDD三角、一致性测试及70/30规则，可参考

references/specifications.md

Anti-Patterns

反模式

Mock-only tests — pass by construction, verify nothing
Self-evaluation — agent grades own work, always passes
Global AGENTS.md dump — fills context before work starts
Infinite retry loops — max 2 CI rounds, then hand back with partial result
All-agentic pipeline — lint/push/format should be deterministic
Context flooding — running full test suites floods context, agent hallucinates. Run targeted subsets, swallow passing output, surface only errors
Designing the perfect harness upfront — iterate from failures, not theory

仅使用模拟环境的测试——通过构造实现通过，无法验证任何真实逻辑
自我评估——Agent自评工作成果，结果总是通过
全局式
AGENTS.md
文档——在工作开始前就加载全部上下文信息
无限重试循环——最多进行2次CI循环，之后移交人工并返回部分结果
全Agent化流水线——代码检查、格式化、推送操作应保持确定性
上下文过载——运行完整测试套件会导致上下文过载，引发Agent幻觉。应运行针对性的测试子集，忽略通过的输出，仅展示错误信息
预先设计完美的验证框架——应从失败中迭代优化，而非基于理论设计

Output

输出结果

After any harness work, report:

Grade: before and after (using
```
references/grading.md
```
scale)
Dimensions: bootable / testable / observable / verifiable — each with status + evidence
What changed: specific files added or modified
Gaps: remaining gaps ranked by impact
Verify readiness: C+ = can verify, D/F = fix harness first
Confidence:
```
ship it
```
/
```
needs review
```
/
```
blocked
```

完成任何验证框架相关工作后，需提交以下报告：

评分：工作前后的评分（使用
```
references/grading.md
```
的评分标准）
维度状态：可启动性/可测试性/可观测性/可验证性——每个维度包含状态及证据
变更内容：新增或修改的具体文件
剩余缺陷：按影响程度排序的未解决问题
验证就绪状态：C+及以上=可进行验证，D/F=需先修复验证框架
置信度：
```
ship it
```
（可发布）/
```
needs review
```
（需审核）/
```
blocked
```
（受阻）

References

参考文档

```
references/grading.md
```
— harness quality grading scale with mechanical criteria
```
references/setup-patterns.md
```
— boot, smoke, e2e, isolation, enforcement patterns
```
references/verification.md
```
— verify workflow, evaluator pattern, subagent lanes, cost
```
references/documentation.md
```
— AGENTS.md rules, scoped rules, README patterns, docs hygiene
```
references/specifications.md
```
— SDD triangle, conformance tests, acceptance criteria
```
references/industry-examples.md
```
— OpenAI, Anthropic, Stripe, Uber, Datadog, Cursor patterns. Read when designing a harness strategy or justifying investment, not during routine work

Each reference file includes source URLs for the research and articles it draws from.

```
references/grading.md
```
——带有机械评估标准的验证框架质量评分体系
```
references/setup-patterns.md
```
——启动、冒烟测试、端到端测试、隔离、强制执行等模式
```
references/verification.md
```
——验证工作流、评估模式、子Agent分工、成本权衡
```
references/documentation.md
```
——
```
AGENTS.md
```
规则、范围规则、README模式、文档维护
```
references/specifications.md
```
——SDD三角、一致性测试、验收标准
```
references/industry-examples.md
```
——OpenAI、Anthropic、Stripe、Uber、Datadog、Cursor的实践模式。仅在设计验证框架策略或论证投入时阅读，日常工作无需参考

每份参考文档均包含其研究依据的文章及资源链接。