harness
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHarness
验证框架(Harness)
The verification infrastructure that makes agent work trustworthy.
让Agent工作成果可信的验证基础设施。
Principles
核心原则
- Environment > instruction — the harness matters more than the prompt
- Mechanical enforcement > documentation — git hooks and CI gates > prose
- Separate builder from judge — self-evaluation is unreliable; spawn independent evaluators
- Deterministic where possible — lint/format/push hardcoded, implementation agentic
- Context is a public good — push knowledge into the repo; what agents can't access doesn't exist
- Scoped rules over global rules — per-directory/file-pattern rules, not a global dump
- Progressive disclosure — small entry points, load detail on demand
- Accept and correct > prevent all errors — small steady error rate with rapid correction beats perfection-seeking that serializes everything
- 环境优先于指令——验证框架的重要性超过提示词
- 机械执行优先于文档——Git钩子和CI门禁优于文字说明
- 构建者与评审者分离——自我评估不可靠;需启用独立的评估Agent
- 尽可能保持确定性——代码检查、格式化、推送操作硬编码实现,具体执行可由Agent完成
- 上下文是公共资源——将知识存入代码库;Agent无法访问的信息等同于不存在
- 范围规则优先于全局规则——采用按目录/文件模式设置的规则,而非全局统一配置
- 渐进式披露——从简单入口开始,按需加载详细内容
- 接受并修正错误优于杜绝所有错误——低错误率配合快速修正,胜过追求完美导致的流程串行化
The 7-Layer Stack
七层架构
Every harness has these layers. Name them when grading — "we have layers 1-3, missing 4-7":
- Boot — single command starts the app
- Smoke — app is alive (health endpoint, ). Under 5 seconds
--version - Interact — agent can exercise the app (Playwright, curl, shell scripts)
- E2e — key user flows on real surfaces (not mocks)
- Enforce — git hooks, CI gates, custom lint rules with agent-readable errors
- Observe — structured logs, health endpoints, error traces queryable by agent
- Isolate — per-worktree or per-container, parallel agents don't collide
每个验证框架都包含以下层级。评估时需明确指出:“我们已实现第1-3层,缺失第4-7层”:
- 启动层(Boot)——单命令启动应用
- 冒烟测试层(Smoke)——确认应用存活(健康检查端点、命令),耗时不超过5秒
--version - 交互层(Interact)——Agent可操作应用(通过Playwright、curl、Shell脚本)
- 端到端测试层(E2e)——在真实场景(而非模拟环境)验证核心用户流程
- 强制执行层(Enforce)——Git钩子、CI门禁、自定义代码检查规则,且错误信息需Agent可读
- 观测层(Observe)——结构化日志、健康检查端点、可由Agent查询的错误追踪信息
- 隔离层(Isolate)——基于工作区或容器的隔离机制,避免并行Agent互相干扰
Workflow
工作流程
1. Audit
1. 审核(Audit)
Grade the repo across four dimensions. For each: (pass/partial/fail), (file or command), (what's missing).
statusevidencegap- Bootable — one command starts the app and confirms it's running
- Testable — tests hit the real running app, not just mocks. Detect /
jest.mock/vi.mock— mock-only = zerounittest.mock - Observable — structured logs, health endpoints, or error traces queryable by agent
- Verifiable — agent can produce evidence (screenshots, response logs, traces)
Use parallel subagents where available (one per dimension); otherwise audit sequentially. Grade using . Lowest dimension = overall grade.
references/grading.md从四个维度评估代码库。每个维度需包含:(通过/部分通过/失败)、(文件或命令)、(缺失内容)。
状态证据缺陷- 可启动性——单命令启动应用并确认其运行状态
- 可测试性——测试针对真实运行的应用,而非仅模拟环境。检测/
jest.mock/vi.mock——仅使用模拟环境的测试得分为0unittest.mock - 可观测性——结构化日志、健康检查端点或可由Agent查询的错误追踪信息
- 可验证性——Agent可生成成果证据(截图、响应日志、追踪信息)
若支持并行子Agent,可同时评估各维度;否则按顺序审核。使用进行评分。整体得分取各维度中的最低分。
references/grading.md2. Setup
2. 搭建(Setup)
Based on grade, build missing layers in priority order:
Boot → Smoke → Interact → E2e → Enforce → Observe → Isolate
Each piece should be independently useful. Stop after any step if remaining gaps aren't blocking. See for concrete patterns by project type.
references/setup-patterns.md根据评分结果,按优先级顺序补充缺失的层级:
启动层 → 冒烟测试层 → 交互层 → 端到端测试层 → 强制执行层 → 观测层 → 隔离层
每个层级的功能需独立可用。若剩余缺陷不会阻碍后续工作,可在完成任意步骤后停止。具体搭建模式可参考,该文档按项目类型提供了具体方案。
references/setup-patterns.md3. Verify
3. 验证(Verify)
Prove changes work on real surfaces. The agent that wrote the code must not verify it — spawn an independent evaluator. If subagents are unavailable, use a fresh session or hand off to human review. Do not self-certify with implementation context still loaded.
- Boot the app, interact with it (Playwright CLI for UI, curl for APIs, CLI invocation)
- Check nearby flows and likely regressions, not just the exact diff
- Investigate anything odd instead of rationalizing it
- Max 2 verification cycles — escalate after that, don't loop
- Keep proof: commands run, screenshots, response logs, traces
For subagent lanes, evaluator pattern, and cost trade-offs:
references/verification.md证明变更在真实场景下有效。编写代码的Agent不得自行验证其工作成果——需启用独立的评估Agent。若无法使用子Agent,可使用新会话或移交人工审核。禁止在仍保留实现上下文的情况下自我认证。
- 启动应用,通过Playwright CLI(针对UI)、curl(针对API)或CLI调用与应用交互
- 检查相关流程及可能的回归问题,而非仅验证变更的代码部分
- 对异常情况进行排查,而非合理化解释
- 最多进行2次验证循环——若仍未通过则升级处理,避免无限循环
- 保留验证证据:执行的命令、截图、响应日志、追踪信息
关于子Agent分工、评估模式及成本权衡,可参考
references/verification.md4. Document
4. 文档(Document)
Keep the repo legible to humans and agents.
- ≈ 100 lines — table of contents, not encyclopedia. Points to
AGENTS.mddocs/ - — human-facing overview, setup, usage
README.md - Scoped rules per directory/file pattern, not global dump
- Update docs as part of the work, not after. Doc drift = test failure
For AGENTS.md structure, scoped rules, and hygiene:
references/documentation.md确保代码库对人类和Agent均清晰可读。
- ——约100行内容,为目录式结构而非百科全书,指向
AGENTS.md目录docs/ - ——面向人类的概述、搭建及使用说明
README.md - 按目录/文件模式设置范围规则,而非全局统一配置
- 文档更新需与开发工作同步进行,而非滞后。文档与代码不一致等同于测试失败
关于结构、范围规则及文档维护,可参考
AGENTS.mdreferences/documentation.md5. Specify (when warranted)
5. 规格定义(Specify,必要时执行)
For non-trivial features, write a spec before coding. Not a throwaway PRD — a living contract.
- Define what, why, acceptance criteria, non-goals
- Define conformance tests or acceptance checks — the mechanical definition of "done"
- Get human approval on spec before implementation when scope is non-trivial
- Break into testable tasks
- Capture decisions during implementation and flow them back to the spec
- Reconcile spec ↔ code ↔ tests after implementation
For the SDD triangle, conformance tests, and the 70/30 rule:
references/specifications.md对于非 trivial 的功能,需在编码前编写规格文档。这不是一次性的PRD,而是可迭代的契约。
- 定义功能内容、设计原因、验收标准及非目标范围
- 定义一致性测试或验收检查——即“完成”的机械定义
- 当功能范围非 trivial 时,需在实现前获得人工对规格文档的批准
- 将功能拆分为可测试的任务
- 在实现过程中记录决策,并同步更新到规格文档
- 实现完成后,协调规格文档 ↔ 代码 ↔ 测试三者的一致性
关于SDD三角、一致性测试及70/30规则,可参考
references/specifications.mdAnti-Patterns
反模式
- Mock-only tests — pass by construction, verify nothing
- Self-evaluation — agent grades own work, always passes
- Global AGENTS.md dump — fills context before work starts
- Infinite retry loops — max 2 CI rounds, then hand back with partial result
- All-agentic pipeline — lint/push/format should be deterministic
- Context flooding — running full test suites floods context, agent hallucinates. Run targeted subsets, swallow passing output, surface only errors
- Designing the perfect harness upfront — iterate from failures, not theory
- 仅使用模拟环境的测试——通过构造实现通过,无法验证任何真实逻辑
- 自我评估——Agent自评工作成果,结果总是通过
- 全局式文档——在工作开始前就加载全部上下文信息
AGENTS.md - 无限重试循环——最多进行2次CI循环,之后移交人工并返回部分结果
- 全Agent化流水线——代码检查、格式化、推送操作应保持确定性
- 上下文过载——运行完整测试套件会导致上下文过载,引发Agent幻觉。应运行针对性的测试子集,忽略通过的输出,仅展示错误信息
- 预先设计完美的验证框架——应从失败中迭代优化,而非基于理论设计
Output
输出结果
After any harness work, report:
- Grade: before and after (using scale)
references/grading.md - Dimensions: bootable / testable / observable / verifiable — each with status + evidence
- What changed: specific files added or modified
- Gaps: remaining gaps ranked by impact
- Verify readiness: C+ = can verify, D/F = fix harness first
- Confidence: /
ship it/needs reviewblocked
完成任何验证框架相关工作后,需提交以下报告:
- 评分:工作前后的评分(使用的评分标准)
references/grading.md - 维度状态:可启动性/可测试性/可观测性/可验证性——每个维度包含状态及证据
- 变更内容:新增或修改的具体文件
- 剩余缺陷:按影响程度排序的未解决问题
- 验证就绪状态:C+及以上=可进行验证,D/F=需先修复验证框架
- 置信度:(可发布)/
ship it(需审核)/needs review(受阻)blocked
References
参考文档
- — harness quality grading scale with mechanical criteria
references/grading.md - — boot, smoke, e2e, isolation, enforcement patterns
references/setup-patterns.md - — verify workflow, evaluator pattern, subagent lanes, cost
references/verification.md - — AGENTS.md rules, scoped rules, README patterns, docs hygiene
references/documentation.md - — SDD triangle, conformance tests, acceptance criteria
references/specifications.md - — OpenAI, Anthropic, Stripe, Uber, Datadog, Cursor patterns. Read when designing a harness strategy or justifying investment, not during routine work
references/industry-examples.md
Each reference file includes source URLs for the research and articles it draws from.
- ——带有机械评估标准的验证框架质量评分体系
references/grading.md - ——启动、冒烟测试、端到端测试、隔离、强制执行等模式
references/setup-patterns.md - ——验证工作流、评估模式、子Agent分工、成本权衡
references/verification.md - ——
references/documentation.md规则、范围规则、README模式、文档维护AGENTS.md - ——SDD三角、一致性测试、验收标准
references/specifications.md - ——OpenAI、Anthropic、Stripe、Uber、Datadog、Cursor的实践模式。仅在设计验证框架策略或论证投入时阅读,日常工作无需参考
references/industry-examples.md
每份参考文档均包含其研究依据的文章及资源链接。