harness

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Harness

验证框架(Harness)

The verification infrastructure that makes agent work trustworthy.
让Agent工作成果可信的验证基础设施。

Principles

核心原则

  • Environment > instruction — the harness matters more than the prompt
  • Mechanical enforcement > documentation — git hooks and CI gates > prose
  • Separate builder from judge — self-evaluation is unreliable; spawn independent evaluators
  • Deterministic where possible — lint/format/push hardcoded, implementation agentic
  • Context is a public good — push knowledge into the repo; what agents can't access doesn't exist
  • Scoped rules over global rules — per-directory/file-pattern rules, not a global dump
  • Progressive disclosure — small entry points, load detail on demand
  • Accept and correct > prevent all errors — small steady error rate with rapid correction beats perfection-seeking that serializes everything
  • 环境优先于指令——验证框架的重要性超过提示词
  • 机械执行优先于文档——Git钩子和CI门禁优于文字说明
  • 构建者与评审者分离——自我评估不可靠;需启用独立的评估Agent
  • 尽可能保持确定性——代码检查、格式化、推送操作硬编码实现,具体执行可由Agent完成
  • 上下文是公共资源——将知识存入代码库;Agent无法访问的信息等同于不存在
  • 范围规则优先于全局规则——采用按目录/文件模式设置的规则,而非全局统一配置
  • 渐进式披露——从简单入口开始,按需加载详细内容
  • 接受并修正错误优于杜绝所有错误——低错误率配合快速修正,胜过追求完美导致的流程串行化

The 7-Layer Stack

七层架构

Every harness has these layers. Name them when grading — "we have layers 1-3, missing 4-7":
  1. Boot — single command starts the app
  2. Smoke — app is alive (health endpoint,
    --version
    ). Under 5 seconds
  3. Interact — agent can exercise the app (Playwright, curl, shell scripts)
  4. E2e — key user flows on real surfaces (not mocks)
  5. Enforce — git hooks, CI gates, custom lint rules with agent-readable errors
  6. Observe — structured logs, health endpoints, error traces queryable by agent
  7. Isolate — per-worktree or per-container, parallel agents don't collide
每个验证框架都包含以下层级。评估时需明确指出:“我们已实现第1-3层,缺失第4-7层”:
  1. 启动层(Boot)——单命令启动应用
  2. 冒烟测试层(Smoke)——确认应用存活(健康检查端点、
    --version
    命令),耗时不超过5秒
  3. 交互层(Interact)——Agent可操作应用(通过Playwright、curl、Shell脚本)
  4. 端到端测试层(E2e)——在真实场景(而非模拟环境)验证核心用户流程
  5. 强制执行层(Enforce)——Git钩子、CI门禁、自定义代码检查规则,且错误信息需Agent可读
  6. 观测层(Observe)——结构化日志、健康检查端点、可由Agent查询的错误追踪信息
  7. 隔离层(Isolate)——基于工作区或容器的隔离机制,避免并行Agent互相干扰

Workflow

工作流程

1. Audit

1. 审核(Audit)

Grade the repo across four dimensions. For each:
status
(pass/partial/fail),
evidence
(file or command),
gap
(what's missing).
  • Bootable — one command starts the app and confirms it's running
  • Testable — tests hit the real running app, not just mocks. Detect
    jest.mock
    /
    vi.mock
    /
    unittest.mock
    — mock-only = zero
  • Observable — structured logs, health endpoints, or error traces queryable by agent
  • Verifiable — agent can produce evidence (screenshots, response logs, traces)
Use parallel subagents where available (one per dimension); otherwise audit sequentially. Grade using
references/grading.md
. Lowest dimension = overall grade.
从四个维度评估代码库。每个维度需包含:
状态
(通过/部分通过/失败)、
证据
(文件或命令)、
缺陷
(缺失内容)。
  • 可启动性——单命令启动应用并确认其运行状态
  • 可测试性——测试针对真实运行的应用,而非仅模拟环境。检测
    jest.mock
    /
    vi.mock
    /
    unittest.mock
    ——仅使用模拟环境的测试得分为0
  • 可观测性——结构化日志、健康检查端点或可由Agent查询的错误追踪信息
  • 可验证性——Agent可生成成果证据(截图、响应日志、追踪信息)
若支持并行子Agent,可同时评估各维度;否则按顺序审核。使用
references/grading.md
进行评分。整体得分取各维度中的最低分。

2. Setup

2. 搭建(Setup)

Based on grade, build missing layers in priority order:
Boot → Smoke → Interact → E2e → Enforce → Observe → Isolate
Each piece should be independently useful. Stop after any step if remaining gaps aren't blocking. See
references/setup-patterns.md
for concrete patterns by project type.
根据评分结果,按优先级顺序补充缺失的层级:
启动层 → 冒烟测试层 → 交互层 → 端到端测试层 → 强制执行层 → 观测层 → 隔离层
每个层级的功能需独立可用。若剩余缺陷不会阻碍后续工作,可在完成任意步骤后停止。具体搭建模式可参考
references/setup-patterns.md
,该文档按项目类型提供了具体方案。

3. Verify

3. 验证(Verify)

Prove changes work on real surfaces. The agent that wrote the code must not verify it — spawn an independent evaluator. If subagents are unavailable, use a fresh session or hand off to human review. Do not self-certify with implementation context still loaded.
  • Boot the app, interact with it (Playwright CLI for UI, curl for APIs, CLI invocation)
  • Check nearby flows and likely regressions, not just the exact diff
  • Investigate anything odd instead of rationalizing it
  • Max 2 verification cycles — escalate after that, don't loop
  • Keep proof: commands run, screenshots, response logs, traces
For subagent lanes, evaluator pattern, and cost trade-offs:
references/verification.md
证明变更在真实场景下有效。编写代码的Agent不得自行验证其工作成果——需启用独立的评估Agent。若无法使用子Agent,可使用新会话或移交人工审核。禁止在仍保留实现上下文的情况下自我认证。
  • 启动应用,通过Playwright CLI(针对UI)、curl(针对API)或CLI调用与应用交互
  • 检查相关流程及可能的回归问题,而非仅验证变更的代码部分
  • 对异常情况进行排查,而非合理化解释
  • 最多进行2次验证循环——若仍未通过则升级处理,避免无限循环
  • 保留验证证据:执行的命令、截图、响应日志、追踪信息
关于子Agent分工、评估模式及成本权衡,可参考
references/verification.md

4. Document

4. 文档(Document)

Keep the repo legible to humans and agents.
  • AGENTS.md
    ≈ 100 lines — table of contents, not encyclopedia. Points to
    docs/
  • README.md
    — human-facing overview, setup, usage
  • Scoped rules per directory/file pattern, not global dump
  • Update docs as part of the work, not after. Doc drift = test failure
For AGENTS.md structure, scoped rules, and hygiene:
references/documentation.md
确保代码库对人类和Agent均清晰可读。
  • AGENTS.md
    ——约100行内容,为目录式结构而非百科全书,指向
    docs/
    目录
  • README.md
    ——面向人类的概述、搭建及使用说明
  • 按目录/文件模式设置范围规则,而非全局统一配置
  • 文档更新需与开发工作同步进行,而非滞后。文档与代码不一致等同于测试失败
关于
AGENTS.md
结构、范围规则及文档维护,可参考
references/documentation.md

5. Specify (when warranted)

5. 规格定义(Specify,必要时执行)

For non-trivial features, write a spec before coding. Not a throwaway PRD — a living contract.
  • Define what, why, acceptance criteria, non-goals
  • Define conformance tests or acceptance checks — the mechanical definition of "done"
  • Get human approval on spec before implementation when scope is non-trivial
  • Break into testable tasks
  • Capture decisions during implementation and flow them back to the spec
  • Reconcile spec ↔ code ↔ tests after implementation
For the SDD triangle, conformance tests, and the 70/30 rule:
references/specifications.md
对于非 trivial 的功能,需在编码前编写规格文档。这不是一次性的PRD,而是可迭代的契约。
  • 定义功能内容、设计原因、验收标准及非目标范围
  • 定义一致性测试或验收检查——即“完成”的机械定义
  • 当功能范围非 trivial 时,需在实现前获得人工对规格文档的批准
  • 将功能拆分为可测试的任务
  • 在实现过程中记录决策,并同步更新到规格文档
  • 实现完成后,协调规格文档 ↔ 代码 ↔ 测试三者的一致性
关于SDD三角、一致性测试及70/30规则,可参考
references/specifications.md

Anti-Patterns

反模式

  • Mock-only tests — pass by construction, verify nothing
  • Self-evaluation — agent grades own work, always passes
  • Global AGENTS.md dump — fills context before work starts
  • Infinite retry loops — max 2 CI rounds, then hand back with partial result
  • All-agentic pipeline — lint/push/format should be deterministic
  • Context flooding — running full test suites floods context, agent hallucinates. Run targeted subsets, swallow passing output, surface only errors
  • Designing the perfect harness upfront — iterate from failures, not theory
  • 仅使用模拟环境的测试——通过构造实现通过,无法验证任何真实逻辑
  • 自我评估——Agent自评工作成果,结果总是通过
  • 全局式
    AGENTS.md
    文档
    ——在工作开始前就加载全部上下文信息
  • 无限重试循环——最多进行2次CI循环,之后移交人工并返回部分结果
  • 全Agent化流水线——代码检查、格式化、推送操作应保持确定性
  • 上下文过载——运行完整测试套件会导致上下文过载,引发Agent幻觉。应运行针对性的测试子集,忽略通过的输出,仅展示错误信息
  • 预先设计完美的验证框架——应从失败中迭代优化,而非基于理论设计

Output

输出结果

After any harness work, report:
  • Grade: before and after (using
    references/grading.md
    scale)
  • Dimensions: bootable / testable / observable / verifiable — each with status + evidence
  • What changed: specific files added or modified
  • Gaps: remaining gaps ranked by impact
  • Verify readiness: C+ = can verify, D/F = fix harness first
  • Confidence:
    ship it
    /
    needs review
    /
    blocked
完成任何验证框架相关工作后,需提交以下报告:
  • 评分:工作前后的评分(使用
    references/grading.md
    的评分标准)
  • 维度状态:可启动性/可测试性/可观测性/可验证性——每个维度包含状态及证据
  • 变更内容:新增或修改的具体文件
  • 剩余缺陷:按影响程度排序的未解决问题
  • 验证就绪状态:C+及以上=可进行验证,D/F=需先修复验证框架
  • 置信度
    ship it
    (可发布)/
    needs review
    (需审核)/
    blocked
    (受阻)

References

参考文档

  • references/grading.md
    — harness quality grading scale with mechanical criteria
  • references/setup-patterns.md
    — boot, smoke, e2e, isolation, enforcement patterns
  • references/verification.md
    — verify workflow, evaluator pattern, subagent lanes, cost
  • references/documentation.md
    — AGENTS.md rules, scoped rules, README patterns, docs hygiene
  • references/specifications.md
    — SDD triangle, conformance tests, acceptance criteria
  • references/industry-examples.md
    — OpenAI, Anthropic, Stripe, Uber, Datadog, Cursor patterns. Read when designing a harness strategy or justifying investment, not during routine work
Each reference file includes source URLs for the research and articles it draws from.
  • references/grading.md
    ——带有机械评估标准的验证框架质量评分体系
  • references/setup-patterns.md
    ——启动、冒烟测试、端到端测试、隔离、强制执行等模式
  • references/verification.md
    ——验证工作流、评估模式、子Agent分工、成本权衡
  • references/documentation.md
    ——
    AGENTS.md
    规则、范围规则、README模式、文档维护
  • references/specifications.md
    ——SDD三角、一致性测试、验收标准
  • references/industry-examples.md
    ——OpenAI、Anthropic、Stripe、Uber、Datadog、Cursor的实践模式。仅在设计验证框架策略或论证投入时阅读,日常工作无需参考
每份参考文档均包含其研究依据的文章及资源链接。