agentic-harness-design

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agentic Harness Design

Agentic Harness 设计

Patterns for building multi-agent systems that produce high-quality outputs on long, complex tasks—covering generator/evaluator loops, context management, and task decomposition.
构建多Agent系统的模式,可在长期复杂任务中生成高质量输出——涵盖生成器/评估器循环、上下文管理和任务分解。

Core Architecture: Planner → Generator → Evaluator

核心架构:Planner → Generator → Evaluator

Three-agent split addresses the two main failure modes of solo agents:
  1. Context degradation — models lose coherence as the context window fills; some exhibit "context anxiety" and wrap up prematurely.
  2. Self-evaluation bias — agents reliably over-praise their own output; separating producer from judge is the key lever.
Planner takes a short user prompt (1–4 sentences) and expands it into a full product spec. Keep it at the level of deliverables and high-level architecture—not granular implementation details, which cascade errors downstream. Ask the planner to identify opportunities to weave AI-native features into the spec.
Generator implements against the spec. Works in sprints (one feature at a time) when the model needs scaffolding. Stronger models can run as a single continuous session with SDK-level compaction handling context growth. Self-evaluates at the end of each sprint before handoff.
Evaluator grades the generator's output against agreed criteria. Uses a live browser tool (e.g. Playwright MCP) to interact with the running app rather than scoring static screenshots. Produces specific, actionable findings.
三Agent拆分解决了单Agent的两个主要失效模式:
  1. 上下文退化 —— 模型随着上下文窗口填满而失去连贯性;部分模型会出现「上下文焦虑」并提前结束任务。
  2. 自我评估偏差 —— Agent往往会过度夸赞自己的输出;将生产者与评判者分离是关键手段。
Planner 接收简短的用户提示(1–4句话)并将其扩展为完整的产品规格。规格应停留在交付物和高层架构层面——不要涉及细粒度的实现细节,否则会引发下游错误。要求Planner在规格中融入AI原生功能的机会。
Generator 根据规格进行实现。当模型需要脚手架支持时,以Sprint(每次一个功能)的方式工作。性能更强的模型可以在SDK级压缩处理上下文增长的情况下,以单一连续会话运行。在移交前,每个Sprint结束时进行自我评估。
Evaluator 根据商定的标准对Generator的输出进行评分。使用实时浏览器工具(如Playwright MCP)与运行中的应用交互,而非对静态截图打分。生成具体、可执行的结论。

Sprint Contracts

Sprint契约

Before each sprint, generator and evaluator negotiate a sprint contract: the generator proposes what it will build and how success will be verified; the evaluator reviews and agrees. This bridges the gap between a high-level spec and testable behavior without over-specifying implementation upfront.
Contracts are communicated via files—one agent writes, the other reads and responds in kind. This keeps both agents grounded in agreed scope.
在每个Sprint开始前,Generator和Evaluator协商一份 Sprint契约:Generator提出要构建的内容以及验证成功的方式;Evaluator进行审核并达成一致。这在高层规格和可测试行为之间架起了桥梁,同时不会提前过度指定实现细节。
契约通过文件传递——一个Agent编写,另一个Agent读取并以同样方式回应。这能让两个Agent始终锚定在商定的范围内。

Context Management

上下文管理

Context resets vs. compaction are not equivalent:
  • Compaction summarizes earlier context in-place; the same agent continues. Context anxiety can persist.
  • A context reset starts a fresh agent with a structured handoff artifact containing prior state and next steps. Provides a clean slate at the cost of orchestration overhead.
Use resets when the model exhibits context anxiety (observable as premature wrap-up behavior). Stronger models (e.g. Opus 4.6+) often sustain long sessions without resets, making compaction sufficient.
上下文重置压缩 并不等价:
  • 压缩会原地总结早期上下文;同一Agent继续运行。上下文焦虑可能持续存在。
  • 上下文重置会启动一个全新的Agent,并提供包含先前状态和下一步步骤的结构化移交工件。以编排开销为代价,提供一个干净的环境。
当模型出现上下文焦虑(表现为提前结束任务的行为)时使用重置。性能更强的模型(如Opus 4.6+)通常可以维持长时间会话而无需重置,此时压缩就足够了。

Evaluator Calibration

Evaluator校准

Out of the box, evaluators are lenient. Calibration steps:
  1. Write explicit grading criteria that encode your quality bar. Turn subjective judgments ("is this good?") into concrete, gradable terms.
  2. Use few-shot examples with detailed score breakdowns to anchor the evaluator's judgment.
  3. Set hard thresholds per criterion. If any falls below threshold, the sprint fails.
  4. Read evaluator logs after each run. When its judgment diverges from yours, update the prompt to resolve it.
  5. Instruct the evaluator to be skeptical; it is far easier to tune a standalone evaluator to be critical than to make a generator self-critical.
For design tasks, weight criteria that the model under-delivers on by default (originality, coherence) more heavily than what it handles well (functional correctness, technical craft).
默认情况下,Evaluator较为宽松。校准步骤:
  1. 编写明确的评分标准,明确你的质量要求。将主观判断(「这是否良好?」)转化为具体、可评分的条款。
  2. 使用带有详细分数细分的少样本示例,锚定Evaluator的判断。
  3. 为每个标准设置严格阈值。如果任何一项低于阈值,该Sprint失败。
  4. 每次运行后查看Evaluator日志。当其判断与你的判断不一致时,更新提示以解决分歧。
  5. 指示Evaluator保持怀疑态度;调整独立的Evaluator使其更具批判性,比让Generator自我批判容易得多。
对于设计任务,应加大模型默认表现不佳的标准(原创性、连贯性)的权重,高于其擅长的标准(功能正确性、技术工艺)。

Harness Simplification Principle

框架简化原则

Every component encodes an assumption about what the model can't do solo. As models improve, those assumptions go stale. After each model upgrade:
  • Re-run representative tasks with simplified harnesses
  • Remove components one at a time and measure the impact on output quality
  • Only add complexity when simpler approaches demonstrably fall short
Sprint decomposition, context resets, and per-sprint evaluation loops may all become unnecessary overhead as model capability increases.
每个组件都包含一个关于模型单独无法完成什么的假设。随着模型能力提升,这些假设会过时。每次模型升级后:
  • 使用简化的框架重新运行代表性任务
  • 逐个移除组件并衡量对输出质量的影响
  • 只有当更简单的方法明显不足时,才增加复杂度
Sprint分解、上下文重置和每个Sprint的评估循环,都可能随着模型能力的提升而成为不必要的开销。

When to Use This Skill

何时使用该技能

Use this skill when:
  • Building agent harnesses for tasks that run longer than a single context window
  • Output quality from a single-agent approach is plateauing
  • The task has subjective quality dimensions (design, UX) where binary correctness checks don't apply
  • You need to decide whether to use context resets vs. compaction vs. a fresh session architecture
  • Calibrating an evaluator agent to grade reliably against your quality bar
在以下场景使用该技能:
  • 为运行时间超过单个上下文窗口的任务构建Agent框架
  • 单Agent方法的输出质量停滞不前
  • 任务具有主观质量维度(设计、UX),无法应用二元正确性检查
  • 需要决定使用上下文重置、压缩还是全新会话架构
  • 校准Evaluator Agent以根据你的质量标准可靠评分