agentic-harness-design
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAgentic Harness Design
Agentic Harness 设计
Patterns for building multi-agent systems that produce high-quality outputs on long, complex tasks—covering generator/evaluator loops, context management, and task decomposition.
构建多Agent系统的模式,可在长期复杂任务中生成高质量输出——涵盖生成器/评估器循环、上下文管理和任务分解。
Core Architecture: Planner → Generator → Evaluator
核心架构:Planner → Generator → Evaluator
Three-agent split addresses the two main failure modes of solo agents:
- Context degradation — models lose coherence as the context window fills; some exhibit "context anxiety" and wrap up prematurely.
- Self-evaluation bias — agents reliably over-praise their own output; separating producer from judge is the key lever.
Planner takes a short user prompt (1–4 sentences) and expands it into a full product spec. Keep it at the level of deliverables and high-level architecture—not granular implementation details, which cascade errors downstream. Ask the planner to identify opportunities to weave AI-native features into the spec.
Generator implements against the spec. Works in sprints (one feature at a time) when the model needs scaffolding. Stronger models can run as a single continuous session with SDK-level compaction handling context growth. Self-evaluates at the end of each sprint before handoff.
Evaluator grades the generator's output against agreed criteria. Uses a live browser tool (e.g. Playwright MCP) to interact with the running app rather than scoring static screenshots. Produces specific, actionable findings.
三Agent拆分解决了单Agent的两个主要失效模式:
- 上下文退化 —— 模型随着上下文窗口填满而失去连贯性;部分模型会出现「上下文焦虑」并提前结束任务。
- 自我评估偏差 —— Agent往往会过度夸赞自己的输出;将生产者与评判者分离是关键手段。
Planner 接收简短的用户提示(1–4句话)并将其扩展为完整的产品规格。规格应停留在交付物和高层架构层面——不要涉及细粒度的实现细节,否则会引发下游错误。要求Planner在规格中融入AI原生功能的机会。
Generator 根据规格进行实现。当模型需要脚手架支持时,以Sprint(每次一个功能)的方式工作。性能更强的模型可以在SDK级压缩处理上下文增长的情况下,以单一连续会话运行。在移交前,每个Sprint结束时进行自我评估。
Evaluator 根据商定的标准对Generator的输出进行评分。使用实时浏览器工具(如Playwright MCP)与运行中的应用交互,而非对静态截图打分。生成具体、可执行的结论。
Sprint Contracts
Sprint契约
Before each sprint, generator and evaluator negotiate a sprint contract: the generator proposes what it will build and how success will be verified; the evaluator reviews and agrees. This bridges the gap between a high-level spec and testable behavior without over-specifying implementation upfront.
Contracts are communicated via files—one agent writes, the other reads and responds in kind. This keeps both agents grounded in agreed scope.
在每个Sprint开始前,Generator和Evaluator协商一份 Sprint契约:Generator提出要构建的内容以及验证成功的方式;Evaluator进行审核并达成一致。这在高层规格和可测试行为之间架起了桥梁,同时不会提前过度指定实现细节。
契约通过文件传递——一个Agent编写,另一个Agent读取并以同样方式回应。这能让两个Agent始终锚定在商定的范围内。
Context Management
上下文管理
Context resets vs. compaction are not equivalent:
- Compaction summarizes earlier context in-place; the same agent continues. Context anxiety can persist.
- A context reset starts a fresh agent with a structured handoff artifact containing prior state and next steps. Provides a clean slate at the cost of orchestration overhead.
Use resets when the model exhibits context anxiety (observable as premature wrap-up behavior). Stronger models (e.g. Opus 4.6+) often sustain long sessions without resets, making compaction sufficient.
上下文重置 与 压缩 并不等价:
- 压缩会原地总结早期上下文;同一Agent继续运行。上下文焦虑可能持续存在。
- 上下文重置会启动一个全新的Agent,并提供包含先前状态和下一步步骤的结构化移交工件。以编排开销为代价,提供一个干净的环境。
当模型出现上下文焦虑(表现为提前结束任务的行为)时使用重置。性能更强的模型(如Opus 4.6+)通常可以维持长时间会话而无需重置,此时压缩就足够了。
Evaluator Calibration
Evaluator校准
Out of the box, evaluators are lenient. Calibration steps:
- Write explicit grading criteria that encode your quality bar. Turn subjective judgments ("is this good?") into concrete, gradable terms.
- Use few-shot examples with detailed score breakdowns to anchor the evaluator's judgment.
- Set hard thresholds per criterion. If any falls below threshold, the sprint fails.
- Read evaluator logs after each run. When its judgment diverges from yours, update the prompt to resolve it.
- Instruct the evaluator to be skeptical; it is far easier to tune a standalone evaluator to be critical than to make a generator self-critical.
For design tasks, weight criteria that the model under-delivers on by default (originality, coherence) more heavily than what it handles well (functional correctness, technical craft).
默认情况下,Evaluator较为宽松。校准步骤:
- 编写明确的评分标准,明确你的质量要求。将主观判断(「这是否良好?」)转化为具体、可评分的条款。
- 使用带有详细分数细分的少样本示例,锚定Evaluator的判断。
- 为每个标准设置严格阈值。如果任何一项低于阈值,该Sprint失败。
- 每次运行后查看Evaluator日志。当其判断与你的判断不一致时,更新提示以解决分歧。
- 指示Evaluator保持怀疑态度;调整独立的Evaluator使其更具批判性,比让Generator自我批判容易得多。
对于设计任务,应加大模型默认表现不佳的标准(原创性、连贯性)的权重,高于其擅长的标准(功能正确性、技术工艺)。
Harness Simplification Principle
框架简化原则
Every component encodes an assumption about what the model can't do solo. As models improve, those assumptions go stale. After each model upgrade:
- Re-run representative tasks with simplified harnesses
- Remove components one at a time and measure the impact on output quality
- Only add complexity when simpler approaches demonstrably fall short
Sprint decomposition, context resets, and per-sprint evaluation loops may all become unnecessary overhead as model capability increases.
每个组件都包含一个关于模型单独无法完成什么的假设。随着模型能力提升,这些假设会过时。每次模型升级后:
- 使用简化的框架重新运行代表性任务
- 逐个移除组件并衡量对输出质量的影响
- 只有当更简单的方法明显不足时,才增加复杂度
Sprint分解、上下文重置和每个Sprint的评估循环,都可能随着模型能力的提升而成为不必要的开销。
When to Use This Skill
何时使用该技能
Use this skill when:
- Building agent harnesses for tasks that run longer than a single context window
- Output quality from a single-agent approach is plateauing
- The task has subjective quality dimensions (design, UX) where binary correctness checks don't apply
- You need to decide whether to use context resets vs. compaction vs. a fresh session architecture
- Calibrating an evaluator agent to grade reliably against your quality bar
在以下场景使用该技能:
- 为运行时间超过单个上下文窗口的任务构建Agent框架
- 单Agent方法的输出质量停滞不前
- 任务具有主观质量维度(设计、UX),无法应用二元正确性检查
- 需要决定使用上下文重置、压缩还是全新会话架构
- 校准Evaluator Agent以根据你的质量标准可靠评分