harness-creator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHarness Creator
Harness Creator
Production harness engineering for AI coding agents.
For: Engineers building or extending coding-agent runtimes, custom agents, multi-session workflows, or anyone who wants their agent to work reliably across sessions.
Not for: Prompt engineering, model selection, generic software architecture, or one-off agent tasks.
All principles are grounded in the Learn Harness Engineering framework and production agent runtime decisions.
面向AI编程代理的生产级Harness工程。
适用人群: 构建或扩展编程代理运行时、自定义代理、多会话工作流的工程师,或任何希望代理跨会话可靠工作的人。
不适用场景: 提示工程、模型选择、通用软件架构或一次性代理任务。
所有原则均基于Learn Harness Engineering框架和生产级代理运行时决策。
Harness Creator(中文版)
Harness Creator(中文版)
面向 AI 编程代理的生产级 Harness 工程技能。
适用人群: 构建或扩展编程代理运行时、自定义代理、多会话工作流的工程师,或任何希望代理跨会话可靠工作的人。
不适用场景: 提示工程、模型选择、通用软件架构或一次性代理任务。
所有原则均基于 Learn Harness Engineering 框架和生产代理运行时决策。
面向 AI 编程代理的生产级 Harness 工程技能。
适用人群: 构建或扩展编程代理运行时、自定义代理、多会话工作流的工程师,或任何希望代理跨会话可靠工作的人。
不适用场景: 提示工程、模型选择、通用软件架构或一次性代理任务。
所有原则均基于 Learn Harness Engineering 框架和生产代理运行时决策。
Choose Your Problem
选择你要解决的问题
| If you want to... | Read |
|---|---|
| Make the agent remember corrections and project rules between sessions | Memory Persistence |
| Package reusable workflows and domain knowledge | Skill Runtime |
| Let the agent work powerfully but not dangerously | Tool Registry & Safety |
| Give the agent the right context at the right cost | Context Engineering |
| Split work across multiple agents without chaos | Multi-agent Coordination |
| Extend behavior with hooks, background tasks, startup logic | Lifecycle & Bootstrap |
| Build the complete 5-subsystem harness | Five Subsystems Guide |
Before you start building: Read the Gotchas — these are the non-obvious failure modes that cost the most time.
选择你要解决的问题
选择你要解决的问题
| 如果你想... | 阅读 |
|---|---|
| 让代理在会话之间记住修正和项目规则 | 记忆持久化 |
| 打包可重复使用的工作流和领域知识 | 技能运行时 |
| 让代理强大但安全地工作 | 工具注册与安全 |
| 以合适的成本给代理合适的上下文 | 上下文工程 |
| 在多个代理之间分配工作而不混乱 | 多代理协调 |
| 使用 hooks、后台任务、启动逻辑扩展行为 | 生命周期与引导 |
| 构建完整的 5 子系统 harness | 五子系统指南 |
开始构建之前: 阅读 陷阱 — 这些是最耗时的非明显失败模式。
| 如果你想... | 阅读 |
|---|---|
| 让代理在会话之间记住修正和项目规则 | 记忆持久化 |
| 打包可重复使用的工作流和领域知识 | 技能运行时 |
| 让代理强大但安全地工作 | 工具注册与安全 |
| 以合适的成本给代理合适的上下文 | 上下文工程 |
| 在多个代理之间分配工作而不混乱 | 多代理协调 |
| 使用 hooks、后台任务、启动逻辑扩展行为 | 生命周期与引导 |
| 构建完整的 5 子系统 harness | 五子系统指南 |
开始构建之前: 阅读 陷阱 — 这些是最耗时的非明显失败模式。
The Five-Subsystem Harness Framework
五子系统Harness框架
Every harness consists of five subsystems:
- Instructions (Recipe Shelf): AGENTS.md, CLAUDE.md, docs/ hierarchy
- State (Prep Station): feature_list.json, progress.md, session-handoff.md
- Verification (Quality Check Window): Verification commands, test suites, type checks
- Scope (Task Boundaries): One-feature-at-a-time policies, definition of done
- Lifecycle (Session Management): init.sh, clean-state checklists, handoff procedures
When creating or improving a harness, systematically address each subsystem.
每个Harness都包含五大子系统:
- 指令(配方库):AGENTS.md、CLAUDE.md、docs/目录结构
- 状态(准备站):feature_list.json、progress.md、session-handoff.md
- 验证(质量检查窗口):验证命令、测试套件、类型检查
- 范围(任务边界):一次处理一个功能的策略、完成定义
- 生命周期(会话管理):init.sh、清洁状态检查清单、交接流程
创建或改进Harness时,请系统地处理每个子系统。
Creating a Harness
创建Harness
Phase 1: Context Gathering
阶段1:上下文收集
Start by understanding the user's situation:
- What project is this for? (tech stack, size, complexity)
- What agent tool are they using? (Claude Code, Codex, Cursor, etc.)
- What exists already? (any AGENTS.md, progress tracking, verification?)
- What problems are they experiencing? (agent overreach, lost context, broken tests?)
- What's the team's tolerance for structure? (minimal vs. comprehensive)
If the user hasn't provided this context, ask before proceeding.
首先了解用户的情况:
- 这是针对哪个项目?(技术栈、规模、复杂度)
- 他们使用哪种代理工具?(Claude Code、Codex、Cursor等)
- 已有哪些资源?(是否有AGENTS.md、进度跟踪、验证机制?)
- 他们遇到了哪些问题?(代理越权、上下文丢失、测试失败?)
- 团队对结构化的接受程度如何?(极简型 vs 全面型)
如果用户未提供这些上下文,请先询问再继续。
Phase 2: Harness Assessment (Existing Projects)
阶段2:Harness评估(已有项目)
If the user has an existing harness, assess it using the five-tuple framework:
For each subsystem, score 1-5:
- 5: Exemplary, documented, consistently followed
- 4: Good, mostly complete, occasional gaps
- 3: Adequate, covers basics, missing polish
- 2: Weak, incomplete, inconsistently applied
- 1: Missing or actively harmful
Identify the lowest-scoring subsystem — that's the bottleneck. Focus improvement efforts there first.
如果用户已有Harness,使用五元组框架进行评估:
针对每个子系统,评分1-5:
- 5分:优秀,文档完善,持续遵循
- 4分:良好,基本完整,偶尔有漏洞
- 3分:合格,覆盖基础,缺乏打磨
- 2分:薄弱,不完整,应用不一致
- 1分:缺失或存在负面影响
找出评分最低的子系统——这就是瓶颈。优先聚焦该子系统的改进工作。
Phase 3: Design
阶段3:设计
Based on the assessment, design the harness components:
Instructions:
- Create a short AGENTS.md (~50-100 lines) as the routing layer
- Link to detailed docs in docs/ directory (ARCHITECTURE.md, PRODUCT.md, etc.)
- Define startup workflow: what the agent reads before coding
State:
- Create feature_list.json with feature definitions and status tracking
- Create or update progress.md for session continuity
- Design session-handoff.md template if needed
Verification:
- List explicit verification commands in AGENTS.md
- Ensure init.sh runs verification
- Design quality score tracking if appropriate
Scope:
- Define one-feature-at-a-time policy
- Document feature dependencies
- Create definition of done checklist
Lifecycle:
- Create init.sh for initialization
- Design clean-state checklist
- Document session handoff procedure
基于评估结果,设计Harness组件:
指令:
- 创建简短的AGENTS.md(约50-100行)作为路由层
- 链接到docs/目录中的详细文档(ARCHITECTURE.md、PRODUCT.md等)
- 定义启动工作流:代理编码前需要阅读的内容
状态:
- 创建feature_list.json,包含功能定义和状态跟踪
- 创建或更新progress.md以保障会话连续性
- 如有需要,设计session-handoff.md模板
验证:
- 在AGENTS.md中列出明确的验证命令
- 确保init.sh执行验证
- 如有必要,设计质量分数跟踪机制
范围:
- 定义一次处理一个功能的策略
- 记录功能依赖关系
- 创建完成定义检查清单
生命周期:
- 创建init.sh用于初始化
- 设计清洁状态检查清单
- 记录会话交接流程
Phase 4: Implementation
阶段4:实现
Create the harness files. Use bundled scripts where available:
bash
undefined创建Harness文件。使用可用的捆绑脚本:
bash
undefinedUse bundled scripts from scripts/ directory
Use bundled scripts from scripts/ directory
(See scripts/ section for available tools)
(See scripts/ section for available tools)
undefinedundefinedPhase 5: Testing and Benchmarking
阶段5:测试与基准测试
Test the harness with real agent sessions:
- Baseline: Run a representative task without the harness
- With Harness: Run the same task with the harness
- Measure: Success rate, time, token usage, rework
- Compare: Quantify the improvement
For rigorous benchmarking, see the "Running Benchmarks" section below.
通过真实代理会话测试Harness:
- 基准测试:在不使用Harness的情况下运行代表性任务
- 使用Harness:在启用Harness的情况下运行相同任务
- 测量:成功率、耗时、Token使用量、返工量
- 对比:量化改进效果
如需严谨的基准测试,请参阅下方的“运行基准测试”部分。
Harness File Templates
Harness文件模板
AGENTS.md Structure
AGENTS.md结构
A minimal AGENTS.md should include:
markdown
undefined一个极简的AGENTS.md应包含:
markdown
undefinedAGENTS.md
AGENTS.md
[One-sentence project purpose]
[One-sentence project purpose]
Startup Workflow
Startup Workflow
Before writing code:
- [Step 1: e.g., Read this file]
- [Step 2: e.g., Read ARCHITECTURE.md]
- [Step 3: e.g., Run ./init.sh]
- [Step 4: e.g., Read feature_list.json]
Before writing code:
- [Step 1: e.g., Read this file]
- [Step 2: e.g., Read ARCHITECTURE.md]
- [Step 3: e.g., Run ./init.sh]
- [Step 4: e.g., Read feature_list.json]
Working Rules
Working Rules
- [Rule 1: e.g., One feature at a time]
- [Rule 2: e.g., Verification required before claiming done]
- [Rule 3: e.g., Update progress before ending session]
- [Rule 1: e.g., One feature at a time]
- [Rule 2: e.g., Verification required before claiming done]
- [Rule 3: e.g., Update progress before ending session]
Required Artifacts
Required Artifacts
- : Feature state tracker
feature_list.json - : Session continuity log
progress.md - : Standard startup and verification
init.sh
- : Feature state tracker
feature_list.json - : Session continuity log
progress.md - : Standard startup and verification
init.sh
Definition of Done
Definition of Done
A feature is done when:
- Implementation complete
- Verification passed
- Evidence recorded
- Repository restartable
A feature is done when:
- Implementation complete
- Verification passed
- Evidence recorded
- Repository restartable
End of Session
End of Session
Before ending:
- Update progress.md
- Update feature_list.json
- Record blockers/risks
- Commit with descriptive message
- Leave clean restart path
undefinedBefore ending:
- Update progress.md
- Update feature_list.json
- Record blockers/risks
- Commit with descriptive message
- Leave clean restart path
undefinedfeature_list.json Structure
feature_list.json结构
json
{
"features": [
{
"id": "feat-001",
"name": "Document Import",
"description": "Allow users to import PDF and TXT documents",
"dependencies": [],
"status": "done",
"evidence": "tests pass, manual verification on 2024-01-15"
},
{
"id": "feat-002",
"name": "Document Chunking",
"description": "Split documents into ~500 char chunks with metadata",
"dependencies": ["feat-001"],
"status": "in-progress",
"evidence": ""
}
]
}json
{
"features": [
{
"id": "feat-001",
"name": "Document Import",
"description": "Allow users to import PDF and TXT documents",
"dependencies": [],
"status": "done",
"evidence": "tests pass, manual verification on 2024-01-15"
},
{
"id": "feat-002",
"name": "Document Chunking",
"description": "Split documents into ~500 char chunks with metadata",
"dependencies": ["feat-001"],
"status": "in-progress",
"evidence": ""
}
]
}init.sh Structure
init.sh结构
bash
#!/bin/bash
set -e
echo "=== Installing dependencies ==="
npm install
echo "=== Running type check ==="
npm run check
echo "=== Running tests ==="
npm test
echo "=== Building application ==="
npm run build
echo "=== Verification complete ==="bash
#!/bin/bash
set -e
echo "=== Installing dependencies ==="
npm install
echo "=== Running type check ==="
npm run check
echo "=== Running tests ==="
npm test
echo "=== Building application ==="
npm run build
echo "=== Verification complete ==="Running Benchmarks
运行基准测试
To measure harness effectiveness:
为衡量Harness的有效性:
Step 1: Define Representative Tasks
步骤1:定义代表性任务
Pick 2-3 tasks that are:
- Real work the user would actually do
- Challenging enough to fail without proper harness
- Verifiable (clear success criteria)
选择2-3个符合以下条件的任务:
- 用户实际会处理的真实工作
- 难度足够高,无合适Harness时容易失败
- 可验证(明确的成功标准)
Step 2: Run Comparative Sessions
步骤2:运行对比会话
For each task:
- Without Harness: Run the task on a clean repo copy
- With Harness: Run the same task with the harness in place
Record:
- Success/failure
- Time taken
- Token usage
- Rework required
- Session restarts needed
针对每个任务:
- 无Harness:在干净的仓库副本上运行任务
- 使用Harness:在启用Harness的情况下运行相同任务
记录:
- 成功/失败情况
- 耗时
- Token使用量
- 所需返工量
- 所需会话重启次数
Step 3: Aggregate Results
步骤3:汇总结果
Calculate:
- Success rate improvement
- Time efficiency change
- Token efficiency change
- Qualitative feedback
计算:
- 成功率提升
- 时间效率变化
- Token效率变化
- 定性反馈
Step 4: Iterate
步骤4:迭代
Use results to identify:
- Which harness components add most value
- Which components are over-engineered
- Where to focus improvement efforts
利用结果确定:
- 哪些Harness组件价值最大
- 哪些组件过度设计
- 改进工作的聚焦方向
Bundled Resources
捆绑资源
References (Deep-Dive Patterns)
参考资料(深度解析模式)
| Document | Covers |
|---|---|
| Memory Persistence | Four-level instruction hierarchy, auto-memory taxonomy, background extraction |
| Context Engineering | Select / Compress / Isolate / Write operations, budget management |
| Tool Registry | Fail-closed registration, per-call concurrency, permission pipeline |
| Multi-Agent | Coordinator / Fork / Swarm patterns, context sharing |
| Lifecycle & Bootstrap | Hook system, long-running tasks, dependency-ordered init |
| Gotchas | 15 non-obvious failure modes with fixes |
| 文档 | 涵盖内容 |
|---|---|
| 记忆持久化 | 四级指令层级、自动记忆分类、背景信息提取 |
| 上下文工程 | 选择/压缩/隔离/写入操作、预算管理 |
| 工具注册 | 关闭式故障注册、每次调用并发控制、权限流水线 |
| 多代理 | 协调/分支/集群模式、上下文共享 |
| 生命周期与引导 | Hook系统、长期运行任务、依赖有序初始化 |
| 陷阱 | 15种非明显失败模式及修复方案 |
Templates
模板
- — AGENTS.md / CLAUDE.md skeleton
templates/agents.md - — Feature state tracker
templates/feature-list.json - — Standard initialization script
templates/init.sh - — Session progress log
templates/progress.md - — Session handoff template
templates/session-handoff.md
- — AGENTS.md / CLAUDE.md框架
templates/agents.md - — 功能状态跟踪器
templates/feature-list.json - — 标准初始化脚本
templates/init.sh - — 会话进度日志
templates/progress.md - — 会话交接模板
templates/session-handoff.md
Scripts (Optional)
脚本(可选)
- — Generate harness files from templates
scripts/create-harness.ts - — Check harness completeness
scripts/validate-harness.ts - — Execute harness effectiveness comparison
scripts/run-benchmark.ts
- — 从模板生成Harness文件
scripts/create-harness.ts - — 检查Harness完整性
scripts/validate-harness.ts - — 执行Harness有效性对比
scripts/run-benchmark.ts
Gotchas
陷阱(Gotchas)
Non-obvious principles that will cause bugs if you violate them:
- Memory index caps fire silently — Long entries invisible once cap hit. Keep hooks to one line.
- Priority ordering counterintuitive — Local beats project beats user beats org. Test full stack.
- Extraction timing creates race window — User can start next turn before background extraction completes.
- Derivable content doesn't belong in memory — Architecture and code patterns are in the repo already.
- Concurrent classification is per-call, not per-tool — Same tool safe for some inputs, unsafe for others.
- Permission evaluation has side effects — Tracks denials, transforms modes, updates state.
- Most async work skips "pending" state — Work units register directly as "running".
- Fork children must not fork — Recursive guard preserves single-level invariant.
- Context builders memoized but manually invalidated — Add invalidation or face staleness.
- Hook trust all-or-nothing — One untrusted hook disables entire extension system.
- Eviction requires notification — Terminal work unit only GC-eligible after parent notified.
- Skill listing budgets tight — Front-load distinctive trigger language, tails get cut.
Full guide: Gotchas — 15 failure modes with fixes.
违反这些非明显原则会导致bug:
- 记忆索引上限静默触发 — 条目过长超上限后不可见。保持钩子单行。
- 优先级顺序反直觉 — 本地胜过项目胜过用户胜过组织。测试完整栈。
- 提取时序产生竞争窗口 — 用户可在后台提取完成前开始下一轮。
- 可推导内容不应存入记忆 — 架构和代码模式已在仓库中。
- 并发分类按调用而非按工具 — 同一工具对某些输入安全,对其他不安全。
- 权限评估有副作用 — 跟踪拒绝、转换模式、更新状态。
- 大多数异步工作跳过"pending"状态 — 工作单元直接注册为"运行中"。
- Fork子节点不能Fork — 递归防护保持单层不变量。
- 上下文构建器缓存但手动失效 — 添加失效机制否则会面临过时问题。
- Hook信任全有或全无 — 一个不可信hook禁用整个扩展系统。
- 驱逐需要通知 — 终端工作单元仅在父节点通知后可被GC。
- Skill列表预算紧张 — 前置独特触发语言,尾部会被截断。
完整指南:陷阱 — 15种失败模式及修复方法。
陷阱(Gotchas)
陷阱(Gotchas)
违反这些非明显原则会导致 bug:
- 记忆索引上限静默触发 — 条目过长超上限后不可见。保持钩子单行。
- 优先级顺序反直觉 — 本地胜过项目胜过用户胜过组织。测试完整栈。
- 提取时序产生竞争窗口 — 用户可在后台提取完成前开始下一轮。
- 可推导内容不应存入记忆 — 架构和代码模式已在仓库中。
- 并发分类按调用而非按工具 — 同一工具对某些输入安全,对其他不安全。
- 权限评估有副作用 — 跟踪拒绝、转换模式、更新状态。
- 大多数异步工作跳过"pending"状态 — 工作单元直接注册为"运行中"。
- Fork 子节点不能 Fork — 递归防护保持单层不变量。
- 上下文构建器缓存但手动失效 — 添加失效或面对过时。
- Hook 信任全有或全无 — 一个不可信 hook 禁用整个扩展系统。
- 驱逐需要通知 — 终端工作单元仅在父节点通知后可 GC。
- Skill 列表预算紧张 — 前置独特触发语言,尾部被截断。
完整指南:陷阱 — 15 种失败模式及修复方法。
违反这些非明显原则会导致 bug:
- 记忆索引上限静默触发 — 条目过长超上限后不可见。保持钩子单行。
- 优先级顺序反直觉 — 本地胜过项目胜过用户胜过组织。测试完整栈。
- 提取时序产生竞争窗口 — 用户可在后台提取完成前开始下一轮。
- 可推导内容不应存入记忆 — 架构和代码模式已在仓库中。
- 并发分类按调用而非按工具 — 同一工具对某些输入安全,对其他不安全。
- 权限评估有副作用 — 跟踪拒绝、转换模式、更新状态。
- 大多数异步工作跳过"pending"状态 — 工作单元直接注册为"运行中"。
- Fork 子节点不能 Fork — 递归防护保持单层不变量。
- 上下文构建器缓存但手动失效 — 添加失效或面对过时。
- Hook 信任全有或全无 — 一个不可信 hook 禁用整个扩展系统。
- 驱逐需要通知 — 终端工作单元仅在父节点通知后可 GC。
- Skill 列表预算紧张 — 前置独特触发语言,尾部被截断。
完整指南:陷阱 — 15 种失败模式及修复方法。
When to Use This Skill
何时使用此技能
Use this skill when:
- User says "I need to set up AGENTS.md for my project"
- User wants to improve their agent's reliability
- User is experiencing agent failures, lost context, or broken work
- User asks "how do I make my agent work better?"
- User wants to benchmark harness effectiveness
- User needs templates for harness files
- User is following the Learn Harness Engineering course
在以下场景使用此技能:
- 用户表示“我需要为我的项目设置AGENTS.md”
- 用户希望提升代理的可靠性
- 用户遇到代理故障、上下文丢失或工作中断问题
- 用户询问“如何让我的代理工作得更好?”
- 用户希望对Harness有效性进行基准测试
- 用户需要Harness文件模板
- 用户正在学习Learn Harness Engineering课程
Communication Style
沟通风格
- Explain harness concepts in practical terms (kitchen analogy works well)
- Focus on measurable outcomes, not theoretical perfection
- Start minimal, add structure as needed
- Show before/after comparisons to build confidence
- Acknowledge tradeoffs (more structure = more reliability but more upfront work)
- 用实用术语解释Harness概念(厨房类比效果很好)
- 聚焦可衡量的结果,而非理论完美
- 从极简方案开始,按需添加结构
- 展示前后对比以建立信心
- 承认权衡(更多结构=更高可靠性,但前期投入更多)
Getting Started
入门指南
If the user is new to harness engineering:
如果用户是Harness工程新手:
- Start with assessment: Run the five-tuple assessment on their current setup
- Pick lowest-scoring subsystem: Focus improvement efforts there first
- Create minimal viable harness: AGENTS.md + init.sh + feature_list.json
- Test with real task: Measure before/after improvement
- 从评估开始:对他们当前的设置进行五元组评估
- 选择评分最低的子系统:优先聚焦该子系统的改进工作
- 创建最小可行Harness:AGENTS.md + init.sh + feature_list.json
- 用真实任务测试:衡量前后改进效果
If the user is experienced:
如果用户经验丰富:
- Ask what specific problem: Don't assume — let them describe the pain point
- Understand harness maturity: What exists already? What's working?
- Design targeted improvements: Use reference patterns for guidance
- Optionally run benchmarks: Quantify impact with before/after comparison
- 询问具体问题:不要假设,让他们描述痛点
- 了解Harness成熟度:已有哪些资源?哪些部分有效?
- 设计针对性改进:参考模式进行指导
- 可选运行基准测试:通过前后对比量化影响
When NOT to Use This Skill
何时不使用此技能
This skill is about the harness around an agent, not:
- Prompt engineering or system prompt design
- Model selection or fine-tuning
- Generic software architecture (MVC, microservices)
- Chat UIs or conversational interfaces
- LLM API integration basics
If your question is about the model itself rather than the system around it, this skill does not apply.
此技能专注于代理周围的Harness系统,而非:
- 提示工程或系统提示设计
- 模型选择或微调
- 通用软件架构(MVC、微服务)
- 聊天UI或对话界面
- LLM API集成基础
如果你的问题是关于模型本身而非其周围的系统,则此技能不适用。