harness-creator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Harness Creator

Harness Creator

Production harness engineering for AI coding agents.
For: Engineers building or extending coding-agent runtimes, custom agents, multi-session workflows, or anyone who wants their agent to work reliably across sessions.
Not for: Prompt engineering, model selection, generic software architecture, or one-off agent tasks.
All principles are grounded in the Learn Harness Engineering framework and production agent runtime decisions.

面向AI编程代理的生产级Harness工程。
适用人群: 构建或扩展编程代理运行时、自定义代理、多会话工作流的工程师,或任何希望代理跨会话可靠工作的人。
不适用场景: 提示工程、模型选择、通用软件架构或一次性代理任务。
所有原则均基于Learn Harness Engineering框架和生产级代理运行时决策。

Harness Creator(中文版)

Harness Creator(中文版)

面向 AI 编程代理的生产级 Harness 工程技能。
适用人群: 构建或扩展编程代理运行时、自定义代理、多会话工作流的工程师,或任何希望代理跨会话可靠工作的人。
不适用场景: 提示工程、模型选择、通用软件架构或一次性代理任务。
所有原则均基于 Learn Harness Engineering 框架和生产代理运行时决策。

面向 AI 编程代理的生产级 Harness 工程技能。
适用人群: 构建或扩展编程代理运行时、自定义代理、多会话工作流的工程师,或任何希望代理跨会话可靠工作的人。
不适用场景: 提示工程、模型选择、通用软件架构或一次性代理任务。
所有原则均基于 Learn Harness Engineering 框架和生产代理运行时决策。

Choose Your Problem

选择你要解决的问题

If you want to...Read
Make the agent remember corrections and project rules between sessionsMemory Persistence
Package reusable workflows and domain knowledgeSkill Runtime
Let the agent work powerfully but not dangerouslyTool Registry & Safety
Give the agent the right context at the right costContext Engineering
Split work across multiple agents without chaosMulti-agent Coordination
Extend behavior with hooks, background tasks, startup logicLifecycle & Bootstrap
Build the complete 5-subsystem harnessFive Subsystems Guide
Before you start building: Read the Gotchas — these are the non-obvious failure modes that cost the most time.

如果你想...阅读
让代理在会话之间记住修正和项目规则记忆持久化
打包可重复使用的工作流和领域知识技能运行时
让代理强大但安全地工作工具注册与安全
以合适的成本给代理合适的上下文上下文工程
在多个代理之间分配工作而不混乱多代理协调
使用 hooks、后台任务、启动逻辑扩展行为生命周期与引导
构建完整的 5 子系统 harness五子系统指南
开始构建之前: 阅读 陷阱 — 这些是最耗时的非明显失败模式。

选择你要解决的问题

选择你要解决的问题

如果你想...阅读
让代理在会话之间记住修正和项目规则记忆持久化
打包可重复使用的工作流和领域知识技能运行时
让代理强大但安全地工作工具注册与安全
以合适的成本给代理合适的上下文上下文工程
在多个代理之间分配工作而不混乱多代理协调
使用 hooks、后台任务、启动逻辑扩展行为生命周期与引导
构建完整的 5 子系统 harness五子系统指南
开始构建之前: 阅读 陷阱 — 这些是最耗时的非明显失败模式。

如果你想...阅读
让代理在会话之间记住修正和项目规则记忆持久化
打包可重复使用的工作流和领域知识技能运行时
让代理强大但安全地工作工具注册与安全
以合适的成本给代理合适的上下文上下文工程
在多个代理之间分配工作而不混乱多代理协调
使用 hooks、后台任务、启动逻辑扩展行为生命周期与引导
构建完整的 5 子系统 harness五子系统指南
开始构建之前: 阅读 陷阱 — 这些是最耗时的非明显失败模式。

The Five-Subsystem Harness Framework

五子系统Harness框架

Every harness consists of five subsystems:
  1. Instructions (Recipe Shelf): AGENTS.md, CLAUDE.md, docs/ hierarchy
  2. State (Prep Station): feature_list.json, progress.md, session-handoff.md
  3. Verification (Quality Check Window): Verification commands, test suites, type checks
  4. Scope (Task Boundaries): One-feature-at-a-time policies, definition of done
  5. Lifecycle (Session Management): init.sh, clean-state checklists, handoff procedures
When creating or improving a harness, systematically address each subsystem.

每个Harness都包含五大子系统:
  1. 指令(配方库):AGENTS.md、CLAUDE.md、docs/目录结构
  2. 状态(准备站):feature_list.json、progress.md、session-handoff.md
  3. 验证(质量检查窗口):验证命令、测试套件、类型检查
  4. 范围(任务边界):一次处理一个功能的策略、完成定义
  5. 生命周期(会话管理):init.sh、清洁状态检查清单、交接流程
创建或改进Harness时,请系统地处理每个子系统。

Creating a Harness

创建Harness

Phase 1: Context Gathering

阶段1:上下文收集

Start by understanding the user's situation:
  1. What project is this for? (tech stack, size, complexity)
  2. What agent tool are they using? (Claude Code, Codex, Cursor, etc.)
  3. What exists already? (any AGENTS.md, progress tracking, verification?)
  4. What problems are they experiencing? (agent overreach, lost context, broken tests?)
  5. What's the team's tolerance for structure? (minimal vs. comprehensive)
If the user hasn't provided this context, ask before proceeding.
首先了解用户的情况:
  1. 这是针对哪个项目?(技术栈、规模、复杂度)
  2. 他们使用哪种代理工具?(Claude Code、Codex、Cursor等)
  3. 已有哪些资源?(是否有AGENTS.md、进度跟踪、验证机制?)
  4. 他们遇到了哪些问题?(代理越权、上下文丢失、测试失败?)
  5. 团队对结构化的接受程度如何?(极简型 vs 全面型)
如果用户未提供这些上下文,请先询问再继续。

Phase 2: Harness Assessment (Existing Projects)

阶段2:Harness评估(已有项目)

If the user has an existing harness, assess it using the five-tuple framework:
For each subsystem, score 1-5:
  • 5: Exemplary, documented, consistently followed
  • 4: Good, mostly complete, occasional gaps
  • 3: Adequate, covers basics, missing polish
  • 2: Weak, incomplete, inconsistently applied
  • 1: Missing or actively harmful
Identify the lowest-scoring subsystem — that's the bottleneck. Focus improvement efforts there first.
如果用户已有Harness,使用五元组框架进行评估:
针对每个子系统,评分1-5:
  • 5分:优秀,文档完善,持续遵循
  • 4分:良好,基本完整,偶尔有漏洞
  • 3分:合格,覆盖基础,缺乏打磨
  • 2分:薄弱,不完整,应用不一致
  • 1分:缺失或存在负面影响
找出评分最低的子系统——这就是瓶颈。优先聚焦该子系统的改进工作。

Phase 3: Design

阶段3:设计

Based on the assessment, design the harness components:
Instructions:
  • Create a short AGENTS.md (~50-100 lines) as the routing layer
  • Link to detailed docs in docs/ directory (ARCHITECTURE.md, PRODUCT.md, etc.)
  • Define startup workflow: what the agent reads before coding
State:
  • Create feature_list.json with feature definitions and status tracking
  • Create or update progress.md for session continuity
  • Design session-handoff.md template if needed
Verification:
  • List explicit verification commands in AGENTS.md
  • Ensure init.sh runs verification
  • Design quality score tracking if appropriate
Scope:
  • Define one-feature-at-a-time policy
  • Document feature dependencies
  • Create definition of done checklist
Lifecycle:
  • Create init.sh for initialization
  • Design clean-state checklist
  • Document session handoff procedure
基于评估结果,设计Harness组件:
指令:
  • 创建简短的AGENTS.md(约50-100行)作为路由层
  • 链接到docs/目录中的详细文档(ARCHITECTURE.md、PRODUCT.md等)
  • 定义启动工作流:代理编码前需要阅读的内容
状态:
  • 创建feature_list.json,包含功能定义和状态跟踪
  • 创建或更新progress.md以保障会话连续性
  • 如有需要,设计session-handoff.md模板
验证:
  • 在AGENTS.md中列出明确的验证命令
  • 确保init.sh执行验证
  • 如有必要,设计质量分数跟踪机制
范围:
  • 定义一次处理一个功能的策略
  • 记录功能依赖关系
  • 创建完成定义检查清单
生命周期:
  • 创建init.sh用于初始化
  • 设计清洁状态检查清单
  • 记录会话交接流程

Phase 4: Implementation

阶段4:实现

Create the harness files. Use bundled scripts where available:
bash
undefined
创建Harness文件。使用可用的捆绑脚本:
bash
undefined

Use bundled scripts from scripts/ directory

Use bundled scripts from scripts/ directory

(See scripts/ section for available tools)

(See scripts/ section for available tools)

undefined
undefined

Phase 5: Testing and Benchmarking

阶段5:测试与基准测试

Test the harness with real agent sessions:
  1. Baseline: Run a representative task without the harness
  2. With Harness: Run the same task with the harness
  3. Measure: Success rate, time, token usage, rework
  4. Compare: Quantify the improvement
For rigorous benchmarking, see the "Running Benchmarks" section below.

通过真实代理会话测试Harness:
  1. 基准测试:在不使用Harness的情况下运行代表性任务
  2. 使用Harness:在启用Harness的情况下运行相同任务
  3. 测量:成功率、耗时、Token使用量、返工量
  4. 对比:量化改进效果
如需严谨的基准测试,请参阅下方的“运行基准测试”部分。

Harness File Templates

Harness文件模板

AGENTS.md Structure

AGENTS.md结构

A minimal AGENTS.md should include:
markdown
undefined
一个极简的AGENTS.md应包含:
markdown
undefined

AGENTS.md

AGENTS.md

[One-sentence project purpose]
[One-sentence project purpose]

Startup Workflow

Startup Workflow

Before writing code:
  1. [Step 1: e.g., Read this file]
  2. [Step 2: e.g., Read ARCHITECTURE.md]
  3. [Step 3: e.g., Run ./init.sh]
  4. [Step 4: e.g., Read feature_list.json]
Before writing code:
  1. [Step 1: e.g., Read this file]
  2. [Step 2: e.g., Read ARCHITECTURE.md]
  3. [Step 3: e.g., Run ./init.sh]
  4. [Step 4: e.g., Read feature_list.json]

Working Rules

Working Rules

  • [Rule 1: e.g., One feature at a time]
  • [Rule 2: e.g., Verification required before claiming done]
  • [Rule 3: e.g., Update progress before ending session]
  • [Rule 1: e.g., One feature at a time]
  • [Rule 2: e.g., Verification required before claiming done]
  • [Rule 3: e.g., Update progress before ending session]

Required Artifacts

Required Artifacts

  • feature_list.json
    : Feature state tracker
  • progress.md
    : Session continuity log
  • init.sh
    : Standard startup and verification
  • feature_list.json
    : Feature state tracker
  • progress.md
    : Session continuity log
  • init.sh
    : Standard startup and verification

Definition of Done

Definition of Done

A feature is done when:
  • Implementation complete
  • Verification passed
  • Evidence recorded
  • Repository restartable
A feature is done when:
  • Implementation complete
  • Verification passed
  • Evidence recorded
  • Repository restartable

End of Session

End of Session

Before ending:
  1. Update progress.md
  2. Update feature_list.json
  3. Record blockers/risks
  4. Commit with descriptive message
  5. Leave clean restart path
undefined
Before ending:
  1. Update progress.md
  2. Update feature_list.json
  3. Record blockers/risks
  4. Commit with descriptive message
  5. Leave clean restart path
undefined

feature_list.json Structure

feature_list.json结构

json
{
  "features": [
    {
      "id": "feat-001",
      "name": "Document Import",
      "description": "Allow users to import PDF and TXT documents",
      "dependencies": [],
      "status": "done",
      "evidence": "tests pass, manual verification on 2024-01-15"
    },
    {
      "id": "feat-002",
      "name": "Document Chunking",
      "description": "Split documents into ~500 char chunks with metadata",
      "dependencies": ["feat-001"],
      "status": "in-progress",
      "evidence": ""
    }
  ]
}
json
{
  "features": [
    {
      "id": "feat-001",
      "name": "Document Import",
      "description": "Allow users to import PDF and TXT documents",
      "dependencies": [],
      "status": "done",
      "evidence": "tests pass, manual verification on 2024-01-15"
    },
    {
      "id": "feat-002",
      "name": "Document Chunking",
      "description": "Split documents into ~500 char chunks with metadata",
      "dependencies": ["feat-001"],
      "status": "in-progress",
      "evidence": ""
    }
  ]
}

init.sh Structure

init.sh结构

bash
#!/bin/bash
set -e

echo "=== Installing dependencies ==="
npm install

echo "=== Running type check ==="
npm run check

echo "=== Running tests ==="
npm test

echo "=== Building application ==="
npm run build

echo "=== Verification complete ==="

bash
#!/bin/bash
set -e

echo "=== Installing dependencies ==="
npm install

echo "=== Running type check ==="
npm run check

echo "=== Running tests ==="
npm test

echo "=== Building application ==="
npm run build

echo "=== Verification complete ==="

Running Benchmarks

运行基准测试

To measure harness effectiveness:
为衡量Harness的有效性:

Step 1: Define Representative Tasks

步骤1:定义代表性任务

Pick 2-3 tasks that are:
  • Real work the user would actually do
  • Challenging enough to fail without proper harness
  • Verifiable (clear success criteria)
选择2-3个符合以下条件的任务:
  • 用户实际会处理的真实工作
  • 难度足够高,无合适Harness时容易失败
  • 可验证(明确的成功标准)

Step 2: Run Comparative Sessions

步骤2:运行对比会话

For each task:
  • Without Harness: Run the task on a clean repo copy
  • With Harness: Run the same task with the harness in place
Record:
  • Success/failure
  • Time taken
  • Token usage
  • Rework required
  • Session restarts needed
针对每个任务:
  • 无Harness:在干净的仓库副本上运行任务
  • 使用Harness:在启用Harness的情况下运行相同任务
记录:
  • 成功/失败情况
  • 耗时
  • Token使用量
  • 所需返工量
  • 所需会话重启次数

Step 3: Aggregate Results

步骤3:汇总结果

Calculate:
  • Success rate improvement
  • Time efficiency change
  • Token efficiency change
  • Qualitative feedback
计算:
  • 成功率提升
  • 时间效率变化
  • Token效率变化
  • 定性反馈

Step 4: Iterate

步骤4:迭代

Use results to identify:
  • Which harness components add most value
  • Which components are over-engineered
  • Where to focus improvement efforts

利用结果确定:
  • 哪些Harness组件价值最大
  • 哪些组件过度设计
  • 改进工作的聚焦方向

Bundled Resources

捆绑资源

References (Deep-Dive Patterns)

参考资料(深度解析模式)

DocumentCovers
Memory PersistenceFour-level instruction hierarchy, auto-memory taxonomy, background extraction
Context EngineeringSelect / Compress / Isolate / Write operations, budget management
Tool RegistryFail-closed registration, per-call concurrency, permission pipeline
Multi-AgentCoordinator / Fork / Swarm patterns, context sharing
Lifecycle & BootstrapHook system, long-running tasks, dependency-ordered init
Gotchas15 non-obvious failure modes with fixes
文档涵盖内容
记忆持久化四级指令层级、自动记忆分类、背景信息提取
上下文工程选择/压缩/隔离/写入操作、预算管理
工具注册关闭式故障注册、每次调用并发控制、权限流水线
多代理协调/分支/集群模式、上下文共享
生命周期与引导Hook系统、长期运行任务、依赖有序初始化
陷阱15种非明显失败模式及修复方案

Templates

模板

  • templates/agents.md
    — AGENTS.md / CLAUDE.md skeleton
  • templates/feature-list.json
    — Feature state tracker
  • templates/init.sh
    — Standard initialization script
  • templates/progress.md
    — Session progress log
  • templates/session-handoff.md
    — Session handoff template
  • templates/agents.md
    — AGENTS.md / CLAUDE.md框架
  • templates/feature-list.json
    — 功能状态跟踪器
  • templates/init.sh
    — 标准初始化脚本
  • templates/progress.md
    — 会话进度日志
  • templates/session-handoff.md
    — 会话交接模板

Scripts (Optional)

脚本(可选)

  • scripts/create-harness.ts
    — Generate harness files from templates
  • scripts/validate-harness.ts
    — Check harness completeness
  • scripts/run-benchmark.ts
    — Execute harness effectiveness comparison

  • scripts/create-harness.ts
    — 从模板生成Harness文件
  • scripts/validate-harness.ts
    — 检查Harness完整性
  • scripts/run-benchmark.ts
    — 执行Harness有效性对比

Gotchas

陷阱(Gotchas)

Non-obvious principles that will cause bugs if you violate them:
  1. Memory index caps fire silently — Long entries invisible once cap hit. Keep hooks to one line.
  2. Priority ordering counterintuitive — Local beats project beats user beats org. Test full stack.
  3. Extraction timing creates race window — User can start next turn before background extraction completes.
  4. Derivable content doesn't belong in memory — Architecture and code patterns are in the repo already.
  5. Concurrent classification is per-call, not per-tool — Same tool safe for some inputs, unsafe for others.
  6. Permission evaluation has side effects — Tracks denials, transforms modes, updates state.
  7. Most async work skips "pending" state — Work units register directly as "running".
  8. Fork children must not fork — Recursive guard preserves single-level invariant.
  9. Context builders memoized but manually invalidated — Add invalidation or face staleness.
  10. Hook trust all-or-nothing — One untrusted hook disables entire extension system.
  11. Eviction requires notification — Terminal work unit only GC-eligible after parent notified.
  12. Skill listing budgets tight — Front-load distinctive trigger language, tails get cut.
Full guide: Gotchas — 15 failure modes with fixes.
违反这些非明显原则会导致bug:
  1. 记忆索引上限静默触发 — 条目过长超上限后不可见。保持钩子单行。
  2. 优先级顺序反直觉 — 本地胜过项目胜过用户胜过组织。测试完整栈。
  3. 提取时序产生竞争窗口 — 用户可在后台提取完成前开始下一轮。
  4. 可推导内容不应存入记忆 — 架构和代码模式已在仓库中。
  5. 并发分类按调用而非按工具 — 同一工具对某些输入安全,对其他不安全。
  6. 权限评估有副作用 — 跟踪拒绝、转换模式、更新状态。
  7. 大多数异步工作跳过"pending"状态 — 工作单元直接注册为"运行中"。
  8. Fork子节点不能Fork — 递归防护保持单层不变量。
  9. 上下文构建器缓存但手动失效 — 添加失效机制否则会面临过时问题。
  10. Hook信任全有或全无 — 一个不可信hook禁用整个扩展系统。
  11. 驱逐需要通知 — 终端工作单元仅在父节点通知后可被GC。
  12. Skill列表预算紧张 — 前置独特触发语言,尾部会被截断。
完整指南陷阱 — 15种失败模式及修复方法。

陷阱(Gotchas)

陷阱(Gotchas)

违反这些非明显原则会导致 bug:
  1. 记忆索引上限静默触发 — 条目过长超上限后不可见。保持钩子单行。
  2. 优先级顺序反直觉 — 本地胜过项目胜过用户胜过组织。测试完整栈。
  3. 提取时序产生竞争窗口 — 用户可在后台提取完成前开始下一轮。
  4. 可推导内容不应存入记忆 — 架构和代码模式已在仓库中。
  5. 并发分类按调用而非按工具 — 同一工具对某些输入安全,对其他不安全。
  6. 权限评估有副作用 — 跟踪拒绝、转换模式、更新状态。
  7. 大多数异步工作跳过"pending"状态 — 工作单元直接注册为"运行中"。
  8. Fork 子节点不能 Fork — 递归防护保持单层不变量。
  9. 上下文构建器缓存但手动失效 — 添加失效或面对过时。
  10. Hook 信任全有或全无 — 一个不可信 hook 禁用整个扩展系统。
  11. 驱逐需要通知 — 终端工作单元仅在父节点通知后可 GC。
  12. Skill 列表预算紧张 — 前置独特触发语言,尾部被截断。
完整指南陷阱 — 15 种失败模式及修复方法。

违反这些非明显原则会导致 bug:
  1. 记忆索引上限静默触发 — 条目过长超上限后不可见。保持钩子单行。
  2. 优先级顺序反直觉 — 本地胜过项目胜过用户胜过组织。测试完整栈。
  3. 提取时序产生竞争窗口 — 用户可在后台提取完成前开始下一轮。
  4. 可推导内容不应存入记忆 — 架构和代码模式已在仓库中。
  5. 并发分类按调用而非按工具 — 同一工具对某些输入安全,对其他不安全。
  6. 权限评估有副作用 — 跟踪拒绝、转换模式、更新状态。
  7. 大多数异步工作跳过"pending"状态 — 工作单元直接注册为"运行中"。
  8. Fork 子节点不能 Fork — 递归防护保持单层不变量。
  9. 上下文构建器缓存但手动失效 — 添加失效或面对过时。
  10. Hook 信任全有或全无 — 一个不可信 hook 禁用整个扩展系统。
  11. 驱逐需要通知 — 终端工作单元仅在父节点通知后可 GC。
  12. Skill 列表预算紧张 — 前置独特触发语言,尾部被截断。
完整指南陷阱 — 15 种失败模式及修复方法。

When to Use This Skill

何时使用此技能

Use this skill when:
  • User says "I need to set up AGENTS.md for my project"
  • User wants to improve their agent's reliability
  • User is experiencing agent failures, lost context, or broken work
  • User asks "how do I make my agent work better?"
  • User wants to benchmark harness effectiveness
  • User needs templates for harness files
  • User is following the Learn Harness Engineering course

在以下场景使用此技能:
  • 用户表示“我需要为我的项目设置AGENTS.md”
  • 用户希望提升代理的可靠性
  • 用户遇到代理故障、上下文丢失或工作中断问题
  • 用户询问“如何让我的代理工作得更好?”
  • 用户希望对Harness有效性进行基准测试
  • 用户需要Harness文件模板
  • 用户正在学习Learn Harness Engineering课程

Communication Style

沟通风格

  • Explain harness concepts in practical terms (kitchen analogy works well)
  • Focus on measurable outcomes, not theoretical perfection
  • Start minimal, add structure as needed
  • Show before/after comparisons to build confidence
  • Acknowledge tradeoffs (more structure = more reliability but more upfront work)

  • 用实用术语解释Harness概念(厨房类比效果很好)
  • 聚焦可衡量的结果,而非理论完美
  • 从极简方案开始,按需添加结构
  • 展示前后对比以建立信心
  • 承认权衡(更多结构=更高可靠性,但前期投入更多)

Getting Started

入门指南

If the user is new to harness engineering:

如果用户是Harness工程新手:

  1. Start with assessment: Run the five-tuple assessment on their current setup
  2. Pick lowest-scoring subsystem: Focus improvement efforts there first
  3. Create minimal viable harness: AGENTS.md + init.sh + feature_list.json
  4. Test with real task: Measure before/after improvement
  1. 从评估开始:对他们当前的设置进行五元组评估
  2. 选择评分最低的子系统:优先聚焦该子系统的改进工作
  3. 创建最小可行Harness:AGENTS.md + init.sh + feature_list.json
  4. 用真实任务测试:衡量前后改进效果

If the user is experienced:

如果用户经验丰富:

  1. Ask what specific problem: Don't assume — let them describe the pain point
  2. Understand harness maturity: What exists already? What's working?
  3. Design targeted improvements: Use reference patterns for guidance
  4. Optionally run benchmarks: Quantify impact with before/after comparison

  1. 询问具体问题:不要假设,让他们描述痛点
  2. 了解Harness成熟度:已有哪些资源?哪些部分有效?
  3. 设计针对性改进:参考模式进行指导
  4. 可选运行基准测试:通过前后对比量化影响

When NOT to Use This Skill

何时不使用此技能

This skill is about the harness around an agent, not:
  • Prompt engineering or system prompt design
  • Model selection or fine-tuning
  • Generic software architecture (MVC, microservices)
  • Chat UIs or conversational interfaces
  • LLM API integration basics
If your question is about the model itself rather than the system around it, this skill does not apply.

此技能专注于代理周围的Harness系统,而非:
  • 提示工程或系统提示设计
  • 模型选择或微调
  • 通用软件架构(MVC、微服务)
  • 聊天UI或对话界面
  • LLM API集成基础
如果你的问题是关于模型本身而非其周围的系统,则此技能不适用。

Further Resources

更多资源

Further Resources

更多资源