Harness Creator

Production harness engineering for AI coding agents.

For: Engineers building or extending coding-agent runtimes, custom agents, multi-session workflows, or anyone who wants their agent to work reliably across sessions.

Not for: Prompt engineering, model selection, generic software architecture, or one-off agent tasks.

All principles are grounded in the Learn Harness Engineering framework and production agent runtime decisions.

Harness Creator（中文版）

面向 AI 编程代理的生产级 Harness 工程技能。

适用人群： 构建或扩展编程代理运行时、自定义代理、多会话工作流的工程师，或任何希望代理跨会话可靠工作的人。

不适用场景： 提示工程、模型选择、通用软件架构或一次性代理任务。

所有原则均基于 Learn Harness Engineering 框架和生产代理运行时决策。

Choose Your Problem

If you want to...	Read
Make the agent remember corrections and project rules between sessions	Memory Persistence
Package reusable workflows and domain knowledge	Skill Runtime
Let the agent work powerfully but not dangerously	Tool Registry & Safety
Give the agent the right context at the right cost	Context Engineering
Split work across multiple agents without chaos	Multi-agent Coordination
Extend behavior with hooks, background tasks, startup logic	Lifecycle & Bootstrap
Build the complete 5-subsystem harness	Five Subsystems Guide

Before you start building: Read the Gotchas — these are the non-obvious failure modes that cost the most time.

选择你要解决的问题

如果你想...	阅读
让代理在会话之间记住修正和项目规则	记忆持久化
打包可重复使用的工作流和领域知识	技能运行时
让代理强大但安全地工作	工具注册与安全
以合适的成本给代理合适的上下文	上下文工程
在多个代理之间分配工作而不混乱	多代理协调
使用 hooks、后台任务、启动逻辑扩展行为	生命周期与引导
构建完整的 5 子系统 harness	五子系统指南

开始构建之前： 阅读陷阱 — 这些是最耗时的非明显失败模式。

The Five-Subsystem Harness Framework

Every harness consists of five subsystems:

Instructions (Recipe Shelf): AGENTS.md, CLAUDE.md, docs/ hierarchy
State (Prep Station): feature_list.json, progress.md, session-handoff.md
Verification (Quality Check Window): Verification commands, test suites, type checks
Scope (Task Boundaries): One-feature-at-a-time policies, definition of done
Lifecycle (Session Management): init.sh, clean-state checklists, handoff procedures

When creating or improving a harness, systematically address each subsystem.

Creating a Harness

Phase 1: Context Gathering

Start by understanding the user's situation:

What project is this for? (tech stack, size, complexity)
What agent tool are they using? (Claude Code, Codex, Cursor, etc.)
What exists already? (any AGENTS.md, progress tracking, verification?)
What problems are they experiencing? (agent overreach, lost context, broken tests?)
What's the team's tolerance for structure? (minimal vs. comprehensive)

If the user hasn't provided this context, ask before proceeding.

Phase 2: Harness Assessment (Existing Projects)

If the user has an existing harness, assess it using the five-tuple framework:

For each subsystem, score 1-5:

5: Exemplary, documented, consistently followed
4: Good, mostly complete, occasional gaps
3: Adequate, covers basics, missing polish
2: Weak, incomplete, inconsistently applied
1: Missing or actively harmful

Identify the lowest-scoring subsystem — that's the bottleneck. Focus improvement efforts there first.

Phase 3: Design

Based on the assessment, design the harness components:

Instructions:

Create a short AGENTS.md (~50-100 lines) as the routing layer
Link to detailed docs in docs/ directory (ARCHITECTURE.md, PRODUCT.md, etc.)
Define startup workflow: what the agent reads before coding

State:

Create feature_list.json with feature definitions and status tracking
Create or update progress.md for session continuity
Design session-handoff.md template if needed

Verification:

List explicit verification commands in AGENTS.md
Ensure init.sh runs verification
Design quality score tracking if appropriate

Scope:

Define one-feature-at-a-time policy
Document feature dependencies
Create definition of done checklist

Lifecycle:

Create init.sh for initialization
Design clean-state checklist
Document session handoff procedure

Phase 4: Implementation

Create the harness files. Use bundled scripts where available:

bash

# Use bundled scripts from scripts/ directory
# (See scripts/ section for available tools)

Phase 5: Testing and Benchmarking

Test the harness with real agent sessions:

Baseline: Run a representative task without the harness
With Harness: Run the same task with the harness
Measure: Success rate, time, token usage, rework
Compare: Quantify the improvement

For rigorous benchmarking, see the "Running Benchmarks" section below.

Harness File Templates

AGENTS.md Structure

A minimal AGENTS.md should include:

markdown

# AGENTS.md

[One-sentence project purpose]

## Startup Workflow

Before writing code:
1. [Step 1: e.g., Read this file]
2. [Step 2: e.g., Read ARCHITECTURE.md]
3. [Step 3: e.g., Run ./init.sh]
4. [Step 4: e.g., Read feature_list.json]

## Working Rules

- [Rule 1: e.g., One feature at a time]
- [Rule 2: e.g., Verification required before claiming done]
- [Rule 3: e.g., Update progress before ending session]

## Required Artifacts

- `feature_list.json`: Feature state tracker
- `progress.md`: Session continuity log
- `init.sh`: Standard startup and verification

## Definition of Done

A feature is done when:
- [ ] Implementation complete
- [ ] Verification passed
- [ ] Evidence recorded
- [ ] Repository restartable

## End of Session

Before ending:
1. Update progress.md
2. Update feature_list.json
3. Record blockers/risks
4. Commit with descriptive message
5. Leave clean restart path

feature_list.json Structure

json

{
  "features": [
    {
      "id": "feat-001",
      "name": "Document Import",
      "description": "Allow users to import PDF and TXT documents",
      "dependencies": [],
      "status": "done",
      "evidence": "tests pass, manual verification on 2024-01-15"
    },
    {
      "id": "feat-002",
      "name": "Document Chunking",
      "description": "Split documents into ~500 char chunks with metadata",
      "dependencies": ["feat-001"],
      "status": "in-progress",
      "evidence": ""
    }
  ]
}

init.sh Structure

bash

#!/bin/bash
set -e

echo "=== Installing dependencies ==="
npm install

echo "=== Running type check ==="
npm run check

echo "=== Running tests ==="
npm test

echo "=== Building application ==="
npm run build

echo "=== Verification complete ==="

Running Benchmarks

To measure harness effectiveness:

Step 1: Define Representative Tasks

Pick 2-3 tasks that are:

Real work the user would actually do
Challenging enough to fail without proper harness
Verifiable (clear success criteria)

Step 2: Run Comparative Sessions

For each task:

Without Harness: Run the task on a clean repo copy
With Harness: Run the same task with the harness in place

Record:

Success/failure
Time taken
Token usage
Rework required
Session restarts needed

Step 3: Aggregate Results

Calculate:

Success rate improvement
Time efficiency change
Token efficiency change
Qualitative feedback

Step 4: Iterate

Use results to identify:

Which harness components add most value
Which components are over-engineered
Where to focus improvement efforts

Bundled Resources

References (Deep-Dive Patterns)

Document	Covers
Memory Persistence	Four-level instruction hierarchy, auto-memory taxonomy, background extraction
Context Engineering	Select / Compress / Isolate / Write operations, budget management
Tool Registry	Fail-closed registration, per-call concurrency, permission pipeline
Multi-Agent	Coordinator / Fork / Swarm patterns, context sharing
Lifecycle & Bootstrap	Hook system, long-running tasks, dependency-ordered init
Gotchas	15 non-obvious failure modes with fixes

Templates

```
templates/agents.md
```
— AGENTS.md / CLAUDE.md skeleton
```
templates/feature-list.json
```
— Feature state tracker
```
templates/init.sh
```
— Standard initialization script
```
templates/progress.md
```
— Session progress log
```
templates/session-handoff.md
```
— Session handoff template

Scripts (Optional)

```
scripts/create-harness.ts
```
— Generate harness files from templates
```
scripts/validate-harness.ts
```
— Check harness completeness
```
scripts/run-benchmark.ts
```
— Execute harness effectiveness comparison

Gotchas

Non-obvious principles that will cause bugs if you violate them:

Memory index caps fire silently — Long entries invisible once cap hit. Keep hooks to one line.
Priority ordering counterintuitive — Local beats project beats user beats org. Test full stack.
Extraction timing creates race window — User can start next turn before background extraction completes.
Derivable content doesn't belong in memory — Architecture and code patterns are in the repo already.
Concurrent classification is per-call, not per-tool — Same tool safe for some inputs, unsafe for others.
Permission evaluation has side effects — Tracks denials, transforms modes, updates state.
Most async work skips "pending" state — Work units register directly as "running".
Fork children must not fork — Recursive guard preserves single-level invariant.
Context builders memoized but manually invalidated — Add invalidation or face staleness.
Hook trust all-or-nothing — One untrusted hook disables entire extension system.
Eviction requires notification — Terminal work unit only GC-eligible after parent notified.
Skill listing budgets tight — Front-load distinctive trigger language, tails get cut.

Full guide: Gotchas — 15 failure modes with fixes.

陷阱（Gotchas）

违反这些非明显原则会导致 bug：

记忆索引上限静默触发 — 条目过长超上限后不可见。保持钩子单行。
优先级顺序反直觉 — 本地胜过项目胜过用户胜过组织。测试完整栈。
提取时序产生竞争窗口 — 用户可在后台提取完成前开始下一轮。
可推导内容不应存入记忆 — 架构和代码模式已在仓库中。
并发分类按调用而非按工具 — 同一工具对某些输入安全，对其他不安全。
权限评估有副作用 — 跟踪拒绝、转换模式、更新状态。
大多数异步工作跳过"pending"状态 — 工作单元直接注册为"运行中"。
Fork 子节点不能 Fork — 递归防护保持单层不变量。
上下文构建器缓存但手动失效 — 添加失效或面对过时。
Hook 信任全有或全无 — 一个不可信 hook 禁用整个扩展系统。
驱逐需要通知 — 终端工作单元仅在父节点通知后可 GC。
Skill 列表预算紧张 — 前置独特触发语言，尾部被截断。

完整指南：陷阱 — 15 种失败模式及修复方法。

When to Use This Skill

Use this skill when:

User says "I need to set up AGENTS.md for my project"
User wants to improve their agent's reliability
User is experiencing agent failures, lost context, or broken work
User asks "how do I make my agent work better?"
User wants to benchmark harness effectiveness
User needs templates for harness files
User is following the Learn Harness Engineering course

Communication Style

Explain harness concepts in practical terms (kitchen analogy works well)
Focus on measurable outcomes, not theoretical perfection
Start minimal, add structure as needed
Show before/after comparisons to build confidence
Acknowledge tradeoffs (more structure = more reliability but more upfront work)

Getting Started

If the user is new to harness engineering:

Start with assessment: Run the five-tuple assessment on their current setup
Pick lowest-scoring subsystem: Focus improvement efforts there first
Create minimal viable harness: AGENTS.md + init.sh + feature_list.json
Test with real task: Measure before/after improvement

If the user is experienced:

Ask what specific problem: Don't assume — let them describe the pain point
Understand harness maturity: What exists already? What's working?
Design targeted improvements: Use reference patterns for guidance
Optionally run benchmarks: Quantify impact with before/after comparison

When NOT to Use This Skill

This skill is about the harness around an agent, not:

Prompt engineering or system prompt design
Model selection or fine-tuning
Generic software architecture (MVC, microservices)
Chat UIs or conversational interfaces
LLM API integration basics

If your question is about the model itself rather than the system around it, this skill does not apply.

Further Resources

Learn Harness Engineering Documentation
OpenAI: Harness Engineering
Anthropic: Effective Harnesses for Long-Running Agents
Anthropic: Harness Design for Long-Running Apps
Awesome Harness Engineering
Agentic Harness Patterns Skill — Reference implementation for pattern extraction

harness-creator

NPX Install

Tags

SKILL.md Content

Harness Creator

Harness Creator（中文版）

Choose Your Problem

选择你要解决的问题

The Five-Subsystem Harness Framework

Creating a Harness

Phase 1: Context Gathering

Phase 2: Harness Assessment (Existing Projects)

Phase 3: Design

Phase 4: Implementation

Phase 5: Testing and Benchmarking

Harness File Templates

AGENTS.md Structure

feature_list.json Structure

init.sh Structure

Running Benchmarks

Step 1: Define Representative Tasks

Step 2: Run Comparative Sessions

Step 3: Aggregate Results

Step 4: Iterate

Bundled Resources

References (Deep-Dive Patterns)

Templates

Scripts (Optional)

Gotchas

陷阱（Gotchas）

When to Use This Skill

Communication Style

Getting Started

If the user is new to harness engineering:

If the user is experienced:

When NOT to Use This Skill

Further Resources

Further Resources