meta-harness-optimization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Meta-Harness Optimization

Meta-Harness优化

Skill by ara.so — Daily 2026 Skills collection.

Meta-Harness is a framework for automated end-to-end search over model harnesses — the scaffolding code around a fixed base model that controls what the model stores, retrieves, and sees while working on a task. Rather than hand-crafting prompts and memory systems, Meta-Harness proposes, evaluates, and evolves harness implementations automatically.

Paper: Meta-Harness: End-to-End Optimization of Model Harnesses
Homepage: https://yoonholee.com/meta-harness/

来自ara.so的技能——Daily 2026 Skills合集。

Meta-Harness是一款用于对**模型封装框架（Model Harness）**进行端到端自动化搜索的框架——模型封装框架是围绕固定基础模型的脚手架代码，控制模型在处理任务时存储、检索和感知的内容。无需手动编写提示词和记忆系统，Meta-Harness可自动提出、评估并演进封装框架的实现方案。

论文：Meta-Harness: End-to-End Optimization of Model Harnesses
主页：https://yoonholee.com/meta-harness/

Core Concepts

核心概念

Term	Meaning
Harness	All code around the base model: memory, retrieval, prompt construction, tool use
Proposer Agent	LLM (e.g. Claude Code) that proposes new harness variants
Evaluator	Runs proposed harnesses on a benchmark, returns a score
Meta-Loop	Iterative propose → evaluate → feedback cycle

术语	含义
Harness	基础模型周围的所有代码：记忆、检索、提示词构建、工具调用
Proposer Agent	提出新封装框架变体的大语言模型（LLM，如Claude Code）
Evaluator	在基准测试上运行提出的封装框架，返回评分
Meta-Loop	迭代式的“提出→评估→反馈”循环

Installation

安装

Meta-Harness uses

uv

for dependency management. Each reference experiment is self-contained:

bash

undefined

Meta-Harness使用

uv

进行依赖管理。每个参考实验都是独立的：

bash

undefined

Text classification experiment

文本分类实验

cd reference_examples/text_classification uv sync

Terminal-Bench 2 experiment

Terminal-Bench 2实验

cd reference_examples/terminal_bench_2 uv sync


No global pip install is needed. All dependencies are managed per-experiment via `pyproject.toml`.

cd reference_examples/terminal_bench_2 uv sync


无需全局pip安装。所有依赖通过`pyproject.toml`按实验进行管理。

Quick Start

快速开始

Text Classification (Memory System Search)

文本分类（记忆系统搜索）

bash

cd reference_examples/text_classification

bash

cd reference_examples/text_classification

Run 1 iteration of meta-harness optimization

运行一轮Meta-Harness优化

uv run python meta_harness.py --iterations 1

Run more iterations for better optimization

运行更多轮次以获得更好的优化效果

uv run python meta_harness.py --iterations 10

undefined

uv run python meta_harness.py --iterations 10

undefined

Terminal-Bench 2 (Scaffold Evolution)

Terminal-Bench 2（脚手架演进）

bash

cd reference_examples/terminal_bench_2

bash

cd reference_examples/terminal_bench_2

Smoke test with a single task

单任务冒烟测试

uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf

General eval format:

通用评估格式：

run_eval.sh <agent_module:AgentClass> <split> <num_tasks> <num_workers> [flags]

undefined

undefined

Applying Meta-Harness to a New Domain

将Meta-Harness应用到新领域

The recommended workflow uses the onboarding document with your AI coding assistant:

bash

undefined

推荐的工作流程是使用入职文档与AI编码助手配合：

bash

undefined

1. Open ONBOARDING.md in your coding assistant (Claude Code, Cursor, etc.)

1. 在你的AI编码助手（Claude Code、Cursor等）中打开ONBOARDING.md

and have a conversation about your domain. This produces domain_spec.md.

并就你的领域进行对话，生成domain_spec.md。

2. domain_spec.md will contain:

2. domain_spec.md将包含：

- What the harness controls in your domain

- 在你的领域中，封装框架控制的内容

- How to evaluate harness quality (benchmark / metric)

- 如何评估封装框架的质量（基准测试/指标）

- What the proposer agent should modify

- 提出Agent应修改的内容

- Constraints and budget considerations

- 约束条件和预算考量

undefined

undefined

Minimum Required Components for a New Domain

新领域所需的最低组件

my_domain/
├── pyproject.toml          # uv-managed dependencies
├── domain_spec.md          # generated via ONBOARDING.md conversation
├── meta_harness.py         # main optimization loop
├── harness.py              # base harness implementation
├── evaluator.py            # benchmark runner → numeric score
└── claude_wrapper.py       # proposer agent wrapper

my_domain/
├── pyproject.toml          # uv管理的依赖
├── domain_spec.md          # 通过ONBOARDING.md对话生成
├── meta_harness.py         # 主优化循环
├── harness.py              # 基础封装框架实现
├── evaluator.py            # 基准测试运行器 → 数值评分
└── claude_wrapper.py       # 提出Agent的封装器

Implementing a Harness

实现封装框架

A harness wraps a base model and manages context/memory/tools:

python

undefined

封装框架包裹基础模型并管理上下文/记忆/工具：

python

undefined

harness.py — minimal harness structure

harness.py — 最小封装框架结构

from dataclasses import dataclass, field from typing import Any

@dataclass class HarnessConfig: model: str = "claude-3-5-sonnet-20241022" memory_strategy: str = "last_k" k: int = 5 retrieval_enabled: bool = False system_prompt: str = "You are a helpful assistant."

class AgentHarness: def init(self, config: HarnessConfig): self.config = config self.memory: list[dict] = []

def reset(self):
    self.memory = []

def _build_context(self, new_input: str) -> list[dict]:
    """Core harness logic: what does the model see?"""
    if self.config.memory_strategy == "last_k":
        recent = self.memory[-self.config.k:]
    elif self.config.memory_strategy == "all":
        recent = self.memory[:]
    else:
        recent = []
    
    return recent + [{"role": "user", "content": new_input}]

def step(self, user_input: str) -> str:
    messages = self._build_context(user_input)
    # Call base model with constructed context
    response = call_model(
        model=self.config.model,
        system=self.config.system_prompt,
        messages=messages
    )
    # Update memory
    self.memory.append({"role": "user", "content": user_input})
    self.memory.append({"role": "assistant", "content": response})
    return response

undefined

from dataclasses import dataclass, field from typing import Any

class AgentHarness: def init(self, config: HarnessConfig): self.config = config self.memory: list[dict] = []

def reset(self):
    self.memory = []

def _build_context(self, new_input: str) -> list[dict]:
    """核心封装框架逻辑：模型能看到什么？"""
    if self.config.memory_strategy == "last_k":
        recent = self.memory[-self.config.k:]
    elif self.config.memory_strategy == "all":
        recent = self.memory[:]
    else:
        recent = []
    
    return recent + [{"role": "user", "content": new_input}]

def step(self, user_input: str) -> str:
    messages = self._build_context(user_input)
    # 使用构建好的上下文调用基础模型
    response = call_model(
        model=self.config.model,
        system=self.config.system_prompt,
        messages=messages
    )
    # 更新记忆
    self.memory.append({"role": "user", "content": user_input})
    self.memory.append({"role": "assistant", "content": response})
    return response

undefined

Implementing the Evaluator

实现评估器

python

undefined

python

undefined

evaluator.py — runs harness on benchmark, returns score

evaluator.py — 在基准测试上运行封装框架，返回评分

from harness import AgentHarness, HarnessConfig

def evaluate_harness(config: HarnessConfig, dataset: list[dict]) -> float: """ Evaluate a harness configuration on a dataset. Returns a scalar score (higher is better). """ harness = AgentHarness(config) correct = 0

for example in dataset:
    harness.reset()
    prediction = harness.step(example["input"])
    if grade(prediction, example["label"]):
        correct += 1

return correct / len(dataset)

def grade(prediction: str, label: str) -> bool: """Task-specific grading logic.""" return label.lower().strip() in prediction.lower()

undefined

from harness import AgentHarness, HarnessConfig

def evaluate_harness(config: HarnessConfig, dataset: list[dict]) -> float: """ 在数据集上评估封装框架配置。返回标量评分（越高越好）。 """ harness = AgentHarness(config) correct = 0

for example in dataset:
    harness.reset()
    prediction = harness.step(example["input"])
    if grade(prediction, example["label"]):
        correct += 1

return correct / len(dataset)

def grade(prediction: str, label: str) -> bool: """特定任务的评分逻辑。""" return label.lower().strip() in prediction.lower()

undefined

The Meta-Harness Loop

Meta-Harness循环

python

undefined

python

undefined

meta_harness.py — the optimization loop

meta_harness.py — 优化循环

import json from pathlib import Path from evaluator import evaluate_harness from claude_wrapper import run_proposer

def meta_harness_loop( iterations: int = 10, train_dataset: list = None, val_dataset: list = None, ): history: list[dict] = [] best_score = 0.0 best_config = None

for i in range(iterations):
    print(f"\n=== Iteration {i+1}/{iterations} ===")

    # 1. Propose: ask the proposer agent for a new harness variant
    proposal = run_proposer(
        history=history,
        task_description="Optimize the memory system for text classification.",
        code_context=Path("harness.py").read_text(),
    )

    # 2. Evaluate: run the proposed harness
    try:
        new_config = parse_proposal(proposal)
        score = evaluate_harness(new_config, train_dataset)
    except Exception as e:
        score = 0.0
        print(f"Evaluation failed: {e}")

    # 3. Record: log result for proposer feedback
    record = {
        "iteration": i + 1,
        "proposal": proposal,
        "score": score,
    }
    history.append(record)
    print(f"Score: {score:.4f}")

    if score > best_score:
        best_score = score
        best_config = new_config
        print(f"New best: {best_score:.4f}")

# Final validation on held-out set
if best_config and val_dataset:
    val_score = evaluate_harness(best_config, val_dataset)
    print(f"\nFinal val score: {val_score:.4f}")

return best_config, history

undefined

import json from pathlib import Path from evaluator import evaluate_harness from claude_wrapper import run_proposer

def meta_harness_loop( iterations: int = 10, train_dataset: list = None, val_dataset: list = None, ): history: list[dict] = [] best_score = 0.0 best_config = None

for i in range(iterations):
    print(f"\n=== 迭代 {i+1}/{iterations} ===")

    # 1. 提出：请求提出Agent提供新的封装框架变体
    proposal = run_proposer(
        history=history,
        task_description="Optimize the memory system for text classification.",
        code_context=Path("harness.py").read_text(),
    )

    # 2. 评估：运行提出的封装框架
    try:
        new_config = parse_proposal(proposal)
        score = evaluate_harness(new_config, train_dataset)
    except Exception as e:
        score = 0.0
        print(f"评估失败：{e}")

    # 3. 记录：记录结果以供提出Agent反馈
    record = {
        "iteration": i + 1,
        "proposal": proposal,
        "score": score,
    }
    history.append(record)
    print(f"评分：{score:.4f}")

    if score > best_score:
        best_score = score
        best_config = new_config
        print(f"新最优：{best_score:.4f}")

# 在预留数据集上进行最终验证
if best_config and val_dataset:
    val_score = evaluate_harness(best_config, val_dataset)
    print(f"\n最终验证评分：{val_score:.4f}")

return best_config, history

undefined

Proposer Agent Wrapper (Claude Code)

提出Agent封装器（Claude Code）

The shipped examples use Claude Code as the proposer. Adapt

claude_wrapper.py

python

undefined

附带的示例使用Claude Code作为提出Agent。调整

claude_wrapper.py

：

python

undefined

claude_wrapper.py — wraps proposer agent calls

claude_wrapper.py — 封装提出Agent的调用

import subprocess import json from pathlib import Path

def run_proposer( history: list[dict], task_description: str, code_context: str, ) -> str: """ Call Claude Code (or another proposer) to suggest harness modifications. Logs all interactions for reproducibility. """ prompt = build_proposer_prompt(history, task_description, code_context)

# Example: call Claude via API
import anthropic
client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=4096,
    messages=[{"role": "user", "content": prompt}],
)

result = response.content[0].text

# Log for reproducibility
log_entry = {"prompt": prompt, "response": result}
with open("proposer_log.jsonl", "a") as f:
    f.write(json.dumps(log_entry) + "\n")

return result

def build_proposer_prompt( history: list[dict], task_description: str, code_context: str, ) -> str: history_str = "\n".join( f"Iteration {h['iteration']}: score={h['score']:.4f}\nProposal:\n{h['proposal']}" for h in history[-5:] # last 5 for context window ) return f"""You are optimizing a model harness for: {task_description}

Current harness code:

python

{code_context}

Optimization history (recent): {history_str if history_str else "No history yet — this is the first iteration."}

Propose a modified HarnessConfig or changes to the harness code that may improve performance. Output your proposal as a JSON config dict, followed by any code changes. """

undefined

import subprocess import json from pathlib import Path

def run_proposer( history: list[dict], task_description: str, code_context: str, ) -> str: """ 调用Claude Code（或其他提出Agent）建议封装框架修改方案。记录所有交互以保证可复现性。 """ prompt = build_proposer_prompt(history, task_description, code_context)

# 示例：通过API调用Claude
import anthropic
client = anthropic.Anthropic()  # 使用ANTHROPIC_API_KEY环境变量

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=4096,
    messages=[{"role": "user", "content": prompt}],
)

result = response.content[0].text

# 记录以保证可复现性
log_entry = {"prompt": prompt, "response": result}
with open("proposer_log.jsonl", "a") as f:
    f.write(json.dumps(log_entry) + "\n")

return result

def build_proposer_prompt( history: list[dict], task_description: str, code_context: str, ) -> str: history_str = "\n".join( f"Iteration {h['iteration']}: score={h['score']:.4f}\nProposal:\n{h['proposal']}" for h in history[-5:] # 取最近5轮作为上下文窗口 ) return f"""You are optimizing a model harness for: {task_description}

Current harness code:

python

{code_context}

Optimization history (recent): {history_str if history_str else "No history yet — this is the first iteration."}

Propose a modified HarnessConfig or changes to the harness code that may improve performance. Output your proposal as a JSON config dict, followed by any code changes. """

undefined

Environment Variables

环境变量

bash

undefined

bash

undefined

Required for Claude-based proposer

使用基于Claude的提出Agent时必填

export ANTHROPIC_API_KEY=your_key_here

Optional: control model used

可选：控制使用的模型

export PROPOSER_MODEL=claude-opus-4-5 export EVALUATOR_MODEL=claude-3-5-sonnet-20241022

undefined

export PROPOSER_MODEL=claude-opus-4-5 export EVALUATOR_MODEL=claude-3-5-sonnet-20241022

undefined

Reference Experiment Structure

参考实验结构

Text Classification (

reference_examples/text_classification/

)

文本分类（

reference_examples/text_classification/

）

Searches over memory system configurations for a classification task:

Proposer modifies memory strategy, retrieval settings, prompt templates
Evaluator scores on held-out classification benchmark
Optimized config is saved for reuse

bash

uv run python meta_harness.py --iterations 20 --dataset ag_news

针对分类任务搜索记忆系统配置：

提出Agent修改记忆策略、检索设置、提示词模板
评估器在预留分类基准测试上评分
优化后的配置保存以供复用

bash

uv run python meta_harness.py --iterations 20 --dataset ag_news

Terminal-Bench 2 (

reference_examples/terminal_bench_2/

)

Terminal-Bench 2（

reference_examples/terminal_bench_2/

）

Evolves agent scaffolding for computer-use / terminal tasks:

bash

undefined

演进化用于计算机操作/终端任务的Agent脚手架：

bash

undefined

Run baseline agent on a specific task

在特定任务上运行基准Agent

uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf

Arguments: module:Class <split> <num_tasks> <num_workers> [task_filter]

参数：module:Class <split> <num_tasks> <num_workers> [task_filter]

Optimized artifact: stanford-iris-lab/meta-harness-tbench2-artifact

优化后的产物：stanford-iris-lab/meta-harness-tbench2-artifact

undefined

undefined

Common Patterns

常见模式

Saving and Loading Optimized Configs

保存和加载优化后的配置

python

import json
from dataclasses import asdict

python

import json
from dataclasses import asdict

Save

保存

with open("best_config.json", "w") as f: json.dump(asdict(best_config), f, indent=2)

Load

加载

with open("best_config.json") as f: data = json.load(f) config = HarnessConfig(**data)

undefined

with open("best_config.json") as f: data = json.load(f) config = HarnessConfig(**data)

undefined

Adding Early Stopping

添加早停机制

python

PATIENCE = 3
no_improve = 0

for i in range(iterations):
    score = evaluate_harness(config, dataset)
    if score > best_score + 1e-4:
        best_score = score
        no_improve = 0
    else:
        no_improve += 1
    if no_improve >= PATIENCE:
        print(f"Early stop at iteration {i+1}")
        break

python

PATIENCE = 3
no_improve = 0

for i in range(iterations):
    score = evaluate_harness(config, dataset)
    if score > best_score + 1e-4:
        best_score = score
        no_improve = 0
    else:
        no_improve += 1
    if no_improve >= PATIENCE:
        print(f"在第{i+1}轮迭代时早停")
        break

Parallel Evaluation

并行评估

python

from concurrent.futures import ProcessPoolExecutor

def batch_evaluate(configs, dataset, num_workers=4):
    with ProcessPoolExecutor(max_workers=num_workers) as executor:
        futures = [executor.submit(evaluate_harness, c, dataset) for c in configs]
        return [f.result() for f in futures]

python

from concurrent.futures import ProcessPoolExecutor

def batch_evaluate(configs, dataset, num_workers=4):
    with ProcessPoolExecutor(max_workers=num_workers) as executor:
        futures = [executor.submit(evaluate_harness, c, dataset) for c in configs]
        return [f.result() for f in futures]

Troubleshooting

故障排除

Problem	Likely Cause	Fix
`uv sync` fails	Missing Python version	Install Python 3.11+ via `pyenv`
Proposer returns unparseable JSON	Prompt too vague	Add explicit JSON schema to proposer prompt
Scores don't improve	Too few iterations or search space too large	Increase `--iterations` , narrow config space
API rate limits	Too many evaluator calls	Add `time.sleep()` or batch requests
Claude Code not found	CLI not installed	`npm install -g @anthropic-ai/claude-code`

问题	可能原因	解决方法
`uv sync` 失败	缺少对应Python版本	通过 `pyenv` 安装Python 3.11+
提出Agent返回无法解析的JSON	提示词过于模糊	在提出Agent的提示词中添加明确的JSON schema
评分没有提升	迭代次数太少或搜索空间过大	增加 `--iterations` 参数，缩小配置空间
API速率限制	评估器调用过于频繁	添加 `time.sleep()` 或批量请求
找不到Claude Code	未安装CLI	`npm install -g @anthropic-ai/claude-code`

Citation

引用

bibtex

@misc{lee2026metaharnessendtoendoptimizationmodel,
      title={Meta-Harness: End-to-End Optimization of Model Harnesses},
      author={Yoonho Lee and Roshen Nair and Qizheng Zhang and Kangwook Lee and Omar Khattab and Chelsea Finn},
      year={2026},
      eprint={2603.28052},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.28052},
}

bibtex

@misc{lee2026metaharnessendtoendoptimizationmodel,
      title={Meta-Harness: End-to-End Optimization of Model Harnesses},
      author={Yoonho Lee and Roshen Nair and Qizheng Zhang and Kangwook Lee and Omar Khattab and Chelsea Finn},
      year={2026},
      eprint={2603.28052},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.28052},
}

meta-harness-optimization

Original

Translation

Meta-Harness Optimization

Meta-Harness优化

Core Concepts

核心概念

Installation

安装

Text classification experiment

文本分类实验

Terminal-Bench 2 experiment

Terminal-Bench 2实验

Quick Start

快速开始

Text Classification (Memory System Search)

文本分类（记忆系统搜索）

Run 1 iteration of meta-harness optimization

运行一轮Meta-Harness优化

Run more iterations for better optimization

运行更多轮次以获得更好的优化效果

Terminal-Bench 2 (Scaffold Evolution)

Terminal-Bench 2（脚手架演进）

Smoke test with a single task

单任务冒烟测试

General eval format:

通用评估格式：

run_eval.sh <agent_module:AgentClass> <split> <num_tasks> <num_workers> [flags]

run_eval.sh <agent_module:AgentClass> <split> <num_tasks> <num_workers> [flags]

Applying Meta-Harness to a New Domain

将Meta-Harness应用到新领域

1. Open ONBOARDING.md in your coding assistant (Claude Code, Cursor, etc.)

1. 在你的AI编码助手（Claude Code、Cursor等）中打开ONBOARDING.md

and have a conversation about your domain. This produces domain_spec.md.

并就你的领域进行对话，生成domain_spec.md。

2. domain_spec.md will contain:

2. domain_spec.md将包含：

- What the harness controls in your domain

- 在你的领域中，封装框架控制的内容

- How to evaluate harness quality (benchmark / metric)

- 如何评估封装框架的质量（基准测试/指标）

- What the proposer agent should modify

- 提出Agent应修改的内容

- Constraints and budget considerations

- 约束条件和预算考量

Minimum Required Components for a New Domain

新领域所需的最低组件

Implementing a Harness

实现封装框架

harness.py — minimal harness structure

harness.py — 最小封装框架结构

Implementing the Evaluator

实现评估器

evaluator.py — runs harness on benchmark, returns score

evaluator.py — 在基准测试上运行封装框架，返回评分

The Meta-Harness Loop

Meta-Harness循环

meta_harness.py — the optimization loop

meta_harness.py — 优化循环

Proposer Agent Wrapper (Claude Code)

提出Agent封装器（Claude Code）

claude_wrapper.py — wraps proposer agent calls

claude_wrapper.py — 封装提出Agent的调用

Environment Variables

环境变量

Required for Claude-based proposer

使用基于Claude的提出Agent时必填

Optional: control model used

可选：控制使用的模型

Reference Experiment Structure

参考实验结构

Text Classification (reference_examples/text_classification/)

文本分类（reference_examples/text_classification/）

Terminal-Bench 2 (reference_examples/terminal_bench_2/)

Terminal-Bench 2（reference_examples/terminal_bench_2/）

Run baseline agent on a specific task

在特定任务上运行基准Agent

Arguments: module:Class <split> <num_tasks> <num_workers> [task_filter]

参数：module:Class <split> <num_tasks> <num_workers> [task_filter]

Optimized artifact: stanford-iris-lab/meta-harness-tbench2-artifact

Text Classification (
`reference_examples/text_classification/`
)

文本分类（
`reference_examples/text_classification/`
）

Terminal-Bench 2 (
`reference_examples/terminal_bench_2/`
)

Terminal-Bench 2（
`reference_examples/terminal_bench_2/`
）