meta-harness-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Meta-Harness Optimization

Meta-Harness优化

Skill by ara.so — Daily 2026 Skills collection.
Meta-Harness is a framework for automated end-to-end search over model harnesses — the scaffolding code around a fixed base model that controls what the model stores, retrieves, and sees while working on a task. Rather than hand-crafting prompts and memory systems, Meta-Harness proposes, evaluates, and evolves harness implementations automatically.
来自ara.so的技能——Daily 2026 Skills合集。
Meta-Harness是一款用于对**模型封装框架(Model Harness)**进行端到端自动化搜索的框架——模型封装框架是围绕固定基础模型的脚手架代码,控制模型在处理任务时存储、检索和感知的内容。无需手动编写提示词和记忆系统,Meta-Harness可自动提出、评估并演进封装框架的实现方案。

Core Concepts

核心概念

TermMeaning
HarnessAll code around the base model: memory, retrieval, prompt construction, tool use
Proposer AgentLLM (e.g. Claude Code) that proposes new harness variants
EvaluatorRuns proposed harnesses on a benchmark, returns a score
Meta-LoopIterative propose → evaluate → feedback cycle
术语含义
Harness基础模型周围的所有代码:记忆、检索、提示词构建、工具调用
Proposer Agent提出新封装框架变体的大语言模型(LLM,如Claude Code)
Evaluator在基准测试上运行提出的封装框架,返回评分
Meta-Loop迭代式的“提出→评估→反馈”循环

Installation

安装

Meta-Harness uses
uv
for dependency management. Each reference experiment is self-contained:
bash
undefined
Meta-Harness使用
uv
进行依赖管理。每个参考实验都是独立的:
bash
undefined

Text classification experiment

文本分类实验

cd reference_examples/text_classification uv sync
cd reference_examples/text_classification uv sync

Terminal-Bench 2 experiment

Terminal-Bench 2实验

cd reference_examples/terminal_bench_2 uv sync

No global pip install is needed. All dependencies are managed per-experiment via `pyproject.toml`.
cd reference_examples/terminal_bench_2 uv sync

无需全局pip安装。所有依赖通过`pyproject.toml`按实验进行管理。

Quick Start

快速开始

Text Classification (Memory System Search)

文本分类(记忆系统搜索)

bash
cd reference_examples/text_classification
bash
cd reference_examples/text_classification

Run 1 iteration of meta-harness optimization

运行一轮Meta-Harness优化

uv run python meta_harness.py --iterations 1
uv run python meta_harness.py --iterations 1

Run more iterations for better optimization

运行更多轮次以获得更好的优化效果

uv run python meta_harness.py --iterations 10
undefined
uv run python meta_harness.py --iterations 10
undefined

Terminal-Bench 2 (Scaffold Evolution)

Terminal-Bench 2(脚手架演进)

bash
cd reference_examples/terminal_bench_2
bash
cd reference_examples/terminal_bench_2

Smoke test with a single task

单任务冒烟测试

uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf
uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf

General eval format:

通用评估格式:

run_eval.sh <agent_module:AgentClass> <split> <num_tasks> <num_workers> [flags]

run_eval.sh <agent_module:AgentClass> <split> <num_tasks> <num_workers> [flags]

undefined
undefined

Applying Meta-Harness to a New Domain

将Meta-Harness应用到新领域

The recommended workflow uses the onboarding document with your AI coding assistant:
bash
undefined
推荐的工作流程是使用入职文档与AI编码助手配合:
bash
undefined

1. Open ONBOARDING.md in your coding assistant (Claude Code, Cursor, etc.)

1. 在你的AI编码助手(Claude Code、Cursor等)中打开ONBOARDING.md

and have a conversation about your domain. This produces domain_spec.md.

并就你的领域进行对话,生成domain_spec.md。

2. domain_spec.md will contain:

2. domain_spec.md将包含:

- What the harness controls in your domain

- 在你的领域中,封装框架控制的内容

- How to evaluate harness quality (benchmark / metric)

- 如何评估封装框架的质量(基准测试/指标)

- What the proposer agent should modify

- 提出Agent应修改的内容

- Constraints and budget considerations

- 约束条件和预算考量

undefined
undefined

Minimum Required Components for a New Domain

新领域所需的最低组件

my_domain/
├── pyproject.toml          # uv-managed dependencies
├── domain_spec.md          # generated via ONBOARDING.md conversation
├── meta_harness.py         # main optimization loop
├── harness.py              # base harness implementation
├── evaluator.py            # benchmark runner → numeric score
└── claude_wrapper.py       # proposer agent wrapper
my_domain/
├── pyproject.toml          # uv管理的依赖
├── domain_spec.md          # 通过ONBOARDING.md对话生成
├── meta_harness.py         # 主优化循环
├── harness.py              # 基础封装框架实现
├── evaluator.py            # 基准测试运行器 → 数值评分
└── claude_wrapper.py       # 提出Agent的封装器

Implementing a Harness

实现封装框架

A harness wraps a base model and manages context/memory/tools:
python
undefined
封装框架包裹基础模型并管理上下文/记忆/工具:
python
undefined

harness.py — minimal harness structure

harness.py — 最小封装框架结构

from dataclasses import dataclass, field from typing import Any
@dataclass class HarnessConfig: model: str = "claude-3-5-sonnet-20241022" memory_strategy: str = "last_k" k: int = 5 retrieval_enabled: bool = False system_prompt: str = "You are a helpful assistant."
class AgentHarness: def init(self, config: HarnessConfig): self.config = config self.memory: list[dict] = []
def reset(self):
    self.memory = []

def _build_context(self, new_input: str) -> list[dict]:
    """Core harness logic: what does the model see?"""
    if self.config.memory_strategy == "last_k":
        recent = self.memory[-self.config.k:]
    elif self.config.memory_strategy == "all":
        recent = self.memory[:]
    else:
        recent = []
    
    return recent + [{"role": "user", "content": new_input}]

def step(self, user_input: str) -> str:
    messages = self._build_context(user_input)
    # Call base model with constructed context
    response = call_model(
        model=self.config.model,
        system=self.config.system_prompt,
        messages=messages
    )
    # Update memory
    self.memory.append({"role": "user", "content": user_input})
    self.memory.append({"role": "assistant", "content": response})
    return response
undefined
from dataclasses import dataclass, field from typing import Any
@dataclass class HarnessConfig: model: str = "claude-3-5-sonnet-20241022" memory_strategy: str = "last_k" k: int = 5 retrieval_enabled: bool = False system_prompt: str = "You are a helpful assistant."
class AgentHarness: def init(self, config: HarnessConfig): self.config = config self.memory: list[dict] = []
def reset(self):
    self.memory = []

def _build_context(self, new_input: str) -> list[dict]:
    """核心封装框架逻辑:模型能看到什么?"""
    if self.config.memory_strategy == "last_k":
        recent = self.memory[-self.config.k:]
    elif self.config.memory_strategy == "all":
        recent = self.memory[:]
    else:
        recent = []
    
    return recent + [{"role": "user", "content": new_input}]

def step(self, user_input: str) -> str:
    messages = self._build_context(user_input)
    # 使用构建好的上下文调用基础模型
    response = call_model(
        model=self.config.model,
        system=self.config.system_prompt,
        messages=messages
    )
    # 更新记忆
    self.memory.append({"role": "user", "content": user_input})
    self.memory.append({"role": "assistant", "content": response})
    return response
undefined

Implementing the Evaluator

实现评估器

python
undefined
python
undefined

evaluator.py — runs harness on benchmark, returns score

evaluator.py — 在基准测试上运行封装框架,返回评分

from harness import AgentHarness, HarnessConfig
def evaluate_harness(config: HarnessConfig, dataset: list[dict]) -> float: """ Evaluate a harness configuration on a dataset. Returns a scalar score (higher is better). """ harness = AgentHarness(config) correct = 0
for example in dataset:
    harness.reset()
    prediction = harness.step(example["input"])
    if grade(prediction, example["label"]):
        correct += 1

return correct / len(dataset)
def grade(prediction: str, label: str) -> bool: """Task-specific grading logic.""" return label.lower().strip() in prediction.lower()
undefined
from harness import AgentHarness, HarnessConfig
def evaluate_harness(config: HarnessConfig, dataset: list[dict]) -> float: """ 在数据集上评估封装框架配置。 返回标量评分(越高越好)。 """ harness = AgentHarness(config) correct = 0
for example in dataset:
    harness.reset()
    prediction = harness.step(example["input"])
    if grade(prediction, example["label"]):
        correct += 1

return correct / len(dataset)
def grade(prediction: str, label: str) -> bool: """特定任务的评分逻辑。""" return label.lower().strip() in prediction.lower()
undefined

The Meta-Harness Loop

Meta-Harness循环

python
undefined
python
undefined

meta_harness.py — the optimization loop

meta_harness.py — 优化循环

import json from pathlib import Path from evaluator import evaluate_harness from claude_wrapper import run_proposer
def meta_harness_loop( iterations: int = 10, train_dataset: list = None, val_dataset: list = None, ): history: list[dict] = [] best_score = 0.0 best_config = None
for i in range(iterations):
    print(f"\n=== Iteration {i+1}/{iterations} ===")

    # 1. Propose: ask the proposer agent for a new harness variant
    proposal = run_proposer(
        history=history,
        task_description="Optimize the memory system for text classification.",
        code_context=Path("harness.py").read_text(),
    )

    # 2. Evaluate: run the proposed harness
    try:
        new_config = parse_proposal(proposal)
        score = evaluate_harness(new_config, train_dataset)
    except Exception as e:
        score = 0.0
        print(f"Evaluation failed: {e}")

    # 3. Record: log result for proposer feedback
    record = {
        "iteration": i + 1,
        "proposal": proposal,
        "score": score,
    }
    history.append(record)
    print(f"Score: {score:.4f}")

    if score > best_score:
        best_score = score
        best_config = new_config
        print(f"New best: {best_score:.4f}")

# Final validation on held-out set
if best_config and val_dataset:
    val_score = evaluate_harness(best_config, val_dataset)
    print(f"\nFinal val score: {val_score:.4f}")

return best_config, history
undefined
import json from pathlib import Path from evaluator import evaluate_harness from claude_wrapper import run_proposer
def meta_harness_loop( iterations: int = 10, train_dataset: list = None, val_dataset: list = None, ): history: list[dict] = [] best_score = 0.0 best_config = None
for i in range(iterations):
    print(f"\n=== 迭代 {i+1}/{iterations} ===")

    # 1. 提出:请求提出Agent提供新的封装框架变体
    proposal = run_proposer(
        history=history,
        task_description="Optimize the memory system for text classification.",
        code_context=Path("harness.py").read_text(),
    )

    # 2. 评估:运行提出的封装框架
    try:
        new_config = parse_proposal(proposal)
        score = evaluate_harness(new_config, train_dataset)
    except Exception as e:
        score = 0.0
        print(f"评估失败:{e}")

    # 3. 记录:记录结果以供提出Agent反馈
    record = {
        "iteration": i + 1,
        "proposal": proposal,
        "score": score,
    }
    history.append(record)
    print(f"评分:{score:.4f}")

    if score > best_score:
        best_score = score
        best_config = new_config
        print(f"新最优:{best_score:.4f}")

# 在预留数据集上进行最终验证
if best_config and val_dataset:
    val_score = evaluate_harness(best_config, val_dataset)
    print(f"\n最终验证评分:{val_score:.4f}")

return best_config, history
undefined

Proposer Agent Wrapper (Claude Code)

提出Agent封装器(Claude Code)

The shipped examples use Claude Code as the proposer. Adapt
claude_wrapper.py
:
python
undefined
附带的示例使用Claude Code作为提出Agent。调整
claude_wrapper.py
python
undefined

claude_wrapper.py — wraps proposer agent calls

claude_wrapper.py — 封装提出Agent的调用

import subprocess import json from pathlib import Path
def run_proposer( history: list[dict], task_description: str, code_context: str, ) -> str: """ Call Claude Code (or another proposer) to suggest harness modifications. Logs all interactions for reproducibility. """ prompt = build_proposer_prompt(history, task_description, code_context)
# Example: call Claude via API
import anthropic
client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=4096,
    messages=[{"role": "user", "content": prompt}],
)

result = response.content[0].text

# Log for reproducibility
log_entry = {"prompt": prompt, "response": result}
with open("proposer_log.jsonl", "a") as f:
    f.write(json.dumps(log_entry) + "\n")

return result
def build_proposer_prompt( history: list[dict], task_description: str, code_context: str, ) -> str: history_str = "\n".join( f"Iteration {h['iteration']}: score={h['score']:.4f}\nProposal:\n{h['proposal']}" for h in history[-5:] # last 5 for context window ) return f"""You are optimizing a model harness for: {task_description}
Current harness code:
python
{code_context}
Optimization history (recent): {history_str if history_str else "No history yet — this is the first iteration."}
Propose a modified HarnessConfig or changes to the harness code that may improve performance. Output your proposal as a JSON config dict, followed by any code changes. """
undefined
import subprocess import json from pathlib import Path
def run_proposer( history: list[dict], task_description: str, code_context: str, ) -> str: """ 调用Claude Code(或其他提出Agent)建议封装框架修改方案。 记录所有交互以保证可复现性。 """ prompt = build_proposer_prompt(history, task_description, code_context)
# 示例:通过API调用Claude
import anthropic
client = anthropic.Anthropic()  # 使用ANTHROPIC_API_KEY环境变量

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=4096,
    messages=[{"role": "user", "content": prompt}],
)

result = response.content[0].text

# 记录以保证可复现性
log_entry = {"prompt": prompt, "response": result}
with open("proposer_log.jsonl", "a") as f:
    f.write(json.dumps(log_entry) + "\n")

return result
def build_proposer_prompt( history: list[dict], task_description: str, code_context: str, ) -> str: history_str = "\n".join( f"Iteration {h['iteration']}: score={h['score']:.4f}\nProposal:\n{h['proposal']}" for h in history[-5:] # 取最近5轮作为上下文窗口 ) return f"""You are optimizing a model harness for: {task_description}
Current harness code:
python
{code_context}
Optimization history (recent): {history_str if history_str else "No history yet — this is the first iteration."}
Propose a modified HarnessConfig or changes to the harness code that may improve performance. Output your proposal as a JSON config dict, followed by any code changes. """
undefined

Environment Variables

环境变量

bash
undefined
bash
undefined

Required for Claude-based proposer

使用基于Claude的提出Agent时必填

export ANTHROPIC_API_KEY=your_key_here
export ANTHROPIC_API_KEY=your_key_here

Optional: control model used

可选:控制使用的模型

export PROPOSER_MODEL=claude-opus-4-5 export EVALUATOR_MODEL=claude-3-5-sonnet-20241022
undefined
export PROPOSER_MODEL=claude-opus-4-5 export EVALUATOR_MODEL=claude-3-5-sonnet-20241022
undefined

Reference Experiment Structure

参考实验结构

Text Classification (
reference_examples/text_classification/
)

文本分类(
reference_examples/text_classification/

Searches over memory system configurations for a classification task:
  • Proposer modifies memory strategy, retrieval settings, prompt templates
  • Evaluator scores on held-out classification benchmark
  • Optimized config is saved for reuse
bash
uv run python meta_harness.py --iterations 20 --dataset ag_news
针对分类任务搜索记忆系统配置:
  • 提出Agent修改记忆策略、检索设置、提示词模板
  • 评估器在预留分类基准测试上评分
  • 优化后的配置保存以供复用
bash
uv run python meta_harness.py --iterations 20 --dataset ag_news

Terminal-Bench 2 (
reference_examples/terminal_bench_2/
)

Terminal-Bench 2(
reference_examples/terminal_bench_2/

Evolves agent scaffolding for computer-use / terminal tasks:
bash
undefined
演进化用于计算机操作/终端任务的Agent脚手架:
bash
undefined

Run baseline agent on a specific task

在特定任务上运行基准Agent

uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf
uv run bash scripts/run_eval.sh agents.baseline_kira:AgentHarness full 1 1 -i extract-elf

Arguments: module:Class <split> <num_tasks> <num_workers> [task_filter]

参数:module:Class <split> <num_tasks> <num_workers> [task_filter]

Optimized artifact: stanford-iris-lab/meta-harness-tbench2-artifact

优化后的产物:stanford-iris-lab/meta-harness-tbench2-artifact

undefined
undefined

Common Patterns

常见模式

Saving and Loading Optimized Configs

保存和加载优化后的配置

python
import json
from dataclasses import asdict
python
import json
from dataclasses import asdict

Save

保存

with open("best_config.json", "w") as f: json.dump(asdict(best_config), f, indent=2)
with open("best_config.json", "w") as f: json.dump(asdict(best_config), f, indent=2)

Load

加载

with open("best_config.json") as f: data = json.load(f) config = HarnessConfig(**data)
undefined
with open("best_config.json") as f: data = json.load(f) config = HarnessConfig(**data)
undefined

Adding Early Stopping

添加早停机制

python
PATIENCE = 3
no_improve = 0

for i in range(iterations):
    score = evaluate_harness(config, dataset)
    if score > best_score + 1e-4:
        best_score = score
        no_improve = 0
    else:
        no_improve += 1
    if no_improve >= PATIENCE:
        print(f"Early stop at iteration {i+1}")
        break
python
PATIENCE = 3
no_improve = 0

for i in range(iterations):
    score = evaluate_harness(config, dataset)
    if score > best_score + 1e-4:
        best_score = score
        no_improve = 0
    else:
        no_improve += 1
    if no_improve >= PATIENCE:
        print(f"在第{i+1}轮迭代时早停")
        break

Parallel Evaluation

并行评估

python
from concurrent.futures import ProcessPoolExecutor

def batch_evaluate(configs, dataset, num_workers=4):
    with ProcessPoolExecutor(max_workers=num_workers) as executor:
        futures = [executor.submit(evaluate_harness, c, dataset) for c in configs]
        return [f.result() for f in futures]
python
from concurrent.futures import ProcessPoolExecutor

def batch_evaluate(configs, dataset, num_workers=4):
    with ProcessPoolExecutor(max_workers=num_workers) as executor:
        futures = [executor.submit(evaluate_harness, c, dataset) for c in configs]
        return [f.result() for f in futures]

Troubleshooting

故障排除

ProblemLikely CauseFix
uv sync
fails
Missing Python versionInstall Python 3.11+ via
pyenv
Proposer returns unparseable JSONPrompt too vagueAdd explicit JSON schema to proposer prompt
Scores don't improveToo few iterations or search space too largeIncrease
--iterations
, narrow config space
API rate limitsToo many evaluator callsAdd
time.sleep()
or batch requests
Claude Code not foundCLI not installed
npm install -g @anthropic-ai/claude-code
问题可能原因解决方法
uv sync
失败
缺少对应Python版本通过
pyenv
安装Python 3.11+
提出Agent返回无法解析的JSON提示词过于模糊在提出Agent的提示词中添加明确的JSON schema
评分没有提升迭代次数太少或搜索空间过大增加
--iterations
参数,缩小配置空间
API速率限制评估器调用过于频繁添加
time.sleep()
或批量请求
找不到Claude Code未安装CLI
npm install -g @anthropic-ai/claude-code

Citation

引用

bibtex
@misc{lee2026metaharnessendtoendoptimizationmodel,
      title={Meta-Harness: End-to-End Optimization of Model Harnesses},
      author={Yoonho Lee and Roshen Nair and Qizheng Zhang and Kangwook Lee and Omar Khattab and Chelsea Finn},
      year={2026},
      eprint={2603.28052},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.28052},
}
bibtex
@misc{lee2026metaharnessendtoendoptimizationmodel,
      title={Meta-Harness: End-to-End Optimization of Model Harnesses},
      author={Yoonho Lee and Roshen Nair and Qizheng Zhang and Kangwook Lee and Omar Khattab and Chelsea Finn},
      year={2026},
      eprint={2603.28052},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.28052},
}