optimize
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRun the optimization loop. Each round, the orchestrator writes structured briefs and spawns parallel subagents that execute within them. Each subagent is semi-autonomous: it reads the pointer traces, forms the concrete edit, runs experiments, and can iterate within its branch. Runs until interrupted or the stall limit is reached.
evo运行优化循环。每一轮中,编排器会编写结构化任务简报,并生成在其中执行的并行子代理。每个子代理具备半自主性:它会读取指针追踪信息、生成具体的修改、运行实验,并能在自己的分支内迭代。运行过程会持续到被中断或达到停滞限制为止。
evoHost conventions
主机约定
This skill runs on any host that implements the Agent Skills spec. When the body uses generic phrases, apply the host's best-fit equivalent:
- "spawn N subagents in parallel" -- use your host's parallel-subagent or background-task tool if you have one (e.g. with
Agent,run_in_background+spawn_agent,wait_agentfor batch). Respect the host's concurrency cap -- if N exceeds it, run in batches. If the host has no parallel-subagent tool, run them serially and note the reduced round width in the final summary.spawn_agents_on_csv - Slash commands shown in user-facing copy (e.g. ) -- translate to your host's mention syntax when speaking to the user (e.g.
/evo:optimizeon Codex -- plugin namespace then skill name, separated by a space).$evo optimize - (optimization subagents only, host-specific). On
evo dispatch, the optimization subagents in step 5 are launched viaclaude-codeinstead of the host's parallel-Task tool. This shares an explorer's KV cache across siblings of the same parent (~99% prefix reuse). On every other host (evo dispatch run --background,codex,opencode,openclaw,hermes),genericis unsupported and exits 2 with guidance -- step 5 stays on the host's parallel-Task primitive there. The scan sub-agents in step 3 always use the host's parallel-Task tool regardless of host (they don't allocate experiments, so fork-cache doesn't apply).evo dispatch
该技能可在任何实现了Agent Skills规范的主机上运行。当正文使用通用表述时,请应用主机最匹配的等效方式:
- "生成N个并行子代理"——如果主机有并行子代理或后台任务工具(例如带有的
run_in_background、Agent+spawn_agent、用于批量处理的wait_agent),请使用该工具。遵循主机的并发上限——如果N超过上限,则分批运行。如果主机没有并行子代理工具,则串行运行,并在最终总结中说明缩减的轮次宽度。spawn_agents_on_csv - 面向用户文案中显示的斜杠命令(例如)——与用户交流时转换为主机的提及语法(例如在Codex上为
/evo:optimize——插件命名空间后跟技能名称,以空格分隔)。$evo optimize - (仅适用于优化子代理,主机特定)。在
evo dispatch上,步骤5中的优化子代理通过claude-code启动,而非主机的并行任务工具。这会在同一父级的子代理间共享探索者的KV缓存(约99%的前缀复用)。在所有其他主机(evo dispatch run --background、codex、opencode、openclaw、hermes)上,generic不受支持,会以退出码2返回指导信息——此时步骤5仍使用主机的并行任务原语。步骤3中的扫描子代理无论在哪个主机上,始终使用主机的并行任务工具(它们不分配实验,因此分支缓存不适用)。evo dispatch
Configuration
配置
These defaults can be overridden via arguments:
/optimize [subagents=N] [budget=N] [stall=N]- subagents: number of parallel subagents per round (default: 5)
- budget: max iterations each subagent can run within its branch (default: 5)
- stall: consecutive rounds with no improvement before auto-stopping (default: 5)
这些默认值可通过参数覆盖:
/optimize [subagents=N] [budget=N] [stall=N]- subagents:每轮的并行子代理数量(默认值:5)
- budget:每个子代理在其分支内可运行的最大迭代次数(默认值:5)
- stall:自动停止前无改进的连续轮次(默认值:5)
Prerequisites
前提条件
- Workspace must be initialized (should succeed)
evo status - A baseline experiment must be committed (run first)
/discover - All benchmark dependencies must be available in the environment
- 工作区必须已初始化(应执行成功)
evo status - 必须已提交基准实验(先运行)
/discover - 环境中必须具备所有基准依赖项
Architecture
架构
Orchestrator (this agent):
- Reads state, identifies failure patterns cross-cutting the tree
- Writes one brief per subagent: objective + parent + boundaries + pointer traces
- Verifies briefs are diverse (no two attacking the same surface)
- Collects results, prunes dead branches, adjusts strategy
Subagent A (brief, budget: N iterations):
- Reads its pointer traces, forms the concrete edit
- Creates experiment, edits target, runs benchmark, analyzes
- If budget remains and sees a promising follow-up, continues
- Can run up to N serial experiments on its own branch
- Returns: what it tried, what worked, what it learned
Subagent B (different brief, budget: N iterations):
- Same protocol, non-overlapping objective
...Both layers read traces; the depth differs. The orchestrator scans for cross-cutting patterns (which failures are common, which branches plateau) -- enough to pick N non-overlapping briefs. Subagents read their pointer traces in depth, enough to commit to a concrete edit. Structured briefs are what prevent parallel subagents from duplicating each other's work.
Trace instrumentation style: 's records vs . Subagents must stay consistent with it (see for details).
.evo/meta.jsoninstrumentation_modesdkinlineskills/subagent/SKILL.mdOrchestrator (this agent):
- Reads state, identifies failure patterns cross-cutting the tree
- Writes one brief per subagent: objective + parent + boundaries + pointer traces
- Verifies briefs are diverse (no two attacking the same surface)
- Collects results, prunes dead branches, adjusts strategy
Subagent A (brief, budget: N iterations):
- Reads its pointer traces, forms the concrete edit
- Creates experiment, edits target, runs benchmark, analyzes
- If budget remains and sees a promising follow-up, continues
- Can run up to N serial experiments on its own branch
- Returns: what it tried, what worked, what it learned
Subagent B (different brief, budget: N iterations):
- Same protocol, non-overlapping objective
...两层都会读取追踪信息,但深度不同。编排器会扫描跨分支的模式(哪些故障是常见的,哪些分支进入平台期)——足以选出N个不重叠的任务简报。子代理会深入读取其指针追踪信息,足以确定具体的修改。结构化任务简报可防止并行子代理重复彼此的工作。
追踪 instrumentation 风格:中的记录了与两种模式。子代理必须保持与该模式一致(详情请参阅)。
.evo/meta.jsoninstrumentation_modesdkinlineskills/subagent/SKILL.mdThe Loop
循环
Repeat until interrupted or stall limit reached:
重复执行直到被中断或达到停滞限制:
1. Read current state
1. 读取当前状态
Step 0 (always, before anything else): ensure you have read . Its contents define the optimization metric and goal; every pattern you report MUST be tied to a failure mode of that goal. If you haven't read it yet in this session, read it now -- before listing experiments, opening any , or spawning sub-agents. If the file is missing, stop and tell the user.
.evo/project.mdoutcome.jsonStep 0.1 (one-time auto-migration): ensure the workspace knows what runtime you are. Run . If it prints , the workspace pre-dates the host signature field — declare your runtime now: . You know what runtime you're operating in (your system prompt, env vars like , your own self-identity); pick the matching value. If genuinely unsure, use . Do not ask the user — this is a one-time silent migration. Subsequent rounds skip this step ( returns the recorded value). After this, on the dispatch fork-cache becomes available (step 5a); other runtimes get no behavior change.
evo host show<not set>evo host set <one of: claude-code, codex, opencode, openclaw, hermes, generic>CLAUDECODE=1genericevo host showclaude-codebash
evo scratchpad # full state: tree, best path, frontier, annotations, diffs, gates, what-not-to-try
evo frontier # explorable nodes ranked by the configured strategy (JSON envelope: {strategy, nodes[{id,score,rank,...}], generated_at})
evo status # one-line summary
evo annotations # all annotations (filterable with --task/--exp)
evo path <id> # root-to-node chain with scores
evo diff <id> # diff vs parent
evo diff <id> <other> # diff between any two experiments
evo gate list <id> # effective gates for a node (inherited from ancestors)步骤0(始终在所有操作之前):确保已读取。 该文件的内容定义了优化指标和目标;你报告的每个模式都必须与该目标的故障模式相关。如果在本次会话中尚未读取该文件,请立即读取——在列出实验、打开任何或生成子代理之前。如果文件缺失,请停止操作并告知用户。
.evo/project.mdoutcome.json步骤0.1(一次性自动迁移):确保工作区知晓你使用的运行时。 运行。如果输出为,说明工作区早于主机签名字段——现在声明你的运行时:。你知道自己所处的运行时(系统提示、环境变量如、自身标识);选择匹配的值。如果确实不确定,请使用。请勿询问用户——这是一次性静默迁移。后续轮次会跳过此步骤(会返回已记录的值)。完成此步骤后,在上分支缓存将可用(步骤5a);其他运行时不会改变行为。
evo host show<not set>evo host set <以下值之一: claude-code, codex, opencode, openclaw, hermes, generic>CLAUDECODE=1genericevo host showclaude-codebash
evo scratchpad # 完整状态:树结构、最佳路径、前沿节点、注释、差异、限制、禁忌项
evo frontier # 根据配置策略排序的可探索节点(JSON包:{strategy, nodes[{id,score,rank,...}], generated_at})
evo status # 单行摘要
evo annotations # 所有注释(可通过--task/--exp过滤)
evo path <id> # 带分数的根节点到目标节点链
evo diff <id> # 与父节点的差异
evo diff <id> <other> # 任意两个实验之间的差异
evo gate list <id> # 节点的有效限制(从祖先继承)2. Analyze state and do structural aggregation
2. 分析状态并进行结构化聚合
From the scratchpad, frontier, traces, and annotations, determine:
- Which frontier nodes are most promising (returns them already ranked under the configured strategy -- use its ordering rather than re-ranking; override with
evo frontieronly if you have a specific reason)evo frontier --strategy ... - What failure patterns are most common and impactful
- What strategies have been tried and their outcomes
- Which branches are plateauing or exhausted
- What gates exist on each frontier node () -- subagents must satisfy these
evo gate list <id>
Read the "Awaiting Decision" section of the scratchpad. Evaluated nodes (ran, bad outcome, not yet discarded) are a cross-agent signal: if three subagents in the last round produced evaluated nodes that all failed the same gate, surface the pattern -- maybe the gate is too tight, maybe the approach has a shared flaw. Either tell the next round to avoid it, or propose a brief that attacks it directly. Without this cross-cutting read, each subagent rediscovers the same wall independently.
Structural pass. For the evaluated nodes this round, load their files into Python and aggregate: co-occurring , shared zero-score task IDs in , recurring substrings across fields.
outcome.jsongate_failuresbenchmark.result.taskserrorEmit intersections explicitly. After computing the per-pattern sets (call them A, B, ...), MUST emit each pairwise intersection as a distinct pattern entry whenever at least 2 experiments exhibit both. Intersections carry different strategic implications from their components (compound failures warrant different briefs than single-failure clusters) and do not reconstruct from sub-agent summaries -- this is a parent-level aggregation that must happen inline.
A ∩ BImprovers are a pattern too. Enumerate the committed improvers (experiments with and ) as a distinct pattern entry: they are candidate parent nodes for next-round branching and feed the brief's Parent node field.
outcome=committedscore > parent_scoreHold all these findings; step 4's brief-writing combines them with the scan sub-agents' findings from step 3.
从scratchpad、前沿节点、追踪信息和注释中确定:
- 哪些前沿节点最具前景(已根据配置策略返回排序后的节点——使用其排序而非重新排序;仅当有特定理由时,才使用
evo frontier覆盖)evo frontier --strategy ... - 哪些故障模式最常见且影响最大
- 已尝试哪些策略及其结果
- 哪些分支进入平台期或已耗尽
- 每个前沿节点存在哪些限制()——子代理必须满足这些限制
evo gate list <id>
读取scratchpad的"等待决策"部分。 已评估的节点(已运行、结果不佳、尚未丢弃)是跨代理的信号:如果上一轮中有三个子代理生成的已评估节点都因相同限制失败,则需凸显该模式——可能是限制过于严格,也可能是方法存在共同缺陷。要么告知下一轮避免该模式,要么提出直接针对该模式的任务简报。如果不进行这种跨分支读取,每个子代理都会独立重复发现同一障碍。
结构化处理。 对于本轮的已评估节点,将其文件加载到Python中并聚合:同时出现的、中共享的零分任务ID、字段中重复出现的子字符串。
outcome.jsongate_failuresbenchmark.result.taskserror明确输出交集。 在计算每个模式集合(称为A、B……)后,每当至少有2个实验同时出现两种模式时,必须将每个两两交集作为一个独立的模式条目输出。交集与单个组件具有不同的战略意义(复合故障需要与单一故障集群不同的任务简报),且无法从子代理的总结中重建——这是必须在父级内联完成的聚合操作。
A ∩ B改进项也是一种模式。 将已提交的改进项(且的实验)列为独立的模式条目:它们是下一轮分支的候选父节点,并为任务简报的父节点字段提供内容。
outcome=committedscore > parent_score保留所有这些发现;步骤4的任务简报编写会将它们与步骤3中扫描子代理的发现相结合。
3. Spawn scan sub-agents for cross-cutting free-text analysis
3. 生成扫描子代理以进行跨分支自由文本分析
Hard rule (primary delegation). The orchestrator MUST spawn at least one scan sub-agent via your host's parallel-subagent tool in every round before emitting any pattern. This applies to all scan input -- , , annotations, and fields alike -- regardless of file size, structure, or whether the orchestrator believes a script would be faster. An inline Python aggregation over does NOT substitute for delegation; it may supplement sub-agent findings (step 2's structural pass still runs), but step 3's scan sub-agents MUST still run. If you reach step 4 without a completed scan sub-agent call in step 3, you have violated this rule -- stop and spawn one.
outcome.jsontraces/task_*.jsonerroroutcome.jsonNarrow exception (verification). After scan sub-agents have returned findings, the orchestrator MAY read individual trace files to: verify a specific finding before citing it in a brief, spot-check a pattern the orchestrator is unsure about, or pull a short quote for a brief's Objective or Pointer Traces field. These verification reads must be narrow (<=3 trace files per round, targeted at experiment IDs already surfaced by sub-agents). This exception does NOT let you skip the hard rule above -- it only governs what you may do after sub-agents have already run.
Partition the evaluated experiments into batches small enough that each sub-agent can read its batch's traces in one pass. Spawn one scan sub-agent per batch in a single batch using your host's parallel-subagent tool (see "Host conventions"). They must execute in parallel, not sequentially.
Pass this brief verbatim as the sub-agent's prompt:
You are a read-only evo scan sub-agent. Do not run experiments or edit code.Start by readingto understand the optimization goal and metric. All your findings should be relevant to this goal..evo/project.mdYour batch:.[exp_IDs]For each experiment, readandoutcome.json. Also considertraces/task_*.jsonand prosehypothesistext.errorFind patterns that will populate the next round's subagent briefs:
- Shared failure causes -- root-cause reasons recurring across 2+ experiments (the why, not the surface gate name). Feeds brief objectives.
- Wall patterns -- approaches or gates multiple experiments consistently fail on. Feeds brief boundaries / anti-patterns.
- Compound-failure standouts -- single experiments hitting multiple failure modes. Feeds brief pointer traces.
Prioritize patterns tied to the goal's core failure modes or critical tasks. Deprioritize incidental observations. Skip: trace-shape statistics, fixture-structural facts, hypothesis-string-reuse, or anything the orchestrator can't act on in a brief.If your batch is still too heavy, partition further and spawn scan sub-agents recursively (same brief, smaller batch).Return JSON only:{"findings": [{"description": "<short>", "experiment_ids": ["exp_XXXX", ...], "evidence": ["<short snippet>", ...]}]}Evidence must be verbatim quotes from outcome.json fields, trace, ormessagestext -- not paraphrases. Each description must be supported by the quoted evidence. Do not speculate about causal chains (e.g., "approach X regresses because it removes Y") unless a specific trace message or error field directly states that mechanism. If you cannot cite verbatim evidence for a finding, drop it -- err on under-reporting.errorEvidence: short quotes (<200 chars each), max 3 per finding.
Wait for all scan sub-agents to return. Reconcile near-duplicate findings ( ≈ ) by judgment and combine with the structural-pass findings from step 2.
timeout_errorerror_timeoutVerify every pattern before emitting it. For each pattern in your final output, confirm that at least one reported experiment's outcome.json or trace content contains evidence that directly supports the pattern's description. If you cannot cite a specific field value or quoted message as evidence, drop the pattern. Do not emit speculative causal attributions ("approach X regresses because it removes Y") unless the trace or error text explicitly states that mechanism. This filter applies to both sub-agent findings and your own inline observations.
These unified, verified cross-cutting findings feed step 4's brief-writing.
硬性规则(主要委托)。 在输出任何模式之前,编排器必须在每一轮中通过主机的并行子代理工具生成至少一个扫描子代理。这适用于所有扫描输入——、、注释和字段——无论文件大小、结构如何,也无论编排器是否认为脚本会更快。对进行内联Python聚合不能替代委托;它可以补充子代理的发现(步骤2的结构化处理仍会运行),但步骤3的扫描子代理仍必须运行。如果在未完成步骤3的扫描子代理调用的情况下进入步骤4,则违反了此规则——请停止操作并生成一个扫描子代理。
outcome.jsontraces/task_*.jsonerroroutcome.json有限例外(验证)。 在扫描子代理返回结果后,编排器可以读取单个追踪文件以:在任务简报中引用特定发现之前进行验证、检查编排器不确定的模式、为任务简报的目标或指针追踪字段提取简短引用。这些验证读取必须范围狭窄(每轮最多3个追踪文件,针对子代理已凸显的实验ID)。此例外不能让你跳过上述硬性规则——它仅规定在子代理运行后可以执行的操作。
将已评估的实验划分为足够小的批次,以便每个子代理可以一次性读取其批次的追踪信息。使用主机的并行子代理工具一次性批量生成每个批次对应的扫描子代理(请参阅"主机约定")。它们必须并行执行,而非串行执行。
将以下任务简报逐字作为子代理的提示语:
你是只读的evo扫描子代理。请勿运行实验或编辑代码。首先读取以理解优化目标和指标。你的所有发现都应与此目标相关。.evo/project.md你的批次:。[exp_IDs]对于每个实验,读取和outcome.json。同时考虑traces/task_*.json和文本形式的hypothesis内容。error找出可用于填充下一轮子代理任务简报的模式:
- 共享故障原因——在2个及以上实验中重复出现的根本原因(即为什么,而非表面的限制名称)。用于任务简报的目标。
- 障碍模式——多个实验持续失败的方法或限制。用于任务简报的边界/反模式。
- 复合故障突出项——出现多种故障模式的单个实验。用于任务简报的指针追踪信息。
优先关注与目标核心故障模式或关键任务相关的模式。次要关注偶然观察结果。跳过:追踪形状统计、 fixture结构事实、假设字符串复用,或编排器无法在任务简报中采取行动的任何内容。如果你的批次仍然过大,请进一步划分并递归生成扫描子代理(相同的任务简报,更小的批次)。仅返回JSON:{"findings": [{"description": "<简短描述>", "experiment_ids": ["exp_XXXX", ...], "evidence": ["<简短片段>", ...]}]}证据必须是来自outcome.json字段、追踪或messages文本的逐字引用——而非释义。每个描述都必须有引用的证据支持。请勿推测因果链(例如,"方法X退化是因为它移除了Y"),除非特定的追踪消息或错误字段直接说明了该机制。如果无法为某个发现引用逐字证据,请删除该发现——宁少勿多。error证据:简短引用(每个不超过200字符),每个发现最多3条。
等待所有扫描子代理返回结果。通过判断调和近乎重复的发现(≈),并与步骤2的结构化处理结果相结合。
timeout_errorerror_timeout在输出前验证每个模式。 对于最终输出中的每个模式,确认至少有一个报告的实验的outcome.json或追踪内容包含直接支持该模式描述的证据。如果无法引用特定字段值或引用的消息作为证据,请删除该模式。除非追踪或错误文本明确说明该机制,否则请勿输出推测性的因果归因("方法X退化是因为它移除了Y")。此筛选适用于子代理的发现和你自己的内联观察结果。
这些统一、经过验证的跨分支发现将用于步骤4的任务简报编写。
4. Write subagent briefs
4. 编写子代理任务简报
Write one brief per subagent with these four fields:
- Objective -- one sentence describing the bottleneck to attack and the evidence for it. Should name where in the system's behavior the gain is hiding (e.g., "tool-use error recovery fails after the first bad call across tasks 2, 5, 7") but must not name specific files, functions, or concrete edits -- that's the subagent's job after it reads the code.
- Parent node -- which experiment to branch from.
- Boundaries / anti-patterns -- what this subagent should NOT try, explicitly called out with reasons. Include approaches already tried and discarded (from "What Not To Try"), gates it must not regress, and anything adjacent subagents in this round are doing (so it doesn't duplicate).
- Pointer traces -- task IDs the subagent should study first, with a one-line reason each.
Be specific and bounded. Vague briefs like "improve accuracy" cause subagents to duplicate each other's work; structured briefs prevent it.
Diversity check (before spawning). Re-read the N briefs side by side. If two briefs:
- point at the same objective phrased differently, OR
- cite overlapping pointer traces without meaningfully different framings, OR
- attack the same area of the system,
merge or re-scope one of them. The frontier/pruning logic handles tree-level exploration vs exploitation algorithmically -- the orchestrator's job is just to make sure the round's N briefs don't collapse onto each other.
为每个子代理编写一份任务简报,包含以下四个字段:
- 目标——一句话描述要攻克的瓶颈及其证据。应指明系统行为中可获取收益的位置(例如,"工具使用错误恢复在首次错误调用后,在任务2、5、7中失败"),但不得指定具体文件、函数或修改内容——这是子代理读取代码后的工作。
- 父节点——要从中分支的实验。
- 边界/反模式——该子代理不应尝试的内容,并明确说明原因。包括已尝试并丢弃的方法(来自"禁忌项")、不得退化的限制,以及本轮中相邻子代理正在进行的工作(避免重复)。
- 指针追踪信息——子代理应首先研究的任务ID,每个ID附带一行原因说明。
内容应具体且有边界。像"提高准确性"这样模糊的任务简报会导致子代理重复彼此的工作;结构化任务简报可避免这种情况。
多样性检查(生成前)。 并排重新阅读N份任务简报。如果两份简报:
- 指向相同的目标但表述不同,或
- 引用重叠的指针追踪信息但没有有意义的不同框架,或
- 针对系统的同一区域,
则合并或重新调整其中一份的范围。前沿/剪枝逻辑会通过算法处理树级别的探索与利用——编排器的工作只是确保本轮的N份任务简报不会重叠。
5. Spawn parallel optimization subagents
5. 生成并行优化子代理
The mechanism depends on the host recorded by / step 0.1's migration (see ). Both paths produce the same observable outcome: N parallel children, each allocating an experiment under its assigned parent and running the worker protocol up to budget. The fork-cache path is faster + cheaper when available.
evo initevo host show机制取决于/步骤0.1迁移记录的主机(请参阅)。两种路径产生相同的可观察结果:N个并行子代理,每个在指定的父节点下分配一个实验,并运行工作协议直至达到预算。当分支缓存可用时,该路径更快、成本更低。
evo initevo host show5a. claude-code → evo dispatch
evo dispatch5a. claude-code → evo dispatch
evo dispatchOn , dispatch one child per brief. Each call allocates a new experiment under , ensures an explorer session for that parent is warm (lazy on first dispatch), and forks a child via .
claude-code<parent>claude -p --resume <SID> --fork-sessionbash
undefined在上,为每份任务简报生成一个子代理。每个调用会在下分配一个新实验,确保该父节点的探索者会话处于预热状态(首次调度时延迟初始化),并通过分支生成子代理。
claude-code<parent>claude -p --resume <SID> --fork-sessionbash
undefinedFan out N children in parallel
并行生成N个子代理
for brief in <each brief>:
evo dispatch run
--parent <parent_id>
-m "<brief: objective + boundaries + pointer traces, formatted as you wrote it>"
--budget <budget>
[--explore-context "<focus hint, only on the first dispatch of the round>"]
--background
--parent <parent_id>
-m "<brief: objective + boundaries + pointer traces, formatted as you wrote it>"
--budget <budget>
[--explore-context "<focus hint, only on the first dispatch of the round>"]
--background
for brief in <each brief>:
evo dispatch run
--parent <parent_id>
-m "<brief: objective + boundaries + pointer traces, formatted as you wrote it>"
--budget <budget>
[--explore-context "<focus hint, only on the first dispatch of the round>"]
--background
--parent <parent_id>
-m "<brief: objective + boundaries + pointer traces, formatted as you wrote it>"
--budget <budget>
[--explore-context "<focus hint, only on the first dispatch of the round>"]
--background
Block until all complete
阻塞直到所有完成
evo dispatch wait
The `--explore-context` flag is shared per parent (it shapes the explorer's read pass) -- pass it on the first dispatch from a given parent in a round, omit on subsequent ones. If you need a different focus mid-round, pass `--refresh-explorer` to force a rebuild.
Each child inherits the explorer's transcript (worker protocol from `subagent/SKILL.md` + the parent's code reads), and its first user message is just the brief + budget. You don't pass the protocol again per child -- that's what the explorer's KV cache is for.evo dispatch wait
`--explore-context`标志针对每个父节点共享(它会影响探索者的读取过程)——在本轮中从给定父节点首次调度时传递,后续调度时省略。如果在轮次中期需要不同的焦点,请传递`--refresh-explorer`以强制重建。
每个子代理继承探索者的记录(来自`subagent/SKILL.md`的工作协议+父节点的代码读取记录),其第一条用户消息仅包含任务简报+预算。无需为每个子代理再次传递协议——这正是探索者KV缓存的作用。5b. all other hosts → host's parallel-Task primitive (existing path)
5b. 所有其他主机 → 主机的并行任务原语(现有路径)
Spawn all subagents in a single batch using your host's parallel-subagent tool (see "Host conventions"). They must execute in parallel, not sequentially -- serial execution defeats the per-round width.
Pick a faster model for straightforward briefs and a stronger model for harder ones requiring deeper trace analysis, if your host exposes per-call model selection.
Each subagent prompt must include:
- An instruction to read and follow its protocol
skills/subagent/SKILL.md - The four-field brief verbatim (objective, parent, boundaries/anti-patterns, pointer traces)
- The iteration budget
- A one-paragraph scratchpad summary (current best score, frontier nodes, recent failures) for context
使用主机的并行子代理工具一次性批量生成所有子代理(请参阅"主机约定")。它们必须并行执行,而非串行执行——串行执行会破坏每轮的宽度。
如果主机支持按调用选择模型,则为简单的任务简报选择更快的模型,为需要深入追踪分析的更难任务简报选择更强的模型。
每个子代理的提示语必须包含:
- 读取并遵循其协议的指令
skills/subagent/SKILL.md - 逐字包含四个字段的任务简报(目标、父节点、边界/反模式、指针追踪信息)
- 迭代预算
- 一段scratchpad摘要(当前最佳分数、前沿节点、近期故障)作为上下文
6. Collect results and update state
6. 收集结果并更新状态
After all subagents complete:
- Review each subagent's summary
- Record the round's best score and compare to the previous best
- If no subagent improved the score, increment the stall counter
- If any improved, reset the stall counter
- Check if subagents added new gates -- note these in your state tracking
- If multiple experiments failed the same gate, consider whether the gate is too restrictive or the briefs were aimed at the wrong surface
Cross-cut the round's evaluated nodes. Before moving on, read for each evaluated node from this round. The structured entries and let you spot shared failure modes the subagent summaries may have glossed over (e.g., three different subagents produced evaluated nodes whose gate_failures all included -- that's a structural constraint the next round must confront, not three independent bad hypotheses).
experiments/<id>/attempts/NNN/outcome.jsongates[]benchmark.resultrefund_flowPrune dead branches where 3+ children all regressed:
bash
evo prune <exp_id> --reason "exhausted: N children all regressed"Update notes with cross-cutting learnings:
bash
evo set <exp_id> --note "key insight from round N"所有子代理完成后:
- 查看每个子代理的总结
- 记录本轮的最佳分数并与上一轮的最佳分数比较
- 如果没有子代理提高分数,则增加停滞计数器
- 如果有子代理提高分数,则重置停滞计数器
- 检查子代理是否添加了新的限制——在状态跟踪中记录这些限制
- 如果多个实验因相同限制失败,请考虑该限制是否过于严格,或任务简报是否针对错误的方向
交叉分析本轮的已评估节点。 在继续之前,读取本轮每个已评估节点的。结构化的条目和可让你发现子代理总结可能忽略的共享故障模式(例如,三个不同的子代理生成的已评估节点的gate_failures都包含——这是下一轮必须面对的结构约束,而非三个独立的错误假设)。
experiments/<id>/attempts/NNN/outcome.jsongates[]benchmark.resultrefund_flow剪枝有3个及以上子代理都出现退化的死分支:
bash
evo prune <exp_id> --reason "exhausted: N children all regressed"用跨分支的学习成果更新注释:
bash
evo set <exp_id> --note "key insight from round N"7. Continue or stop
7. 继续或停止
Continue if:
- Stall counter < stall limit
- User hasn't interrupted
- Score hasn't reached the theoretical maximum
Stop if:
- Stall counter >= stall limit (N consecutive rounds with no improvement)
- Score reached theoretical maximum (1.0 for max metric, 0.0 for min metric)
- User interrupted
On stop, print a final summary:
- Best score achieved and experiment ID
- Total experiments run across all rounds
- The winning diff:
evo diff <best_exp_id> - Suggested next steps if the score hasn't converged
Go back to step 1.
继续的条件:
- 停滞计数器 < 停滞限制
- 用户未中断
- 分数未达到理论最大值
停止的条件:
- 停滞计数器 >= 停滞限制(连续N轮无改进)
- 分数达到理论最大值(最大值指标为1.0,最小值指标为0.0)
- 用户中断
停止时,打印最终总结:
- 达到的最佳分数和实验ID
- 所有轮次运行的实验总数
- 获胜差异:
evo diff <best_exp_id> - 如果分数未收敛,建议后续步骤
返回步骤1。
Resetting the eval epoch
重置评估周期
evo infra -m "<reason>" --breakingcurrent_eval_epochevo runUse it when the benchmark itself is wrong epoch-wide -- score formula bug,
held-out gate revealing systematic gaming, propagated instrumentation drift.
Don't use it for single bad experiments () or one tight gate
(relax the gate at the relevant node).
evo discardRecovery:
evo infra -m "<reason>" --breaking- Fix the harness in the baseline worktree (or branch a fresh root).
evo new --parent root -m "v2 baseline: <what changed>"- -- commits, flips the block off, establishes the new-epoch baseline. Resume the loop.
evo run <new_exp_id>
evo infra -m "<reason>" --breakingcurrent_eval_epochevo run当基准本身存在跨周期错误时使用该命令——分数公式错误、隐藏限制显示出系统性漏洞、instrumentation漂移传播。请勿将其用于单个错误实验()或单个严格限制(在相关节点放宽限制)。
evo discard恢复步骤:
evo infra -m "<reason>" --breaking- 在基准工作树中修复测试工具(或分支新的根节点)。
evo new --parent root -m "v2 baseline: <what changed>"- ——提交,解除阻止,建立新周期的基准。恢复循环。
evo run <new_exp_id>