discover

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Discover

探索初始化

Internal procedure for

evo:discover

. The user only sees the user-facing prompts, the dashboard URL, and the baseline score -- everything else is the agent's choreography.

这是

evo:discover

的内部执行流程。用户仅能看到面向用户的提示、仪表盘URL和基准分数——其余所有操作均由代理自动编排完成。

Host conventions

宿主约定

This skill runs on any host that implements the Agent Skills spec. When the body uses generic phrases, apply the host's best-fit equivalent:

"ask the user" -- use your host's structured multi-choice question tool if you have one (e.g.
```
AskUserQuestion
```
,
```
request_user_input
```
). If the host has none, phrase the question as plain text in your next reply and wait for the user's answer.
File paths like
references/...
-- relative to this
```
SKILL.md
```
; resolve from the skill directory.
Slash commands shown in user-facing copy (e.g.
```
/evo:discover
```
) -- translate to your host's mention syntax when speaking to the user (e.g.
```
$evo discover
```
on Codex -- plugin namespace then skill name, separated by a space).

此技能可在任何实现了Agent Skills规范的宿主环境中运行。当文档中使用通用表述时，请替换为宿主环境对应的最佳实现：

"询问用户"——如果宿主环境有结构化的多选问题工具（如
```
AskUserQuestion
```
、
```
request_user_input
```
），请使用该工具；若没有，则在回复中以纯文本形式提出问题并等待用户回答。
类似
references/...
的文件路径——相对于当前
```
SKILL.md
```
文件的路径；需从技能目录解析。
用户可见文案中的斜杠命令（如
```
/evo:discover
```
）——与用户沟通时转换为宿主环境的提及语法（例如在Codex上为
```
$evo discover
```
，即插件命名空间加技能名称，以空格分隔）。

0. Verify the evo CLI is available and in sync with the plugin

0. 验证evo CLI可用且与插件版本同步

Before anything else, run:

bash

evo-version-check

This wraps

evo --version

and additionally asserts the installed CLI matches the plugin manifest version (hosts refetch the plugin on version bumps, but do not reinstall the globally-installed CLI -- drift between the two breaks skills silently).

Four outcomes to handle:

Exit 0,
evo-version-check: OK (plugin=X, cli=X)
-- continue to step 1.
Exit 1, "plugin manifest and installed CLI disagree" -- stop and show the user the script's stderr verbatim; it tells them the
```
uv tool install --force evo-hq-cli==<version>
```
command to run. Then re-invoke this skill.
Exit 2, "evo CLI not on PATH" -- stop and tell the user:
```
evo-hq-cli
```
isn't on your PATH. Install it once:
```
uv tool install evo-hq-cli
```
(or
```
pipx install evo-hq-cli
```
). Then re-invoke this skill.
evo-version-check: command not found
-- the host's plugin install is incomplete (missing the
```
bin/
```
wrapper). Fall back to running
```
evo --version
```
directly and check for
```
evo-hq-cli
```
in the output; if it's a different package (commonly
```
evo 1.x
```
-- the unrelated SLAM tool), tell the user to uninstall it and install
```
evo-hq-cli
```
in its place.

Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.

在执行任何操作前，运行以下命令：

bash

evo-version-check

该命令封装了

evo --version

，并额外验证已安装的CLI版本是否与插件清单版本匹配（宿主环境会在版本更新时重新获取插件，但不会重新安装全局CLI——两者版本不一致会导致技能静默失效）。

需处理四种结果：

退出码0，输出
evo-version-check: OK (plugin=X, cli=X)
——继续执行步骤1。
退出码1，输出"plugin manifest and installed CLI disagree"——停止操作，向用户显示脚本的标准错误输出；输出内容会告知用户执行
```
uv tool install --force evo-hq-cli==<version>
```
命令进行安装。之后重新调用此技能。
退出码2，输出"evo CLI not on PATH"——停止操作，告知用户：
```
evo-hq-cli
```
不在你的PATH中。请执行一次安装：
```
uv tool install evo-hq-cli
```
（或
```
pipx install evo-hq-cli
```
）。之后重新调用此技能。
输出
evo-version-check: command not found
——宿主环境的插件安装不完整（缺少
```
bin/
```
包装器）。回退到直接运行
```
evo --version
```
并检查输出中是否包含
```
evo-hq-cli
```
；如果是其他包（常见的是无关的SLAM工具
```
evo 1.x
```
），告知用户卸载该包并安装
```
evo-hq-cli
```
。

请勿尝试自动安装。宿主沙箱和网络策略可能会阻止安装；让用户手动执行安装操作可明确失败原因。

Guiding principles

指导原则

Main stays clean. Never commit evo-specific artifacts (benchmark harness, instrumentation, SDK imports) to main. Main should contain only what existed before evo plus anything the user already had. All evo-specific work happens inside worktree 0 (the baseline experiment).
Baseline is a worktree, not a main commit.
```
evo init
```
creates
```
.evo/
```
but nothing in main changes. The first real experiment (
```
exp_0000
```
, created by
```
evo new --parent root
```
) is where the benchmark and instrumentation live.
Ask the user as little as possible. Every question is a beat of friction. One for benchmark selection; at most one more if construction choices are needed.
Relay the dashboard URL verbatim when it prints. This is the user's window into the run.

主分支保持纯净。切勿将evo专属产物（基准测试工具、插桩代码、SDK导入）提交到主分支。主分支应仅包含evo运行前已存在的内容以及用户已有的内容。所有evo专属操作均在工作树0（基准实验）内完成。
基准是工作树而非主分支提交。
```
evo init
```
会创建
```
.evo/
```
目录，但主分支不会有任何变更。第一个真实实验（
```
exp_0000
```
，由
```
evo new --parent root
```
创建）是存放基准测试和插桩代码的位置。
尽可能少地询问用户。每一次提问都会增加操作阻力。仅在选择基准时提问一次；若需要进行构建选择，最多再提问一次。
仪表盘URL打印时原样告知用户。这是用户查看运行状态的窗口。

1. Explore the repo

1. 探索代码仓库

Understand what the codebase does. Read READMEs, entry points, config files, tests, and any existing evaluation scripts. Identify:

The optimization target: which file(s) benefit from iterative optimization?
Metric direction for each candidate: is higher better (
```
max
```
) or lower better (
```
min
```
)?
Critical behaviors worth gating: invariants that must never break regardless of score (e.g., "refund flow works", "core tests pass", "output is valid JSON"). Gates are commands that exit 0 on success, non-zero on failure.

了解代码库的功能。阅读README、入口文件、配置文件、测试用例以及任何现有的评估脚本。确定：

优化目标：哪些文件能从迭代优化中获益？
每个候选指标的方向：数值越高越好（
```
max
```
）还是越低越好（
```
min
```
）？
需要保护的关键行为：无论分数如何都绝不能破坏的不变量（例如“退款流程正常运行”、“核心测试通过”、“输出为合法JSON”）。保护门（Gate）是指成功时退出码为0、失败时退出码非0的命令。

2. Look for the obvious benchmark

2. 寻找现成的基准测试

Check what's already there:

Full benchmarks: existing scripts that run end-to-end and output a score
Partial evals: tests, notebooks, or logs with ground truth but not in runnable-score form
Nothing at all

Also check what the user asked for in the invocation argument. If they named a specific metric or target, that's intent.

If one benchmark is obviously the right one — a runnable eval that measures what the user clearly cares about, or what the repo is plainly built to do — use it. Skip step 3, go to step 4 with that benchmark as the only candidate.

If it's not obvious — multiple candidate surfaces, no existing eval, user didn't specify intent, or the existing eval covers a narrow slice while the interesting optimization sits elsewhere — run step 3.

检查代码库中已有的内容：

完整基准测试：可端到端运行并输出分数的现有脚本
部分评估：包含真值但未形成可运行评分形式的测试、笔记本或日志
完全没有基准测试

同时检查用户调用命令时传入的参数。如果用户指定了特定指标或目标，这就是明确的意图。

如果有一个明显合适的基准测试——即可运行的评估脚本，能够衡量用户明确关心的内容，或者代码库的核心功能——直接使用它。跳过步骤3，进入步骤4。

如果基准测试不明确——存在多个候选方案、没有现有评估、用户未指定意图，或者现有评估仅覆盖了狭窄的范围而真正有价值的优化点在其他地方——执行步骤3。

3. Propose unexplored optimization dimensions (only if step 2 was ambiguous)

3. 提出未发掘的优化维度（仅当步骤2结果不明确时执行）

When the benchmark isn't obvious, propose candidate dimensions grounded in actual repo signals, then pick with the user. See

references/proposing-dimensions.md

for the full rubric, project-type examples, and presentation format. Short version:

A handful of dimensions relevant to this specific repo (not generic categories).
Ground each in repo signals: already-instrumented code, stated goals in READMEs, TODO/FIXME patterns, domain defaults.
Rank by signal × slack × cost answered in prose (no numeric scores — they're vibes).

当基准测试不明确时，基于代码库的实际信号提出候选优化维度，然后由用户选择。完整的评估标准、项目类型示例和展示格式请参考

references/proposing-dimensions.md

。简化版说明：

提出少量与当前代码库相关的维度（而非通用类别）。
每个维度都要有代码库信号支撑：已插桩的代码、README中声明的目标、TODO/FIXME模式、领域默认值。
根据信号×潜力×成本进行排序（用文字说明，无需数值评分）。

4. Ask the user to pick the benchmark

4. 请求用户选择基准测试

If step 2 produced one obvious benchmark, confirm it in one sentence and move on — no ranked list needed.

Otherwise, ask once:

"I'm proposing these optimization targets for this repo:

[ranked list with one-line explanations, construction complexity, and whether an existing eval covers some of it]

Which should we optimize? Recommended: [default pick with reasoning]."

Record the selection. If step 3 ran, save non-picked dimensions to

.evo/project.md

under "Future experiment candidates" after init.

如果步骤2找到一个明显合适的基准测试，用一句话确认后即可继续——无需提供排名列表。

否则，仅提问一次：

"我为这个代码库提出以下优化目标：

[带单行说明、构建复杂度以及是否有现有评估覆盖部分内容的排名列表]

我们应该优化哪一个？推荐选择：[默认选项及理由]。"

记录用户的选择。如果执行了步骤3，初始化完成后将未被选中的维度保存到

.evo/project.md

的“未来实验候选”部分。

5. Ask the user for instrumentation mode

5. 请求用户选择插桩模式

Three cases, in order of how to handle them:

Selected benchmark already exists AND is already instrumented for evo (you can see
```
from evo_agent import Run
```
, an
```
import { Run } from '@evo-hq/evo-agent'
```
, or the inline
```
log_task
```
/
```
logTask
```
helpers in the benchmark source). No wiring needed. Skip this question entirely. Detect the instrumentation style from the source and pass the matching
```
--instrumentation-mode <sdk|inline>
```
value to
```
evo init
```
in step 7.
Selected benchmark already exists but is NOT instrumented (it just prints a score JSON, or it's a test runner that doesn't yet write per-task traces). Wiring is needed. Ask the question.
Selected benchmark needs to be constructed from scratch (case B or C from step 4). Wiring is needed. Ask the question.

For cases 2 and 3, ask once:

"I can wire up the benchmark in one of two ways:
SDK mode -- install the SDK (Python:
pip install evo-hq-agent
/ Node:
npm install @evo-hq/evo-agent
). Richer per-task logs, ~5 lines of user code.
Inline mode -- paste a ~30-line helper directly into the benchmark. Zero new dependencies. Same data contract."

Pass the answer to

evo init

via

--instrumentation-mode <sdk|inline>

in step 7. Never install packages without this confirmation. If you skip the question (case 1), still pass the detected mode to

evo init

so optimize/subagent runs see a consistent value.

分三种情况处理，优先级如下：

选中的基准测试已存在且已为evo配置插桩（可在基准测试源码中看到
```
from evo_agent import Run
```
、
```
import { Run } from '@evo-hq/evo-agent'
```
，或者内联的
```
log_task
```
/
```
logTask
```
辅助函数）。无需额外配置。跳过此问题。从源码中检测插桩风格，并在步骤7的
```
evo init
```
命令中传入匹配的
```
--instrumentation-mode <sdk|inline>
```
参数。
选中的基准测试已存在但未配置插桩（仅打印分数JSON，或者是尚未写入每个任务跟踪信息的测试运行器）。需要配置插桩。提出此问题。
选中的基准测试需要从头构建（步骤4中的B或C情况）。需要配置插桩。提出此问题。

对于情况2和3，仅提问一次：

"我可以通过两种方式配置基准测试：
SDK模式——安装SDK（Python：
pip install evo-hq-agent
/ Node：
npm install @evo-hq/evo-agent
）。支持更丰富的任务级日志，需修改约5行用户代码。
内联模式——将约30行的辅助代码直接粘贴到基准测试中。无需新增依赖。数据协议一致。"

将用户的回答通过

--instrumentation-mode <sdk|inline>

参数传入步骤7的

evo init

命令。未经用户确认，切勿安装任何包。如果跳过此问题（情况1），仍需将检测到的模式传入

evo init

，以便后续的optimize/subagent运行使用一致的配置。

6. Prepare main (without committing to it)

6. 准备主分支（不提交任何内容）

The agent never creates commits on main. Main stays byte-identical to what the user committed before evo ran. Two things to set up, both local-only.

Order matters: do 6a (audit) before 6b (excludes). The excludes in 6b will hide files inside

node_modules/

dist/

build/

, etc. from

git status

. If you run the audit after adding excludes, you'll be blind to anything missing inside those directories -- and benchmark dependencies often live exactly there.

代理绝不会在主分支上创建提交。主分支应与用户运行evo前提交的内容完全一致。需要完成两项设置，均为本地操作。

顺序很重要：先执行6a（审计）再执行6b（排除）。6b中的排除规则会隐藏

node_modules/

、

dist/

、

build/

等目录下的文件，使其不显示在

git status

中。如果在添加排除规则后再执行审计，将无法看到这些目录中缺失的内容——而基准测试的依赖往往就存放在这些目录中。

6a. Detect (don't auto-commit) dirty or untracked dependencies

6a. 检测（不自动提交）未提交或未跟踪的依赖

evo new

forks a worktree from the current branch's HEAD commit, not from your dirty working tree. Any uncommitted edits to the target, benchmark, or gate dependencies are silently absent from

exp_0000

, and the whole optimization tree gets built against stale code while you think evo is running on what you see locally.

Run three checks, in this order:

Tracked-but-modified files -- run
```
git diff --name-only
```
and
```
git diff --cached --name-only
```
. If any output line is the optimization target, an existing benchmark file, a gate-referenced script, or any of their import-graph dependencies, stop and ask the user to commit or stash before continuing. Do not commit on their behalf -- the user might be in the middle of an unrelated change.
Untracked files visible to git -- run
```
git status --short --untracked-files=all
```
and look for
```
??
```
entries that the target or gates will reference. Classify each:
- Part of the user's project (e.g., a smoke test they wrote but hadn't committed) -- stop and ask the user to commit it to main themselves.
- Evo-specific new files (a new gate script you're about to write, a new test fixture) -- do not create these in main. Defer to step 10; they go into the baseline worktree and commit to experiment 0's branch. Every descendant experiment inherits via git branching.
Explicit paths inside soon-to-be-ignored directories -- inspect the benchmark command and every gate command for path references (e.g.,
```
./dist/eval-helper
```
,
```
node_modules/some-tool/cli.js
```
,
```
build/golden_outputs/
```
). For each such path, run
```
git ls-files --error-unmatch <path>
```
to confirm it's tracked. If any aren't, stop and ask the user to commit them. This catches dependencies that step 6b is about to hide from
```
git status
```
.

Any one of these three checks failing is a hard stop. Do not proceed to 6b or beyond until the working tree is clean with respect to anything evo will read.

Anything else (benchmark harness, instrumentation) always gets constructed inside the baseline worktree, never in main.

evo new

会从当前分支的HEAD提交创建工作树，而非从本地未提交的修改。任何未提交的对目标文件、基准测试或保护门依赖的修改都不会出现在

exp_0000

中，导致整个优化过程基于过时的代码运行，而用户却以为evo在使用本地当前的代码。

按以下顺序执行三项检查：

已跟踪但已修改的文件——运行
```
git diff --name-only
```
和
```
git diff --cached --name-only
```
。如果输出中的任何行是优化目标、现有基准测试文件、保护门引用的脚本，或它们的导入依赖，停止操作并要求用户提交或暂存修改后再继续。切勿代表用户提交——用户可能正在进行无关的修改。
Git可见的未跟踪文件——运行
```
git status --short --untracked-files=all
```
并查找目标或保护门会引用的
```
??
```
条目。对每个条目进行分类：
- 属于用户项目的一部分（例如用户编写但尚未提交的冒烟测试）——停止操作并要求用户自行提交到主分支。
- evo专属的新文件（即将编写的新保护门脚本、新测试夹具）——切勿在主分支中创建这些文件。推迟到步骤10处理；这些文件将放入基准工作树并提交到实验0的分支。所有后续实验分支都会通过Git分支继承这些内容。
即将被忽略的目录中的显式路径——检查基准测试命令和每个保护门命令中的路径引用（例如
```
./dist/eval-helper
```
、
```
node_modules/some-tool/cli.js
```
、
```
build/golden_outputs/
```
）。对每个这样的路径，运行
```
git ls-files --error-unmatch <path>
```
确认其已被跟踪。如果有任何路径未被跟踪，停止操作并要求用户提交。这可以捕捉到6b步骤中即将被
```
git status
```
隐藏的依赖。

以上三项检查中任何一项失败都必须立即停止。直到工作树中evo需要读取的所有内容都处于干净状态后，才能继续执行6b或后续步骤。

其他内容（基准测试工具、插桩代码）始终在基准工作树内构建，绝不会出现在主分支中。

6b. Add local-only git excludes

6b. 添加本地Git排除规则

After the audit passes, append to

.git/info/exclude

(not

.gitignore

-- we do not commit to main):

.evo/
__pycache__/
*.pyc
.pytest_cache/
node_modules/
dist/
build/

.git/info/exclude

is git's per-clone ignore file -- same effect as

.gitignore

, but never committed, never shared, invisible to history. Right tool for per-machine tooling state.

审计通过后，将以下内容追加到

.git/info/exclude

（不要修改

.gitignore

——我们不会提交到主分支）：

.evo/
__pycache__/
*.pyc
.pytest_cache/
node_modules/
dist/
build/

.git/info/exclude

是Git的每个克隆专属忽略文件——效果与

.gitignore

相同，但绝不会被提交、共享，也不会出现在历史记录中。这是处理每台机器工具状态的正确方式。

7. Initialize the workspace

7. 初始化工作区

bash

evo init --target <file> --benchmark "<command using {worktree} and {target}>" --metric <max|min> \
  --host <claude-code|codex|opencode|openclaw|hermes|generic> \
  --instrumentation-mode <sdk|inline> [--gate "<gate command>"]

--host
is required. Pass the host runtime you (the orchestrator) are running under. Allowed values:

claude-code

codex

opencode

openclaw

hermes

generic

. This is recorded in

.evo/meta.json

and read by

evo dispatch

(the optional fork-cache spawner). On

claude-code

, dispatch is available; on every other host evo's child-spawn falls back to your host's native parallel-Task primitive (no behavior change vs today). Pick the value matching the runtime you invoked

discover

from. Use

evo host set <value>

later if you change runtimes.

Placeholder semantics. Benchmark and gate commands support two placeholders, resolved lazily at run time by

evo run

/ gate evaluation:

```
{worktree}
```
resolves to the absolute path of the experiment's worktree directory (e.g.
```
/path/to/repo/.evo/run_0000/worktrees/exp_0000
```
). Use this to reference files that live on the experiment branch, not on main.
```
{target}
```
resolves to the absolute path of the target file inside that worktree (e.g.
```
{worktree}/agent/solve.py
```
). Use this when your benchmark needs to load or exec the target dynamically.

Critical rule:

evo run

executes from the main repo root. When the benchmark script is constructed inside the worktree (the default in this flow), the command must reference it via

{worktree}

or the path won't resolve.

Example for a benchmark written at

{worktree}/benchmark.py

that will be committed to exp_0000:

bash

evo init \
  --target agent/solve.py \
  --benchmark "python3 {worktree}/benchmark.py --target {target}" \
  --metric max \
  --host claude-code

If the project uses a specific interpreter (poetry, pipenv, a venv), qualify it:

"poetry run python {worktree}/benchmark.py ..."

".venv/bin/python {worktree}/benchmark.py ..."

, etc.

evo init

creates

.evo/

, the synthetic

root

node, and auto-starts the dashboard. It prints a line like:

Dashboard live: http://127.0.0.1:8080 (pid 12345)

Relay that line back to the user verbatim. If port 8080 is busy, evo auto-increments -- show whatever port prints. The URL is how the user watches the run.

bash

evo init --target <file> --benchmark "<command using {worktree} and {target}>" --metric <max|min> \
  --host <claude-code|codex|opencode|openclaw|hermes|generic> \
  --instrumentation-mode <sdk|inline> [--gate "<gate command>"]

--host
是必填参数。传入你（编排器）运行所在的宿主运行时环境。允许的值：

claude-code

、

codex

、

opencode

、

openclaw

、

hermes

、

generic

。该值会被记录到

.evo/meta.json

中，并被

evo dispatch

（可选的分支缓存生成器）读取。在

claude-code

环境中，dispatch功能可用；在其他所有宿主环境中，evo的子进程生成会回退到宿主原生的并行Task原语（与当前行为一致）。选择与调用

discover

的运行时匹配的值。如果后续更改运行时，可使用

evo host set <value>

命令修改。

占位符语义。基准测试和保护门命令支持两个占位符，由

evo run

/保护门评估在运行时延迟解析：

```
{worktree}
```
解析为实验工作树目录的绝对路径（例如
```
/path/to/repo/.evo/run_0000/worktrees/exp_0000
```
）。用于引用实验分支中的文件，而非主分支中的文件。
```
{target}
```
解析为该工作树内目标文件的绝对路径（例如
```
{worktree}/agent/solve.py
```
）。当基准测试需要动态加载或执行目标文件时使用。

关键规则：

evo run

从主仓库根目录执行。当基准测试脚本在工作树内构建时（此流程中的默认情况），命令必须通过

{worktree}

引用该脚本，否则路径无法解析。

示例：基准测试脚本位于

{worktree}/benchmark.py

并将提交到exp_0000：

bash

evo init \
  --target agent/solve.py \
  --benchmark "python3 {worktree}/benchmark.py --target {target}" \
  --metric max \
  --host claude-code

如果项目使用特定的解释器（poetry、pipenv、虚拟环境），请添加前缀：

"poetry run python {worktree}/benchmark.py ..."

、

".venv/bin/python {worktree}/benchmark.py ..."

等。

evo init

会创建

.evo/

目录、合成的

root

节点，并自动启动仪表盘。输出会包含如下内容：

Dashboard live: http://127.0.0.1:8080 (pid 12345)

将该行内容原样告知用户。如果端口8080被占用，evo会自动递增端口——显示实际打印的端口即可。该URL是用户查看运行状态的入口。

8. Set up gates

8. 设置保护门

Gates inherit down the experiment tree -- children automatically get all ancestor gates.

Gate semantics (read this first).

evo run

decides "gate passed" purely from the command's exit code: 0 = pass, non-zero = fail. A benchmark-style command that just prints

{"score": 0.0}

and exits 0 passes the gate. That defeats the purpose. Every gate command must be wired to exit non-zero when the protected behavior regresses. Two ways to do that:

Test-suite gates --
```
pytest
```
,
```
cargo test
```
,
```
npm test
```
, etc. already exit non-zero on failure. Use them as-is.
Score-threshold gates -- gate the benchmark on a minimum acceptable score. The benchmark script needs a flag like
```
--min-score <float>
```
that exits 1 when the computed score falls below the threshold. The
```
inline_instrumentation.{py,js}
```
helpers in
```
references/
```
show the pattern:
```
write_result()
```
returns the final score; the script can then compare and
```
sys.exit(1)
```
.

Examples:

bash

undefined

保护门会在实验树中继承——子实验会自动获得所有祖先实验的保护门。

保护门语义（请先阅读）。

evo run

仅通过命令的退出码判断“保护门通过”：0=通过，非0=失败。仅打印

{"score": 0.0}

并退出码为0的基准测试风格命令会通过保护门，这违背了保护门的初衷。每个保护门命令必须配置为当受保护的行为退化时退出码非0。有两种实现方式：

测试套件保护门——
```
pytest
```
、
```
cargo test
```
、
```
npm test
```
等工具本身会在失败时返回非0退出码。可直接使用。
分数阈值保护门——为基准测试设置最低可接受分数。基准测试脚本需要一个类似
```
--min-score <float>
```
的标志，当计算出的分数低于阈值时退出码为1。
```
references/
```
目录下的
```
inline_instrumentation.{py,js}
```
辅助代码展示了该模式：
```
write_result()
```
返回最终分数；脚本可进行比较并调用
```
sys.exit(1)
```
。

示例：

bash

undefined

Test-suite gate: pytest already exits non-zero on failures (use uv run --with if pytest isn't already a dep)

测试套件保护门：pytest在失败时会返回非0退出码（如果pytest不是依赖，使用uv run --with）

evo gate add root --name core_tests --command "uv run --with pytest pytest tests/core/ -x"

Score-threshold gate: benchmark exits 1 if pass rate on protected tasks drops below 0.9

分数阈值保护门：当受保护任务的通过率低于0.9时，基准测试退出码为1

evo gate add root --name refund_flow --command "python3 {worktree}/benchmark.py --target {target} --task-ids 5 --min-score 0.9"

Custom validation: smoke test that crashes (non-zero exit) on broken target

自定义验证：当目标文件损坏时会崩溃（非0退出码）的冒烟测试

evo gate add root --name no_crash --command "python3 smoke_test.py --target {target}"


If a benchmark you constructed doesn't yet have a `--min-score` mode, add it now (a few lines: parse the threshold flag, compute the score, `sys.exit(1)` if below). Without it the gate is decorative.

Gate commands support `{target}` and `{worktree}` placeholders with the same semantics as benchmark commands (resolved at run time, not at registration). Registering a gate that references `{worktree}/benchmark.py` before the benchmark exists is safe -- the placeholder resolves only when the gate is evaluated, which happens during `evo run` after the benchmark is committed.

Verify registered gates:

```bash
evo gate list root

Gate pairing rule based on benchmark provenance:

If the selected benchmark already existed in the repo (not constructed from scratch): gates are optional at this step, but if you register any benchmark-derived gate, it must use a score-threshold (
```
--min-score
```
or equivalent) -- not a bare invocation. Subagents can add more during optimization.
If the benchmark was constructed from scratch (case B or C from the A/B/C classification): a Goodhart-mitigation gate is mandatory before the baseline can run, AND that gate must be a real pass/fail check (score-threshold or correctness assertion that exits non-zero on regression), not a bare benchmark rerun. See
```
references/constructing-benchmark.md
```
section 6 on "Required gate pairing." Do not proceed to
```
evo new
```
or
```
evo run
```
without it. This is the safety against metric gaming -- it is not optional.

evo gate add root --name no_crash --command "python3 smoke_test.py --target {target}"


如果构建的基准测试尚未支持`--min-score`模式，请立即添加（只需几行代码：解析阈值标志、计算分数、如果低于阈值则`sys.exit(1)`）。否则保护门只是装饰性的。

保护门命令支持`{target}`和`{worktree}`占位符，语义与基准测试命令相同（在运行时解析，而非注册时）。注册引用`{worktree}/benchmark.py`的保护门时，即使基准测试尚未存在也是安全的——占位符仅在保护门评估时解析，而评估发生在`evo run`期间，此时基准测试已被提交。

验证已注册的保护门：

```bash
evo gate list root

基于基准测试来源的保护门配对规则：

如果选中的基准测试已存在于代码库中（非从头构建）：此步骤中保护门是可选的，但如果注册任何基于基准测试的保护门，必须使用分数阈值（
```
--min-score
```
或等效方式）——不能直接调用基准测试。子代理可在优化过程中添加更多保护门。
如果基准测试是从头构建的（A/B/C分类中的B或C情况）：在运行基准测试前，必须配置一个Goodhart缓解保护门，且该保护门必须是真实的通过/失败检查（分数阈值或断言正确性，退化时退出码非0），而非直接重新运行基准测试。请参考
```
references/constructing-benchmark.md
```
第6节“必填保护门配对”。未配置此保护门，不得继续执行
```
evo new
```
或
```
evo run
```
。这是防止指标滥用的安全措施——并非可选操作。

9. Create the baseline worktree

9. 创建基准工作树

bash

evo new --parent root -m "baseline: instrument + score"

This returns experiment id (typically

exp_0000

) and its worktree path. All subsequent construction work happens inside that worktree -- never in main.

bash

evo new --parent root -m "baseline: instrument + score"

该命令会返回实验ID（通常为

exp_0000

）及其工作树路径。所有后续构建工作都在该工作树内完成——绝不允许在主分支中进行。

10. Work inside the baseline worktree

10. 在基准工作树内操作

Cd into the worktree path returned by

evo new

. Then:

切换到

evo new

返回的工作树路径。然后执行以下步骤：

10a. Construct the benchmark (if needed)

10a. 构建基准测试（如果需要）

If the selected benchmark is new, build it in the worktree. See

references/constructing-benchmark.md

for the full procedure:

Design the scoring function (range, direction, meaningful-improvement threshold)
Assemble test cases (10-20 for programmatic, 15-30 for fuzzy, realistic workload for perf)
Write the runnable harness (helper/SDK writes the score JSON to
```
$EVO_RESULT_PATH
```
; stdout and stderr are free for user output)
Goodhart check (document gaming strategies, mitigate each with a gate or held-out slice)
Held-out validation slice (60/70 training, 30/40 held-out) if the benchmark is hand-written

Do not run separate determinism checks during setup. Note the benchmark's determinism property in

project.md

(step 12) and move on. Variance surfaces during optimization itself, where it can be handled with real evidence rather than guessed at during setup.

如果选中的基准测试是新的，在工作树内构建它。完整流程请参考

references/constructing-benchmark.md

：

设计评分函数（范围、方向、有意义的改进阈值）
组装测试用例（程序型测试10-20个，模糊测试15-30个，性能测试使用真实工作负载）
编写可运行的工具（辅助代码/SDK将分数JSON写入
```
$EVO_RESULT_PATH
```
；标准输出和标准错误可用于用户输出）
Goodhart检查（记录可能的滥用策略，通过保护门或保留切片缓解）
保留验证切片（如果基准测试是手动编写的，按60/70训练、30/40验证拆分）

设置过程中无需单独执行确定性检查。在

project.md

（步骤12）中记录基准测试的确定性属性即可，继续后续操作。方差会在优化过程中显现，届时可根据实际证据处理，无需在设置阶段猜测。

10b. Apply instrumentation

10b. 应用插桩

Based on the instrumentation mode passed to

evo init

Paths below are relative to this

SKILL.md

file (resolve them against the skill directory).

SDK mode: add

from evo_agent import Run

(Python) or

import { Run } from '@evo-hq/evo-agent'

(Node) to the benchmark script. Wrap the eval loop per

references/sdk_python.py

references/sdk_node.js

Inline mode: copy the helper from
```
references/inline_instrumentation.py
```
(or
```
.js
```
) into the benchmark. Use
```
log_task
```
/
```
logTask
```
per task and
```
write_result
```
/
```
writeResult
```
once at the end.

The wire protocol is the same either way:

task_<id>.json

written to

$EVO_TRACES_DIR

, score JSON written to

$EVO_RESULT_PATH

. Stdout is free for user output.

根据传入

evo init

的插桩模式执行：

以下路径相对于当前

SKILL.md

文件（需从技能目录解析）。

SDK模式：在基准测试脚本中添加
```
from evo_agent import Run
```
（Python）或
```
import { Run } from '@evo-hq/evo-agent'
```
（Node）。按照
```
references/sdk_python.py
```
或
```
references/sdk_node.js
```
中的示例包装评估循环。
内联模式：将
```
references/inline_instrumentation.py
```
（或
```
.js
```
）中的辅助代码复制到基准测试中。每个任务使用
```
log_task
```
/
```
logTask
```
，最后使用
```
write_result
```
/
```
writeResult
```
一次。

两种模式的通信协议相同：将

task_<id>.json

写入

$EVO_TRACES_DIR

，将分数JSON写入

$EVO_RESULT_PATH

。标准输出可用于用户输出。

10c. Cheap validation run

10c. 低成本验证运行

Before the full baseline, validate the toolchain with the cheapest possible end-to-end run (single task, smallest split, dry-run flag -- whatever is fastest).

Important: run this from the main repo root, not from inside the worktree. The validation writes traces to

.evo/validate/

, which must resolve to the workspace's

.evo/

at the main repo root. If you run from the worktree, the relative path creates

<worktree>/.evo/validate/

and those artifacts get staged into the experiment commit when you run

git add

later.

Resolve
{worktree}
,
{target}
, and the validator script path yourself before running. Evo substitutes

{worktree}

{target}

only inside

evo run

, not in a plain shell. The validation here is a plain shell call, so build the command with concrete absolute paths:

```
WORKTREE
```
= the worktree path returned by
```
evo new
```

TARGET

$WORKTREE/<relative target path, e.g. agent/solve.py>

VALIDATOR

<absolute path to this skill dir>/scripts/validate_result.py

-- resolve by taking the absolute path of this

SKILL.md

's directory and appending

scripts/validate_result.py

bash

undefined

在完整运行基准测试前，用最快捷的端到端运行验证工具链（单个任务、最小拆分、干跑标志——任何最快的方式）。

重要：请从主仓库根目录运行此命令，不要在工作树内运行。验证会将跟踪信息写入

.evo/validate/

，该路径必须解析为主仓库根目录下的

.evo/

。如果在工作树内运行，相对路径会创建

<worktree>/.evo/validate/

，这些产物会在后续运行

git add

时被暂存到实验提交中。

运行前自行解析
{worktree}
、
{target}
和验证脚本路径。Evo仅在

evo run

中替换

{worktree}

{target}

，而不是在普通shell中。此处的验证是普通shell调用，因此需要使用具体的绝对路径构建命令：

```
WORKTREE
```
=
```
evo new
```
返回的工作树路径

TARGET

$WORKTREE/<相对目标路径，例如agent/solve.py>

VALIDATOR

<此技能目录的绝对路径>/scripts/validate_result.py

——通过获取当前

SKILL.md

所在目录的绝对路径并追加

scripts/validate_result.py

来解析

bash

undefined

from main repo root

从主仓库根目录执行

WORKTREE="<...>" TARGET="$WORKTREE/<...>" VALIDATOR="<...>/scripts/validate_result.py"

mkdir -p .evo/validate ATTEMPT="$(mktemp -d .evo/validate/run-XXXXXX)" mkdir -p "$ATTEMPT/traces"

EVO_TRACES_DIR="$ATTEMPT/traces"
EVO_RESULT_PATH="$ATTEMPT/result.json"
EVO_EXPERIMENT_ID=validate
python3 "$WORKTREE/benchmark.py" --target "$TARGET" \

"$ATTEMPT/stdout.log" 2>"$ATTEMPT/stderr.log"

python3 "$VALIDATOR" "$ATTEMPT/result.json"


Adapt the benchmark invocation (interpreter, args) to whatever you stored with `evo init`. The non-negotiable part is that the resulting bash command contains no literal `{worktree}`, `{target}`, or relative-script paths -- expand all of them to absolute paths before the shell runs the line. Each invocation gets a fresh attempt subdir, so re-running on failure is safe.

The validator asserts `result.json` exists, is non-empty, and is a JSON object with a numeric `score`. Also verify:

- All dependencies resolve and the command completes.
- Traces appear in `$EVO_TRACES_DIR` (if applicable).
- Each gate script runs cleanly on the unmodified target.

Fix any issues and re-validate before proceeding.

WORKTREE="<...>" TARGET="$WORKTREE/<...>" VALIDATOR="<...>/scripts/validate_result.py"

mkdir -p .evo/validate ATTEMPT="$(mktemp -d .evo/validate/run-XXXXXX)" mkdir -p "$ATTEMPT/traces"

EVO_TRACES_DIR="$ATTEMPT/traces"
EVO_RESULT_PATH="$ATTEMPT/result.json"
EVO_EXPERIMENT_ID=validate
python3 "$WORKTREE/benchmark.py" --target "$TARGET" \

"$ATTEMPT/stdout.log" 2>"$ATTEMPT/stderr.log"

python3 "$VALIDATOR" "$ATTEMPT/result.json"


根据`evo init`中存储的内容调整基准测试调用方式（解释器、参数）。不可协商的是，最终的bash命令中不能包含字面量`{worktree}`、`{target}`或相对脚本路径——在shell运行前必须将所有路径展开为绝对路径。每次调用都会使用新的尝试子目录，因此失败后重新运行是安全的。

验证器会检查`result.json`是否存在、非空，且是包含数字`score`的JSON对象。同时验证：

- 所有依赖都能解析，命令能完成执行。
- 跟踪信息出现在`$EVO_TRACES_DIR`中（如果适用）。
- 每个保护门脚本在未修改的目标文件上都能正常运行。

修复所有问题并重新验证后再继续。

10d. Commit inside the worktree

10d. 在工作树内提交

Logical commits are ideal but not required. Minimal acceptable:

```
add: benchmark harness + test cases
```
```
add: instrumentation
```
(only in SDK mode -- inline mode keeps the harness and instrumentation in one file, so this commit collapses into the previous one)

Use git from inside the worktree directory. These commits are on the experiment's branch, not main.

Before the first commit in the worktree, add a
.gitignore
for build artifacts and any stray evo workspace writes that shouldn't land on the experiment branch. At minimum:

.evo/
__pycache__/
*.pyc
.pytest_cache/
node_modules/
dist/
build/

Otherwise, running the benchmark once before committing will drag bytecode caches,

.pytest_cache/

, or stray

.evo/

writes into the experiment's tree and pollute every descendant branch. Belt-and-suspenders with step 10c's "run from main repo root" rule: even if cwd slips, the ignore catches it.

逻辑提交是理想的，但并非必须。最低要求：

```
add: benchmark harness + test cases
```
```
add: instrumentation
```
（仅SDK模式需要——内联模式将工具和插桩代码放在同一个文件中，因此此提交可合并到前一个提交中）

在工作树目录内使用Git。这些提交会在实验分支上，而非主分支。

在工作树内首次提交前，添加
.gitignore
，用于忽略构建产物和不应提交到实验分支的evo工作区临时文件。至少包含以下内容：

.evo/
__pycache__/
*.pyc
.pytest_cache/
node_modules/
dist/
build/

否则，提交前运行一次基准测试会将字节码缓存、

.pytest_cache/

或临时

.evo/

写入内容带入实验分支，污染所有后续分支。与步骤10c的“从主仓库根目录运行”规则配合使用：即使当前工作目录不小心切换到工作树，忽略规则也能捕捉到这些文件。

11. Run the baseline

11. 运行基准测试

First, cd back to main repo root. If the previous step left the shell inside the worktree,

evo run

will fail with "workspace not initialized" because

.evo/

only lives at the main repo root.

bash

cd <main-repo-root>
evo run exp_0000

evo run

executes the benchmark, captures the score, runs all inherited gates, and marks the experiment

committed

in a single step. Its output line ends with something like

COMMITTED exp_0000 0.4286

Do NOT call
evo done
afterward. In the current CLI,

evo run

is terminal: the experiment is already committed when it returns successfully, and calling

evo done exp_0000 --score <n>

errors with

"exp_0000 has status 'committed' -- cannot record again"

. The

evo done

command exists for cases where a human recorded a score outside of

evo run

, which is not the discover flow.

If gates failed,

evo run

exits non-zero and leaves the experiment in a failed state. Fix the benchmark or target inside the worktree, commit, then

evo run exp_0000

again.

If
evo run
fails with a path error (typically:

benchmark.py

not found), the stored benchmark command in

.evo/run_0000/config.json

is missing the

{worktree}

placeholder. Fix by re-initializing:

evo discard exp_0000 --reason "benchmark command missing {worktree}"

then re-run step 7 with the correct

--benchmark

string. Hand-editing

config.json

works but is technical debt.

首先，切换回主仓库根目录。如果上一步操作后shell仍在工作树内，

evo run

会失败并提示“workspace not initialized”，因为

.evo/

仅存在于主仓库根目录。

bash

cd <main-repo-root>
evo run exp_0000

evo run

会执行基准测试、捕获分数、运行所有继承的保护门，并将实验标记为

committed

，所有操作一步完成。输出的最后一行类似

COMMITTED exp_0000 0.4286

。

之后请勿调用
evo done
。在当前CLI中，

evo run

是终端命令：成功返回时实验已被提交，调用

evo done exp_0000 --score <n>

会报错

"exp_0000 has status 'committed' -- cannot record again"

。

evo done

命令仅用于人工在

evo run

之外记录分数的情况，不适用于此探索流程。

如果保护门失败，

evo run

会返回非0退出码并将实验标记为失败状态。在工作树内修复基准测试或目标文件，提交后重新运行

evo run exp_0000

。

如果
evo run
因路径错误失败（通常是：

benchmark.py

未找到），说明

.evo/run_0000/config.json

中存储的基准测试命令缺少

{worktree}

占位符。修复方法：重新初始化，执行

evo discard exp_0000 --reason "benchmark command missing {worktree}"

，然后重新运行步骤7并传入正确的

--benchmark

字符串。手动编辑

config.json

也可行，但会带来技术债务。

12. Write

.evo/project.md

12. 编写

.evo/project.md

Lives at the top level of

.evo/

(run-agnostic, stable path regardless of active run).

evo init

creates an empty stub; overwrite it.

Document:

What the target does
What can be changed by optimization vs what must stay stable
How to interpret benchmark output (score meaning, direction)
Benchmark determinism -- one line, pick what fits:
- ```
deterministic by construction
```
  -- pure code, no randomness, no network
- ```
uses LLMs with temp=0
```
  -- expected to be deterministic in practice; flag if it isn't
- ```
sampling-based, variance expected
```
  -- inherent noise; optimize will need multi-run strategies
Environment requirements discovered during validation
What each gate protects
Benchmark gaming risks identified during the Goodhart check
Future experiment candidates (the non-picked dimensions from step 3)

该文件位于

.evo/

的顶层目录（与运行无关，无论当前运行哪个版本，路径都是固定的）。

evo init

会创建一个空的模板文件，将其覆盖即可。

记录以下内容：

目标文件的功能
优化可修改的内容与必须保持稳定的内容
如何解读基准测试输出（分数含义、优化方向）
基准测试确定性——用一句话描述，选择合适的选项：
- ```
deterministic by construction
```
  ——纯代码，无随机性，无网络依赖
- ```
uses LLMs with temp=0
```
  ——实际运行中预期是确定性的；如果不是，请标记
- ```
sampling-based, variance expected
```
  ——存在固有噪声；优化时需要多轮运行策略
验证过程中发现的环境要求
每个保护门保护的内容
Goodhart检查中发现的基准测试滥用风险
未来实验候选（步骤3中未被选中的维度）

13. Report to the user

13. 向用户汇报

End the skill by reporting in chat:

The dashboard URL (if not already mentioned)
The baseline experiment ID and score
The chosen optimization dimension and why
A one-liner on next steps: "Run
```
/evo:optimize
```
to start the optimization loop."
Resume after crash: if the host, the shell, or the machine restarts mid-flow, re-invoke
```
evo:optimize
```
. Evo reads
```
.evo/
```
and resumes from the last committed experiment -- no special restore procedure.
State is local to this machine: experiment commits on branches like
```
evo/run_0000/exp_*
```
survive
```
git push --all
```
, but orchestration state (graph, annotations, project notes) lives only in
```
.evo/
```
. If that history matters to you, back up
```
.evo/
```
separately (e.g.,
```
tar -czf evo-state-$(date +%F).tar.gz .evo/
```
).

技能结束时，在聊天中向用户汇报：

仪表盘URL（如果尚未提及）
基准实验ID和分数
选择的优化维度及理由
下一步操作的一句话说明：“运行
```
/evo:optimize
```
启动优化循环。”
崩溃后恢复：如果宿主环境、shell或机器在流程中途重启，重新调用
```
evo:optimize
```
即可。Evo会读取
```
.evo/
```
并从最后一个已提交的实验继续——无需特殊恢复流程。
状态仅存储在本地机器：实验提交在
```
evo/run_0000/exp_*
```
等分支上，执行
```
git push --all
```
会保留这些分支，但编排状态（图谱、注释、项目笔记）仅存储在
```
.evo/
```
中。如果这些历史记录对你很重要，请单独备份
```
.evo/
```
（例如
```
tar -czf evo-state-$(date +%F).tar.gz .evo/
```
）。

Inspection commands (for debugging, reference only)

检查命令（仅用于调试、参考）

bash

evo get <id>                        # full experiment detail with scores
evo traces <id> <task>              # per-task trace
evo annotate <id> <task> "analysis" # record failure analysis
evo scratchpad                      # full state: tree, best path, frontier, annotations, diffs, gates
evo gate list <id>                  # effective gates at a node (inherited)

bash

evo get <id>                        # 获取包含分数的完整实验详情
evo traces <id> <task>              # 获取单个任务的跟踪信息
evo annotate <id> <task> "analysis" # 记录失败分析
evo scratchpad                      # 获取完整状态：树、最优路径、前沿、注释、差异、保护门
evo gate list <id>                  # 获取节点的有效保护门（包括继承的）

Rules

规则

Do NOT modify main after
```
evo init
```
unless the user explicitly asks. All new artifacts live in worktree 0.
Do NOT install packages without the user's confirmation from step 5.
Do NOT skip the held-out gate pairing when the benchmark was constructed from scratch. The gate is the safety net against Goodhart gaming, regardless of whether the benchmark is deterministic.
Do NOT skip the Goodhart check when the benchmark was constructed from scratch. Gate pairing is mandatory, not optional.

```
evo init
```
后，除非用户明确要求，否则不得修改主分支。所有新产物都存放在工作树0中。
未经步骤5中用户的确认，不得安装任何包。
当基准测试是从头构建时，不得跳过保留切片的保护门配对。保护门是防止Goodhart滥用的安全网，无论基准测试是否具有确定性。
当基准测试是从头构建时，不得跳过Goodhart检查。保护门配对是必填项，而非可选操作。

discover

Original

Translation

Discover

探索初始化

Host conventions

宿主约定

0. Verify the evo CLI is available and in sync with the plugin

0. 验证evo CLI可用且与插件版本同步

Guiding principles

指导原则

1. Explore the repo

1. 探索代码仓库

2. Look for the obvious benchmark

2. 寻找现成的基准测试

3. Propose unexplored optimization dimensions (only if step 2 was ambiguous)

3. 提出未发掘的优化维度（仅当步骤2结果不明确时执行）

4. Ask the user to pick the benchmark

4. 请求用户选择基准测试

5. Ask the user for instrumentation mode

5. 请求用户选择插桩模式

6. Prepare main (without committing to it)

6. 准备主分支（不提交任何内容）

6a. Detect (don't auto-commit) dirty or untracked dependencies

6a. 检测（不自动提交）未提交或未跟踪的依赖

6b. Add local-only git excludes

6b. 添加本地Git排除规则

7. Initialize the workspace

7. 初始化工作区

8. Set up gates

8. 设置保护门

Test-suite gate: pytest already exits non-zero on failures (use uv run --with if pytest isn't already a dep)

测试套件保护门：pytest在失败时会返回非0退出码（如果pytest不是依赖，使用uv run --with）

Score-threshold gate: benchmark exits 1 if pass rate on protected tasks drops below 0.9

分数阈值保护门：当受保护任务的通过率低于0.9时，基准测试退出码为1

Custom validation: smoke test that crashes (non-zero exit) on broken target

自定义验证：当目标文件损坏时会崩溃（非0退出码）的冒烟测试

9. Create the baseline worktree

9. 创建基准工作树

10. Work inside the baseline worktree

10. 在基准工作树内操作

10a. Construct the benchmark (if needed)

10a. 构建基准测试（如果需要）

10b. Apply instrumentation

10b. 应用插桩

10c. Cheap validation run

10c. 低成本验证运行

from main repo root

从主仓库根目录执行

10d. Commit inside the worktree

10d. 在工作树内提交

11. Run the baseline

11. 运行基准测试

12. Write .evo/project.md

12. 编写.evo/project.md

13. Report to the user

13. 向用户汇报

Inspection commands (for debugging, reference only)

检查命令（仅用于调试、参考）

Rules

规则

12. Write
`.evo/project.md`

12. 编写
`.evo/project.md`