discover
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDiscover
探索初始化
Internal procedure for . The user only sees the user-facing prompts, the dashboard URL, and the baseline score -- everything else is the agent's choreography.
evo:discover这是的内部执行流程。用户仅能看到面向用户的提示、仪表盘URL和基准分数——其余所有操作均由代理自动编排完成。
evo:discoverHost conventions
宿主约定
This skill runs on any host that implements the Agent Skills spec. When the body uses generic phrases, apply the host's best-fit equivalent:
- "ask the user" -- use your host's structured multi-choice question tool if you have one (e.g. ,
AskUserQuestion). If the host has none, phrase the question as plain text in your next reply and wait for the user's answer.request_user_input - File paths like -- relative to this
references/...; resolve from the skill directory.SKILL.md - Slash commands shown in user-facing copy (e.g. ) -- translate to your host's mention syntax when speaking to the user (e.g.
/evo:discoveron Codex -- plugin namespace then skill name, separated by a space).$evo discover
此技能可在任何实现了Agent Skills规范的宿主环境中运行。当文档中使用通用表述时,请替换为宿主环境对应的最佳实现:
- "询问用户"——如果宿主环境有结构化的多选问题工具(如、
AskUserQuestion),请使用该工具;若没有,则在回复中以纯文本形式提出问题并等待用户回答。request_user_input - 类似的文件路径——相对于当前
references/...文件的路径;需从技能目录解析。SKILL.md - 用户可见文案中的斜杠命令(如)——与用户沟通时转换为宿主环境的提及语法(例如在Codex上为
/evo:discover,即插件命名空间加技能名称,以空格分隔)。$evo discover
0. Verify the evo CLI is available and in sync with the plugin
0. 验证evo CLI可用且与插件版本同步
Before anything else, run:
bash
evo-version-checkThis wraps and additionally asserts the installed CLI matches the plugin manifest version (hosts refetch the plugin on version bumps, but do not reinstall the globally-installed CLI -- drift between the two breaks skills silently).
evo --versionFour outcomes to handle:
- Exit 0, -- continue to step 1.
evo-version-check: OK (plugin=X, cli=X) - Exit 1, "plugin manifest and installed CLI disagree" -- stop and show the user the script's stderr verbatim; it tells them the command to run. Then re-invoke this skill.
uv tool install --force evo-hq-cli==<version> - Exit 2, "evo CLI not on PATH" -- stop and tell the user:
isn't on your PATH. Install it once:
evo-hq-cli(oruv tool install evo-hq-cli). Then re-invoke this skill.pipx install evo-hq-cli - -- the host's plugin install is incomplete (missing the
evo-version-check: command not foundwrapper). Fall back to runningbin/directly and check forevo --versionin the output; if it's a different package (commonlyevo-hq-cli-- the unrelated SLAM tool), tell the user to uninstall it and installevo 1.xin its place.evo-hq-cli
Do not try to auto-install. Host sandbox + network policy may block it; leaving the install as a user action keeps failure modes clear.
在执行任何操作前,运行以下命令:
bash
evo-version-check该命令封装了,并额外验证已安装的CLI版本是否与插件清单版本匹配(宿主环境会在版本更新时重新获取插件,但不会重新安装全局CLI——两者版本不一致会导致技能静默失效)。
evo --version需处理四种结果:
- 退出码0,输出——继续执行步骤1。
evo-version-check: OK (plugin=X, cli=X) - 退出码1,输出"plugin manifest and installed CLI disagree"——停止操作,向用户显示脚本的标准错误输出;输出内容会告知用户执行命令进行安装。之后重新调用此技能。
uv tool install --force evo-hq-cli==<version> - 退出码2,输出"evo CLI not on PATH"——停止操作,告知用户:
不在你的PATH中。请执行一次安装:
evo-hq-cli(或uv tool install evo-hq-cli)。之后重新调用此技能。pipx install evo-hq-cli - 输出——宿主环境的插件安装不完整(缺少
evo-version-check: command not found包装器)。回退到直接运行bin/并检查输出中是否包含evo --version;如果是其他包(常见的是无关的SLAM工具evo-hq-cli),告知用户卸载该包并安装evo 1.x。evo-hq-cli
请勿尝试自动安装。宿主沙箱和网络策略可能会阻止安装;让用户手动执行安装操作可明确失败原因。
Guiding principles
指导原则
- Main stays clean. Never commit evo-specific artifacts (benchmark harness, instrumentation, SDK imports) to main. Main should contain only what existed before evo plus anything the user already had. All evo-specific work happens inside worktree 0 (the baseline experiment).
- Baseline is a worktree, not a main commit. creates
evo initbut nothing in main changes. The first real experiment (.evo/, created byexp_0000) is where the benchmark and instrumentation live.evo new --parent root - Ask the user as little as possible. Every question is a beat of friction. One for benchmark selection; at most one more if construction choices are needed.
- Relay the dashboard URL verbatim when it prints. This is the user's window into the run.
- 主分支保持纯净。切勿将evo专属产物(基准测试工具、插桩代码、SDK导入)提交到主分支。主分支应仅包含evo运行前已存在的内容以及用户已有的内容。所有evo专属操作均在工作树0(基准实验)内完成。
- 基准是工作树而非主分支提交。会创建
evo init目录,但主分支不会有任何变更。第一个真实实验(.evo/,由exp_0000创建)是存放基准测试和插桩代码的位置。evo new --parent root - 尽可能少地询问用户。每一次提问都会增加操作阻力。仅在选择基准时提问一次;若需要进行构建选择,最多再提问一次。
- 仪表盘URL打印时原样告知用户。这是用户查看运行状态的窗口。
1. Explore the repo
1. 探索代码仓库
Understand what the codebase does. Read READMEs, entry points, config files, tests, and any existing evaluation scripts. Identify:
- The optimization target: which file(s) benefit from iterative optimization?
- Metric direction for each candidate: is higher better () or lower better (
max)?min - Critical behaviors worth gating: invariants that must never break regardless of score (e.g., "refund flow works", "core tests pass", "output is valid JSON"). Gates are commands that exit 0 on success, non-zero on failure.
了解代码库的功能。阅读README、入口文件、配置文件、测试用例以及任何现有的评估脚本。确定:
- 优化目标:哪些文件能从迭代优化中获益?
- 每个候选指标的方向:数值越高越好()还是越低越好(
max)?min - 需要保护的关键行为:无论分数如何都绝不能破坏的不变量(例如“退款流程正常运行”、“核心测试通过”、“输出为合法JSON”)。保护门(Gate)是指成功时退出码为0、失败时退出码非0的命令。
2. Look for the obvious benchmark
2. 寻找现成的基准测试
Check what's already there:
- Full benchmarks: existing scripts that run end-to-end and output a score
- Partial evals: tests, notebooks, or logs with ground truth but not in runnable-score form
- Nothing at all
Also check what the user asked for in the invocation argument. If they named a specific metric or target, that's intent.
If one benchmark is obviously the right one — a runnable eval that measures what the user clearly cares about, or what the repo is plainly built to do — use it. Skip step 3, go to step 4 with that benchmark as the only candidate.
If it's not obvious — multiple candidate surfaces, no existing eval, user didn't specify intent, or the existing eval covers a narrow slice while the interesting optimization sits elsewhere — run step 3.
检查代码库中已有的内容:
- 完整基准测试:可端到端运行并输出分数的现有脚本
- 部分评估:包含真值但未形成可运行评分形式的测试、笔记本或日志
- 完全没有基准测试
同时检查用户调用命令时传入的参数。如果用户指定了特定指标或目标,这就是明确的意图。
如果有一个明显合适的基准测试——即可运行的评估脚本,能够衡量用户明确关心的内容,或者代码库的核心功能——直接使用它。跳过步骤3,进入步骤4。
如果基准测试不明确——存在多个候选方案、没有现有评估、用户未指定意图,或者现有评估仅覆盖了狭窄的范围而真正有价值的优化点在其他地方——执行步骤3。
3. Propose unexplored optimization dimensions (only if step 2 was ambiguous)
3. 提出未发掘的优化维度(仅当步骤2结果不明确时执行)
When the benchmark isn't obvious, propose candidate dimensions grounded in actual repo signals, then pick with the user. See for the full rubric, project-type examples, and presentation format. Short version:
references/proposing-dimensions.md- A handful of dimensions relevant to this specific repo (not generic categories).
- Ground each in repo signals: already-instrumented code, stated goals in READMEs, TODO/FIXME patterns, domain defaults.
- Rank by signal × slack × cost answered in prose (no numeric scores — they're vibes).
当基准测试不明确时,基于代码库的实际信号提出候选优化维度,然后由用户选择。完整的评估标准、项目类型示例和展示格式请参考。简化版说明:
references/proposing-dimensions.md- 提出少量与当前代码库相关的维度(而非通用类别)。
- 每个维度都要有代码库信号支撑:已插桩的代码、README中声明的目标、TODO/FIXME模式、领域默认值。
- 根据信号×潜力×成本进行排序(用文字说明,无需数值评分)。
4. Ask the user to pick the benchmark
4. 请求用户选择基准测试
If step 2 produced one obvious benchmark, confirm it in one sentence and move on — no ranked list needed.
Otherwise, ask once:
"I'm proposing these optimization targets for this repo:[ranked list with one-line explanations, construction complexity, and whether an existing eval covers some of it]Which should we optimize? Recommended: [default pick with reasoning]."
Record the selection. If step 3 ran, save non-picked dimensions to under "Future experiment candidates" after init.
.evo/project.md如果步骤2找到一个明显合适的基准测试,用一句话确认后即可继续——无需提供排名列表。
否则,仅提问一次:
"我为这个代码库提出以下优化目标:[带单行说明、构建复杂度以及是否有现有评估覆盖部分内容的排名列表]我们应该优化哪一个?推荐选择:[默认选项及理由]。"
记录用户的选择。如果执行了步骤3,初始化完成后将未被选中的维度保存到的“未来实验候选”部分。
.evo/project.md5. Ask the user for instrumentation mode
5. 请求用户选择插桩模式
Three cases, in order of how to handle them:
-
Selected benchmark already exists AND is already instrumented for evo (you can see, an
from evo_agent import Run, or the inlineimport { Run } from '@evo-hq/evo-agent'/log_taskhelpers in the benchmark source). No wiring needed. Skip this question entirely. Detect the instrumentation style from the source and pass the matchinglogTaskvalue to--instrumentation-mode <sdk|inline>in step 7.evo init -
Selected benchmark already exists but is NOT instrumented (it just prints a score JSON, or it's a test runner that doesn't yet write per-task traces). Wiring is needed. Ask the question.
-
Selected benchmark needs to be constructed from scratch (case B or C from step 4). Wiring is needed. Ask the question.
For cases 2 and 3, ask once:
"I can wire up the benchmark in one of two ways:
- SDK mode -- install the SDK (Python:
/ Node:pip install evo-hq-agent). Richer per-task logs, ~5 lines of user code.npm install @evo-hq/evo-agent- Inline mode -- paste a ~30-line helper directly into the benchmark. Zero new dependencies. Same data contract."
Pass the answer to via in step 7. Never install packages without this confirmation. If you skip the question (case 1), still pass the detected mode to so optimize/subagent runs see a consistent value.
evo init--instrumentation-mode <sdk|inline>evo init分三种情况处理,优先级如下:
-
选中的基准测试已存在且已为evo配置插桩(可在基准测试源码中看到、
from evo_agent import Run,或者内联的import { Run } from '@evo-hq/evo-agent'/log_task辅助函数)。无需额外配置。跳过此问题。从源码中检测插桩风格,并在步骤7的logTask命令中传入匹配的evo init参数。--instrumentation-mode <sdk|inline> -
选中的基准测试已存在但未配置插桩(仅打印分数JSON,或者是尚未写入每个任务跟踪信息的测试运行器)。需要配置插桩。提出此问题。
-
选中的基准测试需要从头构建(步骤4中的B或C情况)。需要配置插桩。提出此问题。
对于情况2和3,仅提问一次:
"我可以通过两种方式配置基准测试:
- SDK模式——安装SDK(Python:
/ Node:pip install evo-hq-agent)。支持更丰富的任务级日志,需修改约5行用户代码。npm install @evo-hq/evo-agent- 内联模式——将约30行的辅助代码直接粘贴到基准测试中。无需新增依赖。数据协议一致。"
将用户的回答通过参数传入步骤7的命令。未经用户确认,切勿安装任何包。如果跳过此问题(情况1),仍需将检测到的模式传入,以便后续的optimize/subagent运行使用一致的配置。
--instrumentation-mode <sdk|inline>evo initevo init6. Prepare main (without committing to it)
6. 准备主分支(不提交任何内容)
The agent never creates commits on main. Main stays byte-identical to what the user committed before evo ran. Two things to set up, both local-only.
Order matters: do 6a (audit) before 6b (excludes). The excludes in 6b will hide files inside , , , etc. from . If you run the audit after adding excludes, you'll be blind to anything missing inside those directories -- and benchmark dependencies often live exactly there.
node_modules/dist/build/git status代理绝不会在主分支上创建提交。主分支应与用户运行evo前提交的内容完全一致。需要完成两项设置,均为本地操作。
顺序很重要:先执行6a(审计)再执行6b(排除)。6b中的排除规则会隐藏、、等目录下的文件,使其不显示在中。如果在添加排除规则后再执行审计,将无法看到这些目录中缺失的内容——而基准测试的依赖往往就存放在这些目录中。
node_modules/dist/build/git status6a. Detect (don't auto-commit) dirty or untracked dependencies
6a. 检测(不自动提交)未提交或未跟踪的依赖
evo newexp_0000Run three checks, in this order:
-
Tracked-but-modified files -- runand
git diff --name-only. If any output line is the optimization target, an existing benchmark file, a gate-referenced script, or any of their import-graph dependencies, stop and ask the user to commit or stash before continuing. Do not commit on their behalf -- the user might be in the middle of an unrelated change.git diff --cached --name-only -
Untracked files visible to git -- runand look for
git status --short --untracked-files=allentries that the target or gates will reference. Classify each:??- Part of the user's project (e.g., a smoke test they wrote but hadn't committed) -- stop and ask the user to commit it to main themselves.
- Evo-specific new files (a new gate script you're about to write, a new test fixture) -- do not create these in main. Defer to step 10; they go into the baseline worktree and commit to experiment 0's branch. Every descendant experiment inherits via git branching.
-
Explicit paths inside soon-to-be-ignored directories -- inspect the benchmark command and every gate command for path references (e.g.,,
./dist/eval-helper,node_modules/some-tool/cli.js). For each such path, runbuild/golden_outputs/to confirm it's tracked. If any aren't, stop and ask the user to commit them. This catches dependencies that step 6b is about to hide fromgit ls-files --error-unmatch <path>.git status
Any one of these three checks failing is a hard stop. Do not proceed to 6b or beyond until the working tree is clean with respect to anything evo will read.
Anything else (benchmark harness, instrumentation) always gets constructed inside the baseline worktree, never in main.
evo newexp_0000按以下顺序执行三项检查:
-
已跟踪但已修改的文件——运行和
git diff --name-only。如果输出中的任何行是优化目标、现有基准测试文件、保护门引用的脚本,或它们的导入依赖,停止操作并要求用户提交或暂存修改后再继续。切勿代表用户提交——用户可能正在进行无关的修改。git diff --cached --name-only -
Git可见的未跟踪文件——运行并查找目标或保护门会引用的
git status --short --untracked-files=all条目。对每个条目进行分类:??- 属于用户项目的一部分(例如用户编写但尚未提交的冒烟测试)——停止操作并要求用户自行提交到主分支。
- evo专属的新文件(即将编写的新保护门脚本、新测试夹具)——切勿在主分支中创建这些文件。推迟到步骤10处理;这些文件将放入基准工作树并提交到实验0的分支。所有后续实验分支都会通过Git分支继承这些内容。
-
即将被忽略的目录中的显式路径——检查基准测试命令和每个保护门命令中的路径引用(例如、
./dist/eval-helper、node_modules/some-tool/cli.js)。对每个这样的路径,运行build/golden_outputs/确认其已被跟踪。如果有任何路径未被跟踪,停止操作并要求用户提交。这可以捕捉到6b步骤中即将被git ls-files --error-unmatch <path>隐藏的依赖。git status
以上三项检查中任何一项失败都必须立即停止。直到工作树中evo需要读取的所有内容都处于干净状态后,才能继续执行6b或后续步骤。
其他内容(基准测试工具、插桩代码)始终在基准工作树内构建,绝不会出现在主分支中。
6b. Add local-only git excludes
6b. 添加本地Git排除规则
After the audit passes, append to (not -- we do not commit to main):
.git/info/exclude.gitignore.evo/
__pycache__/
*.pyc
.pytest_cache/
node_modules/
dist/
build/.git/info/exclude.gitignore审计通过后,将以下内容追加到(不要修改——我们不会提交到主分支):
.git/info/exclude.gitignore.evo/
__pycache__/
*.pyc
.pytest_cache/
node_modules/
dist/
build/.git/info/exclude.gitignore7. Initialize the workspace
7. 初始化工作区
bash
evo init --target <file> --benchmark "<command using {worktree} and {target}>" --metric <max|min> \
--host <claude-code|codex|opencode|openclaw|hermes|generic> \
--instrumentation-mode <sdk|inline> [--gate "<gate command>"]--hostclaude-codecodexopencodeopenclawhermesgeneric.evo/meta.jsonevo dispatchclaude-codediscoverevo host set <value>Placeholder semantics. Benchmark and gate commands support two placeholders, resolved lazily at run time by / gate evaluation:
evo run- resolves to the absolute path of the experiment's worktree directory (e.g.
{worktree}). Use this to reference files that live on the experiment branch, not on main./path/to/repo/.evo/run_0000/worktrees/exp_0000 - resolves to the absolute path of the target file inside that worktree (e.g.
{target}). Use this when your benchmark needs to load or exec the target dynamically.{worktree}/agent/solve.py
Critical rule: executes from the main repo root. When the benchmark script is constructed inside the worktree (the default in this flow), the command must reference it via or the path won't resolve.
evo run{worktree}Example for a benchmark written at that will be committed to exp_0000:
{worktree}/benchmark.pybash
evo init \
--target agent/solve.py \
--benchmark "python3 {worktree}/benchmark.py --target {target}" \
--metric max \
--host claude-codeIf the project uses a specific interpreter (poetry, pipenv, a venv), qualify it: , , etc.
"poetry run python {worktree}/benchmark.py ..."".venv/bin/python {worktree}/benchmark.py ..."evo init.evo/rootDashboard live: http://127.0.0.1:8080 (pid 12345)Relay that line back to the user verbatim. If port 8080 is busy, evo auto-increments -- show whatever port prints. The URL is how the user watches the run.
bash
evo init --target <file> --benchmark "<command using {worktree} and {target}>" --metric <max|min> \
--host <claude-code|codex|opencode|openclaw|hermes|generic> \
--instrumentation-mode <sdk|inline> [--gate "<gate command>"]--hostclaude-codecodexopencodeopenclawhermesgeneric.evo/meta.jsonevo dispatchclaude-codediscoverevo host set <value>占位符语义。基准测试和保护门命令支持两个占位符,由/保护门评估在运行时延迟解析:
evo run- 解析为实验工作树目录的绝对路径(例如
{worktree})。用于引用实验分支中的文件,而非主分支中的文件。/path/to/repo/.evo/run_0000/worktrees/exp_0000 - 解析为该工作树内目标文件的绝对路径(例如
{target})。当基准测试需要动态加载或执行目标文件时使用。{worktree}/agent/solve.py
关键规则:从主仓库根目录执行。当基准测试脚本在工作树内构建时(此流程中的默认情况),命令必须通过引用该脚本,否则路径无法解析。
evo run{worktree}示例:基准测试脚本位于并将提交到exp_0000:
{worktree}/benchmark.pybash
evo init \
--target agent/solve.py \
--benchmark "python3 {worktree}/benchmark.py --target {target}" \
--metric max \
--host claude-code如果项目使用特定的解释器(poetry、pipenv、虚拟环境),请添加前缀:、等。
"poetry run python {worktree}/benchmark.py ..."".venv/bin/python {worktree}/benchmark.py ..."evo init.evo/rootDashboard live: http://127.0.0.1:8080 (pid 12345)将该行内容原样告知用户。如果端口8080被占用,evo会自动递增端口——显示实际打印的端口即可。该URL是用户查看运行状态的入口。
8. Set up gates
8. 设置保护门
Gates inherit down the experiment tree -- children automatically get all ancestor gates.
Gate semantics (read this first). decides "gate passed" purely from the command's exit code: 0 = pass, non-zero = fail. A benchmark-style command that just prints and exits 0 passes the gate. That defeats the purpose. Every gate command must be wired to exit non-zero when the protected behavior regresses. Two ways to do that:
evo run{"score": 0.0}- Test-suite gates -- ,
pytest,cargo test, etc. already exit non-zero on failure. Use them as-is.npm test - Score-threshold gates -- gate the benchmark on a minimum acceptable score. The benchmark script needs a flag like that exits 1 when the computed score falls below the threshold. The
--min-score <float>helpers ininline_instrumentation.{py,js}show the pattern:references/returns the final score; the script can then compare andwrite_result().sys.exit(1)
Examples:
bash
undefined保护门会在实验树中继承——子实验会自动获得所有祖先实验的保护门。
保护门语义(请先阅读)。仅通过命令的退出码判断“保护门通过”:0=通过,非0=失败。仅打印并退出码为0的基准测试风格命令会通过保护门,这违背了保护门的初衷。每个保护门命令必须配置为当受保护的行为退化时退出码非0。有两种实现方式:
evo run{"score": 0.0}- 测试套件保护门——、
pytest、cargo test等工具本身会在失败时返回非0退出码。可直接使用。npm test - 分数阈值保护门——为基准测试设置最低可接受分数。基准测试脚本需要一个类似的标志,当计算出的分数低于阈值时退出码为1。
--min-score <float>目录下的references/辅助代码展示了该模式:inline_instrumentation.{py,js}返回最终分数;脚本可进行比较并调用write_result()。sys.exit(1)
示例:
bash
undefinedTest-suite gate: pytest already exits non-zero on failures (use uv run --with if pytest isn't already a dep)
测试套件保护门:pytest在失败时会返回非0退出码(如果pytest不是依赖,使用uv run --with)
evo gate add root --name core_tests --command "uv run --with pytest pytest tests/core/ -x"
evo gate add root --name core_tests --command "uv run --with pytest pytest tests/core/ -x"
Score-threshold gate: benchmark exits 1 if pass rate on protected tasks drops below 0.9
分数阈值保护门:当受保护任务的通过率低于0.9时,基准测试退出码为1
evo gate add root --name refund_flow --command "python3 {worktree}/benchmark.py --target {target} --task-ids 5 --min-score 0.9"
evo gate add root --name refund_flow --command "python3 {worktree}/benchmark.py --target {target} --task-ids 5 --min-score 0.9"
Custom validation: smoke test that crashes (non-zero exit) on broken target
自定义验证:当目标文件损坏时会崩溃(非0退出码)的冒烟测试
evo gate add root --name no_crash --command "python3 smoke_test.py --target {target}"
If a benchmark you constructed doesn't yet have a `--min-score` mode, add it now (a few lines: parse the threshold flag, compute the score, `sys.exit(1)` if below). Without it the gate is decorative.
Gate commands support `{target}` and `{worktree}` placeholders with the same semantics as benchmark commands (resolved at run time, not at registration). Registering a gate that references `{worktree}/benchmark.py` before the benchmark exists is safe -- the placeholder resolves only when the gate is evaluated, which happens during `evo run` after the benchmark is committed.
Verify registered gates:
```bash
evo gate list rootGate pairing rule based on benchmark provenance:
- If the selected benchmark already existed in the repo (not constructed from scratch): gates are optional at this step, but if you register any benchmark-derived gate, it must use a score-threshold (or equivalent) -- not a bare invocation. Subagents can add more during optimization.
--min-score - If the benchmark was constructed from scratch (case B or C from the A/B/C classification): a Goodhart-mitigation gate is mandatory before the baseline can run, AND that gate must be a real pass/fail check (score-threshold or correctness assertion that exits non-zero on regression), not a bare benchmark rerun. See section 6 on "Required gate pairing." Do not proceed to
references/constructing-benchmark.mdorevo newwithout it. This is the safety against metric gaming -- it is not optional.evo run
evo gate add root --name no_crash --command "python3 smoke_test.py --target {target}"
如果构建的基准测试尚未支持`--min-score`模式,请立即添加(只需几行代码:解析阈值标志、计算分数、如果低于阈值则`sys.exit(1)`)。否则保护门只是装饰性的。
保护门命令支持`{target}`和`{worktree}`占位符,语义与基准测试命令相同(在运行时解析,而非注册时)。注册引用`{worktree}/benchmark.py`的保护门时,即使基准测试尚未存在也是安全的——占位符仅在保护门评估时解析,而评估发生在`evo run`期间,此时基准测试已被提交。
验证已注册的保护门:
```bash
evo gate list root基于基准测试来源的保护门配对规则:
- 如果选中的基准测试已存在于代码库中(非从头构建):此步骤中保护门是可选的,但如果注册任何基于基准测试的保护门,必须使用分数阈值(或等效方式)——不能直接调用基准测试。子代理可在优化过程中添加更多保护门。
--min-score - 如果基准测试是从头构建的(A/B/C分类中的B或C情况):在运行基准测试前,必须配置一个Goodhart缓解保护门,且该保护门必须是真实的通过/失败检查(分数阈值或断言正确性,退化时退出码非0),而非直接重新运行基准测试。请参考第6节“必填保护门配对”。未配置此保护门,不得继续执行
references/constructing-benchmark.md或evo new。这是防止指标滥用的安全措施——并非可选操作。evo run
9. Create the baseline worktree
9. 创建基准工作树
bash
evo new --parent root -m "baseline: instrument + score"This returns experiment id (typically ) and its worktree path. All subsequent construction work happens inside that worktree -- never in main.
exp_0000bash
evo new --parent root -m "baseline: instrument + score"该命令会返回实验ID(通常为)及其工作树路径。所有后续构建工作都在该工作树内完成——绝不允许在主分支中进行。
exp_000010. Work inside the baseline worktree
10. 在基准工作树内操作
Cd into the worktree path returned by . Then:
evo new切换到返回的工作树路径。然后执行以下步骤:
evo new10a. Construct the benchmark (if needed)
10a. 构建基准测试(如果需要)
If the selected benchmark is new, build it in the worktree. See for the full procedure:
references/constructing-benchmark.md- Design the scoring function (range, direction, meaningful-improvement threshold)
- Assemble test cases (10-20 for programmatic, 15-30 for fuzzy, realistic workload for perf)
- Write the runnable harness (helper/SDK writes the score JSON to ; stdout and stderr are free for user output)
$EVO_RESULT_PATH - Goodhart check (document gaming strategies, mitigate each with a gate or held-out slice)
- Held-out validation slice (60/70 training, 30/40 held-out) if the benchmark is hand-written
Do not run separate determinism checks during setup. Note the benchmark's determinism property in (step 12) and move on. Variance surfaces during optimization itself, where it can be handled with real evidence rather than guessed at during setup.
project.md如果选中的基准测试是新的,在工作树内构建它。完整流程请参考:
references/constructing-benchmark.md- 设计评分函数(范围、方向、有意义的改进阈值)
- 组装测试用例(程序型测试10-20个,模糊测试15-30个,性能测试使用真实工作负载)
- 编写可运行的工具(辅助代码/SDK将分数JSON写入;标准输出和标准错误可用于用户输出)
$EVO_RESULT_PATH - Goodhart检查(记录可能的滥用策略,通过保护门或保留切片缓解)
- 保留验证切片(如果基准测试是手动编写的,按60/70训练、30/40验证拆分)
设置过程中无需单独执行确定性检查。在(步骤12)中记录基准测试的确定性属性即可,继续后续操作。方差会在优化过程中显现,届时可根据实际证据处理,无需在设置阶段猜测。
project.md10b. Apply instrumentation
10b. 应用插桩
Based on the instrumentation mode passed to :
evo initPaths below are relative to this file (resolve them against the skill directory).
SKILL.md- SDK mode: add (Python) or
from evo_agent import Run(Node) to the benchmark script. Wrap the eval loop perimport { Run } from '@evo-hq/evo-agent'orreferences/sdk_python.py.references/sdk_node.js - Inline mode: copy the helper from (or
references/inline_instrumentation.py) into the benchmark. Use.js/log_taskper task andlogTask/write_resultonce at the end.writeResult
The wire protocol is the same either way: written to , score JSON written to . Stdout is free for user output.
task_<id>.json$EVO_TRACES_DIR$EVO_RESULT_PATH根据传入的插桩模式执行:
evo init以下路径相对于当前文件(需从技能目录解析)。
SKILL.md- SDK模式:在基准测试脚本中添加(Python)或
from evo_agent import Run(Node)。按照import { Run } from '@evo-hq/evo-agent'或references/sdk_python.py中的示例包装评估循环。references/sdk_node.js - 内联模式:将(或
references/inline_instrumentation.py)中的辅助代码复制到基准测试中。每个任务使用.js/log_task,最后使用logTask/write_result一次。writeResult
两种模式的通信协议相同:将写入,将分数JSON写入。标准输出可用于用户输出。
task_<id>.json$EVO_TRACES_DIR$EVO_RESULT_PATH10c. Cheap validation run
10c. 低成本验证运行
Before the full baseline, validate the toolchain with the cheapest possible end-to-end run (single task, smallest split, dry-run flag -- whatever is fastest).
Important: run this from the main repo root, not from inside the worktree. The validation writes traces to , which must resolve to the workspace's at the main repo root. If you run from the worktree, the relative path creates and those artifacts get staged into the experiment commit when you run later.
.evo/validate/.evo/<worktree>/.evo/validate/git addResolve , , and the validator script path yourself before running. Evo substitutes / only inside , not in a plain shell. The validation here is a plain shell call, so build the command with concrete absolute paths:
{worktree}{target}{worktree}{target}evo run- = the worktree path returned by
WORKTREEevo new - =
TARGET$WORKTREE/<relative target path, e.g. agent/solve.py> - =
VALIDATOR-- resolve by taking the absolute path of this<absolute path to this skill dir>/scripts/validate_result.py's directory and appendingSKILL.mdscripts/validate_result.py
bash
undefined在完整运行基准测试前,用最快捷的端到端运行验证工具链(单个任务、最小拆分、干跑标志——任何最快的方式)。
重要:请从主仓库根目录运行此命令,不要在工作树内运行。验证会将跟踪信息写入,该路径必须解析为主仓库根目录下的。如果在工作树内运行,相对路径会创建,这些产物会在后续运行时被暂存到实验提交中。
.evo/validate/.evo/<worktree>/.evo/validate/git add运行前自行解析、和验证脚本路径。Evo仅在中替换/,而不是在普通shell中。此处的验证是普通shell调用,因此需要使用具体的绝对路径构建命令:
{worktree}{target}evo run{worktree}{target}- =
WORKTREE返回的工作树路径evo new - =
TARGET$WORKTREE/<相对目标路径,例如agent/solve.py> - =
VALIDATOR——通过获取当前<此技能目录的绝对路径>/scripts/validate_result.py所在目录的绝对路径并追加SKILL.md来解析scripts/validate_result.py
bash
undefinedfrom main repo root
从主仓库根目录执行
WORKTREE="<...>"
TARGET="$WORKTREE/<...>"
VALIDATOR="<...>/scripts/validate_result.py"
mkdir -p .evo/validate
ATTEMPT="$(mktemp -d .evo/validate/run-XXXXXX)"
mkdir -p "$ATTEMPT/traces"
EVO_TRACES_DIR="$ATTEMPT/traces"
EVO_RESULT_PATH="$ATTEMPT/result.json"
EVO_EXPERIMENT_ID=validate
python3 "$WORKTREE/benchmark.py" --target "$TARGET" \
EVO_RESULT_PATH="$ATTEMPT/result.json"
EVO_EXPERIMENT_ID=validate
python3 "$WORKTREE/benchmark.py" --target "$TARGET" \
"$ATTEMPT/stdout.log" 2>"$ATTEMPT/stderr.log"
python3 "$VALIDATOR" "$ATTEMPT/result.json"
Adapt the benchmark invocation (interpreter, args) to whatever you stored with `evo init`. The non-negotiable part is that the resulting bash command contains no literal `{worktree}`, `{target}`, or relative-script paths -- expand all of them to absolute paths before the shell runs the line. Each invocation gets a fresh attempt subdir, so re-running on failure is safe.
The validator asserts `result.json` exists, is non-empty, and is a JSON object with a numeric `score`. Also verify:
- All dependencies resolve and the command completes.
- Traces appear in `$EVO_TRACES_DIR` (if applicable).
- Each gate script runs cleanly on the unmodified target.
Fix any issues and re-validate before proceeding.WORKTREE="<...>"
TARGET="$WORKTREE/<...>"
VALIDATOR="<...>/scripts/validate_result.py"
mkdir -p .evo/validate
ATTEMPT="$(mktemp -d .evo/validate/run-XXXXXX)"
mkdir -p "$ATTEMPT/traces"
EVO_TRACES_DIR="$ATTEMPT/traces"
EVO_RESULT_PATH="$ATTEMPT/result.json"
EVO_EXPERIMENT_ID=validate
python3 "$WORKTREE/benchmark.py" --target "$TARGET" \
EVO_RESULT_PATH="$ATTEMPT/result.json"
EVO_EXPERIMENT_ID=validate
python3 "$WORKTREE/benchmark.py" --target "$TARGET" \
"$ATTEMPT/stdout.log" 2>"$ATTEMPT/stderr.log"
python3 "$VALIDATOR" "$ATTEMPT/result.json"
根据`evo init`中存储的内容调整基准测试调用方式(解释器、参数)。不可协商的是,最终的bash命令中不能包含字面量`{worktree}`、`{target}`或相对脚本路径——在shell运行前必须将所有路径展开为绝对路径。每次调用都会使用新的尝试子目录,因此失败后重新运行是安全的。
验证器会检查`result.json`是否存在、非空,且是包含数字`score`的JSON对象。同时验证:
- 所有依赖都能解析,命令能完成执行。
- 跟踪信息出现在`$EVO_TRACES_DIR`中(如果适用)。
- 每个保护门脚本在未修改的目标文件上都能正常运行。
修复所有问题并重新验证后再继续。10d. Commit inside the worktree
10d. 在工作树内提交
Logical commits are ideal but not required. Minimal acceptable:
add: benchmark harness + test cases- (only in SDK mode -- inline mode keeps the harness and instrumentation in one file, so this commit collapses into the previous one)
add: instrumentation
Use git from inside the worktree directory. These commits are on the experiment's branch, not main.
Before the first commit in the worktree, add a for build artifacts and any stray evo workspace writes that shouldn't land on the experiment branch. At minimum:
.gitignore.evo/
__pycache__/
*.pyc
.pytest_cache/
node_modules/
dist/
build/Otherwise, running the benchmark once before committing will drag bytecode caches, , or stray writes into the experiment's tree and pollute every descendant branch. Belt-and-suspenders with step 10c's "run from main repo root" rule: even if cwd slips, the ignore catches it.
.pytest_cache/.evo/逻辑提交是理想的,但并非必须。最低要求:
add: benchmark harness + test cases- (仅SDK模式需要——内联模式将工具和插桩代码放在同一个文件中,因此此提交可合并到前一个提交中)
add: instrumentation
在工作树目录内使用Git。这些提交会在实验分支上,而非主分支。
在工作树内首次提交前,添加,用于忽略构建产物和不应提交到实验分支的evo工作区临时文件。至少包含以下内容:
.gitignore.evo/
__pycache__/
*.pyc
.pytest_cache/
node_modules/
dist/
build/否则,提交前运行一次基准测试会将字节码缓存、或临时写入内容带入实验分支,污染所有后续分支。与步骤10c的“从主仓库根目录运行”规则配合使用:即使当前工作目录不小心切换到工作树,忽略规则也能捕捉到这些文件。
.pytest_cache/.evo/11. Run the baseline
11. 运行基准测试
First, cd back to main repo root. If the previous step left the shell inside the worktree, will fail with "workspace not initialized" because only lives at the main repo root.
evo run.evo/bash
cd <main-repo-root>
evo run exp_0000evo runcommittedCOMMITTED exp_0000 0.4286Do NOT call afterward. In the current CLI, is terminal: the experiment is already committed when it returns successfully, and calling errors with . The command exists for cases where a human recorded a score outside of , which is not the discover flow.
evo doneevo runevo done exp_0000 --score <n>"exp_0000 has status 'committed' -- cannot record again"evo doneevo runIf gates failed, exits non-zero and leaves the experiment in a failed state. Fix the benchmark or target inside the worktree, commit, then again.
evo runevo run exp_0000If fails with a path error (typically: not found), the stored benchmark command in is missing the placeholder. Fix by re-initializing: then re-run step 7 with the correct string. Hand-editing works but is technical debt.
evo runbenchmark.py.evo/run_0000/config.json{worktree}evo discard exp_0000 --reason "benchmark command missing {worktree}"--benchmarkconfig.json首先,切换回主仓库根目录。如果上一步操作后shell仍在工作树内,会失败并提示“workspace not initialized”,因为仅存在于主仓库根目录。
evo run.evo/bash
cd <main-repo-root>
evo run exp_0000evo runcommittedCOMMITTED exp_0000 0.4286之后请勿调用。在当前CLI中,是终端命令:成功返回时实验已被提交,调用会报错。命令仅用于人工在之外记录分数的情况,不适用于此探索流程。
evo doneevo runevo done exp_0000 --score <n>"exp_0000 has status 'committed' -- cannot record again"evo doneevo run如果保护门失败,会返回非0退出码并将实验标记为失败状态。在工作树内修复基准测试或目标文件,提交后重新运行。
evo runevo run exp_0000如果因路径错误失败(通常是:未找到),说明中存储的基准测试命令缺少占位符。修复方法:重新初始化,执行,然后重新运行步骤7并传入正确的字符串。手动编辑也可行,但会带来技术债务。
evo runbenchmark.py.evo/run_0000/config.json{worktree}evo discard exp_0000 --reason "benchmark command missing {worktree}"--benchmarkconfig.json12. Write .evo/project.md
.evo/project.md12. 编写.evo/project.md
.evo/project.mdLives at the top level of (run-agnostic, stable path regardless of active run). creates an empty stub; overwrite it.
.evo/evo initDocument:
- What the target does
- What can be changed by optimization vs what must stay stable
- How to interpret benchmark output (score meaning, direction)
- Benchmark determinism -- one line, pick what fits:
- -- pure code, no randomness, no network
deterministic by construction - -- expected to be deterministic in practice; flag if it isn't
uses LLMs with temp=0 - -- inherent noise; optimize will need multi-run strategies
sampling-based, variance expected
- Environment requirements discovered during validation
- What each gate protects
- Benchmark gaming risks identified during the Goodhart check
- Future experiment candidates (the non-picked dimensions from step 3)
该文件位于的顶层目录(与运行无关,无论当前运行哪个版本,路径都是固定的)。会创建一个空的模板文件,将其覆盖即可。
.evo/evo init记录以下内容:
- 目标文件的功能
- 优化可修改的内容与必须保持稳定的内容
- 如何解读基准测试输出(分数含义、优化方向)
- 基准测试确定性——用一句话描述,选择合适的选项:
- ——纯代码,无随机性,无网络依赖
deterministic by construction - ——实际运行中预期是确定性的;如果不是,请标记
uses LLMs with temp=0 - ——存在固有噪声;优化时需要多轮运行策略
sampling-based, variance expected
- 验证过程中发现的环境要求
- 每个保护门保护的内容
- Goodhart检查中发现的基准测试滥用风险
- 未来实验候选(步骤3中未被选中的维度)
13. Report to the user
13. 向用户汇报
End the skill by reporting in chat:
- The dashboard URL (if not already mentioned)
- The baseline experiment ID and score
- The chosen optimization dimension and why
- A one-liner on next steps: "Run to start the optimization loop."
/evo:optimize - Resume after crash: if the host, the shell, or the machine restarts mid-flow, re-invoke . Evo reads
evo:optimizeand resumes from the last committed experiment -- no special restore procedure..evo/ - State is local to this machine: experiment commits on branches like survive
evo/run_0000/exp_*, but orchestration state (graph, annotations, project notes) lives only ingit push --all. If that history matters to you, back up.evo/separately (e.g.,.evo/).tar -czf evo-state-$(date +%F).tar.gz .evo/
技能结束时,在聊天中向用户汇报:
- 仪表盘URL(如果尚未提及)
- 基准实验ID和分数
- 选择的优化维度及理由
- 下一步操作的一句话说明:“运行启动优化循环。”
/evo:optimize - 崩溃后恢复:如果宿主环境、shell或机器在流程中途重启,重新调用即可。Evo会读取
evo:optimize并从最后一个已提交的实验继续——无需特殊恢复流程。.evo/ - 状态仅存储在本地机器:实验提交在等分支上,执行
evo/run_0000/exp_*会保留这些分支,但编排状态(图谱、注释、项目笔记)仅存储在git push --all中。如果这些历史记录对你很重要,请单独备份.evo/(例如.evo/)。tar -czf evo-state-$(date +%F).tar.gz .evo/
Inspection commands (for debugging, reference only)
检查命令(仅用于调试、参考)
bash
evo get <id> # full experiment detail with scores
evo traces <id> <task> # per-task trace
evo annotate <id> <task> "analysis" # record failure analysis
evo scratchpad # full state: tree, best path, frontier, annotations, diffs, gates
evo gate list <id> # effective gates at a node (inherited)bash
evo get <id> # 获取包含分数的完整实验详情
evo traces <id> <task> # 获取单个任务的跟踪信息
evo annotate <id> <task> "analysis" # 记录失败分析
evo scratchpad # 获取完整状态:树、最优路径、前沿、注释、差异、保护门
evo gate list <id> # 获取节点的有效保护门(包括继承的)Rules
规则
- Do NOT modify main after unless the user explicitly asks. All new artifacts live in worktree 0.
evo init - Do NOT install packages without the user's confirmation from step 5.
- Do NOT skip the held-out gate pairing when the benchmark was constructed from scratch. The gate is the safety net against Goodhart gaming, regardless of whether the benchmark is deterministic.
- Do NOT skip the Goodhart check when the benchmark was constructed from scratch. Gate pairing is mandatory, not optional.
- 后,除非用户明确要求,否则不得修改主分支。所有新产物都存放在工作树0中。
evo init - 未经步骤5中用户的确认,不得安装任何包。
- 当基准测试是从头构建时,不得跳过保留切片的保护门配对。保护门是防止Goodhart滥用的安全网,无论基准测试是否具有确定性。
- 当基准测试是从头构建时,不得跳过Goodhart检查。保护门配对是必填项,而非可选操作。