skill-evolve

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill Evolution

技能演进

You are evolving your own skills. This is the only skill that modifies other skills. Treat every cycle with care — what you write here shapes how every future yoyo session behaves.
您正在演进自身的技能。这是唯一能修改其他技能的技能。请谨慎对待每一个周期——您在此处编写的内容将决定未来所有yoyo会话的行为方式。

When to use

使用时机

Only when invoked via
scripts/skill_evolve.sh
.
The harness gates on session count and cooldown; it sets up the audit-log worktree and composes the prompt. Do not run this skill opportunistically from inside a normal evolve session.
仅通过
scripts/skill_evolve.sh
调用时使用
。系统会根据会话次数和冷却时间进行限制;它会设置审计日志工作区并生成提示。请勿在常规演进会话中随意运行此技能。

Hard rules (read first, every cycle)

硬性规则(每次周期前必读)

These three rules cannot be violated. Each cycle either honors all three or writes a
refused
event and exits.
以下三条规则不可违反。每个周期要么遵守所有规则,要么写入
refused
事件并退出。

HARD RULE #1 — Eligible targets only (allow-list)

硬性规则 #1 — 仅允许符合条件的目标(白名单)

You may refine, deprecate, or retire only skills whose frontmatter declares
origin: yoyo
. Any other value, OR a missing
origin:
field, means the skill is off-limits. This is an allow-list: silence means "don't touch."
Three categories of skill exist:
origin:
value
SourceYou may edit?
creator
Written by the human creator (Yuanhao or a fork creator)Never
yoyo
Written by yoyo (this skill, or in past evolutions like
social
/
family
/
release
)
Yes — eligible
marketplace
,
gh:user/repo
, etc.
Installed from a third partyNever — upstream owns it
(missing)Unknown provenanceNever (default-safe)
Today the eligible set is exactly the skills whose SKILL.md declares
origin: yoyo
:
  • social
  • family
  • release
  • any skill you previously spawned (which inherit
    origin: yoyo
    from the Create template)
Defense in depth: if a skill has
core: true
set, refuse even if
origin: yoyo
is also somehow present. The two flags should never co-occur, but the conservative move is to honor the deny-flag.
If a recurring pattern suggests a non-eligible skill needs change (e.g., a core skill, or an installed marketplace skill), do not edit it. Instead, write a learning to
memory/learnings.jsonl
with
source: "skill-evolve"
and a clear pattern_key, and append a
meta-suggestion
block to
skills/_journal.md
. The human creator will decide.
您仅可优化、弃用或停用那些在前置声明中标记了**
origin: yoyo
**的技能。任何其他值,或者缺失
origin:
字段,都意味着该技能不可修改。这是一个白名单:无标记即表示“请勿触碰”。
技能分为三类:
origin:
来源是否可编辑?
creator
由人类创作者(Yuanhao或派生版本创作者)编写绝不允许
yoyo
由yoyo编写(此技能,或过往如
social
/
family
/
release
等演进版本)
是——符合条件
marketplace
,
gh:user/repo
从第三方安装绝不允许——上游拥有所有权
(缺失)来源未知绝不允许(默认安全策略)
当前符合条件的技能是所有在SKILL.md中声明
origin: yoyo
的技能:
  • social
  • family
  • release
  • 您之前创建的任何技能(从Create模板继承
    origin: yoyo
深度防御:如果某个技能设置了
core: true
,即使同时存在
origin: yoyo
,也拒绝修改。这两个标记不应同时出现,但保守做法是优先遵守拒绝标记。
如果重复模式表明某个不符合条件的技能需要修改(例如核心技能或已安装的市场技能),请勿编辑它。相反,在
memory/learnings.jsonl
中写入一条来源为
source: "skill-evolve"
的学习记录,并添加清晰的
pattern_key
,同时在
skills/_journal.md
中追加一个
meta-suggestion
块。由人类创作者决定后续操作。

HARD RULE #2 — Never edit yourself

硬性规则 #2 — 绝不编辑自身

You must NEVER modify
skills/skill-evolve/SKILL.md
. If you believe this skill needs improvement, append a
meta-suggestion
block to
skills/_journal.md
and stop:
undefined
您必须绝不修改
skills/skill-evolve/SKILL.md
。如果您认为此技能需要改进,请在
skills/_journal.md
中追加一个
meta-suggestion
块并停止操作:
undefined

evt-XXXX meta-suggestion

evt-XXXX meta-suggestion

  • ts: <ISO8601>
  • target: skills/skill-evolve/SKILL.md
  • suggestion: <one-paragraph description>
undefined
  • ts: <ISO8601>
  • target: skills/skill-evolve/SKILL.md
  • suggestion: <一段描述>
undefined

HARD RULE #3 — One mutation per cycle

硬性规则 #3 — 每个周期仅执行一次变更

Each cycle produces exactly one of:
  • a refinement diff (one skill, ≤30 added lines, ≤15 removed)
  • a candidate skill draft (one new directory)
  • a retirement (one
    git mv
    to
    skills_attic/
    )
  • a
    NO-OP
    event (you found nothing worth doing)
If you find yourself wanting to do two things, pick the one with the strongest evidence and write the second to
memory/learnings.jsonl
for next cycle.
每个周期只能产生以下结果之一:
  • 一份优化差异(针对一个技能,新增行数≤30,删除行数≤15)
  • 一份候选技能草稿(一个新目录)
  • 一次停用(通过
    git mv
    移至
    skills_attic/
  • 一个
    NO-OP
    事件(未发现值得执行的操作)
如果您想执行两项操作,请选择证据最充分的一项,并将第二项写入
memory/learnings.jsonl
留到下一个周期处理。

HARD RULE #4 — Refine and Create events must declare an expected outcome

硬性规则 #4 — 优化和创建事件必须声明预期结果

Every
refine
and
create
event in
skills/_journal.md
MUST include an
expected:
line — a freeform prose commitment naming (a) a concrete observable signal that should change, (b) a horizon (e.g. "within ~5 sessions" or "by next cycle"), and (c) a fallback move if the prediction does not hold.
If you cannot articulate all three, the edit is not justified by evidence: NO-OP the cycle instead of committing a refine/create without an
expected:
line. This is decision-observability discipline (paper: arxiv 2604.25850) at the cognitive layer — there is no validator, but a future cycle re-reads the line as informal evidence and a human reads it as an audit trail.
expected:
is forbidden on
retire
,
revive
,
meta-suggestion
,
refused
,
NO-OP
, and
init
events (they do not ship a behavioral change, so there is nothing to predict).
The body of the line is freeform prose. See "Step 7 — append the event" for the template position and worked examples; see "What an
expected:
line must do (and must not be)" later in this document for the anti-patterns to refuse.
skills/_journal.md
中的每个
refine
create
事件必须包含
expected:
行——一段自由格式的陈述,需明确:(a) 应发生变化的具体可观测信号,(b) 时间范围(例如“约5个会话内”或“到下一个周期”),以及(c) 预测未实现时的备选方案。
如果您无法明确这三点,则该编辑缺乏证据支持:请执行NO-OP而非提交未包含
expected:
行的优化/创建操作。这是认知层面的决策可观测性准则(参考论文:arxiv 2604.25850)——没有验证器,但未来的周期会将该行作为非正式证据重新读取,人类也会将其作为审计追踪记录查看。
expected:
retire
revive
meta-suggestion
refused
NO-OP
init
事件中禁止使用(这些事件不会带来行为变化,因此无需预测)。
该行内容为自由格式。请参阅“步骤7 — 追加事件”了解模板位置和示例;请参阅本文档后续的“
expected:
行必须包含(且不能包含)的内容”了解需避免的反模式。

Glossary

术语表

  • session — one run of
    scripts/evolve.sh
    (the main evolution loop). There are ~3 per day.
  • cycle — one run of this skill, invoked from
    scripts/skill_evolve.sh
    . Cycles are gated by a session-counter and a 24h cooldown, so they fire roughly once every 5+ sessions.
  • real cycle — a cycle that produced one of
    refine | create | retire | meta-suggestion
    . Excludes
    init
    ,
    refused
    , and
    NO-OP
    .
  • session(会话) — 一次
    scripts/evolve.sh
    运行(主要演进循环)。每天约3次。
  • cycle(周期) — 一次此技能的运行,由
    scripts/skill_evolve.sh
    调用。周期受会话计数器和24小时冷却时间限制,因此大约每5个以上会话触发一次。
  • real cycle(实际周期) — 产生
    refine | create | retire | meta-suggestion
    之一的周期。不包括
    init
    refused
    NO-OP

Bootstrap (first three real cycles only)

引导阶段(仅前三个实际周期)

We are mid-life, not at Day 1, so the cold-start rules from the original design are softened — but the first three real cycles still get extra constraints to let the loop settle.
To know which cycle you are in, count the non-init, non-refused, non-NO-OP entries in
skills/_journal.md
:
bash
cycle_index=$(grep -E '^## .*evt-[0-9]+ (refine|create|retire|meta-suggestion)' skills/_journal.md | wc -l)
我们处于中期阶段,而非初始阶段,因此原始设计中的冷启动规则已放宽——但前三个实际周期仍有额外约束,以确保循环稳定。
要了解您处于哪个周期,请统计
skills/_journal.md
中非
init
、非
refused
、非
NO-OP
的条目数量:
bash
cycle_index=$(grep -E '^## .*evt-[0-9]+ (refine|create|retire|meta-suggestion)' skills/_journal.md | wc -l)

cycle_index=0 → this is the first real cycle

cycle_index=0 → 这是第一个实际周期

cycle_index=1 → second

cycle_index=1 → 第二个

cycle_index=2 → third

cycle_index=2 → 第三个

cycle_index>=3 → full lifecycle unlocked

cycle_index>=3 → 完整生命周期解锁


- **First real cycle** (`cycle_index == 0`): only `refine` or `NO-OP` allowed. Do not create. Do not retire.
- **Second real cycle** (`cycle_index == 1`): `refine`, `create`, or `NO-OP`. No retirement yet.
- **Third real cycle onward** (`cycle_index >= 2`): full lifecycle unlocked (`refine` | `create` | `retire` | `NO-OP`).

(Note: the gate-counter at `.skill_evolve_counter` is unrelated to this — it just controls when the cycle fires, not what it can do.)

- **第一个实际周期** (`cycle_index == 0`):仅允许`refine`或`NO-OP`。禁止创建或停用。
- **第二个实际周期** (`cycle_index == 1`):允许`refine`、`create`或`NO-OP`。仍禁止停用。
- **第三个实际周期及以后** (`cycle_index >= 2`):完整生命周期解锁(允许`refine` | `create` | `retire` | `NO-OP`)。

(注意:`.skill_evolve_counter`中的门控计数器与此无关——它仅控制周期何时触发,不控制可执行的操作。)

Lifecycle states

生命周期状态

Every eligible skill carries a
status:
field in its frontmatter. Five states. Important: yoagent always loads anything with a valid
<dir>/SKILL.md
regardless of status —
status:
is your bookkeeping, telling you what to do next, not what the loader does. The only way to fully un-load a skill from the agent's prompt is to
git mv
its directory to
skills_attic/
(sibling of
skills/
, not scanned by
--skills
).
State
status:
value
Description-prefixEntry conditionExit condition
dormant
dormant
nonea recurring pattern not yet ratifiedratified by you →
candidate
candidate
candidate
[CANDIDATE — unreviewed]
(you write it on Create)
you draft a new skill≥2 successful invocations →
active
; 3 sessions without one → back to
dormant
active
active
nonepromoted from
candidate
refinement applied →
refined
; score < 0.3 →
deprecated
refined
refined
noneyou applied a difffalls back to
active
after 1 session if score holds
deprecated
deprecated
none
score < 0.3
or 10 sessions unused
revived by use →
active
; 5 more idle →
git mv
to
skills_attic/
The
[CANDIDATE — unreviewed]
prefix is agent-written when you Create a skill (see Create template below). Nothing in the loader injects it. It tells future sessions to treat the skill as experimental.
每个符合条件的技能在其前置声明中都有一个
status:
字段。分为五种状态。重要提示:yoagent始终加载任何包含有效
<dir>/SKILL.md
的技能,无论其状态如何——
status:
是您的记录字段,告诉您下一步该做什么,而非加载器的行为。要完全从代理的提示中卸载技能,必须通过
git mv
将其目录移至
skills_attic/
skills/
的同级目录,不会被
--skills
扫描)。
状态
status:
描述前缀进入条件退出条件
休眠
dormant
重复模式尚未被批准您批准后 →
candidate
候选
candidate
[CANDIDATE — unreviewed]
(创建时由您写入)
您起草了一个新技能≥2次成功调用 →
active
;3个会话未调用 → 返回
dormant
活跃
active
candidate
晋升而来
应用优化 →
refined
;得分<0.3 →
deprecated
已优化
refined
您应用了差异更新如果得分保持不变,1个会话后返回
active
已弃用
deprecated
score < 0.3
或10个会话未使用
再次被使用 →
active
;再闲置5个会话 →
git mv
skills_attic/
[CANDIDATE — unreviewed]
前缀是您创建技能时必须写入的(请参阅下文的Create模板)。加载器不会自动添加该前缀。它告诉未来的会话将该技能视为实验性技能。

Cycle execution sequence

周期执行流程

Run these steps in order, every cycle.
请按以下步骤依次执行每个周期。

1. Read evidence

1. 读取证据

bash
undefined
bash
undefined

Latest cycles:

最新周期:

tail -n 200 skills/_journal.md
tail -n 200 skills/_journal.md

Recent self-reflection:

近期自我反思:

tail -n 50 memory/learnings.jsonl
tail -n 50 memory/learnings.jsonl

Top of journal (newest entries are at top):

日志顶部(最新条目在最上方):

head -n 200 journals/JOURNAL.md
head -n 200 journals/JOURNAL.md

Recent runs:

近期运行记录:

gh run list --json url,conclusion,createdAt,name -L 10 || echo "[]"
gh run list --json url,conclusion,createdAt,name -L 10 || echo "[]"

Audit evidence (set by harness, points at audit-log worktree):

审计证据(由系统设置,指向审计日志工作区):

ls "${YOYO_AUDIT_DIR:-/tmp/audit-read/sessions}" 2>/dev/null | tail -30

**First-run handling**: if `$YOYO_AUDIT_DIR` is unset or its directory is empty, the audit-log branch hasn't accumulated evidence yet (this is normal on the first 1–2 cycles). In that case:

- Skip the per-session audit.jsonl mining in step 3 ("Mine patterns").
- Use only `memory/learnings.jsonl` and `journals/JOURNAL.md` for complaint and use signals.
- Lean toward **NO-OP** — without audit evidence, scoring is too noisy to support a confident refine/create/retire decision.
- Write the NO-OP event with note: `evidence: only learnings (audit-log unavailable)`.
ls "${YOYO_AUDIT_DIR:-/tmp/audit-read/sessions}" 2>/dev/null | tail -30

**首次运行处理**:如果`$YOYO_AUDIT_DIR`未设置或其目录为空,则审计日志分支尚未积累证据(在前1-2个周期中这是正常的)。在这种情况下:

- 跳过步骤3(“挖掘模式”)中的每个会话audit.jsonl挖掘。
- 仅使用`memory/learnings.jsonl`和`journals/JOURNAL.md`获取反馈和使用信号。
- 倾向于执行**NO-OP**——没有审计证据,评分噪音太大,无法支持自信的优化/创建/停用决策。
- 写入NO-OP事件,并添加说明:`evidence: only learnings (audit-log unavailable)`。

2. Enumerate eligible skills

2. 枚举符合条件的技能

bash
undefined
bash
undefined

Allow-list: only skills declaring origin: yoyo are eligible.

白名单:仅声明origin: yoyo的技能符合条件。

Defense in depth: also exclude anything carrying core: true.

深度防御:同时排除任何设置了core: true的技能。

for d in skills/*/; do name=$(basename "$d") [ "$name" = "skill-evolve" ] && continue [ -f "$d/SKILL.md" ] || continue grep -q "^core: true" "$d/SKILL.md" && continue grep -q "^origin: yoyo$" "$d/SKILL.md" || continue echo "$name" done
undefined
for d in skills/*/; do name=$(basename "$d") [ "$name" = "skill-evolve" ] && continue [ -f "$d/SKILL.md" ] || continue grep -q "^core: true" "$d/SKILL.md" && continue grep -q "^origin: yoyo$" "$d/SKILL.md" || continue echo "$name" done
undefined

3. Mine patterns

3. 挖掘模式

This step has two layers: counting (the basic signals) and diagnosing (understanding why failures happened, not just that they did). Diagnosis is what turns recurrence into actionable refinement targets.
此步骤分为两层:统计(基础信号)和诊断(理解失败的原因,而非仅知道失败发生)。诊断是将重复模式转化为可操作优化目标的关键。

3a. Count basic signals

3a. 统计基础信号

For each eligible skill, count:
  • Complaint signals: entries in
    memory/learnings.jsonl
    whose
    pattern_key
    or
    title
    /
    takeaway
    mentions the skill and uses negative language ("wrong", "didn't", "instead", "should have").
  • Failure signals: tool-call failures in
    ${YOYO_AUDIT_DIR}/day-*/audit.jsonl
    where the bash command or args reference the skill's domain.
  • Use signals: number of sessions where any string from the skill's frontmatter
    keywords:
    list appears in that session's
    audit.jsonl
    . This is
    uses
    .
  • Win signals: out of those sessions, count the ones where
    outcome.json
    has
    test_ok: true
    AND
    tasks_succeeded >= 1
    . This is
    wins
    .
If a skill's frontmatter is missing
keywords:
, fall back to its name as the only keyword (likely noisy — flag in
_journal.md
so the operator can add proper keywords).
Compute
wins/uses
and update the EMA score:
new_score = 0.3 * blended + 0.7 * old_score
blended   = 0.5 * (wins/uses) + 0.3 * (1 - complaints/uses) + 0.2 * mention_rate
Update the skill's frontmatter with the new values:
score
,
uses
,
wins
, and
last_used
(= the timestamp of the most-recent matching session). These updates are part of your single allowed mutation per cycle — you may bundle them into a refine event, or write a tiny "score-update" event when nothing else changes (this counts as a NO-OP for the bootstrap counter).
针对每个符合条件的技能,统计:
  • 负面反馈信号
    memory/learnings.jsonl
    pattern_key
    title
    /
    takeaway
    提及该技能使用负面语言(“错误”、“未”、“反而”、“本应”)的条目。
  • 失败信号
    ${YOYO_AUDIT_DIR}/day-*/audit.jsonl
    中bash命令或参数涉及该技能领域的工具调用失败次数。
  • 使用信号:会话中技能前置声明
    keywords:
    列表中的任何字符串出现在该会话
    audit.jsonl
    中的次数。即
    uses
  • 成功信号:在这些会话中,
    outcome.json
    包含
    test_ok: true
    tasks_succeeded >= 1
    的次数。即
    wins
如果技能前置声明缺失
keywords:
,则退而使用其名称作为唯一关键词(可能噪音较大——在
_journal.md
中标记,以便操作者添加合适的关键词)。
计算
wins/uses
并更新EMA得分:
new_score = 0.3 * blended + 0.7 * old_score
blended   = 0.5 * (wins/uses) + 0.3 * (1 - complaints/uses) + 0.2 * mention_rate
更新技能前置声明中的新值:
score
uses
wins
last_used
(=最近匹配会话的时间戳)。这些更新是您每个周期允许的唯一变更的一部分——您可以将其纳入优化事件,或者在无其他变更时写入一个小型“得分更新”事件(这在引导计数器中视为NO-OP)。

3b. Diagnose the cause (trace-based)

3b. 诊断原因(基于追踪)

Counting tells you which skill is struggling. Diagnosing tells you what to fix. Borrowed from the GEPA pattern (Genetic-Pareto Prompt Evolution): read the actual execution traces, don't just count failures.
For each skill where
complaint_signals ≥ 2
OR
(wins/uses) < 0.5
(with
uses ≥ 3
), open the relevant session's
audit.jsonl
and look for these failure-mode patterns:
Pattern in audit.jsonlLikely causeRefinement direction
Same
bash
command retried 3+ times with small arg variations
Skill missing a concrete command exampleAdd a verbatim example in
## Procedure
edit_file <P>
followed within 2 tool calls by
git checkout … <P>
(same path), repeated in ≥2 distinct sessions
Agent edited and reverted the SAME path — likely the change was rejected by build/test, not just exploratoryAdd a
## Pitfalls
entry naming the brittle pattern
success: false
with the same
tool
and similar
args
across multiple sessions
Skill's procedure has a recurring blind spotAdd a
## Pitfalls
entry; consider a "do this first" prelude
Long bash sequences (10+ tool calls) without intermediate
read_file
of relevant docs
Skill points at non-existent docs OR doesn't tell agent to verify stateAdd a "verify your assumptions" step in
## Procedure
Tool calls that should be there per
keywords:
are absent
Skill isn't actually being invoked when it should beThe
description:
is too weak — refine that field instead of the body
For each candidate refinement target, write a 1-2 sentence cause hypothesis:
target: social
hypothesis: 3 sessions show repeated `gh api graphql` calls with malformed `categoryId`
            args (sessions day-52, day-55, day-57). Skill's Procedure mentions categoryId
            but doesn't show the format. Refinement: add a verbatim example.
Carry this hypothesis into step 4 (action selection) and step 5 (Refine — it tells you what to write in the diff). Without a hypothesis, you're guessing; with one, the refinement is targeted and the eval (Refine step R4) has something concrete to compare.
If no clear hypothesis emerges from the traces, prefer NO-OP over speculative refinement. Counting alone is not a license to mutate.
统计告诉您哪个技能存在问题。诊断告诉您要修复什么。借鉴GEPA模式(遗传-帕累托提示演进):阅读实际执行追踪,而非仅统计失败次数。
针对每个
complaint_signals ≥ 2
(wins/uses) < 0.5
(且
uses ≥ 3
)的技能,打开相关会话的
audit.jsonl
查找以下失败模式
audit.jsonl中的模式可能原因优化方向
相同的
bash
命令重试3次以上,参数仅有微小变化
技能缺少具体的命令示例
## Procedure
中添加逐字示例
edit_file <P>
之后的2次工具调用内出现
git checkout … <P>
(相同路径),且在≥2个不同会话中重复出现
代理编辑并还原了同一路径——可能是变更被构建/测试拒绝,而非仅探索性操作
## Pitfalls
中添加一个条目,指出该脆弱模式
多个会话中出现
success: false
,且
tool
args
相似
技能的流程存在重复盲点
## Pitfalls
中添加条目;考虑添加“先执行此操作”的前置步骤
长bash序列(10次以上工具调用)未中间
read_file
相关文档
技能指向不存在的文档,或未告知代理验证状态
## Procedure
中添加“验证您的假设”步骤
根据
keywords:
应该出现的工具调用未出现
技能在应被调用时未被实际触发
description:
太薄弱——优化该字段而非正文
针对每个候选优化目标,撰写1-2句话的原因假设
target: social
hypothesis: 3个会话显示重复调用`gh api graphql`时`categoryId`参数格式错误
            (会话day-52、day-55、day-57)。技能的Procedure提及了categoryId
            但未展示格式。优化方案:添加逐字示例。
将此假设带入步骤4(选择操作)和步骤5(优化——它告诉您差异更新中要写入的内容)。没有假设,您只是猜测;有了假设,优化将更具针对性,评估(优化步骤R4)也有具体的比较依据。
如果从追踪中未得出明确假设,优先执行NO-OP而非推测性优化。仅靠统计不足以支持变更。

4. Pick exactly one action

4. 选择恰好一项操作

Decision order (first match wins):
  1. Retire (third cycle onward only): if any skill has
    score < 0.3
    AND
    last_used
    ≥ 10 sessions ago, retire the lowest-scoring one. Skip if there are < 2 active eligible skills (don't bottom out the library).
  2. Refine: if any skill (a) has
    complaint_signals ≥ 2
    , OR (b) has
    (wins/uses) < 0.5
    with
    uses ≥ 3
    , AND in either case has not been refined in the last 3 sessions (
    last_evolved
    check), refine it. This matches the diagnosis-trigger condition in step 3b. Pick the target with the strongest evidence (highest complaint count, or lowest wins-ratio if no complaints).
  3. Create (second cycle onward only, and only if active skill count < 25): if any
    pattern_key
    appears in ≥3 distinct sessions of
    learnings.jsonl
    AND no existing eligible skill covers it (≥3 keyword overlap → refine that one instead), draft a new skill.
  4. NO-OP: nothing meets the bars. Write a
    NO-OP
    event with a one-line note about what evidence you considered.
If you've written 3 consecutive
NO-OP
events, also write
evolution_saturation: true
to the event — the harness reads this and extends the cooldown.
决策顺序(匹配到第一个即执行):
  1. 停用(仅第三个周期及以后):如果任何技能
    score < 0.3
    last_used
    ≥10个会话之前,停用得分最低的那个。如果活跃符合条件的技能数量<2,则跳过(不要耗尽技能库)。
  2. 优化:如果任何技能(a)
    complaint_signals ≥ 2
    ,或(b)
    (wins/uses) < 0.5
    uses ≥ 3
    ,且任一情况下该技能在过去3个会话中未被优化(检查
    last_evolved
    ),则优化它。这与步骤3b中的诊断触发条件匹配。选择证据最充分的目标(负面反馈最多,若无负面反馈则选择成功率最低的)。
  3. 创建(仅第二个周期及以后,且活跃技能数量<25):如果
    learnings.jsonl
    中≥3个不同会话出现相同的
    pattern_key
    ,且现有符合条件的技能均未覆盖该模式(≥3个关键词重叠→改为优化该技能),则起草一个新技能。
  4. NO-OP:无任何操作符合条件。写入
    NO-OP
    事件,并附上一行说明您考虑了哪些证据。
如果您连续写入3个
NO-OP
事件,请在第三个事件中添加
evolution_saturation: true
——系统会读取此标记并延长冷却时间。

5. Execute the action

5. 执行操作

Refine

优化

Refinement uses a snapshot + A/B eval pattern (borrowed from Anthropic's skill-creator). The goal: never commit a refinement that doesn't measurably improve the skill on at least one concrete prompt.
Step R1 — Snapshot the baseline. Before editing, copy the current SKILL.md to a temp location:
bash
mkdir -p /tmp/skill-evolve-baseline
cp "skills/<target>/SKILL.md" "/tmp/skill-evolve-baseline/<target>.SKILL.md"
Step R2 — Generate 2-3 synthetic test prompts. Read the target skill's
## When to use
and
## Procedure
sections. Derive concrete prompts a future agent might receive that should trigger this skill. Examples for
social
:
  • "Reply to discussion #42 with a thoughtful response"
  • "Post a 1-in-4-chance proactive riff in The Show category"
  • "Find unanswered questions in the Journal Club category"
Write them to
/tmp/skill-evolve-eval/<target>/prompts.json
:
json
[
  {"id": "p1", "prompt": "...", "expects": "<one-sentence success criterion>"},
  {"id": "p2", "prompt": "...", "expects": "..."}
]
Step R3 — Write the candidate diff. Use
edit_file
to apply your refinement. Constraints:
  • ≤30 added lines, ≤15 removed lines (diff stat)
  • Touch only the
    ## Pitfalls
    and
    ## Procedure
    sections (or the skill's "what to do" body) — never the top-level
    description:
    , never any frontmatter field except the four bookkeeping fields established in step 3a:
    score
    ,
    uses
    ,
    wins
    ,
    last_used
    . (
    last_evolved
    is also updated, to today's date.)
Step R4 — A/B compare. For each test prompt, generate a 1-3 sentence summary of how each version (baseline, candidate) would handle the prompt — what tools the agent would call, what order, what the outcome would look like.
Two execution modes, in order of preference:
  • Preferred (sub-agent A/B): if you have
    sub_agent
    available, dispatch two sub-agent calls in parallel:
    • Sub-agent A: read
      /tmp/skill-evolve-baseline/<target>.SKILL.md
      + the test prompt → output JSON
      {"summary": "...", "tool_sequence": ["bash", "edit_file", ...]}
    • Sub-agent B: same with the candidate file
    • Use the structured outputs to compare apples-to-apples.
  • Fallback (single-agent sequential): if
    sub_agent
    isn't available or returned an error, read the baseline file, write a baseline summary; then read the candidate file, write a candidate summary. Be deliberate about not letting the candidate read bias the baseline read — write the baseline summary BEFORE looking at the candidate.
For each prompt, decide one of:
  • candidate-better
    : candidate's procedure is more specific, addresses the prompt more directly
  • tie
    : no meaningful difference
  • baseline-better
    : regression — the refinement made things worse
Step R5 — Decide. Commit the refinement only if:
  • 0 prompts came out
    baseline-better
    , AND
  • At least 1 prompt came out
    candidate-better
Otherwise: revert the edit (
cp /tmp/skill-evolve-baseline/<target>.SKILL.md skills/<target>/SKILL.md
) and write a
NO-OP
event with
eval-result: regression
(or
eval-result: tie
).
Step R6 — Append eval summary to the
_journal.md
event.
Add an
eval-summary:
field to the event:
- eval-summary: 2/2 prompts candidate-better, 0 regressions
Or for a NO-OP-after-eval:
- eval-summary: 1/2 baseline-better — refinement was a regression on prompt p2 ("..."). Reverted.
优化采用快照+A/B评估模式(借鉴Anthropic的技能创建器)。目标:绝不提交无法在至少一个具体提示上显著提升技能效果的优化。
步骤R1 — 快照基线 编辑前,将当前SKILL.md复制到临时位置:
bash
mkdir -p /tmp/skill-evolve-baseline
cp "skills/<target>/SKILL.md" "/tmp/skill-evolve-baseline/<target>.SKILL.md"
步骤R2 — 生成2-3个合成测试提示 阅读目标技能的
## When to use
## Procedure
部分。推导未来代理可能收到的、触发此技能的具体提示。例如
social
技能的示例:
  • “回复讨论#42,给出有深度的回应”
  • “在The Show类别中发布一个1/4概率的主动式即兴内容”
  • “在Journal Club类别中查找未回答的问题”
将它们写入
/tmp/skill-evolve-eval/<target>/prompts.json
json
[
  {"id": "p1", "prompt": "...", "expects": "<一句话成功标准>"},
  {"id": "p2", "prompt": "...", "expects": "..."}
]
步骤R3 — 撰写候选差异 使用
edit_file
应用您的优化。约束:
  • 新增行数≤30,删除行数≤15(差异统计)
  • 仅修改
    ## Pitfalls
    ## Procedure
    部分(或技能的“操作说明”正文)——绝不修改顶层
    description:
    ,绝不修改前置声明中除步骤3a确立的四个记录字段之外的任何字段:
    score
    uses
    wins
    last_used
    。(
    last_evolved
    也会更新为当前日期。)
步骤R4 — A/B对比 针对每个测试提示,生成一段1-3句话的摘要,说明每个版本(基线、候选)将如何处理该提示——代理将调用哪些工具,顺序如何,结果会是什么样。
两种执行模式,优先顺序如下:
  • 首选(子代理A/B):如果
    sub_agent
    可用,并行调度两个子代理调用:
    • 子代理A:读取
      /tmp/skill-evolve-baseline/<target>.SKILL.md
      + 测试提示 → 输出JSON
      {"summary": "...", "tool_sequence": ["bash", "edit_file", ...]}
    • 子代理B:使用候选文件执行相同操作
    • 使用结构化输出进行直接对比。
  • 备选(单代理顺序):如果
    sub_agent
    不可用或返回错误,先读取基线文件,撰写基线摘要;然后读取候选文件,撰写候选摘要。注意不要让候选文件的内容影响基线读取——先撰写基线摘要,再查看候选文件。
针对每个提示,做出以下决策之一:
  • candidate-better
    :候选流程更具体,更直接地处理提示
  • tie
    :无显著差异
  • baseline-better
    :退化——优化使情况变差
步骤R5 — 决策 仅在以下情况下提交优化:
  • 0个提示显示
    baseline-better
    ,且
  • 至少1个提示显示
    candidate-better
否则:还原编辑(
cp /tmp/skill-evolve-baseline/<target>.SKILL.md skills/<target>/SKILL.md
)并写入
NO-OP
事件,添加
eval-result: regression
(或
eval-result: tie
)。
步骤R6 — 将评估摘要追加到
_journal.md
事件
在事件中添加
eval-summary:
字段:
- eval-summary: 2/2提示候选版本更优,无退化
或者针对评估后的NO-OP:
- eval-summary: 1/2提示基线版本更优——优化在提示p2("...")上出现退化。已还原。

Create

创建

Draft
skills/<new-name>/SKILL.md
:
yaml
---
name: <new-name>
description: "[CANDIDATE — unreviewed] <pushy one-line trigger description, ≤200 chars total>"
tools: [bash, read_file, ...]
origin: yoyo
status: candidate
score: 0.5
uses: 0
wins: 0
last_used: null
last_evolved: <today>
parent_pattern_key: <kebab-case verb.object>
keywords: ["<distinctive substring 1>", "<distinctive substring 2>", "..."]   # ≥3 strings that, if found in a session's audit.jsonl, indicate this skill was used
---
起草
skills/<new-name>/SKILL.md
yaml
---
name: <new-name>
description: "[CANDIDATE — unreviewed] <具有引导性的一行触发描述,总长度≤200字符>"
tools: [bash, read_file, ...]
origin: yoyo
status: candidate
score: 0.5
uses: 0
wins: 0
last_used: null
last_evolved: <today>
parent_pattern_key: <短横线分隔的动词.宾语>
keywords: ["<独特子字符串1>", "<独特子字符串2>", "..."]   # ≥3个字符串,若在会话audit.jsonl中出现,表明此技能被使用
---

<Title>

<标题>

When to use

何时使用

<concrete trigger conditions>
<具体触发条件>

Quick reference

快速参考

<one-screen cheat sheet>
<一屏大小的速查表>

Procedure

操作流程

<numbered steps>
<编号步骤>

Pitfalls

注意事项

<things that have gone wrong before>
<过往出现过的问题>

Verification

验证方式

<how the skill knows it succeeded> ```
The
[CANDIDATE — unreviewed]
prefix is critical — it tells the agent in future sessions to treat the skill as experimental, not as system-prompt-grade truth.
<技能如何判断自身执行成功>

`[CANDIDATE — unreviewed]`前缀至关重要——它告诉未来会话中的代理将该技能视为实验性技能,而非系统提示级别的可信内容。

Retire

停用

bash
git mv skills/<name>/ skills_attic/<name>/
Soft delete. Recoverable. If yoyo invokes the skill's domain again within 3 cycles, you may revive it (move back, reset score to 0.5).
bash
git mv skills/<name>/ skills_attic/<name>/
软删除。可恢复。如果yoyo在3个周期内再次调用该技能的领域,您可以恢复它(移回原位置,将得分重置为0.5)。

6. Validate

6. 验证

Before committing, run all of these. If any fails, write
refused
and exit:
bash
undefined
提交前,请运行以下所有验证。如果任何一项失败,写入
refused
并退出:
bash
undefined

YAML frontmatter parses (use python3 since yq may not be installed):

YAML前置声明可解析(使用python3,因为yq可能未安装):

python3 -c " import sys, re content = open('skills/<name>/SKILL.md').read() m = re.match(r'---\n(.*?)\n---\n', content, re.DOTALL) assert m, 'no frontmatter' fm = m.group(1) assert len(fm) <= 1900, f'frontmatter too long: {len(fm)}'
python3 -c " import sys, re content = open('skills/<name>/SKILL.md').read() m = re.match(r'---\n(.*?)\n---\n', content, re.DOTALL) assert m, 'no frontmatter' fm = m.group(1) assert len(fm) <= 1900, f'frontmatter too long: {len(fm)}'

crude parse

粗略解析

for line in fm.splitlines(): if line.strip() and ':' not in line: sys.exit(f'invalid line: {line}') "
for line in fm.splitlines(): if line.strip() and ':' not in line: sys.exit(f'invalid line: {line}') "

Description ≤ 200 chars:

描述≤200字符:

desc=$(grep '^description:' skills/<name>/SKILL.md | head -1 | sed 's/^description: *//') [ "${#desc}" -le 200 ] || { echo "description too long"; exit 1; }
desc=$(grep '^description:' skills/<name>/SKILL.md | head -1 | sed 's/^description: *//') [ "${#desc}" -le 200 ] || { echo "description too long"; exit 1; }

Body token estimate (~ word count, ceiling 5000):

正文字数估计(~单词数,上限5000):

body_words=$(awk '/^---$/{n++; next} n>=2' skills/<name>/SKILL.md | wc -w) [ "$body_words" -le 5000 ] || { echo "body too long"; exit 1; }
body_words=$(awk '/^---$/{n++; next} n>=2' skills/<name>/SKILL.md | wc -w) [ "$body_words" -le 5000 ] || { echo "body too long"; exit 1; }

Build still works (the meta-skill itself shouldn't break the build, but defense in depth):

构建仍可正常运行(元技能本身不应破坏构建,但需深度防御):

cargo build --release 2>&1 | tail -5
undefined
cargo build --release 2>&1 | tail -5
undefined

7. Append the event to
skills/_journal.md

7. 将事件追加到
skills/_journal.md

Get the next event number:
bash
last=$(grep -oE 'evt-[0-9]+' skills/_journal.md | sort -u | tail -1)
n=$((${last#evt-} + 1))
evt=$(printf 'evt-%04d' $n)
Append (using
>>
, never overwrite):
undefined
获取下一个事件编号:
bash
last=$(grep -oE 'evt-[0-9]+' skills/_journal.md | sort -u | tail -1)
n=$((${last#evt-} + 1))
evt=$(printf 'evt-%04d' $n)
追加(使用
>>
,绝不要覆盖):
undefined

<ISO8601> <evt-NNNN> <type>

<ISO8601> <evt-NNNN> <type>

  • skill: <name or "-">
  • trigger: <one-line summary of evidence>
  • diff: <+A -B (path)> or "n/a"
  • validation: <pass | reason for refusal>
  • score-delta: <old><new>
  • parent-event: <evt-NNNN>
  • expected: <observable signal | horizon | fallback> # required for refine/create only; forbidden on all other types
  • note: <optional one-line>

Where `<type>` is one of: `init`, `refine`, `create`, `retire`, `revive`, `meta-suggestion`, `refused`, `NO-OP`.
  • skill: <名称或"-">
  • trigger: <一句话证据摘要>
  • diff: <+A -B (路径)> 或 "n/a"
  • validation: <pass | 拒绝原因>
  • score-delta: <旧值> → <新值>
  • parent-event: <evt-NNNN>
  • expected: <可观测信号 | 时间范围 | 备选方案> # 仅refine/create需要;其他类型禁止使用
  • note: <可选一句话说明>

其中`<type>`为以下之一:`init`、`refine`、`create`、`retire`、`revive`、`meta-suggestion`、`refused`、`NO-OP`。

What an
expected:
line must do (and must not be)

expected:
行必须包含(且不能包含)的内容

A good
expected:
line names all three of: a concrete observable signal, a horizon, and a fallback move.
Concrete observables you may reference:
  • A skill's frontmatter
    uses
    /
    wins
    /
    score
    (e.g. "social.uses should grow by ≥3 over the next 5 sessions")
  • A specific failure cluster's recurrence in audit-log sessions (e.g. "the gh-discussion-comment STUCK cluster should drop to 0 hits within 5 sessions")
  • A trace pattern from step 3b (e.g. "the
    git checkout
    revert-after-edit pattern on social/SKILL.md should not recur in the next 3 sessions")
  • A concrete tool-call sequence that should/should not appear in audit.jsonl
Horizons: "by next cycle", "within ~3 sessions", "within ~5 sessions", "within 7 days". Do not say "eventually" or omit the horizon.
Fallbacks: name the next move if the prediction does not hold. Examples: "...otherwise this is a sub-skill candidate, not a prose refine"; "...otherwise the
description:
is the wrong target — try refining the body instead"; "...otherwise retire the skill".
Worked examples:
For a
refine
event:
- expected: STUCK rate on the gh-discussion-comment cluster should drop to 0
  within the next ~5 evolve sessions; if not, the prose tweak was insufficient
  and a helper script (sub-skill) is the right next step
For a
create
event:
- expected: at least 2 sessions in the next 5 should match this skill's
  keywords[] AND have outcome.json.test_ok=true (i.e. wins ≥ 2 by next cycle);
  if uses < 2 by then, the description: is too narrow and needs widening, or
  the pattern was a one-off and the skill should retire
Anti-patterns to refuse (these do not satisfy HARD RULE #4 — NO-OP instead of writing them):
  • "feels better"
  • "will be more readable"
  • "the prose is now clearer"
  • "users will like it"
  • "yoyo will use this skill more" (no horizon, no signal)
  • "this should help" (no horizon, no signal, no fallback)
If your candidate
expected:
line reads like one of those, you do not have a theory of impact — the evidence does not justify a mutation this cycle. Write
NO-OP
and move on.
优秀的
expected:
行必须明确三点:具体可观测信号、时间范围、备选方案。
可引用的具体可观测信号
  • 技能前置声明中的
    uses
    /
    wins
    /
    score
    (例如“social.uses应在未来5个会话中增长≥3”)
  • 审计日志会话中特定失败集群的重复出现次数(例如“gh-discussion-comment STUCK集群应在5个会话内降至0次”)
  • 步骤3b中的追踪模式(例如“social/SKILL.md上的
    git checkout
    编辑后还原模式在未来3个会话中不应再次出现”)
  • audit.jsonl中应出现或不应出现的具体工具调用序列
时间范围:“到下一个周期”、“约3个会话内”、“约5个会话内”、“7天内”。不要说“最终”或省略时间范围。
备选方案:如果预测未实现,说明下一步操作。示例:“...否则这是子技能候选,而非文案优化”;“...否则
description:
不是正确的优化目标——尝试优化正文”;“...否则停用该技能”。
示例
针对
refine
事件:
- expected: gh-discussion-comment集群的STUCK率应在未来约5个演进会话内降至0;
  若未实现,则文案调整不足,下一步应使用辅助脚本(子技能)
针对
create
事件:
- expected: 未来5个会话中至少2个应匹配此技能的keywords[],且outcome.json.test_ok=true
  (即到下一个周期时wins≥2);若届时uses<2,则说明description:
  过于狭窄,需要放宽,或者该模式是一次性的,技能应被停用
需避免的反模式(这些不符合硬性规则#4——请执行NO-OP而非写入):
  • “感觉更好”
  • “将更具可读性”
  • “文案现在更清晰了”
  • “用户会喜欢它”
  • “yoyo将更多使用此技能” (无时间范围,无信号)
  • “这应该会有帮助” (无时间范围,无信号,无备选方案)
如果您的候选
expected:
行类似于上述内容,则您没有明确的影响理论——现有证据不足以支持本周期的变更。请写入NO-OP并继续。

8. Commit

8. 提交

bash
git add skills/ skills_attic/ memory/learnings.jsonl
git commit -m "skill-evolve: <type> <skill-name>" || true
The harness pushes (or doesn't, depending on its config). Do not push from inside this skill.
bash
git add skills/ skills_attic/ memory/learnings.jsonl
git commit -m "skill-evolve: <type> <skill-name>" || true
系统会负责推送(或不推送,取决于其配置)。请勿在此技能内部执行推送操作。

Anti-bloat ceilings

防膨胀上限

Before any
create
action, verify all of these:
  • Active skill count (any with
    status: active
    or
    status: refined
    ) ≤ 25 before this create. If at the limit, you must
    retire
    first or write
    NO-OP
    .
  • Total skill count in
    skills/
    (excluding any skill with
    core: true
    ) ≤ 30.
  • The new skill's frontmatter is ≤ 1900 chars.
  • The new skill's description is ≤ 200 chars (including the
    [CANDIDATE — unreviewed]
    prefix).
  • The new skill's body is ≤ 5000 words.
  • No existing eligible skill has ≥3 keyword overlap with the new skill's
    When to use
    section. If so, refine that skill instead.
执行任何
create
操作前,请验证以下所有条件:
  • 创建前活跃技能数量(任何
    status: active
    status: refined
    的技能)≤25。如果已达上限,您必须先
    retire
    一个技能或执行NO-OP。
  • skills/
    中的总技能数量(排除任何
    core: true
    的技能)≤30。
  • 新技能的前置声明≤1900字符。
  • 新技能的描述≤200字符(包括
    [CANDIDATE — unreviewed]
    前缀)。
  • 新技能的正文≤5000单词。
  • 现有符合条件的技能与新技能的
    When to use
    部分的关键词重叠≥3个。如果是,改为优化现有技能。

Failure modes you must guard against

您必须防范的失败模式

ModeWhat it looks likeWhat you do
Skill thrashingSame skill refined twice within 3 sessionsRead
last_evolved
before refining; if < 3 sessions ago, pick a different target or NO-OP
Saturation3 consecutive NO-OP events in
_journal.md
Add
evolution_saturation: true
to the third event; harness will extend cooldown
Self-edit attemptPattern points at
skill-evolve
itself
HARD RULE #2 — write
meta-suggestion
and stop
Core-edit attemptPattern points at one of the core 4HARD RULE #1 — write
learnings.jsonl
entry and stop
Skill collisionNew skill's triggers overlap an existing skillRefine the existing skill instead
Identity driftPattern would contradict IDENTITY.md / PERSONALITY.mdRefuse; write a
learnings.jsonl
entry noting the contradiction
模式表现应对措施
技能震荡同一技能在3个会话内被优化两次优化前查看
last_evolved
;如果距上次优化<3个会话,选择其他目标或执行NO-OP
饱和
_journal.md
中连续3个NO-OP事件
在第三个事件中添加
evolution_saturation: true
;系统将延长冷却时间
尝试自我编辑模式指向
skill-evolve
自身
遵守硬性规则#2——写入
meta-suggestion
并停止
尝试编辑核心技能模式指向4个核心技能之一遵守硬性规则#1——写入learnings.jsonl条目并停止
技能冲突新技能的触发条件与现有技能重叠改为优化现有技能
身份漂移模式与IDENTITY.md / PERSONALITY.md矛盾拒绝操作;写入learnings.jsonl条目,注明矛盾点

What good looks like

健康状态的表现

A healthy
skills/_journal.md
after 30 days:
  • 4–10 events total (you don't run every session, and most cycles are NO-OP)
  • Mix of refine (~50%), create (~10%), retire (~10%), NO-OP (~30%)
  • Zero
    refused: self-edit
    or
    refused: core-edit
    events (your hard rules are holding)
  • Per-skill EMA scores trending up or stable (not down)
  • pattern_key
    recurrence dispersal falling over time — yoyo is internalizing patterns, not re-discovering them
If you see thrashing, score decay, or many refusals, write a
meta-suggestion
and let the human creator tighten the loop.
30天后健康的
skills/_journal.md
应具备以下特征:
  • 总计4–10个事件(您并非每个会话都运行,且大多数周期为NO-OP)
  • 混合了优化(约50%)、创建(约10%)、停用(约10%)、NO-OP(约30%)
  • refused: self-edit
    refused: core-edit
    事件(您遵守了硬性规则)
  • 每个技能的EMA得分呈上升或稳定趋势(而非下降)
  • pattern_key
    的重复出现率随时间下降——yoyo正在内化模式,而非重复发现模式
如果您看到震荡、得分下降或大量拒绝事件,请写入
meta-suggestion
,让人类创作者收紧循环。