cekura-self-improving-agent

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Cekura Self-Improving Agent

Cekura 自改进Agent

Purpose

目标

Close the loop on agent prompt and tool-config quality. Ingest evaluation signal (scenario IDs to run, completed runs, a result batch, or production call logs), classify failures, diagnose where the prompt or tool config has gaps / conflicts / ambiguities, propose targeted edits, apply them, and re-run validation — iterating until the agent reaches 100% pass rate on the validation set or the iteration cap is reached.

Exit gate. The voice/channel/infra filter informs what to fix (the Optimization phase only proposes edits for prompt-following failures), not when to stop. Any remaining failure of any class keeps the loop alive. Only the iteration cap or a genuine 100% pass ends the loop.

Currently supported for VAPI and self-hosted (websocket). Retell support is intentionally disabled and will be re-enabled in a future revision.

实现Agent提示词和工具配置质量的闭环优化。接收评估信号（待运行的场景ID、已完成的运行记录、结果批次或生产调用日志），分类失败案例，诊断提示词或工具配置中的缺口/冲突/歧义，提出针对性修改建议，应用修改并重新运行验证——循环迭代直到Agent在验证集上达到100%通过率或达到迭代上限。

退出条件。语音/渠道/基础设施过滤器用于确定「需要修复什么」（优化阶段仅针对提示词遵循失败提出修改建议），而非「何时停止」。任何类型的剩余失败都会让循环持续。只有达到迭代上限或真正实现100%通过才会终止循环。

目前支持VAPI和自托管（websocket）模式。Retell支持已被有意禁用，将在未来版本中重新启用。

Architecture — orchestrator over a sequence of focused sub-phases

架构——聚焦子阶段序列的编排器

This SKILL.md is a thin orchestrator. Optimization is split into five sub-phases living in

phases/optimization/

, with Setup, Overfitting Gate, and Eval as standalone phases on either side:

                  ┌────────────────┐
   user input ─→  │  Setup phase   │  (phases/setup.md)
                  │  runs once     │
                  └───────┬────────┘
                          │  (mode, sub-flavor, agent, redeploy_command)
                          ▼
              ┌───  ┌───────────────────────────┐
              │     │ Optimization · Collect    │  (phases/optimization/collect.md)
              │     │ fetch + filter + inspect  │
              │     │ provider call state       │
              │     └───────┬───────────────────┘
              │             │  (kept failures + Signal 5 end-of-call attribution)
              │             ▼
              │     ┌────────────────────────────────────┐
              │     │ Optimization · Early-End-Call      │  (phases/optimization/
              │     │ Diagnose                           │   early-end-call-diagnose.md)
              │     │ flag main-agent-ended-early →      │
              │     │ propose closure-rule / code edits  │
              │     └───────┬────────────────────────────┘
              │             │  (early-end edits proposed; pass-through if none)
              │             ▼
              │     ┌────────────────────────────────────┐
              │     │ Optimization · Diagnose    │  (phases/optimization/
              │     │ classify Gap/Conflict/Ambig/       │   diagnose.md)
              │     │ CodeBug-other/Upstream →           │
              │     │ propose edits → present combined   │
              │     └───────┬────────────────────────────┘
              │             │  (user-approved combined edit set)
              │             ▼
              │     ┌───────────────────────────┐
              │     │ Optimization · Apply      │  (phases/optimization/apply.md)
              │     │ PATCH / Edit → redeploy   │
              │     └───────┬───────────────────┘
              │             │  (writes landed; live agent restarted)
              │             ▼
              │     ┌───────────────────────────┐
              │     │ Optimization · Sync       │  (phases/optimization/sync.md)
              │     │ re-fetch + verify         │
              │     └───────┬───────────────────┘
              │             │  (verified state matches intent)
              │             ▼
              │     ┌───────────────────────────┐
              │     │ Overfitting Gate          │  (phases/overfitting-gate.md)
              │     │ scrub transcript quotes / │
              │     │ scenario IDs / narrow     │
              │     │ clauses; apply cleanup    │
              │     │ (pass-through if clean)   │
              │     └───────┬───────────────────┘
              │             │  (gate-cleaned state)
              │             ▼
              │     ┌───────────────────────────┐
              │     │ Eval phase                │  (phases/eval.md)
              │     │ validate → re-collect →   │
              │     │ decide                    │
              │     └───────┬───────────────────┘
              │             │
              │     ┌───────┴────────────────────┐
              │     │                            │
   hand back  │     ▼                            ▼  exit
   to Collect │  failure set < 100%        full set = 100% (success)
              │  OR regression             OR iteration cap
              │  OR mitigation edits       OR all-Upstream
              │                            OR oscillation / no-change
              └────  (loop)                OR 3× same-shape failure
                                           (surface + pause for user)

Setup runs once. It resolves the run mode and sub-flavor, loads the agent (its config and prompt source), and (for self-hosted live targets) collects the

redeploy_command

. Setup is a hard gate — Optimization · Collect will not start until Setup is complete.

Optimization is five sub-phases that run in series, each with one job:

Collect (
```
phases/optimization/collect.md
```
) — fetch runs / call logs / pasted failures, pre-filter by per-run verdict (keep
```
failure
```
+
```
reviewed_failure
```
, drop
```
success
```
+
```
reviewed_success
```
), apply the voice/channel filter, inspect provider call state for every kept failure (Signals 1–5, including end-of-call attribution), build the failure summary. Output: kept failure set + per-failure signals.
Early-End-Call Diagnose (
```
phases/optimization/early-end-call-diagnose.md
```
) — specialized triage for failures where the main agent ended the call before the scenario's required steps completed. Flags failures matching {main-agent-ended + scenario-incomplete in expected-outcome bullets} via a two-check verdict-first checklist (no rationale, no borderline cases), diagnoses root cause (too-permissive closure rules / orchestration-code end-of-call detection / VAPI handoff misconfigured), proposes minimal closure-rule prompt edits or — for websocket /
```
file
```
mode — orchestration-code edits gating closure on captured state. Pass-through if no failures match. Proposed edits are NOT applied here; they flow into the combined proposal in Diagnose.
Diagnose (
```
phases/optimization/diagnose.md
```
) — classify every non-early-end failure (Gap / Conflict / Ambiguity / non-early-end CodeBug / Upstream), propose minimal scoped edits, de-conflict with early-end proposals, then present the combined proposal (early-end + rest) to the user. This is where the user-facing diff-approval gate fires in
```
auto_mode: false
```
.
Apply (
```
phases/optimization/apply.md
```
) — land the approved edits via per-provider apply machinery (VAPI PATCH / Cekura platform tools /
```
Edit
```
on source file / render rewritten prompt), then run
```
redeploy_command
```
(or fire the manual restart gate).
Sync (
```
phases/optimization/sync.md
```
) — re-fetch the just-edited artifacts and verify each changed field landed correctly. Catches VAPI nested-object replacement and
```
Edit
```
ambiguous-anchor drift. Roll back to Apply on drift.

Overfitting Gate is the "scrub the just-applied edits" phase. Because the diagnose sub-phases read failing transcripts, they sometimes leak transcript-specific phrasing into proposals — verbatim quotes, scenario IDs / names, hardcoded test data, hyper-narrow case clauses, transcript-cloned few-shot examples. The gate re-reads what was just synced, scores each edit against five overfitting signatures, and emits cleanup edits (REVISE to a generalized form, or STRIP entirely) before Eval validates. On clean iterations the gate is a one-line pass-through. Full procedure:

phases/overfitting-gate.md

Eval is the "verify and decide" phase. It builds the validation set, runs it against the gate-cleaned live agent, re-collects failures with the same logic Collect uses, and decides: hand back to Collect with a new failure summary, trigger a final regression sweep, declare success, or surface a stop condition (oscillation / no-change / 3× same-shape / iteration cap / all-Upstream). Full procedure:

phases/eval.md

Run phases strictly sequentially — never parallelize across phase boundaries. Each phase consumes the previous phase's outputs as hard pre-conditions (Diagnose reads Collect's kept-failure set + Signal 5; Apply reads Diagnose's approved combined proposal; Sync reads Apply's written artifacts; Eval reads the gate-cleaned live state). Pre-fetching artifacts from a later phase "to save round-trips" — e.g., fetching the

result_id

payload during Setup, reading the source file before Diagnose has classified failures, or running validation before Sync has verified the writes landed — produces work against premature assumptions, conflates phase responsibilities, and makes failures harder to localize. The orchestrator enters phase N+1 only after phase N's hand-off conditions are met. Parallelizing tool calls within a single phase step is fine when those calls are genuinely independent (e.g., fetching multiple referenced tools during Setup Step 1.3); the rule is about phase boundaries, not intra-phase tool batching.

Loop hand-off rules:

Setup → Collect when mode + sub-flavor + agent +
```
redeploy_command
```
are all resolved.
Collect → Early-End-Call Diagnose when the kept failure set is populated and Signal 5 (end-of-call attribution) is recorded for every kept failure. If kept = 0, skip the rest of Optimization (and the Gate, and Eval) — surface the funnel summary and stop.
Early-End-Call Diagnose → Diagnose always (whether or not any failures were flagged as early-end; the pass-through case is the no-edit branch).
Diagnose → Apply when the combined proposal is non-empty AND the user has approved it (in
```
auto_mode: false
```
) or auto-mode auto-accepted it. If the combined proposal is empty (all-Upstream or all-KEEP-on-low-confidence), skip Apply / Sync / Gate / Eval — surface upstream hand-offs and stop.
Apply → Sync after writes lands and the redeploy step (for self-hosted live targets) completes successfully. A non-zero
```
redeploy_command
```
exit halts the loop here; the user decides retry vs. abort.
Sync → Overfitting Gate after every changed field is verified. Drift detection rolls back to Apply rather than proceeding to the Gate.
Overfitting Gate → Eval after gate scoring finishes. If cleanup edits were needed, after Step GATE.7 sync confirms; if no flags were found, straight to Eval as a pass-through.
Eval → Collect when the failure set is non-empty AND none of the stop conditions fire (oscillation, no-change, 3× same-shape, iteration cap, all-Upstream, all-voice-with-no-mitigation). Each Eval → Collect hand-back counts toward
```
max_iterations
```
.
Eval → Exit on 100% pass on the full set (after the regression sweep), or on any stop condition (surfaced to user, loop halted).

本SKILL.md是一个轻量编排器。优化流程被拆分为

phases/optimization/

中的五个子阶段，同时包含Setup、Overfitting Gate和Eval作为独立阶段：

                  ┌────────────────┐
   用户输入 ─→  │  Setup阶段   │  (phases/setup.md)
                  │  仅运行一次     │
                  └───────┬────────┘
                          │  (mode, sub-flavor, agent, redeploy_command)
                          ▼
              ┌───  ┌───────────────────────────┐
              │     │ 优化·收集    │  (phases/optimization/collect.md)
              │     │ 获取 + 过滤 + 检查  │
              │     │ 服务商调用状态       │
              │     └───────┬───────────────────┘
              │             │  (保留的失败案例 + Signal 5 通话结束归因)
              │             ▼
              │     ┌────────────────────────────────────┐
              │     │ 优化·提前结束通话诊断    │  (phases/optimization/
              │     │ 诊断                           │   early-end-call-diagnose.md)
              │     │ 标记主Agent提前结束通话 →      │
              │     │ 提出关闭规则/代码修改建议  │
              │     └───────┬────────────────────────────┘
              │             │  (提出提前结束修改建议；无匹配则直接传递)
              │             ▼
              │     ┌────────────────────────────────────┐
              │     │ 优化·诊断    │  (phases/optimization/
              │     │ 分类缺口/冲突/歧义/       │   diagnose.md)
              │     │ 其他代码错误/上游问题 →           │
              │     │ 提出修改建议 → 整合展示   │
              │     └───────┬────────────────────────────┘
              │             │  (用户批准的整合修改集)
              │             ▼
              │     ┌───────────────────────────┐
              │     │ 优化·应用      │  (phases/optimization/apply.md)
              │     │ PATCH / 编辑 → 重新部署   │
              │     └───────┬───────────────────┘
              │             │  (修改已落地；重启在线Agent)
              │             ▼
              │     ┌───────────────────────────┐
              │     │ 优化·同步       │  (phases/optimization/sync.md)
              │     │ 重新获取 + 验证         │
              │     └───────┬───────────────────┘
              │             │  (验证状态与预期一致)
              │             ▼
              │     ┌───────────────────────────┐
              │     │ 过拟合检查门          │  (phases/overfitting-gate.md)
              │     │ 清理转录引用 / │
              │     │ 场景ID / 窄范围     │
              │     │ 条款；应用清理操作    │
              │     │ (无问题则直接传递)   │
              │     └───────┬───────────────────┘
              │             │  (检查门清理后的状态)
              │             ▼
              │     ┌───────────────────────────┐
              │     │ 评估阶段                │  (phases/eval.md)
              │     │ 验证 → 重新收集 →   │
              │     │ 决策                    │
              │     └───────┬───────────────────┘
              │             │
              │     ┌───────┴────────────────────┐
              │     │                            │
   返回至收集 │     ▼                            ▼  退出
   阶段       │  失败集 < 100%        全集 = 100% (成功)
              │  或出现回归             或达到迭代上限
              │  或需要缓解性修改       或全部为上游问题
              │                            或振荡/无变化
              └────  (循环)                或连续3次相同形态失败
                                           (告知用户并暂停)

Setup仅运行一次。它解析运行模式和子类型，加载Agent（其配置和提示词源），并（针对自托管在线目标）收集

redeploy_command

。Setup是硬性关卡——优化·收集阶段在Setup完成前不会启动。

优化包含五个按顺序运行的子阶段，每个阶段负责一项任务：

收集 (
```
phases/optimization/collect.md
```
) — 获取运行记录/调用日志/粘贴的失败案例，按每次运行的 verdict 预过滤（保留
```
failure
```
+
```
reviewed_failure
```
，丢弃
```
success
```
+
```
reviewed_success
```
），应用语音/渠道过滤器，检查每个保留失败案例的服务商调用状态（信号1-5，包括通话结束归因），生成失败摘要。输出：保留的失败集 + 每个失败案例的信号。
提前结束通话诊断 (
```
phases/optimization/early-end-call-diagnose.md
```
) — 针对主Agent在场景要求步骤完成前提前结束通话的失败案例进行专门分类。通过两步优先检查 verdict 的清单（无理由说明，无边界情况）标记符合{主Agent结束通话 + 场景预期结果未完成}的失败案例，诊断根本原因（关闭规则过于宽松/编排代码通话结束检测/VAPI交接配置错误），提出最小化的关闭规则提示词修改建议，或针对websocket/
```
file
```
模式提出基于捕获状态控制关闭的编排代码修改建议。无匹配失败则直接传递。此处提出的修改建议不会立即应用；会流入诊断阶段的整合提案中。
诊断 (
```
phases/optimization/diagnose.md
```
) — 对所有非提前结束的失败案例进行分类（缺口/冲突/歧义/非提前结束的代码错误/上游问题），提出最小化的针对性修改建议，与提前结束的提案进行冲突解决，然后展示整合后的提案（提前结束+其他案例）给用户。在
```
auto_mode: false
```
时，此处会触发用户可见的差异批准关卡。
应用 (
```
phases/optimization/apply.md
```
) — 通过对应服务商的应用机制落地已批准的修改（VAPI PATCH / Cekura平台工具 / 源文件
```
Edit
```
/ 渲染重写后的提示词），然后运行
```
redeploy_command
```
（或触发手动重启关卡）。
同步 (
```
phases/optimization/sync.md
```
) — 重新获取刚修改的工件并验证每个修改字段是否正确落地。捕获VAPI嵌套对象替换和
```
Edit
```
模糊锚点漂移问题。若出现漂移则回滚到应用阶段。

过拟合检查门是「清理刚应用的修改」阶段。由于诊断子阶段会读取失败的转录内容，有时会将转录特定措辞泄露到提案中——逐字引用、场景ID/名称、硬编码测试数据、超窄范围案例条款、转录克隆的少样本示例。检查门会重新读取刚同步的内容，根据五个过拟合特征对每个修改进行评分，并在验证前输出清理修改建议（修改为通用形式，或完全删除）。无问题的迭代中，检查门仅为一行直接传递。完整流程：

phases/overfitting-gate.md

。

评估是「验证与决策」阶段。它构建验证集，针对检查门清理后的在线Agent运行验证，使用与收集阶段相同的逻辑重新收集失败案例，并做出决策：将新的失败摘要返回至收集阶段，触发最终回归扫描，宣布成功，或触发停止条件（振荡/无变化/连续3次相同形态/迭代上限/全部上游问题）。完整流程：

phases/eval.md

。

严格按顺序运行阶段——切勿跨阶段并行。每个阶段都将前一阶段的输出作为硬性前置条件（诊断阶段读取收集阶段的保留失败集+信号5；应用阶段读取诊断阶段的已批准整合提案；同步阶段读取应用阶段的已写入工件；评估阶段读取检查门清理后的在线状态）。提前获取后续阶段的工件「以节省往返时间」——例如在Setup阶段获取

result_id

payload，在诊断阶段分类失败前读取源文件，或在同步阶段验证写入落地前运行验证——会基于过早假设产生无效工作，混淆阶段职责，且难以定位故障。只有在阶段N的交接条件满足后，编排器才会进入阶段N+1。单个阶段步骤内并行工具调用是允许的，只要这些调用真正独立（例如在Setup步骤1.3中获取多个引用工具）；规则针对的是阶段边界，而非阶段内的工具批量处理。

循环交接规则：

Setup → 收集：当mode + sub-flavor + agent +
```
redeploy_command
```
全部解析完成时。
收集 → 提前结束通话诊断：当保留失败集已填充且每个保留失败案例都已记录信号5（通话结束归因）时。若保留失败集为空，则跳过剩余优化阶段（以及检查门和评估）——展示漏斗摘要并停止。
提前结束通话诊断 → 诊断：始终执行（无论是否有失败案例被标记为提前结束；无修改建议则直接传递）。
诊断 → 应用：当整合提案非空且用户已批准（
```
auto_mode: false
```
时）或自动模式自动接受时。若整合提案为空（全部上游问题或全部低置信度保留），则跳过应用/同步/检查门/评估——展示上游交接信息并停止。
应用 → 同步：写入落地且重新部署步骤（针对自托管在线目标）成功完成后。
```
redeploy_command
```
非零退出码会在此处终止循环；由用户决定重试还是中止。
同步 → 过拟合检查门：每个修改字段验证完成后。检测到漂移则回滚到应用阶段。
过拟合检查门 → 评估：检查门评分完成后。若需要清理修改，则在步骤GATE.7同步确认后执行；若无标记，则直接传递至评估。
评估 → 收集：当失败集非空且未触发任何停止条件（振荡、无变化、连续3次相同形态、迭代上限、全部上游问题、全部语音问题无缓解措施）时。每次评估→收集的交接都会计入
```
max_iterations
```
。
评估 → 退出：全集100%通过（适用时需完成回归扫描），或触发任何停止条件（告知用户，终止循环）。

Modes and providers (resolved during Setup)

模式与服务商（在Setup阶段解析）

The skill organizes providers under

providers/

vapi
— VAPI agents. Both system prompts and tool definitions are editable directly via the VAPI API. Tool config covers function declarations, referenced tool definitions (
```
name
```
,
```
description
```
,
```
parameters
```
, spoken
```
messages
```
like
```
request-start
```
/
```
request-complete
```
/
```
request-failed
```
, and handoff
```
destinations
```
), and which tools each squad member references via its
```
toolIds
```
array. Edits land on VAPI; the live agent picks them up immediately. See
```
providers/vapi/overview.md
```
.
self_hosted
— umbrella for any agent the user runs themselves. The supported sub-flavor is websocket
: custom websocket servers (e.g., Python / Node / Go) whose system prompt, tool definitions, and conversation-orchestration code live in the user's source code. Editable surface is the user's source file via the
```
Edit
```
tool — covering the system prompt, tool schemas, AND orchestration code (conversation-history management, message wiring, state-preservation logic, keepalive / retry plumbing) when a failure's root cause is in code rather than prompt wording. Business logic (what a tool computes or what an external service returns) and security-sensitive code (API keys, auth, signing) remain out of scope. The Cekura agent record's
llm_system_prompt
field is NOT the source of truth in this mode — do not read it, and never ask the user to paste their prompt while a workspace is reachable. Always source the prompt from the workspace: start with the file currently open in the IDE (
```
ide_opened_file
```
), then grep project files for the system-prompt string constant. The user restarts their websocket server before re-validation; in auto mode the gate is skipped. A degenerate
```
offline
```
variant covers the "no live websocket reachable" case — the skill renders the rewritten prompt for manual application and asks for pasted failures each iteration (offline variant supports prompt edits only, never code edits). See
```
providers/self-hosted/websocket.md
```
.

The

providers/self-hosted/overview.md

file documents the self-hosted routing.

本技能将服务商组织在

providers/

目录下：

vapi
— VAPI Agent。系统提示词和工具定义均可通过VAPI API直接编辑。工具配置包括函数声明、引用的工具定义（
```
name
```
、
```
description
```
、
```
parameters
```
、语音
```
messages
```
如
```
request-start
```
/
```
request-complete
```
/
```
request-failed
```
，以及交接
```
destinations
```
），以及每个团队成员通过
```
toolIds
```
数组引用的工具。修改会落地到VAPI；在线Agent会立即生效。详见
```
providers/vapi/overview.md
```
。
self_hosted
— 涵盖用户自行运行的所有Agent。支持的子类型为**
```
websocket
```
：自定义websocket服务器（如Python/Node/Go），其系统提示词、工具定义和对话编排代码**存放在用户的源代码中。可编辑范围为用户的源文件，通过
```
Edit
```
工具实现——包括系统提示词、工具模式，以及当失败根本原因在代码而非提示词措辞时的编排代码（对话历史管理、消息路由、状态保留逻辑、保活/重试管道）。业务逻辑（工具计算内容或外部服务返回结果）和安全敏感代码（API密钥、认证、签名）不在范围内。在此模式下，Cekura Agent记录的
llm_system_prompt
字段并非可信源——请勿读取，且在可访问工作区时切勿要求用户粘贴其提示词。始终从工作区获取提示词：从IDE当前打开的文件（
```
ide_opened_file
```
）开始，然后在项目文件中搜索系统提示词字符串常量。用户需在重新验证前重启其websocket服务器；自动模式下会跳过该关卡。简化的
```
offline
```
变体适用于「无法访问在线websocket」的场景——技能会渲染重写后的提示词供手动应用，并在每次迭代时要求用户粘贴失败案例（离线变体仅支持提示词修改，不支持代码修改）。详见
```
providers/self-hosted/websocket.md
```
。

providers/self-hosted/overview.md

文件记录了自托管路由规则。

Performing Platform Actions

执行平台操作

When this skill suggests creating, listing, updating, or evaluating something on Cekura, prefer using available platform tools over describing API calls or dashboard steps. In Claude Code with the Cekura plugin installed, these tools are auto-configured and handle authentication, parameter validation, and error handling for you. Fall back to direct API endpoints or dashboard guidance only when no tools are available in the current session.

VAPI mode: VAPI write operations (assistant PATCH, tool create / PATCH / delete) are not exposed through Cekura platform tools — they go directly to the VAPI API with
```
VAPI_KEY
```
. Full curl bodies in
```
providers/vapi/phase-4-apply.md
```
.
Self-hosted / websocket: file edits land via the
```
Edit
```
tool on the user's source code — system prompt, tool schemas, and conversation-orchestration code (history management, message wiring, state) are all in scope; optional
```
mcp__cekura__aiagents_partial_update
```
to sync the Cekura description as a mirror. Full flow in
```
providers/self-hosted/websocket.md
```
.

当本技能建议在Cekura上创建、列出、更新或评估内容时，优先使用可用的平台工具，而非描述API调用或控制台步骤。在安装了Cekura插件的Claude Code中，这些工具会自动配置，并为您处理认证、参数验证和错误处理。仅当当前会话中无可用工具时，才回退到直接API端点或控制台指导。

VAPI模式：VAPI写入操作（助手PATCH、工具创建/PATCH/删除）未通过Cekura平台工具暴露——直接使用
```
VAPI_KEY
```
调用VAPI API。完整curl请求体见
```
providers/vapi/phase-4-apply.md
```
。
自托管/websocket：文件修改通过
```
Edit
```
工具落地到用户源代码——系统提示词、工具模式和对话编排代码（历史管理、消息路由、状态）均在范围内；可选
```
mcp__cekura__aiagents_partial_update
```
同步Cekura描述作为镜像。完整流程见
```
providers/self-hosted/websocket.md
```
。

How to Use This Skill

如何使用本技能

This is an interactive, multi-iteration workflow. The user supplies one of:

VAPI / self-hosted modes (any live target) — an
```
agent_id
```
plus exactly one of:
```
scenario_ids
```
,
```
result_id
```
,
```
run_ids
```
, or
```
call_ids
```
.
Self-hosted / websocket / offline variant — a
```
prompt
```
(pasted text or read-only file path) plus pasted
```
{transcript, expected_outcome, verdict}
```
blocks. No live agent required.

Optionally:

```
max_iterations
```
(default 10) — caps the loop. Each Eval → Optimization hand-back counts as one iteration.
```
mode
```
(
```
vapi
```
/
```
self_hosted
```
) — explicit override if the resolution would otherwise be ambiguous.
```
self_hosted_flavor
```
(
```
websocket
```
) — explicit override; self-hosted resolves to websocket.
```
redeploy_command
```
(self-hosted only) — shell command(s) the skill should run after each apply step to restart the live agent before re-validation. If provided, the Optimization phase runs this automatically and the user-side restart gate is skipped entirely. If set to the literal string
```
"manual"
```
(or not provided in
```
auto_mode: false
```
), the skill falls back to the canonical "pause and ask the user to restart" gate. Collected at the end of Setup Step 1.3 for self-hosted modes — see Setup Step 1.4. VAPI mode ignores this field (VAPI edits land live; nothing to redeploy).
```
auto_mode
```
(default true) — when true, skip the diff-approval gate at the end of Diagnose (Step DIAGNOSE.5) and the overfitting-gate cleanup approval (Step GATE.5) on every iteration. With
```
redeploy_command
```
configured, the skill is fully end-to-end autonomous for self-hosted modes (auto-apply → auto-redeploy → auto-validate). Without
```
redeploy_command
```
, auto_mode skips the routine user-side deployment pauses too and trusts the user to keep their live system in sync (the no-change detector in Eval Step EVAL.3 catches stale-state cases after the fact). The iteration cap, oscillation detection, validation-set stability, and the user's ability to interrupt mid-loop all still apply. Set
```
auto_mode: false
```
only when you want a per-iteration diff-approval pause AND (if
```
redeploy_command
```
is unset) explicit user-side deployment gates before validation.

这是一个交互式多迭代工作流。用户需提供以下内容之一：

VAPI/自托管模式（任何在线目标） —
```
agent_id
```
加上以下之一：
```
scenario_ids
```
、
```
result_id
```
、
```
run_ids
```
或
```
call_ids
```
。
自托管/websocket/离线变体 —
```
prompt
```
（粘贴文本或只读文件路径）加上粘贴的
```
{transcript, expected_outcome, verdict}
```
块。无需在线Agent。

可选参数：

```
max_iterations
```
（默认10）——循环上限。每次评估→优化的交接计为一次迭代。
```
mode
```
（
```
vapi
```
/
```
self_hosted
```
）——若解析存在歧义，可显式覆盖。
```
self_hosted_flavor
```
（
```
websocket
```
）——显式覆盖；自托管默认解析为websocket。
```
redeploy_command
```
（仅自托管模式）——每次应用步骤后，技能应运行的shell命令，用于在重新验证前重启在线Agent。若提供，优化阶段会自动运行该命令，跳过用户侧重启关卡。若设置为字面字符串
```
"manual"
```
（或
```
auto_mode: false
```
时未提供），技能会回退到标准的「暂停并要求用户重启」关卡。自托管模式下在Setup步骤1.3末尾收集——详见Setup步骤1.4。VAPI模式忽略此字段（VAPI修改实时生效；无需重新部署）。
```
auto_mode
```
（默认true）——设为true时，跳过诊断阶段末尾的差异批准关卡（步骤DIAGNOSE.5）和每次迭代的过拟合检查门清理批准（步骤GATE.5）。配置
```
redeploy_command
```
后，自托管模式下技能可实现端到端完全自动化（自动应用→自动重新部署→自动验证）。未配置
```
redeploy_command
```
时，自动模式也会跳过常规用户侧部署暂停，信任用户保持其在线系统同步（评估步骤EVAL.3中的无变化检测器会事后捕获 stale-state 情况）。迭代上限、振荡检测、验证集稳定性以及用户中断循环的能力仍然有效。仅当您希望每次迭代都暂停进行差异批准，且（若未设置
```
redeploy_command
```
）在验证前有明确的用户侧部署关卡时，才设置
```
auto_mode: false
```
。

When to ask for feedback or clarification (applies in every phase)

何时请求反馈或澄清（适用于所有阶段）

Ask for feedback or clarification wherever required, even in auto mode. Auto mode skips routine gates; it does NOT make the skill silent on genuinely ambiguous inputs or risky decisions. Pause and ask when:

The user's input is ambiguous or incomplete (e.g.,
```
agent_id
```
+
```
prompt
```
supplied without a mode; structured-config file where the prompt field can't be identified safely; empty / one-line / clearly-non-production prompt).
Self-hosted live targets — the
redeploy_command
must be resolved at Setup Step 1.4 before Optimization begins (see Setup Step 1.4's hard-gate note). Even in
```
auto_mode: true
```
, this one-time setup question is required — auto-mode skips the per-iteration restart pauses, not the one-time question that defines how the restart happens. Reply must be a real shell command OR the literal
```
"manual"
```
. Do not silently default to "hope the user restarted."
Websocket /
file
variant — confirm which file is the live source before any
Edit
. The IDE-opened file is a hint, not authority. Grep the workspace for the system-prompt string constant; if >1 file matches, ask which is live. Files named
```
original_*.py
```
,
```
*.bak
```
,
```
*.snapshot.py
```
, anything under
```
archive/
```
or
```
backup/
```
are strong "probably not the live source" signals — pause and confirm rather than editing them.
Self-hosted / websocket / offline variant — there is no automated path to re-collect failures, so the skill must ask for pasted failures after each iteration.
The skill needs to widen the validation set, switch input types mid-loop, or change the validation comparison set in any way — never silent in either mode.
Oscillation is detected (same scenario flipping pass/fail across iterations) or a no-change signature appears (identical post-edit failures two iterations in a row). Surface and pause; do not burn the iteration cap.
Same failure shape persists across three consecutive iterations — stop iterating at the same edit surface and escalate to a larger change. After two no-change iterations the prompt/tool layer has demonstrably failed to fix the issue; a third same-shape failure is the cue to surface architectural alternatives instead of producing iter 4 of the same kind of edit. Concretely: (a) switch to a stronger model (e.g.
```
gpt-4o-mini → gpt-4o
```
, a single-line edit in the user's code or a Cekura agent config field); (b) add a programmatic guard in orchestration code that enforces the missed behavior deterministically (websocket /
```
file
```
only); (c) restructure the agent flow into explicit named states gated on collected fields rather than relying on natural-conversation prompting; (d) split the scenario into a narrower validation set and hand off to
```
cekura-eval-design
```
if the evaluator is the issue. Present these options to the user; do not pick one autonomously, since each carries real cost (model swap → ~10× per-token spend, programmatic guard → invasive code, flow restructure → larger refactor). The exception is when the user has already explicitly directed one of these paths — then proceed.
Most kept failures cluster on one or two metrics whose explanations look subjective — hand off to
```
cekura-metric-improvement
```
instead of iterating blindly.
All kept failures classify as Upstream/data — surface the hand-off and stop the loop early; do not propose phantom prompt edits.
A diagnosis is low-confidence ("could be Conflict or Ambiguity, depending on intent") — ask the user to disambiguate rather than guessing.

When in doubt, ask. A short clarifying question costs less than a wrong PATCH against a live agent or a wasted iteration. The "don't pre-emptively pause" rule applies to per-iteration user-side gates only — auto mode runs validation directly after each apply without asking "have you restarted?" each time, because the one-time

redeploy_command

collected at Setup Step 1.4 either handles the restart automatically (real command) or has explicit user buy-in to a manual cadence (

"manual"

sentinel). Do NOT use this rule to skip Setup Step 1.4 itself, or to skip clarifying which file is the live source — those are one-time setup questions, not per-iteration pauses.

无论何时需要，即使在自动模式下，也要请求反馈或澄清。自动模式仅跳过常规关卡；并非让技能在面对真正模糊的输入或风险决策时保持沉默。出现以下情况时暂停并询问：

用户输入模糊或不完整（例如，
```
agent_id
```
+
```
prompt
```
未提供模式；结构化配置文件中无法安全识别提示词字段；空/单行/明显非生产环境的提示词）。
自托管在线目标——必须在Setup步骤1.4优化开始前解析
redeploy_command
（详见Setup步骤1.4的硬性关卡说明）。即使在
```
auto_mode: true
```
时，这个一次性设置问题也是必需的——自动模式跳过的是每次迭代的重启暂停，而非定义重启方式的一次性问题。回复必须是真实的shell命令或字面字符串
```
"manual"
```
。请勿默认假设「用户会自行重启」。
Websocket/
file
变体——在任何
Edit
前确认哪个文件是在线源。IDE打开的文件只是提示，并非权威。在工作区中搜索系统提示词字符串常量；若找到多个匹配文件，询问哪个是在线源。命名为
```
original_*.py
```
、
```
*.bak
```
、
```
*.snapshot.py
```
的文件，以及
```
archive/
```
或
```
backup/
```
下的任何文件，都是强烈的「可能不是在线源」信号——暂停并确认后再编辑。
自托管/websocket/离线变体——没有自动重新收集失败案例的路径，因此技能必须在每次迭代后要求用户粘贴失败案例。
技能需要扩大验证集、中途切换输入类型，或以任何方式更改验证比较集时——无论哪种模式都不能静默操作。
检测到振荡（同一场景在迭代间反复通过/失败）或出现无变化特征（连续两次迭代后失败案例完全相同）。告知用户并暂停；不要消耗迭代上限。
连续三次迭代出现相同形态的失败——停止在同一编辑层面迭代，升级为更大的变更。经过两次无变化迭代后，提示词/工具层面显然无法修复问题；第三次相同形态失败是提示采用架构替代方案的信号，而非继续生成同类修改的第4次迭代。具体包括：(a) 切换到更强的模型（如
```
gpt-4o-mini → gpt-4o
```
，用户代码或Cekura Agent配置字段中的单行修改）；(b) 在编排代码中添加程序化 guard，确定性强制执行缺失的行为（仅websocket/
```
file
```
模式）；(c) 将Agent流程重构为基于收集字段的显式命名状态，而非依赖自然对话提示；(d) 将场景拆分为更窄的验证集，若评估器存在问题则交接给
```
cekura-eval-design
```
。向用户展示这些选项；不要自主选择，因为每个选项都有实际成本（模型切换→约10倍每token成本，程序化guard→侵入式代码，流程重构→更大规模重构）。例外情况是用户已明确指示其中一条路径——此时可继续执行。
大多数保留失败案例集中在一两个解释看起来主观的指标上——交接给
```
cekura-metric-improvement
```
而非盲目迭代。
所有保留失败案例均归类为上游/数据问题——展示交接信息并提前终止循环；不要提出虚假的提示词修改建议。
诊断置信度低（「可能是冲突或歧义，取决于意图」）——请用户澄清而非猜测。

如有疑问，务必询问。简短的澄清问题比错误地PATCH在线Agent或浪费一次迭代成本更低。「不要提前暂停」规则仅适用于每次迭代的用户侧关卡——自动模式在每次应用后直接运行验证，无需每次询问「您是否已重启？」，因为Setup步骤1.4收集的一次性

redeploy_command

要么自动处理重启（真实命令），要么用户已明确同意手动节奏（

"manual"

标记）。请勿使用此规则跳过Setup步骤1.4本身，或跳过确认哪个文件是在线源——这些是一次性设置问题，而非每次迭代的暂停。

Orchestration flow

编排流程

The orchestrator runs the referenced files in sequence, with the loop point at Eval handing back to Optimization · Collect:

Setup — load
```
phases/setup.md
```
and walk through Steps 1.1–1.4. On completion, verify the Setup completion checklist before continuing.
Optimization · Collect (iteration N) — load
```
phases/optimization/collect.md
```
and walk through Steps COLLECT.1–5. On iteration 1, reads the raw input (
```
scenario_ids
```
/
```
result_id
```
/
```
run_ids
```
/
```
call_ids
```
/ pasted failures). On iteration 2+, reads the failure set Eval handed back. Produces the kept failure set + Signal-5 end-of-call attribution per failure.
Optimization · Early-End-Call Diagnose (iteration N) — load
```
phases/optimization/early-end-call-diagnose.md
```
and walk through Steps EARLY.1–3. Flags failures matching the early-end pattern, diagnoses the responsible layer (prompt closure rules / orchestration-code end-of-call detection / VAPI handoff), proposes minimal fixes. Pass-through if zero failures match.
Optimization · Diagnose (iteration N) — load
```
phases/optimization/diagnose.md
```
and walk through Steps DIAGNOSE.1–5. Classifies every non-early-end failure (Gap / Conflict / Ambiguity / non-early-end CodeBug / Upstream), proposes minimal edits, de-conflicts with early-end proposals, and presents the combined diff to the user. If the combined proposal is empty (all-Upstream or all-KEEP), stop the loop here — skip Apply / Sync / Gate / Eval and surface upstream hand-offs.
Optimization · Apply (iteration N) — load
```
phases/optimization/apply.md
```
and walk through Steps APPLY.1–2. Lands the combined edit set per-provider; runs
```
redeploy_command
```
(or fires manual restart gate) for self-hosted live targets. A non-zero
```
redeploy_command
```
exit halts here for user decision.
Optimization · Sync (iteration N) — load
```
phases/optimization/sync.md
```
and walk through Step SYNC.1. Re-fetches the just-edited artifacts and verifies each changed field landed. Drift rolls back to Apply.
Overfitting Gate (iteration N) — load
```
phases/overfitting-gate.md
```
and walk through Steps GATE.1–7. Inventories this iteration's edits, scores them against five overfitting signatures (verbatim transcript quote, scenario-specific identifier, hardcoded test-data value, hyper-narrow case clause, transcript-cloned few-shot example), decides REVISE / STRIP / KEEP per flagged edit, and applies cleanup edits if needed. On no-flag iterations the gate is a one-line pass-through to Eval (no extra apply round-trip). Code-stream edits (websocket /
```
file
```
orchestration code) and pure-deletion edits are not scored by the gate.
Eval (iteration N) — load
```
phases/eval.md
```
and walk through Steps EVAL.1–4. Eval's Step EVAL.4 emits exactly one of these decisions:
- Exit (success) — 100% pass on the full set (after the regression sweep, when applicable). Report cumulative diff + iterations used. Stop.
- Exit (stop condition) — iteration cap hit, oscillation detected, no-change signature for the second time, 3× same-shape failure, all-Upstream re-classified, or stochastic flake. Surface to the user and stop.
- Hand back to Optimization · Collect — failure set still has failures (post-iteration or post-regression-sweep). Re-enter step 2 above with the new failure summary as input. Increment iteration counter.

When

auto_mode: false

, every Step DIAGNOSE.5 (combined proposal approval) AND every Step GATE.5 (gate cleanup approval) is a user-gated decision. All other phase boundaries happen automatically once their phase completes.

When

auto_mode: true

, the routine diff-approval gate at Step DIAGNOSE.5 and the routine cleanup approval at Step GATE.5 are both skipped (still rendered for transparency, then auto-accepted). The Setup hard gate at Step 1.4 is NOT skipped. The Gate's tension-case pause ("REVISE would invalidate the fix") and the large-strip-set pause ("gate would strip > half the iteration's edits") fire even in auto mode. Every "ask for feedback or clarification" trigger from the list above still pauses the orchestrator.

Announce every phase entry in your user-facing output. At each phase boundary state which iteration and phase you are entering — e.g., a one-line header like

Iteration 3 · Overfitting Gate

or a sentence that names the phase as you begin its first step. This is a hard requirement, not stylistic — a missing announcement in the trace is the same signal as a missing phase, and it is the single most effective check against silently skipping a phase. The Overfitting Gate is the most-skipped phase on iter 2+ (the iteration feels incremental, the previous-iter Gate was a pass-through, and the orchestrator is tempted to apply-and-validate without re-walking the full pipeline); naming the phase before doing its work makes the elision impossible. Re-load the phase file (

Read

on the relevant

phases/...md

) on each entry rather than working from memory — phase files carry pre-flight checklists that catch upstream-incomplete states, and those checklists are useless if the orchestrator never re-reads them. Cost is one line per phase boundary plus one Read per phase per iteration.

编排器按顺序运行引用的文件，循环点为评估阶段返回至优化·收集阶段：

Setup — 加载
```
phases/setup.md
```
并执行步骤1.1–1.4。完成后，验证Setup完成清单再继续。
优化·收集（迭代N） — 加载
```
phases/optimization/collect.md
```
并执行步骤COLLECT.1–5。迭代1时，读取原始输入（
```
scenario_ids
```
/
```
result_id
```
/
```
run_ids
```
/
```
call_ids
```
/粘贴的失败案例）。迭代2及以上时，读取评估阶段返回的失败集。输出保留的失败集 + 每个失败案例的信号5通话结束归因。
优化·提前结束通话诊断（迭代N） — 加载
```
phases/optimization/early-end-call-diagnose.md
```
并执行步骤EARLY.1–3。标记符合提前结束模式的失败案例，诊断责任层（提示词关闭规则/编排代码通话结束检测/VAPI交接），提出最小化修复建议。无匹配失败则直接传递。
优化·诊断（迭代N） — 加载
```
phases/optimization/diagnose.md
```
并执行步骤DIAGNOSE.1–5。对所有非提前结束的失败案例进行分类（缺口/冲突/歧义/非提前结束的代码错误/上游问题），提出最小化修改建议，与提前结束提案进行冲突解决，并向用户展示整合后的差异。若整合提案为空（全部上游问题或全部保留），在此处终止循环——跳过应用/同步/检查门/评估并展示上游交接信息。
优化·应用（迭代N） — 加载
```
phases/optimization/apply.md
```
并执行步骤APPLY.1–2。通过对应服务商落地整合修改集；对自托管在线目标运行
```
redeploy_command
```
（或触发手动重启关卡）。
```
redeploy_command
```
非零退出码会在此处终止，由用户决策。
优化·同步（迭代N） — 加载
```
phases/optimization/sync.md
```
并执行步骤SYNC.1。重新获取刚修改的工件并验证每个修改字段是否落地。出现漂移则回滚到应用阶段。
过拟合检查门（迭代N） — 加载
```
phases/overfitting-gate.md
```
并执行步骤GATE.1–7。盘点本次迭代的修改，根据五个过拟合特征（逐字转录引用、场景特定标识符、硬编码测试数据值、超窄范围案例条款、转录克隆的少样本示例）进行评分，对每个标记的修改决定修改/删除/保留，并在需要时应用清理修改。无标记迭代时，检查门直接传递至评估（无需额外应用往返）。代码流修改（websocket/
```
file
```
编排代码）和纯删除修改不参与检查门评分。
评估（迭代N） — 加载
```
phases/eval.md
```
并执行步骤EVAL.1–4。评估步骤EVAL.4会输出以下决策之一：
- 退出（成功） — 全集100%通过（适用时需完成回归扫描）。报告累计差异+使用的迭代次数。停止。
- 退出（停止条件） — 达到迭代上限、检测到振荡、第二次出现无变化特征、连续3次相同形态失败、重新归类为全部上游问题，或随机波动。告知用户并停止。
- 返回至优化·收集 — 失败集仍有失败案例（迭代后或回归扫描后）。以上述新失败摘要作为输入重新进入步骤2。迭代计数器递增。

当

auto_mode: false

时，每次步骤DIAGNOSE.5（整合提案批准）和步骤GATE.5（检查门清理批准）都是用户决策关卡。所有其他阶段边界在阶段完成后自动进行。

当

auto_mode: true

时，步骤DIAGNOSE.5的常规差异批准关卡和步骤GATE.5的常规清理批准关卡均会跳过（仍会展示以保证透明度，然后自动接受）。步骤1.4的Setup硬性关卡不会跳过。检查门的紧张情况暂停（「修改会使修复无效」）和大量删除暂停（「检查门将删除超过一半的迭代修改」）即使在自动模式下也会触发。上述所有「请求反馈或澄清」的触发条件仍会暂停编排器。

在面向用户的输出中宣布每个阶段的进入。在每个阶段边界说明当前迭代和进入的阶段——例如，一行标题如

迭代3 · 过拟合检查门

或开始第一步时命名阶段的句子。这是硬性要求，而非风格问题——跟踪中缺少公告等同于跳过阶段，这是防止静默跳过阶段最有效的检查方式。过拟合检查门是迭代2及以上最容易被跳过的阶段（迭代感觉增量式，前一次迭代检查门直接传递，编排器倾向于直接应用并验证而不重新执行完整流程）；在执行工作前命名阶段可避免遗漏。每次进入阶段时重新加载阶段文件（读取相关

phases/...md

）而非依赖记忆——阶段文件包含飞行前检查清单，可捕获上游未完成状态，若编排器不重新读取则这些清单无用。成本为每个阶段边界一行内容加上每次迭代每个阶段一次读取。

Common Pitfalls

常见陷阱

Parallelizing across phase boundaries. Pre-fetching artifacts a later phase will consume — most commonly fetching the
```
result_id
```
payload during Setup, but also reading the source file before Diagnose has classified failures, or running validation before Sync verified the writes landed — produces work against premature assumptions and makes failures harder to localize. Each phase's pre-conditions are the previous phase's outputs; enter phase N+1 only after phase N completes. Batching independent tool calls inside a single phase step is fine (e.g., fetching multiple VAPI tool definitions during Setup Step 1.3); the rule is about phase boundaries.
Skipping Setup Step 1.4 in self-hosted modes — the deploy-path foot-gun. Setup Step 1.4 collects
```
redeploy_command
```
and is a HARD GATE before Optimization · Collect; auto-mode does NOT skip it. The failure mode looks like this: edits land in the user's file, the live process still runs the old code, validation comes back with transcripts that look slightly different (different LLM nondeterminism) but the same bullets failing for the same reasons. You then attribute the persistent failures to "weak instruction-following" and write stronger prompt edits — the wrong root cause. The skill burns its iteration cap iterating on prompts that never reach the live agent. Auto-mode's no-change detector (Eval Step EVAL.3) is a backstop, not a primary control — by the time it fires, real iterations are already spent. Fix: before Optimization, the run state MUST have either (a)
```
redeploy_command
```
set to a shell command the skill will run after every apply, OR (b)
```
redeploy_command == "manual"
```
with user buy-in to the per-iteration restart pause. Ask once, at Setup Step 1.4. Don't conflate "skip the per-iteration restart pause" (which auto-mode does) with "skip the one-time setup question" (which it must not).
Editing the wrong source file in websocket /
file
variant. Trusting the IDE-opened file as authoritative without grepping the workspace for the system-prompt string constant. Filenames like
```
original_*.py
```
,
```
*.bak
```
,
```
*.snapshot.py
```
, anything under
```
archive/
```
/
```
backup/
```
/
```
old/
```
are strong "not the live source" signals — pause and confirm before editing. Symptom is identical to the Setup Step 1.4 skip above: edits land in a file the server doesn't read. Fix: always grep for the prompt constant (e.g.
```
SYSTEM_PROMPT = {
```
or whatever the constant is named). If grep returns >1 hit, ask which is live. The IDE pointer is a hint to start from, not authority.
Treating an iter-1 result that looks "slightly improved" as confirmation that edits landed. If iter 1 flips a small number of scenarios but the remaining failures show the same bullets failing for the same reasons with transcripts within nondeterminism distance of the originals, the most likely explanation is that edits DIDN'T reach the live process — not that the prompt change was "directionally right but too weak." Strengthening the prompt in iter 2 in this state is the wrong move. Verify the deploy path first (Setup Step 1.4's
```
redeploy_command
```
actually ran without error; the file you edited is the file the server reads).
Asking the user to redeploy / restart / re-apply before triggering evals in auto mode.
```
auto_mode
```
is on by default and skips BOTH the diff-approval gate AND the per-iteration user-side deployment pauses. The skill proceeds straight to validation. Don't render "before continuing, redeploy your server" instruction blocks in the default path. If results come back unchanged, surface the no-change hypothesis after the fact (Eval Step EVAL.3 already does this). This rule applies only when Setup Step 1.4 has already resolved
redeploy_command
— it does NOT license skipping Setup Step 1.4 itself.
Exiting on failure-set 100% without running the regression sweep. A 2/2 pass on the originally-failing subset is a milestone, not the finish line. The exit gate is 100% on the full set (every scenario the user originally provided), and the only way to confirm that is to actually run the full set after the failure subset hits 100%. Skipping the sweep masks regressions where an edit fixed scenarios A & B but broke scenario C. Eval Step EVAL.4's decision tree enforces this — never declare success on failure-set 100% alone.
Skipping the early-end-call-diagnose sub-phase or treating it as redundant with diagnose. The two sub-phases are NOT interchangeable. Early-end is triaged first because the pattern dominates any other diagnosis on the same scenario — if the call ended at turn 4 of a required 8-step scenario, prompt edits targeting step-5-onward behavior are wasted work. Diagnose explicitly skips the early-end CodeBug pattern (Step DIAGNOSE.3 notes this); without the early-end sub-phase, those failures fall through unclassified and get attributed to weak instruction-following.
Treating auto mode as fully silent. Auto mode skips routine gates, NOT the skill's responsibility to ask for clarification on genuinely ambiguous inputs or risky decisions. Ambiguous mode resolution, prompt-source ambiguity (which file? which variable?), low-confidence diagnoses, oscillation, no-change signatures, all-upstream failure sets, and metric-quality clusters all require an explicit pause-and-ask.
Auto mode masking diagnosis quality. Without the per-iteration human read on the diff, a bad diagnosis lands silently and shows up only as a failed re-validation. Treat oscillation and no-change signatures as harder stops in auto mode — surface and pause rather than burn the iteration cap.
Producing iter 4 of the same edit kind after three same-shape failures. When the same scenario fails with the same sub-outcome bullet across three consecutive iterations of prompt-layer (or tool-config-layer) edits, the layer being edited is demonstrably not the layer that fixes it. Keep going and you're paying compute to confirm a known result. Eval Step EVAL.4 case 6 mandates a stop: present architectural alternatives (model swap, programmatic guard, flow restructure, evaluator hand-off) and wait for the user to choose. Don't autonomously pick one — each has real cost. Also don't paper over the situation with "let me try once more, slightly stronger wording" — that is iter 4 of the same edit kind.
Forcing
auto_mode: false
for routine work. The diff-approval + deployment-gate pauses are useful when calibrating the skill against a new agent. For repeat use against an agent whose diagnosis quality you've already validated, the default
```
auto_mode: true
```
is correct.
Proposing tool-config edits in the offline variant. Only prompt edits are valid there — tool findings must be surfaced as upstream hand-offs, not edits.
Proposing VAPI-shaped edits in self-hosted modes. Spoken
```
messages
```
(
```
request-start
```
,
```
request-complete
```
,
```
request-failed
```
), handoff
```
destinations
```
, squad
```
model.toolIds
```
— none of these exist outside VAPI. Diagnose must filter these edit candidates out for self-hosted mode.
Treating Cekura's
description
as the source of truth in websocket mode. It is at best a mirror; the live prompt is in the user's source code. Editing the description does nothing to the live agent unless the user's code reads from it.
Reading
llm_system_prompt
from the Cekura agent record in self-hosted mode, or asking the user to paste their prompt. For
```
assistant_provider == "self_hosted"
```
(websocket),
```
llm_system_prompt
```
is almost always empty — the live prompt lives in the user's workspace (source file). Do NOT pull
```
llm_system_prompt
```
and do NOT ask "paste your current system prompt so I can run improve-prompt against it." Instead, locate the prompt in the workspace: first the IDE-opened file (
```
ide_opened_file
```
context), then grep project files for the prompt string constant, and edit it directly via the
```
Edit
```
tool. Asking the user to paste is only acceptable in the explicit
```
offline
```
variant where no workspace is reachable.
Applying
Edit
with a non-unique
old_string
in websocket /
file
variant. The Edit tool fails on ambiguous matches. Use enough surrounding context (5–10 lines on either side) for every anchor.
Hallucinating variable-injection findings without runtime state. Especially common in the websocket /
```
offline
```
variant. Don't claim "the runtime didn't receive
```
{{accountId}}
```
" unless the transcript itself shows the placeholder leaking.
Shortcutting Optimization · Collect Step COLLECT.3 by reading result-level summary fields instead of per-run
evaluation_status
. A
```
results_retrieve
```
payload exposes both: per-run
```
evaluation_status
```
(post-review, authoritative) AND result-level aggregates (
```
failed_workflow_runs
```
,
```
failed_reasons.issues
```
,
```
failed_runs_count
```
,
```
success_rate
```
) computed from raw machine scores before human review. The aggregates lump
```
failure
```
and
```
reviewed_success
```
into the same buckets — using them silently smuggles human-overridden passes into the kept set and produces edits that contradict the reviewer. The four-bucket filter only works when applied to each run's own
```
evaluation_status
```
. Same rule for
```
run_ids
```
(use per-item verdict) and
```
call_ids
```
(use per-log verdict). The Step COLLECT.5 funnel line must cite
```
per-run evaluation_status
```
as the source so the skip is auditable.
Skipping the variable-state inspection (Optimization · Collect Step COLLECT.4) and mapping failures only to prompt sections. Produces phantom prompt fixes for failures actually rooted upstream. Also breaks the early-end-call-diagnose sub-phase, which depends on Signal 5 (end-of-call attribution) being captured in COLLECT.4.
Quitting the loop the moment failures look non-prompt. The exit gate is 100% pass rate or the iteration cap — not "first sight of an infra-shaped failure." Re-classify with fresh eyes before declaring upstream. In websocket /
```
file
```
mode, also check whether the failure is a CodeBug (history truncation, missing forwarding, broken state) — those are in-scope for editing, not hand-offs.
Iterating prompt-wording when the diagnosis is CodeBug. If oscillation or a no-change signature surfaces and the failure shape matches a CodeBug signal (agent forgets earlier turns, agent ignores explicit don't-re-ask rules despite the prompt being clear, etc.) — stop iterating the prompt. Move to the orchestration-code stream. Repeated prompt-only edits will not converge if the plumbing prevents the agent from following the instructions.
Touching business logic, auth code, or dependencies in websocket-mode code edits. Orchestration-code edits are scoped to plumbing: history management, message wiring, state preservation, keepalive. Tool implementation bodies, API keys / auth code, secrets handling, dependency lists, and framework imports remain out of scope. When in doubt, hand off rather than edit.
Proposing code edits in VAPI or websocket
offline
. The orchestration-code stream exists only for websocket /
```
file
```
. In other modes, code-shaped findings become upstream hand-offs — the skill cannot reach VAPI infrastructure code, and the offline variant has no live file to edit.
Skipping the per-iteration user gate in
auto_mode: false
. The skill applies edits to a live agent. Every PATCH / Edit must be preceded by explicit approval of that iteration's proposed diff.
Skipping the Optimization · Sync Step SYNC.1 re-fetch. VAPI's PATCH semantics replace nested objects wholesale; a malformed body can silently wipe
```
messages
```
or
```
destinations
```
while returning 200. For websocket /
```
file
```
, an
```
Edit
```
call with an ambiguous anchor can land in the wrong spot. Always re-read and verify.
Editing dynamic-variable placeholders (
{{...}}
). They're owned by the calling system. Touch them only if the user explicitly asks.
Patching a tool's spoken
messages
to mask a prompt issue. If the agent says the wrong thing, fix the prompt — unless the tool's
```
request-start
```
message is itself the offending utterance.
Iterating with a noisy metric. If most kept failures come from one metric whose explanations look subjective, the metric is probably miscalibrated — hand off to
```
cekura-metric-improvement
```
first.
Surfacing small-sample / overfitting caveats. Internal calibration of confidence is fine; user-facing hedging reads as a stall. Note that mechanical overfitting in proposed edits (verbatim transcript quotes, scenario IDs in the prompt, hardcoded test-data values) is a different concern — the Overfitting Gate phase handles that automatically by scrubbing the just-applied edits before Eval. The "no caveats" rule applies to user-facing language about worry; it does not turn off the gate's mechanical scrub.
Skipping the Overfitting Gate or treating it as a one-shot pre-flight. The gate runs every iteration where Optimization produced non-zero edits. On no-flag iterations it's a one-line pass-through; on iterations where the LLM diagnosing pulled phrasing directly from the failing transcripts, it's the only thing standing between a memorized fix and a passing-but-non-generalizing agent. Do not short-circuit the gate to "save time" — its cost when there's nothing to scrub is negligible. The drift is most likely on iter 2+ when later edits feel incremental, the previous-iter Gate was a pass-through, and the orchestrator is tempted to apply-and-validate without re-walking the full pipeline — that is exactly when transcript-leak risk compounds (the LLM has now seen the failing transcripts on multiple diagnoses). The countermeasure is the phase-announcement rule in the Orchestration flow section above: if you can't point to the line in your output where you said "Iteration N · Overfitting Gate", you skipped it. Also: a new or revised system-prompt string literal embedded in a source file (e.g., a
```
SYSTEM_PROMPT = {...}
```
or an
```
OVERRIDE_PROMPT = {...}
```
block) is a prompt edit and MUST be scored — only orchestration-control-flow code is skipped within the Gate.
Treating expected-outcome failures and metric failures the same. Expected-outcome failures are first-class signal about agent behavior; metric failures may reflect either the agent or the metric.
Mass-deleting "unused"-looking tools. A tool with no references in this agent's squad members may still be referenced elsewhere. Prefer reference removal over delete.

跨阶段并行。提前获取后续阶段将使用的工件——最常见的是在Setup阶段获取
```
result_id
```
payload，也包括在诊断阶段分类失败前读取源文件，或在同步阶段验证写入落地前运行验证——会基于过早假设产生无效工作，且难以定位故障。每个阶段的前置条件是前一阶段的输出；仅在阶段N完成后进入阶段N+1。单个阶段步骤内批量处理独立工具调用是允许的（例如在Setup步骤1.3中获取多个VAPI工具定义）；规则针对的是阶段边界。
自托管模式下跳过Setup步骤1.4——部署路径隐患。Setup步骤1.4收集
```
redeploy_command
```
，是优化·收集前的硬性关卡；自动模式不会跳过。故障模式表现为：修改落地到用户文件，但在线进程仍运行旧代码，验证返回的转录内容看起来略有不同（不同LLM非确定性），但相同项目因相同原因失败。然后您将持续失败归因于「指令遵循能力弱」并编写更强的提示词修改——这是错误的根本原因。技能会消耗迭代上限迭代从未到达在线Agent的提示词。自动模式的无变化检测器（评估步骤EVAL.3）是后援，而非主要控制——当它触发时，已浪费了真实的迭代次数。修复：优化前，运行状态必须具备(a)
```
redeploy_command
```
设置为技能每次应用后将运行的shell命令，或(b)
```
redeploy_command == "manual"
```
且用户同意每次迭代重启暂停。在Setup步骤1.4询问一次。不要混淆「跳过每次迭代的重启暂停」（自动模式会执行）和「跳过一次性设置问题」（绝对不能）。
websocket/
file
变体中编辑错误的源文件。未经在工作区搜索系统提示词字符串常量就信任IDE打开的文件为权威。命名为
```
original_*.py
```
、
```
*.bak
```
、
```
*.snapshot.py
```
的文件，以及
```
archive/
```
/
```
backup/
```
/
```
old/
```
下的任何文件，都是强烈的「不是在线源」信号——编辑前暂停并确认。症状与跳过Setup步骤1.4相同：修改落地到服务器未读取的文件。修复：始终搜索提示词常量（例如
```
SYSTEM_PROMPT = {
```
或任何常量名称）。若搜索返回多个结果，询问哪个是在线源。IDE指针是起点提示，而非权威。
将迭代1中看起来「略有改进」的结果视为修改已落地的确认。若迭代1翻转了少量场景，但剩余失败案例显示相同项目因相同原因失败，且转录内容与原始内容在非确定性范围内，最可能的解释是修改未到达在线进程——而非提示词更改「方向正确但力度不足」。在此状态下迭代2加强提示词是错误的做法。首先验证部署路径（Setup步骤1.4的
```
redeploy_command
```
是否实际无错误运行；编辑的文件是否为服务器读取的文件）。
自动模式下触发评估前要求用户重新部署/重启/重新应用。
```
auto_mode
```
默认开启，会跳过差异批准关卡和每次迭代的用户侧部署暂停。技能直接进行验证。默认路径中不要显示「继续前请重新部署服务器」的说明块。若结果未变化，事后展示无变化假设（评估步骤EVAL.3已执行此操作）。此规则仅适用于Setup步骤1.4已解析
redeploy_command
的情况——并不允许跳过Setup步骤1.4本身。
失败集100%通过时未运行回归扫描就退出。原始失败子集2/2通过是里程碑，而非终点。退出条件是全集（用户最初提供的每个场景）100%通过，唯一确认方式是失败子集达到100%后实际运行全集。跳过扫描会掩盖修改修复场景A&B但破坏场景C的回归情况。评估步骤EVAL.4的决策树强制执行此规则——切勿仅因失败集100%通过就宣布成功。
跳过提前结束通话诊断子阶段或认为其与诊断阶段冗余。两个子阶段不可互换。提前结束案例优先分类，因为该模式会覆盖同一场景的任何其他诊断——若通话在要求的8步场景的第4步结束，针对第5步及以后行为的提示词修改是无用功。诊断阶段明确跳过提前结束的代码错误模式（步骤DIAGNOSE.3注明）；若无提前结束子阶段，这些失败案例会未分类并归因于弱指令遵循能力。
将自动模式视为完全静默。自动模式仅跳过常规关卡，而非免除技能对真正模糊输入或风险决策的澄清责任。模糊的模式解析、提示词源模糊（哪个文件？哪个变量？）、低置信度诊断、振荡、无变化特征、全部上游失败集和指标质量集群都需要显式暂停并询问。
自动模式掩盖诊断质量问题。没有每次迭代的人工差异检查，错误的诊断会静默落地，仅在重新验证失败时显现。自动模式下将振荡和无变化特征视为更严格的停止条件——告知用户并暂停，而非消耗迭代上限。
连续三次相同形态失败后生成同类修改的第4次迭代。当同一场景在连续三次提示词层面（或工具配置层面）迭代后仍以相同子结果项目失败时，正在编辑的层面显然无法修复问题。继续下去只是花费算力确认已知结果。评估步骤EVAL.4案例6强制停止：展示架构替代方案（模型切换、程序化guard、流程重构、评估器交接）并等待用户选择。不要自主选择——每个选项都有实际成本。也不要用「让我再试一次，措辞更强一点」掩盖问题——这是同类修改的第4次迭代。
常规工作强制使用
auto_mode: false
。差异批准+部署关卡在针对新Agent校准技能时有用。针对已验证诊断质量的Agent重复使用时，默认
```
auto_mode: true
```
是正确选择。
离线变体中提出工具配置修改。此处仅提示词修改有效——工具问题必须作为上游交接信息展示，而非修改建议。
自托管模式中提出VAPI风格的修改。语音
```
messages
```
（
```
request-start
```
、
```
request-complete
```
、
```
request-failed
```
）、交接
```
destinations
```
、团队
```
model.toolIds
```
——这些在VAPI外不存在。诊断阶段必须为自托管模式过滤这些修改候选。
websocket模式中将Cekura的
description
视为可信源。它最多是镜像；在线提示词在用户源代码中。编辑描述对在线Agent无影响，除非用户代码读取该描述。
自托管模式中从Cekura Agent记录读取
llm_system_prompt
，或要求用户粘贴其提示词。对于
```
assistant_provider == "self_hosted"
```
（websocket），
```
llm_system_prompt
```
几乎总是空的——在线提示词存放在用户工作区（源文件）中。请勿读取
```
llm_system_prompt
```
，也不要询问「请粘贴您当前的系统提示词以便我运行提示词优化」。相反，在工作区中定位提示词：首先是IDE打开的文件（
```
ide_opened_file
```
上下文），然后在项目文件中搜索提示词字符串常量，并通过
```
Edit
```
工具直接编辑。仅在明确的
```
offline
```
变体（无法访问工作区）中才允许要求用户粘贴。
websocket/
file
变体中使用非唯一
old_string
应用
Edit
。Edit工具在模糊匹配时会失败。每个锚点使用足够的上下文（两侧5-10行）。
无运行时状态就凭空提出变量注入问题。websocket/
```
offline
```
变体中尤其常见。除非转录内容本身显示占位符泄露，否则不要声称「运行时未收到
```
{{accountId}}
```
」。
通过读取结果级汇总字段而非每次运行的
evaluation_status
来 shortcut 优化·收集步骤COLLECT.3。
```
results_retrieve
```
payload同时暴露两者：每次运行的
```
evaluation_status
```
（审核后，权威）和结果级汇总（
```
failed_workflow_runs
```
、
```
failed_reasons.issues
```
、
```
failed_runs_count
```
、
```
success_rate
```
），后者由原始机器分数计算在人工审核前。汇总将
```
failure
```
和
```
reviewed_success
```
归为同一桶——使用它们会将人工覆盖的通过案例偷偷混入保留集，生成与审核者矛盾的修改。四桶过滤仅在应用于每次运行的
```
evaluation_status
```
时有效。
```
run_ids
```
（使用每个项目的verdict）和
```
call_ids
```
（使用每个日志的verdict）同理。步骤COLLECT.5的漏斗行必须引用
```
per-run evaluation_status
```
作为来源，以便审核跳过情况。
跳过变量状态检查（优化·收集步骤COLLECT.4），仅将失败映射到提示词部分。为实际根因在上游的失败生成虚假的提示词修复。也会破坏提前结束通话诊断子阶段，该阶段依赖COLLECT.4中捕获的信号5（通话结束归因）。
一看到非提示词失败就终止循环。退出条件是100%通过率或迭代上限——而非「首次发现基础设施形态的失败」。重新分类后再宣布上游问题。websocket/
```
file
```
模式下，还需检查失败是否为代码错误（历史截断、缺失转发、状态损坏）——这些属于编辑范围，而非交接。
诊断为代码错误时仍迭代提示词措辞。若出现振荡或无变化特征，且失败形态匹配代码错误信号（Agent忘记之前的轮次、Agent忽略明确的不要重复询问规则尽管提示词清晰等）——停止迭代提示词。转向编排代码流。若管道阻止Agent遵循指令，重复的仅提示词修改无法收敛。
websocket模式代码编辑中触及业务逻辑、认证代码或依赖项。编排代码编辑范围限于管道：历史管理、消息路由、状态保留、保活。工具实现主体、API密钥/认证代码、密钥处理、依赖列表和框架导入不在范围内。如有疑问，交接而非编辑。
VAPI或websocket
offline
模式中提出代码修改。编排代码流仅存在于websocket/
```
file
```
模式。其他模式中，代码形态问题成为上游交接信息——技能无法触及VAPI基础设施代码，离线变体无在线文件可编辑。
auto_mode: false
时跳过每次迭代的用户关卡。技能会修改在线Agent。每次PATCH/Edit前必须获得对本次迭代提案差异的明确批准。
跳过优化·同步步骤SYNC.1重新获取。VAPI的PATCH语义会整体替换嵌套对象；格式错误的请求体可能在返回200时静默擦除
```
messages
```
或
```
destinations
```
。对于websocket/
```
file
```
，模糊锚点的
```
Edit
```
调用可能落地到错误位置。始终重新读取并验证。
编辑动态变量占位符（
{{...}}
）。它们属于调用系统。仅当用户明确要求时才修改。
修改工具的语音
messages
以掩盖提示词问题。若Agent说错话，修复提示词——除非工具的
```
request-start
```
消息本身就是问题表述。
使用嘈杂的指标进行迭代。若大多数保留失败案例来自一个解释看起来主观的指标，该指标可能校准错误——先交接给
```
cekura-metric-improvement
```
。
展示小样本/过拟合警告。内部校准置信度是可以的；面向用户的犹豫听起来像是拖延。注意：提案修改中的机械过拟合（逐字转录引用、提示词中的场景ID、硬编码测试数据值）是不同的问题——过拟合检查门阶段会在评估前自动清理刚应用的修改。「无警告」规则适用于用户可见的担忧表述；不会关闭检查门的机械清理。
跳过过拟合检查门或视为一次性飞行前检查。每次优化产生非零修改时，检查门都会运行。无标记迭代时仅为一行直接传递；当诊断LLM直接从失败转录内容中提取措辞的迭代中，它是区分记忆化修复和通过但不泛化Agent的唯一屏障。不要为「节省时间」而短路检查门——无内容清理时成本可忽略。迭代2及以上最可能出现漂移，此时后续修改感觉增量式，前一次迭代检查门直接传递，编排器倾向于直接应用并验证而不重新执行完整流程——这正是转录泄露风险累积的时候（LLM现在已在多次诊断中看到失败转录内容）。对策是编排流程部分中的阶段公告规则：若您无法在输出中找到「迭代N · 过拟合检查门」的行，说明您跳过了该阶段。另外：源文件中嵌入的新或修订的系统提示词字符串字面量（例如
```
SYSTEM_PROMPT = {...}
```
或
```
OVERRIDE_PROMPT = {...}
```
块）属于提示词修改，必须进行评分——检查门仅跳过编排控制流代码。
将预期结果失败和指标失败视为相同。预期结果失败是Agent行为的一等信号；指标失败可能反映Agent或指标本身的问题。
批量删除看起来「未使用」的工具。当前Agent团队成员中无引用的工具可能在其他地方被引用。优先移除引用而非删除。

Next Steps

后续步骤

After this skill, the user typically needs:

For tool / KB / provider-integration issues surfaced in Eval Step EVAL.4 → invoke cekura-create-agent
For metric-quality issues (noisy or miscalibrated metric judges) → invoke cekura-metric-improvement
For test-suite gaps (the eval set itself is too narrow) → invoke cekura-eval-design
For metric definition / design questions → invoke cekura-metric-design

使用本技能后，用户通常需要：

针对评估步骤EVAL.4中暴露的工具/知识库/服务商集成问题 → 调用cekura-create-agent
针对指标质量问题（嘈杂或校准错误的指标判断） → 调用cekura-metric-improvement
针对测试套件缺口（评估集本身过窄） → 调用cekura-eval-design
针对指标定义/设计问题 → 调用cekura-metric-design

Documentation

文档

Public docs: https://docs.cekura.ai
Concepts: https://docs.cekura.ai/documentation/key-concepts/
Integrations: https://docs.cekura.ai/documentation/integrations/
VAPI assistant API: https://docs.vapi.ai/api-reference/assistants
VAPI tool API: https://docs.vapi.ai/api-reference/tools

公开文档：https://docs.cekura.ai
概念：https://docs.cekura.ai/documentation/key-concepts/
集成：https://docs.cekura.ai/documentation/integrations/
VAPI助手API：https://docs.vapi.ai/api-reference/assistants
VAPI工具API：https://docs.vapi.ai/api-reference/tools

Directory Layout

目录结构

cekura-self-improving-agent/
├── SKILL.md                                  # this file — orchestrator (loop point: Eval → Optimization · Collect)
├── phases/
│   ├── setup.md                              # Resolve mode, fetch agent, collect redeploy_command
│   ├── optimization/
│   │   ├── collect.md                        # Fetch + filter failures + inspect provider call state (incl. Signal 5)
│   │   ├── early-end-call-diagnose.md        # Triage main-agent-ended-early failures → closure-rule / code edits
│   │   ├── diagnose.md               # Classify Gap/Conflict/Ambig/CodeBug-other/Upstream → propose → present
│   │   ├── apply.md                          # Land combined edit set → redeploy
│   │   └── sync.md                           # Re-fetch + verify; drift rolls back to apply
│   ├── overfitting-gate.md                   # Scrub the just-applied edits for transcript/scenario overfitting
│   └── eval.md                               # Build validation set → run → re-collect → decide loop/exit/sweep
├── agents/                                   # MCP-agnostic helpers
└── providers/
    ├── vapi/
    │   ├── overview.md                       # VAPI-mode editable surfaces, anti-patterns
    │   ├── phase-1-fetch.md                  # assistant/squad/tool fetch curl bodies, edge cases
    │   └── phase-4-apply.md                  # PATCH/POST/DELETE curl bodies, loop guardrails
    └── self-hosted/
        ├── overview.md                       # self-hosted overview, shared characteristics
        └── websocket.md                      # websocket sub-flavor — file Edit + restart gate; offline variant
└── references/                               # cross-cutting (shared by every phase)
    ├── phase-2-failure-collection.md         # failure summary template, metric hand-off
    ├── phase-3-diagnosis.md                  # classification table, before/after templates
    └── dynamic-variables-debugging.md        # variable-state per-signal decision tree

cekura-self-improving-agent/
├── SKILL.md                                  # 本文件 — 编排器（循环点：评估 → 优化·收集）
├── phases/
│   ├── setup.md                              # 解析模式，获取Agent，收集redeploy_command
│   ├── optimization/
│   │   ├── collect.md                        # 获取 + 过滤失败案例 + 检查服务商调用状态（含信号5）
│   │   ├── early-end-call-diagnose.md        # 分类主Agent提前结束通话失败案例 → 关闭规则/代码修改
│   │   ├── diagnose.md               # 分类缺口/冲突/歧义/其他代码错误/上游问题 → 提案 → 展示
│   │   ├── apply.md                          # 落地整合修改集 → 重新部署
│   │   └── sync.md                           # 重新获取 + 验证；漂移则回滚到应用阶段
│   ├── overfitting-gate.md                   # 清理刚应用的修改，避免转录/场景过拟合
│   └── eval.md                               # 构建验证集 → 运行 → 重新收集 → 决定循环/退出/扫描
├── agents/                                   # 与MCP无关的助手
└── providers/
    ├── vapi/
    │   ├── overview.md                       # VAPI模式可编辑范围，反模式
    │   ├── phase-1-fetch.md                  # 助手/团队/工具获取curl请求体，边缘情况
    │   └── phase-4-apply.md                  # PATCH/POST/DELETE curl请求体，循环护栏
    └── self-hosted/
        ├── overview.md                       # 自托管概述，共享特性
        └── websocket.md                      # websocket子类型 — 文件Edit + 重启关卡；离线变体
└── references/                               # 跨领域（所有阶段共享）
    ├── phase-2-failure-collection.md         # 失败摘要模板，指标交接
    ├── phase-3-diagnosis.md                  # 分类表，前后模板
    └── dynamic-variables-debugging.md        # 变量状态按信号决策树

Phase Files (loaded on demand)

阶段文件（按需加载）

phases/setup.md
— Mode and sub-flavor resolution, agent fetch per provider,
```
redeploy_command
```
hard gate, Setup completion checklist.
phases/optimization/collect.md
— Scenario execution wait, fetch runs / call logs, verdict pre-filter (per-run
```
evaluation_status
```
), voice-channel filter, accumulate, provider call-state inspection with Signals 1–5 (including end-of-call attribution), failure summary.
phases/optimization/early-end-call-diagnose.md
— Two-check verdict-first triage ({main-agent-ended + scenario-incomplete in expected-outcome bullets}; no rationale, no borderline cases), diagnose responsible layer (closure rules / orchestration code / VAPI handoff), propose minimal early-end fixes. Pass-through if no matches.
phases/optimization/diagnose.md
— Re-read the agent's prompt + tool config, map non-early-end failures to those artifacts + variable state, classify (Gap / Conflict / Ambiguity / non-early-end CodeBug / Upstream), propose minimal scoped edits, de-conflict with early-end proposals, present the combined diff.
phases/optimization/apply.md
— Apply the combined edit set per-provider machinery, then run
```
redeploy_command
```
(or fire manual restart gate). Non-zero exit halts.
phases/optimization/sync.md
— Re-fetch / re-read just-edited artifacts, verify each changed field landed. Drift handling per failure mode; rolls back to apply on drift.
phases/overfitting-gate.md
— Inventory the just-applied edits, score against five overfitting signatures (verbatim transcript quote, scenario-specific identifier, hardcoded test-data value, hyper-narrow case clause, transcript-cloned few-shot example), decide REVISE / STRIP / KEEP, apply + sync cleanup edits when needed; pass-through when no flags.
phases/eval.md
— Validation-set construction (failure set vs. full set), validation execution, failure re-collection, decision tree (loop / sweep / exit / stop condition), iteration cap.

phases/setup.md
— 模式和子类型解析，按服务商获取Agent，
```
redeploy_command
```
硬性关卡，Setup完成清单。
phases/optimization/collect.md
— 场景执行等待，获取运行记录/调用日志，verdict预过滤（每次运行的
```
evaluation_status
```
），语音渠道过滤，累积，服务商调用状态检查（信号1-5，含通话结束归因），失败摘要。
phases/optimization/early-end-call-diagnose.md
— 两步优先检查verdict的分类（{主Agent结束通话 + 场景预期结果未完成}；无理由说明，无边界情况），诊断责任层（关闭规则/编排代码/VAPI交接），提出最小化提前结束修复建议。无匹配则直接传递。
phases/optimization/diagnose.md
— 重新读取Agent的提示词+工具配置，将非提前结束失败案例映射到这些工件+变量状态，分类（缺口/冲突/歧义/非提前结束代码错误/上游问题），提出最小化针对性修改建议，与提前结束提案冲突解决，展示整合差异。
phases/optimization/apply.md
— 通过对应服务商机制应用整合修改集，然后运行
```
redeploy_command
```
（或触发手动重启关卡）。非零退出码终止。
phases/optimization/sync.md
— 重新获取/读取刚修改的工件，验证每个修改字段是否落地。按故障模式处理漂移；漂移则回滚到应用阶段。
phases/overfitting-gate.md
— 盘点刚应用的修改，根据五个过拟合特征（逐字转录引用、场景特定标识符、硬编码测试数据值、超窄范围案例条款、转录克隆的少样本示例）评分，决定修改/删除/保留，必要时应用+同步清理修改；无标记则直接传递。
phases/eval.md
— 验证集构建（失败集vs全集），验证执行，失败重新收集，决策树（循环/扫描/退出/停止条件），迭代上限。

Reference Files (loaded on demand)

参考文件（按需加载）

providers/vapi/overview.md
— VAPI editable surfaces, what's PATCHable directly, anti-patterns.
providers/vapi/phase-1-fetch.md
— Provider-gate error message shapes, VAPI assistant + squad + tool fetch curl bodies, member summary template, Setup phase edge cases.
providers/vapi/phase-4-apply.md
— VAPI PATCH / POST / DELETE curl bodies, tool-backup pattern, validation-set construction, loop guardrails, iteration-cap exit messaging.
providers/self-hosted/overview.md
— Self-hosted umbrella; self-hosted routes to websocket.
providers/self-hosted/websocket.md
— Websocket sub-flavor gate, source-file discovery,
```
Edit
```
-based apply path, restart-server gate, pasted-prompt / pasted-failures degenerate
```
offline
```
variant, websocket-specific edge cases.
references/phase-2-failure-collection.md
— Full failure-summary template, the metric-improvement hand-off wording, edge cases (no failures / all-errored / mixed inputs), and the no-overfitting-caveats rule.
references/phase-3-diagnosis.md
— Full classification table with examples, before/after templates per edit surface, tool-edit anti-patterns, the manual-vs-automated-improver guidance, Optimization-phase anti-patterns.
references/dynamic-variables-debugging.md
— Per-signal decision tree for variable state, where each signal lives in the Cekura payload, the direct-VAPI fallback, the
```
runs_bulk_retrieve
```
bare-string gotcha, squad per-member-message caveats.

providers/vapi/overview.md
— VAPI可编辑范围，可直接PATCH的内容，反模式。
providers/vapi/phase-1-fetch.md
— 服务商关卡错误消息形态，VAPI助手+团队+工具获取curl请求体，成员摘要模板，Setup阶段边缘情况。
providers/vapi/phase-4-apply.md
— VAPI PATCH/POST/DELETE curl请求体，工具备份模式，验证集构建，循环护栏，迭代上限退出消息。
providers/self-hosted/overview.md
— 自托管总览；自托管路由到websocket。
providers/self-hosted/websocket.md
— websocket子类型关卡，源文件发现，基于Edit的应用路径，重启服务器关卡，粘贴提示词/粘贴失败案例的简化
```
offline
```
变体，websocket特定边缘情况。
references/phase-2-failure-collection.md
— 完整失败摘要模板，指标改进交接措辞，边缘情况（无失败/全部错误/混合输入），无过拟合警告规则。
references/phase-3-diagnosis.md
— 带示例的完整分类表，每个编辑范围的前后模板，工具修改反模式，手动vs自动改进指南，优化阶段反模式。
references/dynamic-variables-debugging.md
— 变量状态按信号决策树，每个信号在Cekura payload中的位置，直接VAPI回退，
```
runs_bulk_retrieve
```
裸字符串陷阱，团队每个成员消息注意事项。