agent-introspection-debugging

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Agent Introspection Debugging

Agent内省调试

Use this skill when an agent run is failing repeatedly, consuming tokens without progress, looping on the same tools, or drifting away from the intended task.

This is a workflow skill, not a hidden runtime. It teaches the agent to debug itself systematically before escalating to a human.

当Agent运行反复失败、无进展消耗token、循环调用相同工具，或偏离预期任务时，请使用此技能。

这是一个工作流技能，而非隐藏运行时。它指导Agent在升级求助人工之前，系统性地完成自我调试。

When to Activate

何时启用

Maximum tool call / loop-limit failures
Repeated retries with no forward progress
Context growth or prompt drift that starts degrading output quality
File-system or environment state mismatch between expectation and reality
Tool failures that are likely recoverable with diagnosis and a smaller corrective action

工具调用/循环次数达上限触发失败
多次重试无任何进展
上下文膨胀或Prompt漂移导致输出质量下降
文件系统或环境状态与预期不符
可通过诊断和小型修正操作恢复的工具故障

Scope Boundaries

适用范围边界

Activate this skill for:

capturing failure state before retrying blindly
diagnosing common agent-specific failure patterns
applying contained recovery actions
producing a structured human-readable debug report

Do not use this skill as the primary source for:

feature verification after code changes; use
```
verification-loop
```
framework-specific debugging when a narrower ECC skill already exists
runtime promises the current harness cannot enforce automatically

以下场景可启用本技能：

盲目重试前先捕获故障状态
诊断Agent常见的特定故障模式
执行可控的恢复操作
生成结构化的、人类可读的调试报告

以下场景不建议将本技能作为核心方案：

代码变更后的功能验证，请使用
```
verification-loop
```
已有更细分的ECC技能覆盖的特定框架调试场景
当前运行框架无法自动生效的运行时约定

Four-Phase Loop

四阶段循环

Phase 1: Failure Capture

阶段1：故障捕获

Before trying to recover, record the failure precisely.

Capture:

error type, message, and stack trace when available
last meaningful tool call sequence
what the agent was trying to do
current context pressure: repeated prompts, oversized pasted logs, duplicated plans, or runaway notes
current environment assumptions: cwd, branch, relevant service state, expected files

Minimum capture template:

markdown

undefined

尝试恢复前，先精准记录故障信息。

需捕获的内容：

可用的错误类型、错误信息和堆栈跟踪
最近的有效工具调用序列
Agent当时正在执行的目标
当前上下文压力：重复Prompt、过大的粘贴日志、重复计划、失控的笔记内容
当前环境假设：工作目录(cwd)、分支、相关服务状态、预期存在的文件

最低要求捕获模板：

markdown

undefined

Failure Capture

故障捕获

Session / task:
Goal in progress:
Error:
Last successful step:
Last failed tool / command:
Repeated pattern seen:
Environment assumptions to verify:

undefined

会话/任务：
进行中的目标：
错误信息：
最后一个成功步骤：
最后失败的工具/命令：
观察到的重复模式：
待验证的环境假设：

undefined

Phase 2: Root-Cause Diagnosis

阶段2：根因诊断

Match the failure to a known pattern before changing anything.

Pattern	Likely Cause	Check
Maximum tool calls / repeated same command	loop or no-exit observer path	inspect the last N tool calls for repetition
Context overflow / degraded reasoning	unbounded notes, repeated plans, oversized logs	inspect recent context for duplication and low-signal bulk
`ECONNREFUSED` / timeout	service unavailable or wrong port	verify service health, URL, and port assumptions
`429` / quota exhaustion	retry storm or missing backoff	count repeated calls and inspect retry spacing
file missing after write / stale diff	race, wrong cwd, or branch drift	re-check path, cwd, git status, and actual file existence
tests still failing after “fix”	wrong hypothesis	isolate the exact failing test and re-derive the bug

Diagnosis questions:

is this a logic failure, state failure, environment failure, or policy failure?
did the agent lose the real objective and start optimizing the wrong subtask?
is the failure deterministic or transient?
what is the smallest reversible action that would validate the diagnosis?

修改任何内容前，先将故障与已知模式匹配。

故障模式	可能原因	检查项
工具调用次数达上限/重复执行相同命令	循环或无退出的观察路径	检查最近N次工具调用是否存在重复
上下文溢出/推理能力下降	无限制的笔记、重复计划、过大的日志	检查近期上下文是否存在重复内容和低信息密度的 bulk 数据
`ECONNREFUSED` / 超时	服务不可用或端口错误	验证服务健康状态、URL和端口假设
`429` / 配额耗尽	重试风暴或缺少退避机制	统计重复调用次数，检查重试间隔
写入后文件丢失/差异过时	竞争条件、工作目录错误或分支漂移	重新检查路径、工作目录、git状态和文件实际存在性
「修复」后测试仍失败	假设错误	定位具体的失败用例，重新推导bug原因

诊断问题：

这是逻辑故障、状态故障、环境故障还是策略故障？
Agent是否遗忘了真实目标，开始优化错误的子任务？
故障是确定性的还是偶发的？
能验证诊断结论的最小可逆操作是什么？

Phase 3: Contained Recovery

阶段3：可控恢复

Recover with the smallest action that changes the diagnosis surface.

Safe recovery actions:

stop repeated retries and restate the hypothesis
trim low-signal context and keep only the active goal, blockers, and evidence
re-check the actual filesystem / branch / process state
narrow the task to one failing command, one file, or one test
switch from speculative reasoning to direct observation
escalate to a human when the failure is high-risk or externally blocked

Do not claim unsupported auto-healing actions like “reset agent state” or “update harness config” unless you are actually doing them through real tools in the current environment.

Contained recovery checklist:

markdown

undefined

使用能改变诊断判断的最小操作完成恢复。

安全恢复操作：

停止重复重试，重述假设
裁剪低信息密度的上下文，仅保留活跃目标、阻塞点和证据
重新检查实际的文件系统/分支/进程状态
将任务范围缩小到单个失败命令、单个文件或单条测试用例
从推测推理切换为直接观察
故障风险高或被外部因素阻塞时，升级求助人工

不要声明不支持的自动修复操作，比如「重置Agent状态」或「更新运行框架配置」，除非你确实通过当前环境的真实工具执行了这些操作。

可控恢复检查清单：

markdown

undefined

Recovery Action

恢复操作

Diagnosis chosen:
Smallest action taken:
Why this is safe:
What evidence would prove the fix worked:

undefined

选定的诊断结论：
执行的最小操作：
操作安全性说明：
证明修复生效的证据：

undefined

Phase 4: Introspection Report

阶段4：内省报告

End with a report that makes the recovery legible to the next agent or human.

markdown

undefined

最后输出报告，让后续接手的Agent或人工能清晰了解恢复过程。

markdown

undefined

Agent Self-Debug Report

Agent自调试报告

Session / task:
Failure:
Root cause:
Recovery action:
Result: success | partial | blocked
Token / time burn risk:
Follow-up needed:
Preventive change to encode later:

undefined

会话/任务：
故障信息：
根因：
恢复操作：
结果：成功|部分成功|阻塞
Token/时间消耗风险：
需要后续跟进的事项：
后续可落地的预防变更：

undefined

Recovery Heuristics

恢复启发规则

Prefer these interventions in order:

Restate the real objective in one sentence.
Verify the world state instead of trusting memory.
Shrink the failing scope.
Run one discriminating check.
Only then retry.

Bad pattern:

retrying the same action three times with slightly different wording

Good pattern:

capture failure
classify the pattern
run one direct check
change the plan only if the check supports it

优先按以下顺序执行干预措施：

用一句话重述真实目标
验证实际状态，而非信任记忆
缩小故障范围
执行一次区分度校验
完成以上步骤后再重试

错误模式：

仅微调措辞就重复执行同一操作三次以上

正确模式：

捕获故障
分类模式
执行一次直接校验
仅在校验支持调整的前提下修改计划

Integration with ECC

与ECC集成

Use
```
verification-loop
```
after recovery if code was changed.
Use
```
continuous-learning-v2
```
when the failure pattern is worth turning into an instinct or later skill.
Use
```
council
```
when the issue is not technical failure but decision ambiguity.
Use
```
workspace-surface-audit
```
if the failure came from conflicting local state or repo drift.

如果修改了代码，恢复后使用
```
verification-loop
```
故障模式值得沉淀为经验或后续技能时，使用
```
continuous-learning-v2
```
问题不属于技术故障，而是决策模糊时，使用
```
council
```
故障来自本地状态冲突或代码库漂移时，使用
```
workspace-surface-audit
```

Output Standard

输出标准

When this skill is active, do not end with “I fixed it” alone.

Always provide:

the failure pattern
the root-cause hypothesis
the recovery action
the evidence that the situation is now better or still blocked

本技能激活时，不要仅以「我修复了」作为结束。

请始终提供以下信息：

故障模式
根因假设
恢复操作
证明情况已改善或仍被阻塞的证据