error-recovery

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Error Recovery

错误恢复

Value: Courage -- autonomous operation requires resilience. Recovering from errors without human intervention keeps the workflow moving. Knowing when to escalate prevents wasted effort on unrecoverable situations.
价值: 勇气——自主运行需要韧性。无需人工干预从错误中恢复可保障工作流持续推进。知晓何时升级问题可避免在无法恢复的场景下浪费精力。

Purpose

用途

Teaches agents to handle unexpected errors during autonomous operation (API failures, build tool crashes, permission issues, resource exhaustion). Provides classification, retry strategies, and escalation rules. Prevents the failure modes of infinite retry loops, silent error swallowing, and unnecessary human interruptions for recoverable issues.
教会Agent处理自主运行过程中的意外错误(API故障、构建工具崩溃、权限问题、资源耗尽)。提供分类标准、重试策略和升级规则,避免出现无限重试循环、静默吞错、可恢复问题不必要打扰人工等故障模式。

Practices

实践规范

Classify Before Acting

先分类再处置

When an error occurs, classify it before attempting recovery:
CategoryExamplesRecovery
TransientNetwork timeout, 503, rate limit, lock contentionRetry with backoff
EnvironmentalMissing dependency, wrong version, port conflictFix environment, then retry
PermissionFile permission denied, auth token expiredEscalate to user
LogicAssertion failure, type error, schema mismatchDo NOT retry -- investigate
ResourceOut of memory, disk full, context exhaustionReduce scope or escalate
Do not retry logic errors. If a test fails, an assertion fires, or a type mismatch occurs, retrying will produce the same result. Switch to the debugging protocol instead.
错误发生后,尝试恢复前先对其分类:
分类示例恢复方式
瞬态网络超时、503错误、速率限制、锁竞争退避重试
环境类依赖缺失、版本错误、端口冲突修复环境后重试
权限类文件权限被拒绝、鉴权token过期升级反馈给用户
逻辑类断言失败、类型错误、schema不匹配不要重试——进行排查
资源类内存不足、磁盘满、上下文耗尽缩小范围或升级问题
逻辑错误不要重试。 如果测试失败、断言触发或出现类型不匹配,重试只会得到相同结果,应切换到debugging-protocol处理。

Retry Strategy: Exponential Backoff

重试策略:指数退避

For transient errors, retry with exponential backoff:
  1. First retry: Wait 2 seconds
  2. Second retry: Wait 5 seconds
  3. Third retry: Wait 15 seconds
  4. After third failure: Stop retrying and escalate
Never retry more than 3 times for the same error. Never retry without waiting. Never use a fixed retry loop without backoff.
Rate limit handling: If the error includes a
Retry-After
header or equivalent, respect it. Do not retry before the indicated time.
针对瞬态错误,采用指数退避方式重试:
  1. 第一次重试: 等待2秒
  2. 第二次重试: 等待5秒
  3. 第三次重试: 等待15秒
  4. 第三次失败后: 停止重试并升级问题
同一错误重试不要超过3次,不要无等待重试,不要使用无退避的固定重试循环。
速率限制处理: 如果错误包含
Retry-After
头或等效提示,请遵循该要求,不要在指定时间前重试。

Error Logging

错误日志记录

When an error occurs, log it to a structured format before attempting recovery:
markdown
undefined
错误发生后,尝试恢复前先按结构化格式记录日志:
markdown
undefined

Error Log: [timestamp]

Error Log: [timestamp]

  • Category: transient | environmental | permission | logic | resource
  • Error: [exact error message]
  • Context: [what was happening when the error occurred]
  • Action taken: [retry | escalate | investigate | fix-environment]
  • Outcome: [resolved | escalated | investigating]

In pipeline mode, append to
`.factory/audit-trail/slices/<slice-id>/error-log.md`.
In standalone mode, write to the project's scratch directory or memory.
  • Category: transient | environmental | permission | logic | resource
  • Error: [exact error message]
  • Context: [what was happening when the error occurred]
  • Action taken: [retry | escalate | investigate | fix-environment]
  • Outcome: [resolved | escalated | investigating]

流水线模式下,日志追加到`.factory/audit-trail/slices/<slice-id>/error-log.md`。独立模式下,写入项目临时目录或内存中。

Environmental Recovery

环境类错误恢复

For environmental errors (missing tools, wrong versions, port conflicts):
  1. Identify the specific environmental issue
  2. Attempt a targeted fix (install missing dependency, kill conflicting process, clear stale lock file)
  3. Verify the fix resolved the issue
  4. Retry the original operation ONCE
  5. If it fails again, escalate -- the environment may need manual intervention
Port conflicts: Check for processes using the port with
lsof -i :<port>
or equivalent. If the process is not related to the current project, report it to the user rather than killing it.
针对环境类错误(工具缺失、版本错误、端口冲突):
  1. 定位具体环境问题
  2. 尝试定向修复(安装缺失依赖、终止冲突进程、清理过时锁文件)
  3. 验证修复是否解决问题
  4. 重试原操作一次
  5. 如果再次失败,升级问题——环境可能需要人工干预
端口冲突: 使用
lsof -i :<port>
或等效命令检查占用端口的进程。如果进程与当前项目无关,上报给用户而非直接终止进程。

Context Exhaustion Recovery

上下文耗尽恢复

When approaching context limits during long operations:
  1. Write current state to WORKING_STATE.md immediately
  2. Complete the current atomic operation if possible
  3. Signal that continuation is needed
  4. Do NOT start new operations that cannot complete in remaining context
This prevents the failure mode of starting work that cannot be finished, leaving the project in an inconsistent state.
长时运行过程中接近上下文限制时:
  1. 立即将当前状态写入WORKING_STATE.md
  2. 尽可能完成当前原子操作
  3. 发出需要续跑的信号
  4. 不要启动剩余上下文无法完成的新操作
这可以避免启动无法完成的工作、导致项目处于不一致状态的故障模式。

Escalation Rules

升级规则

Escalate to the user when:
  • Permission errors (you cannot fix what you cannot access)
  • Logic errors after investigation (the bug needs human insight)
  • 3 retries exhausted for a transient error (the service may be down)
  • Environmental fix failed (the environment may need manual repair)
  • Resource exhaustion (context limit, disk space, memory)
  • Any error you cannot classify (unknown errors are dangerous)
How to escalate: Provide the error category, the exact error message, what you tried, and what you recommend. Do not just say "an error occurred."
出现以下情况时升级反馈给用户:
  • 权限错误(无访问权限就无法修复问题)
  • 排查后的逻辑错误(漏洞需要人工排查)
  • 瞬态错误耗尽3次重试机会(服务可能已宕机)
  • 环境修复失败(环境可能需要人工修复)
  • 资源耗尽(上下文限制、磁盘空间、内存)
  • 任何无法分类的错误(未知错误风险很高)
升级方式: 提供错误分类、准确错误信息、已尝试的操作、你的建议。不要只说“发生了错误”。

Pipeline Integration

流水线集成

In factory pipeline mode, error recovery integrates with the rework protocol:
  • Transient errors during CI: auto-retry once (standard/full autonomy)
  • Build tool crashes: classify and apply the appropriate strategy
  • Gate failures: these are NOT errors -- they are expected feedback from quality gates. Do not apply error recovery to gate failures.
在factory流水线模式下,错误恢复与返工协议集成:
  • CI过程中的瞬态错误:自动重试一次(标准/完全自主模式)
  • 构建工具崩溃:分类后应用对应策略
  • 门禁失败:不属于错误——是质量门禁的预期反馈,不要对门禁失败应用错误恢复机制

Enforcement Note

执行说明

This skill provides advisory guidance. It instructs the agent on error classification and recovery strategies but cannot mechanically enforce retry limits or prevent silent error swallowing. The agent follows these practices by convention. If you observe the agent retrying endlessly or ignoring errors, point it out.
本技能提供指导性规范,它指导Agent进行错误分类和恢复,但无法机械强制限制重试次数或避免静默吞错,Agent会按约定遵循这些实践。如果你观察到Agent无限重试或忽略错误,可以指出该问题。

Verification

验证标准

After recovering from an error, verify:
  • Error was classified before any recovery attempt
  • Logic errors were NOT retried (investigated instead)
  • Retries used exponential backoff (not immediate)
  • No more than 3 retries for the same error
  • Error was logged with category, message, context, and outcome
  • Escalation included the error category, message, attempts, and recommendation
  • Environmental fixes were verified before retrying the operation
  • State was saved before context exhaustion recovery
If any criterion is not met, revisit the relevant practice.
从错误中恢复后,验证以下项:
  • 任何恢复尝试前已对错误分类
  • 逻辑错误没有重试(而是进行了排查)
  • 重试使用了指数退避(不是立即重试)
  • 同一错误重试不超过3次
  • 错误已记录,包含分类、信息、上下文和结果
  • 升级内容包含错误分类、信息、已尝试操作和建议
  • 环境修复后重试操作前已验证修复效果
  • 上下文耗尽恢复前已保存状态
如果任何标准未满足,重新执行对应实践。

Dependencies

依赖

This skill works standalone. For enhanced workflows, it integrates with:
  • debugging-protocol: Logic errors escalate to the debugging protocol for systematic investigation
  • pipeline: Pipeline gate failures are handled by the rework protocol, not error recovery
  • session-reflection: Recurring errors become system prompt refinements
  • ci-integration: CI failures classify as transient (infra) or logic (test failure) for appropriate handling
Missing a dependency? Install with:
npx skills add jwilger/agent-skills --skill debugging-protocol
本技能可独立运行。如需增强工作流,可与以下技能集成:
  • debugging-protocol: 逻辑错误升级到debugging-protocol进行系统排查
  • pipeline: 流水线门禁失败由返工协议处理,不使用错误恢复机制
  • session-reflection: 重复出现的错误会优化系统提示词
  • ci-integration: CI故障分为瞬态(基础设施)或逻辑(测试失败)类,分别对应处理方式
缺少依赖?使用以下命令安装:
npx skills add jwilger/agent-skills --skill debugging-protocol