error-recovery

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Error Recovery

错误恢复

Value: Courage -- autonomous operation requires resilience. Recovering from errors without human intervention keeps the workflow moving. Knowing when to escalate prevents wasted effort on unrecoverable situations.

价值： 勇气——自主运行需要韧性。无需人工干预从错误中恢复可保障工作流持续推进。知晓何时升级问题可避免在无法恢复的场景下浪费精力。

Purpose

用途

Teaches agents to handle unexpected errors during autonomous operation (API failures, build tool crashes, permission issues, resource exhaustion). Provides classification, retry strategies, and escalation rules. Prevents the failure modes of infinite retry loops, silent error swallowing, and unnecessary human interruptions for recoverable issues.

教会Agent处理自主运行过程中的意外错误（API故障、构建工具崩溃、权限问题、资源耗尽）。提供分类标准、重试策略和升级规则，避免出现无限重试循环、静默吞错、可恢复问题不必要打扰人工等故障模式。

Practices

实践规范

Classify Before Acting

先分类再处置

When an error occurs, classify it before attempting recovery:

Category	Examples	Recovery
Transient	Network timeout, 503, rate limit, lock contention	Retry with backoff
Environmental	Missing dependency, wrong version, port conflict	Fix environment, then retry
Permission	File permission denied, auth token expired	Escalate to user
Logic	Assertion failure, type error, schema mismatch	Do NOT retry -- investigate
Resource	Out of memory, disk full, context exhaustion	Reduce scope or escalate

Do not retry logic errors. If a test fails, an assertion fires, or a type mismatch occurs, retrying will produce the same result. Switch to the debugging protocol instead.

错误发生后，尝试恢复前先对其分类：

分类	示例	恢复方式
瞬态	网络超时、503错误、速率限制、锁竞争	退避重试
环境类	依赖缺失、版本错误、端口冲突	修复环境后重试
权限类	文件权限被拒绝、鉴权token过期	升级反馈给用户
逻辑类	断言失败、类型错误、schema不匹配	不要重试——进行排查
资源类	内存不足、磁盘满、上下文耗尽	缩小范围或升级问题

逻辑错误不要重试。 如果测试失败、断言触发或出现类型不匹配，重试只会得到相同结果，应切换到debugging-protocol处理。

Retry Strategy: Exponential Backoff

重试策略：指数退避

For transient errors, retry with exponential backoff:

First retry: Wait 2 seconds
Second retry: Wait 5 seconds
Third retry: Wait 15 seconds
After third failure: Stop retrying and escalate

Never retry more than 3 times for the same error. Never retry without waiting. Never use a fixed retry loop without backoff.

Rate limit handling: If the error includes a

Retry-After

header or equivalent, respect it. Do not retry before the indicated time.

针对瞬态错误，采用指数退避方式重试：

第一次重试： 等待2秒
第二次重试： 等待5秒
第三次重试： 等待15秒
第三次失败后： 停止重试并升级问题

同一错误重试不要超过3次，不要无等待重试，不要使用无退避的固定重试循环。

速率限制处理： 如果错误包含

Retry-After

头或等效提示，请遵循该要求，不要在指定时间前重试。

Error Logging

错误日志记录

When an error occurs, log it to a structured format before attempting recovery:

markdown

undefined

错误发生后，尝试恢复前先按结构化格式记录日志：

markdown

undefined

Error Log: [timestamp]

Category: transient | environmental | permission | logic | resource
Error: [exact error message]
Context: [what was happening when the error occurred]
Action taken: [retry | escalate | investigate | fix-environment]
Outcome: [resolved | escalated | investigating]


In pipeline mode, append to
`.factory/audit-trail/slices/<slice-id>/error-log.md`.
In standalone mode, write to the project's scratch directory or memory.

Category: transient | environmental | permission | logic | resource
Error: [exact error message]
Context: [what was happening when the error occurred]
Action taken: [retry | escalate | investigate | fix-environment]
Outcome: [resolved | escalated | investigating]


流水线模式下，日志追加到`.factory/audit-trail/slices/<slice-id>/error-log.md`。独立模式下，写入项目临时目录或内存中。

Environmental Recovery

环境类错误恢复

For environmental errors (missing tools, wrong versions, port conflicts):

Identify the specific environmental issue
Attempt a targeted fix (install missing dependency, kill conflicting process, clear stale lock file)
Verify the fix resolved the issue
Retry the original operation ONCE
If it fails again, escalate -- the environment may need manual intervention

Port conflicts: Check for processes using the port with

lsof -i :<port>

or equivalent. If the process is not related to the current project, report it to the user rather than killing it.

针对环境类错误（工具缺失、版本错误、端口冲突）：

定位具体环境问题
尝试定向修复（安装缺失依赖、终止冲突进程、清理过时锁文件）
验证修复是否解决问题
重试原操作一次
如果再次失败，升级问题——环境可能需要人工干预

端口冲突： 使用

lsof -i :<port>

或等效命令检查占用端口的进程。如果进程与当前项目无关，上报给用户而非直接终止进程。

Context Exhaustion Recovery

上下文耗尽恢复

When approaching context limits during long operations:

Write current state to WORKING_STATE.md immediately
Complete the current atomic operation if possible
Signal that continuation is needed
Do NOT start new operations that cannot complete in remaining context

This prevents the failure mode of starting work that cannot be finished, leaving the project in an inconsistent state.

长时运行过程中接近上下文限制时：

立即将当前状态写入WORKING_STATE.md
尽可能完成当前原子操作
发出需要续跑的信号
不要启动剩余上下文无法完成的新操作

这可以避免启动无法完成的工作、导致项目处于不一致状态的故障模式。

Escalation Rules

升级规则

Escalate to the user when:

Permission errors (you cannot fix what you cannot access)
Logic errors after investigation (the bug needs human insight)
3 retries exhausted for a transient error (the service may be down)
Environmental fix failed (the environment may need manual repair)
Resource exhaustion (context limit, disk space, memory)
Any error you cannot classify (unknown errors are dangerous)

How to escalate: Provide the error category, the exact error message, what you tried, and what you recommend. Do not just say "an error occurred."

出现以下情况时升级反馈给用户：

权限错误（无访问权限就无法修复问题）
排查后的逻辑错误（漏洞需要人工排查）
瞬态错误耗尽3次重试机会（服务可能已宕机）
环境修复失败（环境可能需要人工修复）
资源耗尽（上下文限制、磁盘空间、内存）
任何无法分类的错误（未知错误风险很高）

升级方式： 提供错误分类、准确错误信息、已尝试的操作、你的建议。不要只说“发生了错误”。

Pipeline Integration

流水线集成

In factory pipeline mode, error recovery integrates with the rework protocol:

Transient errors during CI: auto-retry once (standard/full autonomy)
Build tool crashes: classify and apply the appropriate strategy
Gate failures: these are NOT errors -- they are expected feedback from quality gates. Do not apply error recovery to gate failures.

在factory流水线模式下，错误恢复与返工协议集成：

CI过程中的瞬态错误：自动重试一次（标准/完全自主模式）
构建工具崩溃：分类后应用对应策略
门禁失败：不属于错误——是质量门禁的预期反馈，不要对门禁失败应用错误恢复机制

Enforcement Note

执行说明

This skill provides advisory guidance. It instructs the agent on error classification and recovery strategies but cannot mechanically enforce retry limits or prevent silent error swallowing. The agent follows these practices by convention. If you observe the agent retrying endlessly or ignoring errors, point it out.

本技能提供指导性规范，它指导Agent进行错误分类和恢复，但无法机械强制限制重试次数或避免静默吞错，Agent会按约定遵循这些实践。如果你观察到Agent无限重试或忽略错误，可以指出该问题。

Verification

验证标准

Dependencies

依赖

This skill works standalone. For enhanced workflows, it integrates with:

debugging-protocol: Logic errors escalate to the debugging protocol for systematic investigation
pipeline: Pipeline gate failures are handled by the rework protocol, not error recovery
session-reflection: Recurring errors become system prompt refinements
ci-integration: CI failures classify as transient (infra) or logic (test failure) for appropriate handling

Missing a dependency? Install with:

npx skills add jwilger/agent-skills --skill debugging-protocol

本技能可独立运行。如需增强工作流，可与以下技能集成：

debugging-protocol： 逻辑错误升级到debugging-protocol进行系统排查
pipeline： 流水线门禁失败由返工协议处理，不使用错误恢复机制
session-reflection： 重复出现的错误会优化系统提示词
ci-integration： CI故障分为瞬态（基础设施）或逻辑（测试失败）类，分别对应处理方式

缺少依赖？使用以下命令安装：

npx skills add jwilger/agent-skills --skill debugging-protocol