error-recovery
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseError Recovery Protocol
错误恢复协议
When an error occurs, stop, think, and try the right recovery strategy. No blind retries — understand the error signal first, then act.
Core principle: Every error carries a signal. Read the signal first, then act.
当发生错误时,先停止操作、分析问题,再尝试合适的恢复策略。禁止盲目重试——先理解错误信号,再采取行动。
核心原则: 每个错误都传递着信号。先解读信号,再采取行动。
Error Classification
错误分类
Classify every error into one of 4 categories — the recovery strategy depends on the category:
将每个错误归为以下4类之一——恢复策略取决于错误类别:
Transient Error
瞬态错误
Retrying usually fixes it. Infrastructure or network related.
- Examples: timeout, rate limit (429), connection drop, temporary service outage
- Strategy: Wait & Retry with exponential backoff
重试通常可以解决这类问题。与基础设施或网络相关。
- 示例:超时、速率限制(429)、连接断开、服务临时中断
- 策略:等待并重试,使用指数退避机制
Configuration Error
配置错误
Environment or setup issue. Code is correct but setup is wrong.
- Examples: missing env variable, wrong file path, permission denied, missing dependency
- Strategy: Fix & Continue — identify the issue, fix it, re-run
环境或设置问题。代码本身正确,但配置有误。
- 示例:缺少环境变量、文件路径错误、权限不足、依赖缺失
- 策略:修复后继续——识别问题、修复后重新运行
Logic Error
逻辑错误
Code or approach is wrong. Retrying produces the same error.
- Examples: KeyError, TypeError, wrong algorithm, expectation mismatch
- Strategy: Alternative Approach — try a different method
代码或方法存在问题。重试会导致相同错误。
- 示例:KeyError、TypeError、算法错误、预期结果不匹配
- 策略:尝试替代方案——换一种方法
Permanent / External Error
永久性/外部错误
Out of control, cannot be fixed. External service or permission boundary.
- Examples: 403 Forbidden, 404 Not Found, quota exceeded, API deprecated
- Strategy: Escalation — inform the user, ask for direction
超出控制范围,无法修复。与外部服务或权限边界相关。
- 示例:403 禁止访问、404 未找到、配额用尽、API已废弃
- 策略:上报用户——告知用户,请求指导
Retry Strategy
重试策略
For transient errors, use exponential backoff:
Attempt 1: Retry immediately
Attempt 2: Wait 2 seconds
Attempt 3: Wait 4 seconds
Attempt 4: Wait 8 seconds -> move on or escalateMaximum retries: 3 attempts. If all 3 fail → re-evaluate the category.
Rate limit (429) special rule:
- If response has header, wait that duration
Retry-After - Otherwise wait 60 seconds, then retry
对于瞬态错误,使用指数退避机制:
Attempt 1: Retry immediately
Attempt 2: Wait 2 seconds
Attempt 3: Wait 4 seconds
Attempt 4: Wait 8 seconds -> move on or escalate最大重试次数: 3次。如果3次都失败→重新评估错误类别。
速率限制(429)特殊规则:
- 如果响应包含头,等待指定时长
Retry-After - 否则等待60秒后重试
Decision Tree
决策树
Error received
|
Classify the error
|
+------------------------------------+
| Transient? -> Wait & Retry (max 3)|
| Config? -> Fix & Continue |
| Logic? -> Alternative approach|
| Permanent? -> Escalation |
+------------------------------------+
|
Every strategy fails -> EscalationError received
|
Classify the error
|
+------------------------------------+
| Transient? -> Wait & Retry (max 3)|
| Config? -> Fix & Continue |
| Logic? -> Alternative approach|
| Permanent? -> Escalation |
+------------------------------------+
|
Every strategy fails -> EscalationEscalation Protocol
上报协议
Escalate to the user when:
- 3 retries failed
- Permanent / external error
- 2 consecutive different strategies failed
- Error category cannot be determined
ERROR ESCALATION
================================
Failed step : [step name]
Error : [error message summary]
Category : [Transient / Config / Logic / Permanent]
Tried : [what was attempted — short list]
Result : All strategies exhausted
================================
Options:
A) [Alternative approach suggestion]
B) [Simpler / partial solution]
C) Skip this step, continue
D) Stop the task在以下情况向用户上报:
- 3次重试失败
- 永久性/外部错误
- 连续2种不同策略失败
- 无法确定错误类别
ERROR ESCALATION
================================
Failed step : [step name]
Error : [error message summary]
Category : [Transient / Config / Logic / Permanent]
Tried : [what was attempted — short list]
Result : All strategies exhausted
================================
Options:
A) [Alternative approach suggestion]
B) [Simpler / partial solution]
C) Skip this step, continue
D) Stop the taskPartial Success
部分成功场景
For bulk operations where some items succeed and some fail:
PARTIAL SUCCESS
================================
Successful : N / Total
Failed : M items
================================
Failed items:
- [item]: [reason]
Options:
A) Retry only failed items
B) Continue with successful items, skip failed
C) Cancel all对于批量操作中部分项成功、部分项失败的情况:
PARTIAL SUCCESS
================================
Successful : N / Total
Failed : M items
================================
Failed items:
- [item]: [reason]
Options:
A) Retry only failed items
B) Continue with successful items, skip failed
C) Cancel allError Log
错误日志
Log every error and recovery attempt:
[ERROR LOG]
Step : [step name / number]
Error : [message]
Category : [type]
Attempt 1: [strategy] -> [result]
Attempt 2: [strategy] -> [result]
Result : Recovered / Escalated记录每个错误及恢复尝试:
[ERROR LOG]
Step : [step name / number]
Error : [message]
Category : [type]
Attempt 1: [strategy] -> [result]
Attempt 2: [strategy] -> [result]
Result : Recovered / EscalatedWhen to Skip
跳过错误的场景
- Error is expected behavior (e.g., "file not found" when checking existence)
- User said "ignore errors, continue"
- One-off, non-repeatable task
- 错误属于预期行为(例如,检查文件是否存在时出现“文件未找到”)
- 用户明确要求“忽略错误,继续执行”
- 一次性、不可重复的任务
Guardrails
防护规则
- Never blind-retry a logic error — retrying won't help, change the approach.
- Always log every attempt — even successful recoveries need a record.
- Cross-skill: integrates with (risk assessment before retry),
checkpoint-guardian(logs errors and fixes), andmemory-ledger(retrospective analysis).agent-reviewer
- 切勿盲目重试逻辑错误——重试无济于事,应更换方法。
- 始终记录所有尝试——即使恢复成功也需要留存记录。
- 跨技能集成:与(重试前风险评估)、
checkpoint-guardian(记录错误与修复方案)和memory-ledger(回溯分析)集成。agent-reviewer