error-coordinator

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Error Coordinator

错误协调器

Purpose

用途

Provides expertise in building resilient multi-agent systems with robust error handling, failure detection, and recovery mechanisms. Covers loop detection, hallucination mitigation, and self-healing agent workflows.

提供构建具备健壮错误处理、故障检测与恢复机制的多Agent系统的专业方案，涵盖循环检测、幻觉缓解以及Agent自愈工作流等内容。

When to Use

适用场景

Designing error handling for agent systems
Implementing retry and recovery strategies
Building self-healing AI workflows
Detecting agent loops and infinite recursion
Mitigating hallucinations in agent outputs
Implementing circuit breakers for agents
Coordinating failure recovery across agents

为Agent系统设计错误处理机制
实现重试与恢复策略
构建自愈AI工作流
检测Agent循环与无限递归
缓解Agent输出中的幻觉问题
为Agent实现断路器机制
跨Agent协调故障恢复

Quick Start

快速开始

Invoke this skill when:

Designing error handling for agent systems
Implementing retry and recovery strategies
Building self-healing AI workflows
Detecting agent loops and infinite recursion
Coordinating failure recovery across agents

Do NOT invoke when:

Organizing agent teams (use agent-organizer)
Debugging application errors (use debugger)
Handling production incidents (use incident-responder)
Detecting code error patterns (use error-detective)

当以下情况时调用此技能：

为Agent系统设计错误处理机制
实现重试与恢复策略
构建自愈AI工作流
检测Agent循环与无限递归
跨Agent协调故障恢复

请勿在以下情况调用：

组织Agent团队（请使用agent-organizer）
调试应用程序错误（请使用debugger）
处理生产事件（请使用incident-responder）
检测代码错误模式（请使用error-detective）

Decision Framework

决策框架

Error Type Handling:
├── Transient failure → Retry with backoff
├── Rate limiting → Backoff + queue
├── Invalid output → Validation + retry with feedback
├── Loop detected → Break + escalate
├── Hallucination → Ground with context, retry
├── Agent timeout → Cancel + fallback
└── Cascading failure → Circuit breaker

Recovery Strategy:
├── Idempotent operation → Simple retry
├── Stateful operation → Checkpoint + resume
├── Critical path → Fallback agent
└── Best effort → Log + continue

Error Type Handling:
├── Transient failure → Retry with backoff
├── Rate limiting → Backoff + queue
├── Invalid output → Validation + retry with feedback
├── Loop detected → Break + escalate
├── Hallucination → Ground with context, retry
├── Agent timeout → Cancel + fallback
└── Cascading failure → Circuit breaker

Recovery Strategy:
├── Idempotent operation → Simple retry
├── Stateful operation → Checkpoint + resume
├── Critical path → Fallback agent
└── Best effort → Log + continue

Core Workflows

核心工作流

1. Loop Detection System

1. 循环检测系统

Track agent invocation history
Detect repeated state patterns
Set maximum iteration limits
Implement escape hatch triggers
Log loop occurrences for analysis
Escalate to supervisor or human

跟踪Agent调用历史
检测重复状态模式
设置最大迭代限制
实现紧急退出触发机制
记录循环事件用于分析
上报给监督者或人工处理

2. Hallucination Mitigation

2. 幻觉缓解

Ground responses with source data
Implement output validation
Cross-check with retrieval
Add confidence scoring
Flag low-confidence outputs
Provide feedback for retry

基于源数据锚定响应
实现输出验证
结合检索进行交叉校验
添加置信度评分
标记低置信度输出
提供反馈用于重试

3. Circuit Breaker Implementation

3. 断路器实现

Track failure rates per agent
Define failure threshold
Open circuit on threshold breach
Provide fallback behavior
Implement half-open state for testing
Close circuit on recovery
Monitor and alert on breaker state

跟踪每个Agent的故障发生率
定义故障阈值
达到阈值时断开电路
提供降级行为
实现半开状态用于测试
恢复后闭合电路
监控并告警断路器状态

Best Practices

最佳实践

Implement timeouts for all agent calls
Use exponential backoff with jitter
Log all failures with full context
Design for graceful degradation
Test failure scenarios explicitly
Monitor error rates and patterns

为所有Agent调用设置超时
使用带抖动的指数退避策略
记录所有故障及完整上下文
设计优雅降级方案
显式测试故障场景
监控错误率与模式

Anti-Patterns

反模式

Anti-Pattern	Problem	Correct Approach
Infinite retries	Resource exhaustion	Max retry limits
Silent failures	Hidden problems	Log and alert
No timeouts	Hung processes	Always set timeouts
Same retry interval	Thundering herd	Exponential backoff
No fallbacks	Complete failure	Graceful degradation

反模式	问题	正确方案
无限重试	资源耗尽	设置最大重试次数限制
静默故障	问题被隐藏	记录并告警
未设置超时	进程挂起	始终设置超时
固定重试间隔	惊群效应	使用指数退避
无降级方案	完全故障	优雅降级