error-coordinator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Error Coordinator

错误协调器

Purpose

用途

Provides expertise in building resilient multi-agent systems with robust error handling, failure detection, and recovery mechanisms. Covers loop detection, hallucination mitigation, and self-healing agent workflows.
提供构建具备健壮错误处理、故障检测与恢复机制的多Agent系统的专业方案,涵盖循环检测、幻觉缓解以及Agent自愈工作流等内容。

When to Use

适用场景

  • Designing error handling for agent systems
  • Implementing retry and recovery strategies
  • Building self-healing AI workflows
  • Detecting agent loops and infinite recursion
  • Mitigating hallucinations in agent outputs
  • Implementing circuit breakers for agents
  • Coordinating failure recovery across agents
  • 为Agent系统设计错误处理机制
  • 实现重试与恢复策略
  • 构建自愈AI工作流
  • 检测Agent循环与无限递归
  • 缓解Agent输出中的幻觉问题
  • 为Agent实现断路器机制
  • 跨Agent协调故障恢复

Quick Start

快速开始

Invoke this skill when:
  • Designing error handling for agent systems
  • Implementing retry and recovery strategies
  • Building self-healing AI workflows
  • Detecting agent loops and infinite recursion
  • Coordinating failure recovery across agents
Do NOT invoke when:
  • Organizing agent teams (use agent-organizer)
  • Debugging application errors (use debugger)
  • Handling production incidents (use incident-responder)
  • Detecting code error patterns (use error-detective)
当以下情况时调用此技能:
  • 为Agent系统设计错误处理机制
  • 实现重试与恢复策略
  • 构建自愈AI工作流
  • 检测Agent循环与无限递归
  • 跨Agent协调故障恢复
请勿在以下情况调用:
  • 组织Agent团队(请使用agent-organizer)
  • 调试应用程序错误(请使用debugger)
  • 处理生产事件(请使用incident-responder)
  • 检测代码错误模式(请使用error-detective)

Decision Framework

决策框架

Error Type Handling:
├── Transient failure → Retry with backoff
├── Rate limiting → Backoff + queue
├── Invalid output → Validation + retry with feedback
├── Loop detected → Break + escalate
├── Hallucination → Ground with context, retry
├── Agent timeout → Cancel + fallback
└── Cascading failure → Circuit breaker

Recovery Strategy:
├── Idempotent operation → Simple retry
├── Stateful operation → Checkpoint + resume
├── Critical path → Fallback agent
└── Best effort → Log + continue
Error Type Handling:
├── Transient failure → Retry with backoff
├── Rate limiting → Backoff + queue
├── Invalid output → Validation + retry with feedback
├── Loop detected → Break + escalate
├── Hallucination → Ground with context, retry
├── Agent timeout → Cancel + fallback
└── Cascading failure → Circuit breaker

Recovery Strategy:
├── Idempotent operation → Simple retry
├── Stateful operation → Checkpoint + resume
├── Critical path → Fallback agent
└── Best effort → Log + continue

Core Workflows

核心工作流

1. Loop Detection System

1. 循环检测系统

  1. Track agent invocation history
  2. Detect repeated state patterns
  3. Set maximum iteration limits
  4. Implement escape hatch triggers
  5. Log loop occurrences for analysis
  6. Escalate to supervisor or human
  1. 跟踪Agent调用历史
  2. 检测重复状态模式
  3. 设置最大迭代限制
  4. 实现紧急退出触发机制
  5. 记录循环事件用于分析
  6. 上报给监督者或人工处理

2. Hallucination Mitigation

2. 幻觉缓解

  1. Ground responses with source data
  2. Implement output validation
  3. Cross-check with retrieval
  4. Add confidence scoring
  5. Flag low-confidence outputs
  6. Provide feedback for retry
  1. 基于源数据锚定响应
  2. 实现输出验证
  3. 结合检索进行交叉校验
  4. 添加置信度评分
  5. 标记低置信度输出
  6. 提供反馈用于重试

3. Circuit Breaker Implementation

3. 断路器实现

  1. Track failure rates per agent
  2. Define failure threshold
  3. Open circuit on threshold breach
  4. Provide fallback behavior
  5. Implement half-open state for testing
  6. Close circuit on recovery
  7. Monitor and alert on breaker state
  1. 跟踪每个Agent的故障发生率
  2. 定义故障阈值
  3. 达到阈值时断开电路
  4. 提供降级行为
  5. 实现半开状态用于测试
  6. 恢复后闭合电路
  7. 监控并告警断路器状态

Best Practices

最佳实践

  • Implement timeouts for all agent calls
  • Use exponential backoff with jitter
  • Log all failures with full context
  • Design for graceful degradation
  • Test failure scenarios explicitly
  • Monitor error rates and patterns
  • 为所有Agent调用设置超时
  • 使用带抖动的指数退避策略
  • 记录所有故障及完整上下文
  • 设计优雅降级方案
  • 显式测试故障场景
  • 监控错误率与模式

Anti-Patterns

反模式

Anti-PatternProblemCorrect Approach
Infinite retriesResource exhaustionMax retry limits
Silent failuresHidden problemsLog and alert
No timeoutsHung processesAlways set timeouts
Same retry intervalThundering herdExponential backoff
No fallbacksComplete failureGraceful degradation
反模式问题正确方案
无限重试资源耗尽设置最大重试次数限制
静默故障问题被隐藏记录并告警
未设置超时进程挂起始终设置超时
固定重试间隔惊群效应使用指数退避
无降级方案完全故障优雅降级