resilience-patterns

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Resilience Patterns Skill

弹性模式Skill

Production-grade resilience patterns for distributed systems and LLM-based workflows. Covers circuit breakers, bulkheads, retry strategies, and LLM-specific resilience techniques.
面向分布式系统和基于LLM的工作流的生产级弹性模式。涵盖断路器、舱壁、重试策略,以及针对LLM的专属弹性技术。

Overview

概述

  • Building fault-tolerant multi-agent systems
  • Implementing LLM API integrations with proper error handling
  • Designing distributed workflows that need graceful degradation
  • Adding observability to failure scenarios
  • Protecting systems from cascade failures
  • 构建容错多Agent系统
  • 为LLM API集成实现完善的错误处理
  • 设计支持优雅降级的分布式工作流
  • 为故障场景添加可观测性
  • 保护系统避免级联故障

Core Patterns

核心模式

1. Circuit Breaker Pattern (reference: circuit-breaker.md)

1. 断路器模式(参考:circuit-breaker.md)

Prevents cascade failures by "tripping" when a service exceeds failure thresholds.
+-------------------------------------------------------------------+
|                    Circuit Breaker States                         |
+-------------------------------------------------------------------+
|                                                                   |
|    +----------+     failures >= threshold    +----------+         |
|    |  CLOSED  | ----------------------------> |   OPEN   |        |
|    | (normal) |                              | (reject) |         |
|    +----+-----+                              +----+-----+         |
|         |                                         |               |
|         | success                    timeout      |               |
|         |                            expires      |               |
|         |         +------------+                  |               |
|         |         | HALF_OPEN  |<-----------------+               |
|         +---------+  (probe)   |                                  |
|                   +------------+                                  |
|                                                                   |
|   CLOSED:    Allow requests, count failures                       |
|   OPEN:      Reject immediately, return fallback                  |
|   HALF_OPEN: Allow probe request to test recovery                 |
|                                                                   |
+-------------------------------------------------------------------+
Key Configuration:
  • failure_threshold
    : Failures before opening (default: 5)
  • recovery_timeout
    : Seconds before attempting recovery (default: 30)
  • half_open_requests
    : Probes to allow in half-open (default: 1)
当服务超出故障阈值时,通过“跳闸”来防止级联故障。
+-------------------------------------------------------------------+
|                    Circuit Breaker States                         |
+-------------------------------------------------------------------+
|                                                                   |
|    +----------+     failures >= threshold    +----------+         |
|    |  CLOSED  | ----------------------------> |   OPEN   |        |
|    | (normal) |                              | (reject) |         |
|    +----+-----+                              +----+-----+         |
|         |                                         |               |
|         | success                    timeout      |               |
|         |                            expires      |               |
|         |         +------------+                  |               |
|         |         | HALF_OPEN  |<-----------------+               |
|         +---------+  (probe)   |                                  |
|                   +------------+                                  |
|                                                                   |
|   CLOSED:    Allow requests, count failures                       |
|   OPEN:      Reject immediately, return fallback                  |
|   HALF_OPEN: Allow probe request to test recovery                 |
|                                                                   |
+-------------------------------------------------------------------+
关键配置:
  • failure_threshold
    : 触发跳闸前的故障次数(默认值:5)
  • recovery_timeout
    : 尝试恢复前的等待秒数(默认值:30)
  • half_open_requests
    : 半开状态下允许的探测请求数(默认值:1)

2. Bulkhead Pattern (reference: bulkhead-pattern.md)

2. 舱壁模式(参考:bulkhead-pattern.md)

Isolates failures by partitioning resources into independent pools.
+-------------------------------------------------------------------+
|                      Bulkhead Isolation                           |
+-------------------------------------------------------------------+
|                                                                   |
|   +------------------+  +------------------+                      |
|   | TIER 1: Critical |  | TIER 2: Standard |                      |
|   |  (5 workers)     |  |  (3 workers)     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  |#| |#| | |     |  |  |#| | | | |     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  | | | |         |  |  Queue: 2        |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  Queue: 0        |  +------------------+                      |
|   +------------------+                                            |
|                                                                   |
|   +------------------+                                            |
|   | TIER 3: Optional |   # = Active request                       |
|   |  (2 workers)     |     = Available slot                       |
|   |  +-+ +-+         |                                            |
|   |  |#| |#| FULL!   |   Tier 1: synthesis, quality_gate          |
|   |  +-+ +-+         |   Tier 2: analysis agents                  |
|   |  Queue: 5        |   Tier 3: enrichment, optional features    |
|   +------------------+                                            |
|                                                                   |
+-------------------------------------------------------------------+
Tier Configuration (OrchestKit):
TierWorkersQueueTimeoutUse Case
1 (Critical)510300sSynthesis, quality gate
2 (Standard)35120sContent analysis agents
3 (Optional)2360sEnrichment, caching
通过将资源划分为独立池来隔离故障。
+-------------------------------------------------------------------+
|                      Bulkhead Isolation                           |
+-------------------------------------------------------------------+
|                                                                   |
|   +------------------+  +------------------+                      |
|   | TIER 1: Critical |  | TIER 2: Standard |                      |
|   |  (5 workers)     |  |  (3 workers)     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  |#| |#| | |     |  |  |#| | | | |     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  | | | |         |  |  Queue: 2        |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  Queue: 0        |  +------------------+                      |
|   +------------------+                                            |
|                                                                   |
|   +------------------+                                            |
|   | TIER 3: Optional |   # = Active request                       |
|   |  (2 workers)     |     = Available slot                       |
|   |  +-+ +-+         |                                            |
|   |  |#| |#| FULL!   |   Tier 1: synthesis, quality_gate          |
|   |  +-+ +-+         |   Tier 2: analysis agents                  |
|   |  Queue: 5        |   Tier 3: enrichment, optional features    |
|   +------------------+                                            |
|                                                                   |
+-------------------------------------------------------------------+
层级配置(OrchestKit):
层级工作线程数队列长度超时时间适用场景
1(关键)510300秒合成任务、质量网关
2(标准)35120秒内容分析Agent
3(可选)2360秒增强处理、可选功能

3. Retry Strategies (reference: retry-strategies.md)

3. 重试策略(参考:retry-strategies.md)

Intelligent retry logic with exponential backoff and jitter.
+-------------------------------------------------------------------+
|                   Exponential Backoff + Jitter                    |
+-------------------------------------------------------------------+
|                                                                   |
|   Attempt 1:  --> X (fail)                                        |
|               wait: 1s +/- 0.5s                                   |
|                                                                   |
|   Attempt 2:  --> X (fail)                                        |
|               wait: 2s +/- 1s                                     |
|                                                                   |
|   Attempt 3:  --> X (fail)                                        |
|               wait: 4s +/- 2s                                     |
|                                                                   |
|   Attempt 4:  --> OK (success)                                    |
|                                                                   |
|   Formula: delay = min(base * 2^attempt, max_delay) * jitter      |
|   Jitter:  random(0.5, 1.5) to prevent thundering herd            |
|                                                                   |
+-------------------------------------------------------------------+
Error Classification for Retries:
python
RETRYABLE_ERRORS = {
    # HTTP/Network
    408, 429, 500, 502, 503, 504,  # HTTP status codes
    ConnectionError, TimeoutError,  # Network errors

    # LLM-specific
    "rate_limit_exceeded",
    "model_overloaded",
    "context_length_exceeded",  # Retry with truncation
}

NON_RETRYABLE_ERRORS = {
    400, 401, 403, 404,  # Client errors
    "invalid_api_key",
    "content_policy_violation",
    "invalid_request_error",
}
带有指数退避和抖动的智能重试逻辑。
+-------------------------------------------------------------------+
|                   Exponential Backoff + Jitter                    |
+-------------------------------------------------------------------+
|                                                                   |
|   Attempt 1:  --> X (fail)                                        |
|               wait: 1s +/- 0.5s                                   |
|                                                                   |
|   Attempt 2:  --> X (fail)                                        |
|               wait: 2s +/- 1s                                     |
|                                                                   |
|   Attempt 3:  --> X (fail)                                        |
|               wait: 4s +/- 2s                                     |
|                                                                   |
|   Attempt 4:  --> OK (success)                                    |
|                                                                   |
|   Formula: delay = min(base * 2^attempt, max_delay) * jitter      |
|   Jitter:  random(0.5, 1.5) to prevent thundering herd            |
|                                                                   |
+-------------------------------------------------------------------+
重试的错误分类:
python
RETRYABLE_ERRORS = {
    # HTTP/Network
    408, 429, 500, 502, 503, 504,  # HTTP status codes
    ConnectionError, TimeoutError,  # Network errors

    # LLM-specific
    "rate_limit_exceeded",
    "model_overloaded",
    "context_length_exceeded",  # Retry with truncation
}

NON_RETRYABLE_ERRORS = {
    400, 401, 403, 404,  # Client errors
    "invalid_api_key",
    "content_policy_violation",
    "invalid_request_error",
}

4. LLM-Specific Resilience (reference: llm-resilience.md)

4. LLM专属弹性机制(参考:llm-resilience.md)

Patterns specific to LLM API integrations.
+-------------------------------------------------------------------+
|                    LLM Fallback Chain                             |
+-------------------------------------------------------------------+
|                                                                   |
|   Request --> [Primary Model] --success--> Response               |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Fallback Model] --success--> Response              |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Cached Response] --hit--> Response                 |
|                     |                                             |
|                   miss                                            |
|                     v                                             |
|               [Default Response] --> Graceful Degradation         |
|                                                                   |
|   Example Chain:                                                  |
|   1. claude-sonnet-4-5-20251101 (primary)                         |
|   2. gpt-5.2-mini (fallback)                                      |
|   3. Semantic cache lookup                                        |
|   4. "Analysis unavailable" + partial results                     |
|                                                                   |
+-------------------------------------------------------------------+
Token Budget Management:
+-------------------------------------------------------------------+
|                     Token Budget Guard                            |
+-------------------------------------------------------------------+
|                                                                   |
|   Input: 8,000 tokens                                             |
|   +---------------------------------------------+                 |
|   |#################################            |                 |
|   +---------------------------------------------+                 |
|                                          ^                        |
|                                          |                        |
|                                    Context Limit (16K)            |
|                                                                   |
|   Strategy when approaching limit:                                |
|   1. Summarize earlier context (compress 4:1)                     |
|   2. Drop low-priority content (optional fields)                  |
|   3. Split into multiple requests                                 |
|   4. Fail fast with "content too large" error                     |
|                                                                   |
+-------------------------------------------------------------------+
针对LLM API集成的专属模式。
+-------------------------------------------------------------------+
|                    LLM Fallback Chain                             |
+-------------------------------------------------------------------+
|                                                                   |
|   Request --> [Primary Model] --success--> Response               |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Fallback Model] --success--> Response              |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Cached Response] --hit--> Response                 |
|                     |                                             |
|                   miss                                            |
|                     v                                             |
|               [Default Response] --> Graceful Degradation         |
|                                                                   |
|   Example Chain:                                                  |
|   1. claude-sonnet-4-5-20251101 (primary)                         |
|   2. gpt-5.2-mini (fallback)                                      |
|   3. Semantic cache lookup                                        |
|   4. "Analysis unavailable" + partial results                     |
|                                                                   |
+-------------------------------------------------------------------+
Token预算管理:
+-------------------------------------------------------------------+
|                     Token Budget Guard                            |
+-------------------------------------------------------------------+
|                                                                   |
|   Input: 8,000 tokens                                             |
|   +---------------------------------------------+                 |
|   |#################################            |                 |
|   +---------------------------------------------+                 |
|                                          ^                        |
|                                          |                        |
|                                    Context Limit (16K)            |
|                                                                   |
|   Strategy when approaching limit:                                |
|   1. Summarize earlier context (compress 4:1)                     |
|   2. Drop low-priority content (optional fields)                  |
|   3. Split into multiple requests                                 |
|   4. Fail fast with "content too large" error                     |
|                                                                   |
+-------------------------------------------------------------------+

Quick Reference

快速参考

PatternWhen to UseKey Benefit
Circuit BreakerExternal service callsPrevent cascade failures
BulkheadMulti-tenant/multi-agentIsolate failures
Retry + BackoffTransient failuresAutomatic recovery
Fallback ChainCritical operationsGraceful degradation
Token BudgetLLM callsCost control, prevent failures
模式适用场景核心优势
断路器外部服务调用防止级联故障
舱壁多租户/多Agent场景隔离故障
重试+退避瞬时故障场景自动恢复
降级链关键操作场景优雅降级
Token预算LLM调用场景成本控制、避免故障

OrchestKit Integration Points

OrchestKit集成点

  1. Workflow Agents: Each agent wrapped with circuit breaker + bulkhead tier
  2. LLM Calls: All model invocations use fallback chain + retry logic
  3. External APIs: Circuit breaker on YouTube, arXiv, GitHub APIs
  4. Database Ops: Bulkhead isolation for read vs write operations
  1. 工作流Agent:每个Agent都用断路器+舱壁层级包装
  2. LLM调用:所有模型调用均使用降级链+重试逻辑
  3. 外部API:对YouTube、arXiv、GitHub API应用断路器
  4. 数据库操作:为读写操作实现舱壁隔离

Files in This Skill

本Skill包含的文件

References (Conceptual Guides)

参考文档(概念指南)

  • references/circuit-breaker.md
    - Deep dive on circuit breaker pattern
  • references/bulkhead-pattern.md
    - Bulkhead isolation strategies
  • references/retry-strategies.md
    - Retry algorithms and error classification
  • references/llm-resilience.md
    - LLM-specific patterns
  • references/error-classification.md
    - How to categorize errors
  • references/circuit-breaker.md
    - 断路器模式深度解析
  • references/bulkhead-pattern.md
    - 舱壁隔离策略
  • references/retry-strategies.md
    - 重试算法与错误分类
  • references/llm-resilience.md
    - LLM专属模式
  • references/error-classification.md
    - 错误分类方法

Templates (Code Patterns)

模板(代码模式)

  • scripts/circuit-breaker.py
    - Ready-to-use circuit breaker class
  • scripts/bulkhead.py
    - Semaphore-based bulkhead implementation
  • scripts/retry-handler.py
    - Configurable retry decorator
  • scripts/llm-fallback-chain.py
    - Multi-model fallback pattern
  • scripts/token-budget.py
    - Token budget guard implementation
  • scripts/circuit-breaker.py
    - 即用型断路器类
  • scripts/bulkhead.py
    - 基于信号量的舱壁实现
  • scripts/retry-handler.py
    - 可配置的重试装饰器
  • scripts/llm-fallback-chain.py
    - 多模型降级链
  • scripts/token-budget.py
    - Token预算防护实现

Examples

示例

  • examples/orchestkit-workflow-resilience.md
    - Full OrchestKit integration example
  • examples/orchestkit-workflow-resilience.md
    - 完整的OrchestKit集成示例

Checklists

检查清单

  • checklists/pre-deployment-resilience.md
    - Production readiness checklist
  • checklists/circuit-breaker-setup.md
    - Circuit breaker configuration guide
  • checklists/pre-deployment-resilience.md
    - 生产就绪性检查清单
  • checklists/circuit-breaker-setup.md
    - 断路器配置指南

2026 Best Practices

2026最佳实践

  1. Adaptive Thresholds: Use sliding windows, not fixed counters
  2. Observability First: Every circuit trip = alert + metric + trace
  3. Graceful Degradation: Always have a fallback, even if partial
  4. Health Endpoints: Separate health check from circuit state
  5. Chaos Testing: Regularly test failure scenarios in staging

  1. 自适应阈值:使用滑动窗口而非固定计数器
  2. 可观测性优先:每次断路器跳闸都触发告警+指标+追踪
  3. 优雅降级:始终保留降级方案,即使是部分结果
  4. 健康端点:将健康检查与断路器状态分离
  5. 混沌测试:定期在预发布环境测试故障场景

Related Skills

相关Skill

  • observability-monitoring
    - Metrics and alerting for circuit breaker state changes
  • caching-strategies
    - Cache as fallback layer in degradation scenarios
  • error-handling-rfc9457
    - Structured error responses for resilience failures
  • background-jobs
    - Async processing with retry and failure handling
  • observability-monitoring
    - 断路器状态变更的指标与告警
  • caching-strategies
    - 降级场景下的缓存作为 fallback 层
  • error-handling-rfc9457
    - 弹性故障的结构化错误响应
  • background-jobs
    - 带有重试和故障处理的异步处理

Key Decisions

核心决策

DecisionChoiceRationale
Circuit breaker recoveryHalf-open probeGradual recovery, prevents immediate re-failure
Retry algorithmExponential backoff + jitterPrevents thundering herd, respects rate limits
Bulkhead isolationSemaphore-based tiersSimple, efficient, prioritizes critical operations
LLM fallbackModel chain with cacheGraceful degradation, cost optimization, availability

决策项选择方案理由
断路器恢复机制半开状态探测逐步恢复,避免立即再次故障
重试算法指数退避+抖动防止惊群效应,遵守速率限制
舱壁隔离方式基于信号量的层级简单高效,优先处理关键操作
LLM降级方案带缓存的模型链优雅降级,成本优化,高可用性

Capability Details

能力详情

circuit-breaker

circuit-breaker

Keywords: circuit breaker, failure threshold, cascade failure, trip, half-open Solves:
  • Prevent cascade failures when external services fail
  • Automatically recover when services come back online
  • Fail fast instead of waiting for timeouts
关键词: circuit breaker, failure threshold, cascade failure, trip, half-open 解决问题:
  • 防止外部服务故障时引发级联故障
  • 服务恢复时自动恢复系统功能
  • 快速失败而非等待超时

bulkhead

bulkhead

Keywords: bulkhead, isolation, semaphore, thread pool, resource pool, tier Solves:
  • Isolate failures to prevent entire system crashes
  • Prioritize critical operations over optional ones
  • Limit concurrent requests to protect resources
关键词: bulkhead, isolation, semaphore, thread pool, resource pool, tier 解决问题:
  • 隔离故障,避免整个系统崩溃
  • 优先处理关键操作而非可选操作
  • 限制并发请求以保护资源

retry-strategies

retry-strategies

Keywords: retry, backoff, exponential, jitter, thundering herd Solves:
  • Handle transient failures automatically
  • Avoid overwhelming recovering services
  • Classify errors as retryable vs non-retryable
关键词: retry, backoff, exponential, jitter, thundering herd 解决问题:
  • 自动处理瞬时故障
  • 避免压垮正在恢复的服务
  • 区分可重试与不可重试错误

llm-resilience

llm-resilience

Keywords: LLM, fallback, model, token budget, rate limit, context length Solves:
  • Handle LLM API rate limits gracefully
  • Fall back to alternative models when primary fails
  • Manage token budgets to prevent context overflow
关键词: LLM, fallback, model, token budget, rate limit, context length 解决问题:
  • 优雅处理LLM API速率限制
  • 主模型故障时降级到备选模型
  • 管理Token预算以避免上下文溢出

error-classification

error-classification

Keywords: error, retryable, transient, permanent, classification Solves:
  • Determine which errors should be retried
  • Categorize errors by severity and recoverability
  • Map HTTP status codes to resilience actions
关键词: error, retryable, transient, permanent, classification 解决问题:
  • 判断哪些错误应该重试
  • 按严重程度和可恢复性分类错误
  • 将HTTP状态码映射到弹性处理动作