resilience-patterns
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseResilience Patterns Skill
弹性模式Skill
Production-grade resilience patterns for distributed systems and LLM-based workflows. Covers circuit breakers, bulkheads, retry strategies, and LLM-specific resilience techniques.
面向分布式系统和基于LLM的工作流的生产级弹性模式。涵盖断路器、舱壁、重试策略,以及针对LLM的专属弹性技术。
Overview
概述
- Building fault-tolerant multi-agent systems
- Implementing LLM API integrations with proper error handling
- Designing distributed workflows that need graceful degradation
- Adding observability to failure scenarios
- Protecting systems from cascade failures
- 构建容错多Agent系统
- 为LLM API集成实现完善的错误处理
- 设计支持优雅降级的分布式工作流
- 为故障场景添加可观测性
- 保护系统避免级联故障
Core Patterns
核心模式
1. Circuit Breaker Pattern (reference: circuit-breaker.md)
1. 断路器模式(参考:circuit-breaker.md)
Prevents cascade failures by "tripping" when a service exceeds failure thresholds.
+-------------------------------------------------------------------+
| Circuit Breaker States |
+-------------------------------------------------------------------+
| |
| +----------+ failures >= threshold +----------+ |
| | CLOSED | ----------------------------> | OPEN | |
| | (normal) | | (reject) | |
| +----+-----+ +----+-----+ |
| | | |
| | success timeout | |
| | expires | |
| | +------------+ | |
| | | HALF_OPEN |<-----------------+ |
| +---------+ (probe) | |
| +------------+ |
| |
| CLOSED: Allow requests, count failures |
| OPEN: Reject immediately, return fallback |
| HALF_OPEN: Allow probe request to test recovery |
| |
+-------------------------------------------------------------------+Key Configuration:
- : Failures before opening (default: 5)
failure_threshold - : Seconds before attempting recovery (default: 30)
recovery_timeout - : Probes to allow in half-open (default: 1)
half_open_requests
当服务超出故障阈值时,通过“跳闸”来防止级联故障。
+-------------------------------------------------------------------+
| Circuit Breaker States |
+-------------------------------------------------------------------+
| |
| +----------+ failures >= threshold +----------+ |
| | CLOSED | ----------------------------> | OPEN | |
| | (normal) | | (reject) | |
| +----+-----+ +----+-----+ |
| | | |
| | success timeout | |
| | expires | |
| | +------------+ | |
| | | HALF_OPEN |<-----------------+ |
| +---------+ (probe) | |
| +------------+ |
| |
| CLOSED: Allow requests, count failures |
| OPEN: Reject immediately, return fallback |
| HALF_OPEN: Allow probe request to test recovery |
| |
+-------------------------------------------------------------------+关键配置:
- : 触发跳闸前的故障次数(默认值:5)
failure_threshold - : 尝试恢复前的等待秒数(默认值:30)
recovery_timeout - : 半开状态下允许的探测请求数(默认值:1)
half_open_requests
2. Bulkhead Pattern (reference: bulkhead-pattern.md)
2. 舱壁模式(参考:bulkhead-pattern.md)
Isolates failures by partitioning resources into independent pools.
+-------------------------------------------------------------------+
| Bulkhead Isolation |
+-------------------------------------------------------------------+
| |
| +------------------+ +------------------+ |
| | TIER 1: Critical | | TIER 2: Standard | |
| | (5 workers) | | (3 workers) | |
| | +-+ +-+ +-+ | | +-+ +-+ +-+ | |
| | |#| |#| | | | | |#| | | | | | |
| | +-+ +-+ +-+ | | +-+ +-+ +-+ | |
| | +-+ +-+ | | | |
| | | | | | | | Queue: 2 | |
| | +-+ +-+ | | | |
| | Queue: 0 | +------------------+ |
| +------------------+ |
| |
| +------------------+ |
| | TIER 3: Optional | # = Active request |
| | (2 workers) | = Available slot |
| | +-+ +-+ | |
| | |#| |#| FULL! | Tier 1: synthesis, quality_gate |
| | +-+ +-+ | Tier 2: analysis agents |
| | Queue: 5 | Tier 3: enrichment, optional features |
| +------------------+ |
| |
+-------------------------------------------------------------------+Tier Configuration (OrchestKit):
| Tier | Workers | Queue | Timeout | Use Case |
|---|---|---|---|---|
| 1 (Critical) | 5 | 10 | 300s | Synthesis, quality gate |
| 2 (Standard) | 3 | 5 | 120s | Content analysis agents |
| 3 (Optional) | 2 | 3 | 60s | Enrichment, caching |
通过将资源划分为独立池来隔离故障。
+-------------------------------------------------------------------+
| Bulkhead Isolation |
+-------------------------------------------------------------------+
| |
| +------------------+ +------------------+ |
| | TIER 1: Critical | | TIER 2: Standard | |
| | (5 workers) | | (3 workers) | |
| | +-+ +-+ +-+ | | +-+ +-+ +-+ | |
| | |#| |#| | | | | |#| | | | | | |
| | +-+ +-+ +-+ | | +-+ +-+ +-+ | |
| | +-+ +-+ | | | |
| | | | | | | | Queue: 2 | |
| | +-+ +-+ | | | |
| | Queue: 0 | +------------------+ |
| +------------------+ |
| |
| +------------------+ |
| | TIER 3: Optional | # = Active request |
| | (2 workers) | = Available slot |
| | +-+ +-+ | |
| | |#| |#| FULL! | Tier 1: synthesis, quality_gate |
| | +-+ +-+ | Tier 2: analysis agents |
| | Queue: 5 | Tier 3: enrichment, optional features |
| +------------------+ |
| |
+-------------------------------------------------------------------+层级配置(OrchestKit):
| 层级 | 工作线程数 | 队列长度 | 超时时间 | 适用场景 |
|---|---|---|---|---|
| 1(关键) | 5 | 10 | 300秒 | 合成任务、质量网关 |
| 2(标准) | 3 | 5 | 120秒 | 内容分析Agent |
| 3(可选) | 2 | 3 | 60秒 | 增强处理、可选功能 |
3. Retry Strategies (reference: retry-strategies.md)
3. 重试策略(参考:retry-strategies.md)
Intelligent retry logic with exponential backoff and jitter.
+-------------------------------------------------------------------+
| Exponential Backoff + Jitter |
+-------------------------------------------------------------------+
| |
| Attempt 1: --> X (fail) |
| wait: 1s +/- 0.5s |
| |
| Attempt 2: --> X (fail) |
| wait: 2s +/- 1s |
| |
| Attempt 3: --> X (fail) |
| wait: 4s +/- 2s |
| |
| Attempt 4: --> OK (success) |
| |
| Formula: delay = min(base * 2^attempt, max_delay) * jitter |
| Jitter: random(0.5, 1.5) to prevent thundering herd |
| |
+-------------------------------------------------------------------+Error Classification for Retries:
python
RETRYABLE_ERRORS = {
# HTTP/Network
408, 429, 500, 502, 503, 504, # HTTP status codes
ConnectionError, TimeoutError, # Network errors
# LLM-specific
"rate_limit_exceeded",
"model_overloaded",
"context_length_exceeded", # Retry with truncation
}
NON_RETRYABLE_ERRORS = {
400, 401, 403, 404, # Client errors
"invalid_api_key",
"content_policy_violation",
"invalid_request_error",
}带有指数退避和抖动的智能重试逻辑。
+-------------------------------------------------------------------+
| Exponential Backoff + Jitter |
+-------------------------------------------------------------------+
| |
| Attempt 1: --> X (fail) |
| wait: 1s +/- 0.5s |
| |
| Attempt 2: --> X (fail) |
| wait: 2s +/- 1s |
| |
| Attempt 3: --> X (fail) |
| wait: 4s +/- 2s |
| |
| Attempt 4: --> OK (success) |
| |
| Formula: delay = min(base * 2^attempt, max_delay) * jitter |
| Jitter: random(0.5, 1.5) to prevent thundering herd |
| |
+-------------------------------------------------------------------+重试的错误分类:
python
RETRYABLE_ERRORS = {
# HTTP/Network
408, 429, 500, 502, 503, 504, # HTTP status codes
ConnectionError, TimeoutError, # Network errors
# LLM-specific
"rate_limit_exceeded",
"model_overloaded",
"context_length_exceeded", # Retry with truncation
}
NON_RETRYABLE_ERRORS = {
400, 401, 403, 404, # Client errors
"invalid_api_key",
"content_policy_violation",
"invalid_request_error",
}4. LLM-Specific Resilience (reference: llm-resilience.md)
4. LLM专属弹性机制(参考:llm-resilience.md)
Patterns specific to LLM API integrations.
+-------------------------------------------------------------------+
| LLM Fallback Chain |
+-------------------------------------------------------------------+
| |
| Request --> [Primary Model] --success--> Response |
| | |
| fail |
| v |
| [Fallback Model] --success--> Response |
| | |
| fail |
| v |
| [Cached Response] --hit--> Response |
| | |
| miss |
| v |
| [Default Response] --> Graceful Degradation |
| |
| Example Chain: |
| 1. claude-sonnet-4-5-20251101 (primary) |
| 2. gpt-5.2-mini (fallback) |
| 3. Semantic cache lookup |
| 4. "Analysis unavailable" + partial results |
| |
+-------------------------------------------------------------------+Token Budget Management:
+-------------------------------------------------------------------+
| Token Budget Guard |
+-------------------------------------------------------------------+
| |
| Input: 8,000 tokens |
| +---------------------------------------------+ |
| |################################# | |
| +---------------------------------------------+ |
| ^ |
| | |
| Context Limit (16K) |
| |
| Strategy when approaching limit: |
| 1. Summarize earlier context (compress 4:1) |
| 2. Drop low-priority content (optional fields) |
| 3. Split into multiple requests |
| 4. Fail fast with "content too large" error |
| |
+-------------------------------------------------------------------+针对LLM API集成的专属模式。
+-------------------------------------------------------------------+
| LLM Fallback Chain |
+-------------------------------------------------------------------+
| |
| Request --> [Primary Model] --success--> Response |
| | |
| fail |
| v |
| [Fallback Model] --success--> Response |
| | |
| fail |
| v |
| [Cached Response] --hit--> Response |
| | |
| miss |
| v |
| [Default Response] --> Graceful Degradation |
| |
| Example Chain: |
| 1. claude-sonnet-4-5-20251101 (primary) |
| 2. gpt-5.2-mini (fallback) |
| 3. Semantic cache lookup |
| 4. "Analysis unavailable" + partial results |
| |
+-------------------------------------------------------------------+Token预算管理:
+-------------------------------------------------------------------+
| Token Budget Guard |
+-------------------------------------------------------------------+
| |
| Input: 8,000 tokens |
| +---------------------------------------------+ |
| |################################# | |
| +---------------------------------------------+ |
| ^ |
| | |
| Context Limit (16K) |
| |
| Strategy when approaching limit: |
| 1. Summarize earlier context (compress 4:1) |
| 2. Drop low-priority content (optional fields) |
| 3. Split into multiple requests |
| 4. Fail fast with "content too large" error |
| |
+-------------------------------------------------------------------+Quick Reference
快速参考
| Pattern | When to Use | Key Benefit |
|---|---|---|
| Circuit Breaker | External service calls | Prevent cascade failures |
| Bulkhead | Multi-tenant/multi-agent | Isolate failures |
| Retry + Backoff | Transient failures | Automatic recovery |
| Fallback Chain | Critical operations | Graceful degradation |
| Token Budget | LLM calls | Cost control, prevent failures |
| 模式 | 适用场景 | 核心优势 |
|---|---|---|
| 断路器 | 外部服务调用 | 防止级联故障 |
| 舱壁 | 多租户/多Agent场景 | 隔离故障 |
| 重试+退避 | 瞬时故障场景 | 自动恢复 |
| 降级链 | 关键操作场景 | 优雅降级 |
| Token预算 | LLM调用场景 | 成本控制、避免故障 |
OrchestKit Integration Points
OrchestKit集成点
- Workflow Agents: Each agent wrapped with circuit breaker + bulkhead tier
- LLM Calls: All model invocations use fallback chain + retry logic
- External APIs: Circuit breaker on YouTube, arXiv, GitHub APIs
- Database Ops: Bulkhead isolation for read vs write operations
- 工作流Agent:每个Agent都用断路器+舱壁层级包装
- LLM调用:所有模型调用均使用降级链+重试逻辑
- 外部API:对YouTube、arXiv、GitHub API应用断路器
- 数据库操作:为读写操作实现舱壁隔离
Files in This Skill
本Skill包含的文件
References (Conceptual Guides)
参考文档(概念指南)
- - Deep dive on circuit breaker pattern
references/circuit-breaker.md - - Bulkhead isolation strategies
references/bulkhead-pattern.md - - Retry algorithms and error classification
references/retry-strategies.md - - LLM-specific patterns
references/llm-resilience.md - - How to categorize errors
references/error-classification.md
- - 断路器模式深度解析
references/circuit-breaker.md - - 舱壁隔离策略
references/bulkhead-pattern.md - - 重试算法与错误分类
references/retry-strategies.md - - LLM专属模式
references/llm-resilience.md - - 错误分类方法
references/error-classification.md
Templates (Code Patterns)
模板(代码模式)
- - Ready-to-use circuit breaker class
scripts/circuit-breaker.py - - Semaphore-based bulkhead implementation
scripts/bulkhead.py - - Configurable retry decorator
scripts/retry-handler.py - - Multi-model fallback pattern
scripts/llm-fallback-chain.py - - Token budget guard implementation
scripts/token-budget.py
- - 即用型断路器类
scripts/circuit-breaker.py - - 基于信号量的舱壁实现
scripts/bulkhead.py - - 可配置的重试装饰器
scripts/retry-handler.py - - 多模型降级链
scripts/llm-fallback-chain.py - - Token预算防护实现
scripts/token-budget.py
Examples
示例
- - Full OrchestKit integration example
examples/orchestkit-workflow-resilience.md
- - 完整的OrchestKit集成示例
examples/orchestkit-workflow-resilience.md
Checklists
检查清单
- - Production readiness checklist
checklists/pre-deployment-resilience.md - - Circuit breaker configuration guide
checklists/circuit-breaker-setup.md
- - 生产就绪性检查清单
checklists/pre-deployment-resilience.md - - 断路器配置指南
checklists/circuit-breaker-setup.md
2026 Best Practices
2026最佳实践
- Adaptive Thresholds: Use sliding windows, not fixed counters
- Observability First: Every circuit trip = alert + metric + trace
- Graceful Degradation: Always have a fallback, even if partial
- Health Endpoints: Separate health check from circuit state
- Chaos Testing: Regularly test failure scenarios in staging
- 自适应阈值:使用滑动窗口而非固定计数器
- 可观测性优先:每次断路器跳闸都触发告警+指标+追踪
- 优雅降级:始终保留降级方案,即使是部分结果
- 健康端点:将健康检查与断路器状态分离
- 混沌测试:定期在预发布环境测试故障场景
Related Skills
相关Skill
- - Metrics and alerting for circuit breaker state changes
observability-monitoring - - Cache as fallback layer in degradation scenarios
caching-strategies - - Structured error responses for resilience failures
error-handling-rfc9457 - - Async processing with retry and failure handling
background-jobs
- - 断路器状态变更的指标与告警
observability-monitoring - - 降级场景下的缓存作为 fallback 层
caching-strategies - - 弹性故障的结构化错误响应
error-handling-rfc9457 - - 带有重试和故障处理的异步处理
background-jobs
Key Decisions
核心决策
| Decision | Choice | Rationale |
|---|---|---|
| Circuit breaker recovery | Half-open probe | Gradual recovery, prevents immediate re-failure |
| Retry algorithm | Exponential backoff + jitter | Prevents thundering herd, respects rate limits |
| Bulkhead isolation | Semaphore-based tiers | Simple, efficient, prioritizes critical operations |
| LLM fallback | Model chain with cache | Graceful degradation, cost optimization, availability |
| 决策项 | 选择方案 | 理由 |
|---|---|---|
| 断路器恢复机制 | 半开状态探测 | 逐步恢复,避免立即再次故障 |
| 重试算法 | 指数退避+抖动 | 防止惊群效应,遵守速率限制 |
| 舱壁隔离方式 | 基于信号量的层级 | 简单高效,优先处理关键操作 |
| LLM降级方案 | 带缓存的模型链 | 优雅降级,成本优化,高可用性 |
Capability Details
能力详情
circuit-breaker
circuit-breaker
Keywords: circuit breaker, failure threshold, cascade failure, trip, half-open
Solves:
- Prevent cascade failures when external services fail
- Automatically recover when services come back online
- Fail fast instead of waiting for timeouts
关键词: circuit breaker, failure threshold, cascade failure, trip, half-open
解决问题:
- 防止外部服务故障时引发级联故障
- 服务恢复时自动恢复系统功能
- 快速失败而非等待超时
bulkhead
bulkhead
Keywords: bulkhead, isolation, semaphore, thread pool, resource pool, tier
Solves:
- Isolate failures to prevent entire system crashes
- Prioritize critical operations over optional ones
- Limit concurrent requests to protect resources
关键词: bulkhead, isolation, semaphore, thread pool, resource pool, tier
解决问题:
- 隔离故障,避免整个系统崩溃
- 优先处理关键操作而非可选操作
- 限制并发请求以保护资源
retry-strategies
retry-strategies
Keywords: retry, backoff, exponential, jitter, thundering herd
Solves:
- Handle transient failures automatically
- Avoid overwhelming recovering services
- Classify errors as retryable vs non-retryable
关键词: retry, backoff, exponential, jitter, thundering herd
解决问题:
- 自动处理瞬时故障
- 避免压垮正在恢复的服务
- 区分可重试与不可重试错误
llm-resilience
llm-resilience
Keywords: LLM, fallback, model, token budget, rate limit, context length
Solves:
- Handle LLM API rate limits gracefully
- Fall back to alternative models when primary fails
- Manage token budgets to prevent context overflow
关键词: LLM, fallback, model, token budget, rate limit, context length
解决问题:
- 优雅处理LLM API速率限制
- 主模型故障时降级到备选模型
- 管理Token预算以避免上下文溢出
error-classification
error-classification
Keywords: error, retryable, transient, permanent, classification
Solves:
- Determine which errors should be retried
- Categorize errors by severity and recoverability
- Map HTTP status codes to resilience actions
关键词: error, retryable, transient, permanent, classification
解决问题:
- 判断哪些错误应该重试
- 按严重程度和可恢复性分类错误
- 将HTTP状态码映射到弹性处理动作