resilience-patterns

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Resilience Patterns Skill

弹性模式Skill

Production-grade resilience patterns for distributed systems and LLM-based workflows. Covers circuit breakers, bulkheads, retry strategies, and LLM-specific resilience techniques.

面向分布式系统和基于LLM的工作流的生产级弹性模式。涵盖断路器、舱壁、重试策略，以及针对LLM的专属弹性技术。

Overview

概述

Building fault-tolerant multi-agent systems
Implementing LLM API integrations with proper error handling
Designing distributed workflows that need graceful degradation
Adding observability to failure scenarios
Protecting systems from cascade failures

构建容错多Agent系统
为LLM API集成实现完善的错误处理
设计支持优雅降级的分布式工作流
为故障场景添加可观测性
保护系统避免级联故障

Core Patterns

核心模式

1. Circuit Breaker Pattern (reference: circuit-breaker.md)

1. 断路器模式（参考：circuit-breaker.md）

Prevents cascade failures by "tripping" when a service exceeds failure thresholds.

+-------------------------------------------------------------------+
|                    Circuit Breaker States                         |
+-------------------------------------------------------------------+
|                                                                   |
|    +----------+     failures >= threshold    +----------+         |
|    |  CLOSED  | ----------------------------> |   OPEN   |        |
|    | (normal) |                              | (reject) |         |
|    +----+-----+                              +----+-----+         |
|         |                                         |               |
|         | success                    timeout      |               |
|         |                            expires      |               |
|         |         +------------+                  |               |
|         |         | HALF_OPEN  |<-----------------+               |
|         +---------+  (probe)   |                                  |
|                   +------------+                                  |
|                                                                   |
|   CLOSED:    Allow requests, count failures                       |
|   OPEN:      Reject immediately, return fallback                  |
|   HALF_OPEN: Allow probe request to test recovery                 |
|                                                                   |
+-------------------------------------------------------------------+

Key Configuration:

```
failure_threshold
```
: Failures before opening (default: 5)
```
recovery_timeout
```
: Seconds before attempting recovery (default: 30)
```
half_open_requests
```
: Probes to allow in half-open (default: 1)

当服务超出故障阈值时，通过“跳闸”来防止级联故障。

+-------------------------------------------------------------------+
|                    Circuit Breaker States                         |
+-------------------------------------------------------------------+
|                                                                   |
|    +----------+     failures >= threshold    +----------+         |
|    |  CLOSED  | ----------------------------> |   OPEN   |        |
|    | (normal) |                              | (reject) |         |
|    +----+-----+                              +----+-----+         |
|         |                                         |               |
|         | success                    timeout      |               |
|         |                            expires      |               |
|         |         +------------+                  |               |
|         |         | HALF_OPEN  |<-----------------+               |
|         +---------+  (probe)   |                                  |
|                   +------------+                                  |
|                                                                   |
|   CLOSED:    Allow requests, count failures                       |
|   OPEN:      Reject immediately, return fallback                  |
|   HALF_OPEN: Allow probe request to test recovery                 |
|                                                                   |
+-------------------------------------------------------------------+

关键配置：

```
failure_threshold
```
: 触发跳闸前的故障次数（默认值：5）
```
recovery_timeout
```
: 尝试恢复前的等待秒数（默认值：30）
```
half_open_requests
```
: 半开状态下允许的探测请求数（默认值：1）

2. Bulkhead Pattern (reference: bulkhead-pattern.md)

2. 舱壁模式（参考：bulkhead-pattern.md）

Isolates failures by partitioning resources into independent pools.

+-------------------------------------------------------------------+
|                      Bulkhead Isolation                           |
+-------------------------------------------------------------------+
|                                                                   |
|   +------------------+  +------------------+                      |
|   | TIER 1: Critical |  | TIER 2: Standard |                      |
|   |  (5 workers)     |  |  (3 workers)     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  |#| |#| | |     |  |  |#| | | | |     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  | | | |         |  |  Queue: 2        |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  Queue: 0        |  +------------------+                      |
|   +------------------+                                            |
|                                                                   |
|   +------------------+                                            |
|   | TIER 3: Optional |   # = Active request                       |
|   |  (2 workers)     |     = Available slot                       |
|   |  +-+ +-+         |                                            |
|   |  |#| |#| FULL!   |   Tier 1: synthesis, quality_gate          |
|   |  +-+ +-+         |   Tier 2: analysis agents                  |
|   |  Queue: 5        |   Tier 3: enrichment, optional features    |
|   +------------------+                                            |
|                                                                   |
+-------------------------------------------------------------------+

Tier Configuration (OrchestKit):

Tier	Workers	Queue	Timeout	Use Case
1 (Critical)	5	10	300s	Synthesis, quality gate
2 (Standard)	3	5	120s	Content analysis agents
3 (Optional)	2	3	60s	Enrichment, caching

通过将资源划分为独立池来隔离故障。

+-------------------------------------------------------------------+
|                      Bulkhead Isolation                           |
+-------------------------------------------------------------------+
|                                                                   |
|   +------------------+  +------------------+                      |
|   | TIER 1: Critical |  | TIER 2: Standard |                      |
|   |  (5 workers)     |  |  (3 workers)     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  |#| |#| | |     |  |  |#| | | | |     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  | | | |         |  |  Queue: 2        |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  Queue: 0        |  +------------------+                      |
|   +------------------+                                            |
|                                                                   |
|   +------------------+                                            |
|   | TIER 3: Optional |   # = Active request                       |
|   |  (2 workers)     |     = Available slot                       |
|   |  +-+ +-+         |                                            |
|   |  |#| |#| FULL!   |   Tier 1: synthesis, quality_gate          |
|   |  +-+ +-+         |   Tier 2: analysis agents                  |
|   |  Queue: 5        |   Tier 3: enrichment, optional features    |
|   +------------------+                                            |
|                                                                   |
+-------------------------------------------------------------------+

层级配置（OrchestKit）：

层级	工作线程数	队列长度	超时时间	适用场景
1（关键）	5	10	300秒	合成任务、质量网关
2（标准）	3	5	120秒	内容分析Agent
3（可选）	2	3	60秒	增强处理、可选功能

3. Retry Strategies (reference: retry-strategies.md)

3. 重试策略（参考：retry-strategies.md）

Intelligent retry logic with exponential backoff and jitter.

+-------------------------------------------------------------------+
|                   Exponential Backoff + Jitter                    |
+-------------------------------------------------------------------+
|                                                                   |
|   Attempt 1:  --> X (fail)                                        |
|               wait: 1s +/- 0.5s                                   |
|                                                                   |
|   Attempt 2:  --> X (fail)                                        |
|               wait: 2s +/- 1s                                     |
|                                                                   |
|   Attempt 3:  --> X (fail)                                        |
|               wait: 4s +/- 2s                                     |
|                                                                   |
|   Attempt 4:  --> OK (success)                                    |
|                                                                   |
|   Formula: delay = min(base * 2^attempt, max_delay) * jitter      |
|   Jitter:  random(0.5, 1.5) to prevent thundering herd            |
|                                                                   |
+-------------------------------------------------------------------+

Error Classification for Retries:

python

RETRYABLE_ERRORS = {
    # HTTP/Network
    408, 429, 500, 502, 503, 504,  # HTTP status codes
    ConnectionError, TimeoutError,  # Network errors

    # LLM-specific
    "rate_limit_exceeded",
    "model_overloaded",
    "context_length_exceeded",  # Retry with truncation
}

NON_RETRYABLE_ERRORS = {
    400, 401, 403, 404,  # Client errors
    "invalid_api_key",
    "content_policy_violation",
    "invalid_request_error",
}

带有指数退避和抖动的智能重试逻辑。

+-------------------------------------------------------------------+
|                   Exponential Backoff + Jitter                    |
+-------------------------------------------------------------------+
|                                                                   |
|   Attempt 1:  --> X (fail)                                        |
|               wait: 1s +/- 0.5s                                   |
|                                                                   |
|   Attempt 2:  --> X (fail)                                        |
|               wait: 2s +/- 1s                                     |
|                                                                   |
|   Attempt 3:  --> X (fail)                                        |
|               wait: 4s +/- 2s                                     |
|                                                                   |
|   Attempt 4:  --> OK (success)                                    |
|                                                                   |
|   Formula: delay = min(base * 2^attempt, max_delay) * jitter      |
|   Jitter:  random(0.5, 1.5) to prevent thundering herd            |
|                                                                   |
+-------------------------------------------------------------------+

重试的错误分类：

python

RETRYABLE_ERRORS = {
    # HTTP/Network
    408, 429, 500, 502, 503, 504,  # HTTP status codes
    ConnectionError, TimeoutError,  # Network errors

    # LLM-specific
    "rate_limit_exceeded",
    "model_overloaded",
    "context_length_exceeded",  # Retry with truncation
}

NON_RETRYABLE_ERRORS = {
    400, 401, 403, 404,  # Client errors
    "invalid_api_key",
    "content_policy_violation",
    "invalid_request_error",
}

4. LLM-Specific Resilience (reference: llm-resilience.md)

4. LLM专属弹性机制（参考：llm-resilience.md）

Patterns specific to LLM API integrations.

+-------------------------------------------------------------------+
|                    LLM Fallback Chain                             |
+-------------------------------------------------------------------+
|                                                                   |
|   Request --> [Primary Model] --success--> Response               |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Fallback Model] --success--> Response              |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Cached Response] --hit--> Response                 |
|                     |                                             |
|                   miss                                            |
|                     v                                             |
|               [Default Response] --> Graceful Degradation         |
|                                                                   |
|   Example Chain:                                                  |
|   1. claude-sonnet-4-5-20251101 (primary)                         |
|   2. gpt-5.2-mini (fallback)                                      |
|   3. Semantic cache lookup                                        |
|   4. "Analysis unavailable" + partial results                     |
|                                                                   |
+-------------------------------------------------------------------+

Token Budget Management:

+-------------------------------------------------------------------+
|                     Token Budget Guard                            |
+-------------------------------------------------------------------+
|                                                                   |
|   Input: 8,000 tokens                                             |
|   +---------------------------------------------+                 |
|   |#################################            |                 |
|   +---------------------------------------------+                 |
|                                          ^                        |
|                                          |                        |
|                                    Context Limit (16K)            |
|                                                                   |
|   Strategy when approaching limit:                                |
|   1. Summarize earlier context (compress 4:1)                     |
|   2. Drop low-priority content (optional fields)                  |
|   3. Split into multiple requests                                 |
|   4. Fail fast with "content too large" error                     |
|                                                                   |
+-------------------------------------------------------------------+

针对LLM API集成的专属模式。

+-------------------------------------------------------------------+
|                    LLM Fallback Chain                             |
+-------------------------------------------------------------------+
|                                                                   |
|   Request --> [Primary Model] --success--> Response               |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Fallback Model] --success--> Response              |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Cached Response] --hit--> Response                 |
|                     |                                             |
|                   miss                                            |
|                     v                                             |
|               [Default Response] --> Graceful Degradation         |
|                                                                   |
|   Example Chain:                                                  |
|   1. claude-sonnet-4-5-20251101 (primary)                         |
|   2. gpt-5.2-mini (fallback)                                      |
|   3. Semantic cache lookup                                        |
|   4. "Analysis unavailable" + partial results                     |
|                                                                   |
+-------------------------------------------------------------------+

Token预算管理：

+-------------------------------------------------------------------+
|                     Token Budget Guard                            |
+-------------------------------------------------------------------+
|                                                                   |
|   Input: 8,000 tokens                                             |
|   +---------------------------------------------+                 |
|   |#################################            |                 |
|   +---------------------------------------------+                 |
|                                          ^                        |
|                                          |                        |
|                                    Context Limit (16K)            |
|                                                                   |
|   Strategy when approaching limit:                                |
|   1. Summarize earlier context (compress 4:1)                     |
|   2. Drop low-priority content (optional fields)                  |
|   3. Split into multiple requests                                 |
|   4. Fail fast with "content too large" error                     |
|                                                                   |
+-------------------------------------------------------------------+

Quick Reference

快速参考

Pattern	When to Use	Key Benefit
Circuit Breaker	External service calls	Prevent cascade failures
Bulkhead	Multi-tenant/multi-agent	Isolate failures
Retry + Backoff	Transient failures	Automatic recovery
Fallback Chain	Critical operations	Graceful degradation
Token Budget	LLM calls	Cost control, prevent failures

模式	适用场景	核心优势
断路器	外部服务调用	防止级联故障
舱壁	多租户/多Agent场景	隔离故障
重试+退避	瞬时故障场景	自动恢复
降级链	关键操作场景	优雅降级
Token预算	LLM调用场景	成本控制、避免故障

OrchestKit Integration Points

OrchestKit集成点

Workflow Agents: Each agent wrapped with circuit breaker + bulkhead tier
LLM Calls: All model invocations use fallback chain + retry logic
External APIs: Circuit breaker on YouTube, arXiv, GitHub APIs
Database Ops: Bulkhead isolation for read vs write operations

工作流Agent：每个Agent都用断路器+舱壁层级包装
LLM调用：所有模型调用均使用降级链+重试逻辑
外部API：对YouTube、arXiv、GitHub API应用断路器
数据库操作：为读写操作实现舱壁隔离

Files in This Skill

本Skill包含的文件

References (Conceptual Guides)

参考文档（概念指南）

```
references/circuit-breaker.md
```
- Deep dive on circuit breaker pattern
```
references/bulkhead-pattern.md
```
- Bulkhead isolation strategies
```
references/retry-strategies.md
```
- Retry algorithms and error classification
```
references/llm-resilience.md
```
- LLM-specific patterns
```
references/error-classification.md
```
- How to categorize errors

```
references/circuit-breaker.md
```
- 断路器模式深度解析
```
references/bulkhead-pattern.md
```
- 舱壁隔离策略
```
references/retry-strategies.md
```
- 重试算法与错误分类
```
references/llm-resilience.md
```
- LLM专属模式
```
references/error-classification.md
```
- 错误分类方法

Templates (Code Patterns)

模板（代码模式）

```
scripts/circuit-breaker.py
```
- Ready-to-use circuit breaker class
```
scripts/bulkhead.py
```
- Semaphore-based bulkhead implementation
```
scripts/retry-handler.py
```
- Configurable retry decorator
```
scripts/llm-fallback-chain.py
```
- Multi-model fallback pattern
```
scripts/token-budget.py
```
- Token budget guard implementation

```
scripts/circuit-breaker.py
```
- 即用型断路器类
```
scripts/bulkhead.py
```
- 基于信号量的舱壁实现
```
scripts/retry-handler.py
```
- 可配置的重试装饰器
```
scripts/llm-fallback-chain.py
```
- 多模型降级链
```
scripts/token-budget.py
```
- Token预算防护实现

Examples

示例

examples/orchestkit-workflow-resilience.md

- Full OrchestKit integration example

examples/orchestkit-workflow-resilience.md

- 完整的OrchestKit集成示例

Checklists

检查清单

```
checklists/pre-deployment-resilience.md
```
- Production readiness checklist
```
checklists/circuit-breaker-setup.md
```
- Circuit breaker configuration guide

```
checklists/pre-deployment-resilience.md
```
- 生产就绪性检查清单
```
checklists/circuit-breaker-setup.md
```
- 断路器配置指南

2026 Best Practices

2026最佳实践

Adaptive Thresholds: Use sliding windows, not fixed counters
Observability First: Every circuit trip = alert + metric + trace
Graceful Degradation: Always have a fallback, even if partial
Health Endpoints: Separate health check from circuit state
Chaos Testing: Regularly test failure scenarios in staging

自适应阈值：使用滑动窗口而非固定计数器
可观测性优先：每次断路器跳闸都触发告警+指标+追踪
优雅降级：始终保留降级方案，即使是部分结果
健康端点：将健康检查与断路器状态分离
混沌测试：定期在预发布环境测试故障场景

Related Skills

Key Decisions

核心决策

Decision	Choice	Rationale
Circuit breaker recovery	Half-open probe	Gradual recovery, prevents immediate re-failure
Retry algorithm	Exponential backoff + jitter	Prevents thundering herd, respects rate limits
Bulkhead isolation	Semaphore-based tiers	Simple, efficient, prioritizes critical operations
LLM fallback	Model chain with cache	Graceful degradation, cost optimization, availability

决策项	选择方案	理由
断路器恢复机制	半开状态探测	逐步恢复，避免立即再次故障
重试算法	指数退避+抖动	防止惊群效应，遵守速率限制
舱壁隔离方式	基于信号量的层级	简单高效，优先处理关键操作
LLM降级方案	带缓存的模型链	优雅降级，成本优化，高可用性

Capability Details

能力详情

circuit-breaker

Keywords: circuit breaker, failure threshold, cascade failure, trip, half-open Solves:

Prevent cascade failures when external services fail
Automatically recover when services come back online
Fail fast instead of waiting for timeouts

关键词： circuit breaker, failure threshold, cascade failure, trip, half-open 解决问题：

防止外部服务故障时引发级联故障
服务恢复时自动恢复系统功能
快速失败而非等待超时

bulkhead

Keywords: bulkhead, isolation, semaphore, thread pool, resource pool, tier Solves:

Isolate failures to prevent entire system crashes
Prioritize critical operations over optional ones
Limit concurrent requests to protect resources

关键词： bulkhead, isolation, semaphore, thread pool, resource pool, tier 解决问题：

隔离故障，避免整个系统崩溃
优先处理关键操作而非可选操作
限制并发请求以保护资源

retry-strategies

Keywords: retry, backoff, exponential, jitter, thundering herd Solves:

Handle transient failures automatically
Avoid overwhelming recovering services
Classify errors as retryable vs non-retryable

关键词： retry, backoff, exponential, jitter, thundering herd 解决问题：

自动处理瞬时故障
避免压垮正在恢复的服务
区分可重试与不可重试错误

llm-resilience

Keywords: LLM, fallback, model, token budget, rate limit, context length Solves:

Handle LLM API rate limits gracefully
Fall back to alternative models when primary fails
Manage token budgets to prevent context overflow

关键词： LLM, fallback, model, token budget, rate limit, context length 解决问题：

优雅处理LLM API速率限制
主模型故障时降级到备选模型
管理Token预算以避免上下文溢出

error-classification

Keywords: error, retryable, transient, permanent, classification Solves:

Determine which errors should be retried
Categorize errors by severity and recoverability
Map HTTP status codes to resilience actions

关键词： error, retryable, transient, permanent, classification 解决问题：

判断哪些错误应该重试
按严重程度和可恢复性分类错误
将HTTP状态码映射到弹性处理动作

resilience-patterns

Original

Translation

Resilience Patterns Skill

弹性模式Skill

Overview

概述

Core Patterns

核心模式

1. Circuit Breaker Pattern (reference: circuit-breaker.md)

1. 断路器模式（参考：circuit-breaker.md）

2. Bulkhead Pattern (reference: bulkhead-pattern.md)

2. 舱壁模式（参考：bulkhead-pattern.md）

3. Retry Strategies (reference: retry-strategies.md)

3. 重试策略（参考：retry-strategies.md）

4. LLM-Specific Resilience (reference: llm-resilience.md)

4. LLM专属弹性机制（参考：llm-resilience.md）

Quick Reference

快速参考

OrchestKit Integration Points

OrchestKit集成点

Files in This Skill

本Skill包含的文件

References (Conceptual Guides)

参考文档（概念指南）

Templates (Code Patterns)

模板（代码模式）

Examples

示例

Checklists

检查清单

2026 Best Practices

2026最佳实践

Related Skills

相关Skill

Key Decisions

核心决策

Capability Details

能力详情

circuit-breaker

circuit-breaker

bulkhead

bulkhead

retry-strategies

retry-strategies

llm-resilience

llm-resilience

error-classification

error-classification