chaos-engineering

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

When this skill is activated, always start your first response with the 🧢 emoji.

当激活本Skill时，首次回复请务必以🧢表情开头。

Chaos Engineering

混沌工程

A practitioner's framework for running controlled failure experiments in production systems. This skill covers how to design, execute, and learn from chaos experiments - from simple latency injections to full game days - with an emphasis on safety, minimal blast radius, and translating findings into durable resilience improvements.

这是一套用于在生产系统中开展受控故障实验的从业者框架。本Skill涵盖混沌实验的设计、执行与经验总结——从简单的延迟注入到完整的故障演练（Game Day），重点关注安全性、最小化影响范围（Blast Radius），并将实验发现转化为持久的系统韧性提升。

When to use this skill

何时使用本Skill

Trigger this skill when the user:

Wants to design a chaos experiment or fault injection scenario
Is setting up a chaos engineering program from scratch
Needs to implement network latency, packet loss, or service dependency failures
Is planning or facilitating a game day exercise
Needs to validate circuit breakers, retries, or failover logic under real failure conditions
Wants to measure and improve MTTR (Mean Time to Recovery)
Is evaluating chaos tooling (Chaos Monkey, Litmus, Gremlin, AWS Fault Injection Simulator)

Do NOT trigger this skill for:

Writing standard retry or circuit breaker code without the intent to test it under chaos (use backend-engineering skill)
Load testing or performance benchmarking that does not involve failure injection (use performance-engineering skill)

当用户有以下需求时触发本Skill：

想要设计混沌实验或故障注入场景
从零开始搭建混沌工程体系
需要实现网络延迟、丢包或服务依赖故障
正在规划或组织故障演练（Game Day）活动
需要在真实故障条件下验证断路器、重试或故障转移逻辑
想要测量并优化MTTR（平均恢复时间）
正在评估混沌工具（Chaos Monkey、Litmus、Gremlin、AWS Fault Injection Simulator）

以下情况请勿触发本Skill：

编写标准重试或断路器代码但无意在混沌环境下测试（请使用后端开发Skill）
不涉及故障注入的负载测试或性能基准测试（请使用性能工程Skill）

Key principles

核心原则

Define steady state before breaking anything - You cannot detect a deviation without a baseline. Before every experiment, define the precise metric (p99 latency, error rate, success count) that proves the system is healthy. If the system is already degraded, stop and fix it first.
Start small in staging, graduate to production slowly - Every experiment starts in a non-production environment. Only move to production after the hypothesis is proven correct in staging and blast radius is understood. Even in production, target a small traffic percentage or a single availability zone first.
Minimize blast radius - The experiment scope must be as small as possible. Isolate the failure to one service, one host, or one region. Have a kill switch ready before starting. The goal is learning, not causing an incident.
Build the hypothesis before turning on the failure - A hypothesis has three parts: "When X happens, the system will Y, as evidenced by Z metric." Without a pre-written hypothesis you cannot distinguish a passing experiment from an outage.
Automate experiments and run them continuously - A chaos experiment run once is a one-time curiosity. Automated experiments that run on every deploy catch regressions before production. The goal is a resilience gate in CI/CD, not a quarterly fire drill.

在故障注入前定义稳态 - 没有基准就无法检测偏差。每次实验前，定义能证明系统健康的精确指标（如p99延迟、错误率、成功请求数）。如果系统已处于降级状态，请先修复再开始实验。
从 staging 环境小范围起步，逐步扩展到生产环境 - 所有实验先在非生产环境开展。只有在staging环境验证假设成立且明确影响范围后，再迁移到生产环境。即使在生产环境，也应先针对小比例流量或单个可用区进行实验。
最小化影响范围 - 实验范围应尽可能小。将故障隔离到单个服务、主机或区域。实验前准备好终止开关（Kill Switch）。实验的目标是学习，而非引发事故。
在注入故障前建立假设 - 假设包含三个部分："当X发生时，系统会出现Y表现，可通过Z指标验证。" 没有预先编写的假设，你将无法区分实验成功与系统故障。
自动化实验并持续运行 - 仅运行一次的混沌实验只是一次性的尝试。在每次部署时自动运行实验，可在问题进入生产环境前发现回归。目标是将韧性验证纳入CI/CD流程，而非每季度一次的应急演练。

Core concepts

核心概念

Steady State Hypothesis

稳态假设

The foundation of every experiment. A steady state is a measurable, normal behavior of the system:

Hypothesis template:
  "Under normal conditions, [service] processes [metric] at [baseline value].
   When [failure condition] is introduced, [metric] will remain within [acceptable range]
   because [resilience mechanism] will compensate."

Example:
  "Under normal conditions, the checkout service processes 95% of requests in <500ms.
   When the inventory service has 500ms of added latency, checkout p99 will remain
   <800ms because the circuit breaker will open and return cached availability data."

Metrics for steady state (RED method):

Rate - requests per second
Errors - error rate (%)
Duration - latency percentiles (p50, p95, p99)

这是所有实验的基础。稳态是系统可测量的正常行为：

假设模板:
  "在正常条件下，[服务]的[指标]维持在[基准值]。
   当引入[故障条件]时，[指标]将保持在[可接受范围]内，
   因为[韧性机制]会进行补偿。"

示例:
  "在正常条件下，结账服务95%的请求响应时间<500ms。
   当库存服务增加500ms延迟时，结账服务的p99延迟仍将<800ms，
   因为断路器会打开并返回缓存的库存可用数据。"

稳态指标（RED方法）：

Rate（速率） - 每秒请求数
Errors（错误） - 错误率（%）
Duration（时长） - 延迟百分位数（p50、p95、p99）

Blast Radius

影响范围（Blast Radius）

The maximum potential impact of the experiment if something goes wrong. Always quantify before starting:

Blast radius dimension	Example	How to constrain
Traffic percentage	5% of prod requests	Feature flags, canary routing
Infrastructure scope	1 of 3 availability zones	Target specific AZ tags
Service scope	One pod/instance in the fleet	Target single hostname
Time scope	10-minute window	Automated kill switch with timeout

实验出现问题时的最大潜在影响。实验前务必量化：

影响范围维度	示例	限制方法
流量比例	生产环境5%的请求	功能开关、金丝雀路由
基础设施范围	3个可用区中的1个	针对特定AZ标签
服务范围	集群中的单个Pod/实例	针对特定主机名
时间范围	10分钟窗口	带超时的自动终止开关

Experiment Lifecycle

实验生命周期

1. DEFINE    -> Write steady state hypothesis + success/failure criteria
2. SCOPE     -> Identify target environment, blast radius, and rollback mechanism
3. INSTRUMENT -> Confirm observability is in place to measure the hypothesis metric
4. RUN       -> Inject failure; observe metric in real time
5. ANALYZE   -> Did steady state hold? If not, why? What was the real failure mode?
6. IMPROVE   -> Fix the gap. Update runbooks. Automate the experiment.
7. REPEAT    -> Re-run to confirm the fix. Graduate to broader scope.

1. 定义    -> 编写稳态假设 + 成功/失败标准
2. 划定范围 -> 确定目标环境、影响范围与回滚机制
3. 部署观测工具 -> 确认已部署可测量假设指标的观测系统
4. 运行    -> 注入故障；实时观测指标
5. 分析    -> 稳态是否保持？如果没有，原因是什么？真实的故障模式是什么？
6. 改进    -> 修复漏洞，更新运行手册，自动化实验
7. 重复    -> 重新运行实验验证修复效果，逐步扩大范围

Failure Modes Taxonomy

故障模式分类

Category	Examples	Common tools
Network	Latency, packet loss, DNS failure, partition	tc netem, Toxiproxy, Gremlin
Resource	CPU saturation, memory pressure, disk full, fd exhaustion	stress-ng, Chaos Monkey
Dependency	Service unavailable, slow response, bad responses (500/400)	Wiremock, Litmus, FIS
Infrastructure	Pod kill, node drain, AZ outage, region failover	Chaos Monkey, Litmus, FIS
Application	Exception injection, clock skew, thread pool exhaustion	Byte Monkey, custom middleware
Data	Corrupt payload, missing field, schema mismatch	Custom fuzz harness

类别	示例	常用工具
网络	延迟、丢包、DNS故障、网络分区	tc netem、Toxiproxy、Gremlin
资源	CPU饱和、内存压力、磁盘满、文件句柄耗尽	stress-ng、Chaos Monkey
依赖服务	服务不可用、响应缓慢、错误响应（500/400）	Wiremock、Litmus、FIS
基础设施	Pod销毁、节点驱逐、可用区 outage、区域故障转移	Chaos Monkey、Litmus、FIS
应用程序	异常注入、时钟偏移、线程池耗尽	Byte Monkey、自定义中间件
数据	payload损坏、字段缺失、 schema不匹配	自定义模糊测试工具

Common tasks

常见任务

Design a chaos experiment

设计混沌实验

Use this template to structure every experiment:

markdown

undefined

使用以下模板构建所有实验：

markdown

undefined

Chaos Experiment: [Short Name]

混沌实验：[简短名称]

Date: YYYY-MM-DD Hypothesis: When [failure condition], [service] will [expected behavior] as evidenced by [metric staying within range].

Steady State (before):

Metric: checkout.success_rate
Baseline: >= 99.5%
Measured via: Datadog SLO dashboard / Prometheus query

Failure injection:

Tool: Toxiproxy / Litmus / AWS FIS
Target: inventory-service, 1 of 5 pods
Type: HTTP 503 response, 100% of requests to /api/stock
Duration: 10 minutes

Blast radius:

Scope: Single pod in staging environment
Traffic affected: ~20% of inventory requests

Kill switch:

kubectl delete chaosexperiment inventory-latency

Success criteria:

checkout.success_rate remains >= 99.5% during injection
Circuit breaker opens within 30s
Fallback (cached stock) is served to users

Failure criteria:

checkout.success_rate drops below 99% for > 2 minutes
Any user-visible 500 errors during injection

Result: [PASS / FAIL] Finding: [What actually happened] Action: [Ticket number + fix description]

undefined

日期： YYYY-MM-DD 假设： 当[故障条件]发生时，[服务]将[预期行为] 可通过[指标维持在指定范围内]验证。

稳态（实验前）：

指标：checkout.success_rate
基准值：>= 99.5%
测量方式：Datadog SLO仪表盘 / Prometheus查询

故障注入：

工具：Toxiproxy / Litmus / AWS FIS
目标：inventory-service，5个Pod中的1个
类型：对/api/stock的100%请求返回HTTP 503响应
时长：10分钟

影响范围：

范围：staging环境中的单个Pod
受影响流量：约20%的库存服务请求

终止开关：

kubectl delete chaosexperiment inventory-latency

成功标准：

实验期间checkout.success_rate保持>=99.5%
断路器在30秒内打开
向用户返回降级（缓存库存）响应

失败标准：

checkout.success_rate连续2分钟低于99%
实验期间出现用户可见的500错误

结果： [通过/失败] 发现： [实际发生的情况] 行动项： [工单编号 + 修复描述]

undefined

Implement network latency injection

实现网络延迟注入

Inject latency at the network level using Linux Traffic Control (

tc

) or Toxiproxy (application-level proxy). Prefer Toxiproxy for service-specific targeting; prefer

tc

for host-level experiments.

Using Toxiproxy (service-level, recommended for staging):

bash

undefined

使用Linux流量控制工具（

tc

）或Toxiproxy（应用层代理）在网络层面注入延迟。针对服务级别的实验优先使用Toxiproxy；针对主机级别的实验优先使用

tc

。

使用Toxiproxy（服务级别，推荐用于staging环境）：

bash

undefined

Install and start Toxiproxy

安装并启动Toxiproxy

toxiproxy-server &

Create a proxy for the downstream service

为下游服务创建代理

toxiproxy-cli create --listen 0.0.0.0:8474 --upstream inventory-svc:8080 inventory_proxy

Add 200ms of latency with 50ms jitter to 100% of connections

为100%的连接添加200ms延迟和50ms抖动

toxiproxy-cli toxic add inventory_proxy
--type latency
--attribute latency=200
--attribute jitter=50
--toxicity 1.0

Point your service at localhost:8474 instead of inventory-svc:8080

将你的服务指向localhost:8474，而非inventory-svc:8080

... run the experiment, observe metrics ...

... 运行实验，观测指标 ...

Remove the toxic (kill switch)

移除延迟注入（终止开关）

toxiproxy-cli toxic remove inventory_proxy --toxicName latency_downstream


**Using tc netem (host-level, for infrastructure experiments):**

```bash

toxiproxy-cli toxic remove inventory_proxy --toxicName latency_downstream


**使用tc netem（主机级别，用于基础设施实验）：**

```bash

Add 300ms latency + 30ms jitter to all outbound traffic on eth0

为eth0网卡的所有出站流量添加300ms延迟和30ms抖动

sudo tc qdisc add dev eth0 root netem delay 300ms 30ms

Add 10% packet loss

添加10%的丢包率

sudo tc qdisc change dev eth0 root netem loss 10%

Remove (kill switch)

移除规则（终止开关）

sudo tc qdisc del dev eth0 root


> Always test the kill switch before starting the experiment. A failed kill switch
> turns a chaos experiment into a real incident.

sudo tc qdisc del dev eth0 root


> 实验前务必测试终止开关。失效的终止开关会将混沌实验变成真实的生产事故。

Simulate service dependency failure

模拟服务依赖故障

Test what happens when a downstream service becomes unavailable. Use Wiremock or a simple mock server to return error responses:

javascript

// Using Wiremock (Java/Docker) - stub 100% 503s for /api/stock
{
  "request": { "method": "GET", "urlPattern": "/api/stock/.*" },
  "response": {
    "status": 503,
    "headers": { "Content-Type": "application/json" },
    "body": "{\"error\": \"Service Unavailable\"}",
    "fixedDelayMilliseconds": 5000
  }
}

// Verify your circuit breaker opened:
//   - Log line: "Circuit breaker OPEN for inventory-service"
//   - Metric: circuit_breaker_state{service="inventory"} == 1
//   - Fallback response served to callers

Checklist for dependency failure experiments:

Circuit breaker opens within the configured threshold
Fallback value or cached response is served (not a 500)
Downstream errors do not propagate to user-facing error rate
Circuit breaker closes when dependency recovers
Alerting fires within SLO window, not after it

测试下游服务不可用时的系统表现。使用Wiremock或简单的Mock服务器返回错误响应：

javascript

// 使用Wiremock（Java/Docker）- 对/api/stock返回100%的503响应
{
  "request": { "method": "GET", "urlPattern": "/api/stock/.*" },
  "response": {
    "status": 503,
    "headers": { "Content-Type": "application/json" },
    "body": "{\"error\": \"Service Unavailable\"}",
    "fixedDelayMilliseconds": 5000
  }
}

// 验证断路器是否打开：
//   - 日志行："Circuit breaker OPEN for inventory-service"
//   - 指标：circuit_breaker_state{service="inventory"} == 1
//   - 向调用方返回降级响应

依赖故障实验检查清单：

断路器在配置阈值内打开
返回降级值或缓存响应（而非500错误）
下游错误未扩散到用户可见的错误率中
依赖恢复后断路器关闭
告警在SLO窗口内触发，而非事后触发

Run a game day - facilitation guide

故障演练（Game Day）组织指南

A game day is a structured, cross-team exercise that rehearses failure scenarios. It combines chaos experiments with human coordination practice.

Preparation (2 weeks before):

Choose a realistic scenario (e.g., "Primary database AZ goes down")
Define the experiment scope and blast radius in writing
Confirm on-call rotation and escalation paths are documented
Brief all participants: on-call engineers, product owner, leadership observer
Set up a dedicated incident Slack channel and shared dashboard link

Day-of agenda (3-hour format):

00:00 - 00:15  Kickoff: review scenario, confirm kill switches, assign roles
               Roles: Incident Commander, Chaos Operator, Scribe, Observer
00:15 - 00:30  Baseline check: confirm steady state metrics look healthy
00:30 - 01:30  Inject failure; team responds as if it were a real incident
               Scribe records every action and timestamp
01:30 - 01:45  Halt injection; confirm system recovers to steady state
01:45 - 02:30  Hot debrief: timeline walkthrough while memory is fresh
               Key questions: What surprised you? Where were the gaps?
02:30 - 03:00  Action items: each gap gets a ticket, owner, and due date

Post-game day outputs:

Updated runbook with gaps filled
Action items tracked in a backlog with SLO-aligned due dates
Recorded MTTR for the scenario (use as a benchmark for next game day)
Decision on whether to automate the experiment in CI

故障演练是结构化的跨团队演练，模拟故障场景，将混沌实验与人员协作训练相结合。

准备工作（提前2周）：

选择真实场景（如"主数据库所在可用区故障"）
书面定义实验范围与影响范围
确认值班轮换与升级路径已文档化
向所有参与者交底：值班工程师、产品负责人、管理层观察员
搭建专用的事故Slack频道与共享仪表盘链接

当日议程（3小时版）：

00:00 - 00:15  启动会：回顾场景，确认终止开关，分配角色
               角色：事故指挥官、混沌操作员、记录员、观察员
00:15 - 00:30  基准检查：确认稳态指标正常
00:30 - 01:30  注入故障；团队按真实事故流程响应
               记录员记录所有行动与时间戳
01:30 - 01:45  停止故障注入；确认系统恢复到稳态
01:45 - 02:30  即时复盘：在记忆清晰时梳理时间线
               核心问题：什么让你意外？哪里存在漏洞？
02:30 - 03:00  行动项：每个漏洞对应一个工单、负责人与截止日期

故障演练输出物：

更新后的运行手册（填补漏洞）
纳入待办事项的行动项（与SLO对齐的截止日期）
场景的MTTR记录（作为下次演练的基准）
是否将实验自动化纳入CI的决策

Test database failover

测试数据库故障转移

Verify that your application correctly handles a primary database failover without data loss or extended downtime:

bash

undefined

验证应用程序能否在数据库主节点故障转移时正确处理，无数据丢失或长时间停机：

bash

undefined

1. Confirm replication lag is near zero before starting

1. 实验前确认复制延迟接近零

psql -h replica -c "SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;"

2. Start continuous writes to the primary (background process)

2. 持续向主节点写入数据（后台进程）

while true; do psql -h primary -c "INSERT INTO chaos_probe (ts) VALUES (now());" 2>&1 sleep 0.5 done & PROBE_PID=$!

3. Inject: promote the replica (or use your cloud provider's failover API)

3. 注入故障：提升副本为新主节点（或使用云服务商的故障转移API）

AWS RDS: aws rds failover-db-cluster --db-cluster-identifier my-cluster

Manual: pg_ctl promote -D /var/lib/postgresql/data

手动操作: pg_ctl promote -D /var/lib/postgresql/data

4. Observe:

4. 观测：

- How long until the application reconnects?

- 应用程序重新连接耗时多久？

- Were any writes lost? (check probe table row count)

- 是否有写入丢失？（检查探针表行数）

- Did health checks detect the failover promptly?

- 健康检查是否及时检测到故障转移？

- Did connection pool recover without restart?

- 连接池是否无需重启即可恢复？

5. Kill the probe writer

5. 停止探针写入进程

kill $PROBE_PID

6. Measure:

6. 测量：

- Connection downtime: seconds between last successful write and first write to new primary

- 连接停机时间：最后一次成功写入到首次写入新主节点的间隔秒数

- Data loss: rows missing from probe table

- 数据丢失：探针表中缺失的行数

- Recovery time: time until application traffic normalizes

- 恢复时间：应用流量恢复正常的时间


**Success criteria:** Connection re-established within 30s, zero data loss, no
application restart required.


**成功标准：** 30秒内重新建立连接，零数据丢失，无需重启应用程序。

Implement circuit breaker validation

实现断路器验证

After implementing a circuit breaker, verify it actually works under failure conditions. This is the most commonly skipped verification step.

python

undefined

实现断路器后，务必在故障条件下验证其有效性。这是最常被跳过的验证步骤。

python

undefined

Validation test: assert circuit breaker opens under failure threshold

验证测试：断言断路器在达到故障阈值时打开

import pytest import time from unittest.mock import patch

def test_circuit_breaker_opens_on_failure_threshold(): cb = CircuitBreaker(threshold=5, reset_ms=30000) failures = 0

def failing_op():
    raise ConnectionError("downstream unavailable")

# Exhaust the threshold
for _ in range(5):
    with pytest.raises((ConnectionError, CircuitOpenError)):
        cb.call(failing_op)

# Next call must fast-fail without calling the dependency
call_count = 0
def counting_op():
    nonlocal call_count
    call_count += 1
    return "ok"

with pytest.raises(CircuitOpenError):
    cb.call(counting_op)

assert call_count == 0, "Circuit breaker must NOT call the dependency when OPEN"
assert cb.state == OPEN

def test_circuit_breaker_recovers_after_reset_timeout(): cb = CircuitBreaker(threshold=5, reset_ms=100) # 100ms for test speed # ... trip the breaker ... time.sleep(0.15) # Should transition to HALF-OPEN and allow one trial call result = cb.call(lambda: "ok") assert cb.state == CLOSED


**Experiment to run in staging:**
1. Deploy with circuit breaker configured
2. Use Toxiproxy to make the dependency return 503
3. Verify: breaker opens within threshold, fallback activates, logs confirm state transitions
4. Remove the toxic, verify: breaker moves to half-open, trial succeeds, breaker closes

import pytest import time from unittest.mock import patch

def test_circuit_breaker_opens_on_failure_threshold(): cb = CircuitBreaker(threshold=5, reset_ms=30000) failures = 0

def failing_op():
    raise ConnectionError("downstream unavailable")

# 耗尽故障阈值
for _ in range(5):
    with pytest.raises((ConnectionError, CircuitOpenError)):
        cb.call(failing_op)

# 下一次调用必须快速失败，不调用依赖服务
call_count = 0
def counting_op():
    nonlocal call_count
    call_count += 1
    return "ok"

with pytest.raises(CircuitOpenError):
    cb.call(counting_op)

assert call_count == 0, "Circuit breaker must NOT call the dependency when OPEN"
assert cb.state == OPEN

def test_circuit_breaker_recovers_after_reset_timeout(): cb = CircuitBreaker(threshold=5, reset_ms=100) # 测试用100ms超时 # ... 触发断路器 ... time.sleep(0.15) # 应切换到半开状态并允许一次测试调用 result = cb.call(lambda: "ok") assert cb.state == CLOSED


**在staging环境运行的实验：**
1. 部署配置好断路器的应用
2. 使用Toxiproxy让依赖服务返回503
3. 验证：断路器在阈值内打开，降级逻辑激活，日志确认状态切换
4. 移除故障注入，验证：断路器切换到半开状态，测试调用成功，断路器关闭

Measure and improve MTTR

测量并优化MTTR

MTTR (Mean Time to Recovery) is the primary output metric of a chaos engineering program. Improve it by reducing each phase:

Incident timeline phases:
  Detection  - time from failure start to alert firing
  Triage     - time from alert to understanding root cause
  Response   - time from diagnosis to fix applied
  Recovery   - time from fix applied to steady state restored

MTTR = Detection + Triage + Response + Recovery

Measurement query (Prometheus example):

promql

undefined

MTTR（平均恢复时间）是混沌工程体系的核心输出指标。通过减少以下各阶段耗时来优化：

事故时间线阶段：
  检测  - 故障发生到告警触发的时间
  分诊  - 告警触发到定位根因的时间
  响应  - 根因定位到修复实施的时间
  恢复  - 修复实施到系统恢复稳态的时间

MTTR = 检测 + 分诊 + 响应 + 恢复

测量查询（Prometheus示例）：

promql

undefined

Time from incident start (SLO breach) to recovery (SLO restored)

从事故开始（SLO违规）到恢复（SLO恢复）的时间

Track this per incident type in a spreadsheet; compute rolling mean

按事故类型在表格中追踪，计算滚动平均值

Alert on SLO burn rate (detection proxy):

基于SLO燃烧速率告警（检测代理）：

( rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) ) > 0.01 # >1% error rate


**Improvement levers by phase:**

| Phase | Common gap | Fix |
|---|---|---|
| Detection | Alert fires 10 min after incident | Lower burn rate window; add synthetic monitors |
| Triage | Engineers don't know which runbook to use | Link runbook URL directly in alert body |
| Response | Fix requires manual steps | Automate the fix (restart script, failover trigger) |
| Recovery | Traffic does not drain back after fix | Add health check gates to deployment pipeline |

> Track MTTR per failure category. A single average hides that your database failovers
> recover in 2 min but your certificate expiry incidents take 45 min.

---

( rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) ) > 0.01 # >1%错误率


**各阶段优化手段：**

| 阶段 | 常见漏洞 | 修复方案 |
|---|---|---|
| 检测 | 告警在事故发生10分钟后才触发 | 缩小燃烧速率窗口；添加合成监控 |
| 分诊 | 工程师不知道使用哪个运行手册 | 在告警内容中直接链接运行手册URL |
| 响应 | 修复需要手动操作 | 自动化修复（重启脚本、故障转移触发） |
| 恢复 | 修复后流量未自动回切 | 在部署流水线中添加健康检查门控 |

> 按故障类别追踪MTTR。单一平均值会掩盖问题，比如数据库故障转移2分钟即可恢复，但证书过期事故需要45分钟。

---

Anti-patterns

反模式

Anti-pattern	Why it's wrong	What to do instead
Running chaos in production before staging	Turns an experiment into an incident	Always validate hypothesis in staging first; graduate scope incrementally
No hypothesis before starting	Cannot distinguish experiment result from coincidence	Write the three-part hypothesis (condition, behavior, metric) before touching anything
Missing kill switch	Experiment cannot be stopped if it goes wrong	Test the kill switch before injecting; automate it with a timeout
Chaos without observability	Impossible to measure steady state deviation	Confirm dashboards and alerts are live before starting; abort if blind
One-time game days without automation	Resilience regresses between exercises	Automate the experiment; run in CI on every deploy or weekly schedule
Targeting production at full scale first	Single experiment can cause a real outage	Start with 1 pod / 1% traffic / 1 AZ; expand only after confirming safety

反模式	危害	正确做法
未在staging环境验证就直接在生产环境运行混沌实验	将实验变成生产事故	先在staging环境验证假设，逐步扩大范围
实验前未建立假设	无法区分实验结果与偶然事件	在操作前编写包含三部分的假设（条件、行为、指标）
未设置终止开关	实验失控时无法停止	实验前测试终止开关，添加超时自动终止
无观测系统的混沌实验	无法测量稳态偏差	实验前确认仪表盘与告警正常；无观测则终止实验
一次性故障演练未自动化	两次演练间韧性退化	自动化实验；在每次部署或每周定期运行
首次在生产环境就全量实验	单一实验即可引发生产 outage	从1个Pod / 1%流量 / 1个可用区开始；确认安全后再扩展

References

参考资料

For experiment catalogs, failure injection recipes, and advanced tooling guidance:

```
references/experiment-catalog.md
```
- ready-to-use experiments organized by failure type

Only load the references file if the current task requires a specific experiment recipe.

如需实验目录、故障注入方案与高级工具指南：

```
references/experiment-catalog.md
```
- 按故障类型分类的即用型实验

仅当当前任务需要特定实验方案时才加载参考资料文件。