root-cause-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Root Cause Analysis

根本原因分析(RCA)

Systematic approaches for identifying the true source of problems, not just symptoms.
用于确定问题真正根源而非仅表象的系统化方法。

RCA Methods Overview

RCA方法概述

MethodBest ForComplexityTime
5 WhysSimple, linear problemsLow15-30 min
FishboneMulti-factor problemsMedium30-60 min
Fault TreeCritical systems, safetyHigh1-4 hours
Timeline AnalysisIncident investigationMedium30-90 min
方法适用场景复杂度耗时
5个为什么法简单、线性问题15-30分钟
鱼骨图多因素问题30-60分钟
故障树分析关键系统、安全相关问题1-4小时
时间线分析事件调查30-90分钟

5 Whys Method

5个为什么法

Iteratively ask "why" to drill down from symptom to root cause.
通过反复询问“为什么”,从问题表象逐步深挖至根本原因。

Process

流程

Problem Statement: [Clear description of the issue]
Why #1: [First level cause]
Why #2: [Deeper cause]
Why #3: [Even deeper]
Why #4: [Getting to root]
Why #5: [Root cause identified]
Action: [Fix that addresses root cause]
Problem Statement: [清晰的问题描述]
Why #1: [第一层原因]
Why #2: [更深层原因]
Why #3: [更底层原因]
Why #4: [接近根本原因]
Why #5: [确定根本原因]
Action: [针对根本原因的修复措施]

Example: Production Outage

示例:生产环境宕机

markdown
**Problem:** Website was down for 2 hours

**Why 1:** Why was the website down?
→ The application server ran out of memory and crashed.

**Why 2:** Why did the server run out of memory?
→ A memory leak in the image processing service accumulated over time.

**Why 3:** Why was there a memory leak?
→ The service wasn't releasing image buffers after processing.

**Why 4:** Why weren't buffers being released?
→ The cleanup code had a bug introduced in last week's release.

**Why 5:** Why wasn't the bug caught before release?
→ We don't have automated memory leak detection in our test suite.

**Root Cause:** Missing automated memory leak testing
**Action:** Add memory profiling to CI pipeline, add cleanup tests
markdown
**问题:** 网站下线2小时

**为什么1:** 网站为什么下线?
→ 应用服务器内存耗尽并崩溃。

**为什么2:** 服务器为什么内存耗尽?
→ 图片处理服务存在内存泄漏,随时间累积导致。

**为什么3:** 为什么会存在内存泄漏?
→ 服务在处理完成后未释放图片缓冲区。

**为什么4:** 为什么缓冲区未被释放?
→ 上周发布版本中引入了清理代码的Bug。

**为什么5:** 为什么这个Bug在发布前未被发现?
→ 我们的测试套件中没有自动化内存泄漏检测。

**根本原因:** 缺失自动化内存泄漏测试
**行动:** 在CI pipeline中添加内存分析,增加清理测试

5 Whys Best Practices

5个为什么法最佳实践

DoDon't
Base answers on evidenceGuess or assume
Stay focused on one causal chainBranch too early
Keep asking until actionableStop at symptoms
Involve people closest to issueAssign blame
Document your reasoningSkip steps
建议禁忌
基于证据给出答案猜测或假设
聚焦单一因果链过早分支
持续提问直到找到可执行的解决方案在表象阶段停止
让最贴近问题的人员参与归咎于人
记录推理过程跳过步骤

When 5 Whys Falls Short

5个为什么法的局限性

  • Multiple contributing factors (use Fishbone)
  • Complex system interactions (use Fault Tree)
  • Organizational/process issues (need broader analysis)
  • 存在多个影响因素时(使用鱼骨图)
  • 复杂系统交互问题(使用故障树分析)
  • 组织/流程类问题(需要更广泛的分析)

Fishbone Diagram (Ishikawa)

鱼骨图(石川图)

Visualize multiple potential causes organized by category.
将多个潜在原因按类别可视化展示。

Standard Categories (6 M's)

标准分类(6M)

                    ┌─────────────┐
        Methods ────┤             │
                    │             │
      Machines ─────┤             │
                    │             ├──── PROBLEM
     Materials ─────┤             │
                    │             │
    Measurement ────┤             │
                    │             │
    Environment ────┤             │
                    │             │
       People ──────┤             │
                    └─────────────┘
                    ┌─────────────┐
        Methods ────┤             │
                    │             │
      Machines ─────┤             │
                    │             ├──── PROBLEM
     Materials ─────┤             │
                    │             │
    Measurement ────┤             │
                    │             │
    Environment ────┤             │
                    │             │
       People ──────┤             │
                    └─────────────┘

Software-Specific Categories

软件领域特定分类

                    ┌─────────────┐
          Code ─────┤             │
                    │             │
 Infrastructure ────┤             │
                    │             ├──── BUG/INCIDENT
   Dependencies ────┤             │
                    │             │
   Configuration ───┤             │
                    │             │
        Process ────┤             │
                    │             │
        People ─────┤             │
                    └─────────────┘
                    ┌─────────────┐
          Code ─────┤             │
                    │             │
 Infrastructure ────┤             │
                    │             ├──── BUG/INCIDENT
   Dependencies ────┤             │
                    │             │
   Configuration ───┤             │
                    │             │
        Process ────┤             │
                    │             │
        People ─────┤             │
                    └─────────────┘

Fishbone Example: API Latency Spike

鱼骨图示例:API延迟突增

                              ┌─────────────────┐
                              │                 │
        Code ─────────────────┤                 │
         │                    │                 │
         ├─ N+1 query issue   │                 │
         ├─ Missing index     │   API LATENCY   │
         └─ Sync blocking call│      SPIKE      │
                              │                 │
  Infrastructure ─────────────┤                 │
         │                    │                 │
         ├─ DB connection pool│                 │
         ├─ Network saturation│                 │
         └─ Insufficient RAM  │                 │
                              │                 │
  Dependencies ───────────────┤                 │
         │                    │                 │
         ├─ External API slow │                 │
         ├─ Redis timeout     │                 │
         └─ CDN cache miss    │                 │
                              └─────────────────┘
                              ┌─────────────────┐
                              │                 │
        Code ─────────────────┤                 │
         │                    │                 │
         ├─ N+1 query issue   │                 │
         ├─ Missing index     │   API LATENCY   │
         └─ Sync blocking call│      SPIKE      │
                              │                 │
  Infrastructure ─────────────┤                 │
         │                    │                 │
         ├─ DB connection pool│                 │
         ├─ Network saturation│                 │
         └─ Insufficient RAM  │                 │
                              │                 │
  Dependencies ───────────────┤                 │
         │                    │                 │
         ├─ External API slow │                 │
         ├─ Redis timeout     │                 │
         └─ CDN cache miss    │                 │
                              └─────────────────┘

Fishbone Process

鱼骨图实施流程

  1. Define the problem clearly (the fish head)
  2. Identify major categories (the bones)
  3. Brainstorm causes for each category
  4. Analyze relationships between causes
  5. Prioritize most likely root causes
  6. Verify with data/testing
  7. Take action on confirmed causes
  1. 明确定义问题(鱼头部分)
  2. 确定主要类别(鱼骨部分)
  3. 头脑风暴各分类下的原因
  4. 分析原因之间的关联
  5. 优先排序最可能的根本原因
  6. 通过数据/测试验证
  7. 针对已确认的原因采取行动

Fault Tree Analysis (FTA)

故障树分析(FTA)

Top-down, deductive analysis for critical systems.
针对关键系统的自上而下演绎式分析方法。

FTA Symbols

FTA符号

┌─────┐
│ TOP │  Top Event (the failure being analyzed)
└──┬──┘
┌──┴──┐
│ AND │  All inputs must occur for output
└─────┘

┌──┴──┐
│ OR  │  Any input causes output
└─────┘

┌─────┐
│  ○  │  Basic Event (root cause)
└─────┘

┌─────┐
│  ◇  │  Undeveloped Event (needs more analysis)
└─────┘
┌─────┐
│ TOP │  顶事件(待分析的故障)
└──┬──┘
┌──┴──┐
│ AND │  所有输入事件都发生才会触发输出事件
└─────┘

┌──┴──┐
│ OR  │  任意输入事件发生都会触发输出事件
└─────┘

┌─────┐
│  ○  │  基本事件(根本原因)
└─────┘

┌─────┐
│  ◇  │  未展开事件(需进一步分析)
└─────┘

FTA Example: Authentication Failure

FTA示例:认证失败

                    ┌────────────────────┐
                    │   USER CANNOT      │
                    │   AUTHENTICATE     │
                    └─────────┬──────────┘
                          ┌───┴───┐
                          │  OR   │
                          └───┬───┘
           ┌──────────────────┼──────────────────┐
           │                  │                  │
    ┌──────┴──────┐    ┌──────┴──────┐    ┌──────┴──────┐
    │  Invalid    │    │   Auth      │    │  Account    │
    │  Credentials│    │   Service   │    │  Locked     │
    │             │    │   Down      │    │             │
    └──────┬──────┘    └──────┬──────┘    └─────────────┘
           │                  │
       ┌───┴───┐          ┌───┴───┐
       │  OR   │          │  OR   │
       └───┬───┘          └───┬───┘
    ┌──────┼──────┐    ┌──────┼──────┐
    │      │      │    │      │      │
   ○       ○      ○    ○      ○      ◇
 Wrong   Expired Token DB   Redis  External
Password  Token  Invalid Down  Down   Auth
                    ┌────────────────────┐
                    │   USER CANNOT      │
                    │   AUTHENTICATE     │
                    └─────────┬──────────┘
                          ┌───┴───┐
                          │  OR   │
                          └───┬───┘
           ┌──────────────────┼──────────────────┐
           │                  │                  │
    ┌──────┴──────┐    ┌──────┴──────┐    ┌──────┴──────┐
    │  Invalid    │    │   Auth      │    │  Account    │
    │  Credentials│    │   Service   │    │  Locked     │
    │             │    │   Down      │    │             │
    └──────┬──────┘    └──────┬──────┘    └─────────────┘
           │                  │
       ┌───┴───┐          ┌───┴───┐
       │  OR   │          │  OR   │
       └───┬───┘          └───┬───┘
    ┌──────┼──────┐    ┌──────┼──────┐
    │      │      │    │      │      │
   ○       ○      ○    ○      ○      ◇
 Wrong   Expired Token DB   Redis  External
Password  Token  Invalid Down  Down   Auth

When to Use FTA

FTA适用场景

  • Safety-critical systems
  • Complex failure modes
  • Need to identify all paths to failure
  • Regulatory compliance requirements
  • Post-incident analysis for serious outages
  • 安全关键系统
  • 复杂故障模式
  • 需要识别所有故障路径
  • 合规性要求
  • 严重宕机事件的事后分析

Timeline Analysis

时间线分析

Reconstruct sequence of events to identify causation.
重构事件序列以确定因果关系。

Timeline Template

时间线模板

markdown
undefined
markdown
undefined

Incident Timeline: [Incident Name]

事件时间线:[事件名称]

Summary

摘要

  • Incident Start: [Timestamp]
  • Incident Detected: [Timestamp]
  • Incident Resolved: [Timestamp]
  • Total Duration: [X hours Y minutes]
  • Time to Detect: [X minutes]
  • Time to Resolve: [X hours Y minutes]
  • 事件开始: [时间戳]
  • 事件检测: [时间戳]
  • 事件解决: [时间戳]
  • 总时长: [X小时Y分钟]
  • 检测耗时: [X分钟]
  • 解决耗时: [X小时Y分钟]

Detailed Timeline

详细时间线

Time (UTC)EventSourceActor
14:00Deployment startedCI/CDautomated
14:05Deployment completedCI/CDautomated
14:15Error rate increased 10xMonitoring-
14:22Alert firedPagerDuty-
14:25On-call acknowledgedPagerDuty@alice
14:30Root cause identifiedInvestigation@alice
14:35Rollback initiatedManual@alice
14:40Services recoveredMonitoring-
14:45Incident resolvedManual@alice
时间(UTC)事件来源执行者
14:00部署开始CI/CD自动化
14:05部署完成CI/CD自动化
14:15错误率上升10倍监控系统-
14:22警报触发PagerDuty-
14:25值班人员确认PagerDuty@alice
14:30确定根本原因调查@alice
14:35开始回滚手动@alice
14:40服务恢复监控系统-
14:45事件解决手动@alice

Analysis

分析

Contributing Factors:
  1. [Factor 1]
  2. [Factor 2]
What Went Well:
  1. [Positive observation]
What Could Improve:
  1. [Improvement area]
影响因素:
  1. [因素1]
  2. [因素2]
做得好的地方:
  1. [积极观察结果]
待改进之处:
  1. [改进方向]

Action Items

行动项

ActionOwnerDue DateStatus
undefined
行动负责人截止日期状态
undefined

Debugging Decision Tree

调试决策树

                    Problem Reported
               Can you reproduce it?
                    │           │
                   Yes          No
                    │           │
                    ▼           ▼
            Isolate the      Gather more
            conditions       information
                    │           │
                    ▼           ▼
            Recent changes?  Check logs,
                    │        monitoring
                   Yes          │
                    │           │
                    ▼           ▼
            Review diffs    Correlation
            & deploys       analysis
                    │           │
                    └─────┬─────┘
                   Form hypothesis
                    Test hypothesis
                    ┌─────┴─────┐
                    │           │
               Confirmed     Rejected
                    │           │
                    ▼           ▼
               Fix and      Next hypothesis
               verify
                    Problem Reported
               Can you reproduce it?
                    │           │
                   Yes          No
                    │           │
                    ▼           ▼
            Isolate the      Gather more
            conditions       information
                    │           │
                    ▼           ▼
            Recent changes?  Check logs,
                    │        monitoring
                   Yes          │
                    │           │
                    ▼           ▼
            Review diffs    Correlation
            & deploys       analysis
                    │           │
                    └─────┬─────┘
                   Form hypothesis
                    Test hypothesis
                    ┌─────┴─────┐
                    │           │
               Confirmed     Rejected
                    │           │
                    ▼           ▼
               Fix and      Next hypothesis
               verify

RCA Documentation Template

RCA文档模板

markdown
undefined
markdown
undefined

Root Cause Analysis: [Issue Title]

根本原因分析:[问题标题]

Issue Summary

问题摘要

Reported: [Date] Severity: P0 / P1 / P2 / P3 Impact: [Description of impact]
报告时间: [日期] 严重程度: P0 / P1 / P2 / P3 影响范围: [影响描述]

Problem Statement

问题陈述

[Clear, specific description of what went wrong]
[清晰、具体的问题描述]

Investigation

调查过程

Timeline

时间线

[Key events in sequence]
[按顺序排列的关键事件]

Analysis Method Used

使用的分析方法

[ ] 5 Whys [ ] Fishbone [ ] Fault Tree [ ] Timeline Analysis
[ ] 5个为什么法 [ ] 鱼骨图 [ ] 故障树分析 [ ] 时间线分析

Findings

调查结果

[Detailed analysis results]
[详细分析结果]

Root Cause(s)

根本原因

  1. Primary: [Main root cause]
  2. Contributing: [Secondary factors]
  1. 主要原因: [核心根本原因]
  2. 次要因素: [辅助影响因素]

Immediate Fix

即时修复措施

[What was done to resolve the immediate issue]
[用于解决当前问题的措施]

Preventive Actions

预防措施

ActionOwnerDueStatus
行动负责人截止日期状态

Lessons Learned

经验总结

  1. [Key takeaway]
  2. [Process improvement]
  1. [关键要点]
  2. [流程改进建议]

Appendix

附录

  • [Links to logs, graphs, related tickets]
undefined
  • [日志、图表、相关工单的链接]
undefined

Best Practices

最佳实践

  • Blameless postmortems: Focus on systems, not individuals
  • Automated correlation: Use AI to correlate signals across systems
  • Proactive RCA: Analyze near-misses, not just incidents
  • Knowledge sharing: Document and share RCA findings
  • Metrics-driven: Track time-to-detect, time-to-resolve trends
  • 无责事后复盘:聚焦系统而非个人
  • 自动化关联分析:使用AI关联跨系统信号
  • 主动式RCA:分析未遂事件,而非仅针对已发生的事件
  • 知识共享:记录并分享RCA结果
  • 数据驱动:跟踪检测耗时、解决耗时的趋势

Related Skills

相关技能

  • observability-monitoring
    - Gathering data for RCA
  • errors
    - Error pattern analysis
  • resilience-patterns
    - Preventing future incidents
  • observability-monitoring
    - 为RCA收集数据
  • errors
    - 错误模式分析
  • resilience-patterns
    - 预防未来事件

References

参考资料

  • 5 Whys Workshop Guide
  • Fishbone Template
Version: 1.0.0 (January )
  • 5个为什么工作坊指南
  • 鱼骨图模板
版本: 1.0.0(1月)