root-cause-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRoot Cause Analysis
根本原因分析(RCA)
Systematic approaches for identifying the true source of problems, not just symptoms.
用于确定问题真正根源而非仅表象的系统化方法。
RCA Methods Overview
RCA方法概述
| Method | Best For | Complexity | Time |
|---|---|---|---|
| 5 Whys | Simple, linear problems | Low | 15-30 min |
| Fishbone | Multi-factor problems | Medium | 30-60 min |
| Fault Tree | Critical systems, safety | High | 1-4 hours |
| Timeline Analysis | Incident investigation | Medium | 30-90 min |
| 方法 | 适用场景 | 复杂度 | 耗时 |
|---|---|---|---|
| 5个为什么法 | 简单、线性问题 | 低 | 15-30分钟 |
| 鱼骨图 | 多因素问题 | 中 | 30-60分钟 |
| 故障树分析 | 关键系统、安全相关问题 | 高 | 1-4小时 |
| 时间线分析 | 事件调查 | 中 | 30-90分钟 |
5 Whys Method
5个为什么法
Iteratively ask "why" to drill down from symptom to root cause.
通过反复询问“为什么”,从问题表象逐步深挖至根本原因。
Process
流程
Problem Statement: [Clear description of the issue]
│
▼
Why #1: [First level cause]
│
▼
Why #2: [Deeper cause]
│
▼
Why #3: [Even deeper]
│
▼
Why #4: [Getting to root]
│
▼
Why #5: [Root cause identified]
│
▼
Action: [Fix that addresses root cause]Problem Statement: [清晰的问题描述]
│
▼
Why #1: [第一层原因]
│
▼
Why #2: [更深层原因]
│
▼
Why #3: [更底层原因]
│
▼
Why #4: [接近根本原因]
│
▼
Why #5: [确定根本原因]
│
▼
Action: [针对根本原因的修复措施]Example: Production Outage
示例:生产环境宕机
markdown
**Problem:** Website was down for 2 hours
**Why 1:** Why was the website down?
→ The application server ran out of memory and crashed.
**Why 2:** Why did the server run out of memory?
→ A memory leak in the image processing service accumulated over time.
**Why 3:** Why was there a memory leak?
→ The service wasn't releasing image buffers after processing.
**Why 4:** Why weren't buffers being released?
→ The cleanup code had a bug introduced in last week's release.
**Why 5:** Why wasn't the bug caught before release?
→ We don't have automated memory leak detection in our test suite.
**Root Cause:** Missing automated memory leak testing
**Action:** Add memory profiling to CI pipeline, add cleanup testsmarkdown
**问题:** 网站下线2小时
**为什么1:** 网站为什么下线?
→ 应用服务器内存耗尽并崩溃。
**为什么2:** 服务器为什么内存耗尽?
→ 图片处理服务存在内存泄漏,随时间累积导致。
**为什么3:** 为什么会存在内存泄漏?
→ 服务在处理完成后未释放图片缓冲区。
**为什么4:** 为什么缓冲区未被释放?
→ 上周发布版本中引入了清理代码的Bug。
**为什么5:** 为什么这个Bug在发布前未被发现?
→ 我们的测试套件中没有自动化内存泄漏检测。
**根本原因:** 缺失自动化内存泄漏测试
**行动:** 在CI pipeline中添加内存分析,增加清理测试5 Whys Best Practices
5个为什么法最佳实践
| Do | Don't |
|---|---|
| Base answers on evidence | Guess or assume |
| Stay focused on one causal chain | Branch too early |
| Keep asking until actionable | Stop at symptoms |
| Involve people closest to issue | Assign blame |
| Document your reasoning | Skip steps |
| 建议 | 禁忌 |
|---|---|
| 基于证据给出答案 | 猜测或假设 |
| 聚焦单一因果链 | 过早分支 |
| 持续提问直到找到可执行的解决方案 | 在表象阶段停止 |
| 让最贴近问题的人员参与 | 归咎于人 |
| 记录推理过程 | 跳过步骤 |
When 5 Whys Falls Short
5个为什么法的局限性
- Multiple contributing factors (use Fishbone)
- Complex system interactions (use Fault Tree)
- Organizational/process issues (need broader analysis)
- 存在多个影响因素时(使用鱼骨图)
- 复杂系统交互问题(使用故障树分析)
- 组织/流程类问题(需要更广泛的分析)
Fishbone Diagram (Ishikawa)
鱼骨图(石川图)
Visualize multiple potential causes organized by category.
将多个潜在原因按类别可视化展示。
Standard Categories (6 M's)
标准分类(6M)
┌─────────────┐
Methods ────┤ │
│ │
Machines ─────┤ │
│ ├──── PROBLEM
Materials ─────┤ │
│ │
Measurement ────┤ │
│ │
Environment ────┤ │
│ │
People ──────┤ │
└─────────────┘ ┌─────────────┐
Methods ────┤ │
│ │
Machines ─────┤ │
│ ├──── PROBLEM
Materials ─────┤ │
│ │
Measurement ────┤ │
│ │
Environment ────┤ │
│ │
People ──────┤ │
└─────────────┘Software-Specific Categories
软件领域特定分类
┌─────────────┐
Code ─────┤ │
│ │
Infrastructure ────┤ │
│ ├──── BUG/INCIDENT
Dependencies ────┤ │
│ │
Configuration ───┤ │
│ │
Process ────┤ │
│ │
People ─────┤ │
└─────────────┘ ┌─────────────┐
Code ─────┤ │
│ │
Infrastructure ────┤ │
│ ├──── BUG/INCIDENT
Dependencies ────┤ │
│ │
Configuration ───┤ │
│ │
Process ────┤ │
│ │
People ─────┤ │
└─────────────┘Fishbone Example: API Latency Spike
鱼骨图示例:API延迟突增
┌─────────────────┐
│ │
Code ─────────────────┤ │
│ │ │
├─ N+1 query issue │ │
├─ Missing index │ API LATENCY │
└─ Sync blocking call│ SPIKE │
│ │
Infrastructure ─────────────┤ │
│ │ │
├─ DB connection pool│ │
├─ Network saturation│ │
└─ Insufficient RAM │ │
│ │
Dependencies ───────────────┤ │
│ │ │
├─ External API slow │ │
├─ Redis timeout │ │
└─ CDN cache miss │ │
└─────────────────┘ ┌─────────────────┐
│ │
Code ─────────────────┤ │
│ │ │
├─ N+1 query issue │ │
├─ Missing index │ API LATENCY │
└─ Sync blocking call│ SPIKE │
│ │
Infrastructure ─────────────┤ │
│ │ │
├─ DB connection pool│ │
├─ Network saturation│ │
└─ Insufficient RAM │ │
│ │
Dependencies ───────────────┤ │
│ │ │
├─ External API slow │ │
├─ Redis timeout │ │
└─ CDN cache miss │ │
└─────────────────┘Fishbone Process
鱼骨图实施流程
- Define the problem clearly (the fish head)
- Identify major categories (the bones)
- Brainstorm causes for each category
- Analyze relationships between causes
- Prioritize most likely root causes
- Verify with data/testing
- Take action on confirmed causes
- 明确定义问题(鱼头部分)
- 确定主要类别(鱼骨部分)
- 头脑风暴各分类下的原因
- 分析原因之间的关联
- 优先排序最可能的根本原因
- 通过数据/测试验证
- 针对已确认的原因采取行动
Fault Tree Analysis (FTA)
故障树分析(FTA)
Top-down, deductive analysis for critical systems.
针对关键系统的自上而下演绎式分析方法。
FTA Symbols
FTA符号
┌─────┐
│ TOP │ Top Event (the failure being analyzed)
└──┬──┘
│
┌──┴──┐
│ AND │ All inputs must occur for output
└─────┘
┌──┴──┐
│ OR │ Any input causes output
└─────┘
┌─────┐
│ ○ │ Basic Event (root cause)
└─────┘
┌─────┐
│ ◇ │ Undeveloped Event (needs more analysis)
└─────┘┌─────┐
│ TOP │ 顶事件(待分析的故障)
└──┬──┘
│
┌──┴──┐
│ AND │ 所有输入事件都发生才会触发输出事件
└─────┘
┌──┴──┐
│ OR │ 任意输入事件发生都会触发输出事件
└─────┘
┌─────┐
│ ○ │ 基本事件(根本原因)
└─────┘
┌─────┐
│ ◇ │ 未展开事件(需进一步分析)
└─────┘FTA Example: Authentication Failure
FTA示例:认证失败
┌────────────────────┐
│ USER CANNOT │
│ AUTHENTICATE │
└─────────┬──────────┘
│
┌───┴───┐
│ OR │
└───┬───┘
┌──────────────────┼──────────────────┐
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Invalid │ │ Auth │ │ Account │
│ Credentials│ │ Service │ │ Locked │
│ │ │ Down │ │ │
└──────┬──────┘ └──────┬──────┘ └─────────────┘
│ │
┌───┴───┐ ┌───┴───┐
│ OR │ │ OR │
└───┬───┘ └───┬───┘
┌──────┼──────┐ ┌──────┼──────┐
│ │ │ │ │ │
○ ○ ○ ○ ○ ◇
Wrong Expired Token DB Redis External
Password Token Invalid Down Down Auth ┌────────────────────┐
│ USER CANNOT │
│ AUTHENTICATE │
└─────────┬──────────┘
│
┌───┴───┐
│ OR │
└───┬───┘
┌──────────────────┼──────────────────┐
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Invalid │ │ Auth │ │ Account │
│ Credentials│ │ Service │ │ Locked │
│ │ │ Down │ │ │
└──────┬──────┘ └──────┬──────┘ └─────────────┘
│ │
┌───┴───┐ ┌───┴───┐
│ OR │ │ OR │
└───┬───┘ └───┬───┘
┌──────┼──────┐ ┌──────┼──────┐
│ │ │ │ │ │
○ ○ ○ ○ ○ ◇
Wrong Expired Token DB Redis External
Password Token Invalid Down Down AuthWhen to Use FTA
FTA适用场景
- Safety-critical systems
- Complex failure modes
- Need to identify all paths to failure
- Regulatory compliance requirements
- Post-incident analysis for serious outages
- 安全关键系统
- 复杂故障模式
- 需要识别所有故障路径
- 合规性要求
- 严重宕机事件的事后分析
Timeline Analysis
时间线分析
Reconstruct sequence of events to identify causation.
重构事件序列以确定因果关系。
Timeline Template
时间线模板
markdown
undefinedmarkdown
undefinedIncident Timeline: [Incident Name]
事件时间线:[事件名称]
Summary
摘要
- Incident Start: [Timestamp]
- Incident Detected: [Timestamp]
- Incident Resolved: [Timestamp]
- Total Duration: [X hours Y minutes]
- Time to Detect: [X minutes]
- Time to Resolve: [X hours Y minutes]
- 事件开始: [时间戳]
- 事件检测: [时间戳]
- 事件解决: [时间戳]
- 总时长: [X小时Y分钟]
- 检测耗时: [X分钟]
- 解决耗时: [X小时Y分钟]
Detailed Timeline
详细时间线
| Time (UTC) | Event | Source | Actor |
|---|---|---|---|
| 14:00 | Deployment started | CI/CD | automated |
| 14:05 | Deployment completed | CI/CD | automated |
| 14:15 | Error rate increased 10x | Monitoring | - |
| 14:22 | Alert fired | PagerDuty | - |
| 14:25 | On-call acknowledged | PagerDuty | @alice |
| 14:30 | Root cause identified | Investigation | @alice |
| 14:35 | Rollback initiated | Manual | @alice |
| 14:40 | Services recovered | Monitoring | - |
| 14:45 | Incident resolved | Manual | @alice |
| 时间(UTC) | 事件 | 来源 | 执行者 |
|---|---|---|---|
| 14:00 | 部署开始 | CI/CD | 自动化 |
| 14:05 | 部署完成 | CI/CD | 自动化 |
| 14:15 | 错误率上升10倍 | 监控系统 | - |
| 14:22 | 警报触发 | PagerDuty | - |
| 14:25 | 值班人员确认 | PagerDuty | @alice |
| 14:30 | 确定根本原因 | 调查 | @alice |
| 14:35 | 开始回滚 | 手动 | @alice |
| 14:40 | 服务恢复 | 监控系统 | - |
| 14:45 | 事件解决 | 手动 | @alice |
Analysis
分析
Contributing Factors:
- [Factor 1]
- [Factor 2]
What Went Well:
- [Positive observation]
What Could Improve:
- [Improvement area]
影响因素:
- [因素1]
- [因素2]
做得好的地方:
- [积极观察结果]
待改进之处:
- [改进方向]
Action Items
行动项
| Action | Owner | Due Date | Status |
|---|---|---|---|
undefined| 行动 | 负责人 | 截止日期 | 状态 |
|---|---|---|---|
undefinedDebugging Decision Tree
调试决策树
Problem Reported
│
▼
Can you reproduce it?
│ │
Yes No
│ │
▼ ▼
Isolate the Gather more
conditions information
│ │
▼ ▼
Recent changes? Check logs,
│ monitoring
Yes │
│ │
▼ ▼
Review diffs Correlation
& deploys analysis
│ │
└─────┬─────┘
│
▼
Form hypothesis
│
▼
Test hypothesis
│
┌─────┴─────┐
│ │
Confirmed Rejected
│ │
▼ ▼
Fix and Next hypothesis
verify Problem Reported
│
▼
Can you reproduce it?
│ │
Yes No
│ │
▼ ▼
Isolate the Gather more
conditions information
│ │
▼ ▼
Recent changes? Check logs,
│ monitoring
Yes │
│ │
▼ ▼
Review diffs Correlation
& deploys analysis
│ │
└─────┬─────┘
│
▼
Form hypothesis
│
▼
Test hypothesis
│
┌─────┴─────┐
│ │
Confirmed Rejected
│ │
▼ ▼
Fix and Next hypothesis
verifyRCA Documentation Template
RCA文档模板
markdown
undefinedmarkdown
undefinedRoot Cause Analysis: [Issue Title]
根本原因分析:[问题标题]
Issue Summary
问题摘要
Reported: [Date]
Severity: P0 / P1 / P2 / P3
Impact: [Description of impact]
报告时间: [日期]
严重程度: P0 / P1 / P2 / P3
影响范围: [影响描述]
Problem Statement
问题陈述
[Clear, specific description of what went wrong]
[清晰、具体的问题描述]
Investigation
调查过程
Timeline
时间线
[Key events in sequence]
[按顺序排列的关键事件]
Analysis Method Used
使用的分析方法
[ ] 5 Whys
[ ] Fishbone
[ ] Fault Tree
[ ] Timeline Analysis
[ ] 5个为什么法
[ ] 鱼骨图
[ ] 故障树分析
[ ] 时间线分析
Findings
调查结果
[Detailed analysis results]
[详细分析结果]
Root Cause(s)
根本原因
- Primary: [Main root cause]
- Contributing: [Secondary factors]
- 主要原因: [核心根本原因]
- 次要因素: [辅助影响因素]
Immediate Fix
即时修复措施
[What was done to resolve the immediate issue]
[用于解决当前问题的措施]
Preventive Actions
预防措施
| Action | Owner | Due | Status |
|---|---|---|---|
| 行动 | 负责人 | 截止日期 | 状态 |
|---|---|---|---|
Lessons Learned
经验总结
- [Key takeaway]
- [Process improvement]
- [关键要点]
- [流程改进建议]
Appendix
附录
- [Links to logs, graphs, related tickets]
undefined- [日志、图表、相关工单的链接]
undefinedBest Practices
最佳实践
- Blameless postmortems: Focus on systems, not individuals
- Automated correlation: Use AI to correlate signals across systems
- Proactive RCA: Analyze near-misses, not just incidents
- Knowledge sharing: Document and share RCA findings
- Metrics-driven: Track time-to-detect, time-to-resolve trends
- 无责事后复盘:聚焦系统而非个人
- 自动化关联分析:使用AI关联跨系统信号
- 主动式RCA:分析未遂事件,而非仅针对已发生的事件
- 知识共享:记录并分享RCA结果
- 数据驱动:跟踪检测耗时、解决耗时的趋势
Related Skills
相关技能
- - Gathering data for RCA
observability-monitoring - - Error pattern analysis
errors - - Preventing future incidents
resilience-patterns
- - 为RCA收集数据
observability-monitoring - - 错误模式分析
errors - - 预防未来事件
resilience-patterns
References
参考资料
- 5 Whys Workshop Guide
- Fishbone Template
Version: 1.0.0 (January )
- 5个为什么工作坊指南
- 鱼骨图模板
版本: 1.0.0(1月)