debugging
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDebugging
调试
Systematic methodology for finding and fixing bugs. Prioritizes root cause analysis over symptom treatment, evidence over intuition, and prevention over recurrence.
用于定位和修复Bug的系统化方法论。优先关注根本原因分析而非症状缓解,基于证据而非直觉,注重预防而非问题复发。
Iron Law
铁则
No fix without root cause. Never apply a fix until you can explain WHY the bug exists, not just WHERE it manifests. Symptom-level fixes create new bugs.
未找到根本原因,绝不修复。 在你能解释清楚Bug存在的原因(而非仅知晓其表现位置)之前,绝不要实施修复。仅针对症状的修复会引发新的Bug。
When to Use
适用场景
- Bug report from QA or production alert
- Test failure with unclear cause
- Intermittent/flaky behavior
- Performance degradation
- Unexpected behavior that "used to work"
- Integration failures between components
- QA提交的Bug报告或生产环境告警
- 原因不明的测试失败
- 间歇性/不稳定的异常行为
- 性能下降
- 原本正常的功能突然出现异常
- 组件间的集成失败
Workflow
工作流
Phase 1: Reproduce
阶段1:重现问题
Establish a reliable reproduction before investigating.
- Collect all evidence — error messages, stack traces, logs, screenshots, user steps
- Identify the exact conditions: environment, data state, user actions, timing
- Create a minimal reproduction — strip away everything that isn't needed to trigger the bug
- Confirm reproduction is consistent (if intermittent, note frequency and conditions)
- Write down the reproduction steps precisely — someone else should be able to follow them
Output: Documented reproduction steps, minimal test case
If you cannot reproduce: Document what you tried, check environment differences, add instrumentation and wait for next occurrence. Do not proceed to Phase 2 on guesswork — unreproducible bugs get logged, not "fixed."
在展开调查前,先建立可靠的问题重现路径。
- 收集所有证据——错误信息、堆栈跟踪、日志、截图、用户操作步骤
- 明确触发问题的精确条件:环境、数据状态、用户操作、时间节点
- 创建最小化复现用例——剔除所有非必要的内容,仅保留触发Bug所需的核心要素
- 确认复现的一致性(若为间歇性问题,记录出现频率和相关条件)
- 精确记录复现步骤——确保其他人员也能按照步骤复现问题
输出:已文档化的复现步骤、最小化测试用例
若无法复现:记录你尝试过的操作,检查环境差异,添加监控工具并等待下一次出现。不要基于猜测进入阶段2——无法复现的Bug只需记录,而非“修复”。
Phase 2: Investigate
阶段2:展开调查
Gather evidence systematically. Do NOT form hypotheses yet — this phase is about observation, not explanation.
- Read the full error message and stack trace — every line, not just the first one
- Check git history — what changed recently? (,
git log --since="2 weeks ago")git bisect - Trace the data flow — follow the input from entry point to failure point
- Check boundaries — where does data cross component/service/layer boundaries?
- Collect environmental context — versions, configuration, dependencies, resource state
- Map the blast radius — what else is affected? Is this an isolated failure or systemic?
Production vs development debugging:
- Production: Prioritize impact assessment and mitigation first. Can you reduce blast radius before investigating? Read-only access only — never debug by modifying production state.
- Development: You have full control. Use breakpoints, modify state, add temporary logging freely.
Output: Evidence log (what you found, where, timestamps), affected component map
系统化收集证据。此阶段仅专注于观察,而非形成假设。
- 完整阅读错误信息和堆栈跟踪——逐行查看,而非仅看第一行
- 检查Git历史记录——近期有哪些变更?(、
git log --since="2 weeks ago")git bisect - 追踪数据流——从入口点到故障点全程跟进输入数据的流转
- 检查边界——数据在哪些组件/服务/层之间传递?
- 收集环境上下文信息:版本、配置、依赖项、资源状态
- 评估影响范围——还有哪些功能受影响?这是孤立故障还是系统性问题?
生产环境 vs 开发环境调试差异:
- 生产环境:优先评估影响并采取缓解措施。能否在调查前缩小影响范围?仅使用只读权限——绝不通过修改生产环境状态来调试。
- 开发环境:你拥有完全控制权。可自由使用断点、修改状态、添加临时日志。
输出:证据日志(发现的内容、位置、时间戳)、受影响组件映射图
Phase 3: Hypothesize
阶段3:形成假设
Form competing hypotheses ranked by evidence strength.
- List ALL plausible causes — do not anchor on the first idea
- Classify each hypothesis by bug category (see bug categories reference)
- Rate each: evidence strength (strong/medium/weak), testability (easy/hard), likelihood
- Pick the most likely AND most testable hypothesis first
- Define what would CONFIRM and what would FALSIFY each hypothesis
Example hypothesis table:
| # | Hypothesis | Category | Evidence | Testability | Test Plan |
|---|---|---|---|---|---|
| 1 | Cache returns stale data after update | State | Log shows old value 2s after write | Easy | Bypass cache and compare |
| 2 | Race condition between two workers | Race condition | Intermittent, high load correlation | Medium | Add locking, stress test |
| 3 | Upstream API returns unexpected format | Integration | No evidence yet | Easy | Log raw response |
Output: Ranked hypothesis list with evidence and test plan
基于证据强度,形成多个待验证的假设并排序。
- 列出所有合理的可能原因——不要局限于第一个想到的想法
- 按照Bug类别对每个假设进行分类(详见Bug分类参考文档)
- 对每个假设进行评级:证据强度(强/中/弱)、可测试性(易/难)、可能性
- 优先选择可能性最高且最易测试的假设
- 明确每个假设的验证标准和证伪标准
假设表格示例:
| 序号 | 假设内容 | 类别 | 证据 | 可测试性 | 测试方案 |
|---|---|---|---|---|---|
| 1 | 更新后缓存返回过期数据 | 状态类 | 日志显示写入2秒后仍返回旧值 | 易 | 绕过缓存并对比结果 |
| 2 | 两个工作线程间存在竞态条件 | 竞态条件类 | 问题间歇性出现,与高负载相关 | 中 | 添加锁机制,进行压力测试 |
| 3 | 上游API返回非预期格式 | 集成类 | 暂无相关证据 | 易 | 记录原始响应内容 |
输出:带有证据和测试方案的排序假设列表
Phase 4: Test
阶段4:验证假设
Validate one hypothesis at a time. Single-variable changes only.
- Change ONE thing and observe the result
- If confirmed — proceed to Phase 5
- If falsified — update evidence log, return to next hypothesis
- If inconclusive — add more instrumentation, gather more evidence
- After 3 failed hypotheses — STOP. Re-examine your assumptions. The bug model may be wrong.
Red flags (return to Phase 2 immediately):
- "Quick fix for now, investigate later"
- Changing multiple things at once
- Fixing without understanding
- Copy-pasting a fix from the internet without understanding why it works
Output: Confirmed root cause with evidence chain
逐一验证假设,每次仅变更一个变量。
- 仅变更一个变量,观察结果
- 若假设成立——进入阶段5
- 若假设不成立——更新证据日志,验证下一个假设
- 若结果不明确——添加更多监控,收集更多证据
- 若连续3个假设验证失败——停止操作。重新审视你的前提假设,可能你的Bug模型存在错误。
危险信号(立即返回阶段2):
- “先临时修复,之后再深入调查”
- 同时变更多个变量
- 在未理解问题的情况下实施修复
- 直接复制网上的修复方案却不理解其原理
输出:带有证据链的已确认根本原因
Phase 5: Fix
阶段5:实施修复
Implement the fix at the source, not at the symptom.
- Write a failing test that reproduces the bug FIRST
- Implement the fix — single, focused change addressing the root cause
- Verify the failing test now passes
- Run the full test suite — ensure no regressions
- Review your own fix: is this the simplest correct solution?
Fix principles:
- Fix at the SOURCE where bad data/state originates, not where the error appears
- Add defense-in-depth: validate at boundaries even after fixing the source
- Prefer making invalid states unrepresentable over runtime validation
- One bug = one fix = one commit = one test
Output: Fix with regression test, clean test suite
从问题根源入手修复,而非仅解决表面症状。
- 先编写一个能复现Bug的失败测试用例
- 实施修复——仅针对根本原因的单一、聚焦的变更
- 验证失败测试用例现在可通过
- 运行完整测试套件——确保没有引入回归问题
- 自我审核修复方案:这是最简单且正确的解决方案吗?
修复原则:
- 从不良数据/状态产生的源头进行修复,而非在错误出现的位置
- 增加纵深防御:即使修复了源头,仍需在边界处添加验证
- 优先通过设计避免无效状态,而非仅依赖运行时验证
- 一个Bug对应一个修复、一个提交、一个测试用例
输出:带有回归测试的修复方案、通过所有测试的测试套件
Phase 6: Prevent
阶段6:预防复发
Ensure this class of bug cannot recur.
- Add defensive validation at the boundary where bad data entered
- Improve error messages — would future-you understand this error immediately?
- Update monitoring/alerting if this was a production issue
- Write a post-mortem if the bug was significant (see post-mortem template)
- Share findings with the team — this is how institutional knowledge grows
Output: Prevention measures, post-mortem (if significant)
确保此类Bug不会再次出现。
- 在不良数据进入的边界处添加防御性验证
- 优化错误信息——未来的你能否立即理解该错误?
- 若为生产环境问题,更新监控/告警规则
- 若Bug影响重大,撰写事后复盘文档(详见事后复盘模板)
- 与团队分享发现——这是积累团队经验的方式
输出:预防措施、事后复盘文档(若影响重大)
Bug Category Strategies
Bug分类应对策略
Different bug types need different investigation approaches. See bug categories reference for the full guide.
| Category | First Move | Key Technique |
|---|---|---|
| Logic error | Read the code, trace conditions | Rubber duck walkthrough, truth tables |
| Data issue | Inspect actual vs expected data at each boundary | Boundary logging, data flow trace |
| State/race condition | Add timestamps to all state mutations | Sequence diagram, concurrency analysis |
| Integration failure | Check API contract compliance | Request/response logging, contract tests |
| Performance | Profile before guessing | Profiler, flame graphs, query analysis |
| Environment | Compare working vs broken env | Differential analysis, config audit |
| Intermittent/flaky | Increase observability first | Statistical logging, stress testing |
不同类型的Bug需要不同的调查方法。完整指南请查看Bug分类参考文档。
| 类别 | 首要操作 | 核心技巧 |
|---|---|---|
| 逻辑错误 | 阅读代码,追踪条件分支 | 橡皮鸭调试法、真值表分析 |
| 数据问题 | 检查每个边界处的实际数据与预期数据 | 边界日志、数据流追踪 |
| 状态/竞态条件 | 为所有状态变更添加时间戳 | 时序图、并发分析 |
| 集成失败 | 检查API契约合规性 | 请求/响应日志、契约测试 |
| 性能问题 | 先分析性能再做猜测 | 性能分析器、火焰图、查询分析 |
| 环境问题 | 对比正常环境与故障环境 | 差异分析、配置审计 |
| 间歇性/不稳定问题 | 先提升可观察性 | 统计日志、压力测试 |
Escalation Criteria
升级处理标准
Stop debugging and escalate when:
- You have spent more than 2x your initial time estimate without meaningful progress
- The fix requires architectural changes beyond your component
- The root cause is in a dependency you do not control
- You have found 3+ bugs in the same area — the code needs redesign, not more patches
- The bug exposes a fundamental design flaw
- Production impact is growing and a workaround/rollback is faster than a fix
Escalate to:
| Situation | Escalate To |
|---|---|
| Design or architecture issues | Architect |
| Cannot reproduce, need more info | QA team |
| Scope, priority, or trade-off questions | PM / Product Owner |
| Dependency or infrastructure issues | Platform / DevOps team |
| Security implications discovered | Security team immediately |
出现以下情况时,停止调试并升级处理:
- 花费的时间已超过初始预估的2倍,却未取得实质性进展
- 修复需要对超出你负责范围的组件进行架构变更
- 根本原因在于你无法控制的依赖项
- 在同一区域发现3个及以上Bug——代码需要重构,而非更多补丁
- Bug暴露了根本性的设计缺陷
- 生产环境影响范围持续扩大,采用临时方案/回滚比修复更快
升级对象:
| 场景 | 升级至 |
|---|---|
| 设计或架构问题 | 架构师 |
| 无法复现,需要更多信息 | QA团队 |
| 范围、优先级或权衡问题 | 产品经理/产品负责人 |
| 依赖项或基础设施问题 | 平台/DevOps团队 |
| 发现安全隐患 | 立即升级至安全团队 |
Decision Framework
决策框架
Fix depth
修复深度
- Fix at the SOURCE where bad data/state originates, not where the error appears
- Add defense-in-depth: validate at boundaries even after fixing the source
- Prefer making invalid states unrepresentable over runtime validation
- 从不良数据/状态产生的源头进行修复,而非在错误出现的位置
- 增加纵深防御:即使修复了源头,仍需在边界处添加验证
- 优先通过设计避免无效状态,而非仅依赖运行时验证
Scope of fix
修复范围
- Fix the specific bug, not the surrounding code
- If you see other issues nearby, file them separately — do not scope-creep a bug fix
- One bug = one fix = one commit = one test
- 仅修复特定Bug,而非修改周边代码
- 若发现其他问题,单独创建工单——不要在Bug修复中扩大范围
- 一个Bug对应一个修复、一个提交、一个测试用例
When to rewrite vs patch
重构 vs 补丁的选择
- Patch: isolated bug, clear root cause, code is otherwise sound
- Rewrite: 3+ bugs in same module, root cause is structural, fix would be more complex than rewrite
- Rollback: production is burning and the previous version worked — roll back first, debug second
- 补丁:孤立的Bug,根本原因明确,代码整体质量良好
- 重构:同一模块出现3个及以上Bug,根本原因是结构性问题,修复比重构更复杂
- 回滚:生产环境故障严重,且上一版本可正常运行——先回滚,再调试
Integration with Team Roles
与团队角色的协作
This debugging workflow connects to broader team processes:
| Phase | Team Integration |
|---|---|
| Reproduce | QA provides bug reports with reproduction steps; request more detail if insufficient |
| Investigate | Architect can help map component dependencies and blast radius |
| Fix | Code review by a peer before merge — a second pair of eyes catches fix-induced regressions |
| Prevent | Post-mortem shared with the team; action items tracked in the backlog |
When using other code-virtuoso skills:
| Situation | Recommended Skill |
|---|---|
| Bug fix reveals design problems | Install |
| Fix involves refactoring | Install |
| SOLID violation is root cause | Install |
| PR for the fix | Use |
此调试工作流与团队整体流程紧密关联:
| 阶段 | 团队协作 |
|---|---|
| 重现问题 | QA提供带有复现步骤的Bug报告;若信息不足,请求补充细节 |
| 展开调查 | 架构师可协助梳理组件依赖关系和影响范围 |
| 实施修复 | 合并前由同事进行代码评审——第二双眼睛可发现修复引入的回归问题 |
| 预防复发 | 与团队分享事后复盘文档;将改进措施纳入待办事项跟踪 |
结合其他代码优化技能使用:
| 场景 | 推荐技能 |
|---|---|
| Bug修复暴露了设计问题 | 安装 |
| 修复涉及重构 | 安装 |
| 根本原因是SOLID原则违反 | 安装 |
| 为修复编写PR描述 | 使用 |
Quality Checklist
质量检查清单
Before marking a bug fix done:
- Root cause is identified and documented
- Failing test existed before the fix
- Fix addresses root cause, not symptom
- Full test suite passes
- Fix is the simplest correct solution
- Error messages improved where relevant
- Post-mortem written for significant bugs
- Team notified if the bug affects shared components
在标记Bug修复完成前,确认以下事项:
- 已识别并记录根本原因
- 修复前已存在能复现Bug的失败测试用例
- 修复针对根本原因,而非表面症状
- 完整测试套件全部通过
- 修复是最简单且正确的解决方案
- 相关错误信息已优化
- 重大Bug已撰写事后复盘文档
- 若Bug影响共享组件,已通知团队
Critical Rules
核心规则
-
No fix without root cause. This is the iron law. If you cannot explain why the bug exists, you are not done investigating.
-
Reproduce first. Do not investigate what you cannot reproduce. If reproduction fails, add observability and wait.
-
Single-variable testing. Change one thing at a time during hypothesis testing. Changing multiple variables makes results uninterpretable.
-
Evidence over intuition. Log your evidence. "I think it might be X" is not a hypothesis — "Log line Y shows value Z when it should show W" is.
-
Test before and after. A fix without a regression test is a fix that will break again.
-
Escalate without ego. Knowing when to stop and ask for help is a skill, not a weakness. See the escalation criteria above.
-
Document for the next person. The next person debugging this area might be you in six months. Leave the codebase more observable than you found it.
-
Never debug production by modifying production. Read-only investigation. Fixes go through the normal deployment pipeline.
-
Scope discipline. Fix the bug. Only the bug. Other improvements are separate tickets.
-
Share what you learn. Every significant bug is a learning opportunity for the team. Post-mortems are not blame — they are institutional memory.
-
未找到根本原因,绝不修复。 这是铁则。若你无法解释Bug存在的原因,调查工作就未完成。
-
先重现问题。 不要调查无法复现的问题。若无法复现,添加监控并等待。
-
单变量测试。 验证假设时每次仅变更一个变量。同时变更多个变量会导致结果无法解读。
-
基于证据而非直觉。 记录你的证据。“我觉得可能是X”不是假设——“日志Y显示值为Z,但预期应为W”才是。
-
前后都要测试。 没有回归测试的修复,未来仍会再次出现问题。
-
不带自负地升级。 知道何时停止并寻求帮助是一种技能,而非弱点。请参考上述升级处理标准。
-
为后续人员留档。 下次调试该区域的人可能是6个月后的你。让代码库的可观察性比你接手时更好。
-
绝不通过修改生产环境状态来调试。 仅进行只读调查。修复需通过正常部署流程上线。
-
严格控制范围。 仅修复当前Bug。其他优化需单独创建工单。
-
分享你的发现。 每个重大Bug都是团队的学习机会。事后复盘不是追责——是积累团队经验的方式。