debugging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Debugging

调试

Systematic methodology for finding and fixing bugs. Prioritizes root cause analysis over symptom treatment, evidence over intuition, and prevention over recurrence.
用于定位和修复Bug的系统化方法论。优先关注根本原因分析而非症状缓解,基于证据而非直觉,注重预防而非问题复发。

Iron Law

铁则

No fix without root cause. Never apply a fix until you can explain WHY the bug exists, not just WHERE it manifests. Symptom-level fixes create new bugs.
未找到根本原因,绝不修复。 在你能解释清楚Bug存在的原因(而非仅知晓其表现位置)之前,绝不要实施修复。仅针对症状的修复会引发新的Bug。

When to Use

适用场景

  • Bug report from QA or production alert
  • Test failure with unclear cause
  • Intermittent/flaky behavior
  • Performance degradation
  • Unexpected behavior that "used to work"
  • Integration failures between components
  • QA提交的Bug报告或生产环境告警
  • 原因不明的测试失败
  • 间歇性/不稳定的异常行为
  • 性能下降
  • 原本正常的功能突然出现异常
  • 组件间的集成失败

Workflow

工作流

Phase 1: Reproduce

阶段1:重现问题

Establish a reliable reproduction before investigating.
  1. Collect all evidence — error messages, stack traces, logs, screenshots, user steps
  2. Identify the exact conditions: environment, data state, user actions, timing
  3. Create a minimal reproduction — strip away everything that isn't needed to trigger the bug
  4. Confirm reproduction is consistent (if intermittent, note frequency and conditions)
  5. Write down the reproduction steps precisely — someone else should be able to follow them
Output: Documented reproduction steps, minimal test case
If you cannot reproduce: Document what you tried, check environment differences, add instrumentation and wait for next occurrence. Do not proceed to Phase 2 on guesswork — unreproducible bugs get logged, not "fixed."
在展开调查前,先建立可靠的问题重现路径。
  1. 收集所有证据——错误信息、堆栈跟踪、日志、截图、用户操作步骤
  2. 明确触发问题的精确条件:环境、数据状态、用户操作、时间节点
  3. 创建最小化复现用例——剔除所有非必要的内容,仅保留触发Bug所需的核心要素
  4. 确认复现的一致性(若为间歇性问题,记录出现频率和相关条件)
  5. 精确记录复现步骤——确保其他人员也能按照步骤复现问题
输出:已文档化的复现步骤、最小化测试用例
若无法复现:记录你尝试过的操作,检查环境差异,添加监控工具并等待下一次出现。不要基于猜测进入阶段2——无法复现的Bug只需记录,而非“修复”。

Phase 2: Investigate

阶段2:展开调查

Gather evidence systematically. Do NOT form hypotheses yet — this phase is about observation, not explanation.
  1. Read the full error message and stack trace — every line, not just the first one
  2. Check git history — what changed recently? (
    git log --since="2 weeks ago"
    ,
    git bisect
    )
  3. Trace the data flow — follow the input from entry point to failure point
  4. Check boundaries — where does data cross component/service/layer boundaries?
  5. Collect environmental context — versions, configuration, dependencies, resource state
  6. Map the blast radius — what else is affected? Is this an isolated failure or systemic?
Production vs development debugging:
  • Production: Prioritize impact assessment and mitigation first. Can you reduce blast radius before investigating? Read-only access only — never debug by modifying production state.
  • Development: You have full control. Use breakpoints, modify state, add temporary logging freely.
Output: Evidence log (what you found, where, timestamps), affected component map
系统化收集证据。此阶段仅专注于观察,而非形成假设。
  1. 完整阅读错误信息和堆栈跟踪——逐行查看,而非仅看第一行
  2. 检查Git历史记录——近期有哪些变更?(
    git log --since="2 weeks ago"
    git bisect
  3. 追踪数据流——从入口点到故障点全程跟进输入数据的流转
  4. 检查边界——数据在哪些组件/服务/层之间传递?
  5. 收集环境上下文信息:版本、配置、依赖项、资源状态
  6. 评估影响范围——还有哪些功能受影响?这是孤立故障还是系统性问题?
生产环境 vs 开发环境调试差异:
  • 生产环境:优先评估影响并采取缓解措施。能否在调查前缩小影响范围?仅使用只读权限——绝不通过修改生产环境状态来调试。
  • 开发环境:你拥有完全控制权。可自由使用断点、修改状态、添加临时日志。
输出:证据日志(发现的内容、位置、时间戳)、受影响组件映射图

Phase 3: Hypothesize

阶段3:形成假设

Form competing hypotheses ranked by evidence strength.
  1. List ALL plausible causes — do not anchor on the first idea
  2. Classify each hypothesis by bug category (see bug categories reference)
  3. Rate each: evidence strength (strong/medium/weak), testability (easy/hard), likelihood
  4. Pick the most likely AND most testable hypothesis first
  5. Define what would CONFIRM and what would FALSIFY each hypothesis
Example hypothesis table:
#HypothesisCategoryEvidenceTestabilityTest Plan
1Cache returns stale data after updateStateLog shows old value 2s after writeEasyBypass cache and compare
2Race condition between two workersRace conditionIntermittent, high load correlationMediumAdd locking, stress test
3Upstream API returns unexpected formatIntegrationNo evidence yetEasyLog raw response
Output: Ranked hypothesis list with evidence and test plan
基于证据强度,形成多个待验证的假设并排序。
  1. 列出所有合理的可能原因——不要局限于第一个想到的想法
  2. 按照Bug类别对每个假设进行分类(详见Bug分类参考文档
  3. 对每个假设进行评级:证据强度(强/中/弱)、可测试性(易/难)、可能性
  4. 优先选择可能性最高且最易测试的假设
  5. 明确每个假设的验证标准和证伪标准
假设表格示例:
序号假设内容类别证据可测试性测试方案
1更新后缓存返回过期数据状态类日志显示写入2秒后仍返回旧值绕过缓存并对比结果
2两个工作线程间存在竞态条件竞态条件类问题间歇性出现,与高负载相关添加锁机制,进行压力测试
3上游API返回非预期格式集成类暂无相关证据记录原始响应内容
输出:带有证据和测试方案的排序假设列表

Phase 4: Test

阶段4:验证假设

Validate one hypothesis at a time. Single-variable changes only.
  1. Change ONE thing and observe the result
  2. If confirmed — proceed to Phase 5
  3. If falsified — update evidence log, return to next hypothesis
  4. If inconclusive — add more instrumentation, gather more evidence
  5. After 3 failed hypotheses — STOP. Re-examine your assumptions. The bug model may be wrong.
Red flags (return to Phase 2 immediately):
  • "Quick fix for now, investigate later"
  • Changing multiple things at once
  • Fixing without understanding
  • Copy-pasting a fix from the internet without understanding why it works
Output: Confirmed root cause with evidence chain
逐一验证假设,每次仅变更一个变量。
  1. 仅变更一个变量,观察结果
  2. 若假设成立——进入阶段5
  3. 若假设不成立——更新证据日志,验证下一个假设
  4. 若结果不明确——添加更多监控,收集更多证据
  5. 若连续3个假设验证失败——停止操作。重新审视你的前提假设,可能你的Bug模型存在错误。
危险信号(立即返回阶段2):
  • “先临时修复,之后再深入调查”
  • 同时变更多个变量
  • 在未理解问题的情况下实施修复
  • 直接复制网上的修复方案却不理解其原理
输出:带有证据链的已确认根本原因

Phase 5: Fix

阶段5:实施修复

Implement the fix at the source, not at the symptom.
  1. Write a failing test that reproduces the bug FIRST
  2. Implement the fix — single, focused change addressing the root cause
  3. Verify the failing test now passes
  4. Run the full test suite — ensure no regressions
  5. Review your own fix: is this the simplest correct solution?
Fix principles:
  • Fix at the SOURCE where bad data/state originates, not where the error appears
  • Add defense-in-depth: validate at boundaries even after fixing the source
  • Prefer making invalid states unrepresentable over runtime validation
  • One bug = one fix = one commit = one test
Output: Fix with regression test, clean test suite
从问题根源入手修复,而非仅解决表面症状。
  1. 先编写一个能复现Bug的失败测试用例
  2. 实施修复——仅针对根本原因的单一、聚焦的变更
  3. 验证失败测试用例现在可通过
  4. 运行完整测试套件——确保没有引入回归问题
  5. 自我审核修复方案:这是最简单且正确的解决方案吗?
修复原则:
  • 从不良数据/状态产生的源头进行修复,而非在错误出现的位置
  • 增加纵深防御:即使修复了源头,仍需在边界处添加验证
  • 优先通过设计避免无效状态,而非仅依赖运行时验证
  • 一个Bug对应一个修复、一个提交、一个测试用例
输出:带有回归测试的修复方案、通过所有测试的测试套件

Phase 6: Prevent

阶段6:预防复发

Ensure this class of bug cannot recur.
  1. Add defensive validation at the boundary where bad data entered
  2. Improve error messages — would future-you understand this error immediately?
  3. Update monitoring/alerting if this was a production issue
  4. Write a post-mortem if the bug was significant (see post-mortem template)
  5. Share findings with the team — this is how institutional knowledge grows
Output: Prevention measures, post-mortem (if significant)

确保此类Bug不会再次出现。
  1. 在不良数据进入的边界处添加防御性验证
  2. 优化错误信息——未来的你能否立即理解该错误?
  3. 若为生产环境问题,更新监控/告警规则
  4. 若Bug影响重大,撰写事后复盘文档(详见事后复盘模板
  5. 与团队分享发现——这是积累团队经验的方式
输出:预防措施、事后复盘文档(若影响重大)

Bug Category Strategies

Bug分类应对策略

Different bug types need different investigation approaches. See bug categories reference for the full guide.
CategoryFirst MoveKey Technique
Logic errorRead the code, trace conditionsRubber duck walkthrough, truth tables
Data issueInspect actual vs expected data at each boundaryBoundary logging, data flow trace
State/race conditionAdd timestamps to all state mutationsSequence diagram, concurrency analysis
Integration failureCheck API contract complianceRequest/response logging, contract tests
PerformanceProfile before guessingProfiler, flame graphs, query analysis
EnvironmentCompare working vs broken envDifferential analysis, config audit
Intermittent/flakyIncrease observability firstStatistical logging, stress testing

不同类型的Bug需要不同的调查方法。完整指南请查看Bug分类参考文档
类别首要操作核心技巧
逻辑错误阅读代码,追踪条件分支橡皮鸭调试法、真值表分析
数据问题检查每个边界处的实际数据与预期数据边界日志、数据流追踪
状态/竞态条件为所有状态变更添加时间戳时序图、并发分析
集成失败检查API契约合规性请求/响应日志、契约测试
性能问题先分析性能再做猜测性能分析器、火焰图、查询分析
环境问题对比正常环境与故障环境差异分析、配置审计
间歇性/不稳定问题先提升可观察性统计日志、压力测试

Escalation Criteria

升级处理标准

Stop debugging and escalate when:
  • You have spent more than 2x your initial time estimate without meaningful progress
  • The fix requires architectural changes beyond your component
  • The root cause is in a dependency you do not control
  • You have found 3+ bugs in the same area — the code needs redesign, not more patches
  • The bug exposes a fundamental design flaw
  • Production impact is growing and a workaround/rollback is faster than a fix
Escalate to:
SituationEscalate To
Design or architecture issuesArchitect
Cannot reproduce, need more infoQA team
Scope, priority, or trade-off questionsPM / Product Owner
Dependency or infrastructure issuesPlatform / DevOps team
Security implications discoveredSecurity team immediately

出现以下情况时,停止调试并升级处理:
  • 花费的时间已超过初始预估的2倍,却未取得实质性进展
  • 修复需要对超出你负责范围的组件进行架构变更
  • 根本原因在于你无法控制的依赖项
  • 在同一区域发现3个及以上Bug——代码需要重构,而非更多补丁
  • Bug暴露了根本性的设计缺陷
  • 生产环境影响范围持续扩大,采用临时方案/回滚比修复更快
升级对象:
场景升级至
设计或架构问题架构师
无法复现,需要更多信息QA团队
范围、优先级或权衡问题产品经理/产品负责人
依赖项或基础设施问题平台/DevOps团队
发现安全隐患立即升级至安全团队

Decision Framework

决策框架

Fix depth

修复深度

  • Fix at the SOURCE where bad data/state originates, not where the error appears
  • Add defense-in-depth: validate at boundaries even after fixing the source
  • Prefer making invalid states unrepresentable over runtime validation
  • 从不良数据/状态产生的源头进行修复,而非在错误出现的位置
  • 增加纵深防御:即使修复了源头,仍需在边界处添加验证
  • 优先通过设计避免无效状态,而非仅依赖运行时验证

Scope of fix

修复范围

  • Fix the specific bug, not the surrounding code
  • If you see other issues nearby, file them separately — do not scope-creep a bug fix
  • One bug = one fix = one commit = one test
  • 仅修复特定Bug,而非修改周边代码
  • 若发现其他问题,单独创建工单——不要在Bug修复中扩大范围
  • 一个Bug对应一个修复、一个提交、一个测试用例

When to rewrite vs patch

重构 vs 补丁的选择

  • Patch: isolated bug, clear root cause, code is otherwise sound
  • Rewrite: 3+ bugs in same module, root cause is structural, fix would be more complex than rewrite
  • Rollback: production is burning and the previous version worked — roll back first, debug second

  • 补丁:孤立的Bug,根本原因明确,代码整体质量良好
  • 重构:同一模块出现3个及以上Bug,根本原因是结构性问题,修复比重构更复杂
  • 回滚:生产环境故障严重,且上一版本可正常运行——先回滚,再调试

Integration with Team Roles

与团队角色的协作

This debugging workflow connects to broader team processes:
PhaseTeam Integration
ReproduceQA provides bug reports with reproduction steps; request more detail if insufficient
InvestigateArchitect can help map component dependencies and blast radius
FixCode review by a peer before merge — a second pair of eyes catches fix-induced regressions
PreventPost-mortem shared with the team; action items tracked in the backlog
When using other code-virtuoso skills:
SituationRecommended Skill
Bug fix reveals design problemsInstall
design-patterns-virtuoso
from
krzysztofsurdy/code-virtuoso
Fix involves refactoringInstall
refactoring-virtuoso
from
krzysztofsurdy/code-virtuoso
SOLID violation is root causeInstall
solid-virtuoso
from
krzysztofsurdy/code-virtuoso
PR for the fixUse
pr-message-writer
from
krzysztofsurdy/code-virtuoso

此调试工作流与团队整体流程紧密关联:
阶段团队协作
重现问题QA提供带有复现步骤的Bug报告;若信息不足,请求补充细节
展开调查架构师可协助梳理组件依赖关系和影响范围
实施修复合并前由同事进行代码评审——第二双眼睛可发现修复引入的回归问题
预防复发与团队分享事后复盘文档;将改进措施纳入待办事项跟踪
结合其他代码优化技能使用:
场景推荐技能
Bug修复暴露了设计问题安装
design-patterns-virtuoso
(来自
krzysztofsurdy/code-virtuoso
修复涉及重构安装
refactoring-virtuoso
(来自
krzysztofsurdy/code-virtuoso
根本原因是SOLID原则违反安装
solid-virtuoso
(来自
krzysztofsurdy/code-virtuoso
为修复编写PR描述使用
pr-message-writer
(来自
krzysztofsurdy/code-virtuoso

Quality Checklist

质量检查清单

Before marking a bug fix done:
  • Root cause is identified and documented
  • Failing test existed before the fix
  • Fix addresses root cause, not symptom
  • Full test suite passes
  • Fix is the simplest correct solution
  • Error messages improved where relevant
  • Post-mortem written for significant bugs
  • Team notified if the bug affects shared components

在标记Bug修复完成前,确认以下事项:
  • 已识别并记录根本原因
  • 修复前已存在能复现Bug的失败测试用例
  • 修复针对根本原因,而非表面症状
  • 完整测试套件全部通过
  • 修复是最简单且正确的解决方案
  • 相关错误信息已优化
  • 重大Bug已撰写事后复盘文档
  • 若Bug影响共享组件,已通知团队

Critical Rules

核心规则

  1. No fix without root cause. This is the iron law. If you cannot explain why the bug exists, you are not done investigating.
  2. Reproduce first. Do not investigate what you cannot reproduce. If reproduction fails, add observability and wait.
  3. Single-variable testing. Change one thing at a time during hypothesis testing. Changing multiple variables makes results uninterpretable.
  4. Evidence over intuition. Log your evidence. "I think it might be X" is not a hypothesis — "Log line Y shows value Z when it should show W" is.
  5. Test before and after. A fix without a regression test is a fix that will break again.
  6. Escalate without ego. Knowing when to stop and ask for help is a skill, not a weakness. See the escalation criteria above.
  7. Document for the next person. The next person debugging this area might be you in six months. Leave the codebase more observable than you found it.
  8. Never debug production by modifying production. Read-only investigation. Fixes go through the normal deployment pipeline.
  9. Scope discipline. Fix the bug. Only the bug. Other improvements are separate tickets.
  10. Share what you learn. Every significant bug is a learning opportunity for the team. Post-mortems are not blame — they are institutional memory.
  1. 未找到根本原因,绝不修复。 这是铁则。若你无法解释Bug存在的原因,调查工作就未完成。
  2. 先重现问题。 不要调查无法复现的问题。若无法复现,添加监控并等待。
  3. 单变量测试。 验证假设时每次仅变更一个变量。同时变更多个变量会导致结果无法解读。
  4. 基于证据而非直觉。 记录你的证据。“我觉得可能是X”不是假设——“日志Y显示值为Z,但预期应为W”才是。
  5. 前后都要测试。 没有回归测试的修复,未来仍会再次出现问题。
  6. 不带自负地升级。 知道何时停止并寻求帮助是一种技能,而非弱点。请参考上述升级处理标准。
  7. 为后续人员留档。 下次调试该区域的人可能是6个月后的你。让代码库的可观察性比你接手时更好。
  8. 绝不通过修改生产环境状态来调试。 仅进行只读调查。修复需通过正常部署流程上线。
  9. 严格控制范围。 仅修复当前Bug。其他优化需单独创建工单。
  10. 分享你的发现。 每个重大Bug都是团队的学习机会。事后复盘不是追责——是积累团队经验的方式。