debugging

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Debugging

调试

Systematic methodology for finding and fixing bugs. Prioritizes root cause analysis over symptom treatment, evidence over intuition, and prevention over recurrence.

用于定位和修复Bug的系统化方法论。优先关注根本原因分析而非症状缓解，基于证据而非直觉，注重预防而非问题复发。

Iron Law

铁则

No fix without root cause. Never apply a fix until you can explain WHY the bug exists, not just WHERE it manifests. Symptom-level fixes create new bugs.

未找到根本原因，绝不修复。 在你能解释清楚Bug存在的原因（而非仅知晓其表现位置）之前，绝不要实施修复。仅针对症状的修复会引发新的Bug。

When to Use

适用场景

Bug report from QA or production alert
Test failure with unclear cause
Intermittent/flaky behavior
Performance degradation
Unexpected behavior that "used to work"
Integration failures between components

QA提交的Bug报告或生产环境告警
原因不明的测试失败
间歇性/不稳定的异常行为
性能下降
原本正常的功能突然出现异常
组件间的集成失败

Workflow

工作流

Phase 1: Reproduce

阶段1：重现问题

Establish a reliable reproduction before investigating.

Collect all evidence — error messages, stack traces, logs, screenshots, user steps
Identify the exact conditions: environment, data state, user actions, timing
Create a minimal reproduction — strip away everything that isn't needed to trigger the bug
Confirm reproduction is consistent (if intermittent, note frequency and conditions)
Write down the reproduction steps precisely — someone else should be able to follow them

Output: Documented reproduction steps, minimal test case

If you cannot reproduce: Document what you tried, check environment differences, add instrumentation and wait for next occurrence. Do not proceed to Phase 2 on guesswork — unreproducible bugs get logged, not "fixed."

在展开调查前，先建立可靠的问题重现路径。

收集所有证据——错误信息、堆栈跟踪、日志、截图、用户操作步骤
明确触发问题的精确条件：环境、数据状态、用户操作、时间节点
创建最小化复现用例——剔除所有非必要的内容，仅保留触发Bug所需的核心要素
确认复现的一致性（若为间歇性问题，记录出现频率和相关条件）
精确记录复现步骤——确保其他人员也能按照步骤复现问题

输出：已文档化的复现步骤、最小化测试用例

若无法复现：记录你尝试过的操作，检查环境差异，添加监控工具并等待下一次出现。不要基于猜测进入阶段2——无法复现的Bug只需记录，而非“修复”。

Phase 2: Investigate

阶段2：展开调查

Gather evidence systematically. Do NOT form hypotheses yet — this phase is about observation, not explanation.

Read the full error message and stack trace — every line, not just the first one
Check git history — what changed recently? (
```
git log --since="2 weeks ago"
```
,
```
git bisect
```
)
Trace the data flow — follow the input from entry point to failure point
Check boundaries — where does data cross component/service/layer boundaries?
Collect environmental context — versions, configuration, dependencies, resource state
Map the blast radius — what else is affected? Is this an isolated failure or systemic?

Production vs development debugging:

Production: Prioritize impact assessment and mitigation first. Can you reduce blast radius before investigating? Read-only access only — never debug by modifying production state.
Development: You have full control. Use breakpoints, modify state, add temporary logging freely.

Output: Evidence log (what you found, where, timestamps), affected component map

系统化收集证据。此阶段仅专注于观察，而非形成假设。

完整阅读错误信息和堆栈跟踪——逐行查看，而非仅看第一行
检查Git历史记录——近期有哪些变更？（
```
git log --since="2 weeks ago"
```
、
```
git bisect
```
）
追踪数据流——从入口点到故障点全程跟进输入数据的流转
检查边界——数据在哪些组件/服务/层之间传递？
收集环境上下文信息：版本、配置、依赖项、资源状态
评估影响范围——还有哪些功能受影响？这是孤立故障还是系统性问题？

生产环境 vs 开发环境调试差异：

生产环境：优先评估影响并采取缓解措施。能否在调查前缩小影响范围？仅使用只读权限——绝不通过修改生产环境状态来调试。
开发环境：你拥有完全控制权。可自由使用断点、修改状态、添加临时日志。

输出：证据日志（发现的内容、位置、时间戳）、受影响组件映射图

Phase 3: Hypothesize

阶段3：形成假设

Form competing hypotheses ranked by evidence strength.

List ALL plausible causes — do not anchor on the first idea
Classify each hypothesis by bug category (see bug categories reference)
Rate each: evidence strength (strong/medium/weak), testability (easy/hard), likelihood
Pick the most likely AND most testable hypothesis first
Define what would CONFIRM and what would FALSIFY each hypothesis

Example hypothesis table:

#	Hypothesis	Category	Evidence	Testability	Test Plan
1	Cache returns stale data after update	State	Log shows old value 2s after write	Easy	Bypass cache and compare
2	Race condition between two workers	Race condition	Intermittent, high load correlation	Medium	Add locking, stress test
3	Upstream API returns unexpected format	Integration	No evidence yet	Easy	Log raw response

Output: Ranked hypothesis list with evidence and test plan

基于证据强度，形成多个待验证的假设并排序。

列出所有合理的可能原因——不要局限于第一个想到的想法
按照Bug类别对每个假设进行分类（详见Bug分类参考文档）
对每个假设进行评级：证据强度（强/中/弱）、可测试性（易/难）、可能性
优先选择可能性最高且最易测试的假设
明确每个假设的验证标准和证伪标准

假设表格示例：

序号	假设内容	类别	证据	可测试性	测试方案
1	更新后缓存返回过期数据	状态类	日志显示写入2秒后仍返回旧值	易	绕过缓存并对比结果
2	两个工作线程间存在竞态条件	竞态条件类	问题间歇性出现，与高负载相关	中	添加锁机制，进行压力测试
3	上游API返回非预期格式	集成类	暂无相关证据	易	记录原始响应内容

输出：带有证据和测试方案的排序假设列表

Phase 4: Test

阶段4：验证假设

Validate one hypothesis at a time. Single-variable changes only.

Change ONE thing and observe the result
If confirmed — proceed to Phase 5
If falsified — update evidence log, return to next hypothesis
If inconclusive — add more instrumentation, gather more evidence
After 3 failed hypotheses — STOP. Re-examine your assumptions. The bug model may be wrong.

Red flags (return to Phase 2 immediately):

"Quick fix for now, investigate later"
Changing multiple things at once
Fixing without understanding
Copy-pasting a fix from the internet without understanding why it works

Output: Confirmed root cause with evidence chain

逐一验证假设，每次仅变更一个变量。

仅变更一个变量，观察结果
若假设成立——进入阶段5
若假设不成立——更新证据日志，验证下一个假设
若结果不明确——添加更多监控，收集更多证据
若连续3个假设验证失败——停止操作。重新审视你的前提假设，可能你的Bug模型存在错误。

危险信号（立即返回阶段2）：

“先临时修复，之后再深入调查”
同时变更多个变量
在未理解问题的情况下实施修复
直接复制网上的修复方案却不理解其原理

输出：带有证据链的已确认根本原因

Phase 5: Fix

阶段5：实施修复

Implement the fix at the source, not at the symptom.

Write a failing test that reproduces the bug FIRST
Implement the fix — single, focused change addressing the root cause
Verify the failing test now passes
Run the full test suite — ensure no regressions
Review your own fix: is this the simplest correct solution?

Fix principles:

Fix at the SOURCE where bad data/state originates, not where the error appears
Add defense-in-depth: validate at boundaries even after fixing the source
Prefer making invalid states unrepresentable over runtime validation
One bug = one fix = one commit = one test

Output: Fix with regression test, clean test suite

从问题根源入手修复，而非仅解决表面症状。

先编写一个能复现Bug的失败测试用例
实施修复——仅针对根本原因的单一、聚焦的变更
验证失败测试用例现在可通过
运行完整测试套件——确保没有引入回归问题
自我审核修复方案：这是最简单且正确的解决方案吗？

修复原则：

从不良数据/状态产生的源头进行修复，而非在错误出现的位置
增加纵深防御：即使修复了源头，仍需在边界处添加验证
优先通过设计避免无效状态，而非仅依赖运行时验证
一个Bug对应一个修复、一个提交、一个测试用例

输出：带有回归测试的修复方案、通过所有测试的测试套件

Phase 6: Prevent

阶段6：预防复发

Ensure this class of bug cannot recur.

Add defensive validation at the boundary where bad data entered
Improve error messages — would future-you understand this error immediately?
Update monitoring/alerting if this was a production issue
Write a post-mortem if the bug was significant (see post-mortem template)
Share findings with the team — this is how institutional knowledge grows

Output: Prevention measures, post-mortem (if significant)

确保此类Bug不会再次出现。

在不良数据进入的边界处添加防御性验证
优化错误信息——未来的你能否立即理解该错误？
若为生产环境问题，更新监控/告警规则
若Bug影响重大，撰写事后复盘文档（详见事后复盘模板）
与团队分享发现——这是积累团队经验的方式

输出：预防措施、事后复盘文档（若影响重大）

Bug Category Strategies

Bug分类应对策略

Different bug types need different investigation approaches. See bug categories reference for the full guide.

Category	First Move	Key Technique
Logic error	Read the code, trace conditions	Rubber duck walkthrough, truth tables
Data issue	Inspect actual vs expected data at each boundary	Boundary logging, data flow trace
State/race condition	Add timestamps to all state mutations	Sequence diagram, concurrency analysis
Integration failure	Check API contract compliance	Request/response logging, contract tests
Performance	Profile before guessing	Profiler, flame graphs, query analysis
Environment	Compare working vs broken env	Differential analysis, config audit
Intermittent/flaky	Increase observability first	Statistical logging, stress testing

不同类型的Bug需要不同的调查方法。完整指南请查看Bug分类参考文档。

类别	首要操作	核心技巧
逻辑错误	阅读代码，追踪条件分支	橡皮鸭调试法、真值表分析
数据问题	检查每个边界处的实际数据与预期数据	边界日志、数据流追踪
状态/竞态条件	为所有状态变更添加时间戳	时序图、并发分析
集成失败	检查API契约合规性	请求/响应日志、契约测试
性能问题	先分析性能再做猜测	性能分析器、火焰图、查询分析
环境问题	对比正常环境与故障环境	差异分析、配置审计
间歇性/不稳定问题	先提升可观察性	统计日志、压力测试

Escalation Criteria

升级处理标准

Stop debugging and escalate when:

You have spent more than 2x your initial time estimate without meaningful progress
The fix requires architectural changes beyond your component
The root cause is in a dependency you do not control
You have found 3+ bugs in the same area — the code needs redesign, not more patches
The bug exposes a fundamental design flaw
Production impact is growing and a workaround/rollback is faster than a fix

Escalate to:

Situation	Escalate To
Design or architecture issues	Architect
Cannot reproduce, need more info	QA team
Scope, priority, or trade-off questions	PM / Product Owner
Dependency or infrastructure issues	Platform / DevOps team
Security implications discovered	Security team immediately

出现以下情况时，停止调试并升级处理：

花费的时间已超过初始预估的2倍，却未取得实质性进展
修复需要对超出你负责范围的组件进行架构变更
根本原因在于你无法控制的依赖项
在同一区域发现3个及以上Bug——代码需要重构，而非更多补丁
Bug暴露了根本性的设计缺陷
生产环境影响范围持续扩大，采用临时方案/回滚比修复更快

升级对象：

场景	升级至
设计或架构问题	架构师
无法复现，需要更多信息	QA团队
范围、优先级或权衡问题	产品经理/产品负责人
依赖项或基础设施问题	平台/DevOps团队
发现安全隐患	立即升级至安全团队

Decision Framework

决策框架

Fix depth

修复深度

Fix at the SOURCE where bad data/state originates, not where the error appears
Add defense-in-depth: validate at boundaries even after fixing the source
Prefer making invalid states unrepresentable over runtime validation

从不良数据/状态产生的源头进行修复，而非在错误出现的位置
增加纵深防御：即使修复了源头，仍需在边界处添加验证
优先通过设计避免无效状态，而非仅依赖运行时验证

Scope of fix

修复范围

Fix the specific bug, not the surrounding code
If you see other issues nearby, file them separately — do not scope-creep a bug fix
One bug = one fix = one commit = one test

仅修复特定Bug，而非修改周边代码
若发现其他问题，单独创建工单——不要在Bug修复中扩大范围
一个Bug对应一个修复、一个提交、一个测试用例

When to rewrite vs patch

重构 vs 补丁的选择

Patch: isolated bug, clear root cause, code is otherwise sound
Rewrite: 3+ bugs in same module, root cause is structural, fix would be more complex than rewrite
Rollback: production is burning and the previous version worked — roll back first, debug second

补丁：孤立的Bug，根本原因明确，代码整体质量良好
重构：同一模块出现3个及以上Bug，根本原因是结构性问题，修复比重构更复杂
回滚：生产环境故障严重，且上一版本可正常运行——先回滚，再调试

Integration with Team Roles

与团队角色的协作

This debugging workflow connects to broader team processes:

Phase	Team Integration
Reproduce	QA provides bug reports with reproduction steps; request more detail if insufficient
Investigate	Architect can help map component dependencies and blast radius
Fix	Code review by a peer before merge — a second pair of eyes catches fix-induced regressions
Prevent	Post-mortem shared with the team; action items tracked in the backlog

When using other code-virtuoso skills:

Situation	Recommended Skill
Bug fix reveals design problems	Install `design-patterns-virtuoso` from `krzysztofsurdy/code-virtuoso`
Fix involves refactoring	Install `refactoring-virtuoso` from `krzysztofsurdy/code-virtuoso`
SOLID violation is root cause	Install `solid-virtuoso` from `krzysztofsurdy/code-virtuoso`
PR for the fix	Use `pr-message-writer` from `krzysztofsurdy/code-virtuoso`

此调试工作流与团队整体流程紧密关联：

阶段	团队协作
重现问题	QA提供带有复现步骤的Bug报告；若信息不足，请求补充细节
展开调查	架构师可协助梳理组件依赖关系和影响范围
实施修复	合并前由同事进行代码评审——第二双眼睛可发现修复引入的回归问题
预防复发	与团队分享事后复盘文档；将改进措施纳入待办事项跟踪

结合其他代码优化技能使用：

场景	推荐技能
Bug修复暴露了设计问题	安装 `design-patterns-virtuoso` （来自 `krzysztofsurdy/code-virtuoso` ）
修复涉及重构	安装 `refactoring-virtuoso` （来自 `krzysztofsurdy/code-virtuoso` ）
根本原因是SOLID原则违反	安装 `solid-virtuoso` （来自 `krzysztofsurdy/code-virtuoso` ）
为修复编写PR描述	使用 `pr-message-writer` （来自 `krzysztofsurdy/code-virtuoso` ）

Quality Checklist

质量检查清单

Critical Rules

核心规则

No fix without root cause. This is the iron law. If you cannot explain why the bug exists, you are not done investigating.
Reproduce first. Do not investigate what you cannot reproduce. If reproduction fails, add observability and wait.
Single-variable testing. Change one thing at a time during hypothesis testing. Changing multiple variables makes results uninterpretable.
Evidence over intuition. Log your evidence. "I think it might be X" is not a hypothesis — "Log line Y shows value Z when it should show W" is.
Test before and after. A fix without a regression test is a fix that will break again.
Escalate without ego. Knowing when to stop and ask for help is a skill, not a weakness. See the escalation criteria above.
Document for the next person. The next person debugging this area might be you in six months. Leave the codebase more observable than you found it.
Never debug production by modifying production. Read-only investigation. Fixes go through the normal deployment pipeline.
Scope discipline. Fix the bug. Only the bug. Other improvements are separate tickets.
Share what you learn. Every significant bug is a learning opportunity for the team. Post-mortems are not blame — they are institutional memory.

未找到根本原因，绝不修复。 这是铁则。若你无法解释Bug存在的原因，调查工作就未完成。
先重现问题。 不要调查无法复现的问题。若无法复现，添加监控并等待。
单变量测试。 验证假设时每次仅变更一个变量。同时变更多个变量会导致结果无法解读。
基于证据而非直觉。 记录你的证据。“我觉得可能是X”不是假设——“日志Y显示值为Z，但预期应为W”才是。
前后都要测试。 没有回归测试的修复，未来仍会再次出现问题。
不带自负地升级。 知道何时停止并寻求帮助是一种技能，而非弱点。请参考上述升级处理标准。
为后续人员留档。 下次调试该区域的人可能是6个月后的你。让代码库的可观察性比你接手时更好。
绝不通过修改生产环境状态来调试。 仅进行只读调查。修复需通过正常部署流程上线。
严格控制范围。 仅修复当前Bug。其他优化需单独创建工单。
分享你的发现。 每个重大Bug都是团队的学习机会。事后复盘不是追责——是积累团队经验的方式。