systematic-debugging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Systematic Debugging

系统化调试

Random fixes waste time and create new bugs. Quick patches mask underlying issues.
Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
随意的修复不仅浪费时间,还会引入新的bug。快速补丁会掩盖潜在的问题。
核心原则: 在尝试修复前,必须先找到根本原因。仅修复症状等同于失败。

The Iron Law

铁律

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
If you haven't completed Phase 1, you cannot propose fixes.

未完成根本原因调查前,禁止进行任何修复
如果尚未完成第一阶段,不得提出修复方案。

The Four Phases

四个阶段

Phase 1: Root Cause Investigation

第一阶段:根本原因调查

BEFORE attempting ANY fix:
  1. Read Error Messages Carefully
    • Don't skip past errors or warnings
    • Read stack traces completely
    • Note line numbers, file paths, error codes
  2. Reproduce Consistently
    • Can you trigger it reliably?
    • What are the exact steps?
    • If not reproducible, gather more data - don't guess
  3. Check Recent Changes
    • Git diff, recent commits
    • New dependencies, config changes
    • Environmental differences
  4. Gather Evidence in Multi-Component Systems
    When system has multiple components (CI -> build -> signing, API -> service -> database):
    For EACH component boundary:
      - Log what data enters component
      - Log what data exits component
      - Verify environment/config propagation
    
    Run once to gather evidence showing WHERE it breaks
    THEN analyze to identify failing component
  5. Trace Data Flow
    See references/root-cause-tracing.md for backward tracing technique.
    Quick version: Where does bad value originate? Keep tracing up until you find the source. Fix at source, not symptom.
在尝试任何修复之前:
  1. 仔细阅读错误信息
    • 不要跳过错误或警告
    • 完整阅读堆栈跟踪信息
    • 记录行号、文件路径、错误代码
  2. 稳定复现问题
    • 能否可靠触发问题?
    • 具体步骤是什么?
    • 如果无法稳定复现,收集更多数据——不要猜测
  3. 检查近期变更
    • Git diff、近期提交记录
    • 新增依赖项、配置变更
    • 环境差异
  4. 在多组件系统中收集证据
    当系统包含多个组件时(如CI -> 构建 -> 签名,API -> 服务 -> 数据库):
    针对每个组件边界:
      - 记录进入组件的数据
      - 记录流出组件的数据
      - 验证环境/配置的传递是否正确
    
    运行一次以收集证据,确定问题出在哪个环节
    再通过分析定位故障组件
  5. 追踪数据流
    可查看 references/root-cause-tracing.md 了解反向追踪技术。
    简化版:错误值源自何处?持续向上追踪直到找到源头。从源头修复,而非仅修复症状。

Phase 2: Pattern Analysis

第二阶段:模式分析

  1. Find Working Examples - Locate similar working code in same codebase
  2. Compare Against References - Read reference implementations COMPLETELY, don't skim
  3. Identify Differences - List every difference between working and broken
  4. Understand Dependencies - What settings, config, environment assumptions?
  1. 寻找可用示例 - 在同一代码库中定位类似的可正常运行代码
  2. 对比参考实现 - 完整阅读参考实现,不要略读
  3. 识别差异 - 列出正常代码与故障代码之间的所有差异
  4. 梳理依赖关系 - 涉及哪些设置、配置、环境假设?

Phase 3: Hypothesis and Testing

第三阶段:假设与测试

  1. Form Single Hypothesis - "I think X is the root cause because Y"
  2. Test Minimally - SMALLEST possible change, one variable at a time
  3. Verify Before Continuing - Worked? Phase 4. Didn't? NEW hypothesis, don't stack fixes
  1. 形成单一假设 - “我认为X是根本原因,因为Y”
  2. 最小化测试 - 采用最小的变更,每次只修改一个变量
  3. 验证后再推进 - 修复有效?进入第四阶段。无效?提出新假设,不要叠加修复方案

Phase 4: Implementation

第四阶段:实施修复

  1. Create Failing Test Case - Simplest reproduction, automated if possible
  2. Implement Single Fix - ONE change, no "while I'm here" improvements
  3. Verify Fix - Test passes? No regressions?
  4. If Fix Doesn't Work:
    • Count: How many fixes have you tried?
    • If < 3: Return to Phase 1, re-analyze
    • If >= 3: STOP and question the architecture
  5. If 3+ Fixes Failed: Question Architecture
    Pattern indicating architectural problem:
    • Each fix reveals new shared state/coupling
    • Fixes require "massive refactoring"
    • Each fix creates new symptoms elsewhere
    STOP. Discuss with user before attempting more fixes.

  1. 编写失败测试用例 - 最简单的复现方式,尽可能实现自动化
  2. 实施单一修复 - 仅做一处变更,不要顺便进行其他优化
  3. 验证修复效果 - 测试通过?无回归问题?
  4. 如果修复无效:
    • 统计:已经尝试了多少次修复?
    • 若少于3次:回到第一阶段,重新分析
    • 若≥3次:停止操作,质疑架构设计
  5. 若3次以上修复均失败:质疑架构设计
    以下模式表明存在架构问题:
    • 每次修复都会暴露新的共享状态/耦合关系
    • 修复需要“大规模重构”
    • 每次修复都会在其他地方引发新的症状
    停止操作。在尝试更多修复前,与用户沟通。

Red Flags - STOP and Follow Process

危险信号 - 停止操作并遵循流程

If you catch yourself thinking:
  • "Quick fix for now, investigate later"
  • "Just try changing X and see"
  • "Add multiple changes, run tests"
  • "I'm confident it's X, let me fix that"
  • "One more fix attempt" (when already tried 2+)
  • Proposing solutions before tracing data flow
ALL of these mean: STOP. Return to Phase 1.

如果你产生以下想法:
  • “先快速修复,之后再深入调查”
  • “试试修改X看看效果”
  • “同时做多处变更,然后运行测试”
  • “我确定是X的问题,直接修复”
  • “再试最后一次修复”(已尝试2次以上)
  • 未追踪数据流就提出解决方案
以上所有情况都意味着:停止操作。回到第一阶段。

Supporting Techniques

辅助技巧

Defense-in-Depth

纵深防御

When you fix a bug, validate at EVERY layer:
LayerPurposeExample
Entry PointReject invalid input at API boundary
if (!dir) throw new Error('dir required')
Business LogicEnsure data makes sense for operationValidate before processing
Environment GuardsPrevent dangerous ops in specific contextsRefuse git init outside tmpdir in tests
Debug InstrumentationCapture context for forensicsLog with stack trace before dangerous ops
Single validation feels sufficient, but different code paths bypass it. Make bugs structurally impossible.
修复bug时,在每一层都进行验证:
层级目的示例
入口点在API边界拒绝无效输入
if (!dir) throw new Error('dir required')
业务逻辑确保数据符合操作要求处理前先验证
环境防护防止在特定环境中执行危险操作测试中拒绝在临时目录外执行git init
调试埋点为取证收集上下文信息在危险操作前记录带堆栈跟踪的日志
单一验证看似足够,但不同的代码路径可能绕过它。要从结构上杜绝bug的产生。

Condition-Based Waiting

基于条件的等待

Flaky tests guess at timing. Wait for actual conditions instead:
python
undefined
不稳定的测试会猜测执行时机。应改为等待实际条件满足:
python
undefined

BAD: Guessing at timing

错误示例:猜测执行时机

await asyncio.sleep(0.05) result = get_result()
await asyncio.sleep(0.05) result = get_result()

GOOD: Wait for condition

正确示例:等待条件满足

await wait_for(lambda: get_result() is not None) result = get_result()

Pattern:
```python
async def wait_for(condition, timeout_ms=5000):
    start = time.time()
    while True:
        if condition():
            return
        if (time.time() - start) * 1000 > timeout_ms:
            raise TimeoutError("Condition not met")
        await asyncio.sleep(0.01)  # Poll every 10ms

await wait_for(lambda: get_result() is not None) result = get_result()

模式示例:
```python
async def wait_for(condition, timeout_ms=5000):
    start = time.time()
    while True:
        if condition():
            return
        if (time.time() - start) * 1000 > timeout_ms:
            raise TimeoutError("Condition not met")
        await asyncio.sleep(0.01)  # 每10ms轮询一次

Common Rationalizations

常见自我合理化借口

ExcuseReality
"Issue is simple, don't need process"Simple issues have root causes too. Process is fast for simple bugs.
"Emergency, no time for process"Systematic debugging is FASTER than guess-and-check thrashing.
"Just try this first, then investigate"First fix sets the pattern. Do it right from the start.
"I see the problem, let me fix it"Seeing symptoms != understanding root cause.
"One more fix attempt" (after 2+ failures)3+ failures = architectural problem. Question pattern, don't fix again.

借口真相
“问题很简单,不需要走流程”简单问题也有根本原因。这套流程处理简单bug的速度很快。
“情况紧急,没时间走流程”系统化调试比试错式修复更快。
“先试试这个,之后再调查”第一次修复会定下错误的模式。从一开始就应该正确操作。
“我看到问题所在了,直接修复”看到症状不等于理解根本原因。
“再试最后一次修复”(已尝试2次以上)3次以上失败意味着存在架构问题。质疑现有模式,不要继续修复。

Verification

验证

Run:
python scripts/verify.py
执行:
python scripts/verify.py

References

参考资料

  • references/root-cause-tracing.md - Trace bugs backward through call stack
  • references/root-cause-tracing.md - 通过调用栈反向追踪bug