systematic-debugging

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Systematic Debugging

系统化调试

Overview

概述

Random fixes waste time and create new bugs. Quick patches mask underlying issues.

Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.

Violating the letter of this process is violating the spirit of debugging.

随机修复不仅浪费时间，还会引入新的bug。仓促的补丁只会掩盖潜在问题。

核心原则： 在尝试修复前必须找到根本原因。仅修复症状等同于失败。

违反该流程的任何环节，都是违背调试的核心精神。

The Iron Law

铁律

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

If you haven't completed Phase 1, you cannot propose fixes.

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST

如果未完成阶段1，绝不能提出修复方案。

When to Use

适用场景

Use for ANY technical issue:

Test failures
Bugs in production
Unexpected behavior
Performance problems
Build failures
Integration issues

Use this ESPECIALLY when:

Under time pressure (emergencies make guessing tempting)
"Just one quick fix" seems obvious
You've already tried multiple fixes
Previous fix didn't work
You don't fully understand the issue

Don't skip when:

Issue seems simple (simple bugs have root causes too)
You're in a hurry (rushing guarantees rework)
Manager wants it fixed NOW (systematic is faster than thrashing)

适用于所有技术问题场景：

测试失败
生产环境Bug
异常行为
性能问题
构建失败
集成问题

尤其适用于以下场景：

处于时间压力下（紧急情况下容易凭猜测行事）
"只需快速修复一下"看似可行
已经尝试过多种修复方案但无效
之前的修复方案未解决问题
尚未完全理解问题

以下场景绝不能跳过该流程：

问题看似简单（简单bug也有其根本原因）
你时间紧迫（仓促行事必然导致返工）
经理要求立即修复（系统化方法比盲目尝试更高效）

The Four Phases

四阶段流程

You MUST complete each phase before proceeding to the next.

Copy this checklist and track your progress:

Debugging Progress:
- [ ] Phase 1: Root Cause Investigation
  - [ ] Read error messages carefully
  - [ ] Reproduce consistently
  - [ ] Check recent changes
  - [ ] Gather evidence at component boundaries
  - [ ] Trace data flow backward to source
- [ ] Phase 2: Pattern Analysis
  - [ ] Find working examples
  - [ ] Compare against references
  - [ ] Identify differences
- [ ] Phase 3: Hypothesis and Testing
  - [ ] Form single hypothesis
  - [ ] Test minimally (one change)
  - [ ] Verify before continuing
- [ ] Phase 4: Implementation
  - [ ] Create failing test case
  - [ ] Implement single fix at root cause
  - [ ] Apply defense-in-depth
  - [ ] Remove all // debug-shim markers
  - [ ] Verify fix and tests pass

必须完成当前阶段后才能进入下一阶段。

复制以下检查清单并跟踪进度：

调试进度：
- [ ] 阶段1：根本原因调查
  - [ ] 仔细阅读错误信息
  - [ ] 稳定复现问题
  - [ ] 检查近期变更
  - [ ] 在组件边界收集证据
  - [ ] 反向追踪数据流至源头
- [ ] 阶段2：模式分析
  - [ ] 寻找可正常运行的示例
  - [ ] 与参考实现对比
  - [ ] 识别差异点
- [ ] 阶段3：假设与测试
  - [ ] 形成单一假设
  - [ ] 最小化测试（仅变更一处）
  - [ ] 验证后再推进
- [ ] 阶段4：落地实施
  - [ ] 创建失败测试用例
  - [ ] 针对根本原因实施单一修复
  - [ ] 应用纵深防御机制
  - [ ] 移除所有// debug-shim标记
  - [ ] 验证修复方案与测试用例均通过

Phase 1: Root Cause Investigation

阶段1：根本原因调查

BEFORE attempting ANY fix:

在尝试任何修复前：

1. Read Error Messages Carefully

1. 仔细阅读错误信息

Don't skip past errors or warnings
They often contain the exact solution
Read stack traces completely
Note line numbers, file paths, error codes

不要跳过错误或警告信息
这些信息通常包含直接的解决方案
完整阅读堆栈跟踪信息
记录行号、文件路径和错误码

2. Reproduce Consistently

2. 稳定复现问题

Can you trigger it reliably?
What are the exact steps?
Does it happen every time?
If not reproducible → gather more data, don't guess

你能否可靠触发问题？
具体步骤是什么？
每次都会发生吗？
如果无法复现 → 收集更多数据，不要凭猜测行事

3. Check Recent Changes

3. 检查近期变更

What changed that could cause this?
Git diff, recent commits
New dependencies, config changes
Environmental differences

哪些变更可能导致该问题？
Git diff、近期提交记录
新增依赖、配置变更
环境差异

4. Gather Evidence in Multi-Component Systems

4. 在多组件系统中收集证据

WHEN system has multiple components (CI → build → signing, API → service → database):

For log-heavy investigations: When errors appear in application logs, use the

reading-logs

skill for efficient analysis. Never load entire log files into context - use targeted grep and filtering.

BEFORE proposing fixes, add diagnostic instrumentation:

For EACH component boundary:
  - Log what data enters component
  - Log what data exits component
  - Verify environment/config propagation
  - Check state at each layer

Run once to gather evidence showing WHERE it breaks
THEN analyze evidence to identify failing component
THEN investigate that specific component

Example (multi-layer system):

bash

undefined

当系统包含多个组件时（CI → 构建 → 签名，API → 服务 → 数据库）：

针对日志密集型排查： 当应用日志中出现错误时，使用

reading-logs

技能进行高效分析。绝不要将完整日志文件加载到上下文环境中——使用针对性的grep和过滤操作。

在提出修复方案前，添加诊断工具：

对于每个组件边界：
  - 记录进入组件的数据
  - 记录离开组件的数据
  - 验证环境/配置的传递情况
  - 检查每一层的状态

运行一次以收集问题发生位置的证据
然后分析证据确定故障组件
再针对该特定组件展开调查

示例（多层系统）：

bash

undefined

Layer 1: Workflow

echo "=== Secrets available in workflow: ===" echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}"

Layer 2: Build script

echo "=== Env vars in build script: ===" env | grep IDENTITY || echo "IDENTITY not in environment"

Layer 3: Signing script

echo "=== Keychain state: ===" security list-keychains security find-identity -v

Layer 4: Actual signing

codesign --sign "$IDENTITY" --verbose=4 "$APP"


**This reveals:** Which layer fails (secrets → workflow ✓, workflow → build ✗)

codesign --sign "$IDENTITY" --verbose=4 "$APP"


**该示例可揭示：** 哪一层出现故障（密钥 → 工作流 ✓，工作流 → 构建 ✗）

5. Trace Data Flow (Root Cause Tracing)

5. 反向追踪数据流（根本原因追踪）

WHEN error is deep in call stack or unclear where invalid data originated:

Don't fix symptoms. Trace backward through the call chain to find the original trigger, then fix at the source.

Use Five Whys + Backward Tracing:

Symptom: git init creates .git in source code directory
Why? → cwd parameter is empty string, defaults to process.cwd()
Why? → projectDir variable passed to git init is ''
Why? → Session.create() received empty tempDir
Why? → Test accessed context.tempDir before beforeEach initialized it
Why? → setupCoreTest() returns object with tempDir: '' initially
Root Cause: Top-level variable initialization accessing uninitialized value

Trace the Call Chain backward:

typescript

execFileAsync('git', ['init'], { cwd: projectDir })  // Symptom
  ← WorktreeManager.createSessionWorktree(projectDir, sessionId)
  ← Session.initializeWorkspace()
  ← Session.create(tempDir)
  ← Test: Project.create('name', context.tempDir)  // Root trigger

Adding Instrumentation when call chain is unclear:

typescript

async function gitInit(directory: string) {
  // debug-shim
  const stack = new Error().stack;
  console.error("DEBUG:", { directory, cwd: process.cwd(), stack });
  // end debug-shim
  await execFileAsync("git", ["init"], { cwd: directory });
}

Key points:

Use
```
console.error()
```
in tests (logger may be suppressed)
Log before the operation, not after it fails
Include context: directory, cwd, environment variables

当错误位于调用栈深处，或不清楚无效数据的来源时：

不要只修复症状。通过调用链反向追踪，找到最初的触发点，然后从源头修复。

使用5Why分析法 + 反向追踪：

症状：git init在源代码目录中创建.git文件夹
为什么？→ cwd参数为空字符串，默认使用process.cwd()
为什么？→ 传递给git init的projectDir变量为''
为什么？→ Session.create()收到的tempDir为空
为什么？→ 测试在beforeEach初始化前访问了context.tempDir
为什么？→ setupCoreTest()返回的对象初始时tempDir为''
根本原因：顶层变量初始化时访问了未初始化的值

反向追踪调用链：

typescript

execFileAsync('git', ['init'], { cwd: projectDir })  // 症状
  ← WorktreeManager.createSessionWorktree(projectDir, sessionId)
  ← Session.initializeWorkspace()
  ← Session.create(tempDir)
  ← 测试: Project.create('name', context.tempDir)  // 初始触发点

当调用链不清晰时添加工具：

typescript

async function gitInit(directory: string) {
  // debug-shim
  const stack = new Error().stack;
  console.error("DEBUG:", { directory, cwd: process.cwd(), stack });
  // end debug-shim
  await execFileAsync("git", ["init"], { cwd: directory });
}

关键点：

在测试中使用
```
console.error()
```
（日志工具可能被屏蔽）
在操作执行前记录日志，而非失败后
包含上下文信息：目录、cwd、环境变量

Debug Instrumentation Markers

调试工具标记

ALL temporary debug code MUST include the
// debug-shim
marker:

typescript

console.error("DEBUG:", { value, context }); // debug-shim

This enables reliable cleanup via grep. Before completing Phase 4:

Search:
```
grep -r "debug-shim" .
```
Remove all marked instrumentation
Verify tests still pass

For language-specific variants (Python, Bash, JSX), see

references/debugging-techniques.md#debug-shim-markers

Verify the Root Cause:

If you fix at the source, does the symptom disappear?
Does the fix prevent recurrence across all code paths?
Can you add validation to catch it early?

所有临时调试代码必须包含
// debug-shim
标记：

typescript

console.error("DEBUG:", { value, context }); // debug-shim

这样可以通过grep可靠地清理调试代码。在完成阶段4前：

搜索：
```
grep -r "debug-shim" .
```
移除所有标记的调试工具
验证测试仍可通过

针对特定语言的变体（Python、Bash、JSX），请查看

references/debugging-techniques.md#debug-shim-markers

。

验证根本原因：

如果从源头修复，症状是否消失？
该修复能否防止所有代码路径中再次出现问题？
能否添加验证逻辑提前发现问题？

Tactical Debugging Techniques

实用调试技巧

When executing the four phases, use these techniques to gather evidence:

Binary Search / Code Bisection: Systematically narrow down the problem area
Minimal Reproduction: Strip away everything non-essential
Strategic Logging & Instrumentation: Add diagnostic output at key points
Runtime Assertions: Make assumptions explicit and fail fast
Differential Analysis: Compare working vs broken states
Multi-Component System Debugging: Add instrumentation at each boundary

在执行四阶段流程时，可使用以下技巧收集证据：

二分查找/代码二分法： 系统性缩小问题范围
最小化复现： 剥离所有非必要内容
策略性日志与工具： 在关键节点添加诊断输出
运行时断言： 明确假设并快速失败
差异分析： 对比正常与故障状态
多组件系统调试： 在每个边界添加工具

Phase 2: Pattern Analysis

阶段2：模式分析

Find the pattern before fixing:

Find Working Examples
- Locate similar working code in same codebase
- What works that's similar to what's broken?
Compare Against References
- If implementing pattern, read reference implementation COMPLETELY
- Don't skim - read every line
- Understand the pattern fully before applying
Identify Differences
- What's different between working and broken?
- List every difference, however small
- Don't assume "that can't matter"
Understand Dependencies
- What other components does this need?
- What settings, config, environment?
- What assumptions does it make?

在修复前先找到模式：

寻找可正常运行的示例
- 在同一代码库中定位类似的可正常运行代码
- 哪些类似的功能可以正常工作？
与参考实现对比
- 如果是实现某种模式，请完整阅读参考实现
- 不要略读——逐行阅读
- 在应用前完全理解该模式
识别差异点
- 正常运行的代码与故障代码有哪些不同？
- 列出所有差异，无论多小
- 不要假设“这无关紧要”
理解依赖关系
- 该功能还依赖哪些其他组件？
- 需要哪些设置、配置、环境？
- 它做出了哪些假设？

Phase 3: Hypothesis and Testing

阶段3：假设与测试

Scientific method:

Form Single Hypothesis
- State clearly: "I think X is the root cause because Y"
- Write it down
- Be specific, not vague
Test Minimally
- Make the SMALLEST possible change to test hypothesis
- One variable at a time
- Don't fix multiple things at once
Verify Before Continuing
- Did it work? Yes → Phase 4
- Didn't work? Form NEW hypothesis
- DON'T add more fixes on top
When You Don't Know
- Say "I don't understand X"
- Don't pretend to know
- Ask for help
- Research more

科学方法：

形成单一假设
- 清晰表述：“我认为X是根本原因，因为Y”
- 写下来
- 要具体，不要模糊
最小化测试
- 做出最小的变更以测试假设
- 一次只变更一个变量
- 不要同时修复多个问题
验证后再推进
- 有效 → 进入阶段4
- 无效 → 形成新的假设
- 不要在原有基础上叠加更多修复
当你不确定时
- 说出“我不理解X”
- 不要假装知道
- 寻求帮助
- 深入研究

Phase 4: Implementation

阶段4：落地实施

Fix the root cause, not the symptom:

修复根本原因，而非症状：

1. Create Failing Test Case

1. 创建失败测试用例

Simplest possible reproduction
Automated test if possible
One-off test script if no framework
MUST have before fixing

最简单的复现方式
尽可能使用自动化测试
如果没有测试框架，可使用一次性测试脚本
必须在修复前完成

2. Implement Single Fix

2. 实施单一修复

Address the root cause identified
ONE change at a time
No "while I'm here" improvements
No bundled refactoring

针对已识别的根本原因进行修复
一次只做一处变更
不要顺便进行“顺手的”优化
不要捆绑重构操作

3. Apply Defense-in-Depth

3. 应用纵深防御机制

Don't just fix the root cause - add validation at each layer:

Root fix: Prevent the bug at its source
Layer 1: Entry point validates inputs
Layer 2: Core logic validates preconditions
Layer 3: Environment guards (NODE_ENV checks, directory restrictions)

Result: Bug impossible to reintroduce, even with future code changes.

不要只修复根本原因——在每一层添加验证：

源头修复： 在源头防止bug出现
第一层： 入口点验证输入
第二层： 核心逻辑验证前置条件
第三层： 环境防护（NODE_ENV检查、目录限制）

结果：即使未来代码变更，也无法再次引入该bug。

4. Verify Fix

4. 验证修复方案

Test passes now?
No other tests broken?
Issue actually resolved?

测试现在能通过吗？
其他测试是否被破坏？
问题是否真正解决？

5. If Fix Doesn't Work

5. 如果修复方案无效

STOP
Count: How many fixes have you tried?
If < 3: Return to Phase 1, re-analyze with new information
If ≥ 3: STOP and question the architecture (step 6 below)
DON'T attempt Fix #4 without architectural discussion

停止
统计：你已经尝试了多少次修复？
如果<3次：返回阶段1，结合新信息重新分析
如果≥3次：停止并质疑架构（见下文第6步）
未经架构讨论，绝不要尝试第4次修复

6. If 3+ Fixes Failed: Question Architecture

6. 如果3次以上修复失败：质疑架构

Pattern indicating architectural problem:

Each fix reveals new shared state/coupling/problem in different place
Fixes require "massive refactoring" to implement
Each fix creates new symptoms elsewhere

STOP and question fundamentals:

Is this pattern fundamentally sound?
Are we "sticking with it through sheer inertia"?
Should we refactor architecture vs. continue fixing symptoms?

Discuss with your human partner before attempting more fixes

This is NOT a failed hypothesis - this is a wrong architecture.

表明存在架构问题的模式：

每次修复都会在不同位置暴露出新的共享状态/耦合/问题
修复需要“大规模重构”才能实施
每次修复都会在其他地方引入新症状

停止并质疑基础问题：

该模式从根本上是否合理？
我们是否“因惯性而坚持”？
我们应该重构架构还是继续修复症状？

在尝试更多修复前与团队成员讨论

这不是假设错误——而是架构存在问题。

Red Flags - STOP and Follow Process

危险信号——停止并遵循流程

If you catch yourself thinking:

"Quick fix for now, investigate later"
"Just try changing X and see if it works"
"Add multiple changes, run tests"
"Skip the test, I'll manually verify"
"It's probably X, let me fix that"
"I don't fully understand but this might work"
"Pattern says X but I'll adapt it differently"
"Here are the main problems: [lists fixes without investigation]"
Proposing solutions before tracing data flow
"One more fix attempt" (when already tried 2+)
Each fix reveals new problem in different place

ALL of these mean: STOP. Return to Phase 1.

If 3+ fixes failed: Question the architecture (see Phase 4.6)

如果你发现自己有以下想法：

“先快速修复，之后再调查”
“试试修改X看看能不能解决”
“同时做多处变更，然后运行测试”
“跳过测试，我手动验证”
“可能是X的问题，我来修复”
“我不完全理解，但这个可能有用”
“模式要求X，但我要做不同的调整”
“主要问题有这些：[列出修复方案但未做调查]”
在追踪数据流前就提出解决方案
“再试一次修复”（已经尝试2次以上）
每次修复都会在不同位置暴露出新问题

以上所有情况都意味着：停止。返回阶段1。

如果3次以上修复失败：质疑架构（见阶段4.6）

Partner Signals You're Doing It Wrong

同伴提示你操作错误的信号

Watch for these redirections:

"Is that not happening?" - You assumed without verifying
"Will it show us...?" - You should have added evidence gathering
"Stop guessing" - You're proposing fixes without understanding
"Ultrathink this" - Question fundamentals, not just symptoms
"We're stuck?" (frustrated) - Your approach isn't working

When you see these: STOP. Return to Phase 1.

注意以下纠正信号：

“不是这样的？”——你未经验证就做出了假设
“它能显示...吗？”——你应该添加证据收集步骤
“不要猜测”——你在未理解问题的情况下提出修复方案
“深入思考这个问题”——质疑根本问题，而非仅关注症状
“我们卡住了？”（沮丧）——你的方法无效

当遇到这些信号时：停止。返回阶段1。

Common Rationalizations

常见借口与事实

Excuse	Reality
"Issue is simple, don't need process"	Simple issues have root causes too. Process is fast for simple bugs.
"Emergency, no time for process"	Systematic debugging is FASTER than guess-and-check thrashing.
"Just try this first, then investigate"	First fix sets the pattern. Do it right from the start.
"I'll write test after confirming fix works"	Untested fixes don't stick. Test first proves it.
"Multiple fixes at once saves time"	Can't isolate what worked. Causes new bugs.
"Reference too long, I'll adapt the pattern"	Partial understanding guarantees bugs. Read it completely.
"I see the problem, let me fix it"	Seeing symptoms ≠ understanding root cause.
"One more fix attempt" (after 2+ failures)	3+ failures = architectural problem. Question pattern, don't fix again.

借口	事实
“问题很简单，不需要流程”	简单问题也有根本原因。该流程处理简单bug的速度很快。
“紧急情况，没时间走流程”	系统化调试比盲目尝试更高效。
“先试试这个，之后再调查”	第一次修复会定下模式。从一开始就做对。
“确认修复有效后再写测试”	未测试的修复无法持久。先写测试才能证明修复有效。
“同时做多处修复节省时间”	无法确定哪部分起作用。会引入新bug。
“参考内容太长，我会调整模式”	一知半解必然导致bug。请完整阅读。
“我看到问题了，我来修复”	看到症状≠理解根本原因。
“再试一次修复”（尝试2次以上后）	3次以上失败=架构问题。质疑模式，不要继续修复。

Quick Reference

快速参考

Phase	Key Activities	Success Criteria
1. Root Cause	Read errors, reproduce, check changes, trace data flow	Understand WHAT and WHY
2. Pattern	Find working examples, compare	Identify differences
3. Hypothesis	Form theory, test minimally	Confirmed or new hypothesis
4. Implementation	Create test, fix with defense-in-depth, verify	Bug resolved, tests pass

阶段	核心活动	成功标准
1. 根本原因	阅读错误信息、复现问题、检查变更、反向追踪数据流	理解问题是什么以及为什么发生
2. 模式分析	寻找正常示例、对比参考实现	识别差异点
3. 假设与测试	形成理论、最小化测试	假设得到确认或形成新假设
4. 落地实施	创建测试用例、修复根本原因并应用纵深防御、验证	Bug解决，测试通过

Reporting Your Findings

报告你的发现

After completing the debugging process:

markdown

undefined

完成调试流程后：

markdown

undefined

Root Cause

根本原因

[Explain the underlying issue in 1-3 sentences] Located in:

file.ts:123

[用1-3句话解释潜在问题] 位置：

file.ts:123

What Was Wrong

问题详情

[Describe the specific problem - mutation, race condition, missing validation, incorrect assumption, etc. Be technical and specific.]

[描述具体问题——数据突变、竞态条件、缺失验证、错误假设等。技术细节要具体。]

The Fix

修复方案

[Describe the changes made and why they address the root cause]

Changes in:

```
file.ts:123-125
```
- [what changed and why]
```
test.ts:45
```
- [added regression test]

[描述所做的变更以及这些变更如何解决根本原因]

变更位置：

```
file.ts:123-125
```
- [变更内容及原因]
```
test.ts:45
```
- [新增回归测试]

Verification

验证

Bug reproduced and confirmed fixed
Existing tests pass
Added regression test
Checked for similar issues in related code
No new errors or warnings introduced

undefined

Bug已复现并确认修复
现有测试全部通过
新增回归测试
检查相关代码中是否存在类似问题
未引入新的错误或警告

undefined

When Process Reveals "No Root Cause"

无根本原因的情况

If systematic investigation reveals issue is truly environmental, timing-dependent, or external:

You've completed the process
Document what you investigated
Implement appropriate handling (retry, timeout, error message)
Add monitoring/logging for future investigation

But: 95% of "no root cause" cases are incomplete investigation.

如果系统化调查发现问题确实是环境、时间依赖或外部因素导致：

你已完成流程
记录你所做的调查
实施适当的处理逻辑（重试、超时、错误提示）
添加监控/日志以便未来调查

但注意： 95%的“无根本原因”案例都是因为调查不彻底。

Integration

技能集成

Complementary skills:

```
writing-tests
```
- For creating failing test case in Phase 4
```
condition-based-waiting
```
- Replace arbitrary timeouts identified in Phase 2
```
verification-before-completion
```
- Verify fix worked before claiming success
```
reading-logs
```
- Efficient log analysis for evidence gathering in Phases 1-2

互补技能：

```
writing-tests
```
- 用于在阶段4创建失败测试用例
```
condition-based-waiting
```
- 替换阶段2中识别到的任意超时
```
verification-before-completion
```
- 在宣布成功前验证修复有效
```
reading-logs
```
- 在阶段1-2中高效分析日志以收集证据

Real-World Impact

实际效果

From debugging sessions:

Systematic approach: 15-30 minutes to fix
Random fixes approach: 2-3 hours of thrashing
First-time fix rate: 95% vs 40%
New bugs introduced: Near zero vs common

Remember: Fixing symptoms creates technical debt. Finding root causes eliminates entire classes of bugs.

来自调试会话的数据：

系统化方法：15-30分钟修复
随机修复方法：2-3小时的盲目尝试
首次修复成功率：95% vs 40%
引入新bug的概率：几乎为0 vs 常见

请记住： 修复症状会产生技术债务。找到根本原因可以消除一整类bug。