root-cause-tracing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Root Cause Tracing

根源追踪

Overview

概述

Bugs often manifest deep in the call stack (git init in wrong directory, file created in wrong location, database opened with wrong path). Your instinct is to fix where the error appears, but that's treating a symptom.
Core principle: Trace backward through the call chain until you find the original trigger, then fix at the source.
Bug通常会在调用栈深处显现(比如在错误目录执行git init、文件创建在错误位置、用错误路径打开数据库)。你的直觉是修复错误出现的地方,但这只是治标不治本。
核心原则: 沿着调用链回溯,直到找到原始触发点,然后从根源修复。

When to Use

适用场景

dot
digraph when_to_use {
    "Bug appears deep in stack?" [shape=diamond];
    "Can trace backwards?" [shape=diamond];
    "Fix at symptom point" [shape=box];
    "Trace to original trigger" [shape=box];
    "BETTER: Also add defense-in-depth" [shape=box];

    "Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"];
    "Can trace backwards?" -> "Trace to original trigger" [label="yes"];
    "Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"];
    "Trace to original trigger" -> "BETTER: Also add defense-in-depth";
}
Use when:
  • Error happens deep in execution (not at entry point)
  • Stack trace shows long call chain
  • Unclear where invalid data originated
  • Need to find which test/code triggers the problem
dot
digraph when_to_use {
    "Bug appears deep in stack?" [shape=diamond];
    "Can trace backwards?" [shape=diamond];
    "Fix at symptom point" [shape=box];
    "Trace to original trigger" [shape=box];
    "BETTER: Also add defense-in-depth" [shape=box];

    "Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"];
    "Can trace backwards?" -> "Trace to original trigger" [label="yes"];
    "Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"];
    "Trace to original trigger" -> "BETTER: Also add defense-in-depth";
}
适用场景:
  • 错误发生在执行流程深处(而非入口点)
  • 栈追踪显示调用链很长
  • 不清楚无效数据的来源
  • 需要找出触发问题的测试/代码

The Tracing Process

追踪流程

1. Observe the Symptom

1. 观察症状

Error: git init failed in /Users/jesse/project/packages/core
Error: git init failed in /Users/jesse/project/packages/core

2. Find Immediate Cause

2. 查找直接原因

What code directly causes this?
typescript
await execFileAsync('git', ['init'], { cwd: projectDir });
哪段代码直接导致了这个问题?
typescript
await execFileAsync('git', ['init'], { cwd: projectDir });

3. Ask: What Called This?

3. 追问:谁调用了这段代码?

typescript
WorktreeManager.createSessionWorktree(projectDir, sessionId)
  → called by Session.initializeWorkspace()
  → called by Session.create()
  → called by test at Project.create()
typescript
WorktreeManager.createSessionWorktree(projectDir, sessionId)
  → called by Session.initializeWorkspace()
  → called by Session.create()
  → called by test at Project.create()

4. Keep Tracing Up

4. 持续向上追踪

What value was passed?
  • projectDir = ''
    (empty string!)
  • Empty string as
    cwd
    resolves to
    process.cwd()
  • That's the source code directory!
传入的值是什么?
  • projectDir = ''
    (空字符串!)
  • 空字符串作为
    cwd
    会解析为
    process.cwd()
  • 而这正是源代码目录!

5. Find Original Trigger

5. 找到原始触发点

Where did empty string come from?
typescript
const context = setupCoreTest(); // Returns { tempDir: '' }
Project.create('name', context.tempDir); // Accessed before beforeEach!
空字符串来自哪里?
typescript
const context = setupCoreTest(); // 返回 { tempDir: '' }
Project.create('name', context.tempDir); // 在beforeEach之前就被访问了!

Adding Stack Traces

添加栈追踪

When you can't trace manually, add instrumentation:
typescript
// Before the problematic operation
async function gitInit(directory: string) {
  const stack = new Error().stack;
  console.error('DEBUG git init:', {
    directory,
    cwd: process.cwd(),
    nodeEnv: process.env.NODE_ENV,
    stack,
  });

  await execFileAsync('git', ['init'], { cwd: directory });
}
Critical: Use
console.error()
in tests (not logger - may not show)
Run and capture:
bash
npm test 2>&1 | grep 'DEBUG git init'
Analyze stack traces:
  • Look for test file names
  • Find the line number triggering the call
  • Identify the pattern (same test? same parameter?)
当无法手动追踪时,添加插装代码:
typescript
// 在有问题的操作之前
async function gitInit(directory: string) {
  const stack = new Error().stack;
  console.error('DEBUG git init:', {
    directory,
    cwd: process.cwd(),
    nodeEnv: process.env.NODE_ENV,
    stack,
  });

  await execFileAsync('git', ['init'], { cwd: directory });
}
关键提示: 在测试中使用
console.error()
(不要用日志工具——可能不会显示)
运行并捕获输出:
bash
npm test 2>&1 | grep 'DEBUG git init'
分析栈追踪:
  • 查找测试文件名
  • 找到触发调用的行号
  • 识别模式(同一个测试?同一个参数?)

Finding Which Test Causes Pollution

找出导致污染的测试

If something appears during tests but you don't know which test:
Use the bisection script: @find-polluter.sh
bash
./find-polluter.sh '.git' 'src/**/*.test.ts'
Runs tests one-by-one, stops at first polluter. See script for usage.
如果测试过程中出现问题,但不知道是哪个测试导致的:
使用二分法脚本:@find-polluter.sh
bash
./find-polluter.sh '.git' 'src/**/*.test.ts'
逐个运行测试,遇到第一个污染源时停止。查看脚本了解用法。

Real Example: Empty projectDir

真实案例:空projectDir

Symptom:
.git
created in
packages/core/
(source code)
Trace chain:
  1. git init
    runs in
    process.cwd()
    ← empty cwd parameter
  2. WorktreeManager called with empty projectDir
  3. Session.create() passed empty string
  4. Test accessed
    context.tempDir
    before beforeEach
  5. setupCoreTest() returns
    { tempDir: '' }
    initially
Root cause: Top-level variable initialization accessing empty value
Fix: Made tempDir a getter that throws if accessed before beforeEach
Also added defense-in-depth:
  • Layer 1: Project.create() validates directory
  • Layer 2: WorkspaceManager validates not empty
  • Layer 3: NODE_ENV guard refuses git init outside tmpdir
  • Layer 4: Stack trace logging before git init
症状:
.git
被创建在
packages/core/
(源代码目录)
追踪链:
  1. git init
    process.cwd()
    中运行 ← cwd参数为空
  2. WorktreeManager被传入空的projectDir
  3. Session.create()被传入空字符串
  4. 测试在beforeEach之前访问了
    context.tempDir
  5. setupCoreTest()初始返回
    { tempDir: '' }
根源: 顶层变量初始化时访问了空值
修复方案: 将tempDir改为getter,若在beforeEach之前访问则抛出错误
同时添加了纵深防御:
  • 第一层:Project.create()验证目录有效性
  • 第二层:WorkspaceManager验证目录非空
  • 第三层:NODE_ENV防护禁止在临时目录外执行git init
  • 第四层:git init前记录栈追踪

Key Principle

核心原则

dot
digraph principle {
    "Found immediate cause" [shape=ellipse];
    "Can trace one level up?" [shape=diamond];
    "Trace backwards" [shape=box];
    "Is this the source?" [shape=diamond];
    "Fix at source" [shape=box];
    "Add validation at each layer" [shape=box];
    "Bug impossible" [shape=doublecircle];
    "NEVER fix just the symptom" [shape=octagon, style=filled, fillcolor=red, fontcolor=white];

    "Found immediate cause" -> "Can trace one level up?";
    "Can trace one level up?" -> "Trace backwards" [label="yes"];
    "Can trace one level up?" -> "NEVER fix just the symptom" [label="no"];
    "Trace backwards" -> "Is this the source?";
    "Is this the source?" -> "Trace backwards" [label="no - keeps going"];
    "Is this the source?" -> "Fix at source" [label="yes"];
    "Fix at source" -> "Add validation at each layer";
    "Add validation at each layer" -> "Bug impossible";
}
NEVER fix just where the error appears. Trace back to find the original trigger.
dot
digraph principle {
    "Found immediate cause" [shape=ellipse];
    "Can trace one level up?" [shape=diamond];
    "Trace backwards" [shape=box];
    "Is this the source?" [shape=diamond];
    "Fix at source" [shape=box];
    "Add validation at each layer" [shape=box];
    "Bug impossible" [shape=doublecircle];
    "NEVER fix just the symptom" [shape=octagon, style=filled, fillcolor=red, fontcolor=white];

    "Found immediate cause" -> "Can trace one level up?";
    "Can trace one level up?" -> "Trace backwards" [label="yes"];
    "Can trace one level up?" -> "NEVER fix just the symptom" [label="no"];
    "Trace backwards" -> "Is this the source?";
    "Is this the source?" -> "Trace backwards" [label="no - keeps going"];
    "Is this the source?" -> "Fix at source" [label="yes"];
    "Fix at source" -> "Add validation at each layer";
    "Add validation at each layer" -> "Bug impossible";
}
永远不要只修复错误出现的地方。 回溯查找原始触发点。

Stack Trace Tips

栈追踪技巧

In tests: Use
console.error()
not logger - logger may be suppressed Before operation: Log before the dangerous operation, not after it fails Include context: Directory, cwd, environment variables, timestamps Capture stack:
new Error().stack
shows complete call chain
在测试中: 使用
console.error()
而非日志工具——日志可能被屏蔽 操作前记录: 在危险操作前记录日志,而非失败后 包含上下文: 目录、cwd、环境变量、时间戳 捕获栈信息:
new Error().stack
显示完整调用链

Real-World Impact

实际影响

From debugging session (2025-10-03):
  • Found root cause through 5-level trace
  • Fixed at source (getter validation)
  • Added 4 layers of defense
  • 1847 tests passed, zero pollution
来自2025-10-03的调试会话:
  • 通过5层追踪找到根源
  • 在源头修复(getter验证)
  • 添加了4层防御
  • 1847个测试全部通过,无任何污染