debugging
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMode: Cognitive/Prompt-Driven — No standalone utility script; use via agent context.
模式:认知/提示驱动 — 无独立实用脚本;需在Agent环境中使用。
Systematic Debugging
系统化调试
Overview
概述
Random fixes waste time and create new bugs. Quick patches mask underlying issues.
Core principle: ALWAYS find root cause before attempting fixes. Symptom fixes are failure.
Violating the letter of this process is violating the spirit of debugging.
无章法的修复会浪费时间并引入新Bug。快速补丁只会掩盖潜在问题。
核心原则: 尝试修复前必须找到根本原因。仅修复症状就是失败。
违反该流程的任何环节,都是违背调试的本质。
The Iron Law
铁律
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRSTIf you haven't completed Phase 1, you cannot propose fixes.
未完成根本原因调查前,禁止进行任何修复如果尚未完成第一阶段,不得提出修复方案。
When to Use
适用场景
Use for ANY technical issue:
- Test failures
- Bugs in production
- Unexpected behavior
- Performance problems
- Build failures
- Integration issues
Use this ESPECIALLY when:
- Under time pressure (emergencies make guessing tempting)
- "Just one quick fix" seems obvious
- You've already tried multiple fixes
- Previous fix didn't work
- You don't fully understand the issue
Don't skip when:
- Issue seems simple (simple bugs have root causes too)
- You're in a hurry (rushing guarantees rework)
- Manager wants it fixed NOW (systematic is faster than thrashing)
适用于任何技术问题:
- 测试失败
- 生产环境Bug
- 意外行为
- 性能问题
- 构建失败
- 集成问题
尤其在以下场景必须使用:
- 处于时间压力下(紧急情况容易让人想当然)
- “快速修复一下”看似可行
- 已经尝试过多种修复方法
- 之前的修复无效
- 你并未完全理解问题
请勿跳过的场景:
- 问题看似简单(简单Bug也有根本原因)
- 你很匆忙(仓促行事必然导致返工)
- 经理要求立即修复(系统化方法比瞎忙活更快)
The Four Phases
四个阶段
You MUST complete each phase before proceeding to the next.
必须完成上一阶段后,才能进入下一阶段。
Phase 1: Root Cause Investigation
第一阶段:根本原因调查
BEFORE attempting ANY fix:
-
Read Error Messages Carefully
- Don't skip past errors or warnings
- They often contain the exact solution
- Read stack traces completely
- Note line numbers, file paths, error codes
-
Reproduce Consistently
- Can you trigger it reliably?
- What are the exact steps?
- Does it happen every time?
- If not reproducible - gather more data, don't guess
-
Check Recent Changes
- What changed that could cause this?
- Git diff, recent commits
- New dependencies, config changes
- Environmental differences
-
Gather Evidence in Multi-Component SystemsWHEN system has multiple components (CI - build - signing, API - service - database):BEFORE proposing fixes, add diagnostic instrumentation:
For EACH component boundary: - Log what data enters component - Log what data exits component - Verify environment/config propagation - Check state at each layer Run once to gather evidence showing WHERE it breaks THEN analyze evidence to identify failing component THEN investigate that specific componentExample (multi-layer system):bash# Layer 1: Workflow echo "=== Secrets available in workflow: ===" echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}" # Layer 2: Build script echo "=== Env vars in build script: ===" env | grep IDENTITY || echo "IDENTITY not in environment" # Layer 3: Signing script echo "=== Keychain state: ===" security list-keychains security find-identity -v # Layer 4: Actual signing codesign --sign "$IDENTITY" --verbose=4 "$APP"This reveals: Which layer fails (secrets - workflow OK, workflow - build FAIL) -
Trace Data FlowWHEN error is deep in call stack:Seein this directory for the complete backward tracing technique.
root-cause-tracing.mdQuick version:- Where does bad value originate?
- What called this with bad value?
- Keep tracing up until you find the source
- Fix at source, not at symptom
在尝试任何修复之前:
-
仔细阅读错误信息
- 不要跳过错误或警告
- 它们通常包含确切的解决方案
- 完整阅读堆栈跟踪
- 记录行号、文件路径、错误代码
-
稳定复现问题
- 你能否可靠地触发问题?
- 确切步骤是什么?
- 是否每次都会发生?
- 如果无法复现 - 收集更多数据,不要猜测
-
检查近期变更
- 哪些变更可能导致这个问题?
- Git diff、近期提交
- 新依赖、配置变更
- 环境差异
-
在多组件系统中收集证据当系统包含多个组件时(CI - 构建 - 签名、API - 服务 - 数据库):提出修复方案前,添加诊断工具:
针对每个组件边界: - 记录进入组件的数据 - 记录离开组件的数据 - 验证环境/配置的传递 - 检查每个层级的状态 运行一次以收集证据,确定问题出在哪个环节 然后分析证据,找出故障组件 再针对性调查该组件示例(多层系统):bash# 第一层:工作流 echo "=== 工作流中的可用密钥: ===" echo "IDENTITY: ${IDENTITY:+已设置}${IDENTITY:-未设置}" # 第二层:构建脚本 echo "=== 构建脚本中的环境变量: ===" env | grep IDENTITY || echo "IDENTITY 不在环境中" # 第三层:签名脚本 echo "=== 钥匙串状态: ===" security list-keychains security find-identity -v # 第四层:实际签名操作 codesign --sign "$IDENTITY" --verbose=4 "$APP"这会揭示: 哪一层出现故障(密钥 - 工作流正常,工作流 - 构建失败) -
追踪数据流当错误位于调用栈深处时:查看本目录下的获取完整的反向追踪技术。
root-cause-tracing.md简化版:- 错误值源自何处?
- 是谁传入了错误值?
- 持续向上追踪直到找到源头
- 修复源头,而非症状
Phase 2: Pattern Analysis
第二阶段:模式分析
Find the pattern before fixing:
-
Find Working Examples
- Locate similar working code in same codebase
- What works that's similar to what's broken?
-
Compare Against References
- If implementing pattern, read reference implementation COMPLETELY
- Don't skim - read every line
- Understand the pattern fully before applying
-
Identify Differences
- What's different between working and broken?
- List every difference, however small
- Don't assume "that can't matter"
-
Understand Dependencies
- What other components does this need?
- What settings, config, environment?
- What assumptions does it make?
修复前先找到模式:
-
寻找可用示例
- 在同一代码库中找到类似的可运行代码
- 哪些可运行的代码与故障代码相似?
-
与参考实现对比
- 如果是实现某种模式,需完整阅读参考实现
- 不要略读 - 逐行阅读
- 应用前需完全理解该模式
-
识别差异
- 可运行代码与故障代码有哪些不同?
- 列出所有差异,无论多小
- 不要假设“这无关紧要”
-
理解依赖关系
- 该组件还依赖哪些其他组件?
- 需要哪些设置、配置、环境?
- 它有哪些隐含假设?
Phase 3: Hypothesis and Testing
第三阶段:假设与测试
Scientific method:
-
Form Single Hypothesis
- State clearly: "I think X is the root cause because Y"
- Write it down
- Be specific, not vague
-
Test Minimally
- Make the SMALLEST possible change to test hypothesis
- One variable at a time
- Don't fix multiple things at once
-
Verify Before Continuing
- Did it work? Yes - Phase 4
- Didn't work? Form NEW hypothesis
- DON'T add more fixes on top
-
When You Don't Know
- Say "I don't understand X"
- Don't pretend to know
- Ask for help
- Research more
科学方法:
-
形成单一假设
- 清晰表述:“我认为X是根本原因,因为Y”
- 记录下来
- 要具体,不要模糊
-
最小化测试
- 做出最小的变更来验证假设
- 一次只变更一个变量
- 不要同时修复多个问题
-
验证后再继续
- 有效?是 - 进入第四阶段
- 无效?形成新假设
- 不要叠加更多修复
-
当你不确定时
- 直接说“我不理解X”
- 不要不懂装懂
- 寻求帮助
- 深入研究
Phase 4: Implementation
第四阶段:实施
Fix the root cause, not the symptom:
-
Create Failing Test Case
- Simplest possible reproduction
- Automated test if possible
- One-off test script if no framework
- MUST have before fixing
- Use the skill for writing proper failing tests
tdd
-
Implement Single Fix
- Address the root cause identified
- ONE change at a time
- No "while I'm here" improvements
- No bundled refactoring
-
Verify Fix
- Test passes now?
- No other tests broken?
- Issue actually resolved?
-
If Fix Doesn't Work
- STOP
- Count: How many fixes have you tried?
- If < 3: Return to Phase 1, re-analyze with new information
- If >= 3: STOP and question the architecture (step 5 below)
- DON'T attempt Fix #4 without architectural discussion
-
If 3+ Fixes Failed: Question ArchitecturePattern indicating architectural problem:
- Each fix reveals new shared state/coupling/problem in different place
- Fixes require "massive refactoring" to implement
- Each fix creates new symptoms elsewhere
STOP and question fundamentals:- Is this pattern fundamentally sound?
- Are we "sticking with it through sheer inertia"?
- Should we refactor architecture vs. continue fixing symptoms?
Discuss with your human partner before attempting more fixesThis is NOT a failed hypothesis - this is a wrong architecture.
修复根本原因,而非症状:
-
创建失败测试用例
- 最简单的复现方式
- 尽可能实现自动化测试
- 若无框架则使用一次性测试脚本
- 修复前必须完成
- 可使用 skill 编写标准的失败测试
tdd
-
实施单一修复
- 针对已确定的根本原因
- 一次只做一个变更
- 不要顺便做“其他改进”
- 不要捆绑重构
-
验证修复效果
- 测试现在通过了吗?
- 其他测试是否被破坏?
- 问题是否真正解决?
-
如果修复无效
- 停止操作
- 统计:你已经尝试了多少次修复?
- 若 <3:回到第一阶段,结合新信息重新分析
- 若 >=3:停止并质疑架构(见下方第5步)
- 未经架构讨论,请勿尝试第4次修复
-
若3次以上修复失败:质疑架构表明存在架构问题的模式:
- 每次修复都会在不同位置暴露出新的共享状态/耦合/问题
- 修复需要“大规模重构”才能实现
- 每次修复都会在其他地方引入新症状
停止并质疑基础问题:- 该模式从根本上是否合理?
- 我们是否只是“因惯性而坚持”?
- 我们应该重构架构,还是继续修复症状?
尝试更多修复前,请与你的人类伙伴讨论这不是假设错误 - 而是架构本身存在问题。
Red Flags - STOP and Follow Process
危险信号 - 停止并遵循流程
If you catch yourself thinking:
- "Quick fix for now, investigate later"
- "Just try changing X and see if it works"
- "Add multiple changes, run tests"
- "Skip the test, I'll manually verify"
- "It's probably X, let me fix that"
- "I don't fully understand but this might work"
- "Pattern says X but I'll adapt it differently"
- "Here are the main problems: [lists fixes without investigation]"
- Proposing solutions before tracing data flow
- "One more fix attempt" (when already tried 2+)
- Each fix reveals new problem in different place
ALL of these mean: STOP. Return to Phase 1.
If 3+ fixes failed: Question the architecture (see Phase 4.5)
如果你发现自己有以下想法:
- “先快速修复,之后再调查”
- “试试改X看看能不能行”
- “同时做多个变更,然后运行测试”
- “跳过测试,我手动验证就行”
- “可能是X的问题,我来修复它”
- “我不完全理解,但这可能有用”
- “模式要求X,但我要换种方式调整”
- “主要问题如下:[未调查就列出修复方案]”
- 未追踪数据流就提出解决方案
- “再试一次修复”(已经尝试2次以上)
- 每次修复都会在不同位置暴露出新问题
以上所有情况都意味着:停止操作。回到第一阶段。
若3次以上修复失败: 质疑架构(见第四阶段第5步)
Your Human Partner's Signals You're Doing It Wrong
人类伙伴提示你操作错误的信号
Watch for these redirections:
- "Is that not happening?" - You assumed without verifying
- "Will it show us...?" - You should have added evidence gathering
- "Stop guessing" - You're proposing fixes without understanding
- "Ultrathink this" - Question fundamentals, not just symptoms
- "We're stuck?" (frustrated) - Your approach isn't working
When you see these: STOP. Return to Phase 1.
注意以下纠正信号:
- “不是这样的?” - 你未经验证就做出了假设
- “能让我们看到...吗?” - 你应该添加证据收集步骤
- “别瞎猜” - 你在未理解问题的情况下提出了修复方案
- “深入思考” - 质疑根本问题,而非仅关注症状
- “我们卡住了?”(语气沮丧) - 你的方法无效
当看到这些信号:停止操作。回到第一阶段。
Common Rationalizations
常见合理化借口
| Excuse | Reality |
|---|---|
| "Issue is simple, don't need process" | Simple issues have root causes too. Process is fast for simple bugs. |
| "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. |
| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. |
| "I'll write test after confirming fix works" | Untested fixes don't stick. Test first proves it. |
| "Multiple fixes at once saves time" | Can't isolate what worked. Causes new bugs. |
| "Reference too long, I'll adapt the pattern" | Partial understanding guarantees bugs. Read it completely. |
| "I see the problem, let me fix it" | Seeing symptoms does not equal understanding root cause. |
| "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question pattern, don't fix again. |
| 借口 | 实际情况 |
|---|---|
| “问题很简单,不需要流程” | 简单问题也有根本原因。该流程处理简单Bug速度很快。 |
| “情况紧急,没时间走流程” | 系统化调试比瞎忙活的试错方法更快。 |
| “先试试这个,之后再调查” | 第一次修复会定下模式。从一开始就做对。 |
| “确认修复有效后我再写测试” | 未测试的修复无法持久化。先写测试能验证问题。 |
| “同时做多个修复更省时间” | 无法确定哪个变更起作用。还会引入新Bug。 |
| “参考内容太长,我会调整模式” | 一知半解必然会引入Bug。请完整阅读。 |
| “我看到问题了,我来修复” | 看到症状不等于理解根本原因。 |
| “再试一次修复”(已失败2次以上) | 3次以上失败=架构问题。质疑模式,而非继续修复。 |
Quick Reference
快速参考
| Phase | Key Activities | Success Criteria |
|---|---|---|
| 1. Root Cause | Read errors, reproduce, check changes, gather evidence | Understand WHAT and WHY |
| 2. Pattern | Find working examples, compare | Identify differences |
| 3. Hypothesis | Form theory, test minimally | Confirmed or new hypothesis |
| 4. Implementation | Create test, fix, verify | Bug resolved, tests pass |
| 阶段 | 核心活动 | 成功标准 |
|---|---|---|
| 1. 根本原因调查 | 阅读错误信息、复现问题、检查变更、收集证据 | 理解问题是什么及为什么 |
| 2. 模式分析 | 寻找可用示例、对比参考实现 | 识别差异 |
| 3. 假设与测试 | 形成理论、最小化测试 | 假设成立或形成新假设 |
| 4. 实施 | 创建测试、修复、验证 | Bug解决,测试通过 |
When Process Reveals "No Root Cause"
当流程显示“无根本原因”时
If systematic investigation reveals issue is truly environmental, timing-dependent, or external:
- You've completed the process
- Document what you investigated
- Implement appropriate handling (retry, timeout, error message)
- Add monitoring/logging for future investigation
But: 95% of "no root cause" cases are incomplete investigation.
如果系统化调查发现问题确实是环境、时间依赖或外部因素导致:
- 你已完成该流程
- 记录你所做的调查
- 实施适当的处理(重试、超时、错误提示)
- 添加监控/日志以便未来调查
但需注意: 95%的“无根本原因”案例都是因为调查不完整。
Supporting Techniques
配套技术
These techniques are part of systematic debugging and available in this directory:
- - Trace bugs backward through call stack to find original trigger
root-cause-tracing.md - - Add validation at multiple layers after finding root cause
defense-in-depth.md - - Replace arbitrary timeouts with condition polling
condition-based-waiting.md - find-polluter - For test pollution bisection (flaky tests due to shared state): run (or
.claude/tools/analysis/find-polluter/find-polluter.shon Windows) from the project root to isolate which test pollutes the suite.find-polluter.ps1
Related skills:
- tdd - For creating failing test case (Phase 4, Step 1)
- verification-before-completion - Verify fix worked before claiming success
以下技术属于系统化调试的一部分,可在本目录中找到:
- - 通过调用栈反向追踪Bug,找到最初触发点
root-cause-tracing.md - - 找到根本原因后,在多个层级添加验证
defense-in-depth.md - - 用条件轮询替代任意超时
condition-based-waiting.md - find-polluter - 用于测试污染二分法(共享状态导致的不稳定测试):从项目根目录运行 (Windows系统运行
.claude/tools/analysis/find-polluter/find-polluter.sh)以隔离哪个测试污染了测试套件。find-polluter.ps1
相关技能:
- tdd - 用于创建失败测试用例(第四阶段第1步)
- verification-before-completion - 验证修复有效后再宣告完成
Real-World Impact
实际效果
From debugging sessions:
- Systematic approach: 15-30 minutes to fix
- Random fixes approach: 2-3 hours of thrashing
- First-time fix rate: 95% vs 40%
- New bugs introduced: Near zero vs common
来自调试会话的数据:
- 系统化方法:15-30分钟修复
- 无章法修复:2-3小时的瞎忙活
- 首次修复成功率:95% vs 40%
- 引入新Bug:几乎为0 vs 频繁发生
Memory Protocol (MANDATORY)
记忆协议(强制要求)
Before starting:
Read
.claude/context/memory/learnings.mdAfter completing:
- New pattern ->
.claude/context/memory/learnings.md - Issue found ->
.claude/context/memory/issues.md - Decision made ->
.claude/context/memory/decisions.md
ASSUME INTERRUPTION: If it's not in memory, it didn't happen.
开始前:
阅读
.claude/context/memory/learnings.md完成后:
- 新模式 -> 写入
.claude/context/memory/learnings.md - 发现的问题 -> 写入
.claude/context/memory/issues.md - 做出的决策 -> 写入
.claude/context/memory/decisions.md
假设会被中断:如果未记录到记忆中,就相当于没发生过。