debugging
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSystematic Debugging
系统化调试
Evidence-based investigation -> root cause -> verified fix.
基于证据的调查 -> 根本原因 -> 已验证的修复。
Steps
步骤
- Load the skill for stage tracking
outfitter:maintain-tasks - Collect evidence (reproduce, gather symptoms)
- Isolate variables (narrow scope)
- Formulate and test hypotheses
- Implement fix with failing test first
- Verify fix resolves the issue
For formal incident investigation requiring RCA documentation, use skill instead (it loads this skill and adds formal RCA methodology).
find-root-causes<when_to_use>
- Bugs, errors, exceptions, crashes
- Unexpected behavior or wrong results
- Failing tests (unit, integration, e2e)
- Intermittent or timing-dependent failures
- Performance issues (slow, memory leaks, high CPU)
- Integration failures (API, database, external services)
NOT for: obvious fixes, feature requests, architecture planning
</when_to_use>
<iron_law>
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
Never propose solutions or "try this" without understanding root cause through systematic investigation.
</iron_law>
<stages>
See Steps section for skill dependencies. Stages advance forward only.
| Stage | Trigger | activeForm |
|---|---|---|
| Collect Evidence | Session start | "Collecting evidence" |
| Isolate Variables | Evidence gathered | "Isolating variables" |
| Formulate Hypotheses | Problem isolated | "Formulating hypotheses" |
| Test Hypothesis | Hypothesis formed | "Testing hypothesis" |
| Verify Fix | Fix identified | "Verifying fix" |
Situational (insert when triggered):
- Iterate -> Hypothesis disproven, loops back with new hypothesis
Workflow:
- Start: "Collect Evidence" as
in_progress - Transition: Mark current , add next
completedin_progress - Failed hypothesis: Add "Iterate" task
- Quick fixes: If root cause obvious from error, skip to "Verify Fix" (still create failing test)
- Need more evidence: Add new evidence task (don't regress stages)
- Circuit breaker: After 3 failed hypotheses -> escalate
<quick_start>
- Create "Collect Evidence" todo as
in_progress - Reproduce - exact steps to trigger consistently
- Investigate - gather evidence about what's happening
- Analyze - compare working vs broken, find differences
- Test hypothesis - single specific hypothesis, minimal test
- Implement - failing test first, then fix
- Update todos on stage transitions
</quick_start>
<stage_1_root_cause>
Goal: Understand what's actually happening.
Transition: Mark complete when you have reproduction steps and initial evidence.
Read error messages completely
- Stack traces top to bottom
- Note file paths, line numbers, variable names
- Look for "caused by" chains
Reproduce consistently
- Document exact trigger steps
- Note inputs that cause vs don't cause
- Check if intermittent (timing, race conditions)
- Verify in clean environment
Check recent changes
- - what changed?
git diff - - recent commits
git log --since="yesterday" - Dependency updates
- Config/environment changes
Gather evidence
- Add logging at key points
- Print variable values at transformations
- Log function entry/exit with parameters
- Capture timestamps for timing issues
Trace data flow backward
- Where does bad value come from?
- Track through transformations
- Find first place it becomes wrong
Red flags (return to evidence gathering):
- "I think maybe X is the problem"
- "Let's try changing Y"
- "It might be related to Z"
- Starting to write code before understanding
</stage_1_root_cause>
<stage_2_pattern_analysis>
Goal: Learn from working code to understand broken code.
Transition: Mark complete when key differences identified.
Find working examples
- Search for similar functionality that works
- for similar patterns
rg "pattern" - Look for passing vs failing tests
- Check git history for when it worked
Read references completely
- Every line, not skimming
- Full context
- All dependencies/imports
- Configuration and setup
Identify every difference
- Line by line working vs broken
- Different imports?
- Different function signatures?
- Different error handling?
- Different data flow?
- Different configuration?
Understand dependencies
- Libraries/packages involved
- Versions in use
- External services
- Shared state
- Assumptions made
Questions to answer:
- Why does working version work?
- What's fundamentally different?
- Edge cases working version handles?
- Invariants working version maintains?
</stage_2_pattern_analysis>
<stage_3_hypothesis_testing>
Goal: Test one specific idea with minimal change.
Transition: Mark complete when specific, evidence-based hypothesis formed.
Form single hypothesis
- Template: "X is root cause because Y"
- Must explain all symptoms
- Must be testable with small change
- Must be based on evidence from stages 1-2
Design minimal test
- Smallest change to test hypothesis
- Change ONE variable
- Preserve everything else
- Make reversible
Execute and verify
- Apply change
- Run reproduction steps
- Observe carefully
- Document results
Outcomes:
- Fixed: Confirm across all cases, proceed to Verify Fix
- Not fixed: Mark complete, add "Iterate", form NEW hypothesis
- Partially fixed: Add "Iterate" for remaining issues
- Never: Random variations hoping one works
Bad hypotheses (too vague):
- "Maybe it's a race condition"
- "Could be caching or permissions"
- "Probably something with the database"
Good hypotheses (specific, testable):
- "Fails because expects number but receives string when API returns empty"
- "Race condition: fetchData() called before initializeClient() completes"
- "Memory leak: event listeners in useEffect never removed in cleanup"
</stage_3_hypothesis_testing>
<stage_4_implementation>
Goal: Fix root cause permanently with verification.
Transition: Root cause confirmed, ready for permanent fix.
Create failing test
- Write test reproducing bug
- Verify fails before fix
- Should pass after fix
- Captures exact broken scenario
Implement single fix
- Address identified root cause
- No additional "improvements"
- No refactoring "while you're there"
- Just fix the problem
Verify fix
- Failing test now passes
- Existing tests still pass
- Manual reproduction no longer triggers bug
- No new errors/warnings
Circuit breaker
If 3+ fixes tried without success: STOP
- Problem isn't hypothesis - problem is architecture
- May be using wrong pattern entirely
- Escalate or redesign
After fixing:
- Mark "Verify Fix" completed
- Add defensive validation
- Document root cause
- Consider similar bugs elsewhere
</stage_4_implementation>
<red_flags>
STOP and return to Stage 1 if you catch yourself:
- "Quick fix for now, investigate later"
- "Just try changing X and see"
- "I don't fully understand but this might work"
- "One more fix attempt" (already tried 2+)
- "Let me try a few different things"
- Proposing solutions before gathering evidence
- Skipping failing test case
- Fixing symptoms instead of root cause
ALL mean: STOP. Add new "Collect Evidence" task.
</red_flags>
<escalation>
When to escalate:
- After 3 failed fix attempts - architecture may be wrong
- No clear reproduction - need more context/access
- External system issues - need vendor/team involvement
- Security implications - need security expertise
- Data corruption risks - need backup/recovery planning
Before claiming "fixed":
- Root cause identified with evidence
- Failing test case created
- Fix addresses root cause only
- Test now passes
- All existing tests pass
- Manual reproduction no longer triggers bug
- No new warnings/errors
- Root cause documented
- Prevention measures considered
- "Verify Fix" marked completed
Understanding the bug is more valuable than fixing it quickly.
</completion>
<rules>
ALWAYS:
- Create "Collect Evidence" todo at session start
- Follow four-stage framework
- Update todos on stage transitions
- Create failing test before fix
- Test single hypothesis at a time
- Document root cause after fix
- Mark "Verify Fix" complete only after tests pass
NEVER:
- Propose fixes without understanding root cause
- Skip evidence gathering
- Test multiple hypotheses simultaneously
- Skip failing test case
- Fix symptoms instead of root cause
- Continue after 3 failed fixes without escalation
- Regress stages - add new tasks if needed
- playbooks.md - bug-type specific investigations
- evidence-patterns.md - diagnostic techniques
- reproduction.md - reproduction techniques
- integration.md - workflow integration, anti-patterns
- 加载Skill以进行阶段追踪
outfitter:maintain-tasks - 收集证据(复现问题、收集症状)
- 隔离变量(缩小范围)
- 提出并测试假设
- 先编写失败测试再实施修复
- 验证修复是否解决问题
若需要进行需RCA文档的正式事件调查,请改用 Skill(它会加载此Skill并添加正式的RCA方法论)。
find-root-causes<when_to_use>
- Bug、错误、异常、崩溃
- 意外行为或错误结果
- 测试失败(单元测试、集成测试、E2E测试)
- 间歇性或依赖时序的故障
- 性能问题(缓慢、内存泄漏、高CPU占用)
- 集成失败(API、数据库、外部服务)
不适用场景:明显的修复、功能需求、架构规划
</when_to_use>
<iron_law>
未进行根本原因调查前,绝不修复
在通过系统化调查理解根本原因之前,切勿提出解决方案或"尝试这个"。
</iron_law>
<stages>
阶段依赖请查看步骤部分。阶段仅可向前推进。
| 阶段 | 触发条件 | 活动状态 |
|---|---|---|
| 收集证据 | 会话开始 | "收集证据中" |
| 隔离变量 | 证据收集完成 | "隔离变量中" |
| 提出假设 | 问题已隔离 | "提出假设中" |
| 测试假设 | 假设已形成 | "测试假设中" |
| 验证修复 | 修复方案已确定 | "验证修复中" |
特殊情况(触发时插入):
- 迭代 -> 假设不成立,带着新假设返回上一环节
工作流:
- 开始:将"收集证据"标记为
in_progress - 过渡:标记当前阶段为,将下一阶段标记为
completedin_progress - 假设不成立:添加"迭代"任务
- 快速修复:若从错误信息中可直接明确根本原因,可跳过至"验证修复"阶段(仍需创建失败测试)
- 需要更多证据:添加新的证据收集任务(不要回退阶段)
- 熔断机制:若3次假设均失败 -> 升级问题
<quick_start>
- 创建"收集证据"待办事项并标记为
in_progress - 复现问题 - 记录触发问题的精确步骤
- 调查 - 收集当前发生的情况的证据
- 分析 - 对比正常工作与故障状态,找出差异
- 测试假设 - 单个明确的假设,最小化测试
- 实施 - 先编写失败测试,再进行修复
- 阶段过渡时更新待办事项
</quick_start>
<stage_1_root_cause>
目标:理解实际发生的情况。
过渡条件:当你有了复现步骤和初始证据时,标记此阶段完成。
完整阅读错误信息
- 从顶至底查看堆栈跟踪
- 记录文件路径、行号、变量名
- 查找"caused by"链
稳定复现问题
- 记录精确的触发步骤
- 记录导致问题和不导致问题的输入
- 检查是否为间歇性问题(时序、竞态条件)
- 在干净环境中验证
检查近期变更
- - 哪些内容发生了变化?
git diff - - 近期提交记录
git log --since="yesterday" - 依赖更新
- 配置/环境变更
收集证据
- 在关键位置添加日志
- 在数据转换时打印变量值
- 记录函数的进入/退出及参数
- 为时序问题捕获时间戳
反向追踪数据流
- 错误值来自哪里?
- 跟踪数据转换过程
- 找到值首次出错的位置
危险信号(返回证据收集阶段):
- "我觉得可能X是问题所在"
- "我们试试修改Y"
- "这可能和Z有关"
- 在理解问题前就开始编写代码
</stage_1_root_cause>
<stage_2_pattern_analysis>
目标:从正常工作的代码中学习,以理解故障代码。
过渡条件:当找出关键差异时,标记此阶段完成。
找到正常工作的示例
- 搜索功能相似且正常工作的代码
- 使用查找相似模式
rg "pattern" - 对比通过与失败的测试
- 查看git历史记录,找到它正常工作的时期
完整阅读参考资料
- 逐行阅读,不要略读
- 完整上下文
- 所有依赖/导入
- 配置和设置
找出所有差异
- 逐行对比正常与故障代码
- 导入不同?
- 函数签名不同?
- 错误处理不同?
- 数据流不同?
- 配置不同?
理解依赖关系
- 涉及的库/包
- 使用的版本
- 外部服务
- 共享状态
- 做出的假设
需要回答的问题:
- 为什么正常版本可以工作?
- 根本差异是什么?
- 正常版本处理了哪些边缘情况?
- 正常版本维护了哪些不变量?
</stage_2_pattern_analysis>
<stage_3_hypothesis_testing>
目标:通过最小变更测试单个明确的想法。
过渡条件:当形成明确的、基于证据的假设时,标记此阶段完成。
形成单个假设
- 模板:"X是根本原因,因为Y"
- 必须解释所有症状
- 必须可通过小变更测试
- 必须基于阶段1-2的证据
设计最小化测试
- 最小的变更以测试假设
- 仅改变一个变量
- 保留其他所有内容
- 可回滚
执行并验证
- 应用变更
- 执行复现步骤
- 仔细观察
- 记录结果
结果:
- 修复成功:在所有场景中确认,进入验证修复阶段
- 未修复:标记此阶段完成,添加"迭代"任务,形成新假设
- 部分修复:添加"迭代"任务处理剩余问题
- 禁止:随机尝试不同的修改,寄希望于其中一个能解决问题
糟糕的假设(过于模糊):
- "可能是竞态条件"
- "可能是缓存或权限问题"
- "大概和数据库有关"
好的假设(明确、可测试):
- "失败原因是API返回空值时,代码期望数字但收到字符串"
- "竞态条件:fetchData()在initializeClient()完成前被调用"
- "内存泄漏:useEffect中的事件监听器在清理时未被移除"
</stage_3_hypothesis_testing>
<stage_4_implementation>
目标:永久修复根本原因并进行验证。
过渡条件:根本原因已确认,准备进行永久修复。
编写失败测试
- 编写复现bug的测试
- 验证修复前测试失败
- 修复后测试应通过
- 捕获精确的故障场景
实施单个修复
- 解决已识别的根本原因
- 不添加额外的"改进"
- 不顺便进行重构
- 只修复问题
验证修复
- 失败测试现在通过
- 现有测试仍通过
- 手动复现不再触发bug
- 无新的错误/警告
熔断机制
若尝试3次以上修复均未成功:停止
- 问题不在于假设——而在于架构
- 可能完全使用了错误的模式
- 升级问题或重新设计
修复完成后:
- 标记"验证修复"阶段完成
- 添加防御性验证
- 记录根本原因
- 考虑其他地方是否存在类似bug
</stage_4_implementation>
<red_flags>
若你发现自己出现以下行为,请停止并返回阶段1:
- "先临时修复,之后再调查"
- "试试修改X看看"
- "我不完全理解,但这可能有用"
- "再试一次修复"(已尝试2次以上)
- "我试试几个不同的修改"
- 在收集证据前就提出解决方案
- 跳过失败测试用例
- 修复症状而非根本原因
以上所有情况均意味着:停止。添加新的"收集证据"任务。
</red_flags>
<escalation>
何时升级问题:
- 尝试3次修复均失败 - 架构可能存在问题
- 无法明确复现问题 - 需要更多上下文/权限
- 外部系统问题 - 需要供应商/团队介入
- 安全影响 - 需要安全专家参与
- 数据损坏风险 - 需要备份/恢复规划
在声称"已修复"前,请确认:
- 已通过证据识别根本原因
- 已创建失败测试用例
- 修复仅针对根本原因
- 测试现在通过
- 所有现有测试均通过
- 手动复现不再触发bug
- 无新的警告/错误
- 已记录根本原因
- 已考虑预防措施
- "验证修复"阶段已标记为完成
理解bug比快速修复更有价值。
</completion>
<rules>
必须始终:
- 在会话开始时创建"收集证据"待办事项
- 遵循四阶段框架
- 阶段过渡时更新待办事项
- 修复前编写失败测试
- 一次测试一个假设
- 修复后记录根本原因
- 仅在测试通过后标记"验证修复"阶段完成
绝对禁止:
- 在未理解根本原因前提出修复方案
- 跳过证据收集阶段
- 同时测试多个假设
- 跳过失败测试用例
- 修复症状而非根本原因
- 尝试3次修复失败后仍继续而不升级问题
- 回退阶段 - 若需要则添加新任务
- playbooks.md - 特定bug类型的调查指南
- evidence-patterns.md - 诊断技巧
- reproduction.md - 问题复现技巧
- integration.md - 工作流集成、反模式