debugging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Systematic Debugging

系统化调试

Evidence-based investigation -> root cause -> verified fix.
基于证据的调查 -> 根本原因 -> 已验证的修复。

Steps

步骤

  1. Load the
    outfitter:maintain-tasks
    skill for stage tracking
  2. Collect evidence (reproduce, gather symptoms)
  3. Isolate variables (narrow scope)
  4. Formulate and test hypotheses
  5. Implement fix with failing test first
  6. Verify fix resolves the issue
For formal incident investigation requiring RCA documentation, use
find-root-causes
skill instead (it loads this skill and adds formal RCA methodology).
<when_to_use>
  • Bugs, errors, exceptions, crashes
  • Unexpected behavior or wrong results
  • Failing tests (unit, integration, e2e)
  • Intermittent or timing-dependent failures
  • Performance issues (slow, memory leaks, high CPU)
  • Integration failures (API, database, external services)
NOT for: obvious fixes, feature requests, architecture planning
</when_to_use>
<iron_law>
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
Never propose solutions or "try this" without understanding root cause through systematic investigation.
</iron_law>
<stages>
See Steps section for skill dependencies. Stages advance forward only.
StageTriggeractiveForm
Collect EvidenceSession start"Collecting evidence"
Isolate VariablesEvidence gathered"Isolating variables"
Formulate HypothesesProblem isolated"Formulating hypotheses"
Test HypothesisHypothesis formed"Testing hypothesis"
Verify FixFix identified"Verifying fix"
Situational (insert when triggered):
  • Iterate -> Hypothesis disproven, loops back with new hypothesis
Workflow:
  • Start: "Collect Evidence" as
    in_progress
  • Transition: Mark current
    completed
    , add next
    in_progress
  • Failed hypothesis: Add "Iterate" task
  • Quick fixes: If root cause obvious from error, skip to "Verify Fix" (still create failing test)
  • Need more evidence: Add new evidence task (don't regress stages)
  • Circuit breaker: After 3 failed hypotheses -> escalate
</stages>
<quick_start>
  1. Create "Collect Evidence" todo as
    in_progress
  2. Reproduce - exact steps to trigger consistently
  3. Investigate - gather evidence about what's happening
  4. Analyze - compare working vs broken, find differences
  5. Test hypothesis - single specific hypothesis, minimal test
  6. Implement - failing test first, then fix
  7. Update todos on stage transitions
</quick_start>
<stage_1_root_cause>
Goal: Understand what's actually happening.
Transition: Mark complete when you have reproduction steps and initial evidence.
Read error messages completely
  • Stack traces top to bottom
  • Note file paths, line numbers, variable names
  • Look for "caused by" chains
Reproduce consistently
  • Document exact trigger steps
  • Note inputs that cause vs don't cause
  • Check if intermittent (timing, race conditions)
  • Verify in clean environment
Check recent changes
  • git diff
    - what changed?
  • git log --since="yesterday"
    - recent commits
  • Dependency updates
  • Config/environment changes
Gather evidence
  • Add logging at key points
  • Print variable values at transformations
  • Log function entry/exit with parameters
  • Capture timestamps for timing issues
Trace data flow backward
  • Where does bad value come from?
  • Track through transformations
  • Find first place it becomes wrong
Red flags (return to evidence gathering):
  • "I think maybe X is the problem"
  • "Let's try changing Y"
  • "It might be related to Z"
  • Starting to write code before understanding
</stage_1_root_cause>
<stage_2_pattern_analysis>
Goal: Learn from working code to understand broken code.
Transition: Mark complete when key differences identified.
Find working examples
  • Search for similar functionality that works
  • rg "pattern"
    for similar patterns
  • Look for passing vs failing tests
  • Check git history for when it worked
Read references completely
  • Every line, not skimming
  • Full context
  • All dependencies/imports
  • Configuration and setup
Identify every difference
  • Line by line working vs broken
  • Different imports?
  • Different function signatures?
  • Different error handling?
  • Different data flow?
  • Different configuration?
Understand dependencies
  • Libraries/packages involved
  • Versions in use
  • External services
  • Shared state
  • Assumptions made
Questions to answer:
  • Why does working version work?
  • What's fundamentally different?
  • Edge cases working version handles?
  • Invariants working version maintains?
</stage_2_pattern_analysis>
<stage_3_hypothesis_testing>
Goal: Test one specific idea with minimal change.
Transition: Mark complete when specific, evidence-based hypothesis formed.
Form single hypothesis
  • Template: "X is root cause because Y"
  • Must explain all symptoms
  • Must be testable with small change
  • Must be based on evidence from stages 1-2
Design minimal test
  • Smallest change to test hypothesis
  • Change ONE variable
  • Preserve everything else
  • Make reversible
Execute and verify
  • Apply change
  • Run reproduction steps
  • Observe carefully
  • Document results
Outcomes:
  • Fixed: Confirm across all cases, proceed to Verify Fix
  • Not fixed: Mark complete, add "Iterate", form NEW hypothesis
  • Partially fixed: Add "Iterate" for remaining issues
  • Never: Random variations hoping one works
Bad hypotheses (too vague):
  • "Maybe it's a race condition"
  • "Could be caching or permissions"
  • "Probably something with the database"
Good hypotheses (specific, testable):
  • "Fails because expects number but receives string when API returns empty"
  • "Race condition: fetchData() called before initializeClient() completes"
  • "Memory leak: event listeners in useEffect never removed in cleanup"
</stage_3_hypothesis_testing>
<stage_4_implementation>
Goal: Fix root cause permanently with verification.
Transition: Root cause confirmed, ready for permanent fix.
Create failing test
  • Write test reproducing bug
  • Verify fails before fix
  • Should pass after fix
  • Captures exact broken scenario
Implement single fix
  • Address identified root cause
  • No additional "improvements"
  • No refactoring "while you're there"
  • Just fix the problem
Verify fix
  • Failing test now passes
  • Existing tests still pass
  • Manual reproduction no longer triggers bug
  • No new errors/warnings
Circuit breaker If 3+ fixes tried without success: STOP
  • Problem isn't hypothesis - problem is architecture
  • May be using wrong pattern entirely
  • Escalate or redesign
After fixing:
  • Mark "Verify Fix" completed
  • Add defensive validation
  • Document root cause
  • Consider similar bugs elsewhere
</stage_4_implementation>
<red_flags>
STOP and return to Stage 1 if you catch yourself:
  • "Quick fix for now, investigate later"
  • "Just try changing X and see"
  • "I don't fully understand but this might work"
  • "One more fix attempt" (already tried 2+)
  • "Let me try a few different things"
  • Proposing solutions before gathering evidence
  • Skipping failing test case
  • Fixing symptoms instead of root cause
ALL mean: STOP. Add new "Collect Evidence" task.
</red_flags>
<escalation>
When to escalate:
  1. After 3 failed fix attempts - architecture may be wrong
  2. No clear reproduction - need more context/access
  3. External system issues - need vendor/team involvement
  4. Security implications - need security expertise
  5. Data corruption risks - need backup/recovery planning
</escalation> <completion>
Before claiming "fixed":
  • Root cause identified with evidence
  • Failing test case created
  • Fix addresses root cause only
  • Test now passes
  • All existing tests pass
  • Manual reproduction no longer triggers bug
  • No new warnings/errors
  • Root cause documented
  • Prevention measures considered
  • "Verify Fix" marked completed
Understanding the bug is more valuable than fixing it quickly.
</completion> <rules>
ALWAYS:
  • Create "Collect Evidence" todo at session start
  • Follow four-stage framework
  • Update todos on stage transitions
  • Create failing test before fix
  • Test single hypothesis at a time
  • Document root cause after fix
  • Mark "Verify Fix" complete only after tests pass
NEVER:
  • Propose fixes without understanding root cause
  • Skip evidence gathering
  • Test multiple hypotheses simultaneously
  • Skip failing test case
  • Fix symptoms instead of root cause
  • Continue after 3 failed fixes without escalation
  • Regress stages - add new tasks if needed
</rules> <references>
  • playbooks.md - bug-type specific investigations
  • evidence-patterns.md - diagnostic techniques
  • reproduction.md - reproduction techniques
  • integration.md - workflow integration, anti-patterns
</references>
  1. 加载
    outfitter:maintain-tasks
    Skill以进行阶段追踪
  2. 收集证据(复现问题、收集症状)
  3. 隔离变量(缩小范围)
  4. 提出并测试假设
  5. 先编写失败测试再实施修复
  6. 验证修复是否解决问题
若需要进行需RCA文档的正式事件调查,请改用
find-root-causes
Skill(它会加载此Skill并添加正式的RCA方法论)。
<when_to_use>
  • Bug、错误、异常、崩溃
  • 意外行为或错误结果
  • 测试失败(单元测试、集成测试、E2E测试)
  • 间歇性或依赖时序的故障
  • 性能问题(缓慢、内存泄漏、高CPU占用)
  • 集成失败(API、数据库、外部服务)
不适用场景:明显的修复、功能需求、架构规划
</when_to_use>
<iron_law>
未进行根本原因调查前,绝不修复
在通过系统化调查理解根本原因之前,切勿提出解决方案或"尝试这个"。
</iron_law>
<stages>
阶段依赖请查看步骤部分。阶段仅可向前推进。
阶段触发条件活动状态
收集证据会话开始"收集证据中"
隔离变量证据收集完成"隔离变量中"
提出假设问题已隔离"提出假设中"
测试假设假设已形成"测试假设中"
验证修复修复方案已确定"验证修复中"
特殊情况(触发时插入):
  • 迭代 -> 假设不成立,带着新假设返回上一环节
工作流:
  • 开始:将"收集证据"标记为
    in_progress
  • 过渡:标记当前阶段为
    completed
    ,将下一阶段标记为
    in_progress
  • 假设不成立:添加"迭代"任务
  • 快速修复:若从错误信息中可直接明确根本原因,可跳过至"验证修复"阶段(仍需创建失败测试)
  • 需要更多证据:添加新的证据收集任务(不要回退阶段)
  • 熔断机制:若3次假设均失败 -> 升级问题
</stages>
<quick_start>
  1. 创建"收集证据"待办事项并标记为
    in_progress
  2. 复现问题 - 记录触发问题的精确步骤
  3. 调查 - 收集当前发生的情况的证据
  4. 分析 - 对比正常工作与故障状态,找出差异
  5. 测试假设 - 单个明确的假设,最小化测试
  6. 实施 - 先编写失败测试,再进行修复
  7. 阶段过渡时更新待办事项
</quick_start>
<stage_1_root_cause>
目标:理解实际发生的情况。
过渡条件:当你有了复现步骤和初始证据时,标记此阶段完成。
完整阅读错误信息
  • 从顶至底查看堆栈跟踪
  • 记录文件路径、行号、变量名
  • 查找"caused by"链
稳定复现问题
  • 记录精确的触发步骤
  • 记录导致问题和不导致问题的输入
  • 检查是否为间歇性问题(时序、竞态条件)
  • 在干净环境中验证
检查近期变更
  • git diff
    - 哪些内容发生了变化?
  • git log --since="yesterday"
    - 近期提交记录
  • 依赖更新
  • 配置/环境变更
收集证据
  • 在关键位置添加日志
  • 在数据转换时打印变量值
  • 记录函数的进入/退出及参数
  • 为时序问题捕获时间戳
反向追踪数据流
  • 错误值来自哪里?
  • 跟踪数据转换过程
  • 找到值首次出错的位置
危险信号(返回证据收集阶段):
  • "我觉得可能X是问题所在"
  • "我们试试修改Y"
  • "这可能和Z有关"
  • 在理解问题前就开始编写代码
</stage_1_root_cause>
<stage_2_pattern_analysis>
目标:从正常工作的代码中学习,以理解故障代码。
过渡条件:当找出关键差异时,标记此阶段完成。
找到正常工作的示例
  • 搜索功能相似且正常工作的代码
  • 使用
    rg "pattern"
    查找相似模式
  • 对比通过与失败的测试
  • 查看git历史记录,找到它正常工作的时期
完整阅读参考资料
  • 逐行阅读,不要略读
  • 完整上下文
  • 所有依赖/导入
  • 配置和设置
找出所有差异
  • 逐行对比正常与故障代码
  • 导入不同?
  • 函数签名不同?
  • 错误处理不同?
  • 数据流不同?
  • 配置不同?
理解依赖关系
  • 涉及的库/包
  • 使用的版本
  • 外部服务
  • 共享状态
  • 做出的假设
需要回答的问题:
  • 为什么正常版本可以工作?
  • 根本差异是什么?
  • 正常版本处理了哪些边缘情况?
  • 正常版本维护了哪些不变量?
</stage_2_pattern_analysis>
<stage_3_hypothesis_testing>
目标:通过最小变更测试单个明确的想法。
过渡条件:当形成明确的、基于证据的假设时,标记此阶段完成。
形成单个假设
  • 模板:"X是根本原因,因为Y"
  • 必须解释所有症状
  • 必须可通过小变更测试
  • 必须基于阶段1-2的证据
设计最小化测试
  • 最小的变更以测试假设
  • 仅改变一个变量
  • 保留其他所有内容
  • 可回滚
执行并验证
  • 应用变更
  • 执行复现步骤
  • 仔细观察
  • 记录结果
结果:
  • 修复成功:在所有场景中确认,进入验证修复阶段
  • 未修复:标记此阶段完成,添加"迭代"任务,形成新假设
  • 部分修复:添加"迭代"任务处理剩余问题
  • 禁止:随机尝试不同的修改,寄希望于其中一个能解决问题
糟糕的假设(过于模糊):
  • "可能是竞态条件"
  • "可能是缓存或权限问题"
  • "大概和数据库有关"
好的假设(明确、可测试):
  • "失败原因是API返回空值时,代码期望数字但收到字符串"
  • "竞态条件:fetchData()在initializeClient()完成前被调用"
  • "内存泄漏:useEffect中的事件监听器在清理时未被移除"
</stage_3_hypothesis_testing>
<stage_4_implementation>
目标:永久修复根本原因并进行验证。
过渡条件:根本原因已确认,准备进行永久修复。
编写失败测试
  • 编写复现bug的测试
  • 验证修复前测试失败
  • 修复后测试应通过
  • 捕获精确的故障场景
实施单个修复
  • 解决已识别的根本原因
  • 不添加额外的"改进"
  • 不顺便进行重构
  • 只修复问题
验证修复
  • 失败测试现在通过
  • 现有测试仍通过
  • 手动复现不再触发bug
  • 无新的错误/警告
熔断机制 若尝试3次以上修复均未成功:停止
  • 问题不在于假设——而在于架构
  • 可能完全使用了错误的模式
  • 升级问题或重新设计
修复完成后:
  • 标记"验证修复"阶段完成
  • 添加防御性验证
  • 记录根本原因
  • 考虑其他地方是否存在类似bug
</stage_4_implementation>
<red_flags>
若你发现自己出现以下行为,请停止并返回阶段1:
  • "先临时修复,之后再调查"
  • "试试修改X看看"
  • "我不完全理解,但这可能有用"
  • "再试一次修复"(已尝试2次以上)
  • "我试试几个不同的修改"
  • 在收集证据前就提出解决方案
  • 跳过失败测试用例
  • 修复症状而非根本原因
以上所有情况均意味着:停止。添加新的"收集证据"任务。
</red_flags>
<escalation>
何时升级问题:
  1. 尝试3次修复均失败 - 架构可能存在问题
  2. 无法明确复现问题 - 需要更多上下文/权限
  3. 外部系统问题 - 需要供应商/团队介入
  4. 安全影响 - 需要安全专家参与
  5. 数据损坏风险 - 需要备份/恢复规划
</escalation> <completion>
在声称"已修复"前,请确认:
  • 已通过证据识别根本原因
  • 已创建失败测试用例
  • 修复仅针对根本原因
  • 测试现在通过
  • 所有现有测试均通过
  • 手动复现不再触发bug
  • 无新的警告/错误
  • 已记录根本原因
  • 已考虑预防措施
  • "验证修复"阶段已标记为完成
理解bug比快速修复更有价值。
</completion> <rules>
必须始终:
  • 在会话开始时创建"收集证据"待办事项
  • 遵循四阶段框架
  • 阶段过渡时更新待办事项
  • 修复前编写失败测试
  • 一次测试一个假设
  • 修复后记录根本原因
  • 仅在测试通过后标记"验证修复"阶段完成
绝对禁止:
  • 在未理解根本原因前提出修复方案
  • 跳过证据收集阶段
  • 同时测试多个假设
  • 跳过失败测试用例
  • 修复症状而非根本原因
  • 尝试3次修复失败后仍继续而不升级问题
  • 回退阶段 - 若需要则添加新任务
</rules> <references>
  • playbooks.md - 特定bug类型的调查指南
  • evidence-patterns.md - 诊断技巧
  • reproduction.md - 问题复现技巧
  • integration.md - 工作流集成、反模式
</references>