diagnosing-bugs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Diagnosing Bugs

疑难Bug诊断

A discipline for hard bugs. Skip phases only when explicitly justified.
When exploring the codebase, read
CONTEXT.md
(if it exists) to get a clear mental model of the relevant modules, and check ADRs in the area you're touching.
这是一套针对疑难Bug的诊断规范。仅在有明确理由时才可跳过某些阶段。
在探索代码库时,请阅读
CONTEXT.md
(如果存在),以建立对相关模块的清晰认知,并查看你所涉及领域的ADRs。

Phase 1 — Build a feedback loop

阶段1 — 构建反馈循环

This is the skill. Everything else is mechanical. If you have a tight pass/fail signal for the bug — one that goes red on this bug — you will find the cause; bisection, hypothesis-testing, and instrumentation all just consume it. If you don't have one, no amount of staring at code will save you.
Spend disproportionate effort here. Be aggressive. Be creative. Refuse to give up.
这是核心关键。 其他步骤都是机械性操作。如果你拥有针对该Bug的明确通过/失败信号——能精准触发该Bug的失败信号——你就能找到问题根源;二分法、假设检验、instrumentation等手段都只是基于这个信号展开。如果没有这个信号,再怎么盯着代码看也无济于事。
在此阶段投入更多精力。要主动、要创新、不要轻易放弃。

Ways to construct one — try them in roughly this order

构建反馈循环的方法——按以下大致顺序尝试

  1. Failing test at whatever seam reaches the bug — unit, integration, e2e.
  2. Curl / HTTP script against a running dev server.
  3. CLI invocation with a fixture input, diffing stdout against a known-good snapshot.
  4. Headless browser script (Playwright / Puppeteer) — drives the UI, asserts on DOM/console/network.
  5. Replay a captured trace. Save a real network request / payload / event log to disk; replay it through the code path in isolation.
  6. Throwaway harness. Spin up a minimal subset of the system (one service, mocked deps) that exercises the bug code path with a single function call.
  7. Property / fuzz loop. If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode.
  8. Bisection harness. If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can
    git bisect run
    it.
  9. Differential loop. Run the same input through old-version vs new-version (or two configs) and diff outputs.
  10. HITL bash script. Last resort. If a human must click, drive them with
    scripts/hitl-loop.template.sh
    so the loop is still structured. Captured output feeds back to you.
Build the right feedback loop, and the bug is 90% fixed.
  1. 在能触及Bug的任意层级编写失败测试用例——单元测试、集成测试、端到端测试均可。
  2. 针对运行中的开发服务器编写 Curl / HTTP脚本
  3. 使用固定输入执行CLI调用,将标准输出与已知正确的快照进行对比。
  4. 编写无头浏览器脚本(Playwright / Puppeteer)——驱动UI,对DOM、控制台、网络请求进行断言。
  5. 重放捕获的跟踪记录。将真实的网络请求/负载/事件日志保存到磁盘;在隔离环境中重放该代码路径。
  6. 临时测试框架。搭建系统的最小子集(单个服务、模拟依赖),通过单次函数调用触发Bug代码路径。
  7. 属性/模糊测试循环。如果Bug表现为“输出偶尔异常”,运行1000次随机输入,寻找失败模式。
  8. 二分法测试框架。如果Bug出现在两个已知状态(提交版本、数据集、软件版本)之间,自动化实现“在状态X启动、检查、重复”,以便执行
    git bisect run
  9. 差异对比循环。将相同输入分别传入旧版本和新版本(或两种配置),对比输出结果。
  10. 半自动化(HITL)bash脚本。最后手段。如果必须由人工点击操作,使用
    scripts/hitl-loop.template.sh
    驱动流程,确保循环结构清晰。捕获的输出将为你提供反馈。
搭建合适的反馈循环,Bug就相当于解决了90%。

Tighten the loop

优化反馈循环

Treat the loop as a product. Once you have a loop, tighten it:
  • Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.)
  • Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".)
  • Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.)
A 30-second flaky loop is barely better than no loop; a 2-second deterministic one is tight — a debugging superpower.
将反馈循环视为一个产品。一旦搭建好一个循环,就要对其进行优化
  • 能否加快速度?(缓存初始化步骤、跳过无关的初始化操作、缩小测试范围。)
  • 能否让信号更精准?(针对具体症状进行断言,而非仅判断“未崩溃”。)
  • 能否让结果更具确定性?(固定时间、设置RNG种子、隔离文件系统、冻结网络。)
一个耗时30秒的不稳定循环几乎等同于没有循环;而一个耗时2秒的确定性循环则是强大的调试利器。

Non-deterministic bugs

非确定性Bug

The goal is not a clean repro but a higher reproduction rate. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable.
目标并非实现完美复现,而是提高复现率。将触发操作循环100次、并行执行、增加压力、缩小时间窗口、注入延迟。50%概率出现的Bug是可调试的;1%概率的则不行——持续提高复现率直到可调试为止。

When you genuinely cannot build a loop

当确实无法构建反馈循环时

Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps), or (c) permission to add temporary production instrumentation. Do not proceed to hypothesise without a loop.
明确告知用户这一情况。列出你已尝试的方法。向用户请求:(a) 访问能复现Bug的环境,(b) 捕获的工件(HAR文件、日志转储、核心转储、带时间戳的录屏),或(c) 添加临时生产环境instrumentation的权限。没有反馈循环时,不要进行假设推测。

Completion criterion — a tight loop that goes red

完成标准——能触发失败的优化后循环

Phase 1 is done when the loop is tight and red-capable: you can name one command — a script path, a test invocation, a curl — that you have already run at least once (paste the invocation and its output), and that is:
  • Red-capable — it drives the actual bug code path and asserts the user's exact symptom, so it can go red on this bug and green once fixed. Not "runs without erroring" — it must be able to catch this specific bug.
  • Deterministic — same verdict every run (flaky bugs: a pinned, high reproduction rate, per above).
  • Fast — seconds, not minutes.
  • Agent-runnable — you can run it unattended; a human in the loop only via
    scripts/hitl-loop.template.sh
    .
If you catch yourself reading code to build a theory before this command exists, stop — jumping straight to a hypothesis is the exact failure this skill prevents. No red-capable command, no Phase 2.
当循环满足优化后可触发失败的条件时,阶段1即完成:你能说出一条命令——脚本路径、测试调用、curl命令——且你已至少运行过一次(粘贴调用命令及其输出),同时该命令满足:
  • 可触发失败——能驱动实际的Bug代码路径,并对用户描述的具体症状进行断言,因此能在出现该Bug时返回失败,修复后返回成功。不能仅判断“运行无错误”——必须能捕获该特定Bug
  • 确定性——每次运行结果一致(对于非确定性Bug:需达到上文所述的固定高复现率)。
  • 快速——耗时以秒计,而非分钟。
  • 可由Agent自动运行——无需人工干预;仅在通过
    scripts/hitl-loop.template.sh
    时才需要人工参与。
如果你发现自己在这条命令出现前就开始阅读代码构建理论,请停止——直接跳转到假设阶段正是本规范要避免的错误。 没有可触发失败的命令,就不要进入阶段2。

Phase 2 — Reproduce + minimise

阶段2 — 复现 + 最小化

Run the loop. Watch it go red — the bug appears.
Confirm:
  • The loop produces the failure mode the user described — not a different failure that happens to be nearby. Wrong bug = wrong fix.
  • The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against).
  • You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it.
运行循环。观察它返回失败——Bug出现。
确认:
  • 循环产生的失败模式与用户描述的一致——而非其他附近的无关失败。找错Bug就会导致修复错误。
  • 失败可多次复现(对于非确定性Bug:复现率足够高,可用于调试)。
  • 你已捕获具体症状(错误信息、错误输出、缓慢耗时),以便后续阶段验证修复是否真正解决了问题。

Minimise

最小化

Once it's red, shrink the repro to the smallest scenario that still goes red. Cut inputs, callers, config, data, and steps one at a time, re-running the loop after each cut — keep only what's load-bearing for the failure.
Why bother: a minimal repro shrinks the hypothesis space in Phase 3 (fewer moving parts left to suspect) and becomes the clean regression test in Phase 5.
Done when every remaining element is load-bearing — removing any one of them makes the loop go green.
Do not proceed until you have reproduced and minimised.
一旦循环返回失败,将复现场景缩小到仍能触发失败的最小范围。逐个删减输入、调用方、配置、数据和步骤,每次删减后重新运行循环——仅保留对失败必不可少的元素。
原因:最小化的复现场景能缩小阶段3的假设范围(需要怀疑的变量更少),并在阶段5成为清晰的回归测试用例。
完成标准:剩余的每个元素都是必不可少的——移除任何一个元素都会让循环返回成功。
必须完成复现最小化后,才能进入下一阶段。

Phase 3 — Hypothesise

阶段3 — 提出假设

Generate 3–5 ranked hypotheses before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
Each hypothesis must be falsifiable: state the prediction it makes.
Format: "If <X> is the cause, then <changing Y> will make the bug disappear / <changing Z> will make it worse."
If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it.
Show the ranked list to the user before testing. They often have domain knowledge that re-ranks instantly ("we just deployed a change to #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK.
在测试任何假设之前,先生成3–5个排序后的假设。仅生成单个假设会局限于第一个看似合理的想法。
每个假设必须可证伪:明确说明它所做出的预测。
格式:“如果<X>是原因,那么修改<Y>会让Bug消失 / 修改<Z>会让Bug恶化。”
如果你无法说明预测内容,那这个假设只是一种感觉——要么舍弃,要么细化。
在测试前将排序后的假设列表展示给用户。 他们通常拥有领域知识,能立即重新排序(“我们刚部署了与#3相关的变更”),或者知道已排除的假设。这是一个低成本的检查点,能节省大量时间。如果用户未回复,可按你的排序继续。

Phase 4 — Instrument

阶段4 — 插桩调试

Each probe must map to a specific prediction from Phase 3. Change one variable at a time.
Tool preference:
  1. Debugger / REPL inspection if the env supports it. One breakpoint beats ten logs.
  2. Targeted logs at the boundaries that distinguish hypotheses.
  3. Never "log everything and grep".
Tag every debug log with a unique prefix, e.g.
[DEBUG-a4f2]
. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die.
Perf branch. For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness,
performance.now()
, profiler, query plan), then bisect. Measure first, fix second.
每个探测手段必须对应阶段3中的某个具体预测。每次只改变一个变量。
工具优先级:
  1. 如果环境支持,优先使用调试器 / REPL检查。一个断点胜过十条日志。
  2. 在区分不同假设的边界处添加针对性日志
  3. 绝对不要“记录所有内容再去搜索”。
为每条调试日志添加唯一前缀,例如
[DEBUG-a4f2]
。清理时只需搜索该前缀即可。未标记的日志会保留;标记的日志要删除。
性能分支处理。对于性能退化问题,日志通常无用。取而代之的是:建立基准测量(计时框架、
performance.now()
、性能分析器、查询计划),然后进行二分法排查。先测量,再修复。

Phase 5 — Fix + regression test

阶段5 — 修复 + 回归测试

Write the regression test before the fix — but only if there is a correct seam for it.
A correct seam is one where the test exercises the real bug pattern as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence.
If no correct seam exists, that itself is the finding. Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase.
If a correct seam exists:
  1. Turn the minimised repro into a failing test at that seam.
  2. Watch it fail.
  3. Apply the fix.
  4. Watch it pass.
  5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario.
仅当存在合适的测试切入点时,才在修复前编写回归测试用例。
合适的切入点是指测试能在调用点模拟真实的Bug模式。如果唯一可用的切入点过于浅层(当Bug需要多个调用方时仅编写单调用方测试,单元测试无法复现触发Bug的调用链),此处的回归测试会带来虚假的信心。
如果没有合适的切入点,这本身就是一个发现。 记录这一点。代码库的架构导致无法锁定该Bug。将此情况标记为下一阶段的任务。
如果存在合适的切入点:
  1. 将最小化的复现场景转化为该切入点下的失败测试用例。
  2. 观察测试失败。
  3. 应用修复。
  4. 观察测试通过。
  5. 针对原始(未最小化)场景重新运行阶段1的反馈循环。

Phase 6 — Cleanup + post-mortem

阶段6 — 清理 + 事后复盘

Required before declaring done:
  • Original repro no longer reproduces (re-run the Phase 1 loop)
  • Regression test passes (or absence of seam is documented)
  • All
    [DEBUG-...]
    instrumentation removed (
    grep
    the prefix)
  • Throwaway prototypes deleted (or moved to a clearly-marked debug location)
  • The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns
Then ask: what would have prevented this bug? If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to the
/improve-codebase-architecture
skill with the specifics. Make the recommendation after the fix is in, not before — you have more information now than when you started.
在宣布完成前,必须完成以下事项:
  • 原始复现场景不再触发Bug(重新运行阶段1的循环)
  • 回归测试通过(或记录无合适切入点的情况)
  • 所有
    [DEBUG-...]
    插桩代码已移除(搜索该前缀)
  • 临时原型已删除(或移至明确标记的调试目录)
  • 最终验证正确的假设已记录在提交/PR消息中——以便后续调试人员参考
然后思考:什么可以预防这个Bug? 如果答案涉及架构变更(无合适测试切入点、调用方混乱、隐藏耦合),将具体细节移交至
/improve-codebase-architecture
技能。在修复完成后再提出建议——此时你比开始时掌握了更多信息。