hunt

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Hunt: Diagnose Before You Fix

排查指南:修复前先诊断

A patch applied to a symptom creates a new bug somewhere else. Find the origin first.
Do not touch code until you can state the root cause in one sentence.
只针对表面症状打补丁会在其他地方引入新bug,首先要找到问题根源。
在你能用一句话说明根因之前,不要修改任何代码。

Orientation

定位方向

Start by building a complete picture of what happened:
  • Get the exact error, stack trace, and steps to reproduce. If anything is missing, ask one specific question.
  • Run
    git log --oneline -20
    on the files named in the error or stack trace. If no specific files are mentioned, run it on the whole repo. Regressions almost always live in recent changes.
  • Trace the execution path from the symptom backward: follow the data, not intuition.
  • Reproduce it yourself. If you cannot reproduce it reliably, you do not understand it yet.
Before going further, commit to a testable claim:
"I believe the root cause is [X] because [evidence]."
The claim must name a specific file, function, line, or condition. "A state management issue" is not testable. "Stale cache in
useUser
at
src/hooks/user.ts:42
because the dependency array is missing
userId
" is testable. If you cannot be that specific, you do not have a hypothesis yet.
首先构建完整的问题发生背景:
  • 获取准确的错误信息、栈跟踪和复现步骤。如果有信息缺失,针对性提问。
  • 对错误或栈跟踪中提到的文件执行
    git log --oneline -20
    命令。如果没有提到具体文件,对整个仓库执行该命令。回归问题几乎都出在最近的变更中。
  • 从症状倒推执行路径:跟着数据走,不要凭直觉。
  • 亲自复现问题。如果你不能稳定复现,说明你还没搞懂问题。
在进一步操作前,先明确一个可验证的假设:
"我认为根因是 [X],因为 [证据]。"
假设必须指明具体的文件、函数、行号或条件。"状态管理问题"是不可验证的,"
src/hooks/user.ts:42
useUser
的缓存失效,因为依赖数组缺少
userId
" 是可验证的。如果你做不到这么具体,说明你还没有形成有效假设。

Known Failure Shapes

常见故障类型

When a hypothesis is hard to form, match the symptom to a known shape:
ShapeCluesWhere to look
Timing problemIntermittent, load-dependentConcurrent access to shared state
Missing guardCrash on field or index accessOptional values, empty collections
Ordering bugWorks in isolation, fails in sequenceEvent callbacks, transaction scope
Boundary failureTimeout, wrong response shapeExternal APIs, service edges
Environment mismatchLocal pass, CI failEnv vars, feature flags, seeded data
Stale valueOld data shown, refreshes on restartIn-memory cache, memoized result
Also worth checking: existing TODOs near the failure site, and whether this area has been patched before. Recurring fixes in the same place mean the abstraction is wrong.
Pay attention to deflection. When a developer or user says "that part doesn't matter" or "don't worry about that area," treat it as a signal rather than a clearance. The area someone is reluctant to examine is often where the actual problem lives.
如果很难形成假设,可以将症状和已知故障类型匹配:
类型特征排查方向
时序问题偶发、和负载相关共享状态的并发访问
缺少守卫字段或索引访问时崩溃可选值、空集合
顺序bug单独运行正常、按序列运行失败事件回调、事务作用域
边界故障超时、响应格式错误外部API、服务边界
环境不匹配本地运行通过、CI运行失败环境变量、功能开关、种子数据
值过期展示旧数据、重启后刷新内存缓存、缓存计算结果
还值得检查:故障点附近的现有TODO注释,以及该区域之前是否打过补丁。同一位置反复出现修复说明抽象设计有问题。
留意回避信号。当开发人员或用户说"那部分不重要"或者"不用担心那个区域"时,把它当成信号而不是许可。别人不愿意检查的区域通常是真正的问题所在。

Confirm or Discard the Hypothesis

验证或推翻假设

Add one targeted instrument: a log line, a failing assertion, or the smallest possible test that would fail if the hypothesis is correct. Run it.
If the evidence contradicts the hypothesis, discard it completely and re-orient with what was just learned. Do not preserve a hypothesis that the evidence disproves.
After three failed hypotheses, stop. Do not guess a fourth time. Instead, surface the situation to the user: what was checked, what was ruled out, what is still unknown. Ask whether to add more instrumentation, escalate, or approach the problem differently.
Same symptom after a fix is a hard stop. If the user reports the same symptom after a patch was applied, do not patch again. Treat it as a new investigation: the previous hypothesis was wrong. Re-read the execution path from scratch. Three rounds of "fixed but still broken" in the same area means the abstraction is wrong, not the specific line.
Never state environment details from memory. Before diagnosing OS, compiler, SDK, or tool version issues, run the detection command first:
sw_vers
,
xcodebuild -version
,
node --version
,
rustc --version
, etc. State the actual output. A diagnosis built on an assumed version is not a diagnosis.
External tool or MCP failure: diagnose before switching. When an MCP tool, CLI dependency, or external API is unavailable or returning errors, do not immediately try an alternative method. First determine why it failed: is the server running? Is the API key valid or expired? Is the config pointing to the right endpoint? Is a proxy needed? Switching to a workaround without diagnosing the root cause leaves the original problem intact and wastes the next session too.
Stop and reassess if you catch yourself:
  • Writing a fix before you have finished tracing the flow
  • Thinking "let me just try this"
  • Finding that each fix surfaces a new problem in a different module
添加一个针对性的观测手段:一行日志、一个会触发失败的断言,或者假设成立时就会失败的最小测试用例,然后运行验证。
如果证据和假设矛盾,完全抛弃假设,用刚得到的信息重新定位。不要保留已经被证据证伪的假设。
三次假设都失败后就停止,不要猜第四次。而是向用户同步现状:已经检查了什么、排除了什么、还有什么未知。询问是要添加更多观测手段、升级问题,还是换一种方式处理。
修复后症状依旧必须停止。 如果用户反馈打了补丁后还是出现同样的症状,不要再继续打补丁。把它当成新的问题来调查:之前的假设是错的。从头重新梳理执行路径。同一区域出现三次"修复了但还是有问题"说明抽象设计有问题,不是某一行代码的问题。
永远不要凭记忆说环境细节。 诊断操作系统、编译器、SDK或工具版本问题前,先运行检测命令:
sw_vers
xcodebuild -version
node --version
rustc --version
等。给出实际输出。基于假设版本得出的诊断不是有效诊断。
外部工具或MCP故障:切换方案前先诊断。 当MCP工具、CLI依赖或外部API不可用或返回错误时,不要立刻尝试替代方案。首先确定故障原因:服务器运行正常吗?API密钥有效还是过期了?配置指向的端点对吗?需要代理吗?不诊断根因就切换变通方案会让原问题一直存在,也会浪费下次的排查时间。
如果你发现自己有以下行为,停下来重新评估:
  • 还没走完流程溯源就开始写修复代码
  • 脑子里想着"我先试试这个"
  • 发现每次修复都会在不同模块引出新问题

Apply the Fix

执行修复

Once the root cause is confirmed:
  • Fix the cause, not the symptom it produces
  • Keep the diff small: fewest files, fewest lines
  • Write one regression test that fails on the unfixed code and passes after the fix. If the bug is non-testable (timing, environment-specific, UI rendering), document why and add the best available guard instead.
  • For large projects, run the targeted subset first (tests in the affected module). Run the full suite only after the targeted tests pass. Paste the full output, no summaries.
  • If the change touches more than 5 files, pause and confirm the scope with the user
Self-regulation: track how the fix is going. If you have reverted the same area twice, or if the current fix touches more than 3 files for what started as a single bug, stop. Do not keep patching. Describe what is known and unknown, and ask the user how to proceed. Continued patching past this point means the abstraction is wrong, not the code.
After the fix lands, consider whether a second layer of defense makes sense: validate the same condition at the call site, the service boundary, or in a test. A bug that cannot be introduced again is better than a bug that was fixed once.
根因确认后:
  • 修复根源,而不是它产生的表面症状
  • 保持diff最小:改动最少的文件、最少的代码行
  • 写一个回归测试用例,在未修复的代码上运行会失败,修复后运行会通过。如果bug不可测试(时序问题、环境特定问题、UI渲染问题),说明原因,添加现有条件下最好的守卫逻辑
  • 大型项目先运行目标子集(受影响模块的测试),只有目标测试通过后再运行全量测试。粘贴完整输出,不要只给摘要
  • 如果改动涉及超过5个文件,暂停和用户确认改动范围
自我管控: 跟踪修复进展。如果你已经回滚同一区域两次,或者原本的单个bug现在需要修改超过3个文件,停下来,不要继续打补丁。描述已知和未知的信息,询问用户接下来如何处理。到这个地步还继续打补丁说明是抽象设计有问题,不是代码的问题。
修复上线后,考虑是否需要增加第二层防护:在调用点、服务边界或者测试中校验同样的条件。永远不会再出现的bug比只修复过一次的bug更好。

Outcome

结果输出

End with a short summary:
Root cause:  [what was wrong, file:line]
Fix:         [what changed, file:line]
Confirmed:   [evidence or test that proves the fix]
Tests:       [pass/fail count, regression test location]
Status is one of: resolved, resolved with caveats (state them), or blocked (state what is unknown and why).
最后附上简短总结:
根因:  [问题所在,文件:行号]
修复:  [改动内容,文件:行号]
验证:  [证明修复有效的证据或测试]
测试:  [通过/失败数量,回归测试位置]
状态为以下三者之一:已解决已解决但有附加说明(说明附加内容),或者阻塞(说明未知内容和原因)。