debugging-network-issues

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Debugging Network Issues

网络问题调试

Evidence-driven investigation methodology for incidents where the obvious cause is probably wrong. Built from a real 5-hour production case (see references/case-sse-rst-130s.md) where assumption-stacking wasted hours that a 10-minute layered experiment would have resolved.

Apply this skill when the user reports a network/streaming/protocol symptom and the investigator feels tempted to diagnose from one log line or one circumstantial data point. The skill's job is to slow that reflex down.

针对表面原因大概率错误的事件，提供基于证据的调查方法论。源自一个耗时5小时的真实生产案例（详见references/case-sse-rst-130s.md），当时因堆叠假设浪费了数小时，而一个10分钟的分层实验本可以解决问题。

当用户反馈网络/流/协议相关症状，且调查人员仅凭一条日志或一个间接数据点就想诊断时，适用本方法。本方法的作用是抑制这种急于下结论的本能。

Triage first — is this a known domain?

先分类——是否属于已知领域？

Before applying the general methodology below, check whether the symptom points at a stack that already has a dedicated skill in this repo. Those carry the domain-specific symptom→cause→fix tables this skill deliberately stays general about — start there, and come back here for methodology if the root cause turns out to be elsewhere.

If the symptom is…	Start with
macOS Tailscale ⨯ proxy/VPN conflict (Shadowrocket / Clash / Surge): `tailscale ping` works but SSH/curl/git fails, `Connection closed by 198.18.x.x` , TUN DNS hijack, ~60s `getaddrinfo` resolver stall	tunnel-doctor
Cloudflare config: `ERR_TOO_MANY_REDIRECTS` , SSL-mode mismatch, DNS / proxy-status issues behind the orange cloud	cloudflare-troubleshooting
Windows App / AVD / W365 RDP connection quality: WebSocket instead of UDP Shortpath, high RTT, STUN/TURN interference	windows-remote-desktop-connection-doctor

If none match — or you tried a domain skill and the evidence points elsewhere — continue below. The methodology generalizes to any multi-layer system.

在应用以下通用方法论之前，先检查症状是否指向本仓库中已有专门方法的技术栈。这些专门方法包含领域特定的「症状→原因→修复」对照表，而本方法刻意保持通用性——先从专门方法入手，如果最终发现根原因不在该领域，再回到本方法使用通用方法论。

如果症状是…	从以下方法开始
macOS Tailscale ⨯ 代理/VPN冲突（Shadowrocket / Clash / Surge）： `tailscale ping` 可用但SSH/curl/git失败， `Connection closed by 198.18.x.x` ，TUN DNS劫持，约60秒 `getaddrinfo` 解析停滞	tunnel-doctor
Cloudflare配置问题： `ERR_TOO_MANY_REDIRECTS` 、SSL模式不匹配、橙色云后的DNS/代理状态问题	cloudflare-troubleshooting
Windows应用/AVD/W365 RDP连接质量问题：使用WebSocket而非UDP Shortpath、高RTT、STUN/TURN干扰	windows-remote-desktop-connection-doctor

如果没有匹配项——或你尝试了领域方法但证据指向其他方向——继续阅读下文。本方法论可推广至任何多层系统。

Core principles

核心原则

1. Evidence over assumption

1. 证据优先，拒绝假设

If you cannot point to a concrete artifact — log line, pcap frame, probe output, metric sample — you are guessing, not diagnosing. Before stating "X is the cause", require yourself to name the direct evidence. If it does not exist yet, add instrumentation (see references/instrumentation-patterns.md) or capture it (see references/packet-capture-recipes.md) before continuing.

如果你无法指向具体的证据——日志行、pcap帧、探针输出、指标样本——那你只是在猜测，而非诊断。在断言「X是原因」之前，必须明确说出直接证据。如果证据尚未存在，先添加观测工具（详见references/instrumentation-patterns.md）或捕获证据（详见references/packet-capture-recipes.md），再继续调查。

2. Falsification over confirmation

2. 证伪优先，拒绝证实

N independent sources "confirming" a hypothesis does not make it true. One falsifying observation rules it out. Before acting on a hypothesis, answer:

"What observation would make me abandon this hypothesis?"

If the answer is "nothing" or "I cannot think of one", the hypothesis is unfalsifiable and must not drive the investigation. If the answer is concrete, go look for that observation before committing to action.

N个独立来源「证实」某个假设并不代表它是正确的。一个证伪性观测就能推翻它。在基于假设采取行动前，先回答：

「什么样的观测会让我放弃这个假设？」

如果答案是「没有」或「我想不到」，说明这个假设无法证伪，不能作为调查的依据。如果答案具体明确，先去寻找该观测结果，再决定是否采取行动。

3. Layered isolation

3. 分层隔离

Multi-hop systems (client → CDN → LB → reverse proxy → app → upstream) concentrate bugs at the seams between layers. When a symptom could plausibly come from several layers, do not reason about which layer; test. The canonical technique: run the same logical request through three or more paths that differ by exactly one hop, then compare where the symptom appears. This resolves in minutes what stacking hypotheses cannot resolve in hours. See references/layered-isolation-experiment.md.

多跳系统（客户端 → CDN → 负载均衡 → 反向代理 → 应用 → 上游）的问题往往集中在层与层的衔接处。当症状可能来自多个层时，不要纠结于哪一层；直接测试。标准技巧：通过三条或更多仅相差一个跳点的路径发送相同的逻辑请求，然后对比症状出现的位置。这种方法能在几分钟内解决堆叠假设数小时都无法解决的问题。详见references/layered-isolation-experiment.md。

4. Counter-review before committing

4. 行动前先反评审

Before committing to a root cause or shipping a fix, have independent reviewers challenge the conclusion — not confirm it. Agents are good at surfacing risks a single investigator did not think of; they are bad at weighing them. Apply the four-question filter (see references/counter-review-pattern.md) to every finding before it shapes action.

在确定根原因或发布修复之前，让独立评审人员挑战结论——而非证实结论。Agent擅长发现单个调查人员未考虑到的风险，但不擅长权衡风险。在结论指导行动前，对每个发现应用四问题过滤法（详见references/counter-review-pattern.md）。

Workflow

工作流程

Copy this checklist into the investigation notes and check items off:

Investigation Progress:
- [ ] Step 0:   Scope the symptom (exact error, exact times, who, who-not, what changed)
- [ ] Step 0.5: Verify the premise — does direct evidence show the symptom is actually happening?
- [ ] Step 1:   Gather direct evidence at every hop before hypothesizing
- [ ] Step 2:   Frame ≥3 hypotheses; for each, name (a) what falsifies it, (b) which layer boundary the intervention would target
- [ ] Step 3:   Design a decisive experiment (for network: layered isolation)
- [ ] Step 4:   Add instrumentation if evidence gaps block direct observation
- [ ] Step 5:   Execute, record actual vs predicted
- [ ] Step 6:   Counter-review before acting
- [ ] Step 7:   Fix + re-run the same experiment to verify
- [ ] Step 8:   Document wrong turns as teaching material

将以下清单复制到调查笔记中，逐项勾选：

调查进度：
- [ ] 步骤0：明确症状范围（精确错误信息、精确时间、受影响对象、未受影响对象、最近变更）
- [ ] 步骤0.5：验证前提——是否有直接证据证明症状确实存在？
- [ ] 步骤1：在提出假设前，收集每个跳点的直接证据
- [ ] 步骤2：提出≥3个假设；每个假设需明确（a）证伪条件，（b）干预措施针对的层边界
- [ ] 步骤3：设计决定性实验（网络问题：分层隔离）
- [ ] 步骤4：如果证据缺口阻碍直接观测，添加观测工具
- [ ] 步骤5：执行实验，记录实际结果与预测结果的对比
- [ ] 步骤6：行动前进行反评审
- [ ] 步骤7：修复后重新运行相同实验以验证
- [ ] 步骤8：记录错误路径作为教学素材

Step 0: Scope

步骤0：明确范围

A tight scope is the difference between a 20-minute investigation and a 5-hour one. Before looking at anything, extract:

Exact error string (copy-paste, not paraphrase).
```
socket closed
```
is not the same as
```
ECONNRESET
```
is not the same as
```
HTTP/2 RST_STREAM INTERNAL_ERROR (err 2)
```
.
Exact timestamps (ISO-8601 with timezone, not "yesterday evening")
Reproducibility (every time / intermittent / only specific users)
Who is affected, who is not (differential observations narrow the search)
What changed recently (deploys, config, upstream dependencies, client versions)

Distinguish symptom from diagnosis. "Slow" is not a symptom. "Request took 130.898s then returned HTTP/2 INTERNAL_ERROR" is.

精准的范围是20分钟调查和5小时调查的区别。在查看任何内容之前，先提取：

精确错误字符串（复制粘贴，而非转述）。
```
socket closed
```
和
```
ECONNRESET
```
以及
```
HTTP/2 RST_STREAM INTERNAL_ERROR (err 2)
```
并不相同。
精确时间戳（带时区的ISO-8601格式，而非「昨天晚上」）
可复现性（每次都出现/间歇性出现/仅特定用户出现）
受影响对象与未受影响对象（差异观测能缩小搜索范围）
最近变更内容（部署、配置、上游依赖、客户端版本）

区分症状与诊断。「慢」不是症状。「请求耗时130.898秒后返回HTTP/2 INTERNAL_ERROR」才是症状。

Step 0.5: Verify the premise

步骤0.5：验证前提

Before investing in a full investigation, confirm the reported symptom is actually happening — not just inferred from downstream effects or user frustration. One cheap direct observation beats hours spent investigating a non-problem.

Ask: "What direct evidence shows this symptom is real?"

If the user reports "timeout at 130s": is that from a timestamped log, a browser network panel, or a recollection?
If the user reports "connection reset": did they see the packet or is it inferred from a retry spike?
If the user reports "fails for some but not others": has it been reproduced in a controlled test, or is it anecdotal?

Acceptable premises:

Log line with timestamp and error string
Browser DevTools Network screenshot showing the failure
Reproduction command that shows the symptom on demand
Metrics chart showing the specific error count rising

Not sufficient as premise:

"Users are saying it feels slow"
"The alert fired but I did not check what actually failed"
"Last week someone mentioned..."

If the premise fails verification, the fix is observation — not investigation. Add the missing telemetry, wait for the next occurrence with instrumentation in place, and return when you have real data. Resist the sunk-cost instinct to investigate anyway "since we are already here".

在投入完整调查之前，先确认反馈的症状确实存在——而非仅从下游影响或用户推断得出。一次低成本的直接观测胜过数小时调查一个不存在的问题。

问自己：「有什么直接证据能证明这个症状真实存在？」

如果用户反馈「130秒超时」：是来自带时间戳的日志、浏览器网络面板，还是回忆？
如果用户反馈「连接重置」：他们是否看到了数据包，还是从重试峰值推断出来的？
如果用户反馈「部分用户失败」：是否在受控测试中复现，还是只是传闻？

可接受的前提：

带时间戳和错误字符串的日志行
显示失败的浏览器DevTools Network截图
可按需复现症状的命令
显示特定错误计数上升的指标图表

不足以作为前提：

「用户说感觉很慢」
「警报触发了但我没查具体是什么失败了」
「上周有人提到过…」

如果前提验证不通过，解决方案是观测——而非调查。添加缺失的遥测工具，等待下一次出现时进行观测，等有真实数据后再返回。抵制「既然已经在这里了就继续调查」的沉没成本本能。

Step 1: Gather direct evidence at every hop

步骤1：收集每个跳点的直接证据

Before framing hypotheses, collect:

Server-side logs at every hop in the request path
Client-side logs (browser devtools HAR, CLI debug log, SDK traces)
Metrics over the incident window (RPS, latency, error rate, connection count, CPU/mem)
Distributed trace if available
Packet capture if the symptom is at the wire level (see references/packet-capture-recipes.md)

If any of these is missing and relevant, fill the gap before guessing. Adding a

TRACE_*

env flag and restarting a container beats an hour of hypothesis-stacking. The instrumentation patterns in references/instrumentation-patterns.md are low-risk, env-gated, and safe to ship into production permanently.

在提出假设前，收集：

请求路径中每个跳点的服务器端日志
客户端日志（浏览器devtools HAR、CLI调试日志、SDK追踪）
事件窗口内的指标（RPS、延迟、错误率、连接数、CPU/内存）
分布式追踪（如果可用）
数据包捕获（如果症状在网络层面，详见references/packet-capture-recipes.md）

如果任何相关内容缺失，先填补缺口再猜测。添加

TRACE_*

环境变量并重启容器，胜过一小时的假设堆叠。references/instrumentation-patterns.md中的观测模式风险低、由环境控制，可永久部署到生产环境。

Step 2: Hypotheses with falsifiers and threat-model boundaries

步骤2：带证伪条件和威胁模型边界的假设

List three or more plausible causes. For each, write three sentences:

What would confirm it? (easy and often misleading)
What would refute it? (the falsifier — this is what matters)
Which layer boundary would the intervention target? (the threat-model question — forces you to be precise about where the fix would apply)

The third question prevents a common anti-pattern: proposing a fix that operates on the wrong hop. For example, a "keepalive" fix that writes bytes downstream to the client is useless for an upstream idle timeout — the intervention targets a different boundary than the problem. Naming the boundary up-front surfaces this mismatch before coding starts.

If you cannot state a concrete refuter, the hypothesis is unfalsifiable. Flag it, but do not act on it. If you cannot state which boundary a proposed fix targets, you do not yet understand what the fix actually does.

列出三个或更多合理的原因。每个假设需写三句话：

什么能证实它？（容易做到但常具误导性）
什么能推翻它？（证伪条件——这才是关键）
干预措施针对哪个层边界？（威胁模型问题——迫使你明确修复的适用位置）

第三个问题能避免常见的反模式：提出针对错误跳点的修复。例如，向下游客户端写入字节的「保活」修复对上游空闲超时毫无用处——干预措施针对的边界与问题所在边界不同。提前明确边界能在编码前暴露这种不匹配。

如果你无法明确具体的证伪条件，说明这个假设无法证伪。标记它，但不要基于它采取行动。如果你无法明确修复针对的边界，说明你还不理解修复的实际作用。

Step 3: Decisive experiment

步骤3：决定性实验

For network-layer problems, the default is layered isolation: three paths differing by exactly one hop. Example for a CDN-fronted service:

Path	Route	Rules out if it passes
A	Full path via CDN	Nothing — this is the failing baseline
B	`--resolve` to origin IP (bypass CDN)	CDN layer
C	Server loopback (bypass CDN + LB)	CDN + LB

If only A fails, the CDN is the cause. If A and B fail but C passes, the LB is. Compose more variants as needed. See references/layered-isolation-experiment.md for a runnable template using a mock idle upstream — the experiment does not need a cooperating production request to trigger, the idle interval can be controlled precisely.

For non-network domains:

Performance: controlled benchmark with one variable changed
Correctness bug: failing test case that reproduces
Intermittent: sampled tracing + wait for recurrence

对于网络层问题，默认方法是分层隔离：三条仅相差一个跳点的路径。以CDN前端服务为例：

路径	路由	如果成功则排除
A	经CDN的完整路径	无——这是失败的基准
B	通过 `--resolve` 指向源IP（绕过CDN）	CDN层
C	服务器环回（绕过CDN + 负载均衡）	CDN + 负载均衡

如果只有A失败，原因在CDN。如果A和B失败但C成功，原因在负载均衡。根据需要组合更多变体。详见references/layered-isolation-experiment.md，其中包含使用模拟空闲上游的可运行模板——实验不需要配合生产请求触发，可精确控制空闲间隔。

对于非网络领域：

性能：仅改变一个变量的受控基准测试
正确性问题：可复现失败的测试用例
间歇性问题：采样追踪 + 等待复发

Step 4: Instrumentation when needed

步骤4：必要时添加观测工具

If the decisive experiment requires an observation that cannot currently be made, add it — do not skip it. The canonical pattern is env-gated instrumentation that:

Defaults off (zero runtime cost in steady state)
Turns on via one environment variable, without code changes
Writes greppable log tags (
```
[SSE-CHUNK] ts=... req=... bytes=...
```
)
Ships into production permanently — future incidents reuse it

See references/instrumentation-patterns.md for the exact template used to diagnose the Qiniu 125-second upstream silence in this incident.

如果决定性实验需要当前无法进行的观测，添加观测工具——不要跳过。标准模式是环境门控的观测工具，具备以下特点：

默认关闭（稳态下无运行时成本）
通过一个环境变量开启，无需修改代码
写入可 grep 的日志标签（
```
[SSE-CHUNK] ts=... req=... bytes=...
```
）
永久部署到生产环境——未来的事件可复用

详见references/instrumentation-patterns.md，其中包含用于诊断本次事件中七牛云125秒上游静默问题的精确模板。

Step 5: Execute and record

步骤5：执行并记录

Run the experiment once, fully documented: command, environment, inputs, observed outputs, wall-clock timestamps. Compare against the prediction made in Step 2. If actual matches predicted, the hypothesis is calibrated. If not, the hypothesis is wrong — do not rescue it with ad-hoc auxiliary hypotheses ("oh, but maybe X also interferes..."). Return to Step 2 and write new hypotheses from scratch.

运行一次实验并完整记录：命令、环境、输入、观测到的输出、挂钟时间戳。与步骤2中的预测结果对比。如果实际结果与预测结果匹配，说明假设校准正确。如果不匹配，说明假设错误——不要用临时辅助假设来挽救它（「哦，但可能X也在干扰…」）。回到步骤2，从头撰写新的假设。

Step 6: Counter-review

步骤6：反评审

Before committing to a root cause or shipping a fix, spawn independent reviewers to challenge the conclusion. Give them the same evidence, ask them to falsify, not confirm. Apply the four-question filter to each finding they raise:

Probability — will this actually happen?
Cost — what is the cost of fixing versus ignoring?
Realistic scenario — does this apply to the user's actual business case?
Verification — can I cheaply confirm or refute this?

Classify every finding: real issue / partly right / unlikely / actively harmful. Never paste raw agent output to the user; filter first. See references/counter-review-pattern.md.

在确定根原因或发布修复之前，安排独立评审人员挑战结论。给他们相同的证据，要求他们证伪而非证实。对他们提出的每个发现应用四问题过滤法：

概率——这种情况真的会发生吗？
成本——修复与忽略的成本分别是多少？
实际场景——这适用于用户的实际业务场景吗？
验证——我能低成本地证实或推翻它吗？

对每个发现进行分类：真实问题/部分正确/可能性低/有害。永远不要直接将Agent的输出粘贴给用户；先过滤。详见references/counter-review-pattern.md。

Step 7: Fix and verify

步骤7：修复并验证

Apply the fix. Rerun the same decisive experiment from Step 3. Confirm the symptom no longer reproduces with the same setup that was reliably producing it. If the pre-fix state can no longer be reproduced after the fix, the fix cannot be proven — figure out why the repro was lost before declaring victory.

应用修复。重新运行步骤3中的决定性实验。确认在之前能稳定复现症状的相同设置下，症状不再出现。如果修复后无法再复现修复前的状态，说明无法证明修复有效——在宣布成功前，先找出复现场景丢失的原因。

Step 8: Document wrong turns

步骤8：记录错误路径

The wrong turns in the investigation are more valuable than the right answer. Write an incident report capturing:

Symptom + direct evidence
Each hypothesis tried + how it was falsified
Decisive experiment design + result
Fix + verification
New monitoring or instrumentation added

Future investigators — including future self — will read this to avoid the same cognitive traps.

调查中的错误路径比正确答案更有价值。撰写事件报告，包含：

症状 + 直接证据
尝试过的每个假设 + 证伪方式
决定性实验设计 + 结果
修复 + 验证
添加的新监控或观测工具

未来的调查人员——包括未来的你——会通过这份报告避免陷入相同的认知陷阱。

Common cognitive traps

常见认知陷阱

Circumstantial evidence convergence. Five indirect clues all pointing the same direction feel like proof. They are not. If a direct probe is cheap, run it.
Field-semantic confusion.
```
duration=5.95s
```
can mean total wall time (one tool), handler execution phase (another tool), or TTFB (a third). Never cite a numeric field without verifying its semantics against documentation or code.
Single-cause bias. Multi-layer systems fail from multi-layer defect compositions. Fix the direct cause but document the amplifying factors so the next layer of defense can also be hardened.
Naming assumption. A resource labeled
```
spot-instance
```
may not actually be a spot instance. Verify attributes via API, not metadata names.
Probe self-verification. A diagnostic that runs through the broken connection to test the broken connection yields uninterpretable results. Always cross-verify with an independent probe.
Assumption-rescue cycle. When evidence contradicts a hypothesis, the temptation is to add a modifier ("yes, but only in case X"). Resist. If the first falsifier fires, scrap the hypothesis.
Unverified premise. Investigating a symptom that was never directly observed — inferred from user frustration, alert titles, or downstream effects. Verify first (Step 0.5). Do not investigate anecdotes.
Threat-model mismatch. Proposing a fix that targets the wrong layer — writing bytes downstream to solve an upstream problem, tuning a timeout on a hop that never fires it. Naming the boundary each hypothesis targets (Step 2) surfaces this.

See references/cognitive-traps.md for extended examples including this case study.

间接证据趋同。五个间接线索都指向同一方向，感觉像是证据。但其实不是。如果直接探测成本低，就去做。
字段语义混淆。
```
duration=5.95s
```
在不同工具中可能表示总挂钟时间、处理程序执行阶段或TTFB。在引用数字字段前，务必对照文档或代码验证其语义。
单一原因偏差。多层系统的故障往往由多层缺陷组合导致。修复直接原因，但记录放大因素，以便加固下一层防御。
命名假设。标记为
```
spot-instance
```
的资源可能实际上不是抢占式实例。通过API验证属性，而非依赖元数据名称。
探针自验证。通过故障连接运行诊断工具来测试故障连接，会产生无法解释的结果。始终用独立探针交叉验证。
假设挽救循环。当证据与假设矛盾时，人们倾向于添加修饰词（「是的，但仅在X情况下」）。抵制这种冲动。如果第一个证伪条件触发，就放弃该假设。
未验证前提。调查从未被直接观测到的症状——从用户反馈、警报标题或下游影响推断得出。先验证（步骤0.5）。不要调查传闻。
威胁模型不匹配。提出针对错误层的修复——向下游写入字节以解决上游问题，调整从不触发的跳点超时。在步骤2中明确每个假设针对的边界，能暴露这种不匹配。

详见references/cognitive-traps.md，其中包含本次案例研究的扩展示例。

Anti-patterns — things to explicitly avoid

反模式——明确要避免的行为

Jumping to a fix before a falsifier is found. "Probably it is X, let me restart / tweak / upgrade." This converts learning opportunities into mystery fixes that do not prevent recurrence.
Accepting agent counter-review findings wholesale. Agents over-produce risk findings. Filter before acting (see four-question filter above).
Ad-hoc production edits that bypass IaC. If the investigation requires changing production, change the source-of-truth first, then apply — otherwise the "fix" evaporates on the next deploy and the drift hides the real state.
Declaring root cause from a single observation. Demand a falsifier attempt first.
Writing "should work now" without re-running the failing experiment. Re-verify.

在找到证伪条件前就急于修复。「可能是X问题，我重启/调整/升级一下。」这会将学习机会转化为无法预防复发的神秘修复。
全盘接受Agent的反评审发现。Agent会过度生成风险发现。采取行动前先过滤（见上文四问题过滤法）。
绕过IaC进行临时生产修改。如果调查需要修改生产环境，先修改可信源，再应用——否则「修复」会在下一次部署时消失，配置漂移会隐藏真实状态。
仅凭一次观测就确定根原因。要求先尝试证伪。
不重新运行失败实验就说「现在应该正常了」。重新验证。

Case study

案例研究

The references/case-sse-rst-130s.md walks through a full 5-hour investigation where the assistant repeatedly jumped to the wrong conclusion. The right answer — Cloudflare edge HTTP/2 stream idle timeout at 126 seconds, amplified by Qiniu not emitting SSE ping during Sonnet 4.6 tool_use generation — surfaced in 10 minutes once a subagent designed a 3-path layered isolation experiment with a mock idle upstream. Read the case study before applying this skill to an unfamiliar problem domain; the wrong-turn anatomy is the teaching.

references/case-sse-rst-130s.md详细介绍了一个耗时5小时的调查，其中助手多次得出错误结论。正确答案——Cloudflare边缘HTTP/2流空闲超时为126秒，七牛云在Sonnet 4.6 tool_use生成期间未发送SSE ping导致问题放大——在子Agent设计了一个带模拟空闲上游的3路径分层隔离实验后，仅用10分钟就浮出水面。在将本方法应用于不熟悉的问题领域前，请阅读该案例研究；错误路径的剖析才是核心教学内容。

Reference files

参考文件

references/layered-isolation-experiment.md — 3-path technique, mock upstream template, result matrix
references/instrumentation-patterns.md — env-gated TRACE_*, greppable log tags, deployment checklist
references/packet-capture-recipes.md — tcpdump filters for RST isolation, interface selection on Docker, HTTP/2 decoding
references/counter-review-pattern.md — 4-agent team composition, 4-question filter, integration workflow
references/cognitive-traps.md — extended examples, rescue-cycle warnings
references/case-sse-rst-130s.md — canonical case study with wrong-turn timeline

references/layered-isolation-experiment.md — 3路径技术、模拟上游模板、结果矩阵
references/instrumentation-patterns.md — 环境门控TRACE_*、可grep日志标签、部署清单
references/packet-capture-recipes.md — 用于隔离RST的tcpdump过滤器、Docker上的接口选择、HTTP/2解码
references/counter-review-pattern.md — 4-Agent团队组成、四问题过滤法、集成工作流程
references/cognitive-traps.md — 扩展示例、挽救循环警告
references/case-sse-rst-130s.md — 带错误路径时间线的标准案例研究

Scripts

脚本

scripts/mock-idle-upstream.py — SSE server that emits one frame then idles N seconds. Use as the upstream in layered isolation experiments to precisely control the idle interval.
scripts/layered-isolation-probe.sh — Runs the 3-path A/B/C comparison and prints a diagnostic matrix.

scripts/mock-idle-upstream.py — SSE服务器，发送一个帧后空闲N秒。在分层隔离实验中用作上游，以精确控制空闲间隔。
scripts/layered-isolation-probe.sh — 运行3路径A/B/C对比并打印诊断矩阵。