axiom-networking-diag
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNetwork.framework Diagnostics
Network.framework 诊断指南
Overview
概述
Core principle 85% of networking problems stem from misunderstanding connection states, not handling network transitions, or improper error handling—not Network.framework defects.
Network.framework is battle-tested in every iOS app (powers URLSession internally), handles trillions of requests daily, and provides smart connection establishment with Happy Eyeballs, proxy evaluation, and WiFi Assist. If your connection is failing, timing out, or behaving unexpectedly, the issue is almost always in how you're using the framework, not the framework itself.
This skill provides systematic diagnostics to identify root causes in minutes, not hours.
核心原则 85%的网络问题源于对连接状态的误解、未处理网络切换或错误处理不当——而非Network.framework本身的缺陷。
Network.framework在每一款iOS应用中都经过实战检验(URLSession内部基于它实现),每天处理数万亿次请求,并通过Happy Eyeballs、代理评估和WiFi Assist等机制实现智能连接建立。如果你的连接出现失败、超时或异常行为,问题几乎总是出在框架的使用方式上,而非框架本身。
本方案提供系统化诊断方法,可在数分钟内定位根本原因,而非耗时数小时。
Red Flags — Suspect Networking Issue
红色预警——疑似网络问题
If you see ANY of these, suspect a networking misconfiguration, not framework breakage:
-
Connection times out after 60 seconds with no clear error
-
TLS handshake fails with "certificate invalid" on some networks
-
Data sent but never arrives at receiver
-
Connection drops when switching WiFi to cellular
-
Works perfectly on WiFi but fails 100% of time on cellular
-
Works in simulator but fails on real device
-
Connection succeeds on your network but fails for users
-
❌ FORBIDDEN "Network.framework is broken, we should rewrite with sockets"
- Network.framework powers URLSession, used in every iOS app
- Handles edge cases you'll spend months discovering with sockets
- Apple engineers have 10+ years of production debugging baked into framework
- Switching to sockets will expose you to 100+ edge cases
Critical distinction Simulator uses macOS networking stack (not iOS), hides cellular-specific issues (IPv6-only networks), and doesn't simulate network transitions. MANDATORY: Test on real device with real network conditions.
如果出现以下任意一种情况,应怀疑是网络配置错误,而非框架故障:
-
连接在60秒后超时,无明确错误提示
-
部分网络环境下TLS握手失败,提示“证书无效”
-
数据已发送,但接收方从未收到
-
从WiFi切换到蜂窝网络时连接断开
-
在WiFi环境下完全正常,但在蜂窝网络下100%失败
-
在模拟器中正常,但在真实设备上失败
-
你的网络环境下连接成功,但用户端失败
-
❌ 严禁 认为“Network.framework有问题,我们应该用套接字重写”
- Network.framework是URLSession的底层支撑,所有iOS应用都在使用
- 它能处理你用套接字需要数月才能发现的边缘场景
- 苹果工程师在框架中融入了10多年的生产环境调试经验
- 切换到套接字会让你暴露在100多种边缘场景中
关键区别 模拟器使用macOS网络栈(而非iOS),会隐藏蜂窝网络特有的问题(如纯IPv6网络),也无法模拟网络切换。强制要求:在真实设备上,使用真实网络条件测试。
Mandatory First Steps
强制前置步骤
ALWAYS run these commands FIRST (before changing code):
swift
// 1. Enable Network.framework logging
// Add to Xcode scheme: Product → Scheme → Edit Scheme → Arguments
// -NWLoggingEnabled 1
// -NWConnectionLoggingEnabled 1
// 2. Check connection state history
connection.stateUpdateHandler = { state in
print("\(Date()): Connection state: \(state)")
// Log every state transition with timestamp
}
// 3. Check TLS configuration
// If using custom TLS parameters:
print("TLS version: \(tlsParameters.minimumTLSProtocolVersion)")
print("Cipher suites: \(tlsParameters.tlsCipherSuites ?? [])")
// 4. Test with packet capture (Charles Proxy or Wireshark)
// On device: Settings → WiFi → (i) → Configure Proxy → Manual
// Charles: Help → SSL Proxying → Install Charles Root Certificate on iOS
// 5. Test on different networks
// - WiFi
// - Cellular (disable WiFi)
// - Airplane Mode → WiFi (test waiting state)
// - VPN active
// - IPv6-only (some cellular carriers)在修改代码前,必须先执行以下命令:
swift
// 1. 启用Network.framework日志
// 添加到Xcode scheme:Product → Scheme → Edit Scheme → Arguments
// -NWLoggingEnabled 1
// -NWConnectionLoggingEnabled 1
// 2. 检查连接状态历史
connection.stateUpdateHandler = { state in
print("\(Date()): Connection state: \(state)")
// 记录所有带时间戳的状态转换
}
// 3. 检查TLS配置
// 如果使用自定义TLS参数:
print("TLS version: \(tlsParameters.minimumTLSProtocolVersion)")
print("Cipher suites: \(tlsParameters.tlsCipherSuites ?? [])")
// 4. 用抓包工具测试(Charles Proxy或Wireshark)
// 在设备上:设置 → WiFi → (i) → 配置代理 → 手动
// Charles:Help → SSL Proxying → Install Charles Root Certificate on iOS
// 5. 在不同网络环境下测试
// - WiFi
// - 蜂窝网络(关闭WiFi)
// - 飞行模式 → WiFi(测试等待状态)
// - 开启VPN
// - 纯IPv6网络(部分蜂窝运营商)What this tells you
这些操作能告诉你什么
| Observation | Diagnosis | Next Step |
|---|---|---|
| Stuck in .preparing > 5 seconds | DNS failure or network down | Pattern 1a |
| Moves to .waiting immediately | No connectivity (Airplane Mode, no signal) | Pattern 1b |
| .failed with POSIX error 61 | Connection refused (server not listening) | Pattern 1c |
| .failed with POSIX error 50 | Network down (interface disabled) | Pattern 1d |
| .ready then immediate .failed | TLS handshake failure | Pattern 2b |
| .ready, send succeeds, no data arrives | Framing problem or receiver not processing | Pattern 3a |
| Works WiFi, fails cellular | IPv6-only network (hardcoded IPv4) | Pattern 5a |
| Works without VPN, fails with VPN | Proxy interference or DNS override | Pattern 5b |
| 观察结果 | 诊断结论 | 下一步操作 |
|---|---|---|
| 停留在.preparing状态超过5秒 | DNS解析失败或网络中断 | 模式1a |
| 立即切换到.waiting状态 | 无网络连接(飞行模式、无信号) | 模式1b |
| .failed状态伴随POSIX错误61 | 连接被拒绝(服务器未监听) | 模式1c |
| .failed状态伴随POSIX错误50 | 网络中断(接口已禁用) | 模式1d |
| 进入.ready状态后立即失败 | TLS握手失败 | 模式2b |
| 进入.ready状态,发送成功但无数据送达 | 帧格式问题或接收方未处理 | 模式3a |
| WiFi环境下正常,蜂窝网络下失败 | 纯IPv6网络(硬编码IPv4地址) | 模式5a |
| 无VPN时正常,开启VPN后失败 | 代理干扰或DNS被覆盖 | 模式5b |
MANDATORY INTERPRETATION
强制解读规则
Before changing ANY code, identify ONE of these:
- If stuck in .preparing AND network is available → DNS failure (check nslookup)
- If .waiting immediately AND Airplane Mode is off → Interface-specific issue (cellular blocked)
- If .failed POSIX 61 → Server issue (check server logs)
- If .failed with TLS error -9806 → Certificate validation (check with openssl)
- If .ready but data not arriving → Framing or receiver issue (enable packet capture)
在修改任何代码前,必须先确定属于以下某一种情况:
- 停留在.preparing状态且网络可用 → DNS解析失败(检查nslookup)
- 立即进入.waiting状态且飞行模式已关闭 → 接口特定问题(蜂窝网络被限制)
- .failed状态伴随POSIX错误61 → 服务器问题(检查服务器日志)
- .failed状态伴随TLS错误-9806 → 证书验证问题(用openssl检查)
- 进入.ready状态但数据未送达 → 帧格式或接收方问题(启用抓包)
If diagnostics are contradictory or unclear
如果诊断结果矛盾或不明确
- STOP. Do NOT proceed to patterns yet
- Add timestamp logging to every send/receive call
- Enable packet capture (Charles/Wireshark)
- Test on different device to isolate hardware vs software issue
- 停止操作。不要急于尝试任何模式
- 为每一次发送/接收调用添加时间戳日志
- 启用抓包工具(Charles/Wireshark)
- 在不同设备上测试,以区分硬件和软件问题
Decision Tree
决策树
Use this to reach the correct diagnostic pattern in 2 minutes:
Network problem?
├─ Connection never reaches .ready?
│ ├─ Stuck in .preparing for >5 seconds?
│ │ ├─ DNS lookup timing out? → Pattern 1a (DNS Failure)
│ │ ├─ Network available but can't reach host? → Pattern 1c (Connection Refused)
│ │ └─ First connection slow, subsequent fast? → Pattern 1e (DNS Caching)
│ │
│ ├─ Moves to .waiting immediately?
│ │ ├─ Airplane Mode or no signal? → Pattern 1b (No Connectivity)
│ │ ├─ Cellular blocked by parameters? → Pattern 1b (Interface Restrictions)
│ │ └─ VPN connecting? → Wait and retry
│ │
│ ├─ .failed with POSIX error 61?
│ │ └─ → Pattern 1c (Connection Refused)
│ │
│ └─ .failed with POSIX error 50?
│ └─ → Pattern 1d (Network Down)
│
├─ Connection reaches .ready, then fails?
│ ├─ Fails immediately after .ready?
│ │ ├─ TLS error -9806? → Pattern 2b (Certificate Validation)
│ │ ├─ TLS error -9801? → Pattern 2b (Protocol Version)
│ │ └─ POSIX error 54? → Pattern 2d (Connection Reset)
│ │
│ ├─ Fails after network change (WiFi → cellular)?
│ │ ├─ No viabilityUpdateHandler? → Pattern 2a (Viability Not Handled)
│ │ ├─ Didn't detect better path? → Pattern 2a (Better Path)
│ │ └─ IPv6 → IPv4 transition? → Pattern 5a (Dual Stack)
│ │
│ ├─ Fails after timeout?
│ │ └─ → Pattern 2c (Receiver Not Responding)
│ │
│ └─ Random disconnects?
│ └─ → Pattern 2d (Network Instability)
│
├─ Data not arriving?
│ ├─ Send succeeds, receive never returns?
│ │ ├─ No message framing? → Pattern 3a (Framing Problem)
│ │ ├─ Wrong byte count? → Pattern 3b (Min/Max Bytes)
│ │ └─ Receiver not calling receive()? → Check receiver code
│ │
│ ├─ Partial data arrives?
│ │ ├─ receive(exactly:) too large? → Pattern 3b (Chunking)
│ │ ├─ Sender closing too early? → Check sender lifecycle
│ │ └─ Buffer overflow? → Pattern 3b (Buffer Management)
│ │
│ ├─ Data corrupted?
│ │ ├─ TLS disabled? → Pattern 3c (No Encryption)
│ │ ├─ Binary vs text encoding? → Check ContentType
│ │ └─ Byte order (endianness)? → Use network byte order
│ │
│ └─ Works sometimes, fails intermittently?
│ └─ → Pattern 3d (Race Condition)
│
├─ Performance degrading?
│ ├─ Latency increasing over time?
│ │ ├─ TCP congestion? → Pattern 4a (Congestion Control)
│ │ ├─ No contentProcessed pacing? → Pattern 4a (Buffering)
│ │ └─ Server overloaded? → Check server metrics
│ │
│ ├─ Throughput decreasing?
│ │ ├─ Network transition WiFi → cellular? → Pattern 4b (Bandwidth Change)
│ │ ├─ Packet loss increasing? → Pattern 4b (Network Quality)
│ │ └─ Multiple streams competing? → Pattern 4b (Prioritization)
│ │
│ ├─ High CPU usage?
│ │ ├─ Not using batch for UDP? → Pattern 4c (Batching)
│ │ ├─ Too many small sends? → Pattern 4c (Coalescing)
│ │ └─ Using sockets instead of Network.framework? → Migrate (30% CPU savings)
│ │
│ └─ Memory growing?
│ ├─ Not releasing connections? → Pattern 4d (Connection Leaks)
│ ├─ Not cancelling on deinit? → Pattern 4d (Lifecycle)
│ └─ Missing [weak self]? → Pattern 4d (Retain Cycles)
│
└─ Works on WiFi, fails on cellular/VPN?
├─ IPv6-only cellular network?
│ ├─ Hardcoded IPv4 address? → Pattern 5a (IPv4 Literal)
│ ├─ getaddrinfo with AF_INET only? → Pattern 5a (Address Family)
│ └─ Works on some carriers, not others? → Pattern 5a (Regional IPv6)
│
├─ Corporate VPN active?
│ ├─ Proxy configuration failing? → Pattern 5b (PAC)
│ ├─ DNS override blocking hostname? → Pattern 5b (DNS)
│ └─ Certificate pinning failing? → Pattern 5b (TLS in VPN)
│
├─ Port blocked by firewall?
│ ├─ Non-standard port? → Pattern 5c (Firewall)
│ ├─ Outbound only? → Pattern 5c (NATing)
│ └─ Works on port 443, not 8080? → Pattern 5c (Port Scanning)
│
└─ Peer-to-peer connection failing?
├─ NAT traversal issue? → Pattern 5d (STUN/TURN)
├─ Symmetric NAT? → Pattern 5d (NAT Type)
└─ Local network only? → Pattern 5d (Bonjour/mDNS)使用以下决策树,可在2分钟内定位到正确的诊断模式:
遇到网络问题?
├─ 连接从未进入.ready状态?
│ ├─ 停留在.preparing状态超过5秒?
│ │ ├─ DNS解析超时? → 模式1a(DNS解析失败)
│ │ ├─ 网络可用但无法访问主机? → 模式1c(连接被拒绝)
│ │ └─ 首次连接慢,后续连接快? → 模式1e(DNS缓存)
│ │
│ ├─ 立即切换到.waiting状态?
│ │ ├─ 飞行模式或无信号? → 模式1b(无网络连接)
│ │ ├─ 蜂窝网络被参数限制? → 模式1b(接口限制)
│ │ └─ VPN正在连接? → 等待并重试
│ │
│ ├─ .failed状态伴随POSIX错误61?
│ │ └─ → 模式1c(连接被拒绝)
│ │
│ └─ .failed状态伴随POSIX错误50?
│ └─ → 模式1d(网络中断)
│
├─ 连接进入.ready状态后失败?
│ ├─ 进入.ready状态后立即失败?
│ │ ├─ TLS错误-9806? → 模式2b(证书验证)
│ │ ├─ TLS错误-9801? → 模式2b(协议版本)
│ │ └─ POSIX错误54? → 模式2d(连接重置)
│ │
│ ├─ 网络切换后失败(WiFi → 蜂窝)?
│ │ ├─ 未设置viabilityUpdateHandler? → 模式2a(未处理网络可用性)
│ │ ├─ 未检测到更优路径? → 模式2a(更优路径切换)
│ │ └─ IPv6 → IPv4转换? → 模式5a(双栈网络)
│ │
│ ├─ 超时后失败?
│ │ └─ → 模式2c(接收方无响应)
│ │
│ └─ 随机断开连接?
│ └─ → 模式2d(网络不稳定)
│
├─ 数据未送达?
│ ├─ 发送成功,但接收从未返回数据?
│ │ ├─ 无消息帧格式? → 模式3a(帧格式问题)
│ │ ├─ 字节数错误? → 模式3b(最小/最大字节数)
│ │ └─ 接收方未调用receive()? → 检查接收方代码
│ │
│ ├─ 仅收到部分数据?
│ │ ├─ receive(exactly:)设置的字节数过大? → 模式3b(数据分块)
│ │ ├─ 发送方提前关闭连接? → 检查发送方生命周期
│ │ └─ 缓冲区溢出? → 模式3b(缓冲区管理)
│ │
│ ├─ 数据损坏?
│ │ ├─ 未启用TLS? → 模式3c(无加密)
│ │ ├─ 二进制与文本编码混淆? → 检查ContentType
│ │ └─ 字节序(端序)错误? → 使用网络字节序
│ │
│ └─ 有时正常,有时失败?
│ └─ → 模式3d(竞态条件)
│
├─ 性能下降?
│ ├─ 延迟随时间增加?
│ │ ├─ TCP拥塞? → 模式4a(拥塞控制)
│ │ ├─ 未使用contentProcessed进行 pacing? → 模式4a(缓冲管理)
│ │ └─ 服务器过载? → 检查服务器指标
│ │
│ ├─ 吞吐量下降?
│ │ ├─ 网络切换(WiFi → 蜂窝)? → 模式4b(带宽变化)
│ │ ├─ 丢包率上升? → 模式4b(网络质量)
│ │ └─ 多流竞争带宽? → 模式4b(优先级管理)
│ │
│ ├─ CPU使用率过高?
│ │ ├─ UDP未使用批量发送? → 模式4c(批量处理)
│ │ ├─ 过多小数据包发送? → 模式4c(合并发送)
│ │ └─ 使用套接字而非Network.framework? → 迁移到Network.framework(可节省30% CPU)
│ │
│ └─ 内存持续增长?
│ ├─ 未释放连接? → 模式4d(连接泄漏)
│ ├─ 销毁时未取消连接? → 模式4d(生命周期管理)
│ └─ 缺失[weak self]? → 模式4d(循环引用)
│
└─ WiFi环境下正常,蜂窝/VPN环境下失败?
├─ 纯IPv6蜂窝网络?
│ ├─ 硬编码IPv4地址? → 模式5a(IPv4字面量)
│ ├─ getaddrinfo仅使用AF_INET? → 模式5a(地址族限制)
│ └─ 部分运营商可用,部分不可用? → 模式5a(区域IPv6差异)
│
├─ 企业VPN已开启?
│ ├─ 代理配置失败? → 模式5b(PAC脚本)
│ ├─ DNS覆盖阻止主机访问? → 模式5b(DNS干扰)
│ └─ 证书固定失败? → 模式5b(VPN环境下的TLS)
│
├─ 端口被防火墙阻止?
│ ├─ 使用非标准端口? → 模式5c(防火墙限制)
│ ├─ 仅允许出站连接? → 模式5c(NAT转换)
│ └─ 端口443可用,但8080不可用? → 模式5c(端口扫描限制)
│
└─ 对等连接失败?
├─ NAT穿透问题? → 模式5d(STUN/TURN)
├─ 对称NAT? → 模式5d(NAT类型)
└─ 仅本地网络可用? → 模式5d(Bonjour/mDNS)Pattern Selection Rules (MANDATORY)
模式选择规则(强制)
Before proceeding to a pattern:
- Connection never reaching .ready → Start with Pattern 1 (DNS, connectivity, refused)
- TLS error codes → Jump directly to Pattern 2b (Certificate validation)
- Data not arriving → Enable packet capture FIRST, then Pattern 3
- Network-specific (works WiFi, fails cellular) → Test on that exact network, Pattern 5
- Performance degradation → Profile with Instruments Network template, Pattern 4
在使用某一模式前:
- 连接从未进入.ready状态 → 从模式1开始(DNS、连接性、连接被拒绝)
- 出现TLS错误码 → 直接跳转到模式2b(证书验证)
- 数据未送达 → 先启用抓包工具,再使用模式3
- 网络环境特定问题(WiFi正常,蜂窝/VPN失败) → 在对应网络环境下测试,使用模式5
- 性能下降 → 用Instruments的Network模板分析,使用模式4
Apply ONE pattern at a time
一次仅应用一种模式
- Implement the fix from one pattern
- Test thoroughly
- Only if issue persists, try next pattern
- DO NOT apply multiple patterns simultaneously (can't isolate cause)
- 实现某一模式的修复方案
- 彻底测试
- 只有当问题仍存在时,再尝试下一个模式
- 严禁同时应用多种模式(无法定位根本原因)
FORBIDDEN
严禁操作
- Guessing at solutions without diagnostics
- Changing multiple things at once
- Assuming "just needs more timeout"
- Disabling TLS "temporarily"
- Switching to sockets to "avoid framework issues"
- 不进行诊断就盲目尝试解决方案
- 同时修改多个地方
- 认为“只要增加超时时间就能解决”
- “临时”禁用TLS
- 为了“避开框架问题”而切换到套接字
Diagnostic Patterns
诊断模式
Pattern 1a: DNS Resolution Failure
模式1a:DNS解析失败
Time cost 10-15 minutes
耗时 10-15分钟
Symptom
症状
- Connection stuck in .preparing for >5 seconds
- Eventually fails or times out
- Works with IP address but not hostname
- Works on one network, fails on another
- 连接停留在.preparing状态超过5秒
- 最终失败或超时
- 使用IP地址正常,但使用主机名失败
- 在某一网络环境下正常,另一环境下失败
Diagnosis
诊断方法
swift
// Enable DNS logging
// -NWLoggingEnabled 1
// Check DNS resolution manually
// Terminal: nslookup example.com
// Terminal: dig example.com
// Logs show:
// "DNS lookup timed out"
// "getaddrinfo failed: 8 (nodename nor servname provided)"swift
// 启用DNS日志
// -NWLoggingEnabled 1
// 手动检查DNS解析
// 终端:nslookup example.com
// 终端:dig example.com
// 日志会显示:
// "DNS lookup timed out"
// "getaddrinfo failed: 8 (nodename nor servname provided)"Common causes
常见原因
- DNS server unreachable (corporate network blocks external DNS)
- Hostname typo or doesn't exist
- DNS caching stale entry (rare, but happens)
- VPN blocking DNS resolution
- DNS服务器不可达(企业网络阻止外部DNS)
- 主机名拼写错误或不存在
- DNS缓存存在过期条目(罕见,但确实会发生)
- VPN阻止DNS解析
Fix
修复方案
swift
// ❌ WRONG — Adding timeout doesn't fix DNS
/*
let parameters = NWParameters.tls
parameters.expiredDNSBehavior = .allow // Doesn't help if DNS never resolves
*/
// ✅ CORRECT — Verify hostname, test DNS manually
// 1. Test DNS manually:
// $ nslookup your-hostname.com
// If this fails, DNS is the problem (not your code)
// 2. If DNS works manually but not in app:
// Check if VPN or enterprise config blocking app DNS
// 3. If hostname doesn't exist:
let connection = NWConnection(
host: NWEndpoint.Host("correct-hostname.com"), // Fix typo
port: 443,
using: .tls
)
// 4. If DNS caching issue (rare):
// Restart device to clear DNS cache
// Or use IP address temporarily while investigating DNS server issueswift
// ❌ 错误方案 — 增加超时时间无法修复DNS问题
/*
let parameters = NWParameters.tls
parameters.expiredDNSBehavior = .allow // 如果DNS从未解析,此设置无效
*/
// ✅ 正确方案 — 验证主机名,手动测试DNS
// 1. 手动测试DNS:
// $ nslookup your-hostname.com
// 如果此命令失败,说明是DNS问题(而非代码问题)
// 2. 如果手动测试DNS正常,但应用中失败:
// 检查是否有VPN或企业配置阻止应用的DNS请求
// 3. 如果主机名不存在:
let connection = NWConnection(
host: NWEndpoint.Host("correct-hostname.com"), // 修正拼写错误
port: 443,
using: .tls
)
// 4. 如果是DNS缓存问题(罕见):
// 重启设备以清除DNS缓存
// 或在排查DNS服务器问题期间临时使用IP地址Verification
验证方法
- Run — should return IP in <1 second
nslookup your-hostname.com - Test on cellular (different DNS servers) — should work
- Check corporate network DNS configuration
- 运行— 应在1秒内返回IP地址
nslookup your-hostname.com - 在蜂窝网络下测试(使用不同的DNS服务器) — 应正常工作
- 检查企业网络的DNS配置
Prevention
预防措施
- Use well-known hostnames (don't rely on internal DNS)
- Test on multiple networks during development
- Don't hardcode IPs (if DNS fails, you need to fix DNS, not bypass it)
- 使用知名主机名(不要依赖内部DNS)
- 开发期间在多种网络环境下测试
- 不要硬编码IP地址(如果DNS故障,应修复DNS而非绕过)
Pattern 2b: TLS Certificate Validation Failure
模式2b:TLS证书验证失败
Time cost 15-20 minutes
耗时 15-20分钟
Symptom
症状
- Connection reaches .ready briefly, then .failed immediately
- Error: (kSSLPeerCertInvalid)
-9806 - Error: (kSSLPeerCertExpired)
-9807 - Error: (kSSLProtocol)
-9801 - Works on some servers, fails on others
- 连接短暂进入.ready状态后立即进入.failed状态
- 错误码:(kSSLPeerCertInvalid)
-9806 - 错误码:(kSSLPeerCertExpired)
-9807 - 错误码:(kSSLProtocol)
-9801 - 在部分服务器上正常,部分服务器上失败
Diagnosis
诊断方法
bash
undefinedbash
undefinedTest TLS manually with openssl
用openssl手动测试TLS
openssl s_client -connect example.com:443 -showcerts
openssl s_client -connect example.com:443 -showcerts
Check certificate details
检查证书详情
openssl s_client -connect example.com:443 | openssl x509 -noout -dates
openssl s_client -connect example.com:443 | openssl x509 -noout -dates
notBefore: Jan 1 00:00:00 2024 GMT
notBefore: Jan 1 00:00:00 2024 GMT
notAfter: Dec 31 23:59:59 2024 GMT ← Check if expired
notAfter: Dec 31 23:59:59 2024 GMT ← 检查是否过期
Check certificate chain
检查证书链
openssl s_client -connect example.com:443 -showcerts | grep "CN="
openssl s_client -connect example.com:443 -showcerts | grep "CN="
Should show: Subject CN=example.com, Issuer CN=Trusted CA
应显示:Subject CN=example.com, Issuer CN=Trusted CA
undefinedundefinedCommon causes
常见原因
- Self-signed certificate (dev/staging servers)
- Expired certificate
- Certificate hostname mismatch (cert for "example.com" but connecting to "www.example.com")
- Missing intermediate CA certificate
- TLS 1.0/1.1 (iOS 13+ requires TLS 1.2+)
- 自签名证书(开发/预发布服务器)
- 证书过期
- 证书主机名不匹配(证书为"example.com",但连接到"www.example.com")
- 缺少中间CA证书
- 使用TLS 1.0/1.1(iOS 13+要求TLS 1.2+)
Fix
修复方案
For production servers with invalid certs
针对证书无效的生产服务器
swift
// ❌ WRONG — Never disable certificate validation in production
/*
let tlsOptions = NWProtocolTLS.Options()
sec_protocol_options_set_verify_block(tlsOptions.securityProtocolOptions, { ... }, .main)
// This disables validation → security vulnerability
*/
// ✅ CORRECT — Fix the certificate on server
// 1. Renew expired certificate (Let's Encrypt, DigiCert, etc.)
// 2. Ensure hostname matches (CN=example.com or SAN includes example.com)
// 3. Include intermediate CA certificates on server
// 4. Test with: openssl s_client -connect example.com:443swift
// ❌ 错误方案 — 生产环境中绝不能禁用证书验证
/*
let tlsOptions = NWProtocolTLS.Options()
sec_protocol_options_set_verify_block(tlsOptions.securityProtocolOptions, { ... }, .main)
// 此操作会禁用验证 → 存在安全漏洞
*/
// ✅ 正确方案 — 在服务器上修复证书
// 1. 续期过期证书(使用Let's Encrypt、DigiCert等)
// 2. 确保主机名匹配(CN=example.com或SAN包含example.com)
// 3. 在服务器上配置中间CA证书
// 4. 用以下命令测试:openssl s_client -connect example.com:443For development servers (temporary)
针对开发服务器(临时方案)
swift
// ⚠️ ONLY for development/staging
#if DEBUG
let tlsOptions = NWProtocolTLS.Options()
sec_protocol_options_set_verify_block(
tlsOptions.securityProtocolOptions,
{ (sec_protocol_metadata, sec_trust, sec_protocol_verify_complete) in
// Trust any certificate (DEV ONLY)
sec_protocol_verify_complete(true)
},
.main
)
let parameters = NWParameters(tls: tlsOptions)
let connection = NWConnection(host: "dev-server.example.com", port: 443, using: parameters)
#endifswift
// ⚠️ 仅用于开发/预发布环境
#if DEBUG
let tlsOptions = NWProtocolTLS.Options()
sec_protocol_options_set_verify_block(
tlsOptions.securityProtocolOptions,
{ (sec_protocol_metadata, sec_trust, sec_protocol_verify_complete) in
// 信任任何证书(仅开发环境)
sec_protocol_verify_complete(true)
},
.main
)
let parameters = NWParameters(tls: tlsOptions)
let connection = NWConnection(host: "dev-server.example.com", port: 443, using: parameters)
#endifFor certificate pinning
针对证书固定
swift
// Production-grade certificate pinning
let tlsOptions = NWProtocolTLS.Options()
sec_protocol_options_set_verify_block(
tlsOptions.securityProtocolOptions,
{ (metadata, trust, complete) in
let trust = sec_protocol_metadata_copy_peer_public_key(metadata)
// Compare trust with pinned certificate
let pinnedCertificateData = Data(/* your cert */)
let serverCertificateData = SecCertificateCopyData(trust) as Data
if serverCertificateData == pinnedCertificateData {
complete(true)
} else {
complete(false) // Reject non-pinned certificates
}
},
.main
)swift
// 生产级证书固定方案
let tlsOptions = NWProtocolTLS.Options()
sec_protocol_options_set_verify_block(
tlsOptions.securityProtocolOptions,
{ (metadata, trust, complete) in
let trust = sec_protocol_metadata_copy_peer_public_key(metadata)
// 与固定的证书进行对比
let pinnedCertificateData = Data(/* 你的证书数据 */)
let serverCertificateData = SecCertificateCopyData(trust) as Data
if serverCertificateData == pinnedCertificateData {
complete(true)
} else {
complete(false) // 拒绝非固定证书
}
},
.main
)Verification
验证方法
- shows
openssl s_client -connect example.com:443Verify return code: 0 (ok) - Certificate expiration > 30 days in future
- Certificate CN matches hostname
- Test on real iOS device (not just simulator)
- 显示
openssl s_client -connect example.com:443Verify return code: 0 (ok) - 证书过期时间在30天以后
- 证书CN与主机名匹配
- 在真实iOS设备上测试(而非仅模拟器)
Pattern 3a: Message Framing Problem
模式3a:消息帧格式问题
Time cost 20-30 minutes
耗时 20-30分钟
Symptom
症状
- connection.send() succeeds with no error
- connection.receive() never returns data
- Or receive() returns partial data
- Packet capture shows bytes on wire, but app doesn't process them
- connection.send()成功,无错误
- connection.receive()从未返回数据
- 或receive()返回部分数据
- 抓包显示数据已在网络中传输,但应用未处理
Diagnosis
诊断方法
swift
// Enable detailed logging
connection.send(content: data, completion: .contentProcessed { error in
if let error = error {
print("Send error: \(error)")
} else {
print("✅ Sent \(data.count) bytes at \(Date())")
}
})
connection.receive(minimumIncompleteLength: 1, maximumLength: 65536) { data, context, isComplete, error in
if let error = error {
print("Receive error: \(error)")
} else if let data = data {
print("✅ Received \(data.count) bytes at \(Date())")
}
}
// Use Charles Proxy or Wireshark to verify bytes on wireCommon cause Stream protocols (TCP/TLS) don't preserve message boundaries.
swift
// 启用详细日志
connection.send(content: data, completion: .contentProcessed { error in
if let error = error {
print("Send error: \(error)")
} else {
print("✅ Sent \(data.count) bytes at \(Date())")
}
})
connection.receive(minimumIncompleteLength: 1, maximumLength: 65536) { data, context, isComplete, error in
if let error = error {
print("Receive error: \(error)")
} else if let data = data {
print("✅ Received \(data.count) bytes at \(Date())")
}
}
// 使用Charles Proxy或Wireshark验证网络中的字节流常见原因 流协议(TCP/TLS)不保留消息边界。
Example
示例
swift
// Sender sends 3 messages:
send("Hello") // 5 bytes
send("World") // 5 bytes
send("!") // 1 byte
// Receiver might get:
receive() → "HelloWorld!" // All 11 bytes at once
// Or:
receive() → "Hel" // 3 bytes
receive() → "loWorld!" // 8 bytes
// Message boundaries lost!swift
// 发送方发送3条消息:
send("Hello") // 5字节
send("World") // 5字节
send("!") // 1字节
// 接收方可能收到:
receive() → "HelloWorld!" // 一次性收到全部11字节
// 或者:
receive() → "Hel" // 3字节
receive() → "loWorld!" // 8字节
// 消息边界丢失!Fix
修复方案
Solution 1: Use TLV Framing (iOS 26+)
方案1:使用TLV帧格式(iOS 26+)
swift
// NetworkConnection with TLV
let connection = NetworkConnection(
to: .hostPort(host: "example.com", port: 1029)
) {
TLV {
TLS()
}
}
// Send typed messages
enum MessageType: Int {
case chat = 1
case ping = 2
}
let chatData = Data("Hello".utf8)
try await connection.send(chatData, type: MessageType.chat.rawValue)
// Receive typed messages
let (data, metadata) = try await connection.receive()
if metadata.type == MessageType.chat.rawValue {
print("Chat message: \(String(data: data, encoding: .utf8)!)")
}swift
// 带TLV的NetworkConnection
let connection = NetworkConnection(
to: .hostPort(host: "example.com", port: 1029)
) {
TLV {
TLS()
}
}
// 发送带类型的消息
enum MessageType: Int {
case chat = 1
case ping = 2
}
let chatData = Data("Hello".utf8)
try await connection.send(chatData, type: MessageType.chat.rawValue)
// 接收带类型的消息
let (data, metadata) = try await connection.receive()
if metadata.type == MessageType.chat.rawValue {
print("Chat message: \(String(data: data, encoding: .utf8)!)")
}Solution 2: Manual Length Prefix (iOS 12-25)
方案2:手动添加长度前缀(iOS 12-25)
swift
// Sender: Prefix message with UInt32 length
func sendMessage(_ message: Data) {
var length = UInt32(message.count).bigEndian
let lengthData = Data(bytes: &length, count: 4)
connection.send(content: lengthData, completion: .contentProcessed { _ in
connection.send(content: message, completion: .contentProcessed { _ in
print("Sent message with length prefix")
})
})
}
// Receiver: Read length, then read message
func receiveMessage() {
// 1. Read 4-byte length
connection.receive(minimumIncompleteLength: 4, maximumLength: 4) { lengthData, _, _, error in
guard let lengthData = lengthData else { return }
let length = lengthData.withUnsafeBytes { $0.load(as: UInt32.self).bigEndian }
// 2. Read message of exact length
connection.receive(minimumIncompleteLength: Int(length), maximumLength: Int(length)) { messageData, _, _, error in
guard let messageData = messageData else { return }
print("Received complete message: \(messageData.count) bytes")
}
}
}swift
// 发送方:用UInt32长度作为消息前缀
func sendMessage(_ message: Data) {
var length = UInt32(message.count).bigEndian
let lengthData = Data(bytes: &length, count: 4)
connection.send(content: lengthData, completion: .contentProcessed { _ in
connection.send(content: message, completion: .contentProcessed { _ in
print("Sent message with length prefix")
})
})
}
// 接收方:先读取长度,再读取消息
func receiveMessage() {
// 1. 读取4字节长度
connection.receive(minimumIncompleteLength: 4, maximumLength: 4) { lengthData, _, _, error in
guard let lengthData = lengthData else { return }
let length = lengthData.withUnsafeBytes { $0.load(as: UInt32.self).bigEndian }
// 2. 读取指定长度的消息
connection.receive(minimumIncompleteLength: Int(length), maximumLength: Int(length)) { messageData, _, _, error in
guard let messageData = messageData else { return }
print("Received complete message: \(messageData.count) bytes")
}
}
}Verification
验证方法
- Send 10 messages, verify receiver gets exactly 10 messages
- Send messages of varying sizes (1 byte, 1000 bytes, 64KB)
- Test with packet loss simulation (Network Link Conditioner)
- 发送10条消息,验证接收方收到恰好10条消息
- 发送不同大小的消息(1字节、1000字节、64KB)
- 用Network Link Conditioner模拟丢包场景测试
Pattern 4a: TCP Congestion and Buffering
模式4a:TCP拥塞与缓冲
Time cost 15-25 minutes
耗时 15-25分钟
Symptom
症状
- First few sends fast, then increasingly slow
- Latency grows from 50ms → 500ms → 2000ms over time
- Memory usage growing (buffering unsent data)
- User reports app "feels sluggish" after 5 minutes
- 前几次发送速度快,之后逐渐变慢
- 延迟从50ms → 500ms → 2000ms逐渐增加
- 内存占用持续增长(缓冲未发送的数据)
- 用户反馈应用“使用5分钟后变得卡顿”
Diagnosis
诊断方法
swift
// Monitor send completion time
let sendStart = Date()
connection.send(content: data, completion: .contentProcessed { error in
let elapsed = Date().timeIntervalSince(sendStart)
print("Send completed in \(elapsed)s") // Should be < 0.1s normally
// If > 1s, TCP congestion or receiver not draining fast enough
})
// Profile with Instruments
// Xcode → Product → Profile → Network template
// Check "Bytes Sent" vs "Time" graph
// Should be smooth line, not stepped/stalledswift
// 监控发送完成时间
let sendStart = Date()
connection.send(content: data, completion: .contentProcessed { error in
let elapsed = Date().timeIntervalSince(sendStart)
print("Send completed in \(elapsed)s") // 正常情况下应<0.1秒
// 如果>1秒,说明存在TCP拥塞或接收方处理过慢
})
// 用Instruments分析
// Xcode → Product → Profile → Network模板
// 查看“Bytes Sent” vs “Time”图表
// 应是平滑曲线,而非阶梯状/停滞Common causes
常见原因
- Sender sending faster than receiver can process (back pressure)
- Network congestion (packet loss, retransmits)
- No pacing with contentProcessed callback
- Sending on connection that lost viability
- 发送方发送速度超过接收方处理能力(背压问题)
- 网络拥塞(丢包、重传)
- 未使用contentProcessed回调进行 pacing
- 在已失去可用性的连接上发送数据
Fix
修复方案
swift
// ❌ WRONG — Sending without pacing
/*
for frame in videoFrames {
connection.send(content: frame, completion: .contentProcessed { _ in })
// Buffers all frames immediately → memory spike → congestion
}
*/
// ✅ CORRECT — Pace with contentProcessed callback
func sendFrameWithPacing() {
guard let nextFrame = getNextFrame() else { return }
connection.send(content: nextFrame, completion: .contentProcessed { [weak self] error in
if let error = error {
print("Send error: \(error)")
return
}
// contentProcessed = network stack consumed frame
// NOW send next frame (pacing)
self?.sendFrameWithPacing()
})
}
// Start pacing
sendFrameWithPacing()swift
// ❌ 错误方案 — 无pacing的连续发送
/*
for frame in videoFrames {
connection.send(content: frame, completion: .contentProcessed { _ in })
// 立即缓冲所有帧 → 内存飙升 → 拥塞
}
*/
// ✅ 正确方案 — 用contentProcessed回调实现pacing
func sendFrameWithPacing() {
guard let nextFrame = getNextFrame() else { return }
connection.send(content: nextFrame, completion: .contentProcessed { [weak self] error in
if let error = error {
print("Send error: \(error)")
return
}
// contentProcessed = 网络栈已处理该帧
// 现在发送下一帧(pacing)
self?.sendFrameWithPacing()
})
}
// 开始pacing发送
sendFrameWithPacing()Alternative: Async/await (iOS 26+)
替代方案:Async/await(iOS 26+)
swift
// NetworkConnection with natural back pressure
func sendFrames() async throws {
for frame in videoFrames {
try await connection.send(frame)
// Suspends automatically if network can't keep up
// Built-in back pressure, no manual pacing needed
}
}swift
// 带天然背压的NetworkConnection
func sendFrames() async throws {
for frame in videoFrames {
try await connection.send(frame)
// 如果网络无法跟上,会自动挂起
// 内置背压,无需手动pacing
}
}Verification
验证方法
- Send 1000 messages, monitor memory usage (should stay flat)
- Monitor send completion time (should stay < 100ms)
- Test with Network Link Conditioner (100ms latency, 3% packet loss)
- 发送1000条消息,监控内存占用(应保持平稳)
- 监控发送完成时间(应保持<100ms)
- 用Network Link Conditioner模拟(100ms延迟,3%丢包)测试
Pattern 5a: IPv6-Only Cellular Network (Hardcoded IPv4)
模式5a:纯IPv6蜂窝网络(硬编码IPv4)
Time cost 10-15 minutes
耗时 10-15分钟
Symptom
症状
- Works perfectly on WiFi (dual-stack IPv4/IPv6)
- Fails 100% of time on cellular (IPv6-only)
- Works on some carriers (T-Mobile), fails on others (Verizon)
- Logs show "Host unreachable" or POSIX error 65 (EHOSTUNREACH)
- 在WiFi环境下完全正常(双栈IPv4/IPv6)
- 在蜂窝网络下100%失败(纯IPv6)
- 在部分运营商(如T-Mobile)正常,部分运营商(如Verizon)失败
- 日志显示“Host unreachable”或POSIX错误65(EHOSTUNREACH)
Diagnosis
诊断方法
bash
undefinedbash
undefinedCheck if hostname has IPv6
检查主机名是否有IPv6记录
dig AAAA example.com
dig AAAA example.com
Check if device is on IPv6-only network
检查设备是否在纯IPv6网络中
Settings → WiFi/Cellular → (i) → IP Address
设置 → WiFi/蜂窝网络 → (i) → IP地址
If starts with "2001:" or "fe80:" → IPv6
如果以"2001:"或"fe80:"开头 → IPv6
If "192.168" or "10." → IPv4
如果以"192.168"或"10."开头 → IPv4
Test with IPv6-only simulator
用纯IPv6模拟器测试
Xcode → Devices → (device) → Use as Development Target
Xcode → Devices → (设备) → Use as Development Target
Settings → Developer → Networking → DNS64/NAT64
设置 → 开发者 → 网络 → DNS64/NAT64
undefinedundefinedCommon causes
常见原因
- Hardcoded IPv4 address ("192.168.1.1")
- getaddrinfo with AF_INET only (filters out IPv6)
- Server has no IPv6 address (AAAA record)
- Not using Connect by Name (manual DNS)
- 硬编码IPv4地址(如"192.168.1.1")
- getaddrinfo仅使用AF_INET(过滤掉IPv6)
- 服务器无IPv6地址(AAAA记录)
- 未使用按名称连接(手动处理DNS)
Fix
修复方案
swift
// ❌ WRONG — Hardcoded IPv4
/*
let host = "192.168.1.100" // Fails on IPv6-only cellular
*/
// ❌ WRONG — Forcing IPv4
/*
let parameters = NWParameters.tcp
parameters.requiredInterfaceType = .wifi
parameters.ipOptions.version = .v4 // Fails on IPv6-only
*/
// ✅ CORRECT — Use hostname, let framework handle IPv4/IPv6
let connection = NWConnection(
host: NWEndpoint.Host("example.com"), // Hostname, not IP
port: 443,
using: .tls
)
// Framework automatically:
// 1. Resolves both A (IPv4) and AAAA (IPv6) records
// 2. Tries IPv6 first (if available)
// 3. Falls back to IPv4 (Happy Eyeballs)
// 4. Works on any network (IPv4, IPv6, dual-stack)swift
// ❌ 错误方案 — 硬编码IPv4
/*
let host = "192.168.1.100" // 在纯IPv6蜂窝网络中失败
*/
// ❌ 错误方案 — 强制使用IPv4
/*
let parameters = NWParameters.tcp
parameters.requiredInterfaceType = .wifi
parameters.ipOptions.version = .v4 // 在纯IPv6网络中失败
*/
// ✅ 正确方案 — 使用主机名,让框架自动处理IPv4/IPv6
let connection = NWConnection(
host: NWEndpoint.Host("example.com"), // 主机名,而非IP
port: 443,
using: .tls
)
// 框架会自动:
// 1. 解析A(IPv4)和AAAA(IPv6)记录
// 2. 优先尝试IPv6(如果可用)
// 3. 回退到IPv4(Happy Eyeballs机制)
// 4. 在任何网络环境下都能工作(IPv4、IPv6、双栈)Verification
验证方法
- Test on real device with cellular (disable WiFi)
- Test with multiple carriers (Verizon, AT&T, T-Mobile)
- Enable DNS64/NAT64 in developer settings
- Run to verify IPv6 record exists
dig AAAA your-hostname.com
- 在真实设备上关闭WiFi,仅用蜂窝网络测试
- 在多个运营商网络下测试(Verizon、AT&T、T-Mobile)
- 在开发者设置中启用DNS64/NAT64
- 运行验证IPv6记录存在
dig AAAA your-hostname.com
Production Crisis Scenario
生产环境危机场景
Context: iOS Update Causes 15% Connection Failures
背景:iOS更新导致15%连接失败
Situation
场景
- Your company releases iOS app update (v4.2) on Monday morning
- By noon, Customer Support reports surge in "app doesn't work" tickets
- Analytics show 15% of users experiencing connection failures (10,000+ users)
- CEO sends Slack message: "What's going on? How fast can we fix this?"
- Engineering manager asks for ETA
- You're the networking engineer
- 你的公司在周一上午发布iOS应用更新(v4.2)
- 到中午,客服部门报告“应用无法使用”的工单激增
- 数据分析显示15%的用户遇到连接失败(10000+用户)
- CEO在Slack中询问:“发生了什么?多久能修复?”
- 工程经理询问修复时间
- 你是负责网络的工程师
Pressure signals
压力信号
- 🚨 Production outage 10K+ users affected, revenue impact, negative App Store reviews incoming
- ⏰ Time pressure "Need fix ASAP, trending on Twitter"
- 👔 Executive visibility CEO personally asking for updates
- 📊 Public image App Store rating dropping from 4.8 → 4.1 in 3 hours
- 💸 Financial impact E-commerce app, each minute costs $5K in lost sales
- 🚨 生产环境故障 10000+用户受影响,影响收入,即将出现负面App Store评价
- ⏰ 时间压力 “需要立即修复,已经在Twitter上发酵”
- 👔 管理层关注 CEO亲自询问进展
- 📊 公众形象 App Store评分在3小时内从4.8降至4.1
- 💸 财务影响 电商应用,每分钟损失5000美元销售额
Rationalization traps (DO NOT fall into these)
理性陷阱(切勿陷入)
-
"Just roll back to v4.1"
- Tempting but takes 1-2 hours for app review, another 24 hours for users to update
- Doesn't find root cause (might happen again)
- Loses v4.2 features you worked on for weeks
-
"Disable TLS temporarily to narrow it down"
- Security vulnerability, will cause App Store rejection
- Doesn't solve actual problem (masks symptoms)
- When would you re-enable? (spoiler: never, because fixing it "later" never happens)
-
"It works on my device, must be user error"
- Arrogance, not diagnosis
- 10K users having same "error"? That's not user error.
-
"Let's add retry logic and more timeouts"
- Doesn't address root cause
- Makes problem worse (more retries = more load on failing path)
-
“直接回滚到v4.1”
- 看似诱人,但应用审核需要1-2小时,用户更新还需要24小时
- 无法找到根本原因(可能再次发生)
- 丢失v4.2中开发数周的功能
-
“临时禁用TLS来缩小问题范围”
- 存在安全漏洞,会导致App Store审核拒绝
- 无法解决实际问题(只是掩盖症状)
- 你永远不会再重新启用它(“以后再修复”永远不会到来)
-
“在我的设备上正常,肯定是用户操作错误”
- 这是傲慢,而非诊断
- 10000+用户遇到同样的“操作错误”?这绝不是用户的问题
-
“添加重试逻辑和更长的超时时间”
- 无法解决根本原因
- 会让问题更严重(更多重试 = 失败路径上的负载更高)
MANDATORY Diagnostic Protocol
强制诊断流程
You have 1 hour to provide CEO with:
- Root cause
- Fix timeline
- Mitigation plan
你有1小时时间向CEO提供:
- 根本原因
- 修复时间线
- 缓解方案
Step 1: Establish Baseline (5 minutes)
步骤1:建立基准(5分钟)
swift
// Check what changed in v4.2
git diff v4.1 v4.2 -- NetworkClient.swift
// Most likely culprits:
// - TLS configuration changed
// - Added certificate pinning
// - Changed connection parameters
// - Updated hostnameswift
// 检查v4.2中的变更
git diff v4.1 v4.2 -- NetworkClient.swift
// 最可能的罪魁祸首:
// - TLS配置变更
// - 添加了证书固定
// - 连接参数变更
// - 主机名更新Step 2: Reproduce in Production Environment (10 minutes)
步骤2:在生产环境中复现(10分钟)
swift
// Check failure pattern:
// - Random 15%? Or specific user segment?
// - Specific iOS version? (check analytics)
// - Specific network? (WiFi vs cellular)
// Enable logging on production builds (emergency flag):
#if PRODUCTION
if UserDefaults.standard.bool(forKey: "EnableNetworkLogging") {
// -NWLoggingEnabled 1
}
#endif
// Ask Customer Support to enable for affected users
// Check logs for specific error codeswift
// 检查失败模式:
// - 随机15%?还是特定用户群体?
// - 特定iOS版本?(查看数据分析)
// - 特定网络环境?(WiFi vs 蜂窝)
// 在生产构建中启用日志(紧急开关):
#if PRODUCTION
if UserDefaults.standard.bool(forKey: "EnableNetworkLogging") {
// -NWLoggingEnabled 1
}
#endif
// 让客服部门指导受影响用户启用日志
// 检查日志中的具体错误码Step 3: Check Recent Code Changes (5 minutes)
步骤3:检查近期代码变更(5分钟)
swift
// Found in git diff:
// v4.1:
let parameters = NWParameters.tls
// v4.2:
let tlsOptions = NWProtocolTLS.Options()
tlsOptions.minimumTLSProtocolVersion = .TLSv13 // ← SMOKING GUN
let parameters = NWParameters(tls: tlsOptions)Root Cause Identified Some users' backend infrastructure (load balancers, proxy servers) don't support TLS 1.3. v4.1 negotiated TLS 1.2, v4.2 requires TLS 1.3 → connection fails.
swift
// 在git diff中发现:
// v4.1:
let parameters = NWParameters.tls
// v4.2:
let tlsOptions = NWProtocolTLS.Options()
tlsOptions.minimumTLSProtocolVersion = .TLSv13 // ← 关键问题
let parameters = NWParameters(tls: tlsOptions)已定位根本原因 部分用户的后端基础设施(负载均衡器、代理服务器)不支持TLS 1.3。v4.1协商使用TLS 1.2,而v4.2强制要求TLS 1.3 → 连接失败。
Step 4: Apply Targeted Fix (15 minutes)
步骤4:应用针对性修复(15分钟)
swift
// Fix: Support both TLS 1.2 and TLS 1.3
let tlsOptions = NWProtocolTLS.Options()
tlsOptions.minimumTLSProtocolVersion = .TLSv12 // ✅ Support older infrastructure
// TLS 1.3 will still be used where supported (automatic negotiation)
let parameters = NWParameters(tls: tlsOptions)swift
// 修复:同时支持TLS 1.2和TLS 1.3
let tlsOptions = NWProtocolTLS.Options()
tlsOptions.minimumTLSProtocolVersion = .TLSv12 // ✅ 兼容旧基础设施
// 在支持的环境下仍会自动使用TLS 1.3
let parameters = NWParameters(tls: tlsOptions)Step 5: Deploy Hotfix (20 minutes)
步骤5:部署热修复(20分钟)
bash
undefinedbash
undefinedBuild hotfix v4.2.1
构建热修复版本v4.2.1
Test on affected user's network (critical!)
在受影响用户的网络环境下测试(关键!)
Submit to App Store with expedited review request
提交到App Store并申请加急审核
Explain: "Production outage affecting 15% of users"
说明:“生产环境故障,影响15%用户”
undefinedundefinedProfessional Communication Templates
专业沟通模板
To CEO (15 minutes after crisis starts)
给CEO的汇报(危机发生后15分钟)
Found root cause: v4.2 requires TLS 1.3, but 15% of users on older infrastructure
(enterprise proxies, older load balancers) that only support TLS 1.2.
Fix: Change minimum TLS version to 1.2 (backward compatible, 1.3 still used when available).
ETA: Hotfix v4.2.1 in App Store in 1 hour (expedited review).
Full rollout to users: 24 hours.
Mitigation now: Telling affected users to update immediately when available.已定位根本原因:v4.2版本强制要求TLS 1.3,但15%的用户使用的旧基础设施
(企业代理、旧负载均衡器)仅支持TLS 1.2。
修复方案:将最低TLS版本设置为1.2(向后兼容,支持TLS 1.3的环境仍会使用)。
时间线:热修复版本v4.2.1将在1小时内通过加急审核上线App Store。
用户完全更新需要24小时。
当前缓解措施:告知受影响用户版本可用后立即更新。To Engineering Manager
给工程经理的汇报
Root cause: TLS version requirement changed in v4.2 (TLS 1.3 only).
15% of users behind infrastructure that doesn't support TLS 1.3.
Technical fix: Set tlsOptions.minimumTLSProtocolVersion = .TLSv12
This allows backward compatibility while still using TLS 1.3 where supported.
Testing: Verified fix on user's network (enterprise VPN with old proxy).
Deployment: Hotfix build in progress, ETA 30 minutes to submit.
Prevention: Add TLS compatibility testing to pre-release checklist.根本原因:v4.2版本中TLS版本要求变更为仅支持TLS 1.3。
15%的用户使用的基础设施不支持TLS 1.3。
技术修复:设置tlsOptions.minimumTLSProtocolVersion = .TLSv12
此设置可实现向后兼容,同时在支持的环境下仍使用TLS 1.3。
测试:已在用户的网络环境(带旧代理的企业VPN)中验证修复有效。
部署:热修复构建中,30分钟内提交。
预防措施:将TLS兼容性测试添加到发布前检查清单。To Customer Support
给客服部门的通知
Update: We've identified the issue and have a fix deploying within 1 hour.
Affected users: Those on enterprise networks or older ISP infrastructure.
Workaround: None (network level issue).
Expected resolution: v4.2.1 will be available in App Store in 1 hour.
Ask users to update immediately.
Updates: I'll notify you every 30 minutes.更新:我们已定位问题,修复将在1小时内部署。
受影响用户:使用企业网络或旧ISP基础设施的用户。
临时解决方案:无(网络层面问题)。
预计解决时间:v4.2.1将在1小时内上线App Store。
请告知用户版本可用后立即更新。
后续更新:我会每30分钟通知你一次进展。Time Saved
时间对比
| Approach | Time to Resolution | User Impact |
|---|---|---|
| ❌ Panic rollback | 1-2 hours app review + 24 hours user updates = 26 hours | 10K users down for 26 hours |
| ❌ "Add more retries" | Unknown (doesn't fix root cause) | Permanent 15% failure rate |
| ❌ "Works for me" | Days of debugging wrong thing | Frustrated users, bad reviews |
| ✅ Systematic diagnosis | 30 min diagnosis + 20 min fix + 1 hour review = 2 hours | 10K users down for 2 hours |
| 处理方式 | 解决时间 | 用户影响 |
|---|---|---|
| ❌ 恐慌回滚 | 1-2小时审核 + 24小时用户更新 = 26小时 | 10000用户离线26小时 |
| ❌ “添加更多重试” | 未知(无法解决根本原因) | 永久15%失败率 |
| ❌ “在我这里正常” | 数天的错误方向调试 | 用户不满,差评 |
| ✅ 系统化诊断 | 30分钟诊断 + 20分钟修复 + 1小时审核 = 2小时 | 10000用户离线2小时 |
Lessons Learned
经验教训
- Test on diverse networks Don't just test on your WiFi. Test on cellular, VPN, enterprise networks.
- Monitor TLS compatibility If you change TLS config, verify backend supports it.
- Gradual rollout Use phased rollout (10% → 50% → 100%) to catch issues early.
- Emergency logging Have a way to enable detailed logging in production for diagnosis.
- Communication cadence Update stakeholders every 30 minutes, even if just "still investigating."
- 在多样化网络环境下测试 不要仅在你的WiFi环境下测试。要在蜂窝网络、VPN、企业网络下测试。
- 监控TLS兼容性 如果变更TLS配置,要验证后端是否支持。
- 逐步发布 使用分阶段发布(10% → 50% → 100%),提前发现问题。
- 紧急日志机制 要有在生产环境中启用详细日志的方法,用于诊断。
- 沟通节奏 每30分钟向利益相关者更新一次,即使只是“仍在排查”。
Quick Reference Table
快速参考表
| Symptom | Likely Cause | First Check | Pattern | Fix Time |
|---|---|---|---|---|
| Stuck in .preparing | DNS failure | | 1a | 10-15 min |
| .waiting immediately | No connectivity | Airplane Mode? | 1b | 5 min |
| .failed POSIX 61 | Connection refused | Server listening? | 1c | 5-10 min |
| .failed POSIX 50 | Network down | Check interface | 1d | 5 min |
| TLS error -9806 | Certificate invalid | | 2b | 15-20 min |
| Data not received | Framing problem | Packet capture | 3a | 20-30 min |
| Partial data | Min/max bytes wrong | Check receive() params | 3b | 10 min |
| Latency increasing | TCP congestion | contentProcessed pacing | 4a | 15-25 min |
| High CPU | No batching | Use connection.batch | 4c | 10 min |
| Memory growing | Connection leaks | Check [weak self] | 4d | 10-15 min |
| Works WiFi, fails cellular | IPv6-only network | | 5a | 10-15 min |
| Works without VPN, fails with VPN | Proxy interference | Test PAC file | 5b | 20-30 min |
| Port blocked | Firewall | Try 443 vs 8080 | 5c | 10 min |
| 症状 | 可能原因 | 首次检查项 | 模式 | 修复耗时 |
|---|---|---|---|---|
| 停留在.preparing状态 | DNS解析失败 | | 1a | 10-15分钟 |
| 立即进入.waiting状态 | 无网络连接 | 是否开启飞行模式? | 1b | 5分钟 |
| .failed状态伴随POSIX错误61 | 连接被拒绝 | 服务器是否在监听? | 1c | 5-10分钟 |
| .failed状态伴随POSIX错误50 | 网络中断 | 检查网络接口 | 1d | 5分钟 |
| TLS错误-9806 | 证书无效 | | 2b | 15-20分钟 |
| 数据未接收 | 帧格式问题 | 抓包工具 | 3a | 20-30分钟 |
| 仅收到部分数据 | 最小/最大字节数错误 | 检查receive()参数 | 3b | 10分钟 |
| 延迟逐渐增加 | TCP拥塞 | contentProcessed pacing | 4a | 15-25分钟 |
| CPU使用率过高 | 未使用批量处理 | 使用connection.batch | 4c | 10分钟 |
| 内存持续增长 | 连接泄漏 | 检查[weak self] | 4d | 10-15分钟 |
| WiFi正常,蜂窝网络失败 | 纯IPv6网络 | | 5a | 10-15分钟 |
| 无VPN正常,开启VPN失败 | 代理干扰 | 测试PAC文件 | 5b | 20-30分钟 |
| 端口被阻止 | 防火墙限制 | 尝试443 vs 8080端口 | 5c | 10分钟 |
Common Mistakes
常见错误
Mistake 1: Not Enabling Logging Before Debugging
错误1:调试前未启用日志
Problem Trying to debug networking issues without seeing framework's internal state.
Why it fails You're guessing what's happening. Logs show exact state transitions, error codes, timing.
问题 尝试在未查看框架内部状态的情况下调试网络问题。
失败原因 你在猜测发生了什么。日志会显示确切的状态转换、错误码和时间。
Fix
修复方案
swift
// Add to Xcode scheme BEFORE debugging:
// -NWLoggingEnabled 1
// -NWConnectionLoggingEnabled 1
// Or programmatically:
#if DEBUG
ProcessInfo.processInfo.environment["NW_LOGGING_ENABLED"] = "1"
#endifswift
// 在调试前添加到Xcode scheme:
// -NWLoggingEnabled 1
// -NWConnectionLoggingEnabled 1
// 或通过代码设置:
#if DEBUG
ProcessInfo.processInfo.environment["NW_LOGGING_ENABLED"] = "1"
#endifMistake 2: Testing Only on WiFi
错误2:仅在WiFi环境下测试
Problem WiFi and cellular have different characteristics (IPv6-only, proxy configs, packet loss).
Why it fails 40% of connection failures are network-specific. If you only test WiFi, you miss cellular issues.
问题 WiFi和蜂窝网络有不同的特性(纯IPv6、代理配置、丢包率)。
失败原因 40%的连接失败是网络环境特定的。如果仅在WiFi下测试,会遗漏蜂窝网络问题。
Fix
修复方案
- Test on real device with WiFi OFF
- Test on multiple carriers (Verizon, AT&T, T-Mobile have different configs)
- Test with VPN active (enterprise users)
- Use Network Link Conditioner (Xcode → Devices)
- 在真实设备上关闭WiFi测试
- 在多个运营商网络下测试(Verizon、AT&T、T-Mobile的配置不同)
- 在开启VPN的情况下测试(企业用户场景)
- 使用Network Link Conditioner(Xcode → Devices)
Mistake 3: Ignoring POSIX Error Codes
错误3:忽略POSIX错误码
Problem Seeing and just showing generic "Connection failed" to user.
.failed(let error)Why it fails Different error codes require different fixes. POSIX 61 = server issue, POSIX 50 = client network issue.
问题 看到后,仅向用户显示通用的“连接失败”提示。
.failed(let error)失败原因 不同的错误码需要不同的修复方案。POSIX 61 = 服务器问题,POSIX 50 = 客户端网络问题。
Fix
修复方案
swift
if case .failed(let error) = state {
let posixError = (error as NSError).code
switch posixError {
case 61: // ECONNREFUSED
print("Server not listening, check server logs")
case 50: // ENETDOWN
print("Network interface down, check WiFi/cellular")
case 60: // ETIMEDOUT
print("Connection timeout, check firewall/DNS")
default:
print("Connection failed: \(error)")
}
}swift
if case .failed(let error) = state {
let posixError = (error as NSError).code
switch posixError {
case 61: // ECONNREFUSED
print("服务器未监听,请检查服务器日志")
case 50: // ENETDOWN
print("网络接口已断开,请检查WiFi/蜂窝网络")
case 60: // ETIMEDOUT
print("连接超时,请检查防火墙/DNS")
default:
print("Connection failed: \(error)")
}
}Mistake 4: Not Testing State Transitions
错误4:未测试状态转换
Problem Testing only happy path (.preparing → .ready). Not testing .waiting, network changes, failures.
Why it fails Real users experience network transitions (WiFi → cellular), Airplane Mode, weak signal.
问题 仅测试正常路径(.preparing → .ready)。未测试.waiting、网络切换、失败场景。
失败原因 真实用户会遇到网络切换(WiFi → 蜂窝)、飞行模式、弱信号等场景。
Fix
修复方案
swift
// Test with Network Link Conditioner:
// 1. 100% Loss — verify .waiting state shows "Waiting for network"
// 2. WiFi → None → WiFi — verify automatic reconnection
// 3. 3% packet loss — verify performance graceful degradationswift
// 用Network Link Conditioner测试:
// 1. 100%丢包 — 验证.waiting状态显示“等待网络连接”
// 2. WiFi → 无网络 → WiFi — 验证自动重连
// 3. 3%丢包 — 验证性能是否优雅降级Mistake 5: Assuming Simulator = Device
错误5:认为模拟器等同于真实设备
Problem Testing only in simulator. Simulator uses macOS networking (different from iOS), no cellular.
Why it fails Simulator hides IPv6-only issues, doesn't simulate network transitions, has different DNS.
问题 仅在模拟器中测试。模拟器使用macOS网络栈(与iOS不同),无蜂窝网络支持。
失败原因 模拟器会隐藏纯IPv6网络问题,无法模拟网络切换,DNS配置也不同。
Fix
修复方案
- ALWAYS test on real device before shipping
- Test with Airplane Mode toggle (simulate network transitions)
- Test with cellular only (disable WiFi)
- 发布前必须在真实设备上测试
- 测试飞行模式切换(模拟网络转换)
- 仅用蜂窝网络测试(关闭WiFi)
Cross-References
交叉引用
For Preventive Patterns
预防模式
networking skill — Discipline-enforcing anti-patterns:
- Red Flags: SCNetworkReachability, blocking sockets, hardcoded IPs
- Pattern 1a: NetworkConnection with TLS (correct implementation)
- Pattern 2a: NWConnection with proper state handling
- Pressure Scenarios: How to handle deadline pressure without cutting corners
networking skill — 规范约束的反模式:
- 红色预警:SCNetworkReachability、阻塞式套接字、硬编码IP
- 模式1a:带TLS的NetworkConnection(正确实现)
- 模式2a:带正确状态处理的NWConnection
- 压力场景:如何在截止日期压力下不妥协地处理问题
For API Reference
API参考
network-framework-ref skill — Complete API documentation:
- NetworkConnection (iOS 26+): All 12 WWDC 2025 examples
- NWConnection (iOS 12-25): Complete API with examples
- TLV framing, Coder protocol, NetworkListener, NetworkBrowser
- Migration strategies from sockets, URLSession, NWConnection
network-framework-ref skill — 完整的API文档:
- NetworkConnection(iOS 26+):WWDC 2025的全部12个示例
- NWConnection(iOS 12-25):带示例的完整API
- TLV帧格式、Coder协议、NetworkListener、NetworkBrowser
- 从套接字、URLSession、NWConnection迁移的策略
For Related Issues
相关问题
swift-concurrency skill — If using async/await:
- Pattern 3: Weak self in Task closures (similar memory leak prevention)
- @MainActor usage for connection state updates
- Task cancellation when connection fails
Last Updated 2025-12-02
Status Production-ready diagnostics from WWDC 2018/2025
Tested Diagnostic patterns validated against real production issues
swift-concurrency skill — 如果使用async/await:
- 模式3:Task闭包中的weak self(类似的内存泄漏预防)
- 连接状态更新的@MainActor使用
- 连接失败时的Task取消
最后更新 2025-12-02
状态 经过真实生产环境问题验证的诊断方案
测试情况 诊断模式已通过真实生产环境问题验证