debug-master

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

🕵️‍♂️ Skill: Debug Master (v1.1.0)

🕵️‍♂️ Skill: Debug Master（v1.1.0）

Executive Summary

执行摘要

The

debug-master

is a high-level specialist dedicated to the health, reliability, and observability of complex, distributed systems. In 2026, debugging is no longer a manual scavenger hunt through log files; it is an Orchestrated Investigation using AI-assisted tracing, predictive anomaly detection, and automated remediation loops. This skill focuses on minimizing MTTR (Mean Time To Repair) and maximizing system resilience through elite SRE standards.

debug-master

是专注于复杂分布式系统健康性、可靠性与可观测性的高级专家。在2026年，调试不再是手动在日志文件中排查的过程；而是一项利用AI辅助追踪、预测性异常检测和自动化修复循环的编排式调查工作。本Skill通过精英SRE标准，致力于最小化MTTR（平均修复时间）并最大化系统韧性。

📋 Table of Contents

事件解决流程
“禁止”清单（反模式）
分布式追踪（OpenTelemetry）
自主修复（Agentic Loop）
预测性可观测性
全栈故障排查层级
参考库

🛠️ Incident Resolution Protocol

🛠️ 事件解决流程

Every incident follows the Elite SRE Loop:

Evidence Collection: Correlate metrics, logs, and traces. Read the "Observability Graph" to find the service in red.
Impact Analysis: Determine the blast radius. Is it a single user, a region, or the entire tenant base?
Isolation: Use binary search (
```
git bisect
```
) and trace-filtering to isolate the logic or infra failure.
Surgical Fix / Rollback: Apply a precise fix or execute a total rollback if the 5-minute MTTR window is exceeded.
Post-Mortem: Generate an automated report summarizing the "Why" and store it in long-term vector memory.

每一起事件都遵循精英SRE循环：

证据收集：关联指标、日志和追踪数据。查看“可观测性图谱”定位异常服务。
影响分析：确定影响范围。是单个用户、某个区域，还是整个租户群体？
隔离问题：使用二分法（
```
git bisect
```
）和追踪过滤来隔离逻辑或基础设施故障。
精准修复/回滚：应用精准修复方案，若超出5分钟MTTR窗口则执行全面回滚。
事后复盘：生成自动化报告总结故障原因，并存储到长期向量内存中。

🚫 The "Do Not" List (Anti-Patterns)

🚫 “禁止”清单（反模式）

Anti-Pattern	Why it fails in 2026	Modern Alternative
"Guess and Check"	Extremely slow and dangerous.	Use Distributed Tracing.
Ignoring Warnings	Leads to "Alert Fatigue" and outages.	Use Dynamic SLO Tracking.
Manual Log Scraping	Inefficient for large datasets.	Use AI-Assisted Querying (o3).
Hotfixing Production	Bypasses CI/CD and causes drift.	Fix in Feature Branch + Deploy.
Disabling RLS/Security	Huge security risk for a "quick fix."	Fix the Capability Scope.

反模式	2026年失效原因	现代替代方案
“猜测排查”	效率极低且风险高	使用分布式追踪
忽略警告	导致“告警疲劳”和系统中断	使用动态SLO追踪
手动日志检索	处理大型数据集效率低下	使用AI辅助查询（o3）
直接修复生产环境	绕过CI/CD流程，导致配置漂移	在特性分支中修复并部署
禁用RLS/安全机制	为“快速修复”带来巨大安全风险	修复权限范围

🕸️ Distributed Tracing (OpenTelemetry)

🕸️ 分布式追踪（OpenTelemetry）

We use OTel as our source of truth.

Standard Spans: Every operation must have a traceable span ID.
Adaptive Sampling: 100% errors, 1% healthy traffic.
Context Propagation: Mandatory headers for cross-service calls.

See References: Distributed Tracing for setup.

我们将OTel作为事实来源。

标准Span：每个操作必须具备可追踪的Span ID。
自适应采样：100%采集错误流量，1%采集健康流量。
上下文传播：跨服务调用必须携带强制头信息。

详见参考：分布式追踪进行配置。

🤖 Autonomous Remediation

🤖 自主修复

In 2026, AI agents handle the triage.

Detection: Automatic anomaly triggers.
Remediation: Agents execute safe actions (scale up, cache clear).
HITL Gate: Humans approve destructive actions.

See References: Agentic Response for patterns.

在2026年，AI Agent负责事件分诊。

检测：自动触发异常检测。
修复：Agent执行安全操作（扩容、清除缓存）。
人工审核关卡：破坏性操作需经人工批准。

详见参考：Agent响应模式。

📈 Predictive Observability

📈 预测性可观测性

Identify failures before they occur.

Anomaly Detection: Spotting memory leaks or CPU creep.
Chaos Engineering: Running agentic "stress tests" weekly.
Dynamic SLOs: Thresholds that adjust based on business importance.

在故障发生前识别问题。

异常检测：发现内存泄漏或CPU持续攀升问题。
混沌工程：每周运行Agent驱动的“压力测试”。
动态SLO：根据业务重要性调整阈值。

📖 Reference Library

📖 参考库

Detailed deep-dives into SRE excellence:

Distributed Tracing (OTel): Standardizing your observability.
Agentic Incident Response: The autonomous remediation loop.
Predictive Observability: Hardening systems for the future.
Fullstack Troubleshooting: Layers of defense.

Updated: January 22, 2026 - 18:30

深入了解SRE卓越实践的详细内容：

分布式追踪（OTel）：标准化可观测性配置。
Agent事件响应：自主修复循环模式。
预测性可观测性：面向未来的系统加固。
全栈故障排查：多层防御体系。

更新时间：2026年1月22日 - 18:30