debug-master
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese🕵️♂️ Skill: Debug Master (v1.1.0)
🕵️♂️ Skill: Debug Master(v1.1.0)
Executive Summary
执行摘要
The is a high-level specialist dedicated to the health, reliability, and observability of complex, distributed systems. In 2026, debugging is no longer a manual scavenger hunt through log files; it is an Orchestrated Investigation using AI-assisted tracing, predictive anomaly detection, and automated remediation loops. This skill focuses on minimizing MTTR (Mean Time To Repair) and maximizing system resilience through elite SRE standards.
debug-masterdebug-master📋 Table of Contents
📋 目录
🛠️ Incident Resolution Protocol
🛠️ 事件解决流程
Every incident follows the Elite SRE Loop:
- Evidence Collection: Correlate metrics, logs, and traces. Read the "Observability Graph" to find the service in red.
- Impact Analysis: Determine the blast radius. Is it a single user, a region, or the entire tenant base?
- Isolation: Use binary search () and trace-filtering to isolate the logic or infra failure.
git bisect - Surgical Fix / Rollback: Apply a precise fix or execute a total rollback if the 5-minute MTTR window is exceeded.
- Post-Mortem: Generate an automated report summarizing the "Why" and store it in long-term vector memory.
每一起事件都遵循精英SRE循环:
- 证据收集:关联指标、日志和追踪数据。查看“可观测性图谱”定位异常服务。
- 影响分析:确定影响范围。是单个用户、某个区域,还是整个租户群体?
- 隔离问题:使用二分法()和追踪过滤来隔离逻辑或基础设施故障。
git bisect - 精准修复/回滚:应用精准修复方案,若超出5分钟MTTR窗口则执行全面回滚。
- 事后复盘:生成自动化报告总结故障原因,并存储到长期向量内存中。
🚫 The "Do Not" List (Anti-Patterns)
🚫 “禁止”清单(反模式)
| Anti-Pattern | Why it fails in 2026 | Modern Alternative |
|---|---|---|
| "Guess and Check" | Extremely slow and dangerous. | Use Distributed Tracing. |
| Ignoring Warnings | Leads to "Alert Fatigue" and outages. | Use Dynamic SLO Tracking. |
| Manual Log Scraping | Inefficient for large datasets. | Use AI-Assisted Querying (o3). |
| Hotfixing Production | Bypasses CI/CD and causes drift. | Fix in Feature Branch + Deploy. |
| Disabling RLS/Security | Huge security risk for a "quick fix." | Fix the Capability Scope. |
| 反模式 | 2026年失效原因 | 现代替代方案 |
|---|---|---|
| “猜测排查” | 效率极低且风险高 | 使用分布式追踪 |
| 忽略警告 | 导致“告警疲劳”和系统中断 | 使用动态SLO追踪 |
| 手动日志检索 | 处理大型数据集效率低下 | 使用AI辅助查询(o3) |
| 直接修复生产环境 | 绕过CI/CD流程,导致配置漂移 | 在特性分支中修复并部署 |
| 禁用RLS/安全机制 | 为“快速修复”带来巨大安全风险 | 修复权限范围 |
🕸️ Distributed Tracing (OpenTelemetry)
🕸️ 分布式追踪(OpenTelemetry)
We use OTel as our source of truth.
- Standard Spans: Every operation must have a traceable span ID.
- Adaptive Sampling: 100% errors, 1% healthy traffic.
- Context Propagation: Mandatory headers for cross-service calls.
See References: Distributed Tracing for setup.
我们将OTel作为事实来源。
- 标准Span:每个操作必须具备可追踪的Span ID。
- 自适应采样:100%采集错误流量,1%采集健康流量。
- 上下文传播:跨服务调用必须携带强制头信息。
详见参考:分布式追踪进行配置。
🤖 Autonomous Remediation
🤖 自主修复
In 2026, AI agents handle the triage.
- Detection: Automatic anomaly triggers.
- Remediation: Agents execute safe actions (scale up, cache clear).
- HITL Gate: Humans approve destructive actions.
See References: Agentic Response for patterns.
在2026年,AI Agent负责事件分诊。
- 检测:自动触发异常检测。
- 修复:Agent执行安全操作(扩容、清除缓存)。
- 人工审核关卡:破坏性操作需经人工批准。
详见参考:Agent响应模式。
📈 Predictive Observability
📈 预测性可观测性
Identify failures before they occur.
- Anomaly Detection: Spotting memory leaks or CPU creep.
- Chaos Engineering: Running agentic "stress tests" weekly.
- Dynamic SLOs: Thresholds that adjust based on business importance.
在故障发生前识别问题。
- 异常检测:发现内存泄漏或CPU持续攀升问题。
- 混沌工程:每周运行Agent驱动的“压力测试”。
- 动态SLO:根据业务重要性调整阈值。
📖 Reference Library
📖 参考库
Detailed deep-dives into SRE excellence:
- Distributed Tracing (OTel): Standardizing your observability.
- Agentic Incident Response: The autonomous remediation loop.
- Predictive Observability: Hardening systems for the future.
- Fullstack Troubleshooting: Layers of defense.
Updated: January 22, 2026 - 18:30
深入了解SRE卓越实践的详细内容:
- 分布式追踪(OTel):标准化可观测性配置。
- Agent事件响应:自主修复循环模式。
- 预测性可观测性:面向未来的系统加固。
- 全栈故障排查:多层防御体系。
更新时间:2026年1月22日 - 18:30