bmad-observability-readiness

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

BMAD Observability Readiness Skill

BMAD 可观测性就绪 Skill

When to Invoke

调用时机

Use this skill when the user:
  • Mentions missing or low-quality logging, metrics, or tracing.
  • Requests monitoring/alerting setup before a launch or major release.
  • Needs SLOs, dashboards, or on-call runbooks.
  • Reports alert fatigue or noise that needs rationalization.
  • Wants to ensure performance and reliability work has data coverage.
If instrumentation already exists and only specific bug fixes are required, hand over to
bmad-development-execution
with the backlog produced here.
当用户出现以下情况时使用本Skill:
  • 提及日志、指标或链路追踪缺失或质量低下。
  • 在上线或重大版本发布前请求搭建监控/告警系统。
  • 需要SLO、仪表盘或值班运行手册。
  • 反馈告警疲劳或告警噪声过大,需要优化。
  • 希望确保性能与可靠性工作具备完善的数据覆盖。
如果已存在埋点系统,仅需修复特定问题,则将此处生成的待办清单移交至
bmad-development-execution
处理。

Mission

目标

Deliver a comprehensive observability plan that enables diagnosis, alerting, and measurement across the system. Ensure downstream performance, reliability, and security work has trustworthy telemetry.
交付全面的可观测性方案,支持全系统的问题诊断、告警与度量。确保下游的性能、可靠性与安全工作拥有可信的遥测数据。

Inputs Required

所需输入

  • Architecture diagrams and component inventory.
  • Existing logging/monitoring/tracing configuration (if any).
  • Current incidents, outages, or blind spots experienced by the team.
  • SLAs/SLOs, business KPIs, or compliance reporting requirements.
  • 架构图与组件清单。
  • 现有日志/监控/链路追踪配置(若有)。
  • 团队当前遇到的事件、故障或监控盲区。
  • SLA/SLO、业务KPI或合规报告要求。

Outputs

输出内容

  • Observability plan detailing metrics, logs, traces, dashboards, and retention policies.
  • Instrumentation backlog with implementation tasks, owners, and acceptance criteria.
  • SLO dashboard specification covering golden signals, alert thresholds, and runbook links.
  • Updated runbook or escalation paths if gaps were discovered.
  • 可观测性方案:详细说明指标、日志、链路追踪、仪表盘与数据保留策略。
  • 埋点待办清单:包含实现任务、负责人与验收标准。
  • SLO仪表盘规范:涵盖黄金信号、告警阈值与运行手册链接。
  • 若发现缺口,更新运行手册或升级路径。

Process

执行流程

  1. Audit current telemetry coverage, tooling, and data retention. Document gaps.
  2. Define observability objectives aligned with user journeys and business KPIs.
  3. Design instrumentation strategy: metrics taxonomy, structured logging, trace spans, event schemas.
  4. Establish SLOs, SLIs, and alerting strategy with on-call expectations and noise controls.
  5. Produce dashboards/reporting requirements and data governance notes.
  6. Create backlog with prioritized instrumentation tasks and verification approach.
  1. 审计当前遥测覆盖范围、工具与数据保留策略,记录存在的缺口。
  2. 定义与用户旅程及业务KPI对齐的可观测性目标。
  3. 设计埋点策略:指标分类体系、结构化日志、链路追踪跨度、事件 schema。
  4. 建立SLO、SLI与告警策略,明确值班预期与噪声控制措施。
  5. 制定仪表盘/报告需求与数据治理说明。
  6. 创建包含优先级埋点任务与验证方法的待办清单。

Quality Gates

质量门

  • Every critical user journey has metrics and alerts defined (latency, errors, saturation, traffic).
  • Logging standards specify structure, PII handling, and retention.
  • Alert runbooks documented or flagged for creation.
  • Observability plan references integration with performance, security, and incident workflows.
  • 每个关键用户旅程都已定义对应的指标与告警(延迟、错误、饱和度、流量)。
  • 日志标准明确了结构、PII处理方式与保留期限。
  • 告警运行手册已文档化或标记为待创建。
  • 可观测性方案提及与性能、安全及事件处理流程的集成。

Error Handling

错误处理

  • If telemetry tooling is undecided, present comparative options with trade-offs.
  • Highlight dependencies on platform teams or infrastructure before finalizing timeline.
  • Escalate when observability requirements conflict with compliance or privacy constraints.
  • 若遥测工具未确定,提供对比选项及优缺点分析。
  • 在最终确定时间线前,强调对平台团队或基础设施的依赖。
  • 当可观测性需求与合规或隐私约束冲突时,进行升级处理。