bmad-observability-readiness

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

BMAD Observability Readiness Skill

BMAD 可观测性就绪 Skill

When to Invoke

调用时机

Use this skill when the user:

Mentions missing or low-quality logging, metrics, or tracing.
Requests monitoring/alerting setup before a launch or major release.
Needs SLOs, dashboards, or on-call runbooks.
Reports alert fatigue or noise that needs rationalization.
Wants to ensure performance and reliability work has data coverage.

If instrumentation already exists and only specific bug fixes are required, hand over to

bmad-development-execution

with the backlog produced here.

当用户出现以下情况时使用本Skill：

提及日志、指标或链路追踪缺失或质量低下。
在上线或重大版本发布前请求搭建监控/告警系统。
需要SLO、仪表盘或值班运行手册。
反馈告警疲劳或告警噪声过大，需要优化。
希望确保性能与可靠性工作具备完善的数据覆盖。

如果已存在埋点系统，仅需修复特定问题，则将此处生成的待办清单移交至

bmad-development-execution

处理。

Mission

目标

Deliver a comprehensive observability plan that enables diagnosis, alerting, and measurement across the system. Ensure downstream performance, reliability, and security work has trustworthy telemetry.

交付全面的可观测性方案，支持全系统的问题诊断、告警与度量。确保下游的性能、可靠性与安全工作拥有可信的遥测数据。

Inputs Required

所需输入

Architecture diagrams and component inventory.
Existing logging/monitoring/tracing configuration (if any).
Current incidents, outages, or blind spots experienced by the team.
SLAs/SLOs, business KPIs, or compliance reporting requirements.

架构图与组件清单。
现有日志/监控/链路追踪配置（若有）。
团队当前遇到的事件、故障或监控盲区。
SLA/SLO、业务KPI或合规报告要求。

Outputs

输出内容

Observability plan detailing metrics, logs, traces, dashboards, and retention policies.
Instrumentation backlog with implementation tasks, owners, and acceptance criteria.
SLO dashboard specification covering golden signals, alert thresholds, and runbook links.
Updated runbook or escalation paths if gaps were discovered.

可观测性方案：详细说明指标、日志、链路追踪、仪表盘与数据保留策略。
埋点待办清单：包含实现任务、负责人与验收标准。
SLO仪表盘规范：涵盖黄金信号、告警阈值与运行手册链接。
若发现缺口，更新运行手册或升级路径。

Process

执行流程

Audit current telemetry coverage, tooling, and data retention. Document gaps.
Define observability objectives aligned with user journeys and business KPIs.
Design instrumentation strategy: metrics taxonomy, structured logging, trace spans, event schemas.
Establish SLOs, SLIs, and alerting strategy with on-call expectations and noise controls.
Produce dashboards/reporting requirements and data governance notes.
Create backlog with prioritized instrumentation tasks and verification approach.

审计当前遥测覆盖范围、工具与数据保留策略，记录存在的缺口。
定义与用户旅程及业务KPI对齐的可观测性目标。
设计埋点策略：指标分类体系、结构化日志、链路追踪跨度、事件 schema。
建立SLO、SLI与告警策略，明确值班预期与噪声控制措施。
制定仪表盘/报告需求与数据治理说明。
创建包含优先级埋点任务与验证方法的待办清单。

Quality Gates

质量门

Every critical user journey has metrics and alerts defined (latency, errors, saturation, traffic).
Logging standards specify structure, PII handling, and retention.
Alert runbooks documented or flagged for creation.
Observability plan references integration with performance, security, and incident workflows.

每个关键用户旅程都已定义对应的指标与告警（延迟、错误、饱和度、流量）。
日志标准明确了结构、PII处理方式与保留期限。
告警运行手册已文档化或标记为待创建。
可观测性方案提及与性能、安全及事件处理流程的集成。

Error Handling

错误处理

If telemetry tooling is undecided, present comparative options with trade-offs.
Highlight dependencies on platform teams or infrastructure before finalizing timeline.
Escalate when observability requirements conflict with compliance or privacy constraints.

若遥测工具未确定，提供对比选项及优缺点分析。
在最终确定时间线前，强调对平台团队或基础设施的依赖。
当可观测性需求与合规或隐私约束冲突时，进行升级处理。