datadog-review-dashboard
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseReview Datadog Dashboard
审查Datadog仪表盘
Audit an existing Datadog dashboard against operational readiness principles. The core principles are: graphs should earn their place with alert thresholds, thresholds should sit close to normal traffic, a customer-facing section should exist, and the dashboard should be readable by someone with zero service knowledge.
These are guiding principles — not a rigid checklist. Apply judgment based on the product and business context. A context-providing metric (like deployment events) may earn its place without a threshold. A service with unusual traffic patterns may need different proximity rules.
对照运维就绪原则审计现有Datadog仪表盘。核心原则包括:图表应设置告警阈值以体现其存在价值,阈值应靠近正常流量区间,需包含面向客户的板块,且仪表盘应能被完全不了解该服务的人员读懂。
这些是指导性原则,而非硬性检查清单。请结合产品和业务上下文做出判断。用于提供上下文的指标(如部署事件)即使没有阈值也有存在价值。流量模式特殊的服务可能需要不同的阈值间距规则。
Interview Phase
访谈阶段
Skip interview if ALL of these are already specified:
- Dashboard ID or URL
- Service name or team context
Always interview if: No dashboard ID is provided or multiple dashboards may be relevant.
如果以下所有信息均已明确,请跳过访谈:
- 仪表盘ID或URL
- 服务名称或团队上下文
如果出现以下情况,请务必进行访谈: 未提供仪表盘ID,或可能涉及多个相关仪表盘。
Questions
访谈问题
-
Dashboard — "Which dashboard should I review? Provide a dashboard ID, URL, or service name to search for."
- Impact: Determines which dashboard definition to fetch
-
Business Context — "Can you tell me what this service does for customers? Are there codebases or docs I can read to understand the product?"
- Impact: Understanding the domain lets the review focus on whether the right metrics are being tracked, not just whether generic rules are followed
-
Focus — "Is there anything specific you want me to focus on? (A) Full review, (B) Alert thresholds only, (C) Customer-facing section, (D) Layout and readability"
- Impact: Determines review scope — default to full review if unspecified
-
仪表盘 — "我需要审查哪个仪表盘?请提供仪表盘ID、URL,或可供搜索的服务名称。"
- 影响:决定要获取哪个仪表盘的定义
-
业务上下文 — "你可以介绍下这个服务为客户提供的功能吗?有没有我可以查阅的代码库或文档来了解该产品?"
- 影响:理解业务领域后,评审可以聚焦于是否追踪了正确的指标,而不仅仅是是否遵循了通用规则
-
评审重点 — "你有没有希望我特别关注的内容?(A) 全面评审,(B) 仅关注告警阈值,(C) 面向客户板块,(D) 布局与可读性"
- 影响:决定评审范围——如果未指定则默认进行全面评审
Workflow
工作流
1. Fetch Dashboard Definition
1. 获取仪表盘定义
bash
undefinedbash
undefinedIf given a service name, search for matching dashboards
如果提供了服务名称,搜索匹配的仪表盘
pup dashboards list --filter="<service-name>"
pup dashboards list --filter="<service-name>"
If given a URL, extract the dashboard ID from the path (e.g., /dashboard/abc-def-ghi/...)
如果提供了URL,从路径中提取仪表盘ID(例如 /dashboard/abc-def-ghi/...)
Get the full dashboard definition
获取完整的仪表盘定义
pup dashboards get <dashboard-id>
pup dashboards get <dashboard-id>
Get the dashboard URL for reference
获取仪表盘URL作为参考
pup dashboards url <dashboard-id>
Parse the response to build an inventory of all widgets, groups, and their configurations.pup dashboards url <dashboard-id>
解析响应,构建所有组件、分组及其配置的清单。2. Build Widget Inventory
2. 构建组件清单
Catalog every widget in the dashboard:
| Widget Title | Prefix | Type | Group | Has Alert Threshold | Threshold Value | Notes |
|---|---|---|---|---|---|---|
| ... | I0/P1/D0/— | ... | ... | ... | ... | ... |
Check that every widget title uses the layer-priority prefix system:
- for infrastructure (load balancers, databases, networks)
I0-N: - for platform (service-specific components from the codebase)
P0-N: - for domain (business metrics)
D0-N: - The number indicates priority within the layer (= most critical)
0
Focus on timeseries and query value widgets — these are the primary candidates for alert threshold markers.
对仪表盘中的每个组件进行归类:
| 组件标题 | 前缀 | 类型 | 分组 | 包含告警阈值 | 阈值数值 | 备注 |
|---|---|---|---|---|---|---|
| ... | I0/P1/D0/— | ... | ... | ... | ... | ... |
检查每个组件标题是否使用了层级-优先级前缀体系:
- 用于基础设施(负载均衡、数据库、网络)
I0-N: - 用于平台(代码库中服务专属的组件)
P0-N: - 用于业务域(业务指标)
D0-N: - 数字代表该层级内的优先级(= 最高优先级)
0
重点关注时序和查询值组件——这些是告警阈值标记的主要适用对象。
3. Audit Alert Thresholds
3. 审计告警阈值
Principle: Timeseries graphs should generally have an alert threshold (red line). If a metric doesn't warrant an alert, question whether it belongs — but use judgment. Some metrics provide valuable context (deployment markers, dependency traffic patterns) without needing a threshold.
For each timeseries widget, check:
- Does it have a marker/threshold line configured?
- Is the marker colored red for visibility?
- Does the threshold correspond to an actual monitor/alert?
Findings format:
markdown
undefined原则:时序图表通常应设置告警阈值(红线)。如果某个指标不需要设置告警,则需要考量它是否有存在的必要——但请结合实际情况判断。有些指标不需要阈值也能提供有价值的上下文(部署标记、依赖流量模式等)。
针对每个时序组件,检查:
- 是否配置了标记/阈值线?
- 标记是否使用了红色以保证可见性?
- 阈值是否对应真实的监控/告警规则?
输出格式:
markdown
undefinedAlert Threshold Audit
告警阈值审计
| Widget | Group | Status | Finding |
|---|---|---|---|
| Requests/s | Rate | MISSING | No threshold marker — add alert line or remove widget |
| Error rate | Errors | OK | Red line at 5% |
| CPU usage | Infra | MISSING | No threshold — is this metric alertable? |
undefined| 组件 | 分组 | 状态 | 发现问题 |
|---|---|---|---|
| 请求数/秒 | 速率 | 缺失 | 无阈值标记——请添加告警线或移除该组件 |
| 错误率 | 错误 | 正常 | 红线设置在5% |
| CPU使用率 | 基础设施 | 缺失 | 无阈值——该指标是否可配置告警? |
undefined4. Audit Threshold Proximity
4. 审计阈值间距
Principle: Alert thresholds must be close to normal traffic. Large gaps between normal values and the alert line create blind spots where anomalies go unnoticed.
For each widget with a threshold:
- What is the typical (normal) value range?
- Where is the threshold set?
- Is there excessive whitespace between the normal line and the alert line?
- Is the Y-axis auto-scaled or explicitly set? Auto-scaled Y-axes compress normal traffic into a flat band when the threshold is far above normal — the Y-axis max should be set to slightly above the alert threshold
Bad example: Normal CPU is 20%, alert threshold at 95% — the graph is mostly empty space and a slow climb from 20% to 80% looks flat.
Good example: Normal CPU is 20%, alert threshold at 45% — anomalies visually stand out immediately.
Findings format:
markdown
undefined原则:告警阈值必须靠近正常流量区间。正常值和告警线之间的差距过大将产生盲区,导致异常无法被及时发现。
针对每个带阈值的组件,检查:
- 典型(正常)值范围是多少?
- 阈值设置在什么位置?
- 正常值线和告警线之间是否存在过多空白?
- Y轴是自动缩放还是显式设置的?当阈值远高于正常值时,自动缩放的Y轴会将正常流量压缩成扁平带——Y轴最大值应设置为略高于告警阈值
反面示例:正常CPU使用率为20%,告警阈值设置为95%——图表大部分都是空白,从20%缓慢攀升到80%的趋势看起来是平的。
正面示例:正常CPU使用率为20%,告警阈值设置为45%——异常可以立刻被直观发现。
输出格式:
markdown
undefinedThreshold Proximity Audit
阈值间距审计
| Widget | Normal Range | Threshold | Gap | Y-Axis | Status |
|---|---|---|---|---|---|
| CPU usage | ~20% | 95% | 75% | auto | TOO FAR — lower to 40-50%, set Y-max to 55% |
| Error rate | ~0.1% | 5% | ~5% | auto | OK gap — but set Y-max to 6% |
| p99 latency | ~50ms | 500ms | 10x | auto | TOO FAR — lower to 100-150ms, set Y-max to 175ms |
undefined| 组件 | 正常范围 | 阈值 | 间距 | Y轴 | 状态 |
|---|---|---|---|---|---|
| CPU使用率 | ~20% | 95% | 75% | 自动 | 间距过大——请降低至40-50%,将Y轴最大值设置为55% |
| 错误率 | ~0.1% | 5% | ~5% | 自动 | 间距正常——但请将Y轴最大值设置为6% |
| p99延迟 | ~50ms | 500ms | 10倍 | 自动 | 间距过大——请降低至100-150ms,将Y轴最大值设置为175ms |
undefined5. Audit Customer-Facing Section
5. 审计面向客户板块
Principle: A dedicated "Customer-Facing" group should exist at the top of the dashboard with 5-8 key metrics for immediate outage identification. The specific metrics should reflect the product's business — not just generic traffic and error rates.
Check:
- Does a "Customer-Facing" group exist?
- Is it the first group on the dashboard?
- Does it contain 5-8 metrics covering: traffic volume, API latency, error rates, key business transactions, and database health?
- Can someone determine "are customers affected?" within 5 seconds of opening the dashboard?
Findings format:
markdown
undefined原则:仪表盘顶部应设有专门的"面向客户"分组,包含5-8个核心指标,用于快速识别故障。具体指标应反映产品的业务情况,而不仅仅是通用的流量和错误率。
检查项:
- 是否存在"面向客户"分组?
- 是否位于仪表盘的第一个分组?
- 是否包含5-8个指标,覆盖:流量规模、API延迟、错误率、核心业务事务、数据库健康状态?
- 人员打开仪表盘后能否在5秒内判断"客户是否受到影响"?
输出格式:
markdown
undefinedCustomer-Facing Section Audit
面向客户板块审计
Status: MISSING / INCOMPLETE / OK
Current state: [Description of what exists]
Recommended metrics (if missing or incomplete):
- Total request rate (are we receiving traffic?)
- Customer-facing error rate (are requests failing?)
- API p99 latency (are responses slow?)
- Key transaction success rate (are critical flows working?)
- Database connection pool usage (is the data layer healthy?)
- Queue depth or processing lag (is async work backing up?)
undefined状态: 缺失 / 不完整 / 正常
当前状态: [现有内容的描述]
推荐指标(如果缺失或不完整):
- 总请求速率(我们是否收到流量?)
- 面向客户的错误率(请求是否失败?)
- API p99延迟(响应是否缓慢?)
- 核心事务成功率(关键流程是否正常运行?)
- 数据库连接池使用率(数据层是否健康?)
- 队列深度或处理延迟(异步任务是否堆积?)
undefined6. Apply Zero-Knowledge Viewer Test
6. 零知识读者测试
Principle: Someone with zero knowledge of the service should be able to spot problems by looking for red indicators.
Evaluate:
- Can you identify a problem in under 10 seconds without reading widget titles?
- Are thresholds visible as red lines on every graph?
- Is conditional formatting applied to query value widgets (green/yellow/red)?
- Are group names self-explanatory?
- Is there a note widget with runbook links or team ownership?
Findings format:
markdown
undefined原则:完全不了解该服务的人员应该能通过红色指示器发现问题。
评估项:
- 你是否无需阅读组件标题就能在10秒内识别出问题?
- 每个图表上的阈值都显示为红线吗?
- 查询值组件是否应用了条件格式(绿/黄/红)?
- 分组名称是否一目了然?
- 是否有备注组件包含运行手册链接或团队归属信息?
输出格式:
markdown
undefinedZero-Knowledge Readability Audit
零知识可读性审计
| Check | Status | Finding |
|---|---|---|
| Problems visible in <10s | FAIL | No red lines on 8 of 12 graphs |
| Conditional formatting on QV widgets | PARTIAL | 2 of 4 QV widgets have thresholds |
| Group names self-explanatory | OK | All groups use clear names |
| Runbook/ownership note | MISSING | No note widget with team info |
undefined| 检查项 | 状态 | 发现问题 |
|---|---|---|
| 10秒内可识别问题 | 不通过 | 12个图表中有8个没有红线 |
| 查询值组件配置条件格式 | 部分通过 | 4个查询值组件中有2个配置了阈值 |
| 分组名称清晰易懂 | 通过 | 所有分组都使用了清晰的名称 |
| 运行手册/归属信息备注 | 缺失 | 没有包含团队信息的备注组件 |
undefined7. Generate Review Report
7. 生成评审报告
Compile all findings into a structured report:
markdown
undefined将所有发现汇总为结构化报告:
markdown
undefinedDashboard Review: [Dashboard Title]
仪表盘评审:[仪表盘标题]
Dashboard ID: [id]
URL: [url]
Review date: [date]
仪表盘ID: [id]
URL: [url]
评审日期: [date]
Summary
摘要
[2-3 sentence summary: overall health of the dashboard, critical issues count]
[2-3句话的摘要:仪表盘的整体健康度,严重问题数量]
Critical Issues
严重问题
[List issues that must be fixed before the dashboard is production-ready]
[列出仪表盘达到生产可用状态前必须修复的问题]
Alert Threshold Audit
告警阈值审计
[From step 3]
[来自步骤3的内容]
Threshold Proximity Audit
阈值间距审计
[From step 4]
[来自步骤4的内容]
Customer-Facing Section Audit
面向客户板块审计
[From step 5]
[来自步骤5的内容]
Zero-Knowledge Readability Audit
零知识可读性审计
[From step 6]
[来自步骤6的内容]
Recommended Actions
推荐动作
Must Fix
必须修复
- [Action item with specific widget and group reference]
- [包含具体组件和分组参考的行动项]
Should Fix
应该修复
- [Action item]
- [行动项]
Nice to Have
可优化
- [Action item]
---- [行动项]
---Quality Checklist
质量检查清单
- Every widget title uses the layer-priority prefix (,
I0:,P1:, etc.)D0: - Every timeseries widget audited for alert threshold markers
- Threshold proximity checked (no large gaps between normal values and alert lines)
- Customer-Facing group exists with 5-8 key metrics at the top
- Zero-knowledge viewer test applied (red indicators visible without context)
- Query Value widgets checked for conditional formatting (green/yellow/red)
- All findings include specific widget names and group references
- Recommended actions categorized by priority (must/should/nice-to-have)
- Dashboard URL included in report for easy reference
- 每个组件标题都使用了层级-优先级前缀(,
I0:,P1:等)D0: - 所有时序组件都已审计告警阈值标记
- 已检查阈值间距(正常值和告警线之间没有过大差距)
- 面向客户分组存在,且位于顶部,包含5-8个核心指标
- 已通过零知识读者测试(无需上下文即可看到红色指示器)
- 已检查查询值组件的条件格式(绿/黄/红)
- 所有发现都包含具体的组件名称和分组参考
- 推荐动作按优先级分类(必须/应该/可优化)
- 报告中包含仪表盘URL方便查阅