datadog-review-dashboard

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Review Datadog Dashboard

审查Datadog仪表盘

Audit an existing Datadog dashboard against operational readiness principles. The core principles are: graphs should earn their place with alert thresholds, thresholds should sit close to normal traffic, a customer-facing section should exist, and the dashboard should be readable by someone with zero service knowledge.
These are guiding principles — not a rigid checklist. Apply judgment based on the product and business context. A context-providing metric (like deployment events) may earn its place without a threshold. A service with unusual traffic patterns may need different proximity rules.
对照运维就绪原则审计现有Datadog仪表盘。核心原则包括:图表应设置告警阈值以体现其存在价值,阈值应靠近正常流量区间,需包含面向客户的板块,且仪表盘应能被完全不了解该服务的人员读懂。
这些是指导性原则,而非硬性检查清单。请结合产品和业务上下文做出判断。用于提供上下文的指标(如部署事件)即使没有阈值也有存在价值。流量模式特殊的服务可能需要不同的阈值间距规则。

Interview Phase

访谈阶段

Skip interview if ALL of these are already specified:
  • Dashboard ID or URL
  • Service name or team context
Always interview if: No dashboard ID is provided or multiple dashboards may be relevant.
如果以下所有信息均已明确,请跳过访谈:
  • 仪表盘ID或URL
  • 服务名称或团队上下文
如果出现以下情况,请务必进行访谈: 未提供仪表盘ID,或可能涉及多个相关仪表盘。

Questions

访谈问题

  1. Dashboard — "Which dashboard should I review? Provide a dashboard ID, URL, or service name to search for."
    • Impact: Determines which dashboard definition to fetch
  2. Business Context — "Can you tell me what this service does for customers? Are there codebases or docs I can read to understand the product?"
    • Impact: Understanding the domain lets the review focus on whether the right metrics are being tracked, not just whether generic rules are followed
  3. Focus — "Is there anything specific you want me to focus on? (A) Full review, (B) Alert thresholds only, (C) Customer-facing section, (D) Layout and readability"
    • Impact: Determines review scope — default to full review if unspecified

  1. 仪表盘 — "我需要审查哪个仪表盘?请提供仪表盘ID、URL,或可供搜索的服务名称。"
    • 影响:决定要获取哪个仪表盘的定义
  2. 业务上下文 — "你可以介绍下这个服务为客户提供的功能吗?有没有我可以查阅的代码库或文档来了解该产品?"
    • 影响:理解业务领域后,评审可以聚焦于是否追踪了正确的指标,而不仅仅是是否遵循了通用规则
  3. 评审重点 — "你有没有希望我特别关注的内容?(A) 全面评审,(B) 仅关注告警阈值,(C) 面向客户板块,(D) 布局与可读性"
    • 影响:决定评审范围——如果未指定则默认进行全面评审

Workflow

工作流

1. Fetch Dashboard Definition

1. 获取仪表盘定义

bash
undefined
bash
undefined

If given a service name, search for matching dashboards

如果提供了服务名称,搜索匹配的仪表盘

pup dashboards list --filter="<service-name>"
pup dashboards list --filter="<service-name>"

If given a URL, extract the dashboard ID from the path (e.g., /dashboard/abc-def-ghi/...)

如果提供了URL,从路径中提取仪表盘ID(例如 /dashboard/abc-def-ghi/...)

Get the full dashboard definition

获取完整的仪表盘定义

pup dashboards get <dashboard-id>
pup dashboards get <dashboard-id>

Get the dashboard URL for reference

获取仪表盘URL作为参考

pup dashboards url <dashboard-id>

Parse the response to build an inventory of all widgets, groups, and their configurations.
pup dashboards url <dashboard-id>

解析响应,构建所有组件、分组及其配置的清单。

2. Build Widget Inventory

2. 构建组件清单

Catalog every widget in the dashboard:
Widget TitlePrefixTypeGroupHas Alert ThresholdThreshold ValueNotes
...I0/P1/D0/—...............
Check that every widget title uses the layer-priority prefix system:
  • I0-N:
    for infrastructure (load balancers, databases, networks)
  • P0-N:
    for platform (service-specific components from the codebase)
  • D0-N:
    for domain (business metrics)
  • The number indicates priority within the layer (
    0
    = most critical)
Focus on timeseries and query value widgets — these are the primary candidates for alert threshold markers.
对仪表盘中的每个组件进行归类:
组件标题前缀类型分组包含告警阈值阈值数值备注
...I0/P1/D0/—...............
检查每个组件标题是否使用了层级-优先级前缀体系:
  • I0-N:
    用于基础设施(负载均衡、数据库、网络)
  • P0-N:
    用于平台(代码库中服务专属的组件)
  • D0-N:
    用于业务域(业务指标)
  • 数字代表该层级内的优先级(
    0
    = 最高优先级)
重点关注时序和查询值组件——这些是告警阈值标记的主要适用对象。

3. Audit Alert Thresholds

3. 审计告警阈值

Principle: Timeseries graphs should generally have an alert threshold (red line). If a metric doesn't warrant an alert, question whether it belongs — but use judgment. Some metrics provide valuable context (deployment markers, dependency traffic patterns) without needing a threshold.
For each timeseries widget, check:
  • Does it have a marker/threshold line configured?
  • Is the marker colored red for visibility?
  • Does the threshold correspond to an actual monitor/alert?
Findings format:
markdown
undefined
原则:时序图表通常应设置告警阈值(红线)。如果某个指标不需要设置告警,则需要考量它是否有存在的必要——但请结合实际情况判断。有些指标不需要阈值也能提供有价值的上下文(部署标记、依赖流量模式等)。
针对每个时序组件,检查:
  • 是否配置了标记/阈值线?
  • 标记是否使用了红色以保证可见性?
  • 阈值是否对应真实的监控/告警规则?
输出格式
markdown
undefined

Alert Threshold Audit

告警阈值审计

WidgetGroupStatusFinding
Requests/sRateMISSINGNo threshold marker — add alert line or remove widget
Error rateErrorsOKRed line at 5%
CPU usageInfraMISSINGNo threshold — is this metric alertable?
undefined
组件分组状态发现问题
请求数/秒速率缺失无阈值标记——请添加告警线或移除该组件
错误率错误正常红线设置在5%
CPU使用率基础设施缺失无阈值——该指标是否可配置告警?
undefined

4. Audit Threshold Proximity

4. 审计阈值间距

Principle: Alert thresholds must be close to normal traffic. Large gaps between normal values and the alert line create blind spots where anomalies go unnoticed.
For each widget with a threshold:
  • What is the typical (normal) value range?
  • Where is the threshold set?
  • Is there excessive whitespace between the normal line and the alert line?
  • Is the Y-axis auto-scaled or explicitly set? Auto-scaled Y-axes compress normal traffic into a flat band when the threshold is far above normal — the Y-axis max should be set to slightly above the alert threshold
Bad example: Normal CPU is 20%, alert threshold at 95% — the graph is mostly empty space and a slow climb from 20% to 80% looks flat.
Good example: Normal CPU is 20%, alert threshold at 45% — anomalies visually stand out immediately.
Findings format:
markdown
undefined
原则:告警阈值必须靠近正常流量区间。正常值和告警线之间的差距过大将产生盲区,导致异常无法被及时发现。
针对每个带阈值的组件,检查:
  • 典型(正常)值范围是多少?
  • 阈值设置在什么位置?
  • 正常值线和告警线之间是否存在过多空白?
  • Y轴是自动缩放还是显式设置的?当阈值远高于正常值时,自动缩放的Y轴会将正常流量压缩成扁平带——Y轴最大值应设置为略高于告警阈值
反面示例:正常CPU使用率为20%,告警阈值设置为95%——图表大部分都是空白,从20%缓慢攀升到80%的趋势看起来是平的。
正面示例:正常CPU使用率为20%,告警阈值设置为45%——异常可以立刻被直观发现。
输出格式
markdown
undefined

Threshold Proximity Audit

阈值间距审计

WidgetNormal RangeThresholdGapY-AxisStatus
CPU usage~20%95%75%autoTOO FAR — lower to 40-50%, set Y-max to 55%
Error rate~0.1%5%~5%autoOK gap — but set Y-max to 6%
p99 latency~50ms500ms10xautoTOO FAR — lower to 100-150ms, set Y-max to 175ms
undefined
组件正常范围阈值间距Y轴状态
CPU使用率~20%95%75%自动间距过大——请降低至40-50%,将Y轴最大值设置为55%
错误率~0.1%5%~5%自动间距正常——但请将Y轴最大值设置为6%
p99延迟~50ms500ms10倍自动间距过大——请降低至100-150ms,将Y轴最大值设置为175ms
undefined

5. Audit Customer-Facing Section

5. 审计面向客户板块

Principle: A dedicated "Customer-Facing" group should exist at the top of the dashboard with 5-8 key metrics for immediate outage identification. The specific metrics should reflect the product's business — not just generic traffic and error rates.
Check:
  • Does a "Customer-Facing" group exist?
  • Is it the first group on the dashboard?
  • Does it contain 5-8 metrics covering: traffic volume, API latency, error rates, key business transactions, and database health?
  • Can someone determine "are customers affected?" within 5 seconds of opening the dashboard?
Findings format:
markdown
undefined
原则:仪表盘顶部应设有专门的"面向客户"分组,包含5-8个核心指标,用于快速识别故障。具体指标应反映产品的业务情况,而不仅仅是通用的流量和错误率。
检查项:
  • 是否存在"面向客户"分组?
  • 是否位于仪表盘的第一个分组?
  • 是否包含5-8个指标,覆盖:流量规模、API延迟、错误率、核心业务事务、数据库健康状态?
  • 人员打开仪表盘后能否在5秒内判断"客户是否受到影响"?
输出格式
markdown
undefined

Customer-Facing Section Audit

面向客户板块审计

Status: MISSING / INCOMPLETE / OK
Current state: [Description of what exists]
Recommended metrics (if missing or incomplete):
  1. Total request rate (are we receiving traffic?)
  2. Customer-facing error rate (are requests failing?)
  3. API p99 latency (are responses slow?)
  4. Key transaction success rate (are critical flows working?)
  5. Database connection pool usage (is the data layer healthy?)
  6. Queue depth or processing lag (is async work backing up?)
undefined
状态: 缺失 / 不完整 / 正常
当前状态: [现有内容的描述]
推荐指标(如果缺失或不完整):
  1. 总请求速率(我们是否收到流量?)
  2. 面向客户的错误率(请求是否失败?)
  3. API p99延迟(响应是否缓慢?)
  4. 核心事务成功率(关键流程是否正常运行?)
  5. 数据库连接池使用率(数据层是否健康?)
  6. 队列深度或处理延迟(异步任务是否堆积?)
undefined

6. Apply Zero-Knowledge Viewer Test

6. 零知识读者测试

Principle: Someone with zero knowledge of the service should be able to spot problems by looking for red indicators.
Evaluate:
  • Can you identify a problem in under 10 seconds without reading widget titles?
  • Are thresholds visible as red lines on every graph?
  • Is conditional formatting applied to query value widgets (green/yellow/red)?
  • Are group names self-explanatory?
  • Is there a note widget with runbook links or team ownership?
Findings format:
markdown
undefined
原则:完全不了解该服务的人员应该能通过红色指示器发现问题。
评估项:
  • 你是否无需阅读组件标题就能在10秒内识别出问题?
  • 每个图表上的阈值都显示为红线吗?
  • 查询值组件是否应用了条件格式(绿/黄/红)?
  • 分组名称是否一目了然?
  • 是否有备注组件包含运行手册链接或团队归属信息?
输出格式
markdown
undefined

Zero-Knowledge Readability Audit

零知识可读性审计

CheckStatusFinding
Problems visible in <10sFAILNo red lines on 8 of 12 graphs
Conditional formatting on QV widgetsPARTIAL2 of 4 QV widgets have thresholds
Group names self-explanatoryOKAll groups use clear names
Runbook/ownership noteMISSINGNo note widget with team info
undefined
检查项状态发现问题
10秒内可识别问题不通过12个图表中有8个没有红线
查询值组件配置条件格式部分通过4个查询值组件中有2个配置了阈值
分组名称清晰易懂通过所有分组都使用了清晰的名称
运行手册/归属信息备注缺失没有包含团队信息的备注组件
undefined

7. Generate Review Report

7. 生成评审报告

Compile all findings into a structured report:
markdown
undefined
将所有发现汇总为结构化报告:
markdown
undefined

Dashboard Review: [Dashboard Title]

仪表盘评审:[仪表盘标题]

Dashboard ID: [id] URL: [url] Review date: [date]
仪表盘ID: [id] URL: [url] 评审日期: [date]

Summary

摘要

[2-3 sentence summary: overall health of the dashboard, critical issues count]
[2-3句话的摘要:仪表盘的整体健康度,严重问题数量]

Critical Issues

严重问题

[List issues that must be fixed before the dashboard is production-ready]
[列出仪表盘达到生产可用状态前必须修复的问题]

Alert Threshold Audit

告警阈值审计

[From step 3]
[来自步骤3的内容]

Threshold Proximity Audit

阈值间距审计

[From step 4]
[来自步骤4的内容]

Customer-Facing Section Audit

面向客户板块审计

[From step 5]
[来自步骤5的内容]

Zero-Knowledge Readability Audit

零知识可读性审计

[From step 6]
[来自步骤6的内容]

Recommended Actions

推荐动作

Must Fix

必须修复

  1. [Action item with specific widget and group reference]
  1. [包含具体组件和分组参考的行动项]

Should Fix

应该修复

  1. [Action item]
  1. [行动项]

Nice to Have

可优化

  1. [Action item]

---
  1. [行动项]

---

Quality Checklist

质量检查清单

  • Every widget title uses the layer-priority prefix (
    I0:
    ,
    P1:
    ,
    D0:
    , etc.)
  • Every timeseries widget audited for alert threshold markers
  • Threshold proximity checked (no large gaps between normal values and alert lines)
  • Customer-Facing group exists with 5-8 key metrics at the top
  • Zero-knowledge viewer test applied (red indicators visible without context)
  • Query Value widgets checked for conditional formatting (green/yellow/red)
  • All findings include specific widget names and group references
  • Recommended actions categorized by priority (must/should/nice-to-have)
  • Dashboard URL included in report for easy reference
  • 每个组件标题都使用了层级-优先级前缀(
    I0:
    ,
    P1:
    ,
    D0:
    等)
  • 所有时序组件都已审计告警阈值标记
  • 已检查阈值间距(正常值和告警线之间没有过大差距)
  • 面向客户分组存在,且位于顶部,包含5-8个核心指标
  • 已通过零知识读者测试(无需上下文即可看到红色指示器)
  • 已检查查询值组件的条件格式(绿/黄/红)
  • 所有发现都包含具体的组件名称和分组参考
  • 推荐动作按优先级分类(必须/应该/可优化)
  • 报告中包含仪表盘URL方便查阅