grafana-dashboards
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGrafana Dashboards
Grafana 仪表盘
Create and manage production-ready Grafana dashboards for comprehensive system observability.
创建并管理适用于生产环境的Grafana仪表盘,实现全面的系统可观测性。
Purpose
用途
Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.
设计高效的Grafana仪表盘,用于监控应用、基础设施和业务指标。
When to Use
适用场景
- Visualize Prometheus metrics
- Create custom dashboards
- Implement SLO dashboards
- Monitor infrastructure
- Track business KPIs
- 可视化Prometheus指标
- 创建自定义仪表盘
- 实现SLO仪表盘
- 监控基础设施
- 追踪业务KPI
Dashboard Design Principles
仪表盘设计原则
1. Hierarchy of Information
1. 信息层级
┌─────────────────────────────────────┐
│ Critical Metrics (Big Numbers) │
├─────────────────────────────────────┤
│ Key Trends (Time Series) │
├─────────────────────────────────────┤
│ Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘┌─────────────────────────────────────┐
│ Critical Metrics (Big Numbers) │
├─────────────────────────────────────┤
│ Key Trends (Time Series) │
├─────────────────────────────────────┤
│ Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘2. RED Method (Services)
2. RED方法(服务监控)
- Rate - Requests per second
- Errors - Error rate
- Duration - Latency/response time
- 速率 - 每秒请求数
- 错误 - 错误率
- 耗时 - 延迟/响应时间
3. USE Method (Resources)
3. USE方法(资源监控)
- Utilization - % time resource is busy
- Saturation - Queue length/wait time
- Errors - Error count
- 利用率 - 资源繁忙时间占比
- 饱和度 - 队列长度/等待时间
- 错误 - 错误数量
Dashboard Structure
仪表盘结构
API Monitoring Dashboard
API监控仪表盘
json
{
"dashboard": {
"title": "API Monitoring",
"tags": ["api", "production"],
"timezone": "browser",
"refresh": "30s",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}"
}
],
"gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
},
{
"title": "Error Rate %",
"type": "graph",
"targets": [
{
"expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
"legendFormat": "Error Rate"
}
],
"alert": {
"conditions": [
{
"evaluator": { "params": [5], "type": "gt" },
"operator": { "type": "and" },
"query": { "params": ["A", "5m", "now"] },
"type": "query"
}
]
},
"gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
},
{
"title": "P95 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{service}}"
}
],
"gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
}
]
}
}Reference: See
assets/api-dashboard.jsonjson
{
"dashboard": {
"title": "API Monitoring",
"tags": ["api", "production"],
"timezone": "browser",
"refresh": "30s",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}"
}
],
"gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
},
{
"title": "Error Rate %",
"type": "graph",
"targets": [
{
"expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
"legendFormat": "Error Rate"
}
],
"alert": {
"conditions": [
{
"evaluator": { "params": [5], "type": "gt" },
"operator": { "type": "and" },
"query": { "params": ["A", "5m", "now"] },
"type": "query"
}
]
},
"gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
},
{
"title": "P95 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{service}}"
}
],
"gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
}
]
}
}参考: 查看
assets/api-dashboard.jsonPanel Types
面板类型
1. Stat Panel (Single Value)
1. 统计面板(单值)
json
{
"type": "stat",
"title": "Total Requests",
"targets": [
{
"expr": "sum(http_requests_total)"
}
],
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
},
"orientation": "auto",
"textMode": "auto",
"colorMode": "value"
},
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{ "value": 0, "color": "green" },
{ "value": 80, "color": "yellow" },
{ "value": 90, "color": "red" }
]
}
}
}
}json
{
"type": "stat",
"title": "Total Requests",
"targets": [
{
"expr": "sum(http_requests_total)"
}
],
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
},
"orientation": "auto",
"textMode": "auto",
"colorMode": "value"
},
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{ "value": 0, "color": "green" },
{ "value": 80, "color": "yellow" },
{ "value": 90, "color": "red" }
]
}
}
}
}2. Time Series Graph
2. 时间序列图
json
{
"type": "graph",
"title": "CPU Usage",
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}
],
"yaxes": [
{ "format": "percent", "max": 100, "min": 0 },
{ "format": "short" }
]
}json
{
"type": "graph",
"title": "CPU Usage",
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}
],
"yaxes": [
{ "format": "percent", "max": 100, "min": 0 },
{ "format": "short" }
]
}3. Table Panel
3. 表格面板
json
{
"type": "table",
"title": "Service Status",
"targets": [
{
"expr": "up",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": { "Time": true },
"indexByName": {},
"renameByName": {
"instance": "Instance",
"job": "Service",
"Value": "Status"
}
}
}
]
}json
{
"type": "table",
"title": "Service Status",
"targets": [
{
"expr": "up",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": { "Time": true },
"indexByName": {},
"renameByName": {
"instance": "Instance",
"job": "Service",
"Value": "Status"
}
}
}
]
}4. Heatmap
4. 热力图
json
{
"type": "heatmap",
"title": "Latency Heatmap",
"targets": [
{
"expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
"format": "heatmap"
}
],
"dataFormat": "tsbuckets",
"yAxis": {
"format": "s"
}
}json
{
"type": "heatmap",
"title": "Latency Heatmap",
"targets": [
{
"expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
"format": "heatmap"
}
],
"dataFormat": "tsbuckets",
"yAxis": {
"format": "s"
}
}Variables
变量
Query Variables
查询变量
json
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_pod_info, namespace)",
"refresh": 1,
"multi": false
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
"refresh": 1,
"multi": true
}
]
}
}json
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_pod_info, namespace)",
"refresh": 1,
"multi": false
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
"refresh": 1,
"multi": true
}
]
}
}Use Variables in Queries
在查询中使用变量
sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))Alerts in Dashboards
仪表盘告警
json
{
"alert": {
"name": "High Error Rate",
"conditions": [
{
"evaluator": {
"params": [5],
"type": "gt"
},
"operator": { "type": "and" },
"query": {
"params": ["A", "5m", "now"]
},
"reducer": { "type": "avg" },
"type": "query"
}
],
"executionErrorState": "alerting",
"for": "5m",
"frequency": "1m",
"message": "Error rate is above 5%",
"noDataState": "no_data",
"notifications": [{ "uid": "slack-channel" }]
}
}json
{
"alert": {
"name": "High Error Rate",
"conditions": [
{
"evaluator": {
"params": [5],
"type": "gt"
},
"operator": { "type": "and" },
"query": {
"params": ["A", "5m", "now"]
},
"reducer": { "type": "avg" },
"type": "query"
}
],
"executionErrorState": "alerting",
"for": "5m",
"frequency": "1m",
"message": "Error rate is above 5%",
"noDataState": "no_data",
"notifications": [{ "uid": "slack-channel" }]
}
}Dashboard Provisioning
仪表盘配置管理
dashboards.yml:
yaml
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: "General"
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/dashboardsdashboards.yml:
yaml
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: "General"
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/dashboardsCommon Dashboard Patterns
常见仪表盘模式
Infrastructure Dashboard
基础设施仪表盘
Key Panels:
- CPU utilization per node
- Memory usage per node
- Disk I/O
- Network traffic
- Pod count by namespace
- Node status
Reference: See
assets/infrastructure-dashboard.json核心面板:
- 单节点CPU利用率
- 单节点内存使用率
- 磁盘I/O
- 网络流量
- 按命名空间统计的Pod数量
- 节点状态
参考: 查看
assets/infrastructure-dashboard.jsonDatabase Dashboard
数据库仪表盘
Key Panels:
- Queries per second
- Connection pool usage
- Query latency (P50, P95, P99)
- Active connections
- Database size
- Replication lag
- Slow queries
Reference: See
assets/database-dashboard.json核心面板:
- 每秒查询数
- 连接池使用率
- 查询延迟(P50、P95、P99)
- 活跃连接数
- 数据库大小
- 复制延迟
- 慢查询
参考: 查看
assets/database-dashboard.jsonApplication Dashboard
应用仪表盘
Key Panels:
- Request rate
- Error rate
- Response time (percentiles)
- Active users/sessions
- Cache hit rate
- Queue length
核心面板:
- 请求速率
- 错误率
- 响应时间(分位数)
- 活跃用户/会话数
- 缓存命中率
- 队列长度
Best Practices
最佳实践
- Start with templates (Grafana community dashboards)
- Use consistent naming for panels and variables
- Group related metrics in rows
- Set appropriate time ranges (default: Last 6 hours)
- Use variables for flexibility
- Add panel descriptions for context
- Configure units correctly
- Set meaningful thresholds for colors
- Use consistent colors across dashboards
- Test with different time ranges
- 从模板开始(使用Grafana社区仪表盘)
- 使用一致的命名 命名面板和变量
- 分组相关指标 在同一行中
- 设置合适的时间范围(默认:最近6小时)
- 使用变量 提升灵活性
- 添加面板描述 提供上下文信息
- 正确配置单位
- 设置有意义的颜色阈值
- 在所有仪表盘中使用一致的颜色
- 测试不同的时间范围
Dashboard as Code
即代码化仪表盘
Terraform Provisioning
使用Terraform配置
hcl
resource "grafana_dashboard" "api_monitoring" {
config_json = file("${path.module}/dashboards/api-monitoring.json")
folder = grafana_folder.monitoring.id
}
resource "grafana_folder" "monitoring" {
title = "Production Monitoring"
}hcl
resource "grafana_dashboard" "api_monitoring" {
config_json = file("${path.module}/dashboards/api-monitoring.json")
folder = grafana_folder.monitoring.id
}
resource "grafana_folder" "monitoring" {
title = "Production Monitoring"
}Ansible Provisioning
使用Ansible配置
yaml
- name: Deploy Grafana dashboards
copy:
src: "{{ item }}"
dest: /etc/grafana/dashboards/
with_fileglob:
- "dashboards/*.json"
notify: restart grafanayaml
- name: Deploy Grafana dashboards
copy:
src: "{{ item }}"
dest: /etc/grafana/dashboards/
with_fileglob:
- "dashboards/*.json"
notify: restart grafanaReference Files
参考文件
- - API monitoring dashboard
assets/api-dashboard.json - - Infrastructure dashboard
assets/infrastructure-dashboard.json - - Database monitoring dashboard
assets/database-dashboard.json - - Dashboard design guide
references/dashboard-design.md
- - API监控仪表盘
assets/api-dashboard.json - - 基础设施仪表盘
assets/infrastructure-dashboard.json - - 数据库监控仪表盘
assets/database-dashboard.json - - 仪表盘设计指南
references/dashboard-design.md
Related Skills
相关技能
- - For metric collection
prometheus-configuration - - For SLO dashboards
slo-implementation
- - 用于指标采集
prometheus-configuration - - 用于SLO仪表盘
slo-implementation