alerting-dashboard-builder
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAlerting & Dashboard Builder
告警与仪表盘构建工具
Build effective alerts and dashboards based on SLOs.
基于SLO构建高效的告警和仪表盘。
SLO Definition
SLO定义
yaml
slos:
- name: api_availability
objective: 99.9%
window: 30d
sli: |
sum(rate(http_requests_total{status_code!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
- name: api_latency
objective: 95% # 95% of requests under 500ms
window: 30d
sli: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) < 0.5yaml
slos:
- name: api_availability
objective: 99.9%
window: 30d
sli: |
sum(rate(http_requests_total{status_code!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
- name: api_latency
objective: 95% # 95% of requests under 500ms
window: 30d
sli: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) < 0.5Alert Rules
告警规则
yaml
groups:
- name: slo_alerts
rules:
# Fast burn (1% budget in 1h)
- alert: AvailabilitySLOFastBurn
expr: |
(1 - (sum(rate(http_requests_total{status_code!~"5.."}[1h])) /
sum(rate(http_requests_total[1h])))) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Burning 1% error budget per hour"
runbook: "https://runbooks.example.com/availability-fast-burn"
# Slow burn (10% budget in 24h)
- alert: AvailabilitySLOSlowBurn
expr: |
(1 - (sum(rate(http_requests_total{status_code!~"5.."}[24h])) /
sum(rate(http_requests_total[24h])))) > 0.001
for: 1h
labels:
severity: warning
annotations:
summary: "Burning error budget slowly"yaml
groups:
- name: slo_alerts
rules:
# Fast burn (1% budget in 1h)
- alert: AvailabilitySLOFastBurn
expr: |
(1 - (sum(rate(http_requests_total{status_code!~"5.."}[1h])) /
sum(rate(http_requests_total[1h])))) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Burning 1% error budget per hour"
runbook: "https://runbooks.example.com/availability-fast-burn"
# Slow burn (10% budget in 24h)
- alert: AvailabilitySLOSlowBurn
expr: |
(1 - (sum(rate(http_requests_total{status_code!~"5.."}[24h])) /
sum(rate(http_requests_total[24h])))) > 0.001
for: 1h
labels:
severity: warning
annotations:
summary: "Burning error budget slowly"Dashboard Template
仪表盘模板
json
{
"title": "Service Health Dashboard",
"rows": [
{
"title": "Golden Signals",
"panels": [
{
"title": "Request Rate",
"query": "sum(rate(http_requests_total[5m]))",
"type": "graph"
},
{
"title": "Error Rate",
"query": "sum(rate(http_requests_total{status_code=~"5.."}[5m]))",
"type": "graph"
},
{
"title": "Latency (p50, p95, p99)",
"queries": [
"histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
"histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
]
},
{
"title": "Saturation (CPU, Memory)",
"queries": [
"rate(process_cpu_seconds_total[5m])",
"process_resident_memory_bytes"
]
}
]
},
{
"title": "SLO Tracking",
"panels": [
{
"title": "Error Budget Remaining",
"query": "1 - ((1 - 0.999) - (1 - slo_availability))"
}
]
}
]
}json
{
"title": "Service Health Dashboard",
"rows": [
{
"title": "Golden Signals",
"panels": [
{
"title": "Request Rate",
"query": "sum(rate(http_requests_total[5m]))",
"type": "graph"
},
{
"title": "Error Rate",
"query": "sum(rate(http_requests_total{status_code=~"5.."}[5m]))",
"type": "graph"
},
{
"title": "Latency (p50, p95, p99)",
"queries": [
"histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
"histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
]
},
{
"title": "Saturation (CPU, Memory)",
"queries": [
"rate(process_cpu_seconds_total[5m])",
"process_resident_memory_bytes"
]
}
]
},
{
"title": "SLO Tracking",
"panels": [
{
"title": "Error Budget Remaining",
"query": "1 - ((1 - 0.999) - (1 - slo_availability))"
}
]
}
]
}What to Do When Alert Fires
告警触发后的处理步骤
markdown
undefinedmarkdown
undefinedAlert Response Guide
告警响应指南
HighErrorRate
HighErrorRate
What it means: More than 5% of requests are failing
First steps:
- Check recent deployments (rollback if needed)
- Review error logs for patterns
- Check dependent services health
- Verify database connectivity
Escalation: If not resolved in 15 min, page on-call lead
含义: 超过5%的请求失败
第一步:
- 检查最近的部署(必要时回滚)
- 查看错误日志中的模式
- 检查依赖服务的健康状况
- 验证数据库连接性
升级流程: 如果15分钟内未解决,通知值班负责人
HighLatency
HighLatency
What it means: p95 latency above 2 seconds
First steps:
- Check database query performance
- Review recent code changes
- Check cache hit rates
- Look for slow external API calls
Temporary mitigation:
- Scale up instances
- Enable aggressive caching
含义: p95延迟超过2秒
第一步:
- 检查数据库查询性能
- 查看最近的代码变更
- 检查缓存命中率
- 寻找缓慢的外部API调用
临时缓解措施:
- 扩容实例
- 启用激进缓存
LowAvailability
LowAvailability
What it means: Availability below 99.5%
First steps:
- Check infrastructure (AWS status page)
- Review load balancer health checks
- Check for DDoS activity
- Verify auto-scaling functioning
undefined含义: 可用性低于99.5%
第一步:
- 检查基础设施(AWS状态页面)
- 查看负载均衡器健康检查
- 检查是否存在DDoS活动
- 验证自动扩缩容功能
undefinedOutput Checklist
输出检查清单
- SLOs defined
- Alert rules configured
- Dashboards created
- Runbooks linked
- Response guides documented ENDFILE
- 已定义SLO
- 已配置告警规则
- 已创建仪表盘
- 已链接运行手册
- 已记录响应指南