alerting-dashboard-builder

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Alerting & Dashboard Builder

告警与仪表盘构建工具

Build effective alerts and dashboards based on SLOs.
基于SLO构建高效的告警和仪表盘。

SLO Definition

SLO定义

yaml
slos:
  - name: api_availability
    objective: 99.9%
    window: 30d
    sli: |
      sum(rate(http_requests_total{status_code!~"5.."}[5m])) /
      sum(rate(http_requests_total[5m]))

  - name: api_latency
    objective: 95% # 95% of requests under 500ms
    window: 30d
    sli: |
      histogram_quantile(0.95,
        rate(http_request_duration_seconds_bucket[5m])
      ) < 0.5
yaml
slos:
  - name: api_availability
    objective: 99.9%
    window: 30d
    sli: |
      sum(rate(http_requests_total{status_code!~"5.."}[5m])) /
      sum(rate(http_requests_total[5m]))

  - name: api_latency
    objective: 95% # 95% of requests under 500ms
    window: 30d
    sli: |
      histogram_quantile(0.95,
        rate(http_request_duration_seconds_bucket[5m])
      ) < 0.5

Alert Rules

告警规则

yaml
groups:
  - name: slo_alerts
    rules:
      # Fast burn (1% budget in 1h)
      - alert: AvailabilitySLOFastBurn
        expr: |
          (1 - (sum(rate(http_requests_total{status_code!~"5.."}[1h])) /
          sum(rate(http_requests_total[1h])))) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Burning 1% error budget per hour"
          runbook: "https://runbooks.example.com/availability-fast-burn"

      # Slow burn (10% budget in 24h)
      - alert: AvailabilitySLOSlowBurn
        expr: |
          (1 - (sum(rate(http_requests_total{status_code!~"5.."}[24h])) /
          sum(rate(http_requests_total[24h])))) > 0.001
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Burning error budget slowly"
yaml
groups:
  - name: slo_alerts
    rules:
      # Fast burn (1% budget in 1h)
      - alert: AvailabilitySLOFastBurn
        expr: |
          (1 - (sum(rate(http_requests_total{status_code!~"5.."}[1h])) /
          sum(rate(http_requests_total[1h])))) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Burning 1% error budget per hour"
          runbook: "https://runbooks.example.com/availability-fast-burn"

      # Slow burn (10% budget in 24h)
      - alert: AvailabilitySLOSlowBurn
        expr: |
          (1 - (sum(rate(http_requests_total{status_code!~"5.."}[24h])) /
          sum(rate(http_requests_total[24h])))) > 0.001
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Burning error budget slowly"

Dashboard Template

仪表盘模板

json
{
  "title": "Service Health Dashboard",
  "rows": [
    {
      "title": "Golden Signals",
      "panels": [
        {
          "title": "Request Rate",
          "query": "sum(rate(http_requests_total[5m]))",
          "type": "graph"
        },
        {
          "title": "Error Rate",
          "query": "sum(rate(http_requests_total{status_code=~"5.."}[5m]))",
          "type": "graph"
        },
        {
          "title": "Latency (p50, p95, p99)",
          "queries": [
            "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
            "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
          ]
        },
        {
          "title": "Saturation (CPU, Memory)",
          "queries": [
            "rate(process_cpu_seconds_total[5m])",
            "process_resident_memory_bytes"
          ]
        }
      ]
    },
    {
      "title": "SLO Tracking",
      "panels": [
        {
          "title": "Error Budget Remaining",
          "query": "1 - ((1 - 0.999) - (1 - slo_availability))"
        }
      ]
    }
  ]
}
json
{
  "title": "Service Health Dashboard",
  "rows": [
    {
      "title": "Golden Signals",
      "panels": [
        {
          "title": "Request Rate",
          "query": "sum(rate(http_requests_total[5m]))",
          "type": "graph"
        },
        {
          "title": "Error Rate",
          "query": "sum(rate(http_requests_total{status_code=~"5.."}[5m]))",
          "type": "graph"
        },
        {
          "title": "Latency (p50, p95, p99)",
          "queries": [
            "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
            "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
          ]
        },
        {
          "title": "Saturation (CPU, Memory)",
          "queries": [
            "rate(process_cpu_seconds_total[5m])",
            "process_resident_memory_bytes"
          ]
        }
      ]
    },
    {
      "title": "SLO Tracking",
      "panels": [
        {
          "title": "Error Budget Remaining",
          "query": "1 - ((1 - 0.999) - (1 - slo_availability))"
        }
      ]
    }
  ]
}

What to Do When Alert Fires

告警触发后的处理步骤

markdown
undefined
markdown
undefined

Alert Response Guide

告警响应指南

HighErrorRate

HighErrorRate

What it means: More than 5% of requests are failing
First steps:
  1. Check recent deployments (rollback if needed)
  2. Review error logs for patterns
  3. Check dependent services health
  4. Verify database connectivity
Escalation: If not resolved in 15 min, page on-call lead
含义: 超过5%的请求失败
第一步:
  1. 检查最近的部署(必要时回滚)
  2. 查看错误日志中的模式
  3. 检查依赖服务的健康状况
  4. 验证数据库连接性
升级流程: 如果15分钟内未解决,通知值班负责人

HighLatency

HighLatency

What it means: p95 latency above 2 seconds
First steps:
  1. Check database query performance
  2. Review recent code changes
  3. Check cache hit rates
  4. Look for slow external API calls
Temporary mitigation:
  • Scale up instances
  • Enable aggressive caching
含义: p95延迟超过2秒
第一步:
  1. 检查数据库查询性能
  2. 查看最近的代码变更
  3. 检查缓存命中率
  4. 寻找缓慢的外部API调用
临时缓解措施:
  • 扩容实例
  • 启用激进缓存

LowAvailability

LowAvailability

What it means: Availability below 99.5%
First steps:
  1. Check infrastructure (AWS status page)
  2. Review load balancer health checks
  3. Check for DDoS activity
  4. Verify auto-scaling functioning
undefined
含义: 可用性低于99.5%
第一步:
  1. 检查基础设施(AWS状态页面)
  2. 查看负载均衡器健康检查
  3. 检查是否存在DDoS活动
  4. 验证自动扩缩容功能
undefined

Output Checklist

输出检查清单

  • SLOs defined
  • Alert rules configured
  • Dashboards created
  • Runbooks linked
  • Response guides documented ENDFILE
  • 已定义SLO
  • 已配置告警规则
  • 已创建仪表盘
  • 已链接运行手册
  • 已记录响应指南