alerting-irm

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Grafana Alerting & IRM

Grafana Alerting & IRM

Alert Rules

告警规则

Grafana-Managed Alert Rule (YAML provisioning)

Grafana管理的告警规则(YAML配置)

yaml
undefined
yaml
undefined

provisioning/alerting/rules.yaml

provisioning/alerting/rules.yaml

apiVersion: 1 groups:
  • orgId: 1 name: MyAlertGroup folder: MyFolder interval: 1m rules:
    • uid: high-error-rate title: High Error Rate condition: C data:
      • refId: A datasourceUid: prometheus relativeTimeRange: from: 300 # 5 minutes to: 0 model: expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
      • refId: B datasourceUid: expr model: type: reduce refId: B expression: A reducer: last
      • refId: C datasourceUid: expr model: type: math refId: C expression: $B > 0.05 noDataState: NoData execErrState: Alerting for: 5m labels: severity: critical team: platform annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $values.B }}%" runbook_url: "https://runbooks.example.com/high-error-rate"
undefined
apiVersion: 1 groups:
  • orgId: 1 name: MyAlertGroup folder: MyFolder interval: 1m rules:
    • uid: high-error-rate title: High Error Rate condition: C data:
      • refId: A datasourceUid: prometheus relativeTimeRange: from: 300 # 5 minutes to: 0 model: expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
      • refId: B datasourceUid: expr model: type: reduce refId: B expression: A reducer: last
      • refId: C datasourceUid: expr model: type: math refId: C expression: $B > 0.05 noDataState: NoData execErrState: Alerting for: 5m labels: severity: critical team: platform annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $values.B }}%" runbook_url: "https://runbooks.example.com/high-error-rate"
undefined

Prometheus/Mimir Alert Rule (ruler)

Prometheus/Mimir告警规则(Ruler)

yaml
groups:
  - name: service-alerts
    interval: 1m
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate: {{ $labels.service }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 1s on {{ $labels.service }}"

      # Recording rule
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
yaml
groups:
  - name: service-alerts
    interval: 1m
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate: {{ $labels.service }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 1s on {{ $labels.service }}"

      # Recording rule
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

Loki Alert Rule (LogQL)

Loki告警规则(LogQL)

yaml
groups:
  - name: log-alerts
    rules:
      - alert: HighErrorLogs
        expr: |
          sum(rate({app="myapp"} |= "error" [5m])) by (app)
          /
          sum(rate({app="myapp"}[5m])) by (app)
          > 0.05
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "High error log rate for {{ $labels.app }}"

      - alert: CredentialsLeak
        expr: |
          sum by (cluster, job, pod) (
            count_over_time({namespace="prod"} |~ "https?://(\\w+):(\\w+)@" [5m]) > 0
          )
        for: 5m
        labels:
          severity: critical
yaml
groups:
  - name: log-alerts
    rules:
      - alert: HighErrorLogs
        expr: |
          sum(rate({app="myapp"} |= "error" [5m])) by (app)
          /
          sum(rate({app="myapp"}[5m])) by (app)
          > 0.05
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "High error log rate for {{ $labels.app }}"

      - alert: CredentialsLeak
        expr: |
          sum by (cluster, job, pod) (
            count_over_time({namespace="prod"} |~ "https?://(\\w+):(\\w+)@" [5m]) > 0
          )
        for: 5m
        labels:
          severity: critical

Contact Points (YAML provisioning)

联系点(YAML配置)

yaml
undefined
yaml
undefined

provisioning/alerting/contact_points.yaml

provisioning/alerting/contact_points.yaml

apiVersion: 1 contactPoints:
  • orgId: 1 name: pagerduty-critical receivers:
    • uid: pd-receiver type: pagerduty settings: integrationKey: YOUR_PAGERDUTY_KEY severity: critical
  • orgId: 1 name: slack-alerts receivers:
    • uid: slack-receiver type: slack settings: url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL channel: '#alerts' username: Grafana icon_emoji: ':grafana:' title: '{{ template "slack.default.title" . }}' text: '{{ template "slack.default.text" . }}'
  • orgId: 1 name: email-alerts receivers:
    • uid: email-receiver type: email settings: addresses: 'oncall@example.com;alerts@example.com'
  • orgId: 1 name: webhook-alerts receivers:
undefined
apiVersion: 1 contactPoints:
  • orgId: 1 name: pagerduty-critical receivers:
    • uid: pd-receiver type: pagerduty settings: integrationKey: YOUR_PAGERDUTY_KEY severity: critical
  • orgId: 1 name: slack-alerts receivers:
    • uid: slack-receiver type: slack settings: url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL channel: '#alerts' username: Grafana icon_emoji: ':grafana:' title: '{{ template "slack.default.title" . }}' text: '{{ template "slack.default.text" . }}'
  • orgId: 1 name: email-alerts receivers:
    • uid: email-receiver type: email settings: addresses: 'oncall@example.com;alerts@example.com'
  • orgId: 1 name: webhook-alerts receivers:
undefined

Notification Policies (YAML provisioning)

通知策略(YAML配置)

yaml
undefined
yaml
undefined

provisioning/alerting/policies.yaml

provisioning/alerting/policies.yaml

apiVersion: 1 policies:
  • orgId: 1 receiver: default-receiver group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 12h routes:

    Critical alerts → PagerDuty

    • receiver: pagerduty-critical matchers:
      • severity = critical group_wait: 10s group_interval: 1m repeat_interval: 4h

    Platform team alerts → Slack

    • receiver: slack-alerts matchers:
      • team = platform routes:

      Critical platform → page immediately

      • receiver: pagerduty-critical matchers:
        • severity = critical

    Everything else → email

    • receiver: email-alerts matchers:
      • severity =~ "warning|info"
undefined
apiVersion: 1 policies:
  • orgId: 1 receiver: default-receiver group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 12h routes:

    Critical alerts → PagerDuty

    • receiver: pagerduty-critical matchers:
      • severity = critical group_wait: 10s group_interval: 1m repeat_interval: 4h

    Platform team alerts → Slack

    • receiver: slack-alerts matchers:
      • team = platform routes:

      Critical platform → page immediately

      • receiver: pagerduty-critical matchers:
        • severity = critical

    Everything else → email

    • receiver: email-alerts matchers:
      • severity =~ "warning|info"
undefined

Silences

静默规则

Silences suppress notifications for matching alerts without stopping evaluation.
bash
undefined
静默规则可抑制匹配告警的通知,但不会停止告警评估。
bash
undefined

Via API - create a silence

通过API创建静默规则

curl -X POST https://grafana.example.com/api/alertmanager/grafana/api/v2/silences
-H 'Content-Type: application/json'
-d '{ "matchers": [ {"name": "alertname", "value": "HighErrorRate", "isRegex": false}, {"name": "env", "value": "staging", "isRegex": false} ], "startsAt": "2024-01-01T00:00:00Z", "endsAt": "2024-01-01T02:00:00Z", "comment": "Maintenance window", "createdBy": "admin" }'
undefined
curl -X POST https://grafana.example.com/api/alertmanager/grafana/api/v2/silences
-H 'Content-Type: application/json'
-d '{ "matchers": [ {"name": "alertname", "value": "HighErrorRate", "isRegex": false}, {"name": "env", "value": "staging", "isRegex": false} ], "startsAt": "2024-01-01T00:00:00Z", "endsAt": "2024-01-01T02:00:00Z", "comment": "Maintenance window", "createdBy": "admin" }'
undefined

Alert Rule States

告警规则状态

StateDescription
NormalCondition not met
PendingCondition met, waiting for
for
duration
FiringCondition met for full
for
duration
NoDataQuery returned no data
ErrorQuery/evaluation error
RecoveringWas firing, condition no longer met
状态描述
正常未满足触发条件
待触发已满足条件,等待
for
配置的时长
触发中满足条件且达到
for
配置的时长
无数据查询未返回数据
错误查询/评估出错
恢复中曾处于触发状态,当前不再满足条件

SLOs

SLO

yaml
undefined
yaml
undefined

SLO configuration (via Grafana UI or API)

SLO配置(通过Grafana UI或API)

Grafana auto-generates recording rules, dashboards, and burn-rate alerts

Grafana会自动生成记录规则、仪表盘和burn-rate告警

Generated recording rules example:

生成的记录规则示例:

groups:
  • name: slo_availability interval: 1m rules:
    • record: slo:availability:ratio_rate5m expr: | sum(rate(http_requests_total{status!~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
    • record: slo:error_budget:remaining expr: | (slo:availability:ratio_rate30d - 0.999) / (1 - 0.999)
groups:
  • name: slo_availability interval: 1m rules:
    • record: slo:availability:ratio_rate5m expr: | sum(rate(http_requests_total{status!~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
    • record: slo:error_budget:remaining expr: | (slo:availability:ratio_rate30d - 0.999) / (1 - 0.999)

Burn rate alerts (auto-generated by Grafana SLO)

Burn rate告警(由Grafana SLO自动生成)

  • alert: SLOBurnRateHigh expr: | slo:burn_rate:ratio_rate1h > 14.4 # 1h window, 5% budget in 1h for: 2m labels: severity: critical annotations: summary: "SLO burn rate critical for {{ $labels.service }}"
undefined
  • alert: SLOBurnRateHigh expr: | slo:burn_rate:ratio_rate1h > 14.4 # 1h窗口,1小时内消耗5%预算 for: 2m labels: severity: critical annotations: summary: "SLO burn rate critical for {{ $labels.service }}"
undefined

IRM - On-Call and Incidents

IRM - 值班与事件管理

Key IRM Capabilities

IRM核心功能

  • On-Call Schedules: Rotating shifts, overrides, escalation policies
  • Alert Routing: Route from Grafana Alerting, Prometheus, Datadog, PagerDuty, etc.
  • Incident Management: Declare incidents, add participants, track tasks/timeline
  • Escalation Chains: Auto-escalate if no response after N minutes
  • Integrations: Slack, Teams, Telegram, GitHub, Jira, StatusPage
  • 值班排班:轮班制、临时替班、升级策略
  • 告警路由:支持从Grafana Alerting、Prometheus、Datadog、PagerDuty等平台路由告警
  • 事件管理:声明事件、添加参与者、跟踪任务/时间线
  • 升级链:若N分钟内无响应则自动升级
  • 集成:Slack、Teams、Telegram、GitHub、Jira、StatusPage

IRM Integration Sources

IRM集成来源

SourceSetup
Grafana AlertingNative - configure in contact points
Prometheus AlertmanagerWebhook URL from IRM
DatadogWebhook integration
PagerDutyEvent integration
JiraIssue alerts
CustomGeneric webhook
来源设置方式
Grafana Alerting原生集成 - 在联系点中配置
Prometheus Alertmanager使用IRM提供的Webhook URL
DatadogWebhook集成
PagerDuty事件集成
Jira问题告警
自定义通用Webhook

Provisioning Directory Structure

配置目录结构

provisioning/alerting/
├── alert_rules.yaml        # Alert and recording rules
├── contact_points.yaml     # Notification destinations
├── notification_policies.yaml  # Routing tree
├── templates.yaml          # Message templates
└── mute_timings.yaml       # Recurring mute windows
provisioning/alerting/
├── alert_rules.yaml        # 告警和记录规则
├── contact_points.yaml     # 通知目标
├── notification_policies.yaml  # 路由树
├── templates.yaml          # 消息模板
└── mute_timings.yaml       # 定期静音窗口

API Provisioning (Keeps UI Editable)

API配置(保持UI可编辑)

bash
undefined
bash
undefined

Get current notification policy

获取当前通知策略

curl https://grafana.example.com/api/v1/provisioning/policies
-H 'Authorization: Bearer <token>'
curl https://grafana.example.com/api/v1/provisioning/policies
-H 'Authorization: Bearer <token>'

Update (add X-Disable-Provenance to keep editable in UI)

更新(添加X-Disable-Provenance头以保持UI可编辑)

curl -X PUT https://grafana.example.com/api/v1/provisioning/policies
-H 'Authorization: Bearer <token>'
-H 'X-Disable-Provenance: true'
-H 'Content-Type: application/json'
-d @policy.json
curl -X PUT https://grafana.example.com/api/v1/provisioning/policies
-H 'Authorization: Bearer <token>'
-H 'X-Disable-Provenance: true'
-H 'Content-Type: application/json'
-d @policy.json

Create alert rule

创建告警规则

curl -X POST https://grafana.example.com/api/v1/provisioning/alert-rules
-H 'Authorization: Bearer <token>'
-H 'Content-Type: application/json'
-d @rule.json
undefined
curl -X POST https://grafana.example.com/api/v1/provisioning/alert-rules
-H 'Authorization: Bearer <token>'
-H 'Content-Type: application/json'
-d @rule.json
undefined

Notification Templates

通知模板

undefined
undefined

Custom Slack template

自定义Slack模板

{{ define "slack.custom.title" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} {{ end }}
{{ define "slack.custom.text" }} {{ range .Alerts }} Alert: {{ .Annotations.summary }} Severity: {{ .Labels.severity }} Service: {{ .Labels.service }} Details: {{ .Annotations.description }} {{ if .Annotations.runbook_url }}Runbook: {{ .Annotations.runbook_url }}{{ end }} {{ end }} {{ end }}
undefined
{{ define "slack.custom.title" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} {{ end }}
{{ define "slack.custom.text" }} {{ range .Alerts }} Alert: {{ .Annotations.summary }} Severity: {{ .Labels.severity }} Service: {{ .Labels.service }} Details: {{ .Annotations.description }} {{ if .Annotations.runbook_url }}Runbook: {{ .Annotations.runbook_url }}{{ end }} {{ end }} {{ end }}
undefined

References

参考资料

  • Alerting Rules
  • IRM & On-Call
  • SLOs
  • 告警规则
  • IRM与值班
  • SLO