alerting-irm

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Grafana Alerting & IRM

Docs: https://grafana.com/docs/grafana/latest/alerting/

文档: https://grafana.com/docs/grafana/latest/alerting/

Alert Rules

告警规则

Grafana-Managed Alert Rule (YAML provisioning)

Grafana管理的告警规则（YAML配置）

yaml

undefined

yaml

undefined

provisioning/alerting/rules.yaml

apiVersion: 1 groups:

orgId: 1 name: MyAlertGroup folder: MyFolder interval: 1m rules:
- uid: high-error-rate title: High Error Rate condition: C data:
  - refId: A datasourceUid: prometheus relativeTimeRange: from: 300 # 5 minutes to: 0 model: expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
  - refId: B datasourceUid: expr model: type: reduce refId: B expression: A reducer: last
  - refId: C datasourceUid: expr model: type: math refId: C expression: $B > 0.05 noDataState: NoData execErrState: Alerting for: 5m labels: severity: critical team: platform annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $values.B }}%" runbook_url: "https://runbooks.example.com/high-error-rate"

undefined

apiVersion: 1 groups:

orgId: 1 name: MyAlertGroup folder: MyFolder interval: 1m rules:
- uid: high-error-rate title: High Error Rate condition: C data:
  - refId: A datasourceUid: prometheus relativeTimeRange: from: 300 # 5 minutes to: 0 model: expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
  - refId: B datasourceUid: expr model: type: reduce refId: B expression: A reducer: last
  - refId: C datasourceUid: expr model: type: math refId: C expression: $B > 0.05 noDataState: NoData execErrState: Alerting for: 5m labels: severity: critical team: platform annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $values.B }}%" runbook_url: "https://runbooks.example.com/high-error-rate"

undefined

Prometheus/Mimir Alert Rule (ruler)

Prometheus/Mimir告警规则（Ruler）

yaml

groups:
  - name: service-alerts
    interval: 1m
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate: {{ $labels.service }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 1s on {{ $labels.service }}"

      # Recording rule
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

yaml

groups:
  - name: service-alerts
    interval: 1m
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate: {{ $labels.service }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 1s on {{ $labels.service }}"

      # Recording rule
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

Loki Alert Rule (LogQL)

Loki告警规则（LogQL）

yaml

groups:
  - name: log-alerts
    rules:
      - alert: HighErrorLogs
        expr: |
          sum(rate({app="myapp"} |= "error" [5m])) by (app)
          /
          sum(rate({app="myapp"}[5m])) by (app)
          > 0.05
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "High error log rate for {{ $labels.app }}"

      - alert: CredentialsLeak
        expr: |
          sum by (cluster, job, pod) (
            count_over_time({namespace="prod"} |~ "https?://(\\w+):(\\w+)@" [5m]) > 0
          )
        for: 5m
        labels:
          severity: critical

yaml

groups:
  - name: log-alerts
    rules:
      - alert: HighErrorLogs
        expr: |
          sum(rate({app="myapp"} |= "error" [5m])) by (app)
          /
          sum(rate({app="myapp"}[5m])) by (app)
          > 0.05
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "High error log rate for {{ $labels.app }}"

      - alert: CredentialsLeak
        expr: |
          sum by (cluster, job, pod) (
            count_over_time({namespace="prod"} |~ "https?://(\\w+):(\\w+)@" [5m]) > 0
          )
        for: 5m
        labels:
          severity: critical

Contact Points (YAML provisioning)

联系点（YAML配置）

yaml

undefined

yaml

undefined

provisioning/alerting/contact_points.yaml

apiVersion: 1 contactPoints:

orgId: 1 name: pagerduty-critical receivers:
- uid: pd-receiver type: pagerduty settings: integrationKey: YOUR_PAGERDUTY_KEY severity: critical
orgId: 1 name: slack-alerts receivers:
- uid: slack-receiver type: slack settings: url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL channel: '#alerts' username: Grafana icon_emoji: ':grafana:' title: '{{ template "slack.default.title" . }}' text: '{{ template "slack.default.text" . }}'
orgId: 1 name: email-alerts receivers:
- uid: email-receiver type: email settings: addresses: 'oncall@example.com;alerts@example.com'
orgId: 1 name: webhook-alerts receivers:
- uid: webhook-receiver type: webhook settings: url: https://your-endpoint.com/grafana-alerts httpMethod: POST

undefined

apiVersion: 1 contactPoints:

orgId: 1 name: pagerduty-critical receivers:
- uid: pd-receiver type: pagerduty settings: integrationKey: YOUR_PAGERDUTY_KEY severity: critical
orgId: 1 name: slack-alerts receivers:
- uid: slack-receiver type: slack settings: url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL channel: '#alerts' username: Grafana icon_emoji: ':grafana:' title: '{{ template "slack.default.title" . }}' text: '{{ template "slack.default.text" . }}'
orgId: 1 name: email-alerts receivers:
- uid: email-receiver type: email settings: addresses: 'oncall@example.com;alerts@example.com'
orgId: 1 name: webhook-alerts receivers:
- uid: webhook-receiver type: webhook settings: url: https://your-endpoint.com/grafana-alerts httpMethod: POST

undefined

Notification Policies (YAML provisioning)

通知策略（YAML配置）

yaml

undefined

yaml

undefined

provisioning/alerting/policies.yaml

apiVersion: 1 policies:

orgId: 1 receiver: default-receiver group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 12h routes:

Critical alerts → PagerDuty
- receiver: pagerduty-critical matchers:
  - severity = critical group_wait: 10s group_interval: 1m repeat_interval: 4h
Platform team alerts → Slack
- receiver: slack-alerts matchers:
  - team = platform routes:
  Critical platform → page immediately
  - receiver: pagerduty-critical matchers:
    - severity = critical
Everything else → email
- receiver: email-alerts matchers:
  - severity =~ "warning|info"

undefined

apiVersion: 1 policies:

orgId: 1 receiver: default-receiver group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 12h routes:

Critical alerts → PagerDuty
- receiver: pagerduty-critical matchers:
  - severity = critical group_wait: 10s group_interval: 1m repeat_interval: 4h
Platform team alerts → Slack
- receiver: slack-alerts matchers:
  - team = platform routes:
  Critical platform → page immediately
  - receiver: pagerduty-critical matchers:
    - severity = critical
Everything else → email
- receiver: email-alerts matchers:
  - severity =~ "warning|info"

undefined

Silences

静默规则

Silences suppress notifications for matching alerts without stopping evaluation.

bash

undefined

静默规则可抑制匹配告警的通知，但不会停止告警评估。

bash

undefined

Via API - create a silence

通过API创建静默规则

curl -X POST https://grafana.example.com/api/alertmanager/grafana/api/v2/silences
-H 'Content-Type: application/json'
-d '{ "matchers": [ {"name": "alertname", "value": "HighErrorRate", "isRegex": false}, {"name": "env", "value": "staging", "isRegex": false} ], "startsAt": "2024-01-01T00:00:00Z", "endsAt": "2024-01-01T02:00:00Z", "comment": "Maintenance window", "createdBy": "admin" }'

undefined

undefined

Alert Rule States

告警规则状态

State	Description
Normal	Condition not met
Pending	Condition met, waiting for `for` duration
Firing	Condition met for full `for` duration
NoData	Query returned no data
Error	Query/evaluation error
Recovering	Was firing, condition no longer met

状态	描述
正常	未满足触发条件
待触发	已满足条件，等待 `for` 配置的时长
触发中	满足条件且达到 `for` 配置的时长
无数据	查询未返回数据
错误	查询/评估出错
恢复中	曾处于触发状态，当前不再满足条件

SLOs

SLO

yaml

undefined

yaml

undefined

SLO configuration (via Grafana UI or API)

SLO配置（通过Grafana UI或API）

Grafana auto-generates recording rules, dashboards, and burn-rate alerts

Grafana会自动生成记录规则、仪表盘和burn-rate告警

Generated recording rules example:

生成的记录规则示例:

groups:

name: slo_availability interval: 1m rules:
- record: slo:availability:ratio_rate5m expr: | sum(rate(http_requests_total{status!~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
- record: slo:error_budget:remaining expr: | (slo:availability:ratio_rate30d - 0.999) / (1 - 0.999)

groups:

name: slo_availability interval: 1m rules:
- record: slo:availability:ratio_rate5m expr: | sum(rate(http_requests_total{status!~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
- record: slo:error_budget:remaining expr: | (slo:availability:ratio_rate30d - 0.999) / (1 - 0.999)

Burn rate alerts (auto-generated by Grafana SLO)

Burn rate告警（由Grafana SLO自动生成）

alert: SLOBurnRateHigh expr: | slo:burn_rate:ratio_rate1h > 14.4 # 1h window, 5% budget in 1h for: 2m labels: severity: critical annotations: summary: "SLO burn rate critical for {{ $labels.service }}"

undefined

alert: SLOBurnRateHigh expr: | slo:burn_rate:ratio_rate1h > 14.4 # 1h窗口，1小时内消耗5%预算 for: 2m labels: severity: critical annotations: summary: "SLO burn rate critical for {{ $labels.service }}"

undefined

IRM - On-Call and Incidents

IRM - 值班与事件管理

Key IRM Capabilities

IRM核心功能

On-Call Schedules: Rotating shifts, overrides, escalation policies
Alert Routing: Route from Grafana Alerting, Prometheus, Datadog, PagerDuty, etc.
Incident Management: Declare incidents, add participants, track tasks/timeline
Escalation Chains: Auto-escalate if no response after N minutes
Integrations: Slack, Teams, Telegram, GitHub, Jira, StatusPage

值班排班：轮班制、临时替班、升级策略
告警路由：支持从Grafana Alerting、Prometheus、Datadog、PagerDuty等平台路由告警
事件管理：声明事件、添加参与者、跟踪任务/时间线
升级链：若N分钟内无响应则自动升级
集成：Slack、Teams、Telegram、GitHub、Jira、StatusPage

IRM Integration Sources

IRM集成来源

Source	Setup
Grafana Alerting	Native - configure in contact points
Prometheus Alertmanager	Webhook URL from IRM
Datadog	Webhook integration
PagerDuty	Event integration
Jira	Issue alerts
Custom	Generic webhook

来源	设置方式
Grafana Alerting	原生集成 - 在联系点中配置
Prometheus Alertmanager	使用IRM提供的Webhook URL
Datadog	Webhook集成
PagerDuty	事件集成
Jira	问题告警
自定义	通用Webhook

Provisioning Directory Structure

配置目录结构

provisioning/alerting/
├── alert_rules.yaml        # Alert and recording rules
├── contact_points.yaml     # Notification destinations
├── notification_policies.yaml  # Routing tree
├── templates.yaml          # Message templates
└── mute_timings.yaml       # Recurring mute windows

provisioning/alerting/
├── alert_rules.yaml        # 告警和记录规则
├── contact_points.yaml     # 通知目标
├── notification_policies.yaml  # 路由树
├── templates.yaml          # 消息模板
└── mute_timings.yaml       # 定期静音窗口

API Provisioning (Keeps UI Editable)

API配置（保持UI可编辑）

bash

undefined

bash

undefined

Get current notification policy

获取当前通知策略

curl https://grafana.example.com/api/v1/provisioning/policies
-H 'Authorization: Bearer <token>'

Update (add X-Disable-Provenance to keep editable in UI)

更新（添加X-Disable-Provenance头以保持UI可编辑）

curl -X PUT https://grafana.example.com/api/v1/provisioning/policies
-H 'Authorization: Bearer <token>'
-H 'X-Disable-Provenance: true'
-H 'Content-Type: application/json'
-d @policy.json

Create alert rule

创建告警规则

curl -X POST https://grafana.example.com/api/v1/provisioning/alert-rules
-H 'Authorization: Bearer <token>'
-H 'Content-Type: application/json'
-d @rule.json

undefined

curl -X POST https://grafana.example.com/api/v1/provisioning/alert-rules
-H 'Authorization: Bearer <token>'
-H 'Content-Type: application/json'
-d @rule.json

undefined

Notification Templates

通知模板

undefined

undefined

Custom Slack template

自定义Slack模板

{{ define "slack.custom.title" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} {{ end }}

{{ define "slack.custom.text" }} {{ range .Alerts }} Alert: {{ .Annotations.summary }} Severity: {{ .Labels.severity }} Service: {{ .Labels.service }} Details: {{ .Annotations.description }} {{ if .Annotations.runbook_url }}Runbook: {{ .Annotations.runbook_url }}{{ end }} {{ end }} {{ end }}

undefined

{{ define "slack.custom.title" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} {{ end }}

undefined

References

参考资料

Alerting Rules
IRM & On-Call
SLOs

告警规则
IRM与值班
SLO

alerting-irm

Original

Translation

Grafana Alerting & IRM

Grafana Alerting & IRM

Alert Rules

告警规则

Grafana-Managed Alert Rule (YAML provisioning)

Grafana管理的告警规则（YAML配置）

provisioning/alerting/rules.yaml

provisioning/alerting/rules.yaml

Prometheus/Mimir Alert Rule (ruler)

Prometheus/Mimir告警规则（Ruler）

Loki Alert Rule (LogQL)

Loki告警规则（LogQL）

Contact Points (YAML provisioning)

联系点（YAML配置）

provisioning/alerting/contact_points.yaml

provisioning/alerting/contact_points.yaml

Notification Policies (YAML provisioning)

通知策略（YAML配置）

provisioning/alerting/policies.yaml

provisioning/alerting/policies.yaml

Critical alerts → PagerDuty

Platform team alerts → Slack

Critical platform → page immediately

Everything else → email

Critical alerts → PagerDuty

Platform team alerts → Slack

Critical platform → page immediately

Everything else → email

Silences

静默规则

Via API - create a silence

通过API创建静默规则

Alert Rule States

告警规则状态

SLOs

SLO

SLO configuration (via Grafana UI or API)

SLO配置（通过Grafana UI或API）

Grafana auto-generates recording rules, dashboards, and burn-rate alerts

Grafana会自动生成记录规则、仪表盘和burn-rate告警

Generated recording rules example:

生成的记录规则示例:

Burn rate alerts (auto-generated by Grafana SLO)

Burn rate告警（由Grafana SLO自动生成）

IRM - On-Call and Incidents

IRM - 值班与事件管理

Key IRM Capabilities

IRM核心功能

IRM Integration Sources

IRM集成来源

Provisioning Directory Structure

配置目录结构

API Provisioning (Keeps UI Editable)

API配置（保持UI可编辑）

Get current notification policy

获取当前通知策略

Update (add X-Disable-Provenance to keep editable in UI)

更新（添加X-Disable-Provenance头以保持UI可编辑）

Create alert rule

创建告警规则

Notification Templates

通知模板

Custom Slack template

自定义Slack模板

References

参考资料