alerting-irm
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGrafana Alerting & IRM
Grafana Alerting & IRM
Alert Rules
告警规则
Grafana-Managed Alert Rule (YAML provisioning)
Grafana管理的告警规则(YAML配置)
yaml
undefinedyaml
undefinedprovisioning/alerting/rules.yaml
provisioning/alerting/rules.yaml
apiVersion: 1
groups:
- orgId: 1
name: MyAlertGroup
folder: MyFolder
interval: 1m
rules:
- uid: high-error-rate
title: High Error Rate
condition: C
data:
- refId: A datasourceUid: prometheus relativeTimeRange: from: 300 # 5 minutes to: 0 model: expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
- refId: B datasourceUid: expr model: type: reduce refId: B expression: A reducer: last
- refId: C datasourceUid: expr model: type: math refId: C expression: $B > 0.05 noDataState: NoData execErrState: Alerting for: 5m labels: severity: critical team: platform annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $values.B }}%" runbook_url: "https://runbooks.example.com/high-error-rate"
- uid: high-error-rate
title: High Error Rate
condition: C
data:
undefinedapiVersion: 1
groups:
- orgId: 1
name: MyAlertGroup
folder: MyFolder
interval: 1m
rules:
- uid: high-error-rate
title: High Error Rate
condition: C
data:
- refId: A datasourceUid: prometheus relativeTimeRange: from: 300 # 5 minutes to: 0 model: expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
- refId: B datasourceUid: expr model: type: reduce refId: B expression: A reducer: last
- refId: C datasourceUid: expr model: type: math refId: C expression: $B > 0.05 noDataState: NoData execErrState: Alerting for: 5m labels: severity: critical team: platform annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $values.B }}%" runbook_url: "https://runbooks.example.com/high-error-rate"
- uid: high-error-rate
title: High Error Rate
condition: C
data:
undefinedPrometheus/Mimir Alert Rule (ruler)
Prometheus/Mimir告警规则(Ruler)
yaml
groups:
- name: service-alerts
interval: 1m
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate: {{ $labels.service }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 2m
labels:
severity: warning
annotations:
summary: "P95 latency > 1s on {{ $labels.service }}"
# Recording rule
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)yaml
groups:
- name: service-alerts
interval: 1m
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate: {{ $labels.service }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 2m
labels:
severity: warning
annotations:
summary: "P95 latency > 1s on {{ $labels.service }}"
# Recording rule
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)Loki Alert Rule (LogQL)
Loki告警规则(LogQL)
yaml
groups:
- name: log-alerts
rules:
- alert: HighErrorLogs
expr: |
sum(rate({app="myapp"} |= "error" [5m])) by (app)
/
sum(rate({app="myapp"}[5m])) by (app)
> 0.05
for: 10m
labels:
severity: page
annotations:
summary: "High error log rate for {{ $labels.app }}"
- alert: CredentialsLeak
expr: |
sum by (cluster, job, pod) (
count_over_time({namespace="prod"} |~ "https?://(\\w+):(\\w+)@" [5m]) > 0
)
for: 5m
labels:
severity: criticalyaml
groups:
- name: log-alerts
rules:
- alert: HighErrorLogs
expr: |
sum(rate({app="myapp"} |= "error" [5m])) by (app)
/
sum(rate({app="myapp"}[5m])) by (app)
> 0.05
for: 10m
labels:
severity: page
annotations:
summary: "High error log rate for {{ $labels.app }}"
- alert: CredentialsLeak
expr: |
sum by (cluster, job, pod) (
count_over_time({namespace="prod"} |~ "https?://(\\w+):(\\w+)@" [5m]) > 0
)
for: 5m
labels:
severity: criticalContact Points (YAML provisioning)
联系点(YAML配置)
yaml
undefinedyaml
undefinedprovisioning/alerting/contact_points.yaml
provisioning/alerting/contact_points.yaml
apiVersion: 1
contactPoints:
-
orgId: 1 name: pagerduty-critical receivers:
- uid: pd-receiver type: pagerduty settings: integrationKey: YOUR_PAGERDUTY_KEY severity: critical
-
orgId: 1 name: slack-alerts receivers:
- uid: slack-receiver type: slack settings: url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL channel: '#alerts' username: Grafana icon_emoji: ':grafana:' title: '{{ template "slack.default.title" . }}' text: '{{ template "slack.default.text" . }}'
-
orgId: 1 name: email-alerts receivers:
- uid: email-receiver type: email settings: addresses: 'oncall@example.com;alerts@example.com'
-
orgId: 1 name: webhook-alerts receivers:
- uid: webhook-receiver type: webhook settings: url: https://your-endpoint.com/grafana-alerts httpMethod: POST
undefinedapiVersion: 1
contactPoints:
-
orgId: 1 name: pagerduty-critical receivers:
- uid: pd-receiver type: pagerduty settings: integrationKey: YOUR_PAGERDUTY_KEY severity: critical
-
orgId: 1 name: slack-alerts receivers:
- uid: slack-receiver type: slack settings: url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL channel: '#alerts' username: Grafana icon_emoji: ':grafana:' title: '{{ template "slack.default.title" . }}' text: '{{ template "slack.default.text" . }}'
-
orgId: 1 name: email-alerts receivers:
- uid: email-receiver type: email settings: addresses: 'oncall@example.com;alerts@example.com'
-
orgId: 1 name: webhook-alerts receivers:
- uid: webhook-receiver type: webhook settings: url: https://your-endpoint.com/grafana-alerts httpMethod: POST
undefinedNotification Policies (YAML provisioning)
通知策略(YAML配置)
yaml
undefinedyaml
undefinedprovisioning/alerting/policies.yaml
provisioning/alerting/policies.yaml
apiVersion: 1
policies:
-
orgId: 1 receiver: default-receiver group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 12h routes:
Critical alerts → PagerDuty
- receiver: pagerduty-critical
matchers:
- severity = critical group_wait: 10s group_interval: 1m repeat_interval: 4h
Platform team alerts → Slack
- receiver: slack-alerts
matchers:
- team = platform routes:
Critical platform → page immediately
- receiver: pagerduty-critical
matchers:
- severity = critical
Everything else → email
- receiver: email-alerts
matchers:
- severity =~ "warning|info"
- receiver: pagerduty-critical
matchers:
undefinedapiVersion: 1
policies:
-
orgId: 1 receiver: default-receiver group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 12h routes:
Critical alerts → PagerDuty
- receiver: pagerduty-critical
matchers:
- severity = critical group_wait: 10s group_interval: 1m repeat_interval: 4h
Platform team alerts → Slack
- receiver: slack-alerts
matchers:
- team = platform routes:
Critical platform → page immediately
- receiver: pagerduty-critical
matchers:
- severity = critical
Everything else → email
- receiver: email-alerts
matchers:
- severity =~ "warning|info"
- receiver: pagerduty-critical
matchers:
undefinedSilences
静默规则
Silences suppress notifications for matching alerts without stopping evaluation.
bash
undefined静默规则可抑制匹配告警的通知,但不会停止告警评估。
bash
undefinedVia API - create a silence
通过API创建静默规则
curl -X POST https://grafana.example.com/api/alertmanager/grafana/api/v2/silences
-H 'Content-Type: application/json'
-d '{ "matchers": [ {"name": "alertname", "value": "HighErrorRate", "isRegex": false}, {"name": "env", "value": "staging", "isRegex": false} ], "startsAt": "2024-01-01T00:00:00Z", "endsAt": "2024-01-01T02:00:00Z", "comment": "Maintenance window", "createdBy": "admin" }'
-H 'Content-Type: application/json'
-d '{ "matchers": [ {"name": "alertname", "value": "HighErrorRate", "isRegex": false}, {"name": "env", "value": "staging", "isRegex": false} ], "startsAt": "2024-01-01T00:00:00Z", "endsAt": "2024-01-01T02:00:00Z", "comment": "Maintenance window", "createdBy": "admin" }'
undefinedcurl -X POST https://grafana.example.com/api/alertmanager/grafana/api/v2/silences
-H 'Content-Type: application/json'
-d '{ "matchers": [ {"name": "alertname", "value": "HighErrorRate", "isRegex": false}, {"name": "env", "value": "staging", "isRegex": false} ], "startsAt": "2024-01-01T00:00:00Z", "endsAt": "2024-01-01T02:00:00Z", "comment": "Maintenance window", "createdBy": "admin" }'
-H 'Content-Type: application/json'
-d '{ "matchers": [ {"name": "alertname", "value": "HighErrorRate", "isRegex": false}, {"name": "env", "value": "staging", "isRegex": false} ], "startsAt": "2024-01-01T00:00:00Z", "endsAt": "2024-01-01T02:00:00Z", "comment": "Maintenance window", "createdBy": "admin" }'
undefinedAlert Rule States
告警规则状态
| State | Description |
|---|---|
| Normal | Condition not met |
| Pending | Condition met, waiting for |
| Firing | Condition met for full |
| NoData | Query returned no data |
| Error | Query/evaluation error |
| Recovering | Was firing, condition no longer met |
| 状态 | 描述 |
|---|---|
| 正常 | 未满足触发条件 |
| 待触发 | 已满足条件,等待 |
| 触发中 | 满足条件且达到 |
| 无数据 | 查询未返回数据 |
| 错误 | 查询/评估出错 |
| 恢复中 | 曾处于触发状态,当前不再满足条件 |
SLOs
SLO
yaml
undefinedyaml
undefinedSLO configuration (via Grafana UI or API)
SLO配置(通过Grafana UI或API)
Grafana auto-generates recording rules, dashboards, and burn-rate alerts
Grafana会自动生成记录规则、仪表盘和burn-rate告警
Generated recording rules example:
生成的记录规则示例:
groups:
- name: slo_availability
interval: 1m
rules:
-
record: slo:availability:ratio_rate5m expr: | sum(rate(http_requests_total{status!~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
-
record: slo:error_budget:remaining expr: | (slo:availability:ratio_rate30d - 0.999) / (1 - 0.999)
-
groups:
- name: slo_availability
interval: 1m
rules:
-
record: slo:availability:ratio_rate5m expr: | sum(rate(http_requests_total{status!~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
-
record: slo:error_budget:remaining expr: | (slo:availability:ratio_rate30d - 0.999) / (1 - 0.999)
-
Burn rate alerts (auto-generated by Grafana SLO)
Burn rate告警(由Grafana SLO自动生成)
- alert: SLOBurnRateHigh expr: | slo:burn_rate:ratio_rate1h > 14.4 # 1h window, 5% budget in 1h for: 2m labels: severity: critical annotations: summary: "SLO burn rate critical for {{ $labels.service }}"
undefined- alert: SLOBurnRateHigh expr: | slo:burn_rate:ratio_rate1h > 14.4 # 1h窗口,1小时内消耗5%预算 for: 2m labels: severity: critical annotations: summary: "SLO burn rate critical for {{ $labels.service }}"
undefinedIRM - On-Call and Incidents
IRM - 值班与事件管理
Key IRM Capabilities
IRM核心功能
- On-Call Schedules: Rotating shifts, overrides, escalation policies
- Alert Routing: Route from Grafana Alerting, Prometheus, Datadog, PagerDuty, etc.
- Incident Management: Declare incidents, add participants, track tasks/timeline
- Escalation Chains: Auto-escalate if no response after N minutes
- Integrations: Slack, Teams, Telegram, GitHub, Jira, StatusPage
- 值班排班:轮班制、临时替班、升级策略
- 告警路由:支持从Grafana Alerting、Prometheus、Datadog、PagerDuty等平台路由告警
- 事件管理:声明事件、添加参与者、跟踪任务/时间线
- 升级链:若N分钟内无响应则自动升级
- 集成:Slack、Teams、Telegram、GitHub、Jira、StatusPage
IRM Integration Sources
IRM集成来源
| Source | Setup |
|---|---|
| Grafana Alerting | Native - configure in contact points |
| Prometheus Alertmanager | Webhook URL from IRM |
| Datadog | Webhook integration |
| PagerDuty | Event integration |
| Jira | Issue alerts |
| Custom | Generic webhook |
| 来源 | 设置方式 |
|---|---|
| Grafana Alerting | 原生集成 - 在联系点中配置 |
| Prometheus Alertmanager | 使用IRM提供的Webhook URL |
| Datadog | Webhook集成 |
| PagerDuty | 事件集成 |
| Jira | 问题告警 |
| 自定义 | 通用Webhook |
Provisioning Directory Structure
配置目录结构
provisioning/alerting/
├── alert_rules.yaml # Alert and recording rules
├── contact_points.yaml # Notification destinations
├── notification_policies.yaml # Routing tree
├── templates.yaml # Message templates
└── mute_timings.yaml # Recurring mute windowsprovisioning/alerting/
├── alert_rules.yaml # 告警和记录规则
├── contact_points.yaml # 通知目标
├── notification_policies.yaml # 路由树
├── templates.yaml # 消息模板
└── mute_timings.yaml # 定期静音窗口API Provisioning (Keeps UI Editable)
API配置(保持UI可编辑)
bash
undefinedbash
undefinedGet current notification policy
获取当前通知策略
curl https://grafana.example.com/api/v1/provisioning/policies
-H 'Authorization: Bearer <token>'
-H 'Authorization: Bearer <token>'
curl https://grafana.example.com/api/v1/provisioning/policies
-H 'Authorization: Bearer <token>'
-H 'Authorization: Bearer <token>'
Update (add X-Disable-Provenance to keep editable in UI)
更新(添加X-Disable-Provenance头以保持UI可编辑)
curl -X PUT https://grafana.example.com/api/v1/provisioning/policies
-H 'Authorization: Bearer <token>'
-H 'X-Disable-Provenance: true'
-H 'Content-Type: application/json'
-d @policy.json
-H 'Authorization: Bearer <token>'
-H 'X-Disable-Provenance: true'
-H 'Content-Type: application/json'
-d @policy.json
curl -X PUT https://grafana.example.com/api/v1/provisioning/policies
-H 'Authorization: Bearer <token>'
-H 'X-Disable-Provenance: true'
-H 'Content-Type: application/json'
-d @policy.json
-H 'Authorization: Bearer <token>'
-H 'X-Disable-Provenance: true'
-H 'Content-Type: application/json'
-d @policy.json
Create alert rule
创建告警规则
curl -X POST https://grafana.example.com/api/v1/provisioning/alert-rules
-H 'Authorization: Bearer <token>'
-H 'Content-Type: application/json'
-d @rule.json
-H 'Authorization: Bearer <token>'
-H 'Content-Type: application/json'
-d @rule.json
undefinedcurl -X POST https://grafana.example.com/api/v1/provisioning/alert-rules
-H 'Authorization: Bearer <token>'
-H 'Content-Type: application/json'
-d @rule.json
-H 'Authorization: Bearer <token>'
-H 'Content-Type: application/json'
-d @rule.json
undefinedNotification Templates
通知模板
undefinedundefinedCustom Slack template
自定义Slack模板
{{ define "slack.custom.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ .CommonLabels.alertname }}
{{ end }}
{{ define "slack.custom.text" }}
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Severity: {{ .Labels.severity }}
Service: {{ .Labels.service }}
Details: {{ .Annotations.description }}
{{ if .Annotations.runbook_url }}Runbook: {{ .Annotations.runbook_url }}{{ end }}
{{ end }}
{{ end }}
undefined{{ define "slack.custom.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ .CommonLabels.alertname }}
{{ end }}
{{ define "slack.custom.text" }}
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Severity: {{ .Labels.severity }}
Service: {{ .Labels.service }}
Details: {{ .Annotations.description }}
{{ if .Annotations.runbook_url }}Runbook: {{ .Annotations.runbook_url }}{{ end }}
{{ end }}
{{ end }}
undefinedReferences
参考资料
- Alerting Rules
- IRM & On-Call
- SLOs
- 告警规则
- IRM与值班
- SLO