oncall-irm
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGrafana OnCall & IRM
Grafana OnCall & IRM
OnCall docs: https://grafana.com/docs/oncall/latest/ IRM docs: https://grafana.com/docs/grafana-cloud/alerting-and-irm/
Note: Grafana OnCall OSS is in maintenance mode (archived March 2026). Grafana Cloud users
should use IRM, which unifies OnCall + Incident management. The concepts (escalation chains,
schedules, integrations) are identical.
注意: Grafana OnCall 开源版处于维护模式(将于2026年3月归档)。Grafana Cloud 用户应使用 IRM,它整合了 OnCall + 事件管理功能。相关概念(升级链、排班、集成)完全一致。
Core Concepts
核心概念
| Concept | Description |
|---|---|
| Integration | Entry point for alerts (HTTP POST URL); one per alert source |
| Route | Jinja2 condition that maps alerts to an escalation chain (first True wins) |
| Escalation Chain | Ordered notification steps: wait, notify schedule, notify team, etc. |
| Schedule | Calendar-based on-call rotation (web, iCal import, or Terraform) |
| Alert Group | Aggregated related alerts (grouped by Grouping ID template) |
| Notification Policy | Per-user delivery channels (Slack, mobile push, SMS, phone, email) |
| 概念 | 描述 |
|---|---|
| 集成(Integration) | 告警的入口(HTTP POST URL);每个告警源对应一个集成 |
| 路由(Route) | 用于将告警映射到升级链的Jinja2条件(首个匹配为True的路由生效) |
| 升级链(Escalation Chain) | 有序的通知步骤:等待、通知排班人员、通知团队等 |
| 排班(Schedule) | 基于日历的值班轮换(支持网页配置、iCal导入或Terraform配置) |
| 告警组(Alert Group) | 聚合相关告警(按分组ID模板进行分组) |
| 通知策略(Notification Policy) | 针对单个用户的投递渠道(Slack、移动推送、短信、电话、邮件) |
Alert Processing Flow
告警处理流程
Alert arrives at Integration URL
→ Routing template (Jinja2, first True wins) selects escalation chain
→ Grouping ID template consolidates related alerts
→ Escalation chain fires: wait → notify schedule → wait → notify team lead
→ Users: acknowledge / resolve / silence from Slack, mobile, or web告警到达集成URL
→ 路由模板(Jinja2,首个匹配为True的生效)选择升级链
→ 分组ID模板整合相关告警
→ 触发升级链:等待 → 通知排班人员 → 等待 → 通知团队负责人
→ 用户:通过Slack、移动设备或网页进行确认 / 解决 / 静默告警Integrations
集成功能
Alertmanager / Prometheus Alertmanager
Alertmanager / Prometheus Alertmanager
yaml
undefinedyaml
undefinedalertmanager.yml
alertmanager.yml
receivers:
- name: grafana-oncall
webhook_configs:
- url: https://your-oncall.grafana.net/integrations/v1/alertmanager/[id]/ send_resolved: true max_alerts: 100 # prevent oversized payloads
route:
receiver: grafana-oncall
group_by: [alertname, cluster]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
undefinedreceivers:
- name: grafana-oncall
webhook_configs:
- url: https://your-oncall.grafana.net/integrations/v1/alertmanager/[id]/ send_resolved: true max_alerts: 100 # 避免过大的请求负载
route:
receiver: grafana-oncall
group_by: [alertname, cluster]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
undefinedGrafana Alerting (same instance)
Grafana Alerting(同实例)
- In OnCall → Integrations → New Integration → Grafana Alerting
- Click Quick Connect on the integration tile — auto-creates a contact point
- Link the contact point to a notification policy in Grafana Alerting
- 在 OnCall → 集成 → 新建集成 → 选择 Grafana Alerting
- 点击集成卡片上的 快速连接 —— 自动创建联系点
- 在 Grafana Alerting 中将联系点关联到通知策略
Webhook (custom/generic)
Webhook(自定义/通用)
bash
undefinedbash
undefinedSend alert via formatted webhook
通过格式化Webhook发送告警
curl -X POST https://your-oncall.grafana.net/integrations/v1/formatted_webhook/[id]/
-H "Content-Type: application/json"
-d '{ "alert_uid": "incident-123", "title": "Database CPU High", "state": "alerting", "message": "db-prod-01 CPU at 95% for 10 minutes", "link_to_upstream_details": "https://grafana.example.com/d/abc123" }'
-H "Content-Type: application/json"
-d '{ "alert_uid": "incident-123", "title": "Database CPU High", "state": "alerting", "message": "db-prod-01 CPU at 95% for 10 minutes", "link_to_upstream_details": "https://grafana.example.com/d/abc123" }'
curl -X POST https://your-oncall.grafana.net/integrations/v1/formatted_webhook/[id]/
-H "Content-Type: application/json"
-d '{ "alert_uid": "incident-123", "title": "数据库CPU使用率过高", "state": "alerting", "message": "db-prod-01 CPU使用率达95%已持续10分钟", "link_to_upstream_details": "https://grafana.example.com/d/abc123" }'
-H "Content-Type: application/json"
-d '{ "alert_uid": "incident-123", "title": "数据库CPU使用率过高", "state": "alerting", "message": "db-prod-01 CPU使用率达95%已持续10分钟", "link_to_upstream_details": "https://grafana.example.com/d/abc123" }'
Resolve the alert
解决告警
curl -X POST https://your-oncall.grafana.net/integrations/v1/formatted_webhook/[id]/
-H "Content-Type: application/json"
-d '{"alert_uid": "incident-123", "state": "ok"}'
-H "Content-Type: application/json"
-d '{"alert_uid": "incident-123", "state": "ok"}'
Recognized fields: `alert_uid`, `title`, `state` (`alerting`/`ok`), `message`, `image_url`, `link_to_upstream_details`curl -X POST https://your-oncall.grafana.net/integrations/v1/formatted_webhook/[id]/
-H "Content-Type: application/json"
-d '{"alert_uid": "incident-123", "state": "ok"}'
-H "Content-Type: application/json"
-d '{"alert_uid": "incident-123", "state": "ok"}'
支持的字段:`alert_uid`, `title`, `state`(`alerting`/`ok`), `message`, `image_url`, `link_to_upstream_details`Routing Templates (Jinja2)
路由模板(Jinja2)
Routing templates return or to select the escalation chain. First matching route wins.
TrueFalsejinja2
{# Route critical alerts to PagerDuty escalation #}
{{ payload.labels.severity == "critical" }}
{# Route by team label #}
{{ payload.labels.team == "platform" }}
{# Route database alerts to DBA on-call #}
{{ "database" in payload.labels.get("component", "") }}
{# Default catch-all (always True) #}
{{ true }}Grouping ID (consolidates related alerts into one alert group):
jinja2
{{ payload.labels.alertname }}-{{ payload.labels.instance }}Advanced template functions:
jinja2
{{ payload.field | b64decode }} # Decode base64
{{ "pattern" | regex_match(payload.message) }} # Regex matching
{{ datetimeformat_as_timezone(payload.startsAt, "UTC") }} # Timezone display
{{ payload.values | tojson_pretty }} # Pretty-print JSON路由模板返回或以选择升级链,首个匹配的路由生效。
TrueFalsejinja2
{# 将严重告警路由到PagerDuty升级链 #}
{{ payload.labels.severity == "critical" }}
{# 按团队标签路由 #}
{{ payload.labels.team == "platform" }}
{# 将数据库告警路由到DBA值班人员 #}
{{ "database" in payload.labels.get("component", "") }}
{# 默认兜底路由(始终返回True) #}
{{ true }}分组ID(将相关告警合并为一个告警组):
jinja2
{{ payload.labels.alertname }}-{{ payload.labels.instance }}高级模板函数:
jinja2
{{ payload.field | b64decode }} # 解码base64
{{ "pattern" | regex_match(payload.message) }} # 正则匹配
{{ datetimeformat_as_timezone(payload.startsAt, "UTC") }} # 时区显示
{{ payload.values | tojson_pretty }} # 格式化JSON输出Escalation Chains
升级链
Configure at OnCall → Escalation Chains → Create:
Step 1: Notify users from schedule "Primary On-Call" (Important Notifications)
Step 2: Wait 5 minutes
Step 3: Notify users from schedule "Primary On-Call" (Default Notifications)
Step 4: Wait 10 minutes
Step 5: Notify whole team "Platform"
Step 6: Trigger webhook (PagerDuty, ticket system, etc.)Step types:
- Wait — pause N minutes before next step
- Notify users from schedule — alerts whoever is currently on-call
- Notify team — alerts all members of a team
- Notify users — alerts specific named users
- Trigger outgoing webhook — call external system
- Auto-resolve — mark alert group resolved after N minutes
- Round-robin — rotate through a list of users
在 OnCall → 升级链 → 创建 中配置:
步骤1:通知「主值班人员」排班中的用户(重要通知)
步骤2:等待5分钟
步骤3:通知「主值班人员」排班中的用户(默认通知)
步骤4:等待10分钟
步骤5:通知整个「平台团队」
步骤6:触发Webhook(PagerDuty、工单系统等)步骤类型:
- 等待 —— 暂停N分钟后执行下一步
- 通知排班人员 —— 告警当前值班人员
- 通知团队 —— 告警团队所有成员
- 通知指定用户 —— 告警特定用户
- 触发出站Webhook —— 调用外部系统
- 自动解决 —— N分钟后标记告警组已解决
- 轮询 —— 在用户列表中轮流告警
On-Call Schedules
值班排班
Web-based (UI)
网页端(UI)
Create rotations with shifts, overrides, and gaps directly in the OnCall/IRM UI.
直接在OnCall/IRM界面创建包含轮班、代班和空缺的排班。
iCal Import
iCal导入
bash
undefinedbash
undefinedAPI: create schedule from iCal
API:从iCal创建排班
curl -X POST https://your-oncall.grafana.net/api/v1/schedules/
-H "Authorization: your-api-key"
-H "Content-Type: application/json"
-d '{ "name": "Platform On-Call", "ical_url_primary": "https://calendar.example.com/platform-oncall.ics", "ical_url_overrides": "https://calendar.example.com/overrides.ics", "slack": { "channel_id": "C123456ABC", "user_group_id": "S123456ABC" } }'
-H "Authorization: your-api-key"
-H "Content-Type: application/json"
-d '{ "name": "Platform On-Call", "ical_url_primary": "https://calendar.example.com/platform-oncall.ics", "ical_url_overrides": "https://calendar.example.com/overrides.ics", "slack": { "channel_id": "C123456ABC", "user_group_id": "S123456ABC" } }'
undefinedcurl -X POST https://your-oncall.grafana.net/api/v1/schedules/
-H "Authorization: your-api-key"
-H "Content-Type: application/json"
-d '{ "name": "平台团队值班", "ical_url_primary": "https://calendar.example.com/platform-oncall.ics", "ical_url_overrides": "https://calendar.example.com/overrides.ics", "slack": { "channel_id": "C123456ABC", "user_group_id": "S123456ABC" } }'
-H "Authorization: your-api-key"
-H "Content-Type: application/json"
-d '{ "name": "平台团队值班", "ical_url_primary": "https://calendar.example.com/platform-oncall.ics", "ical_url_overrides": "https://calendar.example.com/overrides.ics", "slack": { "channel_id": "C123456ABC", "user_group_id": "S123456ABC" } }'
undefinedTerraform
Terraform
hcl
resource "grafana_oncall_schedule" "platform" {
name = "Platform On-Call"
type = "calendar"
shifts = [
grafana_oncall_on_call_shift.weekday.id,
grafana_oncall_on_call_shift.weekend.id,
]
}
resource "grafana_oncall_on_call_shift" "weekday" {
name = "Weekday"
type = "rolling_users"
start = "2024-01-01T09:00:00"
duration = 3600 * 8 # 8 hours
frequency = "weekly"
users_per_slot = 1
rolling_users = [["user-id-1"], ["user-id-2"], ["user-id-3"]]
}hcl
resource "grafana_oncall_schedule" "platform" {
name = "平台团队值班"
type = "calendar"
shifts = [
grafana_oncall_on_call_shift.weekday.id,
grafana_oncall_on_call_shift.weekend.id,
]
}
resource "grafana_oncall_on_call_shift" "weekday" {
name = "工作日"
type = "rolling_users"
start = "2024-01-01T09:00:00"
duration = 3600 * 8 # 8小时
frequency = "weekly"
users_per_slot = 1
rolling_users = [["user-id-1"], ["user-id-2"], ["user-id-3"]]
}Slack Integration
Slack集成
- Install: OnCall Settings → Chat Ops → Slack → Install Slack Integration
- Connect users: Each user: Profile → Connect to Slack
- Set default channel: for alert routing
- Add to escalation: "Notify by Slack mentions" step in escalation chain
Slack actions on alert messages: Acknowledge, Resolve, Silence, Add responders, Add note
Slash commands: ,
/escalate/oncall- 安装:OnCall 设置 → 聊天运维 → Slack → 安装Slack集成
- 关联用户:每个用户在「个人资料」中关联Slack账号
- 设置默认频道:用于告警路由
- 添加到升级链:在升级链中添加「通过Slack提及通知」步骤
告警消息的Slack操作:确认、解决、静默、添加响应人、添加备注
Slash命令:、
/escalate/oncallAPI Reference
API参考
Base URL:
https://your-oncall.grafana.net/api/v1/bash
TOKEN=your-api-key基础URL:
https://your-oncall.grafana.net/api/v1/bash
TOKEN=your-api-keyList integrations
列出集成
curl "$BASE/integrations/" -H "Authorization: $TOKEN"
curl "$BASE/integrations/" -H "Authorization: $TOKEN"
Create escalation chain
创建升级链
curl -X POST "$BASE/escalation_chains/"
-H "Authorization: $TOKEN"
-H "Content-Type: application/json"
-d '{"name": "Platform Critical", "team_id": "team-id"}'
-H "Authorization: $TOKEN"
-H "Content-Type: application/json"
-d '{"name": "Platform Critical", "team_id": "team-id"}'
curl -X POST "$BASE/escalation_chains/"
-H "Authorization: $TOKEN"
-H "Content-Type: application/json"
-d '{"name": "平台严重告警", "team_id": "team-id"}'
-H "Authorization: $TOKEN"
-H "Content-Type: application/json"
-d '{"name": "平台严重告警", "team_id": "team-id"}'
List schedules
列出排班
curl "$BASE/schedules/" -H "Authorization: $TOKEN"
curl "$BASE/schedules/" -H "Authorization: $TOKEN"
List alert groups
列出告警组
curl "$BASE/alert_groups/?page=1&perpage=25" -H "Authorization: $TOKEN"
curl "$BASE/alert_groups/?page=1&perpage=25" -H "Authorization: $TOKEN"
Who is on-call right now
查询当前值班人员
curl "$BASE/schedules/{schedule_id}/next_shifts/" -H "Authorization: $TOKEN"
**Rate limits:** 300 alerts/integration per 5 min, 500 alerts/org per 5 min, 300 API requests/key per 5 mincurl "$BASE/schedules/{schedule_id}/next_shifts/" -H "Authorization: $TOKEN"
**速率限制**:每个集成每5分钟300条告警,每个组织每5分钟500条告警,每个API密钥每5分钟300次请求Incident Management (IRM)
事件管理(IRM)
When an alert group becomes an incident:
- Declare incident: From alert group → "Declare Incident" or via Slack
/incident declare - Set severity: P1–P4
- Add responders: Page additional team members
- Update status: Investigating → Identified → Monitoring → Resolved
- Timeline: Auto-tracks all actions; add manual notes
- Postmortem: Auto-generated draft from timeline on resolution
当告警组升级为事件时:
- 声明事件:从告警组中点击「声明事件」或通过Slack命令
/incident declare - 设置严重级别:P1–P4
- 添加响应人:通知额外的团队成员
- 更新状态:调查中 → 已定位 → 监控中 → 已解决
- 时间线:自动跟踪所有操作;可手动添加备注
- 事后复盘:事件解决后自动从时间线生成复盘草稿
RBAC Roles
RBAC角色
| Role | Access |
|---|---|
| Full access to all OnCall resources |
| Create/edit integrations, schedules, escalation chains |
| Read-only |
| Receive alerts; cannot modify configuration |
| 角色 | 权限 |
|---|---|
| 拥有所有OnCall资源的完全访问权限 |
| 可创建/编辑集成、排班、升级链 |
| 只读权限 |
| 可接收告警;无法修改配置 |
Rate Limits & Best Practices
速率限制与最佳实践
- Keep escalation chains short (≤4 levels) with a definitive final step
- Set in Alertmanager for auto-resolution
send_resolved: true - Use in Alertmanager webhook config
max_alerts: 100 - Test routes with the template editor before going live
- Combine Slack + mobile push for notification reliability
- Assign integrations/schedules to teams for access control
- 保持升级链简短(≤4级),并设置明确的最终步骤
- 在Alertmanager中设置以实现自动解决
send_resolved: true - 在Alertmanager Webhook配置中使用
max_alerts: 100 - 上线前使用模板编辑器测试路由规则
- 结合Slack + 移动推送提升通知可靠性
- 将集成/排班分配给团队以实现访问控制