oncall-irm

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Grafana OnCall & IRM

Grafana OnCall & IRM

Note: Grafana OnCall OSS is in maintenance mode (archived March 2026). Grafana Cloud users should use IRM, which unifies OnCall + Incident management. The concepts (escalation chains, schedules, integrations) are identical.
注意: Grafana OnCall 开源版处于维护模式(将于2026年3月归档)。Grafana Cloud 用户应使用 IRM,它整合了 OnCall + 事件管理功能。相关概念(升级链、排班、集成)完全一致。

Core Concepts

核心概念

ConceptDescription
IntegrationEntry point for alerts (HTTP POST URL); one per alert source
RouteJinja2 condition that maps alerts to an escalation chain (first True wins)
Escalation ChainOrdered notification steps: wait, notify schedule, notify team, etc.
ScheduleCalendar-based on-call rotation (web, iCal import, or Terraform)
Alert GroupAggregated related alerts (grouped by Grouping ID template)
Notification PolicyPer-user delivery channels (Slack, mobile push, SMS, phone, email)
概念描述
集成(Integration)告警的入口(HTTP POST URL);每个告警源对应一个集成
路由(Route)用于将告警映射到升级链的Jinja2条件(首个匹配为True的路由生效)
升级链(Escalation Chain)有序的通知步骤:等待、通知排班人员、通知团队等
排班(Schedule)基于日历的值班轮换(支持网页配置、iCal导入或Terraform配置)
告警组(Alert Group)聚合相关告警(按分组ID模板进行分组)
通知策略(Notification Policy)针对单个用户的投递渠道(Slack、移动推送、短信、电话、邮件)

Alert Processing Flow

告警处理流程

Alert arrives at Integration URL
  → Routing template (Jinja2, first True wins) selects escalation chain
  → Grouping ID template consolidates related alerts
  → Escalation chain fires: wait → notify schedule → wait → notify team lead
  → Users: acknowledge / resolve / silence from Slack, mobile, or web
告警到达集成URL
  → 路由模板(Jinja2,首个匹配为True的生效)选择升级链
  → 分组ID模板整合相关告警
  → 触发升级链:等待 → 通知排班人员 → 等待 → 通知团队负责人
  → 用户:通过Slack、移动设备或网页进行确认 / 解决 / 静默告警

Integrations

集成功能

Alertmanager / Prometheus Alertmanager

Alertmanager / Prometheus Alertmanager

yaml
undefined
yaml
undefined

alertmanager.yml

alertmanager.yml

receivers:
route: receiver: grafana-oncall group_by: [alertname, cluster] group_wait: 30s group_interval: 5m repeat_interval: 4h
undefined
receivers:
route: receiver: grafana-oncall group_by: [alertname, cluster] group_wait: 30s group_interval: 5m repeat_interval: 4h
undefined

Grafana Alerting (same instance)

Grafana Alerting(同实例)

  1. In OnCall → Integrations → New Integration → Grafana Alerting
  2. Click Quick Connect on the integration tile — auto-creates a contact point
  3. Link the contact point to a notification policy in Grafana Alerting
  1. 在 OnCall → 集成 → 新建集成 → 选择 Grafana Alerting
  2. 点击集成卡片上的 快速连接 —— 自动创建联系点
  3. 在 Grafana Alerting 中将联系点关联到通知策略

Webhook (custom/generic)

Webhook(自定义/通用)

bash
undefined
bash
undefined

Send alert via formatted webhook

通过格式化Webhook发送告警

curl -X POST https://your-oncall.grafana.net/integrations/v1/formatted_webhook/[id]/
-H "Content-Type: application/json"
-d '{ "alert_uid": "incident-123", "title": "Database CPU High", "state": "alerting", "message": "db-prod-01 CPU at 95% for 10 minutes", "link_to_upstream_details": "https://grafana.example.com/d/abc123" }'
curl -X POST https://your-oncall.grafana.net/integrations/v1/formatted_webhook/[id]/
-H "Content-Type: application/json"
-d '{ "alert_uid": "incident-123", "title": "数据库CPU使用率过高", "state": "alerting", "message": "db-prod-01 CPU使用率达95%已持续10分钟", "link_to_upstream_details": "https://grafana.example.com/d/abc123" }'

Resolve the alert

解决告警

curl -X POST https://your-oncall.grafana.net/integrations/v1/formatted_webhook/[id]/
-H "Content-Type: application/json"
-d '{"alert_uid": "incident-123", "state": "ok"}'

Recognized fields: `alert_uid`, `title`, `state` (`alerting`/`ok`), `message`, `image_url`, `link_to_upstream_details`
curl -X POST https://your-oncall.grafana.net/integrations/v1/formatted_webhook/[id]/
-H "Content-Type: application/json"
-d '{"alert_uid": "incident-123", "state": "ok"}'

支持的字段:`alert_uid`, `title`, `state`(`alerting`/`ok`), `message`, `image_url`, `link_to_upstream_details`

Routing Templates (Jinja2)

路由模板(Jinja2)

Routing templates return
True
or
False
to select the escalation chain. First matching route wins.
jinja2
{# Route critical alerts to PagerDuty escalation #}
{{ payload.labels.severity == "critical" }}

{# Route by team label #}
{{ payload.labels.team == "platform" }}

{# Route database alerts to DBA on-call #}
{{ "database" in payload.labels.get("component", "") }}

{# Default catch-all (always True) #}
{{ true }}
Grouping ID (consolidates related alerts into one alert group):
jinja2
{{ payload.labels.alertname }}-{{ payload.labels.instance }}
Advanced template functions:
jinja2
{{ payload.field | b64decode }}                          # Decode base64
{{ "pattern" | regex_match(payload.message) }}           # Regex matching
{{ datetimeformat_as_timezone(payload.startsAt, "UTC") }} # Timezone display
{{ payload.values | tojson_pretty }}                     # Pretty-print JSON
路由模板返回
True
False
以选择升级链,首个匹配的路由生效。
jinja2
{# 将严重告警路由到PagerDuty升级链 #}
{{ payload.labels.severity == "critical" }}

{# 按团队标签路由 #}
{{ payload.labels.team == "platform" }}

{# 将数据库告警路由到DBA值班人员 #}
{{ "database" in payload.labels.get("component", "") }}

{# 默认兜底路由(始终返回True) #}
{{ true }}
分组ID(将相关告警合并为一个告警组):
jinja2
{{ payload.labels.alertname }}-{{ payload.labels.instance }}
高级模板函数
jinja2
{{ payload.field | b64decode }}                          # 解码base64
{{ "pattern" | regex_match(payload.message) }}           # 正则匹配
{{ datetimeformat_as_timezone(payload.startsAt, "UTC") }} # 时区显示
{{ payload.values | tojson_pretty }}                     # 格式化JSON输出

Escalation Chains

升级链

Configure at OnCall → Escalation Chains → Create:
Step 1: Notify users from schedule "Primary On-Call" (Important Notifications)
Step 2: Wait 5 minutes
Step 3: Notify users from schedule "Primary On-Call" (Default Notifications)
Step 4: Wait 10 minutes
Step 5: Notify whole team "Platform"
Step 6: Trigger webhook (PagerDuty, ticket system, etc.)
Step types:
  • Wait — pause N minutes before next step
  • Notify users from schedule — alerts whoever is currently on-call
  • Notify team — alerts all members of a team
  • Notify users — alerts specific named users
  • Trigger outgoing webhook — call external system
  • Auto-resolve — mark alert group resolved after N minutes
  • Round-robin — rotate through a list of users
OnCall → 升级链 → 创建 中配置:
步骤1:通知「主值班人员」排班中的用户(重要通知)
步骤2:等待5分钟
步骤3:通知「主值班人员」排班中的用户(默认通知)
步骤4:等待10分钟
步骤5:通知整个「平台团队」
步骤6:触发Webhook(PagerDuty、工单系统等)
步骤类型
  • 等待 —— 暂停N分钟后执行下一步
  • 通知排班人员 —— 告警当前值班人员
  • 通知团队 —— 告警团队所有成员
  • 通知指定用户 —— 告警特定用户
  • 触发出站Webhook —— 调用外部系统
  • 自动解决 —— N分钟后标记告警组已解决
  • 轮询 —— 在用户列表中轮流告警

On-Call Schedules

值班排班

Web-based (UI)

网页端(UI)

Create rotations with shifts, overrides, and gaps directly in the OnCall/IRM UI.
直接在OnCall/IRM界面创建包含轮班、代班和空缺的排班。

iCal Import

iCal导入

bash
undefined
bash
undefined

API: create schedule from iCal

API:从iCal创建排班

curl -X POST https://your-oncall.grafana.net/api/v1/schedules/
-H "Authorization: your-api-key"
-H "Content-Type: application/json"
-d '{ "name": "Platform On-Call", "ical_url_primary": "https://calendar.example.com/platform-oncall.ics", "ical_url_overrides": "https://calendar.example.com/overrides.ics", "slack": { "channel_id": "C123456ABC", "user_group_id": "S123456ABC" } }'
undefined
curl -X POST https://your-oncall.grafana.net/api/v1/schedules/
-H "Authorization: your-api-key"
-H "Content-Type: application/json"
-d '{ "name": "平台团队值班", "ical_url_primary": "https://calendar.example.com/platform-oncall.ics", "ical_url_overrides": "https://calendar.example.com/overrides.ics", "slack": { "channel_id": "C123456ABC", "user_group_id": "S123456ABC" } }'
undefined

Terraform

Terraform

hcl
resource "grafana_oncall_schedule" "platform" {
  name = "Platform On-Call"
  type = "calendar"

  shifts = [
    grafana_oncall_on_call_shift.weekday.id,
    grafana_oncall_on_call_shift.weekend.id,
  ]
}

resource "grafana_oncall_on_call_shift" "weekday" {
  name       = "Weekday"
  type       = "rolling_users"
  start      = "2024-01-01T09:00:00"
  duration   = 3600 * 8    # 8 hours
  frequency  = "weekly"
  users_per_slot = 1
  rolling_users  = [["user-id-1"], ["user-id-2"], ["user-id-3"]]
}
hcl
resource "grafana_oncall_schedule" "platform" {
  name = "平台团队值班"
  type = "calendar"

  shifts = [
    grafana_oncall_on_call_shift.weekday.id,
    grafana_oncall_on_call_shift.weekend.id,
  ]
}

resource "grafana_oncall_on_call_shift" "weekday" {
  name       = "工作日"
  type       = "rolling_users"
  start      = "2024-01-01T09:00:00"
  duration   = 3600 * 8    # 8小时
  frequency  = "weekly"
  users_per_slot = 1
  rolling_users  = [["user-id-1"], ["user-id-2"], ["user-id-3"]]
}

Slack Integration

Slack集成

  1. Install: OnCall Settings → Chat Ops → Slack → Install Slack Integration
  2. Connect users: Each user: Profile → Connect to Slack
  3. Set default channel: for alert routing
  4. Add to escalation: "Notify by Slack mentions" step in escalation chain
Slack actions on alert messages: Acknowledge, Resolve, Silence, Add responders, Add note
Slash commands:
/escalate
,
/oncall
  1. 安装:OnCall 设置 → 聊天运维 → Slack → 安装Slack集成
  2. 关联用户:每个用户在「个人资料」中关联Slack账号
  3. 设置默认频道:用于告警路由
  4. 添加到升级链:在升级链中添加「通过Slack提及通知」步骤
告警消息的Slack操作:确认解决静默添加响应人添加备注
Slash命令:
/escalate
/oncall

API Reference

API参考

Base URL:
https://your-oncall.grafana.net/api/v1/
bash
TOKEN=your-api-key
基础URL:
https://your-oncall.grafana.net/api/v1/
bash
TOKEN=your-api-key

List integrations

列出集成

curl "$BASE/integrations/" -H "Authorization: $TOKEN"
curl "$BASE/integrations/" -H "Authorization: $TOKEN"

Create escalation chain

创建升级链

curl -X POST "$BASE/escalation_chains/"
-H "Authorization: $TOKEN"
-H "Content-Type: application/json"
-d '{"name": "Platform Critical", "team_id": "team-id"}'
curl -X POST "$BASE/escalation_chains/"
-H "Authorization: $TOKEN"
-H "Content-Type: application/json"
-d '{"name": "平台严重告警", "team_id": "team-id"}'

List schedules

列出排班

curl "$BASE/schedules/" -H "Authorization: $TOKEN"
curl "$BASE/schedules/" -H "Authorization: $TOKEN"

List alert groups

列出告警组

curl "$BASE/alert_groups/?page=1&perpage=25" -H "Authorization: $TOKEN"
curl "$BASE/alert_groups/?page=1&perpage=25" -H "Authorization: $TOKEN"

Who is on-call right now

查询当前值班人员

curl "$BASE/schedules/{schedule_id}/next_shifts/" -H "Authorization: $TOKEN"

**Rate limits:** 300 alerts/integration per 5 min, 500 alerts/org per 5 min, 300 API requests/key per 5 min
curl "$BASE/schedules/{schedule_id}/next_shifts/" -H "Authorization: $TOKEN"

**速率限制**:每个集成每5分钟300条告警,每个组织每5分钟500条告警,每个API密钥每5分钟300次请求

Incident Management (IRM)

事件管理(IRM)

When an alert group becomes an incident:
  1. Declare incident: From alert group → "Declare Incident" or via Slack
    /incident declare
  2. Set severity: P1–P4
  3. Add responders: Page additional team members
  4. Update status: Investigating → Identified → Monitoring → Resolved
  5. Timeline: Auto-tracks all actions; add manual notes
  6. Postmortem: Auto-generated draft from timeline on resolution
当告警组升级为事件时:
  1. 声明事件:从告警组中点击「声明事件」或通过Slack命令
    /incident declare
  2. 设置严重级别:P1–P4
  3. 添加响应人:通知额外的团队成员
  4. 更新状态:调查中 → 已定位 → 监控中 → 已解决
  5. 时间线:自动跟踪所有操作;可手动添加备注
  6. 事后复盘:事件解决后自动从时间线生成复盘草稿

RBAC Roles

RBAC角色

RoleAccess
oncall-admin
Full access to all OnCall resources
oncall-editor
Create/edit integrations, schedules, escalation chains
oncall-viewer
Read-only
oncall-notifications-receiver
Receive alerts; cannot modify configuration
角色权限
oncall-admin
拥有所有OnCall资源的完全访问权限
oncall-editor
可创建/编辑集成、排班、升级链
oncall-viewer
只读权限
oncall-notifications-receiver
可接收告警;无法修改配置

Rate Limits & Best Practices

速率限制与最佳实践

  • Keep escalation chains short (≤4 levels) with a definitive final step
  • Set
    send_resolved: true
    in Alertmanager for auto-resolution
  • Use
    max_alerts: 100
    in Alertmanager webhook config
  • Test routes with the template editor before going live
  • Combine Slack + mobile push for notification reliability
  • Assign integrations/schedules to teams for access control
  • 保持升级链简短(≤4级),并设置明确的最终步骤
  • 在Alertmanager中设置
    send_resolved: true
    以实现自动解决
  • 在Alertmanager Webhook配置中使用
    max_alerts: 100
  • 上线前使用模板编辑器测试路由规则
  • 结合Slack + 移动推送提升通知可靠性
  • 将集成/排班分配给团队以实现访问控制