cx-incident-management

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Incident Management Skill

事件管理Skill

Use this skill as the gateway for incident triage, SLO monitoring, and notification verification. It orchestrates the full triage workflow - from detection through resolution - and cross-references
cx-alerts
for deep alert management and
cx-telemetry-querying
for root cause investigation.

此 Skill 可作为事件分诊、SLO 监控和通知验证的入口。它编排完整的分诊工作流——从检测到解决——并交叉引用
cx-alerts
进行深度告警管理,以及
cx-telemetry-querying
进行根因分析。

CLI Commands

CLI 命令

CommandSubcommandsPurpose
cx incidents
list
,
get
,
acknowledge
,
resolve
,
close
,
assign
,
unassign
,
events
,
aggregations
Manage and triage incidents
cx slos
list
,
get
,
create
,
update
,
delete
Monitor and manage SLO definitions
cx alerts
list
,
get
Check which alerts are firing (see
cx-alerts
skill for full alert management)
cx notifications connectors
list
,
get
Verify notification connector configuration
cx notifications routers
list
,
get
Verify notification routing rules
cx notifications presets
list
,
get
Check notification preset templates
cx notifications test
connector
,
destination
,
preset
,
routing-condition
,
template-render
Test notification delivery
Key flags:
  • cx incidents list
    supports
    --status
    (TRIGGERED, ACKNOWLEDGED, RESOLVED),
    --severity
    (CRITICAL, WARNING, INFO),
    --assignee
  • All commands support
    -o json
    for structured output and
    -p <profile>
    for profile selection
  • cx slos create/update
    use
    --from-file <path>
    (or
    -
    for stdin)

命令子命令用途
cx incidents
list
,
get
,
acknowledge
,
resolve
,
close
,
assign
,
unassign
,
events
,
aggregations
管理和分诊事件
cx slos
list
,
get
,
create
,
update
,
delete
监控和管理 SLO 定义
cx alerts
list
,
get
检查哪些警报正在触发(完整告警管理请查看
cx-alerts
Skill)
cx notifications connectors
list
,
get
验证通知连接器配置
cx notifications routers
list
,
get
验证通知路由规则
cx notifications presets
list
,
get
检查通知预设模板
cx notifications test
connector
,
destination
,
preset
,
routing-condition
,
template-render
测试通知送达情况
关键参数:
  • cx incidents list
    支持
    --status
    (TRIGGERED、ACKNOWLEDGED、RESOLVED)、
    --severity
    (CRITICAL、WARNING、INFO)、
    --assignee
  • 所有命令均支持
    -o json
    输出结构化数据,以及
    -p <profile>
    选择配置文件
  • cx slos create/update
    使用
    --from-file <path>
    (或
    -
    表示标准输入)

Incident Triage Workflow

事件分诊工作流

Step 1: Check Active Incidents

步骤1:检查活跃事件

bash
cx incidents list -o json
cx incidents list --status TRIGGERED -o json
cx incidents list --severity CRITICAL -o json
Get an overview of what's happening. Filter by severity for immediate priorities:
bash
cx incidents list -o json | jq '[.[] | select(.severity == "CRITICAL") | {id, name, status, severity, started_at}]'
bash
cx incidents list -o json
cx incidents list --status TRIGGERED -o json
cx incidents list --severity CRITICAL -o json
了解当前事件概况。按严重程度筛选以确定优先处理项:
bash
cx incidents list -o json | jq '[.[] | select(.severity == "CRITICAL") | {id, name, status, severity, started_at}]'

Step 2: Get Incident Details

步骤2:获取事件详情

bash
cx incidents get <incident-id> -o json
cx incidents events --incident-id <incident-id> -o json
Review the incident timeline and related events to understand scope and progression.
bash
cx incidents get <incident-id> -o json
cx incidents events --incident-id <incident-id> -o json
查看事件时间线及相关事件,了解影响范围和发展过程。

Step 3: Check Related Alerts

步骤3:检查相关警报

bash
cx alerts list -o json
Find which alerts are currently firing. For deep alert inspection, switch to the
cx-alerts
skill.
bash
cx alerts list -o json | jq '[.[] | select(.is_active == true) | {id, name, severity, last_triggered}]'
bash
cx alerts list -o json
查找当前正在触发的警报。如需深度检查警报,请切换至
cx-alerts
Skill。
bash
cx alerts list -o json | jq '[.[] | select(.is_active == true) | {id, name, severity, last_triggered}]'

Step 4: Review SLO Status

步骤4:查看 SLO 状态

bash
cx slos list -o json
cx slos get <slo-id> -o json
Check if SLOs are breaching or error budgets are burned:
bash
cx slos list -o json | jq '[.[] | {name, status, remaining_budget_percentage}]'
bash
cx slos list -o json
cx slos get <slo-id> -o json
检查 SLO 是否违规或错误预算是否耗尽:
bash
cx slos list -o json | jq '[.[] | {name, status, remaining_budget_percentage}]'

Step 5: Verify Notifications

步骤5:验证通知

bash
cx notifications connectors list -o json
cx notifications routers list -o json
cx notifications presets list -o json
Confirm the right people were notified through the correct channels.
bash
cx notifications connectors list -o json
cx notifications routers list -o json
cx notifications presets list -o json
确认正确的人员通过正确渠道收到了通知。

Step 6: Pivot to Root Cause

步骤6:转向根因分析

Switch to the
cx-telemetry-querying
skill to investigate the underlying cause using logs, traces, and metrics.

切换至
cx-telemetry-querying
Skill,使用日志、链路追踪和指标调查根本原因。

Incident Actions

事件操作

Acknowledge

确认事件

bash
cx incidents acknowledge <incident-id>
cx incidents acknowledge <id1> <id2> <id3>
bash
cx incidents acknowledge <incident-id>
cx incidents acknowledge <id1> <id2> <id3>

Resolve

解决事件

bash
cx incidents resolve <incident-id>
cx incidents resolve <id1> <id2> <id3>
bash
cx incidents resolve <incident-id>
cx incidents resolve <id1> <id2> <id3>

Assign

分配事件

bash
cx incidents assign <incident-id> --user-id <user-id>
bash
cx incidents assign <incident-id> --user-id <user-id>

Close

关闭事件

bash
cx incidents close <incident-id>

bash
cx incidents close <incident-id>

SLO Management

SLO 管理

Creating SLOs

创建 SLO

Template from an existing SLO:
bash
cx slos get <existing-slo-id> -o json > slo-template.json
基于现有 SLO 生成模板:
bash
cx slos get <existing-slo-id> -o json > slo-template.json

Edit slo-template.json with new service/threshold

编辑 slo-template.json,修改服务/阈值

cx slos create --from-file slo-template.json
undefined
cx slos create --from-file slo-template.json
undefined

Monitoring SLO Health

监控 SLO 健康状态

bash
undefined
bash
undefined

All SLOs with their status

查看所有 SLO 及其状态

cx slos list -o json | jq '[.[] | {name, status, target_percentage, remaining_budget}]'
cx slos list -o json | jq '[.[] | {name, status, target_percentage, remaining_budget}]'

SLOs that are breaching

查看违规的 SLO

cx slos list -o json | jq '[.[] | select(.status != "OK")]'

---
cx slos list -o json | jq '[.[] | select(.status != "OK")]'

---

Notification Debugging

通知调试

When notifications aren't reaching the right people:
当通知未送达目标人员时:

1. Check Connectors

1. 检查连接器

bash
cx notifications connectors list -o json | jq '[.[] | {id, name, type}]'
Verify the expected channels (Slack, PagerDuty, email) exist and are configured.
bash
cx notifications connectors list -o json | jq '[.[] | {id, name, type}]'
确认所需渠道(Slack、PagerDuty、邮件)已存在且配置正确。

2. Check Routers

2. 检查路由规则

bash
cx notifications routers list -o json | jq '[.[] | {id, name, entity_type}]'
Verify routing rules map the right alert types to the right connectors.
bash
cx notifications routers list -o json | jq '[.[] | {id, name, entity_type}]'
确认路由规则已将正确的警报类型映射到对应的连接器。

3. Test Notification Delivery

3. 测试通知送达

bash
cx notifications test connector --from-file test-connector.json
cx notifications test destination --from-file test-destination.json
cx notifications test preset --from-file test-preset.json
cx notifications test routing-condition --from-file test-condition.json

bash
cx notifications test connector --from-file test-connector.json
cx notifications test destination --from-file test-destination.json
cx notifications test preset --from-file test-preset.json
cx notifications test routing-condition --from-file test-condition.json

Incident Aggregations

事件聚合

Get a high-level view of incident patterns:
bash
cx incidents aggregations -o json
Use this to understand incident frequency, MTTR trends, and severity distribution.

获取事件模式的概览:
bash
cx incidents aggregations -o json
通过此命令了解事件频率、MTTR 趋势和严重程度分布。

Key Principles

核心原则

  • Triage before deep-dive - check incidents, alerts, and SLOs before querying telemetry data
  • Check SLO burn rate, not just status - a slowly burning SLO needs attention before it breaches
  • Verify notification chain end-to-end - connector exists → router maps correctly → test delivery works
  • Cross-reference with telemetry - use
    cx-telemetry-querying
    skill for root cause after triage
  • Acknowledge promptly - acknowledge incidents to signal ownership and stop re-notifications
  • Use incident events for timeline -
    cx incidents events
    shows the full incident lifecycle

  • 先分诊再深入 - 在查询遥测数据前,先检查事件、警报和 SLO
  • 关注 SLO 消耗速率而非仅状态 - 缓慢消耗的 SLO 在违规前就需要关注
  • 端到端验证通知链路 - 连接器存在 → 路由映射正确 → 测试送达正常
  • 结合遥测数据 - 分诊完成后使用
    cx-telemetry-querying
    Skill 进行根因分析
  • 及时确认事件 - 确认事件以表明责任归属并停止重复通知
  • 使用事件事件查看时间线 -
    cx incidents events
    展示完整的事件生命周期

Related Skills

相关 Skill

  • cx-alerts
    - deep alert management: creating, updating, and inspecting alert definitions
  • cx-telemetry-querying
    - root cause investigation using logs, metrics, traces, and RUM
  • cx-observability-setup
    - configure notification channels and routing for alerts
  • cx-alerts
    - 深度告警管理:创建、更新和检查告警定义
  • cx-telemetry-querying
    - 使用日志、指标、链路追踪和 RUM 进行根因分析
  • cx-observability-setup
    - 配置告警的通知渠道和路由规则