cx-incident-management
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIncident Management Skill
事件管理Skill
Use this skill as the gateway for incident triage, SLO monitoring, and notification verification. It orchestrates the full triage workflow - from detection through resolution - and cross-references for deep alert management and for root cause investigation.
cx-alertscx-telemetry-querying此 Skill 可作为事件分诊、SLO 监控和通知验证的入口。它编排完整的分诊工作流——从检测到解决——并交叉引用 进行深度告警管理,以及 进行根因分析。
cx-alertscx-telemetry-queryingCLI Commands
CLI 命令
| Command | Subcommands | Purpose |
|---|---|---|
| | Manage and triage incidents |
| | Monitor and manage SLO definitions |
| | Check which alerts are firing (see |
| | Verify notification connector configuration |
| | Verify notification routing rules |
| | Check notification preset templates |
| | Test notification delivery |
Key flags:
- supports
cx incidents list(TRIGGERED, ACKNOWLEDGED, RESOLVED),--status(CRITICAL, WARNING, INFO),--severity--assignee - All commands support for structured output and
-o jsonfor profile selection-p <profile> - use
cx slos create/update(or--from-file <path>for stdin)-
| 命令 | 子命令 | 用途 |
|---|---|---|
| | 管理和分诊事件 |
| | 监控和管理 SLO 定义 |
| | 检查哪些警报正在触发(完整告警管理请查看 |
| | 验证通知连接器配置 |
| | 验证通知路由规则 |
| | 检查通知预设模板 |
| | 测试通知送达情况 |
关键参数:
- 支持
cx incidents list(TRIGGERED、ACKNOWLEDGED、RESOLVED)、--status(CRITICAL、WARNING、INFO)、--severity--assignee - 所有命令均支持 输出结构化数据,以及
-o json选择配置文件-p <profile> - 使用
cx slos create/update(或--from-file <path>表示标准输入)-
Incident Triage Workflow
事件分诊工作流
Step 1: Check Active Incidents
步骤1:检查活跃事件
bash
cx incidents list -o json
cx incidents list --status TRIGGERED -o json
cx incidents list --severity CRITICAL -o jsonGet an overview of what's happening. Filter by severity for immediate priorities:
bash
cx incidents list -o json | jq '[.[] | select(.severity == "CRITICAL") | {id, name, status, severity, started_at}]'bash
cx incidents list -o json
cx incidents list --status TRIGGERED -o json
cx incidents list --severity CRITICAL -o json了解当前事件概况。按严重程度筛选以确定优先处理项:
bash
cx incidents list -o json | jq '[.[] | select(.severity == "CRITICAL") | {id, name, status, severity, started_at}]'Step 2: Get Incident Details
步骤2:获取事件详情
bash
cx incidents get <incident-id> -o json
cx incidents events --incident-id <incident-id> -o jsonReview the incident timeline and related events to understand scope and progression.
bash
cx incidents get <incident-id> -o json
cx incidents events --incident-id <incident-id> -o json查看事件时间线及相关事件,了解影响范围和发展过程。
Step 3: Check Related Alerts
步骤3:检查相关警报
bash
cx alerts list -o jsonFind which alerts are currently firing. For deep alert inspection, switch to the skill.
cx-alertsbash
cx alerts list -o json | jq '[.[] | select(.is_active == true) | {id, name, severity, last_triggered}]'bash
cx alerts list -o json查找当前正在触发的警报。如需深度检查警报,请切换至 Skill。
cx-alertsbash
cx alerts list -o json | jq '[.[] | select(.is_active == true) | {id, name, severity, last_triggered}]'Step 4: Review SLO Status
步骤4:查看 SLO 状态
bash
cx slos list -o json
cx slos get <slo-id> -o jsonCheck if SLOs are breaching or error budgets are burned:
bash
cx slos list -o json | jq '[.[] | {name, status, remaining_budget_percentage}]'bash
cx slos list -o json
cx slos get <slo-id> -o json检查 SLO 是否违规或错误预算是否耗尽:
bash
cx slos list -o json | jq '[.[] | {name, status, remaining_budget_percentage}]'Step 5: Verify Notifications
步骤5:验证通知
bash
cx notifications connectors list -o json
cx notifications routers list -o json
cx notifications presets list -o jsonConfirm the right people were notified through the correct channels.
bash
cx notifications connectors list -o json
cx notifications routers list -o json
cx notifications presets list -o json确认正确的人员通过正确渠道收到了通知。
Step 6: Pivot to Root Cause
步骤6:转向根因分析
Switch to the skill to investigate the underlying cause using logs, traces, and metrics.
cx-telemetry-querying切换至 Skill,使用日志、链路追踪和指标调查根本原因。
cx-telemetry-queryingIncident Actions
事件操作
Acknowledge
确认事件
bash
cx incidents acknowledge <incident-id>
cx incidents acknowledge <id1> <id2> <id3>bash
cx incidents acknowledge <incident-id>
cx incidents acknowledge <id1> <id2> <id3>Resolve
解决事件
bash
cx incidents resolve <incident-id>
cx incidents resolve <id1> <id2> <id3>bash
cx incidents resolve <incident-id>
cx incidents resolve <id1> <id2> <id3>Assign
分配事件
bash
cx incidents assign <incident-id> --user-id <user-id>bash
cx incidents assign <incident-id> --user-id <user-id>Close
关闭事件
bash
cx incidents close <incident-id>bash
cx incidents close <incident-id>SLO Management
SLO 管理
Creating SLOs
创建 SLO
Template from an existing SLO:
bash
cx slos get <existing-slo-id> -o json > slo-template.json基于现有 SLO 生成模板:
bash
cx slos get <existing-slo-id> -o json > slo-template.jsonEdit slo-template.json with new service/threshold
编辑 slo-template.json,修改服务/阈值
cx slos create --from-file slo-template.json
undefinedcx slos create --from-file slo-template.json
undefinedMonitoring SLO Health
监控 SLO 健康状态
bash
undefinedbash
undefinedAll SLOs with their status
查看所有 SLO 及其状态
cx slos list -o json | jq '[.[] | {name, status, target_percentage, remaining_budget}]'
cx slos list -o json | jq '[.[] | {name, status, target_percentage, remaining_budget}]'
SLOs that are breaching
查看违规的 SLO
cx slos list -o json | jq '[.[] | select(.status != "OK")]'
---cx slos list -o json | jq '[.[] | select(.status != "OK")]'
---Notification Debugging
通知调试
When notifications aren't reaching the right people:
当通知未送达目标人员时:
1. Check Connectors
1. 检查连接器
bash
cx notifications connectors list -o json | jq '[.[] | {id, name, type}]'Verify the expected channels (Slack, PagerDuty, email) exist and are configured.
bash
cx notifications connectors list -o json | jq '[.[] | {id, name, type}]'确认所需渠道(Slack、PagerDuty、邮件)已存在且配置正确。
2. Check Routers
2. 检查路由规则
bash
cx notifications routers list -o json | jq '[.[] | {id, name, entity_type}]'Verify routing rules map the right alert types to the right connectors.
bash
cx notifications routers list -o json | jq '[.[] | {id, name, entity_type}]'确认路由规则已将正确的警报类型映射到对应的连接器。
3. Test Notification Delivery
3. 测试通知送达
bash
cx notifications test connector --from-file test-connector.json
cx notifications test destination --from-file test-destination.json
cx notifications test preset --from-file test-preset.json
cx notifications test routing-condition --from-file test-condition.jsonbash
cx notifications test connector --from-file test-connector.json
cx notifications test destination --from-file test-destination.json
cx notifications test preset --from-file test-preset.json
cx notifications test routing-condition --from-file test-condition.jsonIncident Aggregations
事件聚合
Get a high-level view of incident patterns:
bash
cx incidents aggregations -o jsonUse this to understand incident frequency, MTTR trends, and severity distribution.
获取事件模式的概览:
bash
cx incidents aggregations -o json通过此命令了解事件频率、MTTR 趋势和严重程度分布。
Key Principles
核心原则
- Triage before deep-dive - check incidents, alerts, and SLOs before querying telemetry data
- Check SLO burn rate, not just status - a slowly burning SLO needs attention before it breaches
- Verify notification chain end-to-end - connector exists → router maps correctly → test delivery works
- Cross-reference with telemetry - use skill for root cause after triage
cx-telemetry-querying - Acknowledge promptly - acknowledge incidents to signal ownership and stop re-notifications
- Use incident events for timeline - shows the full incident lifecycle
cx incidents events
- 先分诊再深入 - 在查询遥测数据前,先检查事件、警报和 SLO
- 关注 SLO 消耗速率而非仅状态 - 缓慢消耗的 SLO 在违规前就需要关注
- 端到端验证通知链路 - 连接器存在 → 路由映射正确 → 测试送达正常
- 结合遥测数据 - 分诊完成后使用 Skill 进行根因分析
cx-telemetry-querying - 及时确认事件 - 确认事件以表明责任归属并停止重复通知
- 使用事件事件查看时间线 - 展示完整的事件生命周期
cx incidents events
Related Skills
相关 Skill
- - deep alert management: creating, updating, and inspecting alert definitions
cx-alerts - - root cause investigation using logs, metrics, traces, and RUM
cx-telemetry-querying - - configure notification channels and routing for alerts
cx-observability-setup
- - 深度告警管理:创建、更新和检查告警定义
cx-alerts - - 使用日志、指标、链路追踪和 RUM 进行根因分析
cx-telemetry-querying - - 配置告警的通知渠道和路由规则
cx-observability-setup