Loading...
Loading...
Found 31 Skills
Create operational runbooks, playbooks, standard operating procedures (SOPs), and incident response guides. Use when documenting operational procedures, on-call guides, or incident response processes.
An engineering runbook — service overview, alerts table, dashboards links, common procedures with copy-pasteable commands, on-call rotation, and an incident-response checklist. Use when the brief mentions "runbook", "ops doc", "on-call guide", "SRE doc", or "运维手册".
Grafana OnCall and Incident Response Management (IRM) — alert routing, escalation chains, on-call schedules, Jinja2 routing templates, Slack/mobile notifications, integrations (Alertmanager, Grafana Alerting, webhooks, PagerDuty), and incident lifecycle management. Use when setting up on-call rotations, configuring escalation policies, routing alerts to the right team, declaring and managing incidents, integrating with Alertmanager or Grafana Alerting, or configuring Slack-based alert workflows.
Automate PagerDuty tasks via Rube MCP (Composio): manage incidents, services, schedules, escalation policies, and on-call rotations. Always search tools first for current schemas.
Guide incident response from detection to post-mortem using SRE principles, severity classification, on-call management, blameless culture, and communication protocols. Use when setting up incident processes, designing escalation policies, or conducting post-mortems.
Use this skill when managing production incidents, designing on-call rotations, writing runbooks, conducting post-mortems, setting up status pages, or running war rooms. Triggers on incident response, incident commander, on-call schedule, pager escalation, runbook authoring, post-incident review, blameless retro, status page updates, war room coordination, severity classification, and any task requiring structured incident lifecycle management.
Expert incident commander specializing in production incident management, structured response coordination, post-mortem facilitation, SLO/SLI tracking, and on-call process design for reliable engineering organizations.
Design and run a monitoring system for a website or web app. Use this skill when setting up uptime checks, defining SLOs, configuring error tracking, choosing what to alert on, designing on-call rotations, or fixing alert fatigue. Triggers on monitoring, alerts, uptime, SLO, SLA, error rate, on-call, pager, alert fatigue, observability, dashboards, what should we monitor. Also triggers when an incident reveals a gap in monitoring.
服务概述 + alerts 表 + dashboards + 操作命令 + on-call + 事故清单
Investigate PagerDuty incidents for Envato on-call escalation policies. Use when asked about incidents, on-call status, incident analysis, or PagerDuty investigation.
Execute Instantly incident response procedures with triage, mitigation, and postmortem. Use when responding to Instantly-related outages, investigating errors, or running post-incident reviews for Instantly integration failures. Trigger with phrases like "instantly incident", "instantly outage", "instantly down", "instantly on-call", "instantly emergency", "instantly broken".
Implement incident management processes and escalation procedures. Configure on-call schedules and post-incident reviews. Use when managing production incidents.