Loading...
Loading...
Found 39 Skills
Use this skill when managing production incidents, designing on-call rotations, writing runbooks, conducting post-mortems, setting up status pages, or running war rooms. Triggers on incident response, incident commander, on-call schedule, pager escalation, runbook authoring, post-incident review, blameless retro, status page updates, war room coordination, severity classification, and any task requiring structured incident lifecycle management.
Use this skill when the user asks to "investigate incident", "triage this alert", "what's firing", "who got paged", "incident response", "check incident status", "SLO breaching", "error budget burned", "check service level", "SLI status", "who was notified", "check notification delivery", "verify alert routing", "MTTR", "incident severity", "error budget", "burn rate", "acknowledge incident", "resolve incident", "production incident", "what alerts are active", "incident timeline", "on-call triage", or wants to triage, manage, or respond to incidents using alerts, SLOs, and notifications.
Implement incident management processes and escalation procedures. Configure on-call schedules and post-incident reviews. Use when managing production incidents.
Guides defensive security analysis—alert triage, log and SIEM investigation, threat hunting, detection engineering basics, MITRE ATT&CK mapping, incident scoping, containment recommendations, and DFIR evidence handling for SOC and blue-team analysts. Use when investigating security alerts, writing detection rules, tuning false positives, analyzing EDR/network/auth logs, building timelines of suspicious activity, recommending containment steps, or documenting findings for incident command—not for enterprise security strategy (cybersecurity), CI/CD pipeline hardening (devsecops), offensive pentest execution (authorize red team separately), or LLM adversarial testing (ai-redteam), or designing on-call rotations and postmortem programs (incident-management-engineer).
Guides technical support engineering—customer ticket investigation, reproduction, log and API analysis, root-cause isolation, workaround communication, engineering escalation with evidence, and knowledge-base fixes for product bugs and integration issues. Use when debugging a customer-reported issue, writing a repro for engineering, analyzing API errors, drafting technical replies, or improving support runbooks—not for CS program design, renewals, or billing ops (customer-ops-specialist), production incident command (incident-management-engineer), building product features (fullstack-software-engineer), or company-wide crisis statements and launch announcements (communication-lead), or exec/VIP and community escalation program design (community-executive-escalations-program-manager). Product how-to, macros, and ticket triage without deep debugging: product-support-specialist.
Guides failure-prevention culture and operational excellence for mission-critical engineering— zero-defect aspiration vs error budgets; HRO principles; defense-in-depth; fail-safe/fail-closed; verification gates and independent checks; redundancy and graceful degradation; pre-mortems and FMEA; stop-the-line; defect escape, near-miss, and repeat-incident metrics; leadership against normalization of deviance—not blame culture. Use for failure-prevention programs, HRO practices, verification gates, fail-safe design, pre-mortem/FMEA, stop-the-line, near-miss reporting, or defect-escape metrics—not SRE error budgets only (site-reliability-engineer), incident command only (incident-management-engineer), backup/restore only (cyber-resilience-engineer), CI lint only (build-validator), agile coaching, HR discipline, or classified ATO without ops-excellence lens (classified-cyber-security-senior-manager).
Guides SOC operations—alert triage, SIEM/EDR investigation, enrichment, playbook execution, false-positive closure, escalation decisions, and detection tuning feedback. Use when working SOC queues, investigating suspicious alerts, correlating events, documenting analyst notes, or deciding escalate vs close—not for declared incident command, timelines, evidence preservation, or regulatory comms (incident-responder), incident program design (incident-management-engineer), binary/firmware RE (reverse-engineer), red team operations (red-team-specialist), or enterprise security strategy (cybersecurity).
Guides technical program management—multi-team initiatives with dependencies, milestones, RAID tracking, launch readiness, stakeholder status, and cross-functional coordination across engineering, product, and infrastructure (not application code or BRDs). Use when running a technical program, dependency maps, milestones, exec status, or unblocking cross-team delivery—not for requirements (business-analyst), rollout (deployment-strategist), CI/CD (devops), data roadmaps (data-manager), or single-team delivery (fullstack-software-engineer). Incidents: incident-management-engineer. Architecture: senior-system-architecture. Strategy: business-consultant. Comms: communication-lead. DC site build: data-center-design-execution-lead. DC portfolio: data-center-portfolio-planning-execution-lead. M&A/financing deal execution and closing matrix: transaction-manager. Exec/VIP and community customer escalations: community-executive-escalations-program-manager. CVD/disclosure: technical-program-manager-security-cvd.
Guides Site Reliability Engineering—SLI/SLO and error budgets, reliability dashboards and burn-rate alerting, production readiness reviews, capacity planning for availability, toil reduction, dependency and failure-mode analysis, release reliability (canaries, rollback criteria), and service-owner incident mitigation tied to customer impact. Use when defining or operating SLOs, measuring error budget burn, improving service reliability, running PRRs before launch, planning scalable resilient capacity, or leading technical mitigation during outages—not for CI/CD pipeline implementation (devops), incident program and paging policy design (incident-management-engineer), cloud access and patch tickets (cloud-system-administrator), load-test profiling (performance-engineer), rollout cutover strategy (deployment-strategist), or greenfield cloud build-out (cloud-engineer).
Use when defining SLIs/SLOs, managing error budgets, or building reliable systems at scale. Invoke for incident management, chaos engineering, toil reduction, capacity planning.
Problem entities, root cause analysis (RCA), impact assessment, and problem correlation. Query and analyze Dynatrace-detected problems and incidents.
Conduct systematic root cause analysis to identify underlying problems. Use structured methodologies to prevent recurring issues and drive improvements.