Loading...
Loading...
Found 31 Skills
Implement comprehensive alert management with PagerDuty, escalation policies, and incident coordination. Use when setting up alerting systems, managing on-call schedules, or coordinating incident response.
Creates Dockerfiles, configures CI/CD pipelines, writes Kubernetes manifests, and generates Terraform/Pulumi infrastructure templates. Handles deployment automation, GitOps configuration, incident response runbooks, and internal developer platform tooling. Use when setting up CI/CD pipelines, containerizing applications, managing infrastructure as code, deploying to Kubernetes clusters, configuring cloud platforms, automating releases, or responding to production incidents. Invoke for pipelines, Docker, Kubernetes, GitOps, Terraform, GitHub Actions, on-call, or platform engineering.
Guides defensive security analysis—alert triage, log and SIEM investigation, threat hunting, detection engineering basics, MITRE ATT&CK mapping, incident scoping, containment recommendations, and DFIR evidence handling for SOC and blue-team analysts. Use when investigating security alerts, writing detection rules, tuning false positives, analyzing EDR/network/auth logs, building timelines of suspicious activity, recommending containment steps, or documenting findings for incident command—not for enterprise security strategy (cybersecurity), CI/CD pipeline hardening (devsecops), offensive pentest execution (authorize red team separately), or LLM adversarial testing (ai-redteam), or designing on-call rotations and postmortem programs (incident-management-engineer).
Design and optimize systems for high concurrency, throughput, scalability, and elastic scale—concurrency models (threads, async/await, actors), lock-free patterns, connection pooling, caching stampede mitigation, horizontal scaling, load balancing, backpressure, queueing, rate limiting, bulkheads, read replicas, sharding, pool tuning, profiling, capacity planning, SLO-driven autoscaling, multi-region and CDN edge architecture. Use when the user asks about high concurrency, scalability, throughput, horizontal scaling, connection pooling, backpressure, rate limiting, caching stampede, read replica, sharding, autoscaling, capacity planning, lock contention, async scalability, or load balancing—not service decomposition (microservices-developer), event buses only (event-driven-architecture), generic CRUD (senior-software-engineer), SRE on-call only (site-reliability-engineer), load tests without architecture (performance-engineer), or cost-only FinOps (cloud-economist).
Create structured incident runbooks with diagnostic steps, resolution procedures, escalation paths, and communication templates for effective incident response. Use when documenting response procedures for recurring alerts, standardizing incident response across an on-call rotation, reducing MTTR with clear diagnostic steps, creating training materials for new team members, or linking alert annotations directly to resolution procedures.
Execute FireCrawl incident response procedures with triage, mitigation, and postmortem. Use when responding to FireCrawl-related outages, investigating errors, or running post-incident reviews for FireCrawl integration failures. Trigger with phrases like "firecrawl incident", "firecrawl outage", "firecrawl down", "firecrawl on-call", "firecrawl emergency", "firecrawl broken".
Create or update an operational runbook for a recurring task or procedure. Use when documenting a task that on-call or ops needs to run repeatably, turning tribal knowledge into exact step-by-step commands, adding troubleshooting and rollback steps to an existing procedure, or writing escalation paths for when things go wrong.
Master on-call shift handoffs with context transfer, escalation procedures, and documentation. Use when transitioning on-call responsibilities, documenting shift summaries, or improving on-call processes.
Review existing Datadog dashboards for operational readiness. Audits alert threshold markers, threshold proximity to normal traffic, customer-facing section completeness, and zero-knowledge readability. Uses pup CLI to fetch dashboard definitions. Use when auditing dashboards before on-call handoff, after dashboard changes, or during operational reviews. Do not use for: (1) designing new dashboards from scratch, (2) monitor/alert rule design, (3) APM instrumentation or tracing setup, (4) log pipeline configuration.
Generates comprehensive operational runbooks for any system or process. Reads codebase, infrastructure config, and deployment scripts to produce structured runbook.md files formatted for on-call engineers. Use when you need operations documentation, incident response guides, deployment procedures, or disaster recovery plans.
Grafana Alerting, Incident Response Management (IRM), and SLOs. Covers Grafana-managed and data source-managed alert rules, notification policies, contact points (Slack/PagerDuty/email/webhook), silences, muting, on-call scheduling, incident management workflows, and SLO configuration with burn-rate alerts. Use when configuring alerts, debugging notification routing, setting up on-call rotations, managing incidents, defining SLOs, or provisioning alerting via YAML/API.
Execute Clay incident response procedures with triage, mitigation, and postmortem. Use when responding to Clay-related outages, investigating errors, or running post-incident reviews for Clay integration failures. Trigger with phrases like "clay incident", "clay outage", "clay down", "clay on-call", "clay emergency", "clay broken".