Loading...
Loading...
Found 13 Skills
Use when defining, reviewing, or operating SLOs/SLIs/error budgets. Triggers on "define an SLO", "what should our SLO be", "error budget", "burn rate", "SLI", "service level objective", "Google SRE workbook", "multi-window burn-rate alert", or any reliability-target question. Ships SLO designer, error-budget calculator with multi-window burn-rate thresholds, and SLO reviewer that catches the common bugs (target too aggressive, window too short, conflicting SLOs, no SLI definition). 4 references on SLO principles + SLI design + error budget math + composition with feature-flags-architect/chaos-engineering/kubernetes-operator. NOT a generic observability skill — specifically the SLO discipline.
Guides CI/CD for agent skills repositories and skill packages—pipeline design (build, test, validate, package), GitHub Actions for PR checks and release promotion, environment gates, secrets hygiene (no secrets in repo), skill-creator integration (quick_validate.py, package_skill.py), .skill artifact strategy, rollback, and operational runbooks for skill releases. Use when the user mentions CI/CD, CI/CD engineer, pipeline design, GitHub Actions, skill validation CI, package skills, release pipeline, deploy skills, PR checks, continuous integration, or skill release workflow—not application-only CI without skill packaging (devops), pre-flight plan go/no-go (build-validator), IDP or golden paths (platform-engineer), org-wide SLO and error-budget programs without pipeline ownership (site-reliability-engineer), or portfolio catalog governance without pipeline YAML (ai-skill-manager).
Use when defining SLIs/SLOs, managing error budgets, or building reliable systems at scale. Invoke for incident management, chaos engineering, toil reduction, capacity planning.
You are an SLO (Service Level Objective) expert specializing in implementing reliability standards and error budget-based practices. Design SLO frameworks, define SLIs, and build monitoring that balances reliability with delivery velocity.
Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.
Expert SRE incident responder specializing in rapid problem resolution, modern observability, and comprehensive incident management. Masters incident command, blameless post-mortems, error budget management, and system reliability patterns. Handles critical outages, communication strategies, and continuous improvement. Use IMMEDIATELY for production incidents or SRE practices.
Use when building reliable and scalable distributed systems.
Creates SLO-based alerts and operational dashboards with key charts, alert thresholds, and runbook links. Use for "alerting", "dashboards", "SLO", or "monitoring".
Create and manage SLOs in Elastic Observability using the Kibana API. Use when defining SLIs, setting error budgets, or managing SLO lifecycle.
Use this skill when implementing SRE practices, defining error budgets, reducing toil, planning capacity, or improving service reliability. Triggers on SRE, error budgets, SLOs, SLAs, toil automation, incident management, postmortems, on-call rotation, capacity planning, chaos engineering, and any task requiring reliability engineering decisions.
Expert Site Reliability Engineer specializing in SLOs, error budgets, and reliability engineering practices. Proficient in incident management, post-mortems, capacity planning, and building scalable, resilient systems with focus on reliability, availability, and performance.
Guides Site Reliability Engineering—SLI/SLO and error budgets, reliability dashboards and burn-rate alerting, production readiness reviews, capacity planning for availability, toil reduction, dependency and failure-mode analysis, release reliability (canaries, rollback criteria), and service-owner incident mitigation tied to customer impact. Use when defining or operating SLOs, measuring error budget burn, improving service reliability, running PRRs before launch, planning scalable resilient capacity, or leading technical mitigation during outages—not for CI/CD pipeline implementation (devops), incident program and paging policy design (incident-management-engineer), cloud access and patch tickets (cloud-system-administrator), load-test profiling (performance-engineer), rollout cutover strategy (deployment-strategist), or greenfield cloud build-out (cloud-engineer).