Loading...
Loading...
Found 11 Skills
Expert-level site reliability engineering, SLOs, incident management, and operational excellence
Use when defining SLIs/SLOs, managing error budgets, or building reliable systems at scale. Invoke for incident management, chaos engineering, toil reduction, capacity planning.
Build production-ready systems with stability patterns: circuit breakers, bulkheads, timeouts, and retry logic. Use when the user mentions "production outage", "circuit breaker", "timeout strategy", "deployment pipeline", or "chaos engineering". Covers capacity planning, health checks, and anti-fragility patterns. For data systems, see ddia-systems. For system architecture, see system-design.
Use when building reliable and scalable distributed systems.
Use when the user wants to deploy and run a prepared AWS FIS experiment. Triggers on "execute FIS experiment", "run FIS experiment", "start chaos experiment", "deploy FIS template", "启动 FIS 实验", "运行混沌实验", "执行故障注入实验", "deploy and run the experiment in [directory]". Expects a prepared experiment directory (from aws-fis-experiment-prepare or manually created) containing experiment-template.json, iam-policy.json, cfn-template.yaml, and alarm configs. Deploys resources via CLI or CloudFormation, starts the experiment with strict user confirmation, monitors progress, and generates results report.
Expert knowledge for Chaos Studio development including troubleshooting, limits & quotas, security, configuration, and integrations & coding patterns. Use when defining ARM/Bicep experiments, deploying Chaos Agents, using CLI/REST, or integrating with Azure Monitor, and other Chaos Studio related development tasks. Not for Azure Monitor (use azure-monitor), Azure Resiliency (use azure-resiliency), Azure Reliability (use azure-reliability), Azure Site Recovery (use azure-site-recovery).
Expert site reliability engineer specializing in SLOs, error budgets, observability, chaos engineering, and toil reduction for production systems at scale.
Design and implement disaster recovery strategies with RTO/RPO planning, database backups, Kubernetes DR, cross-region replication, and chaos engineering testing. Use when implementing backup systems, configuring point-in-time recovery, setting up multi-region failover, or validating DR procedures.
Expert Site Reliability Engineer specializing in SLOs, error budgets, and reliability engineering practices. Proficient in incident management, post-mortems, capacity planning, and building scalable, resilient systems with focus on reliability, availability, and performance.
Testing in production with feature flags, canary deployments, synthetic monitoring, and chaos engineering. Use when implementing production observability or progressive delivery.
Advanced testing strategies and methodologies. Use when user asks to "design tests", "test coverage", "property-based testing", "mutation testing", "contract testing", "chaos engineering", "test pyramid", "testing strategy", "behavior-driven development", "acceptance testing", or mentions comprehensive testing approaches.