Loading...
Loading...
Use when planning, running, or learning from chaos engineering experiments. Triggers on "chaos experiment", "fault injection", "gameday", "resilience test", "blast radius", "steady state", "abort criteria", "Chaos Toolkit", "Chaos Mesh", "Litmus", "Gremlin", "AWS FIS", or any deliberate failure-injection question. Ships experiment designer, blast-radius calculator, and postmortem generator (all stdlib Python), 4 references on chaos principles + experiment design + attack taxonomy + tooling landscape, and a /chaos-experiment slash command. Composes with feature-flags-architect (kill switches as abort triggers) and kubernetes-operator (common chaos targets).
npx skill4agent add alirezarezvani/claude-skills chaos-engineeringincident-responsered-teamthreat-detectionSKILL=engineering/chaos-engineering/skills/chaos-engineering
# 1. Design an experiment
python "$SKILL/scripts/experiment_designer.py" --target "checkout-svc" --hypothesis "p99 latency stays <500ms" --attack latency --duration-min 15
# 2. Calculate blast radius
python "$SKILL/scripts/blast_radius_calculator.py" --traffic-share 0.05 --user-pop 1000000 --duration-min 15
# 3. Generate postmortem after the experiment
python "$SKILL/scripts/experiment_postmortem.py" --plan experiment.json --result-log results.txt--helpexperiment_designer.pypython scripts/experiment_designer.py \
--target "checkout-svc" \
--hypothesis "p99 latency stays <500ms when payment-svc is slow" \
--attack latency \
--magnitude "+200ms" \
--duration-min 15 \
--blast-radius "5% of US traffic" \
--abort-if "p99 > 1000ms OR error_rate > baseline + 1pp"blast_radius_calculator.pypython scripts/blast_radius_calculator.py \
--traffic-share 0.05 \
--user-pop 1000000 \
--duration-min 15 \
--baseline-availability 0.999 \
--expected-impact-availability 0.95experiment_postmortem.pypython scripts/experiment_postmortem.py --plan experiment.json --result-log results.txtreferences/attack_taxonomy.md| Attack | What it tests | Tooling |
|---|---|---|
| Latency | Timeouts, retries, circuit breakers | tc, Chaos Mesh |
| Error | Error handling, fallback paths | Chaos Mesh |
| Resource (CPU, memory, disk) | Saturation handling, autoscaling | Chaos Mesh |
| Network partition | Split-brain, consensus, failover | Chaos Mesh |
| Dependency failure | Graceful degradation, fallback | Service mesh fault injection |
| Time | Clock skew, NTP issues | libfaketime, Chaos Mesh |
| Infrastructure (kill instance) | Auto-recovery, failover | AWS FIS, Chaos Monkey |
| Tool | Best for | Pricing | Stack |
|---|---|---|---|
| Chaos Toolkit | Lightweight, language-agnostic, JSON experiments | OSS | Any |
| Chaos Mesh | Kubernetes-native, rich CRDs, in-cluster | OSS | Kubernetes |
| Litmus | Kubernetes, Argo-integrated, large library | OSS + Enterprise | Kubernetes |
| Gremlin | Enterprise SaaS, multi-cloud, audit | Paid | Any |
| AWS FIS | AWS-native, IAM-integrated, EC2/ECS/EKS | Paid (AWS) | AWS |
| Custom | Niche needs, single-cloud, low budget | None | Any |
references/tooling_landscape.md1. State a hypothesis: "When [fault], steady-state metric X stays within Y."
2. Identify the steady-state metric — must be measurable BEFORE the experiment.
3. Run blast_radius_calculator.py — confirm GREEN before proceeding.
4. Run experiment_designer.py to produce the plan.
5. Get a peer review of the plan; confirm abort criteria are concrete.
6. Notify the on-call team in #incidents (or whatever channel).
7. Run the experiment with monitoring open.
8. If abort criteria are hit, abort immediately; record what happened.
9. Run experiment_postmortem.py to capture learnings.
10. File follow-up actions; link to next experiment.1. Pick a scenario (e.g., "primary database fails over").
2. Identify all dependent services that should keep working.
3. Build a multi-experiment plan covering each layer.
4. Schedule with stakeholders; on-call coverage required.
5. Run with a facilitator who manages the scenario.
6. Capture observations in a shared doc as they happen.
7. Single combined postmortem covering all observations.
8. Track follow-up actions in a board with owners.1. Start: weekly Game Day in staging.
2. Move to: weekly Game Day in production with limited blast radius.
3. Mature to: continuous chaos via scheduled experiments (Litmus chaos schedule, Gremlin scenarios).
4. Wire to deployment: every prod deploy triggers a baseline chaos sweep.
5. Track: experiments per week, weaknesses discovered, MTTR trend.| Skill | Composition |
|---|---|
| Kill switches defined there are the abort triggers here |
| Operators are common chaos targets (test reconcile under fault) |
| Chaos experiments that escalate become incidents |
references/chaos_principles.mdreferences/experiment_design.mdreferences/attack_taxonomy.mdreferences/tooling_landscape.md/chaos-experimentassets/experiment_template.mdassets/postmortem_template.md