Loading...
Loading...
Identify single points of failure, assess recovery capabilities, and produce a prioritized remediation plan aligned with the Well-Architected Reliability pillar.
npx skill4agent add aws-samples/sample-well-architected-skills-and-steering reliability-improvement-planWhat workload would you like me to assess for reliability? Please share:
- Architecture overview (services, regions, AZs, dependencies)
- Availability target (99.9%, 99.95%, 99.99%, etc.)
- Recovery objectives (RTO and RPO if defined)
- Past incidents (optional — recent outages or near-misses)
# Reliability Improvement Plan: {Workload Name}
## Summary
- **Date**: {date}
- **Availability target**: {target}
- **Estimated current availability**: {estimate}
- **RTO**: {current} → {target}
- **RPO**: {current} → {target}
- **Findings**: {X} High Risk, {Y} Medium Risk, {Z} Low Risk
## Reliability Scorecard
| Domain | Score (1-5) | Key Gap |
|--------|-------------|---------|
| Fault Tolerance | {score} | {gap} |
| Recovery & Backup | {score} | {gap} |
| Scaling & Capacity | {score} | {gap} |
| Change Management | {score} | {gap} |
| Testing & Validation | {score} | {gap} |
## Single Points of Failure
| Component | Severity | Failure Impact | Current Mitigation | Gap | AWS Service to Fix |
|-----------|----------|---------------|-------------------|-----|-------------------|
| {component} | 🔴/🟡/🟢 | {impact} | {mitigation or "None"} | {what's missing} | {service} |
## High Risk Findings
{Each: SPOF description, blast radius, recommendation, AWS services, effort}
## Remediation Plan
### Quick Wins (< 1 week)
{Low-effort high-impact: enable backups, turn on Multi-AZ, add health checks}
### Foundation (1-4 weeks)
{Multi-AZ compute, auto-scaling, circuit breakers, deployment safety}
### Advanced (1-3 months)
{Multi-region, chaos engineering, automated failover drills}
## Architecture Recommendations
{Specific changes: multi-AZ, read replicas, circuit breakers, async patterns, etc.}
## Testing Plan
| Test | What it validates | Frequency | AWS Service |
|------|-------------------|-----------|-------------|
| AZ failover drill | Compute continues in remaining AZs | Monthly | FIS |
| Database failover | RDS/Aurora failover < 60s | Quarterly | FIS |
| Load test | Capacity handles 2x peak | Before releases | Distributed Load Testing |
| Backup restore | RPO is met, data is recoverable | Monthly | AWS Backup |
| Deployment rollback | Bad deploy is reverted < 5 min | Every deploy | CodeDeploy |
## Next Steps
{Concrete actions the team should take this week}Would you like me to:
- Design the multi-AZ architecture in detail?
- Create a chaos engineering experiment plan using AWS FIS?
- Build a failover testing runbook?
- Estimate the cost of the reliability improvements?
- Design circuit breaker patterns for your service dependencies?