reliability-improvement-plan
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseReliability Improvement Plan
可靠性改进计划
Step 1: Gather context
步骤1:收集上下文
Ask the user:
What workload would you like me to assess for reliability? Please share:
- Architecture overview (services, regions, AZs, dependencies)
- Availability target (99.9%, 99.95%, 99.99%, etc.)
- Recovery objectives (RTO and RPO if defined)
- Past incidents (optional — recent outages or near-misses)
If context is already provided, proceed directly.
询问用户:
您希望我评估哪个工作负载的可靠性?请提供以下信息:
- 架构概述(服务、区域、AZ、依赖项)
- 可用性目标(99.9%、99.95%、99.99%等)
- 恢复目标(若已定义,请提供RTO和RPO)
- 过往事件(可选——近期中断或险些发生的故障)
若已提供上下文,直接进入下一步。
Step 2: Identify single points of failure
步骤2:识别单点故障(SPOF)
For each component, ask: "What happens if this fails?"
Classify each SPOF by severity:
- 🔴 High Risk — total outage or data loss if this component fails
- 🟡 Medium Risk — degraded experience or partial outage
- 🟢 Low Risk — minimal impact, graceful degradation
Check for:
- Single-AZ deployments (databases, compute, caches)
- Single-region dependencies with no failover
- Unreplicated data stores (no backups, no read replicas)
- Hard dependencies on third-party services without fallback
- Single NAT Gateway, single bastion, single load balancer
- Shared-nothing vs shared-everything bottlenecks
If the workload is a data pipeline (S3/Lambda/Step Functions/Glue/EMR/Kinesis/Kafka/Redshift):
- Prioritize data durability over compute availability — message/data loss is worse than temporary processing delay
- Check: DLQ on every async invocation (Lambda, SQS, EventBridge)
- Check: retry policies with exponential backoff and max attempts
- Check: idempotency guarantees (duplicate processing must be safe)
- Check: poison pill handling (malformed messages must not block the pipeline)
- Check: single-cluster SPOFs in data sinks (Redshift, OpenSearch, RDS) — a single cluster with no failover is a 🔴 High Risk
- Check: timeout configuration on all processing steps (Glue, Lambda, Step Functions)
针对每个组件,询问:“如果该组件故障会发生什么?”
按严重程度对每个SPOF分类:
- 🔴 高风险 — 若该组件故障,将导致完全中断或数据丢失
- 🟡 中风险 — 体验降级或部分中断
- 🟢 低风险 — 影响极小,可优雅降级
检查以下内容:
- 单AZ部署(数据库、计算、缓存)
- 无故障转移的单区域依赖项
- 未复制的数据存储(无备份、无只读副本)
- 对第三方服务存在硬依赖且无备选方案
- 单个NAT网关、单个堡垒机、单个负载均衡器
- 无共享架构与全共享架构的瓶颈
如果工作负载是数据管道(S3/Lambda/Step Functions/Glue/EMR/Kinesis/Kafka/Redshift):
- 优先考虑数据持久性而非计算可用性——消息/数据丢失比临时处理延迟更严重
- 检查:每个异步调用(Lambda、SQS、EventBridge)是否配置了DLQ
- 检查:是否带有指数退避和最大尝试次数的重试策略
- 检查:是否具备幂等性保证(重复处理必须是安全的)
- 检查:是否处理毒丸消息(格式错误的消息不得阻塞管道)
- 检查:数据接收器(Redshift、OpenSearch、RDS)中的单集群SPOF——无故障转移的单集群属于🔴高风险
- 检查:所有处理步骤(Glue、Lambda、Step Functions)的超时配置
Step 3: Assess recovery capabilities
步骤3:评估恢复能力
Evaluate:
- Backup strategy — Are backups automated, tested, and cross-region?
- Failover mechanisms — Is failover automatic or manual? How long does it take?
- Health checks — Are they deep enough to detect real failures?
- Deployment rollback — Can a bad deploy be reverted in minutes?
- Dependency isolation — Does one service failure cascade?
- Chaos engineering — Are failure scenarios tested proactively?
评估以下内容:
- 备份策略 — 备份是否自动化、已测试且跨区域?
- 故障转移机制 — 故障转移是自动还是手动?所需时长是多少?
- 健康检查 — 检查是否足够深入以检测真实故障?
- 部署回滚 — 能否在数分钟内回滚有问题的部署?
- 依赖隔离 — 单个服务故障是否会引发级联效应?
- 混沌工程 — 是否主动测试故障场景?
Step 4: Evaluate scaling and capacity
步骤4:评估扩展与容量
Assess:
- Is auto-scaling configured with appropriate min/max/cooldown?
- Are service quotas monitored and increased proactively?
- Is there load shedding or throttling for overload scenarios?
- Are queues used to absorb traffic spikes?
- Is capacity tested under peak load?
评估以下内容:
- 是否配置了合适的最小/最大/冷却时间的自动扩展?
- 是否主动监控并提升服务配额?
- 是否针对过载场景配置了流量削峰或限流?
- 是否使用队列吸收流量峰值?
- 是否在峰值负载下测试过容量?
Step 5: Assess change management
步骤5:评估变更管理
Evaluate:
- Are deployments canary or blue/green?
- Are database migrations backward-compatible?
- Is there automated rollback on health check failure?
- Are changes tested in a staging environment that mirrors production?
评估以下内容:
- 部署是否采用金丝雀发布或蓝绿部署?
- 数据库迁移是否向后兼容?
- 健康检查失败时是否自动回滚?
- 是否在镜像生产环境的 staging 环境中测试变更?
Step 6: Produce the plan
步骤6:生成计划
Output:
markdown
undefined输出:
markdown
undefinedReliability Improvement Plan: {Workload Name}
Reliability Improvement Plan: {Workload Name}
Summary
Summary
- Date: {date}
- Availability target: {target}
- Estimated current availability: {estimate}
- RTO: {current} → {target}
- RPO: {current} → {target}
- Findings: {X} High Risk, {Y} Medium Risk, {Z} Low Risk
- Date: {date}
- Availability target: {target}
- Estimated current availability: {estimate}
- RTO: {current} → {target}
- RPO: {current} → {target}
- Findings: {X} High Risk, {Y} Medium Risk, {Z} Low Risk
Reliability Scorecard
Reliability Scorecard
| Domain | Score (1-5) | Key Gap |
|---|---|---|
| Fault Tolerance | {score} | {gap} |
| Recovery & Backup | {score} | {gap} |
| Scaling & Capacity | {score} | {gap} |
| Change Management | {score} | {gap} |
| Testing & Validation | {score} | {gap} |
| Domain | Score (1-5) | Key Gap |
|---|---|---|
| Fault Tolerance | {score} | {gap} |
| Recovery & Backup | {score} | {gap} |
| Scaling & Capacity | {score} | {gap} |
| Change Management | {score} | {gap} |
| Testing & Validation | {score} | {gap} |
Single Points of Failure
Single Points of Failure
| Component | Severity | Failure Impact | Current Mitigation | Gap | AWS Service to Fix |
|---|---|---|---|---|---|
| {component} | 🔴/🟡/🟢 | {impact} | {mitigation or "None"} | {what's missing} | {service} |
| Component | Severity | Failure Impact | Current Mitigation | Gap | AWS Service to Fix |
|---|---|---|---|---|---|
| {component} | 🔴/🟡/🟢 | {impact} | {mitigation or "None"} | {what's missing} | {service} |
High Risk Findings
High Risk Findings
{Each: SPOF description, blast radius, recommendation, AWS services, effort}
{Each: SPOF description, blast radius, recommendation, AWS services, effort}
Remediation Plan
Remediation Plan
Quick Wins (< 1 week)
Quick Wins (< 1 week)
{Low-effort high-impact: enable backups, turn on Multi-AZ, add health checks}
{Low-effort high-impact: enable backups, turn on Multi-AZ, add health checks}
Foundation (1-4 weeks)
Foundation (1-4 weeks)
{Multi-AZ compute, auto-scaling, circuit breakers, deployment safety}
{Multi-AZ compute, auto-scaling, circuit breakers, deployment safety}
Advanced (1-3 months)
Advanced (1-3 months)
{Multi-region, chaos engineering, automated failover drills}
{Multi-region, chaos engineering, automated failover drills}
Architecture Recommendations
Architecture Recommendations
{Specific changes: multi-AZ, read replicas, circuit breakers, async patterns, etc.}
{Specific changes: multi-AZ, read replicas, circuit breakers, async patterns, etc.}
Testing Plan
Testing Plan
| Test | What it validates | Frequency | AWS Service |
|---|---|---|---|
| AZ failover drill | Compute continues in remaining AZs | Monthly | FIS |
| Database failover | RDS/Aurora failover < 60s | Quarterly | FIS |
| Load test | Capacity handles 2x peak | Before releases | Distributed Load Testing |
| Backup restore | RPO is met, data is recoverable | Monthly | AWS Backup |
| Deployment rollback | Bad deploy is reverted < 5 min | Every deploy | CodeDeploy |
| Test | What it validates | Frequency | AWS Service |
|---|---|---|---|
| AZ failover drill | Compute continues in remaining AZs | Monthly | FIS |
| Database failover | RDS/Aurora failover < 60s | Quarterly | FIS |
| Load test | Capacity handles 2x peak | Before releases | Distributed Load Testing |
| Backup restore | RPO is met, data is recoverable | Monthly | AWS Backup |
| Deployment rollback | Bad deploy is reverted < 5 min | Every deploy | CodeDeploy |
Next Steps
Next Steps
{Concrete actions the team should take this week}
undefined{Concrete actions the team should take this week}
undefinedStep 7: Offer follow-up
步骤7:提供后续服务
After delivering the plan, offer:
Would you like me to:
- Design the multi-AZ architecture in detail?
- Create a chaos engineering experiment plan using AWS FIS?
- Build a failover testing runbook?
- Estimate the cost of the reliability improvements?
- Design circuit breaker patterns for your service dependencies?
交付计划后,询问:
您是否需要我:
- 详细设计多AZ架构?
- 使用AWS FIS制定混沌工程实验计划?
- 构建故障转移测试手册?
- 估算可靠性改进的成本?
- 为您的服务依赖项设计断路器模式?