reliability-improvement-plan

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Reliability Improvement Plan

可靠性改进计划

Step 1: Gather context

步骤1:收集上下文

Ask the user:
What workload would you like me to assess for reliability? Please share:
  • Architecture overview (services, regions, AZs, dependencies)
  • Availability target (99.9%, 99.95%, 99.99%, etc.)
  • Recovery objectives (RTO and RPO if defined)
  • Past incidents (optional — recent outages or near-misses)
If context is already provided, proceed directly.
询问用户:
您希望我评估哪个工作负载的可靠性?请提供以下信息:
  • 架构概述(服务、区域、AZ、依赖项)
  • 可用性目标(99.9%、99.95%、99.99%等)
  • 恢复目标(若已定义,请提供RTO和RPO)
  • 过往事件(可选——近期中断或险些发生的故障)
若已提供上下文,直接进入下一步。

Step 2: Identify single points of failure

步骤2:识别单点故障(SPOF)

For each component, ask: "What happens if this fails?"
Classify each SPOF by severity:
  • 🔴 High Risk — total outage or data loss if this component fails
  • 🟡 Medium Risk — degraded experience or partial outage
  • 🟢 Low Risk — minimal impact, graceful degradation
Check for:
  • Single-AZ deployments (databases, compute, caches)
  • Single-region dependencies with no failover
  • Unreplicated data stores (no backups, no read replicas)
  • Hard dependencies on third-party services without fallback
  • Single NAT Gateway, single bastion, single load balancer
  • Shared-nothing vs shared-everything bottlenecks
If the workload is a data pipeline (S3/Lambda/Step Functions/Glue/EMR/Kinesis/Kafka/Redshift):
  • Prioritize data durability over compute availability — message/data loss is worse than temporary processing delay
  • Check: DLQ on every async invocation (Lambda, SQS, EventBridge)
  • Check: retry policies with exponential backoff and max attempts
  • Check: idempotency guarantees (duplicate processing must be safe)
  • Check: poison pill handling (malformed messages must not block the pipeline)
  • Check: single-cluster SPOFs in data sinks (Redshift, OpenSearch, RDS) — a single cluster with no failover is a 🔴 High Risk
  • Check: timeout configuration on all processing steps (Glue, Lambda, Step Functions)
针对每个组件,询问:“如果该组件故障会发生什么?”
按严重程度对每个SPOF分类:
  • 🔴 高风险 — 若该组件故障,将导致完全中断或数据丢失
  • 🟡 中风险 — 体验降级或部分中断
  • 🟢 低风险 — 影响极小,可优雅降级
检查以下内容:
  • 单AZ部署(数据库、计算、缓存)
  • 无故障转移的单区域依赖项
  • 未复制的数据存储(无备份、无只读副本)
  • 对第三方服务存在硬依赖且无备选方案
  • 单个NAT网关、单个堡垒机、单个负载均衡器
  • 无共享架构与全共享架构的瓶颈
如果工作负载是数据管道(S3/Lambda/Step Functions/Glue/EMR/Kinesis/Kafka/Redshift):
  • 优先考虑数据持久性而非计算可用性——消息/数据丢失比临时处理延迟更严重
  • 检查:每个异步调用(Lambda、SQS、EventBridge)是否配置了DLQ
  • 检查:是否带有指数退避和最大尝试次数的重试策略
  • 检查:是否具备幂等性保证(重复处理必须是安全的)
  • 检查:是否处理毒丸消息(格式错误的消息不得阻塞管道)
  • 检查:数据接收器(Redshift、OpenSearch、RDS)中的单集群SPOF——无故障转移的单集群属于🔴高风险
  • 检查:所有处理步骤(Glue、Lambda、Step Functions)的超时配置

Step 3: Assess recovery capabilities

步骤3:评估恢复能力

Evaluate:
  • Backup strategy — Are backups automated, tested, and cross-region?
  • Failover mechanisms — Is failover automatic or manual? How long does it take?
  • Health checks — Are they deep enough to detect real failures?
  • Deployment rollback — Can a bad deploy be reverted in minutes?
  • Dependency isolation — Does one service failure cascade?
  • Chaos engineering — Are failure scenarios tested proactively?
评估以下内容:
  • 备份策略 — 备份是否自动化、已测试且跨区域?
  • 故障转移机制 — 故障转移是自动还是手动?所需时长是多少?
  • 健康检查 — 检查是否足够深入以检测真实故障?
  • 部署回滚 — 能否在数分钟内回滚有问题的部署?
  • 依赖隔离 — 单个服务故障是否会引发级联效应?
  • 混沌工程 — 是否主动测试故障场景?

Step 4: Evaluate scaling and capacity

步骤4:评估扩展与容量

Assess:
  • Is auto-scaling configured with appropriate min/max/cooldown?
  • Are service quotas monitored and increased proactively?
  • Is there load shedding or throttling for overload scenarios?
  • Are queues used to absorb traffic spikes?
  • Is capacity tested under peak load?
评估以下内容:
  • 是否配置了合适的最小/最大/冷却时间的自动扩展?
  • 是否主动监控并提升服务配额?
  • 是否针对过载场景配置了流量削峰或限流?
  • 是否使用队列吸收流量峰值?
  • 是否在峰值负载下测试过容量?

Step 5: Assess change management

步骤5:评估变更管理

Evaluate:
  • Are deployments canary or blue/green?
  • Are database migrations backward-compatible?
  • Is there automated rollback on health check failure?
  • Are changes tested in a staging environment that mirrors production?
评估以下内容:
  • 部署是否采用金丝雀发布或蓝绿部署?
  • 数据库迁移是否向后兼容?
  • 健康检查失败时是否自动回滚?
  • 是否在镜像生产环境的 staging 环境中测试变更?

Step 6: Produce the plan

步骤6:生成计划

Output:
markdown
undefined
输出:
markdown
undefined

Reliability Improvement Plan: {Workload Name}

Reliability Improvement Plan: {Workload Name}

Summary

Summary

  • Date: {date}
  • Availability target: {target}
  • Estimated current availability: {estimate}
  • RTO: {current} → {target}
  • RPO: {current} → {target}
  • Findings: {X} High Risk, {Y} Medium Risk, {Z} Low Risk
  • Date: {date}
  • Availability target: {target}
  • Estimated current availability: {estimate}
  • RTO: {current} → {target}
  • RPO: {current} → {target}
  • Findings: {X} High Risk, {Y} Medium Risk, {Z} Low Risk

Reliability Scorecard

Reliability Scorecard

DomainScore (1-5)Key Gap
Fault Tolerance{score}{gap}
Recovery & Backup{score}{gap}
Scaling & Capacity{score}{gap}
Change Management{score}{gap}
Testing & Validation{score}{gap}
DomainScore (1-5)Key Gap
Fault Tolerance{score}{gap}
Recovery & Backup{score}{gap}
Scaling & Capacity{score}{gap}
Change Management{score}{gap}
Testing & Validation{score}{gap}

Single Points of Failure

Single Points of Failure

ComponentSeverityFailure ImpactCurrent MitigationGapAWS Service to Fix
{component}🔴/🟡/🟢{impact}{mitigation or "None"}{what's missing}{service}
ComponentSeverityFailure ImpactCurrent MitigationGapAWS Service to Fix
{component}🔴/🟡/🟢{impact}{mitigation or "None"}{what's missing}{service}

High Risk Findings

High Risk Findings

{Each: SPOF description, blast radius, recommendation, AWS services, effort}
{Each: SPOF description, blast radius, recommendation, AWS services, effort}

Remediation Plan

Remediation Plan

Quick Wins (< 1 week)

Quick Wins (< 1 week)

{Low-effort high-impact: enable backups, turn on Multi-AZ, add health checks}
{Low-effort high-impact: enable backups, turn on Multi-AZ, add health checks}

Foundation (1-4 weeks)

Foundation (1-4 weeks)

{Multi-AZ compute, auto-scaling, circuit breakers, deployment safety}
{Multi-AZ compute, auto-scaling, circuit breakers, deployment safety}

Advanced (1-3 months)

Advanced (1-3 months)

{Multi-region, chaos engineering, automated failover drills}
{Multi-region, chaos engineering, automated failover drills}

Architecture Recommendations

Architecture Recommendations

{Specific changes: multi-AZ, read replicas, circuit breakers, async patterns, etc.}
{Specific changes: multi-AZ, read replicas, circuit breakers, async patterns, etc.}

Testing Plan

Testing Plan

TestWhat it validatesFrequencyAWS Service
AZ failover drillCompute continues in remaining AZsMonthlyFIS
Database failoverRDS/Aurora failover < 60sQuarterlyFIS
Load testCapacity handles 2x peakBefore releasesDistributed Load Testing
Backup restoreRPO is met, data is recoverableMonthlyAWS Backup
Deployment rollbackBad deploy is reverted < 5 minEvery deployCodeDeploy
TestWhat it validatesFrequencyAWS Service
AZ failover drillCompute continues in remaining AZsMonthlyFIS
Database failoverRDS/Aurora failover < 60sQuarterlyFIS
Load testCapacity handles 2x peakBefore releasesDistributed Load Testing
Backup restoreRPO is met, data is recoverableMonthlyAWS Backup
Deployment rollbackBad deploy is reverted < 5 minEvery deployCodeDeploy

Next Steps

Next Steps

{Concrete actions the team should take this week}
undefined
{Concrete actions the team should take this week}
undefined

Step 7: Offer follow-up

步骤7:提供后续服务

After delivering the plan, offer:
Would you like me to:
  • Design the multi-AZ architecture in detail?
  • Create a chaos engineering experiment plan using AWS FIS?
  • Build a failover testing runbook?
  • Estimate the cost of the reliability improvements?
  • Design circuit breaker patterns for your service dependencies?
交付计划后,询问:
您是否需要我:
  • 详细设计多AZ架构?
  • 使用AWS FIS制定混沌工程实验计划?
  • 构建故障转移测试手册?
  • 估算可靠性改进的成本?
  • 为您的服务依赖项设计断路器模式?