reliability-improvement-plan

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Reliability Improvement Plan

可靠性改进计划

Step 1: Gather context

步骤1：收集上下文

Ask the user:

What workload would you like me to assess for reliability? Please share:

Architecture overview (services, regions, AZs, dependencies)

Availability target (99.9%, 99.95%, 99.99%, etc.)

Recovery objectives (RTO and RPO if defined)

Past incidents (optional — recent outages or near-misses)

If context is already provided, proceed directly.

询问用户：

您希望我评估哪个工作负载的可靠性？请提供以下信息：

架构概述（服务、区域、AZ、依赖项）

可用性目标（99.9%、99.95%、99.99%等）

恢复目标（若已定义，请提供RTO和RPO）

过往事件（可选——近期中断或险些发生的故障）

若已提供上下文，直接进入下一步。

Step 2: Identify single points of failure

步骤2：识别单点故障（SPOF）

For each component, ask: "What happens if this fails?"

Classify each SPOF by severity:

🔴 High Risk — total outage or data loss if this component fails
🟡 Medium Risk — degraded experience or partial outage
🟢 Low Risk — minimal impact, graceful degradation

Check for:

Single-AZ deployments (databases, compute, caches)
Single-region dependencies with no failover
Unreplicated data stores (no backups, no read replicas)
Hard dependencies on third-party services without fallback
Single NAT Gateway, single bastion, single load balancer
Shared-nothing vs shared-everything bottlenecks

If the workload is a data pipeline (S3/Lambda/Step Functions/Glue/EMR/Kinesis/Kafka/Redshift):

Prioritize data durability over compute availability — message/data loss is worse than temporary processing delay
Check: DLQ on every async invocation (Lambda, SQS, EventBridge)
Check: retry policies with exponential backoff and max attempts
Check: idempotency guarantees (duplicate processing must be safe)
Check: poison pill handling (malformed messages must not block the pipeline)
Check: single-cluster SPOFs in data sinks (Redshift, OpenSearch, RDS) — a single cluster with no failover is a 🔴 High Risk
Check: timeout configuration on all processing steps (Glue, Lambda, Step Functions)

针对每个组件，询问：“如果该组件故障会发生什么？”

按严重程度对每个SPOF分类：

🔴 高风险 — 若该组件故障，将导致完全中断或数据丢失
🟡 中风险 — 体验降级或部分中断
🟢 低风险 — 影响极小，可优雅降级

检查以下内容：

单AZ部署（数据库、计算、缓存）
无故障转移的单区域依赖项
未复制的数据存储（无备份、无只读副本）
对第三方服务存在硬依赖且无备选方案
单个NAT网关、单个堡垒机、单个负载均衡器
无共享架构与全共享架构的瓶颈

如果工作负载是数据管道（S3/Lambda/Step Functions/Glue/EMR/Kinesis/Kafka/Redshift）：

优先考虑数据持久性而非计算可用性——消息/数据丢失比临时处理延迟更严重
检查：每个异步调用（Lambda、SQS、EventBridge）是否配置了DLQ
检查：是否带有指数退避和最大尝试次数的重试策略
检查：是否具备幂等性保证（重复处理必须是安全的）
检查：是否处理毒丸消息（格式错误的消息不得阻塞管道）
检查：数据接收器（Redshift、OpenSearch、RDS）中的单集群SPOF——无故障转移的单集群属于🔴高风险
检查：所有处理步骤（Glue、Lambda、Step Functions）的超时配置

Step 3: Assess recovery capabilities

步骤3：评估恢复能力

Evaluate:

Backup strategy — Are backups automated, tested, and cross-region?
Failover mechanisms — Is failover automatic or manual? How long does it take?
Health checks — Are they deep enough to detect real failures?
Deployment rollback — Can a bad deploy be reverted in minutes?
Dependency isolation — Does one service failure cascade?
Chaos engineering — Are failure scenarios tested proactively?

评估以下内容：

备份策略 — 备份是否自动化、已测试且跨区域？
故障转移机制 — 故障转移是自动还是手动？所需时长是多少？
健康检查 — 检查是否足够深入以检测真实故障？
部署回滚 — 能否在数分钟内回滚有问题的部署？
依赖隔离 — 单个服务故障是否会引发级联效应？
混沌工程 — 是否主动测试故障场景？

Step 4: Evaluate scaling and capacity

步骤4：评估扩展与容量

Assess:

Is auto-scaling configured with appropriate min/max/cooldown?
Are service quotas monitored and increased proactively?
Is there load shedding or throttling for overload scenarios?
Are queues used to absorb traffic spikes?
Is capacity tested under peak load?

评估以下内容：

是否配置了合适的最小/最大/冷却时间的自动扩展？
是否主动监控并提升服务配额？
是否针对过载场景配置了流量削峰或限流？
是否使用队列吸收流量峰值？
是否在峰值负载下测试过容量？

Step 5: Assess change management

步骤5：评估变更管理

Evaluate:

Are deployments canary or blue/green?
Are database migrations backward-compatible?
Is there automated rollback on health check failure?
Are changes tested in a staging environment that mirrors production?

评估以下内容：

部署是否采用金丝雀发布或蓝绿部署？
数据库迁移是否向后兼容？
健康检查失败时是否自动回滚？
是否在镜像生产环境的 staging 环境中测试变更？

Step 6: Produce the plan

步骤6：生成计划

Output:

markdown

undefined

输出：

markdown

undefined

Reliability Improvement Plan: {Workload Name}

Summary

Date: {date}
Availability target: {target}
Estimated current availability: {estimate}
RTO: {current} → {target}
RPO: {current} → {target}
Findings: {X} High Risk, {Y} Medium Risk, {Z} Low Risk

Date: {date}
Availability target: {target}
Estimated current availability: {estimate}
RTO: {current} → {target}
RPO: {current} → {target}
Findings: {X} High Risk, {Y} Medium Risk, {Z} Low Risk

Reliability Scorecard

Domain	Score (1-5)	Key Gap
Fault Tolerance	{score}	{gap}
Recovery & Backup	{score}	{gap}
Scaling & Capacity	{score}	{gap}
Change Management	{score}	{gap}
Testing & Validation	{score}	{gap}

Domain	Score (1-5)	Key Gap
Fault Tolerance	{score}	{gap}
Recovery & Backup	{score}	{gap}
Scaling & Capacity	{score}	{gap}
Change Management	{score}	{gap}
Testing & Validation	{score}	{gap}

Single Points of Failure

Component	Severity	Failure Impact	Current Mitigation	Gap	AWS Service to Fix
{component}	🔴/🟡/🟢	{impact}	{mitigation or "None"}	{what's missing}	{service}

Component	Severity	Failure Impact	Current Mitigation	Gap	AWS Service to Fix
{component}	🔴/🟡/🟢	{impact}	{mitigation or "None"}	{what's missing}	{service}

High Risk Findings

{Each: SPOF description, blast radius, recommendation, AWS services, effort}

Remediation Plan

Quick Wins (< 1 week)

{Low-effort high-impact: enable backups, turn on Multi-AZ, add health checks}

Foundation (1-4 weeks)

{Multi-AZ compute, auto-scaling, circuit breakers, deployment safety}

Advanced (1-3 months)

{Multi-region, chaos engineering, automated failover drills}

Architecture Recommendations

{Specific changes: multi-AZ, read replicas, circuit breakers, async patterns, etc.}

Testing Plan

Test	What it validates	Frequency	AWS Service
AZ failover drill	Compute continues in remaining AZs	Monthly	FIS
Database failover	RDS/Aurora failover < 60s	Quarterly	FIS
Load test	Capacity handles 2x peak	Before releases	Distributed Load Testing
Backup restore	RPO is met, data is recoverable	Monthly	AWS Backup
Deployment rollback	Bad deploy is reverted < 5 min	Every deploy	CodeDeploy

Test	What it validates	Frequency	AWS Service
AZ failover drill	Compute continues in remaining AZs	Monthly	FIS
Database failover	RDS/Aurora failover < 60s	Quarterly	FIS
Load test	Capacity handles 2x peak	Before releases	Distributed Load Testing
Backup restore	RPO is met, data is recoverable	Monthly	AWS Backup
Deployment rollback	Bad deploy is reverted < 5 min	Every deploy	CodeDeploy

Next Steps

{Concrete actions the team should take this week}

undefined

{Concrete actions the team should take this week}

undefined

Step 7: Offer follow-up

步骤7：提供后续服务

After delivering the plan, offer:

Would you like me to:

Design the multi-AZ architecture in detail?

Create a chaos engineering experiment plan using AWS FIS?

Build a failover testing runbook?

Estimate the cost of the reliability improvements?

Design circuit breaker patterns for your service dependencies?

交付计划后，询问：

您是否需要我：

详细设计多AZ架构？

使用AWS FIS制定混沌工程实验计划？

构建故障转移测试手册？

估算可靠性改进的成本？

为您的服务依赖项设计断路器模式？