AWS Well-Architected Framework

Expert guidance for designing, reviewing, and improving AWS architectures using the six pillars of the Well-Architected Framework.

使用Well-Architected Framework的六大支柱，为设计、评审和优化AWS架构提供专业指导。

When to Use

适用场景

Use this skill when:

Reviewing existing AWS architecture for best practices
Designing new cloud systems or applications
Troubleshooting operational issues, security vulnerabilities, or reliability problems
Optimizing costs or improving performance
Preparing for architecture reviews or audits
Migrating workloads to AWS
Addressing compliance or sustainability requirements
User asks "is my architecture good?" or "how can I improve my AWS setup?"

在以下场景中使用本技能：

评审现有AWS架构是否符合最佳实践
设计新的云系统或应用
排查运维问题、安全漏洞或可靠性故障
优化成本或提升性能
准备架构评审或审计
将工作负载迁移至AWS
满足合规或可持续性要求
用户询问“我的架构是否合理？”或“如何改进我的AWS配置？”

Core Principle

核心原则

Systematic architecture evaluation across 6 pillars ensures balanced, well-designed systems that meet business objectives.

The AWS Well-Architected Framework provides a consistent approach for evaluating cloud architectures and implementing scalable designs.

通过六大支柱进行系统化架构评估，确保系统设计均衡、完善，满足业务目标。

AWS Well-Architected Framework提供了一套一致的方法，用于评估云架构并实施可扩展的设计方案。

The Six Pillars

六大支柱

Pillar	Focus	Key Question
Operational Excellence	Run and monitor systems	How do we operate effectively?
Security	Protect information and systems	How do we protect data and resources?
Reliability	Recover from failures	How do we ensure workload availability?
Performance Efficiency	Use resources effectively	How do we meet performance requirements?
Cost Optimization	Avoid unnecessary costs	How do we achieve cost-effective outcomes?
Sustainability	Minimize environmental impact	How do we reduce carbon footprint?

支柱	关注重点	核心问题
Operational Excellence（卓越运维）	运行和监控系统	我们如何高效运维？
Security（安全）	保护信息和系统	我们如何保护数据和资源？
Reliability（可靠性）	从故障中恢复	我们如何确保工作负载的可用性？
Performance Efficiency（性能效率）	高效利用资源	我们如何满足性能要求？
Cost Optimization（成本优化）	避免不必要的成本	我们如何实现高性价比的结果？
Sustainability（可持续性）	最小化环境影响	我们如何减少碳足迹？

Architecture Review Workflow

架构评审流程

CRITICAL: You MUST review ALL 6 pillars systematically. Never skip a pillar because it "seems not applicable" - every workload has considerations across all pillars.

dot

digraph review_flow {
    "Architecture review needed" [shape=doublecircle];
    "Identify workload scope" [shape=box];
    "Review each pillar systematically" [shape=box];
    "Document findings per pillar" [shape=box];
    "Prioritize improvements" [shape=box];
    "Create action plan" [shape=box];
    "All pillars reviewed?" [shape=diamond];
    "Complete" [shape=doublecircle];

    "Architecture review needed" -> "Identify workload scope";
    "Identify workload scope" -> "Review each pillar systematically";
    "Review each pillar systematically" -> "Document findings per pillar";
    "Document findings per pillar" -> "All pillars reviewed?";
    "All pillars reviewed?" -> "Review each pillar systematically" [label="no"];
    "All pillars reviewed?" -> "Prioritize improvements" [label="yes"];
    "Prioritize improvements" -> "Create action plan";
    "Create action plan" -> "Complete";
}

Red Flags - You're Skipping the Framework:

"This pillar doesn't apply to this workload" - WRONG, every pillar applies
Jumping straight to recommendations without documenting current state
Only reviewing 3-4 pillars instead of all 6
Providing generic advice instead of workload-specific assessment

重要提示：必须系统化评审所有6个支柱。绝不能因为某个支柱“看似不适用”就跳过——每个工作负载都需要考虑所有支柱的相关因素。

dot

digraph review_flow {
    "Architecture review needed" [shape=doublecircle];
    "Identify workload scope" [shape=box];
    "Review each pillar systematically" [shape=box];
    "Document findings per pillar" [shape=box];
    "Prioritize improvements" [shape=box];
    "Create action plan" [shape=box];
    "All pillars reviewed?" [shape=diamond];
    "Complete" [shape=doublecircle];

    "Architecture review needed" -> "Identify workload scope";
    "Identify workload scope" -> "Review each pillar systematically";
    "Review each pillar systematically" -> "Document findings per pillar";
    "Document findings per pillar" -> "All pillars reviewed?";
    "All pillars reviewed?" -> "Review each pillar systematically" [label="no"];
    "All pillars reviewed?" -> "Prioritize improvements" [label="yes"];
    "Prioritize improvements" -> "Create action plan";
    "Create action plan" -> "Complete";
}

警示信号——你在跳过框架要求：

“这个支柱不适用于此工作负载”——错误，所有支柱都适用
未记录当前状态就直接给出建议
仅评审3-4个支柱而非全部6个
提供通用建议而非针对工作负载的具体评估

Pillar 1: Operational Excellence

支柱1：Operational Excellence（卓越运维）

Goal: Support development and run workloads effectively, gain insight into operations, and continuously improve processes.

目标： 有效支持开发和运行工作负载，深入了解运维情况，并持续改进流程。

Design Principles

设计原则

Perform operations as code (IaC)
Make frequent, small, reversible changes
Refine operations procedures frequently
Anticipate failure
Learn from operational events and failures

将运维操作作为代码（IaC）实现
频繁进行小型、可回滚的变更
定期优化运维流程
预判故障
从运维事件和故障中学习

Key Areas

核心领域

Organization:

How do teams share architecture knowledge?
Are there clear ownership and accountability models?

Prepare:

How do you design workloads for observability?
Infrastructure as code implementation?
Deployment practices (CI/CD)?

Operate:

What's the runbook for common operations?
How do you understand workload health?
How do you respond to events?

Evolve:

How do you learn from operational events?
Process for continuous improvement?

组织管理：

团队如何共享架构知识？
是否有清晰的职责归属模型？

准备工作：

你如何设计具备可观测性的工作负载？
是否实现了基础设施即代码？
部署实践（CI/CD）情况如何？

运维执行：

常见操作的运行手册是什么？
你如何了解工作负载的健康状态？
你如何响应事件？

持续演进：

你如何从运维事件中学习？
是否有持续改进的流程？

Common Issues & Solutions

常见问题与解决方案

Issue	Solution
Manual deployments	Implement CI/CD with CloudFormation/CDK/Terraform
No visibility into system health	Add CloudWatch dashboards, metrics, alarms
Operational procedures outdated	Regular runbook reviews, post-incident learning
Slow incident response	Create automated remediation with Lambda/Systems Manager

问题	解决方案
手动部署	使用CloudFormation/CDK/Terraform实现CI/CD
无系统健康可见性	添加CloudWatch仪表板、指标和告警
运维流程过时	定期评审运行手册，开展事后复盘学习
事件响应缓慢	使用Lambda/Systems Manager创建自动化修复

Quick Implementation Checklist

快速实施检查清单

Infrastructure defined as code (CloudFormation/CDK/Terraform)
CI/CD pipeline implemented
CloudWatch dashboards for key metrics
Alarms for critical thresholds
Runbooks documented and accessible
Regular game days to test procedures
Post-incident review process

基础设施通过代码定义（CloudFormation/CDK/Terraform）
已实现CI/CD流水线
针对关键指标配置CloudWatch仪表板
为临界阈值设置告警
运行手册已文档化且可访问
定期开展演练日测试流程
建立事后复盘流程

Pillar 2: Security

支柱2：Security（安全）

Goal: Protect data, systems, and assets through cloud security practices.

目标： 通过云安全实践保护数据、系统和资产。

Design Principles

设计原则

Implement strong identity foundation
Enable traceability
Apply security at all layers
Automate security best practices
Protect data in transit and at rest
Keep people away from data
Prepare for security events

实施强身份基础
启用可追溯性
在所有层级应用安全措施
自动化安全最佳实践
保护传输中和静止的数据
避免人员直接接触数据
为安全事件做好准备

Key Areas

核心领域

Security Foundations:

How do you manage credentials and authentication?
IAM roles and policies following least privilege?

Identity and Access Management:

How do you manage identities for people and machines?
MFA enabled for all human access?

Detection:

How do you detect and investigate security events?
CloudTrail, GuardDuty, Security Hub configured?

Infrastructure Protection:

How do you protect networks and compute?
VPC configuration, security groups, NACLs?

Data Protection:

How do you classify and protect data?
Encryption at rest and in transit?

Incident Response:

How do you respond to security incidents?
Incident response plan tested?

安全基础：

你如何管理凭证和认证？
IAM角色和策略是否遵循最小权限原则？

身份与访问管理：

你如何管理人员和机器的身份？
是否为所有人工访问启用了MFA？

检测能力：

你如何检测和调查安全事件？
是否配置了CloudTrail、GuardDuty、Security Hub？

基础设施保护：

你如何保护网络和计算资源？
VPC配置、安全组、NACL情况如何？

数据保护：

你如何分类和保护数据？
是否对传输中和静止的数据进行加密？

事件响应：

你如何响应安全事件？
事件响应计划是否经过测试？

Critical Security Patterns

关键安全模式

Never Do:

typescript

// ❌ DANGEROUS: Hardcoded credentials
const AWS = require('aws-sdk');
const s3 = new AWS.S3({
  accessKeyId: 'AKIAIOSFODNN7EXAMPLE',
  secretAccessKey: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
});

Always Do:

typescript

// ✅ CORRECT: Use IAM roles
const AWS = require('aws-sdk');
const s3 = new AWS.S3(); // Credentials from IAM role

// Lambda function with IAM role
const lambda = new lambda.Function(this, 'MyFunction', {
  // IAM role with least privilege
  role: myRole,
  // ...
});

禁止操作：

typescript

// ❌ DANGEROUS: Hardcoded credentials
const AWS = require('aws-sdk');
const s3 = new AWS.S3({
  accessKeyId: 'AKIAIOSFODNN7EXAMPLE',
  secretAccessKey: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
});

推荐操作：

typescript

// ✅ CORRECT: Use IAM roles
const AWS = require('aws-sdk');
const s3 = new AWS.S3(); // Credentials from IAM role

// Lambda function with IAM role
const lambda = new lambda.Function(this, 'MyFunction', {
  // IAM role with least privilege
  role: myRole,
  // ...
});

Security Checklist

安全检查清单

Pillar 3: Reliability

支柱3：Reliability（可靠性）

Goal: Ensure workload performs its intended function correctly and consistently.

目标： 确保工作负载能正确、持续地执行其预期功能。

Design Principles

设计原则

Automatically recover from failure
Test recovery procedures
Scale horizontally
Stop guessing capacity
Manage change through automation

自动从故障中恢复
测试恢复流程
水平扩展
停止猜测容量需求
通过自动化管理变更

Key Areas

核心领域

Foundations:

How do you manage service quotas and constraints?
Network topology designed for HA?

Workload Architecture:

How do you design workload service architecture?
Microservices vs monolith considerations?

Change Management:

How do you monitor workload resources?
How are changes deployed safely?

Failure Management:

How do you back up data?
How do you design for resilience?
DR plan and RTO/RPO defined?

基础架构：

你如何管理服务配额和限制？
网络拓扑是否为高可用性设计？

工作负载架构：

你如何设计工作负载的服务架构？
微服务与单体架构的考量？

变更管理：

你如何监控工作负载资源？
如何安全部署变更？

故障管理：

你如何备份数据？
你如何设计弹性架构？
是否定义了灾难恢复（DR）计划以及恢复时间目标（RTO）和恢复点目标（RPO）？

High Availability Patterns

高可用模式

Multi-AZ Deployment:

Region
├── AZ-1: Application + Database
├── AZ-2: Application + Database (standby)
└── AZ-3: Application + Database (standby)

Multi-Region Deployment:

Primary Region          Secondary Region
├── Active workload    ├── Standby/Active
├── Database (primary) ├── Database (replica)
└── Route 53 health check monitoring

多可用区部署：

Region
├── AZ-1: Application + Database
├── AZ-2: Application + Database (standby)
└── AZ-3: Application + Database (standby)

多区域部署：

Primary Region          Secondary Region
├── Active workload    ├── Standby/Active
├── Database (primary) ├── Database (replica)
└── Route 53 health check monitoring

Backup Strategy

备份策略

Data Type	Solution	RPO	RTO
RDS	Automated backups + snapshots	< 5 min	< 30 min
DynamoDB	Point-in-time recovery	Seconds	Minutes
S3	Versioning + cross-region replication	Real-time	Immediate
EBS	Snapshots via AWS Backup	Hours	Hours

数据类型	解决方案	RPO	RTO
RDS	自动备份 + 快照	< 5分钟	< 30分钟
DynamoDB	时间点恢复	秒级	分钟级
S3	版本控制 + 跨区域复制	实时	即时
EBS	通过AWS Backup创建快照	小时级	小时级

Reliability Checklist

可靠性检查清单

Pillar 4: Performance Efficiency

支柱4：Performance Efficiency（性能效率）

Goal: Use computing resources efficiently to meet requirements and maintain efficiency as demand changes.

目标： 高效利用计算资源以满足需求，并在需求变化时保持效率。

Design Principles

设计原则

Democratize advanced technologies
Go global in minutes
Use serverless architectures
Experiment more often
Consider mechanical sympathy

普及先进技术
几分钟内完成全球化部署
使用无服务器架构
更频繁地进行实验
考虑机械亲和性

Key Areas

核心领域

Selection:

How do you select appropriate resource types and sizes?
Compute: EC2, Lambda, Fargate, ECS, EKS?
Database: RDS, DynamoDB, Aurora, ElastiCache?
Storage: S3, EFS, EBS, Glacier?

Review:

How do you evolve workload to use new resources?
Regular review of AWS new features?

Monitoring:

How do you monitor resources?
CloudWatch, X-Ray for distributed tracing?

Trade-offs:

How do you use trade-offs to improve performance?
Caching, consistency models, compression?

资源选择：

你如何选择合适的资源类型和规格？
计算：EC2、Lambda、Fargate、ECS、EKS？
数据库：RDS、DynamoDB、Aurora、ElastiCache？
存储：S3、EFS、EBS、Glacier？

架构演进：

你如何演进工作负载以使用新资源？
是否定期评审AWS新功能？

监控：

你如何监控资源？
是否使用CloudWatch、X-Ray进行分布式追踪？

权衡取舍：

你如何通过权衡取舍提升性能？
缓存、一致性模型、压缩？

Performance Patterns

性能模式

Caching Strategy:

Client → CloudFront (edge cache)
  → API Gateway
    → Lambda
      → ElastiCache (data cache)
        → DynamoDB/RDS

Database Selection:

Use Case	Recommended Service
Relational, complex queries	RDS (PostgreSQL/MySQL)
High throughput, simple queries	DynamoDB
Graph relationships	Neptune
Search and analytics	OpenSearch
Time-series data	Timestream
In-memory cache	ElastiCache (Redis/Memcached)

缓存策略：

Client → CloudFront (edge cache)
  → API Gateway
    → Lambda
      → ElastiCache (data cache)
        → DynamoDB/RDS

数据库选择：

使用场景	推荐服务
关系型、复杂查询	RDS (PostgreSQL/MySQL)
高吞吐量、简单查询	DynamoDB
图关系	Neptune
搜索与分析	OpenSearch
时序数据	Timestream
内存缓存	ElastiCache (Redis/Memcached)

Performance Checklist

性能检查清单

Right-sized compute instances (not over-provisioned)
Content delivery through CloudFront
Database read replicas for read-heavy workloads
Caching layer (ElastiCache, DAX, CloudFront)
Asynchronous processing with SQS/SNS/EventBridge
Auto Scaling configured appropriately
Database indexes optimized
Monitoring with CloudWatch and X-Ray
Regular performance testing under load

计算实例规格合理（未过度配置）
通过CloudFront交付内容
为读密集型工作负载配置数据库只读副本
配置缓存层（ElastiCache、DAX、CloudFront）
使用SQS/SNS/EventBridge进行异步处理
合理配置自动扩缩容
数据库索引已优化
使用CloudWatch和X-Ray进行监控
定期进行负载下的性能测试

Pillar 5: Cost Optimization

支柱5：Cost Optimization（成本优化）

Goal: Run systems to deliver business value at lowest price point.

目标： 以最低成本运行系统，交付业务价值。

Design Principles

设计原则

Implement cloud financial management
Adopt consumption model
Measure overall efficiency
Stop spending on undifferentiated heavy lifting
Analyze and attribute expenditure

实施云财务管理
采用消费型模式
衡量整体效率
停止在无差异化的繁重工作上投入
分析并分摊支出

Key Areas

核心领域

Practice Cloud Financial Management:

Cost allocation tags implemented?
Budgets and alerts configured?

Expenditure and Usage Awareness:

How do you govern usage?
Cost Explorer and AWS Budgets configured?

Cost-Effective Resources:

How do you evaluate cost when selecting services?
Reserved Instances or Savings Plans for predictable workloads?

Manage Demand:

How do you manage demand and supply resources?
Throttling, caching to reduce demand?

Optimize Over Time:

How do you evaluate new services?
Regular review of cost optimization opportunities?

云财务管理实践：

是否实施了成本分配标签？
是否配置了预算和告警？

支出与使用感知：

你如何管控使用情况？
是否配置了Cost Explorer和AWS Budgets？

高性价比资源：

你在选择服务时如何评估成本？
对于可预测的工作负载，是否使用预留实例（Reserved Instances）或节省计划（Savings Plans）？

需求管理：

你如何管理资源的供需？
是否通过限流、缓存减少需求？

持续优化：

你如何评估新服务？
是否定期评审成本优化机会？

Cost Optimization Strategies

成本优化策略

Strategy	Implementation	Potential Savings
Right-sizing	Use Compute Optimizer recommendations	20-40%
Reserved Instances	1-year or 3-year commitments	30-75%
Savings Plans	Flexible compute commitments	30-70%
Spot Instances	Fault-tolerant workloads	50-90%
S3 Intelligent-Tiering	Automatic storage class optimization	40-60%
Auto Scaling	Scale resources with demand	30-50%
Lambda instead of EC2	For appropriate workloads	Varies

策略	实施方式	潜在节省比例
规格优化	使用Compute Optimizer的建议	20-40%
预留实例	1年或3年承诺	30-75%
节省计划	灵活的计算承诺	30-70%
竞价实例	用于容错工作负载	50-90%
S3智能分层	自动存储类优化	40-60%
自动扩缩容	根据需求缩放资源	30-50%
Lambda替代EC2	适用于合适的工作负载	视情况而定

Cost Monitoring

成本监控

typescript

// CDK Example: Set up budget alerts
import * as budgets from 'aws-cdk-lib/aws-budgets';

new budgets.CfnBudget(this, 'MonthlyBudget', {
  budget: {
    budgetType: 'COST',
    timeUnit: 'MONTHLY',
    budgetLimit: {
      amount: 1000,
      unit: 'USD',
    },
  },
  notificationsWithSubscribers: [{
    notification: {
      notificationType: 'ACTUAL',
      comparisonOperator: 'GREATER_THAN',
      threshold: 80, // Alert at 80%
    },
    subscribers: [{
      subscriptionType: 'EMAIL',
      address: 'team@example.com',
    }],
  }],
});

typescript

// CDK Example: Set up budget alerts
import * as budgets from 'aws-cdk-lib/aws-budgets';

new budgets.CfnBudget(this, 'MonthlyBudget', {
  budget: {
    budgetType: 'COST',
    timeUnit: 'MONTHLY',
    budgetLimit: {
      amount: 1000,
      unit: 'USD',
    },
  },
  notificationsWithSubscribers: [{
    notification: {
      notificationType: 'ACTUAL',
      comparisonOperator: 'GREATER_THAN',
      threshold: 80, // Alert at 80%
    },
    subscribers: [{
      subscriptionType: 'EMAIL',
      address: 'team@example.com',
    }],
  }],
});

Cost Optimization Checklist

成本优化检查清单

Pillar 6: Sustainability

支柱6：Sustainability（可持续性）

Goal: Minimize environmental impact of running cloud workloads.

目标： 最小化运行云工作负载对环境的影响。

Design Principles

设计原则

Understand your impact
Establish sustainability goals
Maximize utilization
Anticipate and adopt new, more efficient offerings
Use managed services
Reduce downstream impact

了解你的影响
制定可持续性目标
最大化资源利用率
预判并采用新的、更高效的产品
使用托管服务
减少下游影响

Key Areas

核心领域

Region Selection:

Choose regions with renewable energy
AWS regions with lower carbon intensity

User Behavior Patterns:

Scale resources with demand
Remove unused resources

Software and Architecture:

Optimize code for efficiency
Use appropriate services (serverless over provisioned)

Data Patterns:

Minimize data movement
Use data compression
Implement lifecycle policies

Hardware Patterns:

Use minimum necessary hardware
Use instance types with best performance per watt

Development Process:

Test sustainability improvements
Measure and report carbon footprint

区域选择：

选择使用可再生能源的区域
选择碳强度较低的AWS区域

用户行为模式：

根据需求缩放资源
移除未使用的资源

软件与架构：

优化代码以提升效率
使用合适的服务（无服务器优于预置实例）

数据模式：

最小化数据移动
使用数据压缩
实施生命周期策略

硬件模式：

使用必要的最少硬件
使用每瓦性能最佳的实例类型

开发流程：

测试可持续性改进措施
衡量并报告碳足迹

Sustainability Checklist

可持续性检查清单

Workloads in regions with renewable energy
Auto Scaling to match demand (no idle resources)
Unused resources regularly cleaned up
Graviton processors considered for better efficiency
Managed services used where appropriate
Data lifecycle policies to reduce storage
Efficient code (async processing, optimized queries)
Monitoring resource utilization
Carbon footprint tracked (AWS Customer Carbon Footprint Tool)

工作负载部署在使用可再生能源的区域
自动扩缩容以匹配需求（无闲置资源）
定期清理未使用的资源
考虑使用Graviton处理器以提升效率
适当使用托管服务
配置数据生命周期策略以减少存储
高效代码（异步处理、优化查询）
监控资源利用率
追踪碳足迹（使用AWS Customer Carbon Footprint Tool）

Review Process

评审流程

1. Scoping Phase

1. 范围界定阶段

Questions to ask:

What is the workload scope? (entire system vs specific component)
What are the business objectives?
What are the compliance requirements?
What are the current pain points?

需要询问的问题：

工作负载范围是什么？（整个系统 vs 特定组件）
业务目标是什么？
合规要求是什么？
当前的痛点是什么？

2. Review Each Pillar

2. 评审每个支柱

For each pillar, use this template:

Current State:

Document what exists today

Gaps:

What's missing or needs improvement?

Risks:

What are the high/medium/low priority risks?

Recommendations:

Specific, actionable improvements

针对每个支柱，使用以下模板：

当前状态：

记录现有情况

差距：

缺少什么或需要改进什么？

风险：

高/中/低优先级风险是什么？

建议：

具体、可操作的改进措施

3. Prioritization Matrix

3. 优先级矩阵

Priority	Criteria
High	Security vulnerabilities, critical availability risks, major cost waste
Medium	Performance issues, moderate cost optimization, operational improvements
Low	Nice-to-haves, future considerations, minor optimizations

优先级	标准
高	安全漏洞、关键可用性风险、重大成本浪费
中	性能问题、中等成本优化、运维改进
低	锦上添花的功能、未来考量、微小优化

4. Action Plan Template

4. 行动计划模板

markdown

undefined

markdown

undefined

Pillar: [Name]

Issue: [Description]

Risk Level: High/Medium/Low
Impact: [Business impact]
Effort: Low/Medium/High

Risk Level: High/Medium/Low
Impact: [Business impact]
Effort: Low/Medium/High

Recommendation:

[Specific actions]

Implementation Steps:

[Step 1]
[Step 2]
[Step 3]

[Step 1]
[Step 2]
[Step 3]

Success Criteria:

[Measurable outcome 1]
[Measurable outcome 2]

[Measurable outcome 1]
[Measurable outcome 2]

Resources:

[AWS documentation links]
[Blog posts or examples]

undefined

[AWS documentation links]
[Blog posts or examples]

undefined

Common Anti-Patterns

常见反模式

Anti-Pattern	Issue	Better Approach
Single AZ deployment	No fault tolerance	Multi-AZ architecture
No IaC	Manual config, drift	CloudFormation/CDK/Terraform
Hardcoded secrets	Security vulnerability	Secrets Manager/Parameter Store
No monitoring	Blind operation	CloudWatch dashboards + alarms
No backups	Data loss risk	Automated backup strategy
Over-provisioning	Cost waste	Right-sizing + Auto Scaling
No cost tracking	Budget overruns	Tags + Budgets + Cost Explorer
Monolithic architecture	Hard to scale	Microservices or serverless

反模式	问题	更佳方案
单可用区部署	无容错能力	多可用区架构
无IaC	手动配置、配置漂移	CloudFormation/CDK/Terraform
硬编码密钥	安全漏洞	Secrets Manager/Parameter Store
无监控	盲目运维	CloudWatch仪表板 + 告警
无备份	数据丢失风险	自动化备份策略
过度配置	成本浪费	合理规格 + 自动扩缩容
无成本追踪	预算超支	标签 + 预算 + Cost Explorer
单体架构	难以扩展	微服务或无服务器

Real-World Example

真实案例

Scenario: Serverless API with authentication

Architecture Review:

Operational Excellence:

✅ Lambda functions deployed via CDK
✅ CloudWatch logs enabled
❌ Missing: Distributed tracing (X-Ray), dashboards

Security:

❌ CRITICAL: Hardcoded API keys in Lambda environment variables
✅ API Gateway with IAM authorization
❌ Missing: Secrets Manager, encryption at rest

Reliability:

✅ Multi-AZ DynamoDB table
❌ Single region deployment
❌ Missing: Backup strategy, DR plan

Performance:

✅ CloudFront for static assets
❌ No caching for API responses
❌ Lambda cold starts not optimized

Cost:

❌ DynamoDB provisioned capacity, but traffic is spiky
✅ Lambda usage-based pricing
❌ Missing: Budget alerts, cost allocation tags

Sustainability:

✅ Serverless architecture (good utilization)
❌ Unused dev/test resources running 24/7

Priority Actions:

HIGH: Move API keys to Secrets Manager (Security)
HIGH: Implement DynamoDB backups (Reliability)
MEDIUM: Add X-Ray tracing (Operational Excellence)
MEDIUM: Switch DynamoDB to on-demand (Cost)
LOW: Add API Gateway caching (Performance)

场景： 带认证的无服务器API

架构评审：

Operational Excellence（卓越运维）：

✅ Lambda函数通过CDK部署
✅ 已启用CloudWatch日志
❌ 缺失：分布式追踪（X-Ray）、仪表板

Security（安全）：

❌ 严重问题：Lambda环境变量中存在硬编码API密钥
✅ API Gateway使用IAM授权
❌ 缺失：Secrets Manager、静态加密

Reliability（可靠性）：

✅ DynamoDB表为多可用区部署
❌ 单区域部署
❌ 缺失：备份策略、DR计划

Performance（性能）：

✅ 使用CloudFront交付静态资产
❌ API响应无缓存
❌ Lambda冷启动未优化

Cost（成本）：

❌ DynamoDB使用预置容量，但流量波动大
✅ Lambda采用基于使用量的定价
❌ 缺失：预算告警、成本分配标签

Sustainability（可持续性）：

✅ 无服务器架构（资源利用率高）
❌ 未使用的开发/测试资源24小时运行

优先级行动：

高优先级：将API密钥迁移至Secrets Manager（安全）
高优先级：实施DynamoDB备份（可靠性）
中优先级：添加X-Ray追踪（卓越运维）
中优先级：将DynamoDB切换为按需模式（成本）
低优先级：添加API Gateway缓存（性能）

Resources

资源

AWS Well-Architected Framework Whitepaper
AWS Well-Architected Tool (Interactive reviews)
Well-Architected Labs
AWS Architecture Center
Sustainability Pillar Whitepaper

AWS Well-Architected Framework Whitepaper
AWS Well-Architected Tool (交互式评审)
Well-Architected Labs
AWS Architecture Center
Sustainability Pillar Whitepaper

Common Mistakes When Using This Framework

使用本框架的常见错误

Mistake	Why It's Wrong	Correct Approach
"Sustainability doesn't apply to this workload"	Every workload consumes resources and energy	Review all 6 pillars, even if findings are minimal
Skipping current state documentation	Can't measure improvement without baseline	Always document "Current State" before recommendations
Generic recommendations	Not actionable or specific to this workload	Provide specific AWS services, code examples, priorities
No prioritization	Everything seems equally important	Use HIGH/MEDIUM/LOW risk levels, create phased plan
Forgetting about trade-offs	Optimizing one pillar at expense of others	Explicitly call out trade-offs (e.g., multi-region cost vs reliability)

错误	错误原因	正确做法
“可持续性不适用于此工作负载”	每个工作负载都会消耗资源和能源	评审所有6个支柱，即使发现的问题很少
跳过当前状态记录	没有基线就无法衡量改进	给出建议前始终记录“当前状态”
通用建议	不具可操作性或不针对工作负载	提供具体的AWS服务、代码示例和优先级
无优先级划分	所有事项看似同等重要	使用高/中/低风险级别，制定分阶段计划
忽略权衡取舍	优化一个支柱以牺牲其他支柱为代价	明确指出权衡（例如，多区域部署的成本与可靠性）

Using This Skill

使用本技能的方法

When conducting architecture reviews:

Start with context - understand business objectives and constraints
Review systematically - go through all 6 pillars, don't skip ANY
Document findings - use consistent format per pillar (Current State → Gaps → Recommendations)
Prioritize ruthlessly - security and availability issues first
Be specific - actionable recommendations with examples and AWS service names
Provide resources - link to AWS docs and examples
Create action plan - clear next steps with success criteria and effort estimates
Call out trade-offs - be explicit about costs and benefits of each recommendation

Remember: Architecture is about trade-offs. A perfect architecture doesn't exist - aim for a well-balanced one that meets business needs.

No exceptions to reviewing all 6 pillars - even if a pillar seems "not applicable", document why and what the current state is.

进行架构评审时：

从上下文开始 - 了解业务目标和约束
系统化评审 - 遍历所有6个支柱，绝不跳过任何一个
记录发现 - 为每个支柱使用一致的格式（当前状态 → 差距 → 建议）
严格优先级划分 - 优先处理安全和可用性问题
具体明确 - 提供可操作的建议，附带示例和AWS服务名称
提供资源 - 链接到AWS文档和示例
创建行动计划 - 明确的后续步骤，包含成功标准和工作量估算
指出权衡取舍 - 明确每个建议的成本和收益

记住： 架构在于权衡取舍。完美的架构不存在——目标是构建一个平衡的架构，满足业务需求。

评审所有6个支柱无例外 - 即使某个支柱看似“不适用”，也要记录原因和当前状态。

aws-well-architected-framework

Original

Translation

AWS Well-Architected Framework

AWS Well-Architected Framework

When to Use

适用场景

Core Principle

核心原则

The Six Pillars

六大支柱

Architecture Review Workflow

架构评审流程

Pillar 1: Operational Excellence

支柱1：Operational Excellence（卓越运维）

Design Principles

设计原则

Key Areas

核心领域

Common Issues & Solutions

常见问题与解决方案

Quick Implementation Checklist

快速实施检查清单

Pillar 2: Security

支柱2：Security（安全）

Design Principles

设计原则

Key Areas

核心领域

Critical Security Patterns

关键安全模式

Security Checklist

安全检查清单

Pillar 3: Reliability

支柱3：Reliability（可靠性）

Design Principles

设计原则

Key Areas

核心领域

High Availability Patterns

高可用模式

Backup Strategy

备份策略

Reliability Checklist

可靠性检查清单

Pillar 4: Performance Efficiency

支柱4：Performance Efficiency（性能效率）

Design Principles

设计原则

Key Areas

核心领域

Performance Patterns

性能模式

Performance Checklist

性能检查清单

Pillar 5: Cost Optimization

支柱5：Cost Optimization（成本优化）

Design Principles

设计原则

Key Areas

核心领域

Cost Optimization Strategies

成本优化策略

Cost Monitoring

成本监控

Cost Optimization Checklist

成本优化检查清单

Pillar 6: Sustainability

支柱6：Sustainability（可持续性）

Design Principles

设计原则

Key Areas

核心领域

Sustainability Checklist

可持续性检查清单

Review Process

评审流程

1. Scoping Phase

1. 范围界定阶段

2. Review Each Pillar