aws-well-architected-framework
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAWS Well-Architected Framework
AWS Well-Architected Framework
Expert guidance for designing, reviewing, and improving AWS architectures using the six pillars of the Well-Architected Framework.
使用Well-Architected Framework的六大支柱,为设计、评审和优化AWS架构提供专业指导。
When to Use
适用场景
Use this skill when:
- Reviewing existing AWS architecture for best practices
- Designing new cloud systems or applications
- Troubleshooting operational issues, security vulnerabilities, or reliability problems
- Optimizing costs or improving performance
- Preparing for architecture reviews or audits
- Migrating workloads to AWS
- Addressing compliance or sustainability requirements
- User asks "is my architecture good?" or "how can I improve my AWS setup?"
在以下场景中使用本技能:
- 评审现有AWS架构是否符合最佳实践
- 设计新的云系统或应用
- 排查运维问题、安全漏洞或可靠性故障
- 优化成本或提升性能
- 准备架构评审或审计
- 将工作负载迁移至AWS
- 满足合规或可持续性要求
- 用户询问“我的架构是否合理?”或“如何改进我的AWS配置?”
Core Principle
核心原则
Systematic architecture evaluation across 6 pillars ensures balanced, well-designed systems that meet business objectives.
The AWS Well-Architected Framework provides a consistent approach for evaluating cloud architectures and implementing scalable designs.
通过六大支柱进行系统化架构评估,确保系统设计均衡、完善,满足业务目标。
AWS Well-Architected Framework提供了一套一致的方法,用于评估云架构并实施可扩展的设计方案。
The Six Pillars
六大支柱
| Pillar | Focus | Key Question |
|---|---|---|
| Operational Excellence | Run and monitor systems | How do we operate effectively? |
| Security | Protect information and systems | How do we protect data and resources? |
| Reliability | Recover from failures | How do we ensure workload availability? |
| Performance Efficiency | Use resources effectively | How do we meet performance requirements? |
| Cost Optimization | Avoid unnecessary costs | How do we achieve cost-effective outcomes? |
| Sustainability | Minimize environmental impact | How do we reduce carbon footprint? |
| 支柱 | 关注重点 | 核心问题 |
|---|---|---|
| Operational Excellence(卓越运维) | 运行和监控系统 | 我们如何高效运维? |
| Security(安全) | 保护信息和系统 | 我们如何保护数据和资源? |
| Reliability(可靠性) | 从故障中恢复 | 我们如何确保工作负载的可用性? |
| Performance Efficiency(性能效率) | 高效利用资源 | 我们如何满足性能要求? |
| Cost Optimization(成本优化) | 避免不必要的成本 | 我们如何实现高性价比的结果? |
| Sustainability(可持续性) | 最小化环境影响 | 我们如何减少碳足迹? |
Architecture Review Workflow
架构评审流程
CRITICAL: You MUST review ALL 6 pillars systematically. Never skip a pillar because it "seems not applicable" - every workload has considerations across all pillars.
dot
digraph review_flow {
"Architecture review needed" [shape=doublecircle];
"Identify workload scope" [shape=box];
"Review each pillar systematically" [shape=box];
"Document findings per pillar" [shape=box];
"Prioritize improvements" [shape=box];
"Create action plan" [shape=box];
"All pillars reviewed?" [shape=diamond];
"Complete" [shape=doublecircle];
"Architecture review needed" -> "Identify workload scope";
"Identify workload scope" -> "Review each pillar systematically";
"Review each pillar systematically" -> "Document findings per pillar";
"Document findings per pillar" -> "All pillars reviewed?";
"All pillars reviewed?" -> "Review each pillar systematically" [label="no"];
"All pillars reviewed?" -> "Prioritize improvements" [label="yes"];
"Prioritize improvements" -> "Create action plan";
"Create action plan" -> "Complete";
}Red Flags - You're Skipping the Framework:
- "This pillar doesn't apply to this workload" - WRONG, every pillar applies
- Jumping straight to recommendations without documenting current state
- Only reviewing 3-4 pillars instead of all 6
- Providing generic advice instead of workload-specific assessment
重要提示:必须系统化评审所有6个支柱。绝不能因为某个支柱“看似不适用”就跳过——每个工作负载都需要考虑所有支柱的相关因素。
dot
digraph review_flow {
"Architecture review needed" [shape=doublecircle];
"Identify workload scope" [shape=box];
"Review each pillar systematically" [shape=box];
"Document findings per pillar" [shape=box];
"Prioritize improvements" [shape=box];
"Create action plan" [shape=box];
"All pillars reviewed?" [shape=diamond];
"Complete" [shape=doublecircle];
"Architecture review needed" -> "Identify workload scope";
"Identify workload scope" -> "Review each pillar systematically";
"Review each pillar systematically" -> "Document findings per pillar";
"Document findings per pillar" -> "All pillars reviewed?";
"All pillars reviewed?" -> "Review each pillar systematically" [label="no"];
"All pillars reviewed?" -> "Prioritize improvements" [label="yes"];
"Prioritize improvements" -> "Create action plan";
"Create action plan" -> "Complete";
}警示信号——你在跳过框架要求:
- “这个支柱不适用于此工作负载”——错误,所有支柱都适用
- 未记录当前状态就直接给出建议
- 仅评审3-4个支柱而非全部6个
- 提供通用建议而非针对工作负载的具体评估
Pillar 1: Operational Excellence
支柱1:Operational Excellence(卓越运维)
Goal: Support development and run workloads effectively, gain insight into operations, and continuously improve processes.
目标: 有效支持开发和运行工作负载,深入了解运维情况,并持续改进流程。
Design Principles
设计原则
- Perform operations as code (IaC)
- Make frequent, small, reversible changes
- Refine operations procedures frequently
- Anticipate failure
- Learn from operational events and failures
- 将运维操作作为代码(IaC)实现
- 频繁进行小型、可回滚的变更
- 定期优化运维流程
- 预判故障
- 从运维事件和故障中学习
Key Areas
核心领域
Organization:
- How do teams share architecture knowledge?
- Are there clear ownership and accountability models?
Prepare:
- How do you design workloads for observability?
- Infrastructure as code implementation?
- Deployment practices (CI/CD)?
Operate:
- What's the runbook for common operations?
- How do you understand workload health?
- How do you respond to events?
Evolve:
- How do you learn from operational events?
- Process for continuous improvement?
组织管理:
- 团队如何共享架构知识?
- 是否有清晰的职责归属模型?
准备工作:
- 你如何设计具备可观测性的工作负载?
- 是否实现了基础设施即代码?
- 部署实践(CI/CD)情况如何?
运维执行:
- 常见操作的运行手册是什么?
- 你如何了解工作负载的健康状态?
- 你如何响应事件?
持续演进:
- 你如何从运维事件中学习?
- 是否有持续改进的流程?
Common Issues & Solutions
常见问题与解决方案
| Issue | Solution |
|---|---|
| Manual deployments | Implement CI/CD with CloudFormation/CDK/Terraform |
| No visibility into system health | Add CloudWatch dashboards, metrics, alarms |
| Operational procedures outdated | Regular runbook reviews, post-incident learning |
| Slow incident response | Create automated remediation with Lambda/Systems Manager |
| 问题 | 解决方案 |
|---|---|
| 手动部署 | 使用CloudFormation/CDK/Terraform实现CI/CD |
| 无系统健康可见性 | 添加CloudWatch仪表板、指标和告警 |
| 运维流程过时 | 定期评审运行手册,开展事后复盘学习 |
| 事件响应缓慢 | 使用Lambda/Systems Manager创建自动化修复 |
Quick Implementation Checklist
快速实施检查清单
- Infrastructure defined as code (CloudFormation/CDK/Terraform)
- CI/CD pipeline implemented
- CloudWatch dashboards for key metrics
- Alarms for critical thresholds
- Runbooks documented and accessible
- Regular game days to test procedures
- Post-incident review process
- 基础设施通过代码定义(CloudFormation/CDK/Terraform)
- 已实现CI/CD流水线
- 针对关键指标配置CloudWatch仪表板
- 为临界阈值设置告警
- 运行手册已文档化且可访问
- 定期开展演练日测试流程
- 建立事后复盘流程
Pillar 2: Security
支柱2:Security(安全)
Goal: Protect data, systems, and assets through cloud security practices.
目标: 通过云安全实践保护数据、系统和资产。
Design Principles
设计原则
- Implement strong identity foundation
- Enable traceability
- Apply security at all layers
- Automate security best practices
- Protect data in transit and at rest
- Keep people away from data
- Prepare for security events
- 实施强身份基础
- 启用可追溯性
- 在所有层级应用安全措施
- 自动化安全最佳实践
- 保护传输中和静止的数据
- 避免人员直接接触数据
- 为安全事件做好准备
Key Areas
核心领域
Security Foundations:
- How do you manage credentials and authentication?
- IAM roles and policies following least privilege?
Identity and Access Management:
- How do you manage identities for people and machines?
- MFA enabled for all human access?
Detection:
- How do you detect and investigate security events?
- CloudTrail, GuardDuty, Security Hub configured?
Infrastructure Protection:
- How do you protect networks and compute?
- VPC configuration, security groups, NACLs?
Data Protection:
- How do you classify and protect data?
- Encryption at rest and in transit?
Incident Response:
- How do you respond to security incidents?
- Incident response plan tested?
安全基础:
- 你如何管理凭证和认证?
- IAM角色和策略是否遵循最小权限原则?
身份与访问管理:
- 你如何管理人员和机器的身份?
- 是否为所有人工访问启用了MFA?
检测能力:
- 你如何检测和调查安全事件?
- 是否配置了CloudTrail、GuardDuty、Security Hub?
基础设施保护:
- 你如何保护网络和计算资源?
- VPC配置、安全组、NACL情况如何?
数据保护:
- 你如何分类和保护数据?
- 是否对传输中和静止的数据进行加密?
事件响应:
- 你如何响应安全事件?
- 事件响应计划是否经过测试?
Critical Security Patterns
关键安全模式
Never Do:
typescript
// ❌ DANGEROUS: Hardcoded credentials
const AWS = require('aws-sdk');
const s3 = new AWS.S3({
accessKeyId: 'AKIAIOSFODNN7EXAMPLE',
secretAccessKey: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
});Always Do:
typescript
// ✅ CORRECT: Use IAM roles
const AWS = require('aws-sdk');
const s3 = new AWS.S3(); // Credentials from IAM role
// Lambda function with IAM role
const lambda = new lambda.Function(this, 'MyFunction', {
// IAM role with least privilege
role: myRole,
// ...
});禁止操作:
typescript
// ❌ DANGEROUS: Hardcoded credentials
const AWS = require('aws-sdk');
const s3 = new AWS.S3({
accessKeyId: 'AKIAIOSFODNN7EXAMPLE',
secretAccessKey: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
});推荐操作:
typescript
// ✅ CORRECT: Use IAM roles
const AWS = require('aws-sdk');
const s3 = new AWS.S3(); // Credentials from IAM role
// Lambda function with IAM role
const lambda = new lambda.Function(this, 'MyFunction', {
// IAM role with least privilege
role: myRole,
// ...
});Security Checklist
安全检查清单
- No hardcoded credentials anywhere (check git history!)
- IAM roles follow least privilege principle
- MFA enabled for root and privileged accounts
- CloudTrail enabled in all regions
- VPC with proper public/private subnet architecture
- Security groups with minimal inbound rules
- Encryption at rest for all data stores
- HTTPS/TLS for all data in transit
- Secrets Manager or Parameter Store for secrets
- Regular security patching process
- AWS Config for compliance monitoring
- GuardDuty for threat detection
- 任何地方都没有硬编码凭证(检查git历史!)
- IAM角色遵循最小权限原则
- 根账户和特权账户已启用MFA
- 所有区域都已启用CloudTrail
- VPC配置了合理的公有/私有子网架构
- 安全组配置了最小化入站规则
- 所有数据存储都启用了静态加密
- 所有传输数据都使用HTTPS/TLS
- 使用Secrets Manager或Parameter Store管理密钥
- 定期进行安全补丁更新
- 使用AWS Config进行合规监控
- 使用GuardDuty进行威胁检测
Pillar 3: Reliability
支柱3:Reliability(可靠性)
Goal: Ensure workload performs its intended function correctly and consistently.
目标: 确保工作负载能正确、持续地执行其预期功能。
Design Principles
设计原则
- Automatically recover from failure
- Test recovery procedures
- Scale horizontally
- Stop guessing capacity
- Manage change through automation
- 自动从故障中恢复
- 测试恢复流程
- 水平扩展
- 停止猜测容量需求
- 通过自动化管理变更
Key Areas
核心领域
Foundations:
- How do you manage service quotas and constraints?
- Network topology designed for HA?
Workload Architecture:
- How do you design workload service architecture?
- Microservices vs monolith considerations?
Change Management:
- How do you monitor workload resources?
- How are changes deployed safely?
Failure Management:
- How do you back up data?
- How do you design for resilience?
- DR plan and RTO/RPO defined?
基础架构:
- 你如何管理服务配额和限制?
- 网络拓扑是否为高可用性设计?
工作负载架构:
- 你如何设计工作负载的服务架构?
- 微服务与单体架构的考量?
变更管理:
- 你如何监控工作负载资源?
- 如何安全部署变更?
故障管理:
- 你如何备份数据?
- 你如何设计弹性架构?
- 是否定义了灾难恢复(DR)计划以及恢复时间目标(RTO)和恢复点目标(RPO)?
High Availability Patterns
高可用模式
Multi-AZ Deployment:
Region
├── AZ-1: Application + Database
├── AZ-2: Application + Database (standby)
└── AZ-3: Application + Database (standby)Multi-Region Deployment:
Primary Region Secondary Region
├── Active workload ├── Standby/Active
├── Database (primary) ├── Database (replica)
└── Route 53 health check monitoring多可用区部署:
Region
├── AZ-1: Application + Database
├── AZ-2: Application + Database (standby)
└── AZ-3: Application + Database (standby)多区域部署:
Primary Region Secondary Region
├── Active workload ├── Standby/Active
├── Database (primary) ├── Database (replica)
└── Route 53 health check monitoringBackup Strategy
备份策略
| Data Type | Solution | RPO | RTO |
|---|---|---|---|
| RDS | Automated backups + snapshots | < 5 min | < 30 min |
| DynamoDB | Point-in-time recovery | Seconds | Minutes |
| S3 | Versioning + cross-region replication | Real-time | Immediate |
| EBS | Snapshots via AWS Backup | Hours | Hours |
| 数据类型 | 解决方案 | RPO | RTO |
|---|---|---|---|
| RDS | 自动备份 + 快照 | < 5分钟 | < 30分钟 |
| DynamoDB | 时间点恢复 | 秒级 | 分钟级 |
| S3 | 版本控制 + 跨区域复制 | 实时 | 即时 |
| EBS | 通过AWS Backup创建快照 | 小时级 | 小时级 |
Reliability Checklist
可靠性检查清单
- Multi-AZ deployment for critical components
- Health checks configured (ELB, Route 53)
- Auto Scaling groups with proper sizing
- RDS automated backups enabled
- DynamoDB point-in-time recovery enabled
- S3 versioning for critical buckets
- Disaster recovery plan documented and tested
- Chaos engineering tests (failure injection)
- Graceful degradation strategies
- Circuit breaker patterns implemented
- 关键组件采用多可用区部署
- 配置了健康检查(ELB、Route 53)
- 自动扩缩容组配置了合理的大小
- RDS自动备份已启用
- DynamoDB时间点恢复已启用
- 关键S3存储桶启用了版本控制
- 灾难恢复计划已文档化并经过测试
- 进行混沌工程测试(故障注入)
- 实现了优雅降级策略
- 实现了断路器模式
Pillar 4: Performance Efficiency
支柱4:Performance Efficiency(性能效率)
Goal: Use computing resources efficiently to meet requirements and maintain efficiency as demand changes.
目标: 高效利用计算资源以满足需求,并在需求变化时保持效率。
Design Principles
设计原则
- Democratize advanced technologies
- Go global in minutes
- Use serverless architectures
- Experiment more often
- Consider mechanical sympathy
- 普及先进技术
- 几分钟内完成全球化部署
- 使用无服务器架构
- 更频繁地进行实验
- 考虑机械亲和性
Key Areas
核心领域
Selection:
- How do you select appropriate resource types and sizes?
- Compute: EC2, Lambda, Fargate, ECS, EKS?
- Database: RDS, DynamoDB, Aurora, ElastiCache?
- Storage: S3, EFS, EBS, Glacier?
Review:
- How do you evolve workload to use new resources?
- Regular review of AWS new features?
Monitoring:
- How do you monitor resources?
- CloudWatch, X-Ray for distributed tracing?
Trade-offs:
- How do you use trade-offs to improve performance?
- Caching, consistency models, compression?
资源选择:
- 你如何选择合适的资源类型和规格?
- 计算:EC2、Lambda、Fargate、ECS、EKS?
- 数据库:RDS、DynamoDB、Aurora、ElastiCache?
- 存储:S3、EFS、EBS、Glacier?
架构演进:
- 你如何演进工作负载以使用新资源?
- 是否定期评审AWS新功能?
监控:
- 你如何监控资源?
- 是否使用CloudWatch、X-Ray进行分布式追踪?
权衡取舍:
- 你如何通过权衡取舍提升性能?
- 缓存、一致性模型、压缩?
Performance Patterns
性能模式
Caching Strategy:
Client → CloudFront (edge cache)
→ API Gateway
→ Lambda
→ ElastiCache (data cache)
→ DynamoDB/RDSDatabase Selection:
| Use Case | Recommended Service |
|---|---|
| Relational, complex queries | RDS (PostgreSQL/MySQL) |
| High throughput, simple queries | DynamoDB |
| Graph relationships | Neptune |
| Search and analytics | OpenSearch |
| Time-series data | Timestream |
| In-memory cache | ElastiCache (Redis/Memcached) |
缓存策略:
Client → CloudFront (edge cache)
→ API Gateway
→ Lambda
→ ElastiCache (data cache)
→ DynamoDB/RDS数据库选择:
| 使用场景 | 推荐服务 |
|---|---|
| 关系型、复杂查询 | RDS (PostgreSQL/MySQL) |
| 高吞吐量、简单查询 | DynamoDB |
| 图关系 | Neptune |
| 搜索与分析 | OpenSearch |
| 时序数据 | Timestream |
| 内存缓存 | ElastiCache (Redis/Memcached) |
Performance Checklist
性能检查清单
- Right-sized compute instances (not over-provisioned)
- Content delivery through CloudFront
- Database read replicas for read-heavy workloads
- Caching layer (ElastiCache, DAX, CloudFront)
- Asynchronous processing with SQS/SNS/EventBridge
- Auto Scaling configured appropriately
- Database indexes optimized
- Monitoring with CloudWatch and X-Ray
- Regular performance testing under load
- 计算实例规格合理(未过度配置)
- 通过CloudFront交付内容
- 为读密集型工作负载配置数据库只读副本
- 配置缓存层(ElastiCache、DAX、CloudFront)
- 使用SQS/SNS/EventBridge进行异步处理
- 合理配置自动扩缩容
- 数据库索引已优化
- 使用CloudWatch和X-Ray进行监控
- 定期进行负载下的性能测试
Pillar 5: Cost Optimization
支柱5:Cost Optimization(成本优化)
Goal: Run systems to deliver business value at lowest price point.
目标: 以最低成本运行系统,交付业务价值。
Design Principles
设计原则
- Implement cloud financial management
- Adopt consumption model
- Measure overall efficiency
- Stop spending on undifferentiated heavy lifting
- Analyze and attribute expenditure
- 实施云财务管理
- 采用消费型模式
- 衡量整体效率
- 停止在无差异化的繁重工作上投入
- 分析并分摊支出
Key Areas
核心领域
Practice Cloud Financial Management:
- Cost allocation tags implemented?
- Budgets and alerts configured?
Expenditure and Usage Awareness:
- How do you govern usage?
- Cost Explorer and AWS Budgets configured?
Cost-Effective Resources:
- How do you evaluate cost when selecting services?
- Reserved Instances or Savings Plans for predictable workloads?
Manage Demand:
- How do you manage demand and supply resources?
- Throttling, caching to reduce demand?
Optimize Over Time:
- How do you evaluate new services?
- Regular review of cost optimization opportunities?
云财务管理实践:
- 是否实施了成本分配标签?
- 是否配置了预算和告警?
支出与使用感知:
- 你如何管控使用情况?
- 是否配置了Cost Explorer和AWS Budgets?
高性价比资源:
- 你在选择服务时如何评估成本?
- 对于可预测的工作负载,是否使用预留实例(Reserved Instances)或节省计划(Savings Plans)?
需求管理:
- 你如何管理资源的供需?
- 是否通过限流、缓存减少需求?
持续优化:
- 你如何评估新服务?
- 是否定期评审成本优化机会?
Cost Optimization Strategies
成本优化策略
| Strategy | Implementation | Potential Savings |
|---|---|---|
| Right-sizing | Use Compute Optimizer recommendations | 20-40% |
| Reserved Instances | 1-year or 3-year commitments | 30-75% |
| Savings Plans | Flexible compute commitments | 30-70% |
| Spot Instances | Fault-tolerant workloads | 50-90% |
| S3 Intelligent-Tiering | Automatic storage class optimization | 40-60% |
| Auto Scaling | Scale resources with demand | 30-50% |
| Lambda instead of EC2 | For appropriate workloads | Varies |
| 策略 | 实施方式 | 潜在节省比例 |
|---|---|---|
| 规格优化 | 使用Compute Optimizer的建议 | 20-40% |
| 预留实例 | 1年或3年承诺 | 30-75% |
| 节省计划 | 灵活的计算承诺 | 30-70% |
| 竞价实例 | 用于容错工作负载 | 50-90% |
| S3智能分层 | 自动存储类优化 | 40-60% |
| 自动扩缩容 | 根据需求缩放资源 | 30-50% |
| Lambda替代EC2 | 适用于合适的工作负载 | 视情况而定 |
Cost Monitoring
成本监控
typescript
// CDK Example: Set up budget alerts
import * as budgets from 'aws-cdk-lib/aws-budgets';
new budgets.CfnBudget(this, 'MonthlyBudget', {
budget: {
budgetType: 'COST',
timeUnit: 'MONTHLY',
budgetLimit: {
amount: 1000,
unit: 'USD',
},
},
notificationsWithSubscribers: [{
notification: {
notificationType: 'ACTUAL',
comparisonOperator: 'GREATER_THAN',
threshold: 80, // Alert at 80%
},
subscribers: [{
subscriptionType: 'EMAIL',
address: 'team@example.com',
}],
}],
});typescript
// CDK Example: Set up budget alerts
import * as budgets from 'aws-cdk-lib/aws-budgets';
new budgets.CfnBudget(this, 'MonthlyBudget', {
budget: {
budgetType: 'COST',
timeUnit: 'MONTHLY',
budgetLimit: {
amount: 1000,
unit: 'USD',
},
},
notificationsWithSubscribers: [{
notification: {
notificationType: 'ACTUAL',
comparisonOperator: 'GREATER_THAN',
threshold: 80, // Alert at 80%
},
subscribers: [{
subscriptionType: 'EMAIL',
address: 'team@example.com',
}],
}],
});Cost Optimization Checklist
成本优化检查清单
- Cost allocation tags applied consistently
- AWS Budgets configured with alerts
- Cost Explorer reviewed monthly
- Reserved Instances or Savings Plans for stable workloads
- Spot Instances for fault-tolerant workloads
- Unused resources identified and terminated
- S3 lifecycle policies for data management
- Right-sized instances (not over-provisioned)
- Lambda memory optimization
- DynamoDB on-demand vs provisioned analysis
- Data transfer costs analyzed and optimized
- 一致应用成本分配标签
- 配置AWS Budgets并设置告警
- 每月评审Cost Explorer
- 为稳定工作负载使用预留实例或节省计划
- 为容错工作负载使用竞价实例
- 识别并终止未使用的资源
- 为S3配置生命周期策略进行数据管理
- 实例规格合理(未过度配置)
- 优化Lambda内存配置
- 分析DynamoDB按需模式与预置模式的选择
- 分析并优化数据传输成本
Pillar 6: Sustainability
支柱6:Sustainability(可持续性)
Goal: Minimize environmental impact of running cloud workloads.
目标: 最小化运行云工作负载对环境的影响。
Design Principles
设计原则
- Understand your impact
- Establish sustainability goals
- Maximize utilization
- Anticipate and adopt new, more efficient offerings
- Use managed services
- Reduce downstream impact
- 了解你的影响
- 制定可持续性目标
- 最大化资源利用率
- 预判并采用新的、更高效的产品
- 使用托管服务
- 减少下游影响
Key Areas
核心领域
Region Selection:
- Choose regions with renewable energy
- AWS regions with lower carbon intensity
User Behavior Patterns:
- Scale resources with demand
- Remove unused resources
Software and Architecture:
- Optimize code for efficiency
- Use appropriate services (serverless over provisioned)
Data Patterns:
- Minimize data movement
- Use data compression
- Implement lifecycle policies
Hardware Patterns:
- Use minimum necessary hardware
- Use instance types with best performance per watt
Development Process:
- Test sustainability improvements
- Measure and report carbon footprint
区域选择:
- 选择使用可再生能源的区域
- 选择碳强度较低的AWS区域
用户行为模式:
- 根据需求缩放资源
- 移除未使用的资源
软件与架构:
- 优化代码以提升效率
- 使用合适的服务(无服务器优于预置实例)
数据模式:
- 最小化数据移动
- 使用数据压缩
- 实施生命周期策略
硬件模式:
- 使用必要的最少硬件
- 使用每瓦性能最佳的实例类型
开发流程:
- 测试可持续性改进措施
- 衡量并报告碳足迹
Sustainability Checklist
可持续性检查清单
- Workloads in regions with renewable energy
- Auto Scaling to match demand (no idle resources)
- Unused resources regularly cleaned up
- Graviton processors considered for better efficiency
- Managed services used where appropriate
- Data lifecycle policies to reduce storage
- Efficient code (async processing, optimized queries)
- Monitoring resource utilization
- Carbon footprint tracked (AWS Customer Carbon Footprint Tool)
- 工作负载部署在使用可再生能源的区域
- 自动扩缩容以匹配需求(无闲置资源)
- 定期清理未使用的资源
- 考虑使用Graviton处理器以提升效率
- 适当使用托管服务
- 配置数据生命周期策略以减少存储
- 高效代码(异步处理、优化查询)
- 监控资源利用率
- 追踪碳足迹(使用AWS Customer Carbon Footprint Tool)
Review Process
评审流程
1. Scoping Phase
1. 范围界定阶段
Questions to ask:
- What is the workload scope? (entire system vs specific component)
- What are the business objectives?
- What are the compliance requirements?
- What are the current pain points?
需要询问的问题:
- 工作负载范围是什么?(整个系统 vs 特定组件)
- 业务目标是什么?
- 合规要求是什么?
- 当前的痛点是什么?
2. Review Each Pillar
2. 评审每个支柱
For each pillar, use this template:
Current State:
- Document what exists today
Gaps:
- What's missing or needs improvement?
Risks:
- What are the high/medium/low priority risks?
Recommendations:
- Specific, actionable improvements
针对每个支柱,使用以下模板:
当前状态:
- 记录现有情况
差距:
- 缺少什么或需要改进什么?
风险:
- 高/中/低优先级风险是什么?
建议:
- 具体、可操作的改进措施
3. Prioritization Matrix
3. 优先级矩阵
| Priority | Criteria |
|---|---|
| High | Security vulnerabilities, critical availability risks, major cost waste |
| Medium | Performance issues, moderate cost optimization, operational improvements |
| Low | Nice-to-haves, future considerations, minor optimizations |
| 优先级 | 标准 |
|---|---|
| 高 | 安全漏洞、关键可用性风险、重大成本浪费 |
| 中 | 性能问题、中等成本优化、运维改进 |
| 低 | 锦上添花的功能、未来考量、微小优化 |
4. Action Plan Template
4. 行动计划模板
markdown
undefinedmarkdown
undefinedPillar: [Name]
Pillar: [Name]
Issue: [Description]
Issue: [Description]
- Risk Level: High/Medium/Low
- Impact: [Business impact]
- Effort: Low/Medium/High
- Risk Level: High/Medium/Low
- Impact: [Business impact]
- Effort: Low/Medium/High
Recommendation:
Recommendation:
[Specific actions]
[Specific actions]
Implementation Steps:
Implementation Steps:
- [Step 1]
- [Step 2]
- [Step 3]
- [Step 1]
- [Step 2]
- [Step 3]
Success Criteria:
Success Criteria:
- [Measurable outcome 1]
- [Measurable outcome 2]
- [Measurable outcome 1]
- [Measurable outcome 2]
Resources:
Resources:
- [AWS documentation links]
- [Blog posts or examples]
undefined- [AWS documentation links]
- [Blog posts or examples]
undefinedCommon Anti-Patterns
常见反模式
| Anti-Pattern | Issue | Better Approach |
|---|---|---|
| Single AZ deployment | No fault tolerance | Multi-AZ architecture |
| No IaC | Manual config, drift | CloudFormation/CDK/Terraform |
| Hardcoded secrets | Security vulnerability | Secrets Manager/Parameter Store |
| No monitoring | Blind operation | CloudWatch dashboards + alarms |
| No backups | Data loss risk | Automated backup strategy |
| Over-provisioning | Cost waste | Right-sizing + Auto Scaling |
| No cost tracking | Budget overruns | Tags + Budgets + Cost Explorer |
| Monolithic architecture | Hard to scale | Microservices or serverless |
| 反模式 | 问题 | 更佳方案 |
|---|---|---|
| 单可用区部署 | 无容错能力 | 多可用区架构 |
| 无IaC | 手动配置、配置漂移 | CloudFormation/CDK/Terraform |
| 硬编码密钥 | 安全漏洞 | Secrets Manager/Parameter Store |
| 无监控 | 盲目运维 | CloudWatch仪表板 + 告警 |
| 无备份 | 数据丢失风险 | 自动化备份策略 |
| 过度配置 | 成本浪费 | 合理规格 + 自动扩缩容 |
| 无成本追踪 | 预算超支 | 标签 + 预算 + Cost Explorer |
| 单体架构 | 难以扩展 | 微服务或无服务器 |
Real-World Example
真实案例
Scenario: Serverless API with authentication
Architecture Review:
Operational Excellence:
- ✅ Lambda functions deployed via CDK
- ✅ CloudWatch logs enabled
- ❌ Missing: Distributed tracing (X-Ray), dashboards
Security:
- ❌ CRITICAL: Hardcoded API keys in Lambda environment variables
- ✅ API Gateway with IAM authorization
- ❌ Missing: Secrets Manager, encryption at rest
Reliability:
- ✅ Multi-AZ DynamoDB table
- ❌ Single region deployment
- ❌ Missing: Backup strategy, DR plan
Performance:
- ✅ CloudFront for static assets
- ❌ No caching for API responses
- ❌ Lambda cold starts not optimized
Cost:
- ❌ DynamoDB provisioned capacity, but traffic is spiky
- ✅ Lambda usage-based pricing
- ❌ Missing: Budget alerts, cost allocation tags
Sustainability:
- ✅ Serverless architecture (good utilization)
- ❌ Unused dev/test resources running 24/7
Priority Actions:
- HIGH: Move API keys to Secrets Manager (Security)
- HIGH: Implement DynamoDB backups (Reliability)
- MEDIUM: Add X-Ray tracing (Operational Excellence)
- MEDIUM: Switch DynamoDB to on-demand (Cost)
- LOW: Add API Gateway caching (Performance)
场景: 带认证的无服务器API
架构评审:
Operational Excellence(卓越运维):
- ✅ Lambda函数通过CDK部署
- ✅ 已启用CloudWatch日志
- ❌ 缺失:分布式追踪(X-Ray)、仪表板
Security(安全):
- ❌ 严重问题:Lambda环境变量中存在硬编码API密钥
- ✅ API Gateway使用IAM授权
- ❌ 缺失:Secrets Manager、静态加密
Reliability(可靠性):
- ✅ DynamoDB表为多可用区部署
- ❌ 单区域部署
- ❌ 缺失:备份策略、DR计划
Performance(性能):
- ✅ 使用CloudFront交付静态资产
- ❌ API响应无缓存
- ❌ Lambda冷启动未优化
Cost(成本):
- ❌ DynamoDB使用预置容量,但流量波动大
- ✅ Lambda采用基于使用量的定价
- ❌ 缺失:预算告警、成本分配标签
Sustainability(可持续性):
- ✅ 无服务器架构(资源利用率高)
- ❌ 未使用的开发/测试资源24小时运行
优先级行动:
- 高优先级:将API密钥迁移至Secrets Manager(安全)
- 高优先级:实施DynamoDB备份(可靠性)
- 中优先级:添加X-Ray追踪(卓越运维)
- 中优先级:将DynamoDB切换为按需模式(成本)
- 低优先级:添加API Gateway缓存(性能)
Resources
资源
Common Mistakes When Using This Framework
使用本框架的常见错误
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| "Sustainability doesn't apply to this workload" | Every workload consumes resources and energy | Review all 6 pillars, even if findings are minimal |
| Skipping current state documentation | Can't measure improvement without baseline | Always document "Current State" before recommendations |
| Generic recommendations | Not actionable or specific to this workload | Provide specific AWS services, code examples, priorities |
| No prioritization | Everything seems equally important | Use HIGH/MEDIUM/LOW risk levels, create phased plan |
| Forgetting about trade-offs | Optimizing one pillar at expense of others | Explicitly call out trade-offs (e.g., multi-region cost vs reliability) |
| 错误 | 错误原因 | 正确做法 |
|---|---|---|
| “可持续性不适用于此工作负载” | 每个工作负载都会消耗资源和能源 | 评审所有6个支柱,即使发现的问题很少 |
| 跳过当前状态记录 | 没有基线就无法衡量改进 | 给出建议前始终记录“当前状态” |
| 通用建议 | 不具可操作性或不针对工作负载 | 提供具体的AWS服务、代码示例和优先级 |
| 无优先级划分 | 所有事项看似同等重要 | 使用高/中/低风险级别,制定分阶段计划 |
| 忽略权衡取舍 | 优化一个支柱以牺牲其他支柱为代价 | 明确指出权衡(例如,多区域部署的成本与可靠性) |
Using This Skill
使用本技能的方法
When conducting architecture reviews:
- Start with context - understand business objectives and constraints
- Review systematically - go through all 6 pillars, don't skip ANY
- Document findings - use consistent format per pillar (Current State → Gaps → Recommendations)
- Prioritize ruthlessly - security and availability issues first
- Be specific - actionable recommendations with examples and AWS service names
- Provide resources - link to AWS docs and examples
- Create action plan - clear next steps with success criteria and effort estimates
- Call out trade-offs - be explicit about costs and benefits of each recommendation
Remember: Architecture is about trade-offs. A perfect architecture doesn't exist - aim for a well-balanced one that meets business needs.
No exceptions to reviewing all 6 pillars - even if a pillar seems "not applicable", document why and what the current state is.
进行架构评审时:
- 从上下文开始 - 了解业务目标和约束
- 系统化评审 - 遍历所有6个支柱,绝不跳过任何一个
- 记录发现 - 为每个支柱使用一致的格式(当前状态 → 差距 → 建议)
- 严格优先级划分 - 优先处理安全和可用性问题
- 具体明确 - 提供可操作的建议,附带示例和AWS服务名称
- 提供资源 - 链接到AWS文档和示例
- 创建行动计划 - 明确的后续步骤,包含成功标准和工作量估算
- 指出权衡取舍 - 明确每个建议的成本和收益
记住: 架构在于权衡取舍。完美的架构不存在——目标是构建一个平衡的架构,满足业务需求。
评审所有6个支柱无例外 - 即使某个支柱看似“不适用”,也要记录原因和当前状态。