aws-well-architected-framework

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AWS Well-Architected Framework

AWS Well-Architected Framework

Expert guidance for designing, reviewing, and improving AWS architectures using the six pillars of the Well-Architected Framework.
使用Well-Architected Framework的六大支柱,为设计、评审和优化AWS架构提供专业指导。

When to Use

适用场景

Use this skill when:
  • Reviewing existing AWS architecture for best practices
  • Designing new cloud systems or applications
  • Troubleshooting operational issues, security vulnerabilities, or reliability problems
  • Optimizing costs or improving performance
  • Preparing for architecture reviews or audits
  • Migrating workloads to AWS
  • Addressing compliance or sustainability requirements
  • User asks "is my architecture good?" or "how can I improve my AWS setup?"
在以下场景中使用本技能:
  • 评审现有AWS架构是否符合最佳实践
  • 设计新的云系统或应用
  • 排查运维问题、安全漏洞或可靠性故障
  • 优化成本或提升性能
  • 准备架构评审或审计
  • 将工作负载迁移至AWS
  • 满足合规或可持续性要求
  • 用户询问“我的架构是否合理?”或“如何改进我的AWS配置?”

Core Principle

核心原则

Systematic architecture evaluation across 6 pillars ensures balanced, well-designed systems that meet business objectives.
The AWS Well-Architected Framework provides a consistent approach for evaluating cloud architectures and implementing scalable designs.
通过六大支柱进行系统化架构评估,确保系统设计均衡、完善,满足业务目标。
AWS Well-Architected Framework提供了一套一致的方法,用于评估云架构并实施可扩展的设计方案。

The Six Pillars

六大支柱

PillarFocusKey Question
Operational ExcellenceRun and monitor systemsHow do we operate effectively?
SecurityProtect information and systemsHow do we protect data and resources?
ReliabilityRecover from failuresHow do we ensure workload availability?
Performance EfficiencyUse resources effectivelyHow do we meet performance requirements?
Cost OptimizationAvoid unnecessary costsHow do we achieve cost-effective outcomes?
SustainabilityMinimize environmental impactHow do we reduce carbon footprint?
支柱关注重点核心问题
Operational Excellence(卓越运维)运行和监控系统我们如何高效运维?
Security(安全)保护信息和系统我们如何保护数据和资源?
Reliability(可靠性)从故障中恢复我们如何确保工作负载的可用性?
Performance Efficiency(性能效率)高效利用资源我们如何满足性能要求?
Cost Optimization(成本优化)避免不必要的成本我们如何实现高性价比的结果?
Sustainability(可持续性)最小化环境影响我们如何减少碳足迹?

Architecture Review Workflow

架构评审流程

CRITICAL: You MUST review ALL 6 pillars systematically. Never skip a pillar because it "seems not applicable" - every workload has considerations across all pillars.
dot
digraph review_flow {
    "Architecture review needed" [shape=doublecircle];
    "Identify workload scope" [shape=box];
    "Review each pillar systematically" [shape=box];
    "Document findings per pillar" [shape=box];
    "Prioritize improvements" [shape=box];
    "Create action plan" [shape=box];
    "All pillars reviewed?" [shape=diamond];
    "Complete" [shape=doublecircle];

    "Architecture review needed" -> "Identify workload scope";
    "Identify workload scope" -> "Review each pillar systematically";
    "Review each pillar systematically" -> "Document findings per pillar";
    "Document findings per pillar" -> "All pillars reviewed?";
    "All pillars reviewed?" -> "Review each pillar systematically" [label="no"];
    "All pillars reviewed?" -> "Prioritize improvements" [label="yes"];
    "Prioritize improvements" -> "Create action plan";
    "Create action plan" -> "Complete";
}
Red Flags - You're Skipping the Framework:
  • "This pillar doesn't apply to this workload" - WRONG, every pillar applies
  • Jumping straight to recommendations without documenting current state
  • Only reviewing 3-4 pillars instead of all 6
  • Providing generic advice instead of workload-specific assessment
重要提示:必须系统化评审所有6个支柱。绝不能因为某个支柱“看似不适用”就跳过——每个工作负载都需要考虑所有支柱的相关因素。
dot
digraph review_flow {
    "Architecture review needed" [shape=doublecircle];
    "Identify workload scope" [shape=box];
    "Review each pillar systematically" [shape=box];
    "Document findings per pillar" [shape=box];
    "Prioritize improvements" [shape=box];
    "Create action plan" [shape=box];
    "All pillars reviewed?" [shape=diamond];
    "Complete" [shape=doublecircle];

    "Architecture review needed" -> "Identify workload scope";
    "Identify workload scope" -> "Review each pillar systematically";
    "Review each pillar systematically" -> "Document findings per pillar";
    "Document findings per pillar" -> "All pillars reviewed?";
    "All pillars reviewed?" -> "Review each pillar systematically" [label="no"];
    "All pillars reviewed?" -> "Prioritize improvements" [label="yes"];
    "Prioritize improvements" -> "Create action plan";
    "Create action plan" -> "Complete";
}
警示信号——你在跳过框架要求:
  • “这个支柱不适用于此工作负载”——错误,所有支柱都适用
  • 未记录当前状态就直接给出建议
  • 仅评审3-4个支柱而非全部6个
  • 提供通用建议而非针对工作负载的具体评估

Pillar 1: Operational Excellence

支柱1:Operational Excellence(卓越运维)

Goal: Support development and run workloads effectively, gain insight into operations, and continuously improve processes.
目标: 有效支持开发和运行工作负载,深入了解运维情况,并持续改进流程。

Design Principles

设计原则

  • Perform operations as code (IaC)
  • Make frequent, small, reversible changes
  • Refine operations procedures frequently
  • Anticipate failure
  • Learn from operational events and failures
  • 将运维操作作为代码(IaC)实现
  • 频繁进行小型、可回滚的变更
  • 定期优化运维流程
  • 预判故障
  • 从运维事件和故障中学习

Key Areas

核心领域

Organization:
  • How do teams share architecture knowledge?
  • Are there clear ownership and accountability models?
Prepare:
  • How do you design workloads for observability?
  • Infrastructure as code implementation?
  • Deployment practices (CI/CD)?
Operate:
  • What's the runbook for common operations?
  • How do you understand workload health?
  • How do you respond to events?
Evolve:
  • How do you learn from operational events?
  • Process for continuous improvement?
组织管理:
  • 团队如何共享架构知识?
  • 是否有清晰的职责归属模型?
准备工作:
  • 你如何设计具备可观测性的工作负载?
  • 是否实现了基础设施即代码?
  • 部署实践(CI/CD)情况如何?
运维执行:
  • 常见操作的运行手册是什么?
  • 你如何了解工作负载的健康状态?
  • 你如何响应事件?
持续演进:
  • 你如何从运维事件中学习?
  • 是否有持续改进的流程?

Common Issues & Solutions

常见问题与解决方案

IssueSolution
Manual deploymentsImplement CI/CD with CloudFormation/CDK/Terraform
No visibility into system healthAdd CloudWatch dashboards, metrics, alarms
Operational procedures outdatedRegular runbook reviews, post-incident learning
Slow incident responseCreate automated remediation with Lambda/Systems Manager
问题解决方案
手动部署使用CloudFormation/CDK/Terraform实现CI/CD
无系统健康可见性添加CloudWatch仪表板、指标和告警
运维流程过时定期评审运行手册,开展事后复盘学习
事件响应缓慢使用Lambda/Systems Manager创建自动化修复

Quick Implementation Checklist

快速实施检查清单

  • Infrastructure defined as code (CloudFormation/CDK/Terraform)
  • CI/CD pipeline implemented
  • CloudWatch dashboards for key metrics
  • Alarms for critical thresholds
  • Runbooks documented and accessible
  • Regular game days to test procedures
  • Post-incident review process
  • 基础设施通过代码定义(CloudFormation/CDK/Terraform)
  • 已实现CI/CD流水线
  • 针对关键指标配置CloudWatch仪表板
  • 为临界阈值设置告警
  • 运行手册已文档化且可访问
  • 定期开展演练日测试流程
  • 建立事后复盘流程

Pillar 2: Security

支柱2:Security(安全)

Goal: Protect data, systems, and assets through cloud security practices.
目标: 通过云安全实践保护数据、系统和资产。

Design Principles

设计原则

  • Implement strong identity foundation
  • Enable traceability
  • Apply security at all layers
  • Automate security best practices
  • Protect data in transit and at rest
  • Keep people away from data
  • Prepare for security events
  • 实施强身份基础
  • 启用可追溯性
  • 在所有层级应用安全措施
  • 自动化安全最佳实践
  • 保护传输中和静止的数据
  • 避免人员直接接触数据
  • 为安全事件做好准备

Key Areas

核心领域

Security Foundations:
  • How do you manage credentials and authentication?
  • IAM roles and policies following least privilege?
Identity and Access Management:
  • How do you manage identities for people and machines?
  • MFA enabled for all human access?
Detection:
  • How do you detect and investigate security events?
  • CloudTrail, GuardDuty, Security Hub configured?
Infrastructure Protection:
  • How do you protect networks and compute?
  • VPC configuration, security groups, NACLs?
Data Protection:
  • How do you classify and protect data?
  • Encryption at rest and in transit?
Incident Response:
  • How do you respond to security incidents?
  • Incident response plan tested?
安全基础:
  • 你如何管理凭证和认证?
  • IAM角色和策略是否遵循最小权限原则?
身份与访问管理:
  • 你如何管理人员和机器的身份?
  • 是否为所有人工访问启用了MFA?
检测能力:
  • 你如何检测和调查安全事件?
  • 是否配置了CloudTrail、GuardDuty、Security Hub?
基础设施保护:
  • 你如何保护网络和计算资源?
  • VPC配置、安全组、NACL情况如何?
数据保护:
  • 你如何分类和保护数据?
  • 是否对传输中和静止的数据进行加密?
事件响应:
  • 你如何响应安全事件?
  • 事件响应计划是否经过测试?

Critical Security Patterns

关键安全模式

Never Do:
typescript
// ❌ DANGEROUS: Hardcoded credentials
const AWS = require('aws-sdk');
const s3 = new AWS.S3({
  accessKeyId: 'AKIAIOSFODNN7EXAMPLE',
  secretAccessKey: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
});
Always Do:
typescript
// ✅ CORRECT: Use IAM roles
const AWS = require('aws-sdk');
const s3 = new AWS.S3(); // Credentials from IAM role

// Lambda function with IAM role
const lambda = new lambda.Function(this, 'MyFunction', {
  // IAM role with least privilege
  role: myRole,
  // ...
});
禁止操作:
typescript
// ❌ DANGEROUS: Hardcoded credentials
const AWS = require('aws-sdk');
const s3 = new AWS.S3({
  accessKeyId: 'AKIAIOSFODNN7EXAMPLE',
  secretAccessKey: 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
});
推荐操作:
typescript
// ✅ CORRECT: Use IAM roles
const AWS = require('aws-sdk');
const s3 = new AWS.S3(); // Credentials from IAM role

// Lambda function with IAM role
const lambda = new lambda.Function(this, 'MyFunction', {
  // IAM role with least privilege
  role: myRole,
  // ...
});

Security Checklist

安全检查清单

  • No hardcoded credentials anywhere (check git history!)
  • IAM roles follow least privilege principle
  • MFA enabled for root and privileged accounts
  • CloudTrail enabled in all regions
  • VPC with proper public/private subnet architecture
  • Security groups with minimal inbound rules
  • Encryption at rest for all data stores
  • HTTPS/TLS for all data in transit
  • Secrets Manager or Parameter Store for secrets
  • Regular security patching process
  • AWS Config for compliance monitoring
  • GuardDuty for threat detection
  • 任何地方都没有硬编码凭证(检查git历史!)
  • IAM角色遵循最小权限原则
  • 根账户和特权账户已启用MFA
  • 所有区域都已启用CloudTrail
  • VPC配置了合理的公有/私有子网架构
  • 安全组配置了最小化入站规则
  • 所有数据存储都启用了静态加密
  • 所有传输数据都使用HTTPS/TLS
  • 使用Secrets Manager或Parameter Store管理密钥
  • 定期进行安全补丁更新
  • 使用AWS Config进行合规监控
  • 使用GuardDuty进行威胁检测

Pillar 3: Reliability

支柱3:Reliability(可靠性)

Goal: Ensure workload performs its intended function correctly and consistently.
目标: 确保工作负载能正确、持续地执行其预期功能。

Design Principles

设计原则

  • Automatically recover from failure
  • Test recovery procedures
  • Scale horizontally
  • Stop guessing capacity
  • Manage change through automation
  • 自动从故障中恢复
  • 测试恢复流程
  • 水平扩展
  • 停止猜测容量需求
  • 通过自动化管理变更

Key Areas

核心领域

Foundations:
  • How do you manage service quotas and constraints?
  • Network topology designed for HA?
Workload Architecture:
  • How do you design workload service architecture?
  • Microservices vs monolith considerations?
Change Management:
  • How do you monitor workload resources?
  • How are changes deployed safely?
Failure Management:
  • How do you back up data?
  • How do you design for resilience?
  • DR plan and RTO/RPO defined?
基础架构:
  • 你如何管理服务配额和限制?
  • 网络拓扑是否为高可用性设计?
工作负载架构:
  • 你如何设计工作负载的服务架构?
  • 微服务与单体架构的考量?
变更管理:
  • 你如何监控工作负载资源?
  • 如何安全部署变更?
故障管理:
  • 你如何备份数据?
  • 你如何设计弹性架构?
  • 是否定义了灾难恢复(DR)计划以及恢复时间目标(RTO)和恢复点目标(RPO)?

High Availability Patterns

高可用模式

Multi-AZ Deployment:
Region
├── AZ-1: Application + Database
├── AZ-2: Application + Database (standby)
└── AZ-3: Application + Database (standby)
Multi-Region Deployment:
Primary Region          Secondary Region
├── Active workload    ├── Standby/Active
├── Database (primary) ├── Database (replica)
└── Route 53 health check monitoring
多可用区部署:
Region
├── AZ-1: Application + Database
├── AZ-2: Application + Database (standby)
└── AZ-3: Application + Database (standby)
多区域部署:
Primary Region          Secondary Region
├── Active workload    ├── Standby/Active
├── Database (primary) ├── Database (replica)
└── Route 53 health check monitoring

Backup Strategy

备份策略

Data TypeSolutionRPORTO
RDSAutomated backups + snapshots< 5 min< 30 min
DynamoDBPoint-in-time recoverySecondsMinutes
S3Versioning + cross-region replicationReal-timeImmediate
EBSSnapshots via AWS BackupHoursHours
数据类型解决方案RPORTO
RDS自动备份 + 快照< 5分钟< 30分钟
DynamoDB时间点恢复秒级分钟级
S3版本控制 + 跨区域复制实时即时
EBS通过AWS Backup创建快照小时级小时级

Reliability Checklist

可靠性检查清单

  • Multi-AZ deployment for critical components
  • Health checks configured (ELB, Route 53)
  • Auto Scaling groups with proper sizing
  • RDS automated backups enabled
  • DynamoDB point-in-time recovery enabled
  • S3 versioning for critical buckets
  • Disaster recovery plan documented and tested
  • Chaos engineering tests (failure injection)
  • Graceful degradation strategies
  • Circuit breaker patterns implemented
  • 关键组件采用多可用区部署
  • 配置了健康检查(ELB、Route 53)
  • 自动扩缩容组配置了合理的大小
  • RDS自动备份已启用
  • DynamoDB时间点恢复已启用
  • 关键S3存储桶启用了版本控制
  • 灾难恢复计划已文档化并经过测试
  • 进行混沌工程测试(故障注入)
  • 实现了优雅降级策略
  • 实现了断路器模式

Pillar 4: Performance Efficiency

支柱4:Performance Efficiency(性能效率)

Goal: Use computing resources efficiently to meet requirements and maintain efficiency as demand changes.
目标: 高效利用计算资源以满足需求,并在需求变化时保持效率。

Design Principles

设计原则

  • Democratize advanced technologies
  • Go global in minutes
  • Use serverless architectures
  • Experiment more often
  • Consider mechanical sympathy
  • 普及先进技术
  • 几分钟内完成全球化部署
  • 使用无服务器架构
  • 更频繁地进行实验
  • 考虑机械亲和性

Key Areas

核心领域

Selection:
  • How do you select appropriate resource types and sizes?
  • Compute: EC2, Lambda, Fargate, ECS, EKS?
  • Database: RDS, DynamoDB, Aurora, ElastiCache?
  • Storage: S3, EFS, EBS, Glacier?
Review:
  • How do you evolve workload to use new resources?
  • Regular review of AWS new features?
Monitoring:
  • How do you monitor resources?
  • CloudWatch, X-Ray for distributed tracing?
Trade-offs:
  • How do you use trade-offs to improve performance?
  • Caching, consistency models, compression?
资源选择:
  • 你如何选择合适的资源类型和规格?
  • 计算:EC2、Lambda、Fargate、ECS、EKS?
  • 数据库:RDS、DynamoDB、Aurora、ElastiCache?
  • 存储:S3、EFS、EBS、Glacier?
架构演进:
  • 你如何演进工作负载以使用新资源?
  • 是否定期评审AWS新功能?
监控:
  • 你如何监控资源?
  • 是否使用CloudWatch、X-Ray进行分布式追踪?
权衡取舍:
  • 你如何通过权衡取舍提升性能?
  • 缓存、一致性模型、压缩?

Performance Patterns

性能模式

Caching Strategy:
Client → CloudFront (edge cache)
  → API Gateway
    → Lambda
      → ElastiCache (data cache)
        → DynamoDB/RDS
Database Selection:
Use CaseRecommended Service
Relational, complex queriesRDS (PostgreSQL/MySQL)
High throughput, simple queriesDynamoDB
Graph relationshipsNeptune
Search and analyticsOpenSearch
Time-series dataTimestream
In-memory cacheElastiCache (Redis/Memcached)
缓存策略:
Client → CloudFront (edge cache)
  → API Gateway
    → Lambda
      → ElastiCache (data cache)
        → DynamoDB/RDS
数据库选择:
使用场景推荐服务
关系型、复杂查询RDS (PostgreSQL/MySQL)
高吞吐量、简单查询DynamoDB
图关系Neptune
搜索与分析OpenSearch
时序数据Timestream
内存缓存ElastiCache (Redis/Memcached)

Performance Checklist

性能检查清单

  • Right-sized compute instances (not over-provisioned)
  • Content delivery through CloudFront
  • Database read replicas for read-heavy workloads
  • Caching layer (ElastiCache, DAX, CloudFront)
  • Asynchronous processing with SQS/SNS/EventBridge
  • Auto Scaling configured appropriately
  • Database indexes optimized
  • Monitoring with CloudWatch and X-Ray
  • Regular performance testing under load
  • 计算实例规格合理(未过度配置)
  • 通过CloudFront交付内容
  • 为读密集型工作负载配置数据库只读副本
  • 配置缓存层(ElastiCache、DAX、CloudFront)
  • 使用SQS/SNS/EventBridge进行异步处理
  • 合理配置自动扩缩容
  • 数据库索引已优化
  • 使用CloudWatch和X-Ray进行监控
  • 定期进行负载下的性能测试

Pillar 5: Cost Optimization

支柱5:Cost Optimization(成本优化)

Goal: Run systems to deliver business value at lowest price point.
目标: 以最低成本运行系统,交付业务价值。

Design Principles

设计原则

  • Implement cloud financial management
  • Adopt consumption model
  • Measure overall efficiency
  • Stop spending on undifferentiated heavy lifting
  • Analyze and attribute expenditure
  • 实施云财务管理
  • 采用消费型模式
  • 衡量整体效率
  • 停止在无差异化的繁重工作上投入
  • 分析并分摊支出

Key Areas

核心领域

Practice Cloud Financial Management:
  • Cost allocation tags implemented?
  • Budgets and alerts configured?
Expenditure and Usage Awareness:
  • How do you govern usage?
  • Cost Explorer and AWS Budgets configured?
Cost-Effective Resources:
  • How do you evaluate cost when selecting services?
  • Reserved Instances or Savings Plans for predictable workloads?
Manage Demand:
  • How do you manage demand and supply resources?
  • Throttling, caching to reduce demand?
Optimize Over Time:
  • How do you evaluate new services?
  • Regular review of cost optimization opportunities?
云财务管理实践:
  • 是否实施了成本分配标签?
  • 是否配置了预算和告警?
支出与使用感知:
  • 你如何管控使用情况?
  • 是否配置了Cost Explorer和AWS Budgets?
高性价比资源:
  • 你在选择服务时如何评估成本?
  • 对于可预测的工作负载,是否使用预留实例(Reserved Instances)或节省计划(Savings Plans)?
需求管理:
  • 你如何管理资源的供需?
  • 是否通过限流、缓存减少需求?
持续优化:
  • 你如何评估新服务?
  • 是否定期评审成本优化机会?

Cost Optimization Strategies

成本优化策略

StrategyImplementationPotential Savings
Right-sizingUse Compute Optimizer recommendations20-40%
Reserved Instances1-year or 3-year commitments30-75%
Savings PlansFlexible compute commitments30-70%
Spot InstancesFault-tolerant workloads50-90%
S3 Intelligent-TieringAutomatic storage class optimization40-60%
Auto ScalingScale resources with demand30-50%
Lambda instead of EC2For appropriate workloadsVaries
策略实施方式潜在节省比例
规格优化使用Compute Optimizer的建议20-40%
预留实例1年或3年承诺30-75%
节省计划灵活的计算承诺30-70%
竞价实例用于容错工作负载50-90%
S3智能分层自动存储类优化40-60%
自动扩缩容根据需求缩放资源30-50%
Lambda替代EC2适用于合适的工作负载视情况而定

Cost Monitoring

成本监控

typescript
// CDK Example: Set up budget alerts
import * as budgets from 'aws-cdk-lib/aws-budgets';

new budgets.CfnBudget(this, 'MonthlyBudget', {
  budget: {
    budgetType: 'COST',
    timeUnit: 'MONTHLY',
    budgetLimit: {
      amount: 1000,
      unit: 'USD',
    },
  },
  notificationsWithSubscribers: [{
    notification: {
      notificationType: 'ACTUAL',
      comparisonOperator: 'GREATER_THAN',
      threshold: 80, // Alert at 80%
    },
    subscribers: [{
      subscriptionType: 'EMAIL',
      address: 'team@example.com',
    }],
  }],
});
typescript
// CDK Example: Set up budget alerts
import * as budgets from 'aws-cdk-lib/aws-budgets';

new budgets.CfnBudget(this, 'MonthlyBudget', {
  budget: {
    budgetType: 'COST',
    timeUnit: 'MONTHLY',
    budgetLimit: {
      amount: 1000,
      unit: 'USD',
    },
  },
  notificationsWithSubscribers: [{
    notification: {
      notificationType: 'ACTUAL',
      comparisonOperator: 'GREATER_THAN',
      threshold: 80, // Alert at 80%
    },
    subscribers: [{
      subscriptionType: 'EMAIL',
      address: 'team@example.com',
    }],
  }],
});

Cost Optimization Checklist

成本优化检查清单

  • Cost allocation tags applied consistently
  • AWS Budgets configured with alerts
  • Cost Explorer reviewed monthly
  • Reserved Instances or Savings Plans for stable workloads
  • Spot Instances for fault-tolerant workloads
  • Unused resources identified and terminated
  • S3 lifecycle policies for data management
  • Right-sized instances (not over-provisioned)
  • Lambda memory optimization
  • DynamoDB on-demand vs provisioned analysis
  • Data transfer costs analyzed and optimized
  • 一致应用成本分配标签
  • 配置AWS Budgets并设置告警
  • 每月评审Cost Explorer
  • 为稳定工作负载使用预留实例或节省计划
  • 为容错工作负载使用竞价实例
  • 识别并终止未使用的资源
  • 为S3配置生命周期策略进行数据管理
  • 实例规格合理(未过度配置)
  • 优化Lambda内存配置
  • 分析DynamoDB按需模式与预置模式的选择
  • 分析并优化数据传输成本

Pillar 6: Sustainability

支柱6:Sustainability(可持续性)

Goal: Minimize environmental impact of running cloud workloads.
目标: 最小化运行云工作负载对环境的影响。

Design Principles

设计原则

  • Understand your impact
  • Establish sustainability goals
  • Maximize utilization
  • Anticipate and adopt new, more efficient offerings
  • Use managed services
  • Reduce downstream impact
  • 了解你的影响
  • 制定可持续性目标
  • 最大化资源利用率
  • 预判并采用新的、更高效的产品
  • 使用托管服务
  • 减少下游影响

Key Areas

核心领域

Region Selection:
  • Choose regions with renewable energy
  • AWS regions with lower carbon intensity
User Behavior Patterns:
  • Scale resources with demand
  • Remove unused resources
Software and Architecture:
  • Optimize code for efficiency
  • Use appropriate services (serverless over provisioned)
Data Patterns:
  • Minimize data movement
  • Use data compression
  • Implement lifecycle policies
Hardware Patterns:
  • Use minimum necessary hardware
  • Use instance types with best performance per watt
Development Process:
  • Test sustainability improvements
  • Measure and report carbon footprint
区域选择:
  • 选择使用可再生能源的区域
  • 选择碳强度较低的AWS区域
用户行为模式:
  • 根据需求缩放资源
  • 移除未使用的资源
软件与架构:
  • 优化代码以提升效率
  • 使用合适的服务(无服务器优于预置实例)
数据模式:
  • 最小化数据移动
  • 使用数据压缩
  • 实施生命周期策略
硬件模式:
  • 使用必要的最少硬件
  • 使用每瓦性能最佳的实例类型
开发流程:
  • 测试可持续性改进措施
  • 衡量并报告碳足迹

Sustainability Checklist

可持续性检查清单

  • Workloads in regions with renewable energy
  • Auto Scaling to match demand (no idle resources)
  • Unused resources regularly cleaned up
  • Graviton processors considered for better efficiency
  • Managed services used where appropriate
  • Data lifecycle policies to reduce storage
  • Efficient code (async processing, optimized queries)
  • Monitoring resource utilization
  • Carbon footprint tracked (AWS Customer Carbon Footprint Tool)
  • 工作负载部署在使用可再生能源的区域
  • 自动扩缩容以匹配需求(无闲置资源)
  • 定期清理未使用的资源
  • 考虑使用Graviton处理器以提升效率
  • 适当使用托管服务
  • 配置数据生命周期策略以减少存储
  • 高效代码(异步处理、优化查询)
  • 监控资源利用率
  • 追踪碳足迹(使用AWS Customer Carbon Footprint Tool)

Review Process

评审流程

1. Scoping Phase

1. 范围界定阶段

Questions to ask:
  • What is the workload scope? (entire system vs specific component)
  • What are the business objectives?
  • What are the compliance requirements?
  • What are the current pain points?
需要询问的问题:
  • 工作负载范围是什么?(整个系统 vs 特定组件)
  • 业务目标是什么?
  • 合规要求是什么?
  • 当前的痛点是什么?

2. Review Each Pillar

2. 评审每个支柱

For each pillar, use this template:
Current State:
  • Document what exists today
Gaps:
  • What's missing or needs improvement?
Risks:
  • What are the high/medium/low priority risks?
Recommendations:
  • Specific, actionable improvements
针对每个支柱,使用以下模板:
当前状态:
  • 记录现有情况
差距:
  • 缺少什么或需要改进什么?
风险:
  • 高/中/低优先级风险是什么?
建议:
  • 具体、可操作的改进措施

3. Prioritization Matrix

3. 优先级矩阵

PriorityCriteria
HighSecurity vulnerabilities, critical availability risks, major cost waste
MediumPerformance issues, moderate cost optimization, operational improvements
LowNice-to-haves, future considerations, minor optimizations
优先级标准
安全漏洞、关键可用性风险、重大成本浪费
性能问题、中等成本优化、运维改进
锦上添花的功能、未来考量、微小优化

4. Action Plan Template

4. 行动计划模板

markdown
undefined
markdown
undefined

Pillar: [Name]

Pillar: [Name]

Issue: [Description]

Issue: [Description]

  • Risk Level: High/Medium/Low
  • Impact: [Business impact]
  • Effort: Low/Medium/High
  • Risk Level: High/Medium/Low
  • Impact: [Business impact]
  • Effort: Low/Medium/High

Recommendation:

Recommendation:

[Specific actions]
[Specific actions]

Implementation Steps:

Implementation Steps:

  1. [Step 1]
  2. [Step 2]
  3. [Step 3]
  1. [Step 1]
  2. [Step 2]
  3. [Step 3]

Success Criteria:

Success Criteria:

  • [Measurable outcome 1]
  • [Measurable outcome 2]
  • [Measurable outcome 1]
  • [Measurable outcome 2]

Resources:

Resources:

  • [AWS documentation links]
  • [Blog posts or examples]
undefined
  • [AWS documentation links]
  • [Blog posts or examples]
undefined

Common Anti-Patterns

常见反模式

Anti-PatternIssueBetter Approach
Single AZ deploymentNo fault toleranceMulti-AZ architecture
No IaCManual config, driftCloudFormation/CDK/Terraform
Hardcoded secretsSecurity vulnerabilitySecrets Manager/Parameter Store
No monitoringBlind operationCloudWatch dashboards + alarms
No backupsData loss riskAutomated backup strategy
Over-provisioningCost wasteRight-sizing + Auto Scaling
No cost trackingBudget overrunsTags + Budgets + Cost Explorer
Monolithic architectureHard to scaleMicroservices or serverless
反模式问题更佳方案
单可用区部署无容错能力多可用区架构
无IaC手动配置、配置漂移CloudFormation/CDK/Terraform
硬编码密钥安全漏洞Secrets Manager/Parameter Store
无监控盲目运维CloudWatch仪表板 + 告警
无备份数据丢失风险自动化备份策略
过度配置成本浪费合理规格 + 自动扩缩容
无成本追踪预算超支标签 + 预算 + Cost Explorer
单体架构难以扩展微服务或无服务器

Real-World Example

真实案例

Scenario: Serverless API with authentication
Architecture Review:
Operational Excellence:
  • ✅ Lambda functions deployed via CDK
  • ✅ CloudWatch logs enabled
  • ❌ Missing: Distributed tracing (X-Ray), dashboards
Security:
  • ❌ CRITICAL: Hardcoded API keys in Lambda environment variables
  • ✅ API Gateway with IAM authorization
  • ❌ Missing: Secrets Manager, encryption at rest
Reliability:
  • ✅ Multi-AZ DynamoDB table
  • ❌ Single region deployment
  • ❌ Missing: Backup strategy, DR plan
Performance:
  • ✅ CloudFront for static assets
  • ❌ No caching for API responses
  • ❌ Lambda cold starts not optimized
Cost:
  • ❌ DynamoDB provisioned capacity, but traffic is spiky
  • ✅ Lambda usage-based pricing
  • ❌ Missing: Budget alerts, cost allocation tags
Sustainability:
  • ✅ Serverless architecture (good utilization)
  • ❌ Unused dev/test resources running 24/7
Priority Actions:
  1. HIGH: Move API keys to Secrets Manager (Security)
  2. HIGH: Implement DynamoDB backups (Reliability)
  3. MEDIUM: Add X-Ray tracing (Operational Excellence)
  4. MEDIUM: Switch DynamoDB to on-demand (Cost)
  5. LOW: Add API Gateway caching (Performance)
场景: 带认证的无服务器API
架构评审:
Operational Excellence(卓越运维):
  • ✅ Lambda函数通过CDK部署
  • ✅ 已启用CloudWatch日志
  • ❌ 缺失:分布式追踪(X-Ray)、仪表板
Security(安全):
  • ❌ 严重问题:Lambda环境变量中存在硬编码API密钥
  • ✅ API Gateway使用IAM授权
  • ❌ 缺失:Secrets Manager、静态加密
Reliability(可靠性):
  • ✅ DynamoDB表为多可用区部署
  • ❌ 单区域部署
  • ❌ 缺失:备份策略、DR计划
Performance(性能):
  • ✅ 使用CloudFront交付静态资产
  • ❌ API响应无缓存
  • ❌ Lambda冷启动未优化
Cost(成本):
  • ❌ DynamoDB使用预置容量,但流量波动大
  • ✅ Lambda采用基于使用量的定价
  • ❌ 缺失:预算告警、成本分配标签
Sustainability(可持续性):
  • ✅ 无服务器架构(资源利用率高)
  • ❌ 未使用的开发/测试资源24小时运行
优先级行动:
  1. 高优先级:将API密钥迁移至Secrets Manager(安全)
  2. 高优先级:实施DynamoDB备份(可靠性)
  3. 中优先级:添加X-Ray追踪(卓越运维)
  4. 中优先级:将DynamoDB切换为按需模式(成本)
  5. 低优先级:添加API Gateway缓存(性能)

Resources

资源

Common Mistakes When Using This Framework

使用本框架的常见错误

MistakeWhy It's WrongCorrect Approach
"Sustainability doesn't apply to this workload"Every workload consumes resources and energyReview all 6 pillars, even if findings are minimal
Skipping current state documentationCan't measure improvement without baselineAlways document "Current State" before recommendations
Generic recommendationsNot actionable or specific to this workloadProvide specific AWS services, code examples, priorities
No prioritizationEverything seems equally importantUse HIGH/MEDIUM/LOW risk levels, create phased plan
Forgetting about trade-offsOptimizing one pillar at expense of othersExplicitly call out trade-offs (e.g., multi-region cost vs reliability)
错误错误原因正确做法
“可持续性不适用于此工作负载”每个工作负载都会消耗资源和能源评审所有6个支柱,即使发现的问题很少
跳过当前状态记录没有基线就无法衡量改进给出建议前始终记录“当前状态”
通用建议不具可操作性或不针对工作负载提供具体的AWS服务、代码示例和优先级
无优先级划分所有事项看似同等重要使用高/中/低风险级别,制定分阶段计划
忽略权衡取舍优化一个支柱以牺牲其他支柱为代价明确指出权衡(例如,多区域部署的成本与可靠性)

Using This Skill

使用本技能的方法

When conducting architecture reviews:
  1. Start with context - understand business objectives and constraints
  2. Review systematically - go through all 6 pillars, don't skip ANY
  3. Document findings - use consistent format per pillar (Current State → Gaps → Recommendations)
  4. Prioritize ruthlessly - security and availability issues first
  5. Be specific - actionable recommendations with examples and AWS service names
  6. Provide resources - link to AWS docs and examples
  7. Create action plan - clear next steps with success criteria and effort estimates
  8. Call out trade-offs - be explicit about costs and benefits of each recommendation
Remember: Architecture is about trade-offs. A perfect architecture doesn't exist - aim for a well-balanced one that meets business needs.
No exceptions to reviewing all 6 pillars - even if a pillar seems "not applicable", document why and what the current state is.
进行架构评审时:
  1. 从上下文开始 - 了解业务目标和约束
  2. 系统化评审 - 遍历所有6个支柱,绝不跳过任何一个
  3. 记录发现 - 为每个支柱使用一致的格式(当前状态 → 差距 → 建议)
  4. 严格优先级划分 - 优先处理安全和可用性问题
  5. 具体明确 - 提供可操作的建议,附带示例和AWS服务名称
  6. 提供资源 - 链接到AWS文档和示例
  7. 创建行动计划 - 明确的后续步骤,包含成功标准和工作量估算
  8. 指出权衡取舍 - 明确每个建议的成本和收益
记住: 架构在于权衡取舍。完美的架构不存在——目标是构建一个平衡的架构,满足业务需求。
评审所有6个支柱无例外 - 即使某个支柱看似“不适用”,也要记录原因和当前状态。