aws-cloudformation-cloudwatch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AWS CloudFormation CloudWatch Monitoring

AWS CloudFormation CloudWatch 监控

Overview

概述

Create production-ready monitoring and observability infrastructure using AWS CloudFormation templates. This skill covers CloudWatch metrics, alarms, dashboards, log groups, log insights, anomaly detection, synthesized canaries, Application Signals, and best practices for parameters, outputs, and cross-stack references.
使用AWS CloudFormation模板创建适用于生产环境的监控和可观测性基础设施。本内容涵盖CloudWatch指标、告警、仪表盘、日志组、日志洞察、异常检测、合成金丝雀、Application Signals,以及参数、输出和跨栈引用的最佳实践。

When to Use

适用场景

Use this skill when:
  • Creating custom CloudWatch metrics
  • Configuring CloudWatch alarms for thresholds and anomaly detection
  • Creating CloudWatch dashboards for multi-region visualization
  • Implementing log groups with retention and encryption
  • Configuring log subscriptions and cross-account log aggregation
  • Implementing synthesized canaries for synthetic monitoring
  • Enabling Application Signals for application performance monitoring
  • Organizing templates with Parameters, Outputs, Mappings, Conditions
  • Implementing cross-stack references with export/import
  • Using Transform for macros and reuse
在以下场景中使用本技能:
  • 创建自定义CloudWatch指标
  • 配置基于阈值和异常检测的CloudWatch告警
  • 创建支持多区域可视化的CloudWatch仪表盘
  • 实现带保留期和加密功能的日志组
  • 配置日志订阅和跨账号日志聚合
  • 实现用于合成监控的合成金丝雀
  • 启用Application Signals进行应用性能监控
  • 使用参数、输出、映射、条件组织模板
  • 通过导出/导入实现跨栈引用
  • 使用Transform实现宏和复用

CloudFormation Template Structure

CloudFormation模板结构

Base Template with Standard Format

标准格式的基础模板

yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch monitoring and observability stack

Metadata:
  AWS::CloudFormation::Interface:
    ParameterGroups:
      - Label:
          default: Monitoring Configuration
        Parameters:
          - Environment
          - LogRetentionDays
          - EnableAnomalyDetection
      - Label:
          default: Alarm Thresholds
        Parameters:
          - ErrorRateThreshold
          - LatencyThreshold
          - CpuUtilizationThreshold

Parameters:
  Environment:
    Type: String
    Default: dev
    AllowedValues:
      - dev
      - staging
      - production
    Description: Deployment environment

  LogRetentionDays:
    Type: Number
    Default: 30
    AllowedValues:
      - 1
      - 3
      - 5
      - 7
      - 14
      - 30
      - 60
      - 90
      - 120
      - 150
      - 180
      - 365
      - 400
      - 545
      - 731
      - 1095
      - 1827
      - 2190
      - 2555
      - 2922
      - 3285
      - 3650
    Description: Number of days to retain log events

  EnableAnomalyDetection:
    Type: String
    Default: false
    AllowedValues:
      - true
      - false
    Description: Enable CloudWatch anomaly detection

  ErrorRateThreshold:
    Type: Number
    Default: 5
    Description: Error rate threshold for alarms (percentage)

  LatencyThreshold:
    Type: Number
    Default: 1000
    Description: Latency threshold in milliseconds

  CpuUtilizationThreshold:
    Type: Number
    Default: 80
    Description: CPU utilization threshold (percentage)

Mappings:
  EnvironmentConfig:
    dev:
      LogRetentionDays: 7
      ErrorRateThreshold: 10
      LatencyThreshold: 2000
      CpuUtilizationThreshold: 90
    staging:
      LogRetentionDays: 14
      ErrorRateThreshold: 5
      LatencyThreshold: 1500
      CpuUtilizationThreshold: 85
    production:
      LogRetentionDays: 30
      ErrorRateThreshold: 1
      LatencyThreshold: 500
      CpuUtilizationThreshold: 80

Conditions:
  IsProduction: !Equals [!Ref Environment, production]
  IsStaging: !Equals [!Ref Environment, staging]
  EnableAnomaly: !Equals [!Ref EnableAnomalyDetection, true]

Transform:
  - AWS::Serverless-2016-10-31

Resources:
  # Log Group per applicazione
  ApplicationLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "/aws/applications/${Environment}/application"
      RetentionInDays: !Ref LogRetentionDays
      KmsKeyId: !Ref LogKmsKey
      Tags:
        - Key: Environment
          Value: !Ref Environment
        - Key: Application
          Value: !Ref ApplicationName

Outputs:
  LogGroupName:
    Description: Name of the application log group
    Value: !Ref ApplicationLogGroup
    Export:
      Name: !Sub "${AWS::StackName}-LogGroupName"
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch monitoring and observability stack

Metadata:
  AWS::CloudFormation::Interface:
    ParameterGroups:
      - Label:
          default: Monitoring Configuration
        Parameters:
          - Environment
          - LogRetentionDays
          - EnableAnomalyDetection
      - Label:
          default: Alarm Thresholds
        Parameters:
          - ErrorRateThreshold
          - LatencyThreshold
          - CpuUtilizationThreshold

Parameters:
  Environment:
    Type: String
    Default: dev
    AllowedValues:
      - dev
      - staging
      - production
    Description: Deployment environment

  LogRetentionDays:
    Type: Number
    Default: 30
    AllowedValues:
      - 1
      - 3
      - 5
      - 7
      - 14
      - 30
      - 60
      - 90
      - 120
      - 150
      - 180
      - 365
      - 400
      - 545
      - 731
      - 1095
      - 1827
      - 2190
      - 2555
      - 2922
      - 3285
      - 3650
    Description: Number of days to retain log events

  EnableAnomalyDetection:
    Type: String
    Default: false
    AllowedValues:
      - true
      - false
    Description: Enable CloudWatch anomaly detection

  ErrorRateThreshold:
    Type: Number
    Default: 5
    Description: Error rate threshold for alarms (percentage)

  LatencyThreshold:
    Type: Number
    Default: 1000
    Description: Latency threshold in milliseconds

  CpuUtilizationThreshold:
    Type: Number
    Default: 80
    Description: CPU utilization threshold (percentage)

Mappings:
  EnvironmentConfig:
    dev:
      LogRetentionDays: 7
      ErrorRateThreshold: 10
      LatencyThreshold: 2000
      CpuUtilizationThreshold: 90
    staging:
      LogRetentionDays: 14
      ErrorRateThreshold: 5
      LatencyThreshold: 1500
      CpuUtilizationThreshold: 85
    production:
      LogRetentionDays: 30
      ErrorRateThreshold: 1
      LatencyThreshold: 500
      CpuUtilizationThreshold: 80

Conditions:
  IsProduction: !Equals [!Ref Environment, production]
  IsStaging: !Equals [!Ref Environment, staging]
  EnableAnomaly: !Equals [!Ref EnableAnomalyDetection, true]

Transform:
  - AWS::Serverless-2016-10-31

Resources:
  # 每个应用对应的日志组
  ApplicationLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "/aws/applications/${Environment}/application"
      RetentionInDays: !Ref LogRetentionDays
      KmsKeyId: !Ref LogKmsKey
      Tags:
        - Key: Environment
          Value: !Ref Environment
        - Key: Application
          Value: !Ref ApplicationName

Outputs:
  LogGroupName:
    Description: Name of the application log group
    Value: !Ref ApplicationLogGroup
    Export:
      Name: !Sub "${AWS::StackName}-LogGroupName"

Parameters Best Practices

参数最佳实践

AWS-Specific Parameter Types

AWS特定参数类型

yaml
Parameters:
  # AWS-specific types for validation
  CloudWatchNamespace:
    Type: AWS::CloudWatch::Namespace
    Description: CloudWatch metric namespace

  AlarmActionArn:
    Type: AWS::SNS::Topic::Arn
    Description: SNS topic ARN for alarm actions

  LogKmsKeyArn:
    Type: AWS::KMS::Key::Arn
    Description: KMS key ARN for log encryption

  DashboardArn:
    Type: AWS::CloudWatch::Dashboard::Arn
    Description: Existing dashboard ARN to import

  AnomalyDetectorArn:
    Type: AWS::CloudWatch::AnomalyDetector::Arn
    Description: Existing anomaly detector ARN
yaml
Parameters:
  # 用于验证的AWS特定类型
  CloudWatchNamespace:
    Type: AWS::CloudWatch::Namespace
    Description: CloudWatch metric namespace

  AlarmActionArn:
    Type: AWS::SNS::Topic::Arn
    Description: SNS topic ARN for alarm actions

  LogKmsKeyArn:
    Type: AWS::KMS::Key::Arn
    Description: KMS key ARN for log encryption

  DashboardArn:
    Type: AWS::CloudWatch::Dashboard::Arn
    Description: Existing dashboard ARN to import

  AnomalyDetectorArn:
    Type: AWS::CloudWatch::AnomalyDetector::Arn
    Description: Existing anomaly detector ARN

Parameter Constraints

参数约束

yaml
Parameters:
  MetricName:
    Type: String
    Description: CloudWatch metric name
    ConstraintDescription: Must be 1-256 characters, alphanumeric, underscore, period, dash
    MinLength: 1
    MaxLength: 256
    AllowedPattern: "[a-zA-Z0-9._-]+"

  ThresholdValue:
    Type: Number
    Description: Alarm threshold value
    MinValue: 0
    MaxValue: 1000000000

  EvaluationPeriods:
    Type: Number
    Description: Number of evaluation periods
    Default: 5
    MinValue: 1
    MaxValue: 100
    ConstraintDescription: Must be between 1 and 100

  DatapointsToAlarm:
    Type: Number
    Description: Datapoints that must breach to trigger alarm
    Default: 5
    MinValue: 1
    MaxValue: 10

  Period:
    Type: Number
    Description: Metric period in seconds
    Default: 300
    AllowedValues:
      - 10
      - 30
      - 60
      - 300
      - 900
      - 3600
      - 21600
      - 86400

  ComparisonOperator:
    Type: String
    Description: Alarm comparison operator
    Default: GreaterThanThreshold
    AllowedValues:
      - GreaterThanThreshold
      - GreaterThanOrEqualToThreshold
      - LessThanThreshold
      - LessThanOrEqualToThreshold
      - GreaterThanUpperBound
      - LessThanLowerBound
yaml
Parameters:
  MetricName:
    Type: String
    Description: CloudWatch metric name
    ConstraintDescription: Must be 1-256 characters, alphanumeric, underscore, period, dash
    MinLength: 1
    MaxLength: 256
    AllowedPattern: "[a-zA-Z0-9._-]+"

  ThresholdValue:
    Type: Number
    Description: Alarm threshold value
    MinValue: 0
    MaxValue: 1000000000

  EvaluationPeriods:
    Type: Number
    Description: Number of evaluation periods
    Default: 5
    MinValue: 1
    MaxValue: 100
    ConstraintDescription: Must be between 1 and 100

  DatapointsToAlarm:
    Type: Number
    Description: Datapoints that must breach to trigger alarm
    Default: 5
    MinValue: 1
    MaxValue: 10

  Period:
    Type: Number
    Description: Metric period in seconds
    Default: 300
    AllowedValues:
      - 10
      - 30
      - 60
      - 300
      - 900
      - 3600
      - 21600
      - 86400

  ComparisonOperator:
    Type: String
    Description: Alarm comparison operator
    Default: GreaterThanThreshold
    AllowedValues:
      - GreaterThanThreshold
      - GreaterThanOrEqualToThreshold
      - LessThanThreshold
      - LessThanOrEqualToThreshold
      - GreaterThanUpperBound
      - LessThanLowerBound

SSM Parameter References

SSM参数引用

yaml
Parameters:
  AlarmTopicArn:
    Type: AWS::SSM::Parameter::Value<AWS::SNS::Topic::Arn>
    Default: /monitoring/alarms/topic-arn
    Description: SNS topic ARN from SSM Parameter Store

  DashboardConfig:
    Type: AWS::SSM::Parameter::Value<String>
    Default: /monitoring/dashboards/config
    Description: Dashboard configuration from SSM
yaml
Parameters:
  AlarmTopicArn:
    Type: AWS::SSM::Parameter::Value<AWS::SNS::Topic::Arn>
    Default: /monitoring/alarms/topic-arn
    Description: SNS topic ARN from SSM Parameter Store

  DashboardConfig:
    Type: AWS::SSM::Parameter::Value<String>
    Default: /monitoring/dashboards/config
    Description: Dashboard configuration from SSM

Outputs and Cross-Stack References

输出与跨栈引用

Export/Import Patterns

导出/导入模式

yaml
undefined
yaml
undefined

Stack A - Monitoring Stack

栈A - 监控栈

AWSTemplateFormatVersion: 2010-09-09 Description: Central monitoring infrastructure stack
Resources: AlarmTopic: Type: AWS::SNS::Topic Properties: TopicName: !Sub "${AWS::StackName}-alarms" DisplayName: !Sub "${AWS::StackName} Alarm Notifications"
LogGroup: Type: AWS::Logs::LogGroup Properties: LogGroupName: !Sub "/aws/monitoring/${AWS::StackName}" RetentionInDays: 30
Outputs: AlarmTopicArn: Description: ARN of the alarm SNS topic Value: !Ref AlarmTopic Export: Name: !Sub "${AWS::StackName}-AlarmTopicArn"
LogGroupName: Description: Name of the log group Value: !Ref LogGroup Export: Name: !Sub "${AWS::StackName}-LogGroupName"
LogGroupArn: Description: ARN of the log group Value: !GetAtt LogGroup.Arn Export: Name: !Sub "${AWS::StackName}-LogGroupArn"

```yaml
AWSTemplateFormatVersion: 2010-09-09 Description: Central monitoring infrastructure stack
Resources: AlarmTopic: Type: AWS::SNS::Topic Properties: TopicName: !Sub "${AWS::StackName}-alarms" DisplayName: !Sub "${AWS::StackName} Alarm Notifications"
LogGroup: Type: AWS::Logs::LogGroup Properties: LogGroupName: !Sub "/aws/monitoring/${AWS::StackName}" RetentionInDays: 30
Outputs: AlarmTopicArn: Description: ARN of the alarm SNS topic Value: !Ref AlarmTopic Export: Name: !Sub "${AWS::StackName}-AlarmTopicArn"
LogGroupName: Description: Name of the log group Value: !Ref LogGroup Export: Name: !Sub "${AWS::StackName}-LogGroupName"
LogGroupArn: Description: ARN of the log group Value: !GetAtt LogGroup.Arn Export: Name: !Sub "${AWS::StackName}-LogGroupArn"

```yaml

Stack B - Application Stack (imports from Monitoring Stack)

栈B - 应用栈(从监控栈导入资源)

AWSTemplateFormatVersion: 2010-09-09 Description: Application stack with monitoring integration
Parameters: MonitoringStackName: Type: String Description: Name of the monitoring stack
Resources: LambdaFunction: Type: AWS::Lambda::Function Properties: FunctionName: !Sub "${AWS::StackName}-processor" Runtime: python3.11 Handler: app.handler Code: S3Bucket: !Ref CodeBucket S3Key: lambda/function.zip Role: !GetAtt LambdaExecutionRole.Arn
ErrorAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: !Sub "${AWS::StackName}-errors" AlarmDescription: Alert on Lambda errors MetricName: Errors Namespace: AWS/Lambda Dimensions: - Name: FunctionName Value: !Ref LambdaFunction Statistic: Sum Period: 60 EvaluationPeriods: 5 Threshold: 1 ComparisonOperator: GreaterThanThreshold AlarmActions: - !ImportValue !Sub "${MonitoringStackName}-AlarmTopicArn"
HighLatencyAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: !Sub "${AWS::StackName}-latency" AlarmDescription: Alert on high latency MetricName: Duration Namespace: AWS/Lambda Dimensions: - Name: FunctionName Value: !Ref LambdaFunction Statistic: P99 Period: 60 EvaluationPeriods: 3 Threshold: 5000 ComparisonOperator: GreaterThanThreshold AlarmActions: - !ImportValue !Sub "${MonitoringStackName}-AlarmTopicArn"
undefined
AWSTemplateFormatVersion: 2010-09-09 Description: Application stack with monitoring integration
Parameters: MonitoringStackName: Type: String Description: Name of the monitoring stack
Resources: LambdaFunction: Type: AWS::Lambda::Function Properties: FunctionName: !Sub "${AWS::StackName}-processor" Runtime: python3.11 Handler: app.handler Code: S3Bucket: !Ref CodeBucket S3Key: lambda/function.zip Role: !GetAtt LambdaExecutionRole.Arn
ErrorAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: !Sub "${AWS::StackName}-errors" AlarmDescription: Alert on Lambda errors MetricName: Errors Namespace: AWS/Lambda Dimensions: - Name: FunctionName Value: !Ref LambdaFunction Statistic: Sum Period: 60 EvaluationPeriods: 5 Threshold: 1 ComparisonOperator: GreaterThanThreshold AlarmActions: - !ImportValue !Sub "${MonitoringStackName}-AlarmTopicArn"
HighLatencyAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: !Sub "${AWS::StackName}-latency" AlarmDescription: Alert on high latency MetricName: Duration Namespace: AWS/Lambda Dimensions: - Name: FunctionName Value: !Ref LambdaFunction Statistic: P99 Period: 60 EvaluationPeriods: 3 Threshold: 5000 ComparisonOperator: GreaterThanThreshold AlarmActions: - !ImportValue !Sub "${MonitoringStackName}-AlarmTopicArn"
undefined

Nested Stacks for Modularity

用于模块化的嵌套栈

yaml
AWSTemplateFormatVersion: 2010-09-09
Description: Main stack with nested monitoring stacks

Resources:
  # Nested stack for alarms
  AlarmsStack:
    Type: AWS::CloudFormation::Stack
    Properties:
      TemplateURL: https://s3.amazonaws.com/bucket/monitoring/alarms.yaml
      TimeoutInMinutes: 15
      Parameters:
        Environment: !Ref Environment
        AlarmTopicArn: !Ref AlarmTopicArn

  # Nested stack for dashboards
  DashboardsStack:
    Type: AWS::CloudFormation::Stack
    Properties:
      TemplateURL: https://s3.amazonaws.com/bucket/monitoring/dashboards.yaml
      TimeoutInMinutes: 15
      Parameters:
        Environment: !Ref Environment
        LogGroupNames: !Join [",", [!GetAtt AlarmsStack.Outputs.LogGroupName]]

  # Nested stack for log insights
  LogInsightsStack:
    Type: AWS::CloudFormation::Stack
    Properties:
      TemplateURL: https://s3.amazonaws.com/bucket/monitoring/log-insights.yaml
      TimeoutInMinutes: 15
      Parameters:
        Environment: !Ref Environment
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: Main stack with nested monitoring stacks

Resources:
  # 告警嵌套栈
  AlarmsStack:
    Type: AWS::CloudFormation::Stack
    Properties:
      TemplateURL: https://s3.amazonaws.com/bucket/monitoring/alarms.yaml
      TimeoutInMinutes: 15
      Parameters:
        Environment: !Ref Environment
        AlarmTopicArn: !Ref AlarmTopicArn

  # 仪表盘嵌套栈
  DashboardsStack:
    Type: AWS::CloudFormation::Stack
    Properties:
      TemplateURL: https://s3.amazonaws.com/bucket/monitoring/dashboards.yaml
      TimeoutInMinutes: 15
      Parameters:
        Environment: !Ref Environment
        LogGroupNames: !Join [",", [!GetAtt AlarmsStack.Outputs.LogGroupName]]

  # 日志洞察嵌套栈
  LogInsightsStack:
    Type: AWS::CloudFormation::Stack
    Properties:
      TemplateURL: https://s3.amazonaws.com/bucket/monitoring/log-insights.yaml
      TimeoutInMinutes: 15
      Parameters:
        Environment: !Ref Environment

CloudWatch Metrics and Alarms

CloudWatch指标与告警

Base Metric Alarm

基础指标告警

yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch metric alarms

Resources:
  # Error rate alarm
  ErrorRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-error-rate"
      AlarmDescription: Alert when error rate exceeds threshold
      MetricName: ErrorRate
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
        - Name: Environment
          Value: !Ref Environment
      Statistic: Average
      Period: 60
      EvaluationPeriods: 5
      DatapointsToAlarm: 3
      Threshold: !Ref ErrorRateThreshold
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
        - !Ref AlarmTopic
      InsufficientDataActions:
        - !Ref AlarmTopic
      OKActions:
        - !Ref AlarmTopic
      Tags:
        - Key: Environment
          Value: !Ref Environment
        - Key: Severity
          Value: high

  # P99 latency alarm
  LatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-p99-latency"
      AlarmDescription: Alert when P99 latency exceeds threshold
      MetricName: Latency
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
      Statistic: p99
      ExtendedStatistic: "p99"
      Period: 60
      EvaluationPeriods: 3
      Threshold: !Ref LatencyThreshold
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions:
        - !Ref AlarmTopic

  # 4xx errors alarm
  ClientErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-4xx-errors"
      AlarmDescription: Alert on high 4xx error rate
      MetricName: 4XXError
      Namespace: AWS/ApiGateway
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref StageName
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 100
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
        - !Ref AlarmTopic

  # 5xx errors alarm
  ServerErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-5xx-errors"
      AlarmDescription: Alert on high 5xx error rate
      MetricName: 5XXError
      Namespace: AWS/ApiGateway
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref StageName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      ComparisonOperator: GreaterThanOrEqualToThreshold
      AlarmActions:
        - !Ref AlarmTopic
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch metric alarms

Resources:
  # 错误率告警
  ErrorRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-error-rate"
      AlarmDescription: Alert when error rate exceeds threshold
      MetricName: ErrorRate
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
        - Name: Environment
          Value: !Ref Environment
      Statistic: Average
      Period: 60
      EvaluationPeriods: 5
      DatapointsToAlarm: 3
      Threshold: !Ref ErrorRateThreshold
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
        - !Ref AlarmTopic
      InsufficientDataActions:
        - !Ref AlarmTopic
      OKActions:
        - !Ref AlarmTopic
      Tags:
        - Key: Environment
          Value: !Ref Environment
        - Key: Severity
          Value: high

  # P99延迟告警
  LatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-p99-latency"
      AlarmDescription: Alert when P99 latency exceeds threshold
      MetricName: Latency
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
      Statistic: p99
      ExtendedStatistic: "p99"
      Period: 60
      EvaluationPeriods: 3
      Threshold: !Ref LatencyThreshold
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions:
        - !Ref AlarmTopic

  # 4xx错误告警
  ClientErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-4xx-errors"
      AlarmDescription: Alert on high 4xx error rate
      MetricName: 4XXError
      Namespace: AWS/ApiGateway
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref StageName
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 100
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
        - !Ref AlarmTopic

  # 5xx错误告警
  ServerErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-5xx-errors"
      AlarmDescription: Alert on high 5xx error rate
      MetricName: 5XXError
      Namespace: AWS/ApiGateway
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref StageName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      ComparisonOperator: GreaterThanOrEqualToThreshold
      AlarmActions:
        - !Ref AlarmTopic

Composite Alarm

复合告警

yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch composite alarms

Resources:
  # Base alarm for Lambda errors
  LambdaErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-lambda-errors"
      MetricName: Errors
      Namespace: AWS/Lambda
      Dimensions:
        - Name: FunctionName
          Value: !Ref LambdaFunction
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 5
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold

  # Base alarm for Lambda throttles
  LambdaThrottleAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-lambda-throttles"
      MetricName: Throttles
      Namespace: AWS/Lambda
      Dimensions:
        - Name: FunctionName
          Value: !Ref LambdaFunction
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 5
      Threshold: 3
      ComparisonOperator: GreaterThanThreshold

  # Composite alarm combining both
  LambdaHealthCompositeAlarm:
    Type: AWS::CloudWatch::CompositeAlarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-lambda-health"
      AlarmDescription: Composite alarm for Lambda function health
      AlarmRule: !Or
        - !Ref LambdaErrorAlarm
        - !Ref LambdaThrottleAlarm
      ActionsEnabled: true
      AlarmActions:
        - !Ref AlarmTopic
      Tags:
        - Key: Service
          Value: lambda
        - Key: Tier
          Value: application
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch composite alarms

Resources:
  # Lambda错误基础告警
  LambdaErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-lambda-errors"
      MetricName: Errors
      Namespace: AWS/Lambda
      Dimensions:
        - Name: FunctionName
          Value: !Ref LambdaFunction
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 5
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold

  # Lambda限流基础告警
  LambdaThrottleAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-lambda-throttles"
      MetricName: Throttles
      Namespace: AWS/Lambda
      Dimensions:
        - Name: FunctionName
          Value: !Ref LambdaFunction
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 5
      Threshold: 3
      ComparisonOperator: GreaterThanThreshold

  # 合并两者的复合告警
  LambdaHealthCompositeAlarm:
    Type: AWS::CloudWatch::CompositeAlarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-lambda-health"
      AlarmDescription: Composite alarm for Lambda function health
      AlarmRule: !Or
        - !Ref LambdaErrorAlarm
        - !Ref LambdaThrottleAlarm
      ActionsEnabled: true
      AlarmActions:
        - !Ref AlarmTopic
      Tags:
        - Key: Service
          Value: lambda
        - Key: Tier
          Value: application

Anomaly Detection Alarm

异常检测告警

yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch anomaly detection

Resources:
  # Anomaly detector for metric
  RequestRateAnomalyDetector:
    Type: AWS::CloudWatch::AnomalyDetector
    Properties:
      MetricName: RequestCount
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
        - Name: Environment
          Value: !Ref Environment
      Statistic: Sum
      Configuration:
        ExcludedTimeRanges:
          - StartTime: "2023-12-25T00:00:00"
            EndTime: "2023-12-26T00:00:00"
        MetricTimeZone: UTC

  # Alarm based on anomaly detection
  AnomalyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-anomaly-detection"
      AlarmDescription: Alert on anomalous metric behavior
      MetricName: RequestCount
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
      AnomalyDetectorConfiguration:
        ExcludeTimeRange:
          StartTime: "2023-12-25T00:00:00"
          EndTime: "2023-12-26T00:00:00"
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 2
      ComparisonOperator: GreaterThanUpperThreshold
      AlarmActions:
        - !Ref AlarmTopic

  # Alarm for low anomalous value
  LowTrafficAnomalyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-low-traffic"
      AlarmDescription: Alert on unusually low traffic
      MetricName: RequestCount
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
      AnomalyDetectorConfiguration:
        Bound: Lower
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 0.5
      ComparisonOperator: LessThanLowerThreshold
      AlarmActions:
        - !Ref AlarmTopic
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch anomaly detection

Resources:
  # 指标异常检测器
  RequestRateAnomalyDetector:
    Type: AWS::CloudWatch::AnomalyDetector
    Properties:
      MetricName: RequestCount
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
        - Name: Environment
          Value: !Ref Environment
      Statistic: Sum
      Configuration:
        ExcludedTimeRanges:
          - StartTime: "2023-12-25T00:00:00"
            EndTime: "2023-12-26T00:00:00"
        MetricTimeZone: UTC

  # 基于异常检测的告警
  AnomalyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-anomaly-detection"
      AlarmDescription: Alert on anomalous metric behavior
      MetricName: RequestCount
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
      AnomalyDetectorConfiguration:
        ExcludeTimeRange:
          StartTime: "2023-12-25T00:00:00"
            EndTime: "2023-12-26T00:00:00"
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 2
      ComparisonOperator: GreaterThanUpperThreshold
      AlarmActions:
        - !Ref AlarmTopic

  # 低流量异常告警
  LowTrafficAnomalyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-low-traffic"
      AlarmDescription: Alert on unusually low traffic
      MetricName: RequestCount
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
      AnomalyDetectorConfiguration:
        Bound: Lower
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 0.5
      ComparisonOperator: LessThanLowerThreshold
      AlarmActions:
        - !Ref AlarmTopic

CloudWatch Dashboards

CloudWatch仪表盘

Dashboard Base

仪表盘基础

yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch dashboard

Resources:
  # Main dashboard
  MainDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: !Sub "${AWS::StackName}-main"
      DashboardBody: !Sub |
        {
          "widgets": [
            {
              "type": "metric",
              "x": 0,
              "y": 0,
              "width": 12,
              "height": 6,
              "properties": {
                "title": "API Gateway Requests",
                "view": "timeSeries",
                "stacked": false,
                "region": "${AWS::Region}",
                "metrics": [
                  ["AWS/ApiGateway", "Count", "ApiName", "${ApiName}", "Stage", "${StageName}"],
                  [".", "4XXError", ".", ".", ".", "."],
                  [".", "5XXError", ".", ".", ".", "."]
                ],
                "period": 300,
                "stat": "Sum"
              }
            },
            {
              "type": "metric",
              "x": 12,
              "y": 0,
              "width": 12,
              "height": 6,
              "properties": {
                "title": "API Gateway Latency",
                "view": "timeSeries",
                "region": "${AWS::Region}",
                "metrics": [
                  ["AWS/ApiGateway", "Latency", "ApiName", "${ApiName}", "Stage", "${StageName}", {"stat": "p99"}],
                  [".", ".", ".", ".", ".", ".", {"stat": "Average"}]
                ],
                "period": 300
              }
            },
            {
              "type": "metric",
              "x": 0,
              "y": 6,
              "width": 12,
              "height": 6,
              "properties": {
                "title": "Lambda Invocations",
                "view": "timeSeries",
                "region": "${AWS::Region}",
                "metrics": [
                  ["AWS/Lambda", "Invocations", "FunctionName", "${LambdaFunction}"],
                  [".", "Errors", ".", "."],
                  [".", "Throttles", ".", "."]
                ],
                "period": 60,
                "stat": "Sum"
              }
            },
            {
              "type": "metric",
              "x": 12,
              "y": 6,
              "width": 12,
              "height": 6,
              "properties": {
                "title": "Lambda Duration",
                "view": "timeSeries",
                "region": "${AWS::Region}",
                "metrics": [
                  ["AWS/Lambda", "Duration", "FunctionName", "${LambdaFunction}", {"stat": "p99"}],
                  [".", ".", ".", ".", {"stat": "Average"}],
                  [".", ".", ".", ".", {"stat": "Maximum"}]
                ],
                "period": 60
              }
            },
            {
              "type": "log",
              "x": 0,
              "y": 12,
              "width": 24,
              "height": 6,
              "properties": {
                "title": "Application Logs",
                "view": "table",
                "region": "${AWS::Region}",
                "logGroupName": "${ApplicationLogGroup}",
                "timeRange": {
                  "type": "relative",
                  "from": 3600
                },
                "filterPattern": "ERROR | WARN"
              }
            }
          ]
        }

  # Dashboard for specific service
  ServiceDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: !Sub "${AWS::StackName}-${ServiceName}"
      DashboardBody: !Sub |
        {
          "start": "-PT6H",
          "widgets": [
            {
              "type": "text",
              "x": 0,
              "y": 0,
              "width": 24,
              "height": 1,
              "properties": {
                "markdown": "# ${ServiceName} - ${Environment} Dashboard"
              }
            },
            {
              "type": "metric",
              "x": 0,
              "y": 1,
              "width": 8,
              "height": 6,
              "properties": {
                "title": "Request Rate",
                "view": "timeSeries",
                "stacked": false,
                "region": "${AWS::Region}",
                "metrics": [
                  ["${CustomNamespace}", "RequestCount", "Service", "${ServiceName}", "Environment", "${Environment}"]
                ],
                "period": 60,
                "stat": "Sum"
              }
            },
            {
              "type": "metric",
              "x": 8,
              "y": 1,
              "width": 8,
              "height": 6,
              "properties": {
                "title": "Error Rate %",
                "view": "timeSeries",
                "region": "${AWS::Region}",
                "metrics": [
                  ["${CustomNamespace}", "ErrorCount", "Service", "${ServiceName}"],
                  [".", "RequestCount", ".", "."],
                  [".", "SuccessCount", ".", "."]
                ],
                "period": 60,
                "stat": "Average"
              }
            },
            {
              "type": "metric",
              "x": 16,
              "y": 1,
              "width": 8,
              "height": 6,
              "properties": {
                "title": "P99 Latency",
                "view": "timeSeries",
                "region": "${AWS::Region}",
                "metrics": [
                  ["${CustomNamespace}", "Latency", "Service", "${ServiceName}"]
                ],
                "period": 60,
                "stat": "p99"
              }
            }
          ]
        }
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch dashboard

Resources:
  # 主仪表盘
  MainDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: !Sub "${AWS::StackName}-main"
      DashboardBody: !Sub |
        {
          "widgets": [
            {
              "type": "metric",
              "x": 0,
              "y": 0,
              "width": 12,
              "height": 6,
              "properties": {
                "title": "API Gateway Requests",
                "view": "timeSeries",
                "stacked": false,
                "region": "${AWS::Region}",
                "metrics": [
                  ["AWS/ApiGateway", "Count", "ApiName", "${ApiName}", "Stage", "${StageName}"],
                  [".", "4XXError", ".", ".", ".", "."],
                  [".", "5XXError", ".", ".", ".", "."]
                ],
                "period": 300,
                "stat": "Sum"
              }
            },
            {
              "type": "metric",
              "x": 12,
              "y": 0,
              "width": 12,
              "height": 6,
              "properties": {
                "title": "API Gateway Latency",
                "view": "timeSeries",
                "region": "${AWS::Region}",
                "metrics": [
                  ["AWS/ApiGateway", "Latency", "ApiName", "${ApiName}", "Stage", "${StageName}", {"stat": "p99"}],
                  [".", ".", ".", ".", ".", ".", {"stat": "Average"}]
                ],
                "period": 300
              }
            },
            {
              "type": "metric",
              "x": 0,
              "y": 6,
              "width": 12,
              "height": 6,
              "properties": {
                "title": "Lambda Invocations",
                "view": "timeSeries",
                "region": "${AWS::Region}",
                "metrics": [
                  ["AWS/Lambda", "Invocations", "FunctionName", "${LambdaFunction}"],
                  [".", "Errors", ".", "."],
                  [".", "Throttles", ".", "."]
                ],
                "period": 60,
                "stat": "Sum"
              }
            },
            {
              "type": "metric",
              "x": 12,
              "y": 6,
              "width": 12,
              "height": 6,
              "properties": {
                "title": "Lambda Duration",
                "view": "timeSeries",
                "region": "${AWS::Region}",
                "metrics": [
                  ["AWS/Lambda", "Duration", "FunctionName", "${LambdaFunction}", {"stat": "p99"}],
                  [".", ".", ".", ".", {"stat": "Average"}],
                  [".", ".", ".", ".", {"stat": "Maximum"}]
                ],
                "period": 60
              }
            },
            {
              "type": "log",
              "x": 0,
              "y": 12,
              "width": 24,
              "height": 6,
              "properties": {
                "title": "Application Logs",
                "view": "table",
                "region": "${AWS::Region}",
                "logGroupName": "${ApplicationLogGroup}",
                "timeRange": {
                  "type": "relative",
                  "from": 3600
                },
                "filterPattern": "ERROR | WARN"
              }
            }
          ]
        }

  # 特定服务仪表盘
  ServiceDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: !Sub "${AWS::StackName}-${ServiceName}"
      DashboardBody: !Sub |
        {
          "start": "-PT6H",
          "widgets": [
            {
              "type": "text",
              "x": 0,
              "y": 0,
              "width": 24,
              "height": 1,
              "properties": {
                "markdown": "# ${ServiceName} - ${Environment} Dashboard"
              }
            },
            {
              "type": "metric",
              "x": 0,
              "y": 1,
              "width": 8,
              "height": 6,
              "properties": {
                "title": "Request Rate",
                "view": "timeSeries",
                "stacked": false,
                "region": "${AWS::Region}",
                "metrics": [
                  ["${CustomNamespace}", "RequestCount", "Service", "${ServiceName}", "Environment", "${Environment}"]
                ],
                "period": 60,
                "stat": "Sum"
              }
            },
            {
              "type": "metric",
              "x": 8,
              "y": 1,
              "width": 8,
              "height": 6,
              "properties": {
                "title": "Error Rate %",
                "view": "timeSeries",
                "region": "${AWS::Region}",
                "metrics": [
                  ["${CustomNamespace}", "ErrorCount", "Service", "${ServiceName}"],
                  [".", "RequestCount", ".", "."],
                  [".", "SuccessCount", ".", "."]
                ],
                "period": 60,
                "stat": "Average"
              }
            },
            {
              "type": "metric",
              "x": 16,
              "y": 1,
              "width": 8,
              "height": 6,
              "properties": {
                "title": "P99 Latency",
                "view": "timeSeries",
                "region": "${AWS::Region}",
                "metrics": [
                  ["${CustomNamespace}", "Latency", "Service", "${ServiceName}"]
                ],
                "period": 60,
                "stat": "p99"
              }
            }
          ]
        }

CloudWatch Logs

CloudWatch日志

Log Group Configurations

日志组配置

yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch log groups configuration

Parameters:
  LogRetentionDays:
    Type: Number
    Default: 30
    AllowedValues:
      - 1
      - 3
      - 5
      - 7
      - 14
      - 30
      - 60
      - 90
      - 120
      - 150
      - 180
      - 365
      - 400
      - 545
      - 731
      - 1095
      - 1827
      - 2190
      - 2555
      - 2922
      - 3285
      - 3650

Resources:
  # Application log group
  ApplicationLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "/aws/applications/${Environment}/${ApplicationName}"
      RetentionInDays: !Ref LogRetentionDays
      KmsKeyId: !Ref LogKmsKeyArn
      Tags:
        - Key: Environment
          Value: !Ref Environment
        - Key: Application
          Value: !Ref ApplicationName
        - Key: Service
          Value: !Ref ServiceName

  # Lambda log group
  LambdaLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "/aws/lambda/${LambdaFunctionName}"
      RetentionInDays: !Ref LogRetentionDays
      KmsKeyId: !Ref LogKmsKeyArn

  # Subscription filter for Log Insights
  LogSubscriptionFilter:
    Type: AWS::Logs::SubscriptionFilter
    Properties:
      DestinationArn: !GetAtt LogDestination.Arn
      FilterPattern: '[timestamp=*Z, request_id, level, message]'
      LogGroupName: !Ref ApplicationLogGroup
      RoleArn: !GetAtt LogSubscriptionRole.Arn

  # Metric filter for errors
  ErrorMetricFilter:
    Type: AWS::Logs::MetricFilter
    Properties:
      FilterPattern: '[level="ERROR", msg]'
      LogGroupName: !Ref ApplicationLogGroup
      MetricTransformations:
        - MetricValue: "1"
          MetricNamespace: !Sub "${AWS::StackName}/Application"
          MetricName: ErrorCount
        - MetricValue: "$level"
          MetricNamespace: !Sub "${AWS::StackName}/Application"
          MetricName: LogLevel

  # Metric filter for warnings
  WarningMetricFilter:
    Type: AWS::Logs::MetricFilter
    Properties:
      FilterPattern: '[level="WARN", msg]'
      LogGroupName: !Ref ApplicationLogGroup
      MetricTransformations:
        - MetricValue: "1"
          MetricNamespace: !Sub "${AWS::StackName}/Application"
          MetricName: WarningCount

  # Log group with custom retention
  AuditLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "/aws/audit/${Environment}/${ApplicationName}"
      RetentionInDays: 365
      KmsKeyId: !Ref LogKmsKeyArn
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch log groups configuration

Parameters:
  LogRetentionDays:
    Type: Number
    Default: 30
    AllowedValues:
      - 1
      - 3
      - 5
      - 7
      - 14
      - 30
      - 60
      - 90
      - 120
      - 150
      - 180
      - 365
      - 400
      - 545
      - 731
      - 1095
      - 1827
      - 2190
      - 2555
      - 2922
      - 3285
      - 3650

Resources:
  # 应用日志组
  ApplicationLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "/aws/applications/${Environment}/${ApplicationName}"
      RetentionInDays: !Ref LogRetentionDays
      KmsKeyId: !Ref LogKmsKeyArn
      Tags:
        - Key: Environment
          Value: !Ref Environment
        - Key: Application
          Value: !Ref ApplicationName
        - Key: Service
          Value: !Ref ServiceName

  # Lambda日志组
  LambdaLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "/aws/lambda/${LambdaFunctionName}"
      RetentionInDays: !Ref LogRetentionDays
      KmsKeyId: !Ref LogKmsKeyArn

  # 用于日志洞察的订阅过滤器
  LogSubscriptionFilter:
    Type: AWS::Logs::SubscriptionFilter
    Properties:
      DestinationArn: !GetAtt LogDestination.Arn
      FilterPattern: '[timestamp=*Z, request_id, level, message]'
      LogGroupName: !Ref ApplicationLogGroup
      RoleArn: !GetAtt LogSubscriptionRole.Arn

  # 错误指标过滤器
  ErrorMetricFilter:
    Type: AWS::Logs::MetricFilter
    Properties:
      FilterPattern: '[level="ERROR", msg]'
      LogGroupName: !Ref ApplicationLogGroup
      MetricTransformations:
        - MetricValue: "1"
          MetricNamespace: !Sub "${AWS::StackName}/Application"
          MetricName: ErrorCount
        - MetricValue: "$level"
          MetricNamespace: !Sub "${AWS::StackName}/Application"
          MetricName: LogLevel

  # 警告指标过滤器
  WarningMetricFilter:
    Type: AWS::Logs::MetricFilter
    Properties:
      FilterPattern: '[level="WARN", msg]'
      LogGroupName: !Ref ApplicationLogGroup
      MetricTransformations:
        - MetricValue: "1"
          MetricNamespace: !Sub "${AWS::StackName}/Application"
          MetricName: WarningCount

  # 自定义保留期的审计日志组
  AuditLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "/aws/audit/${Environment}/${ApplicationName}"
      RetentionInDays: 365
      KmsKeyId: !Ref LogKmsKeyArn

Log Insights Query

日志洞察查询

yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch Logs Insights queries

Resources:
  # Query definition for recent errors
  RecentErrorsQuery:
    Type: AWS::Logs::QueryDefinition
    Properties:
      Name: !Sub "${AWS::StackName}-recent-errors"
      QueryString: |
        fields @timestamp, @message
        | sort @timestamp desc
        | limit 100
        | filter @message like /ERROR/
        | display @timestamp, @message, @logStream

  # Query for performance analysis
  PerformanceQuery:
    Type: AWS::Logs::QueryDefinition
    Properties:
      Name: !Sub "${AWS::StackName}-performance"
      QueryString: |
        fields @timestamp, @message, @duration
        | filter @duration > 1000
        | sort @duration desc
        | limit 50
        | display @timestamp, @duration, @message
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch Logs Insights queries

Resources:
  # 最近错误查询定义
  RecentErrorsQuery:
    Type: AWS::Logs::QueryDefinition
    Properties:
      Name: !Sub "${AWS::StackName}-recent-errors"
      QueryString: |
        fields @timestamp, @message
        | sort @timestamp desc
        | limit 100
        | filter @message like /ERROR/
        | display @timestamp, @message, @logStream

  # 性能分析查询
  PerformanceQuery:
    Type: AWS::Logs::QueryDefinition
    Properties:
      Name: !Sub "${AWS::StackName}-performance"
      QueryString: |
        fields @timestamp, @message, @duration
        | filter @duration > 1000
        | sort @duration desc
        | limit 50
        | display @timestamp, @duration, @message

Synthesized Canaries

合成金丝雀

yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch Synthesized Canaries

Parameters:
  CanarySchedule:
    Type: String
    Default: rate(5 minutes)
    Description: Schedule expression for canary

Resources:
  # Canary for API endpoint
  ApiCanary:
    Type: AWS::Synthetics::Canary
    Properties:
      Name: !Sub "${AWS::StackName}-api-check"
      ArtifactS3Location: !Sub "s3://${ArtifactBucket}/canary/${AWS::StackName}"
      Code:
        S3Bucket: !Ref CanariesCodeBucket
        S3Key: canary/api-check.zip
        Handler: apiCheck.handler
      ExecutionRoleArn: !GetAtt CanaryRole.Arn
      RuntimeVersion: syn-python-selenium-1.1
      Schedule:
        Expression: !Ref CanarySchedule
        DurationInSeconds: 120
      SuccessRetentionPeriodInDays: 31
      FailureRetentionPeriodInDays: 31
      Tags:
        - Key: Environment
          Value: !Ref Environment
        - Key: Service
          Value: api

  # Alarm for canary failure
  CanaryFailureAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-canary-failed"
      AlarmDescription: Alert when synthesized canary fails
      MetricName: Failed
      Namespace: AWS/Synthetics
      Dimensions:
        - Name: CanaryName
          Value: !Ref ApiCanary
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 1
      ComparisonOperator: GreaterThanOrEqualToThreshold
      AlarmActions:
        - !Ref AlarmTopic

  # Alarm for canary latency
  CanaryLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-canary-slow"
      AlarmDescription: Alert when canary latency is high
      MetricName: Duration
      Namespace: AWS/Synthetics
      Dimensions:
        - Name: CanaryName
          Value: !Ref ApiCanary
      Statistic: p99
      Period: 300
      EvaluationPeriods: 3
      Threshold: 5000
      ComparisonOperator: GreaterThanThreshold

  CanaryRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub "${AWS::StackName}-canary-role"
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: synthetics.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: SyntheticsLeastPrivilege
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - synthetics:DescribeCanaries
                  - synthetics:DescribeCanaryRuns
                  - synthetics:GetCanary
                  - synthetics:ListTagsForResource
                Resource: "*"
              - Effect: Allow
                Action:
                  - synthetics:StartCanary
                  - synthetics:StopCanary
                Resource: !Ref ApiCanary
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                  - logs:DescribeLogStreams
                Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/cw-syn-canary-*"
              - Effect: Allow
                Action:
                  - s3:PutObject
                  - s3:GetObject
                Resource: !Sub "s3://${ArtifactBucket}/canary/${AWS::StackName}/*"
              - Effect: Allow
                Action:
                  - kms:Decrypt
                Resource: !Ref KmsKeyArn
                Condition:
                  StringEquals:
                    kms:ViaService: !Sub "s3.${AWS::Region}.amazonaws.com"
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch Synthesized Canaries

Parameters:
  CanarySchedule:
    Type: String
    Default: rate(5 minutes)
    Description: Schedule expression for canary

Resources:
  # API端点金丝雀
  ApiCanary:
    Type: AWS::Synthetics::Canary
    Properties:
      Name: !Sub "${AWS::StackName}-api-check"
      ArtifactS3Location: !Sub "s3://${ArtifactBucket}/canary/${AWS::StackName}"
      Code:
        S3Bucket: !Ref CanariesCodeBucket
        S3Key: canary/api-check.zip
        Handler: apiCheck.handler
      ExecutionRoleArn: !GetAtt CanaryRole.Arn
      RuntimeVersion: syn-python-selenium-1.1
      Schedule:
        Expression: !Ref CanarySchedule
        DurationInSeconds: 120
      SuccessRetentionPeriodInDays: 31
      FailureRetentionPeriodInDays: 31
      Tags:
        - Key: Environment
          Value: !Ref Environment
        - Key: Service
          Value: api

  # 金丝雀失败告警
  CanaryFailureAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-canary-failed"
      AlarmDescription: Alert when synthesized canary fails
      MetricName: Failed
      Namespace: AWS/Synthetics
      Dimensions:
        - Name: CanaryName
          Value: !Ref ApiCanary
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 1
      ComparisonOperator: GreaterThanOrEqualToThreshold
      AlarmActions:
        - !Ref AlarmTopic

  # 金丝雀延迟告警
  CanaryLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-canary-slow"
      AlarmDescription: Alert when canary latency is high
      MetricName: Duration
      Namespace: AWS/Synthetics
      Dimensions:
        - Name: CanaryName
          Value: !Ref ApiCanary
      Statistic: p99
      Period: 300
      EvaluationPeriods: 3
      Threshold: 5000
      ComparisonOperator: GreaterThanThreshold

  CanaryRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub "${AWS::StackName}-canary-role"
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: synthetics.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: SyntheticsLeastPrivilege
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - synthetics:DescribeCanaries
                  - synthetics:DescribeCanaryRuns
                  - synthetics:GetCanary
                  - synthetics:ListTagsForResource
                Resource: "*"
              - Effect: Allow
                Action:
                  - synthetics:StartCanary
                  - synthetics:StopCanary
                Resource: !Ref ApiCanary
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                  - logs:DescribeLogStreams
                Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/cw-syn-canary-*"
              - Effect: Allow
                Action:
                  - s3:PutObject
                  - s3:GetObject
                Resource: !Sub "s3://${ArtifactBucket}/canary/${AWS::StackName}/*"
              - Effect: Allow
                Action:
                  - kms:Decrypt
                Resource: !Ref KmsKeyArn
                Condition:
                  StringEquals:
                    kms:ViaService: !Sub "s3.${AWS::Region}.amazonaws.com"

CloudWatch Application Signals

CloudWatch Application Signals

yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch Application Signals for APM

Resources:
  # Service level indicator for availability
  AvailabilitySLI:
    Type: AWS::CloudWatch::ServiceLevelObjective
    Properties:
      Name: !Sub "${AWS::StackName}-availability"
      Description: Service level objective for availability
      Monitor:
        MonitorName: !Sub "${AWS::StackName}-monitor"
        MonitorType: AWS_SERVICE_LEVEL_INDICATOR
        ResourceGroup: !Ref ResourceGroup
      SliMetric:
        MetricName: Availability
        Namespace: !Sub "${AWS::StackName}/Application"
        Dimensions:
          - Name: Service
            Value: !Ref ServiceName
      Target:
        ComparisonOperator: GREATER_THAN_OR_EQUAL
        Threshold: 99.9
        Period:
          RollingInterval:
            Count: 1
            TimeUnit: HOUR
      Goal:
        TargetLevel: 99.9

  # Service level indicator for latency
  LatencySLI:
    Type: AWS::CloudWatch::ServiceLevelIndicator
    Properties:
      Name: !Sub "${AWS::StackName}-latency-sli"
      Monitor:
        MonitorName: !Sub "${AWS::StackName}-monitor"
      Metric:
        MetricName: Latency
        Namespace: !Sub "${AWS::StackName}/Application"
        Dimensions:
          - Name: Service
            Value: !Ref ServiceName
      OperationName: GetItem
      AccountId: !Ref AWS::AccountId

  # Monitor for application performance
  ApplicationMonitor:
    Type: AWS::CloudWatch::ApplicationMonitor
    Properties:
      MonitorName: !Sub "${AWS::StackName}-app-monitor"
      MonitorType: CW_MONITOR
      Telemetry:
        - Type: APM
          Config:
            Eps: 100
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch Application Signals for APM

Resources:
  # 可用性服务水平指标
  AvailabilitySLI:
    Type: AWS::CloudWatch::ServiceLevelObjective
    Properties:
      Name: !Sub "${AWS::StackName}-availability"
      Description: Service level objective for availability
      Monitor:
        MonitorName: !Sub "${AWS::StackName}-monitor"
        MonitorType: AWS_SERVICE_LEVEL_INDICATOR
        ResourceGroup: !Ref ResourceGroup
      SliMetric:
        MetricName: Availability
        Namespace: !Sub "${AWS::StackName}/Application"
        Dimensions:
          - Name: Service
            Value: !Ref ServiceName
      Target:
        ComparisonOperator: GREATER_THAN_OR_EQUAL
        Threshold: 99.9
        Period:
          RollingInterval:
            Count: 1
            TimeUnit: HOUR
      Goal:
        TargetLevel: 99.9

  # 延迟服务水平指标
  LatencySLI:
    Type: AWS::CloudWatch::ServiceLevelIndicator
    Properties:
      Name: !Sub "${AWS::StackName}-latency-sli"
      Monitor:
        MonitorName: !Sub "${AWS::StackName}-monitor"
      Metric:
        MetricName: Latency
        Namespace: !Sub "${AWS::StackName}/Application"
        Dimensions:
          - Name: Service
            Value: !Ref ServiceName
      OperationName: GetItem
      AccountId: !Ref AWS::AccountId

  # 应用性能监控器
  ApplicationMonitor:
    Type: AWS::CloudWatch::ApplicationMonitor
    Properties:
      MonitorName: !Sub "${AWS::StackName}-app-monitor"
      MonitorType: CW_MONITOR
      Telemetry:
        - Type: APM
          Config:
            Eps: 100

Conditions and Transform

条件与转换

Conditions for Environment-Specific Resources

特定环境资源的条件

yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch with conditional resources

Parameters:
  Environment:
    Type: String
    Default: dev
    AllowedValues:
      - dev
      - staging
      - production
    Description: Deployment environment

Conditions:
  IsProduction: !Equals [!Ref Environment, production]
  IsStaging: !Equals [!Ref Environment, staging]
  CreateAnomalyDetection: !Or [!Equals [!Ref Environment, staging], !Equals [!Ref Environment, production]]
  CreateSLI: !Equals [!Ref Environment, production]

Resources:
  # Base alarm for all environments
  BaseAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-errors"
      MetricName: Errors
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 5
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold

  # Alarm with different thresholds for production
  ProductionAlarm:
    Type: AWS::CloudWatch::Alarm
    Condition: IsProduction
    Properties:
      AlarmName: !Sub "${AWS::StackName}-errors-production"
      MetricName: Errors
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 3
      Threshold: 1
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
        - !Ref ProductionAlarmTopic

  # Anomaly detector only for staging and production
  AnomalyDetector:
    Type: AWS::CloudWatch::AnomalyDetector
    Condition: CreateAnomalyDetection
    Properties:
      MetricName: RequestCount
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
      Statistic: Sum

  # SLI only for production
  ServiceLevelIndicator:
    Type: AWS::CloudWatch::ServiceLevelIndicator
    Condition: CreateSLI
    Properties:
      Name: !Sub "${AWS::StackName}-sli"
      Monitor:
        MonitorName: !Sub "${AWS::StackName}-monitor"
      Metric:
        MetricName: Availability
        Namespace: !Sub "${AWS::StackName}/Application"
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch with conditional resources

Parameters:
  Environment:
    Type: String
    Default: dev
    AllowedValues:
      - dev
      - staging
      - production
    Description: Deployment environment

Conditions:
  IsProduction: !Equals [!Ref Environment, production]
  IsStaging: !Equals [!Ref Environment, staging]
  CreateAnomalyDetection: !Or [!Equals [!Ref Environment, staging], !Equals [!Ref Environment, production]]
  CreateSLI: !Equals [!Ref Environment, production]

Resources:
  # 所有环境的基础告警
  BaseAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-errors"
      MetricName: Errors
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 5
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold

  # 生产环境专属的不同阈值告警
  ProductionAlarm:
    Type: AWS::CloudWatch::Alarm
    Condition: IsProduction
    Properties:
      AlarmName: !Sub "${AWS::StackName}-errors-production"
      MetricName: Errors
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 3
      Threshold: 1
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
        - !Ref ProductionAlarmTopic

  # 仅预发布和生产环境的异常检测器
  AnomalyDetector:
    Type: AWS::CloudWatch::AnomalyDetector
    Condition: CreateAnomalyDetection
    Properties:
      MetricName: RequestCount
      Namespace: !Ref CustomNamespace
      Dimensions:
        - Name: Service
          Value: !Ref ServiceName
      Statistic: Sum

  # 仅生产环境的服务水平指标
  ServiceLevelIndicator:
    Type: AWS::CloudWatch::ServiceLevelIndicator
    Condition: CreateSLI
    Properties:
      Name: !Sub "${AWS::StackName}-sli"
      Monitor:
        MonitorName: !Sub "${AWS::StackName}-monitor"
      Metric:
        MetricName: Availability
        Namespace: !Sub "${AWS::StackName}/Application"

Transform for Code Reuse

用于代码复用的转换

yaml
AWSTemplateFormatVersion: 2010-09-09
Transform: AWS::Serverless-2016-10-31

Description: Using SAM Transform for CloudWatch resources

Globals:
  Function:
    Timeout: 30
    Runtime: python3.11
    Environment:
      Variables:
        LOG_LEVEL: INFO
    LoggingConfiguration:
      LogGroup:
        Name: !Sub "/aws/lambda/${FunctionName}"
        RetentionInDays: 30

Resources:
  # Lambda function with automatic logging
  MonitoredFunction:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: !Sub "${AWS::StackName}-monitored"
      Handler: app.handler
      CodeUri: functions/monitored/
      Policies:
        - PolicyName: LogsLeastPrivilege
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                  - logs:DescribeLogStreams
                  - logs:GetLogEvents
                  - logs:FilterLogEvents
                Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/${AWS::StackName}-*"
              - Effect: Allow
                Action:
                  - logs:DescribeLogGroups
                Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:*"
      Events:
        Api:
          Type: Api
          Properties:
            Path: /health
            Method: get
yaml
AWSTemplateFormatVersion: 2010-09-09
Transform: AWS::Serverless-2016-10-31

Description: Using SAM Transform for CloudWatch resources

Globals:
  Function:
    Timeout: 30
    Runtime: python3.11
    Environment:
      Variables:
        LOG_LEVEL: INFO
    LoggingConfiguration:
      LogGroup:
        Name: !Sub "/aws/lambda/${FunctionName}"
        RetentionInDays: 30

Resources:
  # 带自动日志的Lambda函数
  MonitoredFunction:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: !Sub "${AWS::StackName}-monitored"
      Handler: app.handler
      CodeUri: functions/monitored/
      Policies:
        - PolicyName: LogsLeastPrivilege
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                  - logs:DescribeLogStreams
                  - logs:GetLogEvents
                  - logs:FilterLogEvents
                Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/${AWS::StackName}-*"
              - Effect: Allow
                Action:
                  - logs:DescribeLogGroups
                Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:*"
      Events:
        Api:
          Type: Api
          Properties:
            Path: /health
            Method: get

Best Practices

最佳实践

Security

安全

  • Encrypt log groups with KMS keys
  • Use resource-based policies for log access
  • Implement cross-account log aggregation with proper IAM
  • Configure log retention appropriate for compliance
  • Use VPC endpoints for CloudWatch to isolate traffic
  • Implement least privilege for IAM roles
  • 使用KMS密钥加密日志组
  • 对日志访问使用基于资源的策略
  • 通过适当的IAM实现跨账号日志聚合
  • 配置符合合规要求的日志保留期
  • 使用VPC终端节点隔离CloudWatch流量
  • 为IAM角色实现最小权限原则

Performance

性能

  • Use appropriate metric periods (60s for alarms, 300s for dashboards)
  • Implement composite alarms to reduce alarm fatigue
  • Use anomaly detection for non-linear patterns
  • Configure dashboards with efficient widgets
  • Limit retention period for log groups
  • 使用合适的指标周期(告警用60秒,仪表盘用300秒)
  • 实现复合告警以减少告警疲劳
  • 对非线性模式使用异常检测
  • 配置高效的仪表盘组件
  • 限制日志组的保留期

Monitoring

监控

  • Implement SLI/SLO for service health
  • Use multi-region dashboards for global applications
  • Configure alarms with proper evaluation periods
  • Implement canaries for synthetic monitoring
  • Use Application Signals for APM
  • 为服务健康实现SLI/SLO
  • 为全局应用使用多区域仪表盘
  • 配置带合适评估周期的告警
  • 实现合成金丝雀进行合成监控
  • 使用Application Signals进行APM

Deployment

部署

  • Use change sets before deployment
  • Test templates with cfn-lint
  • Organize stacks by ownership (network, app, data)
  • Use nested stacks for modularity
  • Implement stack policies for protection
  • 部署前使用变更集
  • 用cfn-lint测试模板
  • 按所有权(网络、应用、数据)组织栈
  • 使用嵌套栈实现模块化
  • 为保护资源实现栈策略

CloudFormation Best Practices

CloudFormation最佳实践

Stack Policies

栈策略

Stack policies protect critical resources from unintentional updates. Use them to prevent modifications to production resources.
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch stack with protection policies

Resources:
  CriticalAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-critical"
      MetricName: Errors
      Namespace: AWS/Lambda
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 5
      Threshold: 1
      ComparisonOperator: GreaterThanThreshold

Metadata:
  AWS::CloudFormation::StackPolicy:
    Statement:
      - Effect: Deny
        Principal: "*"
        Action:
          - Update:Delete
          - Update:Modify
        Resource: "*"
      - Effect: Allow
        Principal: "*"
        Action:
          - Update:Modify
        Resource: "*"
        Condition:
          StringEquals:
            aws:RequestedOperation:
              - Describe*
              - List*
栈策略可保护关键资源免受意外更新。使用它们防止对生产资源的修改。
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CloudWatch stack with protection policies

Resources:
  CriticalAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-critical"
      MetricName: Errors
      Namespace: AWS/Lambda
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 5
      Threshold: 1
      ComparisonOperator: GreaterThanThreshold

Metadata:
  AWS::CloudFormation::StackPolicy:
    Statement:
      - Effect: Deny
        Principal: "*"
        Action:
          - Update:Delete
          - Update:Modify
        Resource: "*"
      - Effect: Allow
        Principal: "*"
        Action:
          - Update:Modify
        Resource: "*"
        Condition:
          StringEquals:
            aws:RequestedOperation:
              - Describe*
              - List*

Termination Protection

终止保护

Enable termination protection to prevent accidental stack deletion, especially for production monitoring stacks.
Via Console:
  1. Select the stack
  2. Go to Stack actions > Change termination protection
  3. Enable termination protection
Via CLI:
bash
aws cloudformation update-termination-protection \
  --stack-name my-monitoring-stack \
  --enable-termination-protection
Via CloudFormation (Stack Set):
yaml
Resources:
  MonitoringStack:
    Type: AWS::CloudFormation::Stack
    Properties:
      TemplateURL: !Sub "https://${BucketName}.s3.amazonaws.com/monitoring.yaml"
      TerminationProtection: true
启用终止保护以防止意外删除栈,尤其是生产监控栈。
通过控制台:
  1. 选择栈
  2. 进入栈操作 > 更改终止保护
  3. 启用终止保护
通过CLI:
bash
aws cloudformation update-termination-protection \
  --stack-name my-monitoring-stack \
  --enable-termination-protection
通过CloudFormation(栈集):
yaml
Resources:
  MonitoringStack:
    Type: AWS::CloudFormation::Stack
    Properties:
      TemplateURL: !Sub "https://${BucketName}.s3.amazonaws.com/monitoring.yaml"
      TerminationProtection: true

Drift Detection

漂移检测

Detect when actual infrastructure differs from the CloudFormation template.
Detect drift on a single stack:
bash
aws cloudformation detect-drift \
  --stack-name my-monitoring-stack
Get drift detection status:
bash
aws cloudFormation describe-stack-drift-detection-process-status \
  --stack-drift-detection-id <detection-id>
Get resources that have drifted:
bash
aws cloudformation list-stack-resources \
  --stack-name my-monitoring-stack \
  --query "StackResourceSummaries[?StackResourceDriftStatus!='IN_SYNC']"
Automation with Lambda:
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: Automated drift detection scheduler

Resources:
  DriftDetectionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Policies:
        - PolicyName: CloudWatchDrift
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - cloudformation:DetectStackDrift
                  - cloudformation:DescribeStacks
                  - cloudformation:ListStackResources
                Resource: "*"
              - Effect: Allow
                Action:
                  - sns:Publish
                Resource: !Ref AlertTopic

  DriftDetectionFunction:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.11
      Handler: drift_detector.handler
      Code:
        S3Bucket: !Ref CodeBucket
        S3Key: functions/drift-detector.zip
      Role: !GetAtt DriftDetectionRole.Arn
      Environment:
        Variables:
          SNS_TOPIC_ARN: !Ref AlertTopic

  DriftDetectionRule:
    Type: AWS::Events::Rule
    Properties:
      ScheduleExpression: "rate(1 day)"
      Targets:
        - Id: DriftDetection
          Arn: !GetAtt DriftDetectionFunction.Arn

  AlertTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: !Sub "${AWS::StackName}-drift-alerts"
检测实际基础设施与CloudFormation模板的差异。
检测单个栈的漂移:
bash
aws cloudformation detect-drift \
  --stack-name my-monitoring-stack
获取漂移检测状态:
bash
aws cloudFormation describe-stack-drift-detection-process-status \
  --stack-drift-detection-id <detection-id>
获取已漂移的资源:
bash
aws cloudformation list-stack-resources \
  --stack-name my-monitoring-stack \
  --query "StackResourceSummaries[?StackResourceDriftStatus!='IN_SYNC']"
使用Lambda自动化:
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: Automated drift detection scheduler

Resources:
  DriftDetectionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Policies:
        - PolicyName: CloudWatchDrift
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - cloudformation:DetectStackDrift
                  - cloudformation:DescribeStacks
                  - cloudformation:ListStackResources
                Resource: "*"
              - Effect: Allow
                Action:
                  - sns:Publish
                Resource: !Ref AlertTopic

  DriftDetectionFunction:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.11
      Handler: drift_detector.handler
      Code:
        S3Bucket: !Ref CodeBucket
        S3Key: functions/drift-detector.zip
      Role: !GetAtt DriftDetectionRole.Arn
      Environment:
        Variables:
          SNS_TOPIC_ARN: !Ref AlertTopic

  DriftDetectionRule:
    Type: AWS::Events::Rule
    Properties:
      ScheduleExpression: "rate(1 day)"
      Targets:
        - Id: DriftDetection
          Arn: !GetAtt DriftDetectionFunction.Arn

  AlertTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: !Sub "${AWS::StackName}-drift-alerts"

Change Sets

变更集

Use change sets to preview and review changes before applying them.
Create change set:
bash
aws cloudformation create-change-set \
  --stack-name my-monitoring-stack \
  --template-body file://updated-template.yaml \
  --change-set-name my-changeset \
  --capabilities CAPABILITY_IAM
List change sets:
bash
aws cloudformation list-change-sets \
  --stack-name my-monitoring-stack
Describe change set:
bash
aws cloudformation describe-change-set \
  --stack-name my-monitoring-stack \
  --change-set-name my-changeset
Execute change set:
bash
aws cloudformation execute-change-set \
  --stack-name my-monitoring-stack \
  --change-set-name my-changeset
Pipeline integration:
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CI/CD pipeline for CloudWatch stacks

Resources:
  Pipeline:
    Type: AWS::CodePipeline::Pipeline
    Properties:
      Name: !Sub "${AWS::StackName}-pipeline"
      RoleArn: !GetAtt PipelineRole.Arn
      Stages:
        - Name: Source
          Actions:
            - Name: SourceAction
              ActionTypeId:
                Category: Source
                Owner: AWS
                Provider: CodeCommit
                Version: "1"
              Configuration:
                RepositoryName: !Ref RepositoryName
                BranchName: main
              OutputArtifacts:
                - Name: SourceOutput
        - Name: Validate
          Actions:
            - Name: ValidateTemplate
              ActionTypeId:
                Category: Test
                Owner: AWS
                Provider: CloudFormation
                Version: "1"
              Configuration:
                ActionMode: VALIDATE_ONLY
                TemplatePath: SourceOutput::template.yaml
              InputArtifacts:
                - Name: SourceOutput
        - Name: Review
          Actions:
            - Name: CreateChangeSet
              ActionTypeId:
                Category: Deploy
                Owner: AWS
                Provider: CloudFormation
                Version: "1"
              Configuration:
                ActionMode: CHANGE_SET_REPLACE
                StackName: !Ref StackName
                ChangeSetName: !Sub "${StackName}-changeset"
                TemplatePath: SourceOutput::template.yaml
                Capabilities: CAPABILITY_IAM,CAPABILITY_NAMED_IAM
              InputArtifacts:
                - Name: SourceOutput
            - Name: Approval
              ActionTypeId:
                Category: Approval
                Owner: AWS
                Provider: Manual
                Version: "1"
              Configuration:
                CustomData: Review changes before deployment
        - Name: Deploy
          Actions:
            - Name: ExecuteChangeSet
              ActionTypeId:
                Category: Deploy
                Owner: AWS
                Provider: CloudFormation
                Version: "1"
              Configuration:
                ActionMode: CHANGE_SET_EXECUTE
                StackName: !Ref StackName
                ChangeSetName: !Sub "${StackName}-changeset"

  PipelineRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: codepipeline.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: PipelinePolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - codecommit:Get*
                  - codecommit:List*
                  - codecommit:BatchGet*
                Resource: "*"
              - Effect: Allow
                Action:
                  - s3:GetObject
                  - s3:PutObject
                Resource: !Sub "arn:aws:s3:::${ArtifactBucket}/*"
              - Effect: Allow
                Action:
                  - cloudformation:*
                  - iam:PassRole
                Resource: "*"
              - Effect: Allow
                Action:
                  - sns:Publish
                Resource: !Ref ApprovalTopic

  ApprovalTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: !Sub "${AWS::StackName}-approval"
使用变更集在应用前预览和审查变更。
创建变更集:
bash
aws cloudformation create-change-set \
  --stack-name my-monitoring-stack \
  --template-body file://updated-template.yaml \
  --change-set-name my-changeset \
  --capabilities CAPABILITY_IAM
列出变更集:
bash
aws cloudformation list-change-sets \
  --stack-name my-monitoring-stack
描述变更集:
bash
aws cloudformation describe-change-set \
  --stack-name my-monitoring-stack \
  --change-set-name my-changeset
执行变更集:
bash
aws cloudformation execute-change-set \
  --stack-name my-monitoring-stack \
  --change-set-name my-changeset
流水线集成:
yaml
AWSTemplateFormatVersion: 2010-09-09
Description: CI/CD pipeline for CloudWatch stacks

Resources:
  Pipeline:
    Type: AWS::CodePipeline::Pipeline
    Properties:
      Name: !Sub "${AWS::StackName}-pipeline"
      RoleArn: !GetAtt PipelineRole.Arn
      Stages:
        - Name: Source
          Actions:
            - Name: SourceAction
              ActionTypeId:
                Category: Source
                Owner: AWS
                Provider: CodeCommit
                Version: "1"
              Configuration:
                RepositoryName: !Ref RepositoryName
                BranchName: main
              OutputArtifacts:
                - Name: SourceOutput
        - Name: Validate
          Actions:
            - Name: ValidateTemplate
              ActionTypeId:
                Category: Test
                Owner: AWS
                Provider: CloudFormation
                Version: "1"
              Configuration:
                ActionMode: VALIDATE_ONLY
                TemplatePath: SourceOutput::template.yaml
              InputArtifacts:
                - Name: SourceOutput
        - Name: Review
          Actions:
            - Name: CreateChangeSet
              ActionTypeId:
                Category: Deploy
                Owner: AWS
                Provider: CloudFormation
                Version: "1"
              Configuration:
                ActionMode: CHANGE_SET_REPLACE
                StackName: !Ref StackName
                ChangeSetName: !Sub "${StackName}-changeset"
                TemplatePath: SourceOutput::template.yaml
                Capabilities: CAPABILITY_IAM,CAPABILITY_NAMED_IAM
              InputArtifacts:
                - Name: SourceOutput
            - Name: Approval
              ActionTypeId:
                Category: Approval
                Owner: AWS
                Provider: Manual
                Version: "1"
              Configuration:
                CustomData: Review changes before deployment
        - Name: Deploy
          Actions:
            - Name: ExecuteChangeSet
              ActionTypeId:
                Category: Deploy
                Owner: AWS
                Provider: CloudFormation
                Version: "1"
              Configuration:
                ActionMode: CHANGE_SET_EXECUTE
                StackName: !Ref StackName
                ChangeSetName: !Sub "${StackName}-changeset"

  PipelineRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: codepipeline.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: PipelinePolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - codecommit:Get*
                  - codecommit:List*
                  - codecommit:BatchGet*
                Resource: "*"
              - Effect: Allow
                Action:
                  - s3:GetObject
                  - s3:PutObject
                Resource: !Sub "arn:aws:s3:::${ArtifactBucket}/*"
              - Effect: Allow
                Action:
                  - cloudformation:*
                  - iam:PassRole
                Resource: "*"
              - Effect: Allow
                Action:
                  - sns:Publish
                Resource: !Ref ApprovalTopic

  ApprovalTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: !Sub "${AWS::StackName}-approval"

Related Resources

相关资源

Additional Files

附加文件

For complete details on resources and their properties, consult:
  • REFERENCE.md - Detailed reference guide for all CloudFormation resources
  • EXAMPLES.md - Complete production-ready examples
如需了解资源及其属性的完整详情,请参考:
  • REFERENCE.md - 所有CloudFormation资源的详细参考指南
  • EXAMPLES.md - 完整的生产就绪示例