cloudwatch-alarm-creator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CloudWatch Alarm Creator

CloudWatch 告警创建工具

Эксперт по мониторингу AWS CloudWatch и настройке алармов.
AWS CloudWatch监控与告警配置专家。

Основные принципы

核心原则

  • Выбор порогов: Основывайте на исторических данных и бизнес-требованиях
  • Статистические методы: Выбирайте подходящую статистику (Average, Sum, Maximum) по характеристикам метрик
  • Периоды оценки: Баланс между отзывчивостью и подавлением шума
  • Actionable алерты: Каждый аларм должен иметь понятный путь устранения
  • Оптимизация стоимости: Эффективные стратегии для минимизации расходов
  • 阈值选择:基于历史数据和业务需求
  • 统计方法:根据指标特性选择合适的统计类型(Average、Sum、Maximum)
  • 评估周期:在响应速度和噪声抑制之间取得平衡
  • 可执行告警:每个告警都要有明确的解决路径
  • 成本优化:最小化开支的有效策略

EC2 Alarm

EC2 告警

json
{
  "AlarmName": "HighCPUUtilization",
  "MetricName": "CPUUtilization",
  "Namespace": "AWS/EC2",
  "Statistic": "Average",
  "Period": 300,
  "EvaluationPeriods": 2,
  "Threshold": 80,
  "ComparisonOperator": "GreaterThanThreshold",
  "Dimensions": [
    {
      "Name": "InstanceId",
      "Value": "i-1234567890abcdef0"
    }
  ],
  "AlarmActions": ["arn:aws:sns:region:account:topic"],
  "TreatMissingData": "notBreaching"
}
json
{
  "AlarmName": "HighCPUUtilization",
  "MetricName": "CPUUtilization",
  "Namespace": "AWS/EC2",
  "Statistic": "Average",
  "Period": 300,
  "EvaluationPeriods": 2,
  "Threshold": 80,
  "ComparisonOperator": "GreaterThanThreshold",
  "Dimensions": [
    {
      "Name": "InstanceId",
      "Value": "i-1234567890abcdef0"
    }
  ],
  "AlarmActions": ["arn:aws:sns:region:account:topic"],
  "TreatMissingData": "notBreaching"
}

ALB Alarm

ALB 告警

json
{
  "AlarmName": "HighTargetResponseTime",
  "MetricName": "TargetResponseTime",
  "Namespace": "AWS/ApplicationELB",
  "Statistic": "Average",
  "Period": 60,
  "EvaluationPeriods": 3,
  "DatapointsToAlarm": 2,
  "Threshold": 1.0,
  "ComparisonOperator": "GreaterThanThreshold",
  "Dimensions": [
    {
      "Name": "LoadBalancer",
      "Value": "app/my-alb/1234567890"
    }
  ],
  "TreatMissingData": "ignore"
}
json
{
  "AlarmName": "HighTargetResponseTime",
  "MetricName": "TargetResponseTime",
  "Namespace": "AWS/ApplicationELB",
  "Statistic": "Average",
  "Period": 60,
  "EvaluationPeriods": 3,
  "DatapointsToAlarm": 2,
  "Threshold": 1.0,
  "ComparisonOperator": "GreaterThanThreshold",
  "Dimensions": [
    {
      "Name": "LoadBalancer",
      "Value": "app/my-alb/1234567890"
    }
  ],
  "TreatMissingData": "ignore"
}

RDS Alarm

RDS 告警

json
{
  "AlarmName": "HighDatabaseConnections",
  "MetricName": "DatabaseConnections",
  "Namespace": "AWS/RDS",
  "Statistic": "Average",
  "Period": 300,
  "EvaluationPeriods": 2,
  "Threshold": 100,
  "ComparisonOperator": "GreaterThanThreshold",
  "Dimensions": [
    {
      "Name": "DBInstanceIdentifier",
      "Value": "my-database"
    }
  ]
}
json
{
  "AlarmName": "HighDatabaseConnections",
  "MetricName": "DatabaseConnections",
  "Namespace": "AWS/RDS",
  "Statistic": "Average",
  "Period": 300,
  "EvaluationPeriods": 2,
  "Threshold": 100,
  "ComparisonOperator": "GreaterThanThreshold",
  "Dimensions": [
    {
      "Name": "DBInstanceIdentifier",
      "Value": "my-database"
    }
  ]
}

Terraform Configuration

Terraform 配置

hcl
resource "aws_cloudwatch_metric_alarm" "ec2_cpu_high" {
  alarm_name          = "ec2-cpu-high-${var.instance_id}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "CPU utilization exceeds 80%"

  dimensions = {
    InstanceId = var.instance_id
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_cloudwatch_metric_alarm" "custom_metric" {
  alarm_name          = "custom-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  threshold           = 5
  alarm_description   = "Error rate exceeds 5%"

  metric_query {
    id          = "error_rate"
    expression  = "errors/requests*100"
    label       = "Error Rate"
    return_data = true
  }

  metric_query {
    id = "errors"
    metric {
      metric_name = "Errors"
      namespace   = "MyApp"
      period      = 60
      stat        = "Sum"
    }
  }

  metric_query {
    id = "requests"
    metric {
      metric_name = "Requests"
      namespace   = "MyApp"
      period      = 60
      stat        = "Sum"
    }
  }
}
hcl
resource "aws_cloudwatch_metric_alarm" "ec2_cpu_high" {
  alarm_name          = "ec2-cpu-high-${var.instance_id}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "CPU utilization exceeds 80%"

  dimensions = {
    InstanceId = var.instance_id
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_cloudwatch_metric_alarm" "custom_metric" {
  alarm_name          = "custom-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  threshold           = 5
  alarm_description   = "Error rate exceeds 5%"

  metric_query {
    id          = "error_rate"
    expression  = "errors/requests*100"
    label       = "Error Rate"
    return_data = true
  }

  metric_query {
    id = "errors"
    metric {
      metric_name = "Errors"
      namespace   = "MyApp"
      period      = 60
      stat        = "Sum"
    }
  }

  metric_query {
    id = "requests"
    metric {
      metric_name = "Requests"
      namespace   = "MyApp"
      period      = 60
      stat        = "Sum"
    }
  }
}

Composite Alarm

复合告警

json
{
  "AlarmName": "CompositeSystemHealth",
  "AlarmRule": "ALARM(HighCPU) AND (ALARM(HighMemory) OR ALARM(HighDisk))",
  "AlarmActions": ["arn:aws:sns:region:account:critical-alerts"],
  "AlarmDescription": "System health degraded - multiple metrics breaching"
}
json
{
  "AlarmName": "CompositeSystemHealth",
  "AlarmRule": "ALARM(HighCPU) AND (ALARM(HighMemory) OR ALARM(HighDisk))",
  "AlarmActions": ["arn:aws:sns:region:account:critical-alerts"],
  "AlarmDescription": "System health degraded - multiple metrics breaching"
}

Anomaly Detection

异常检测

json
{
  "AlarmName": "AnomalyDetectionCPU",
  "MetricName": "CPUUtilization",
  "Namespace": "AWS/EC2",
  "ThresholdMetricId": "ad1",
  "ComparisonOperator": "GreaterThanUpperThreshold",
  "EvaluationPeriods": 2,
  "Metrics": [
    {
      "Id": "m1",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/EC2",
          "MetricName": "CPUUtilization",
          "Dimensions": [{"Name": "InstanceId", "Value": "i-123"}]
        },
        "Period": 300,
        "Stat": "Average"
      }
    },
    {
      "Id": "ad1",
      "Expression": "ANOMALY_DETECTION_BAND(m1, 2)"
    }
  ]
}
json
{
  "AlarmName": "AnomalyDetectionCPU",
  "MetricName": "CPUUtilization",
  "Namespace": "AWS/EC2",
  "ThresholdMetricId": "ad1",
  "ComparisonOperator": "GreaterThanUpperThreshold",
  "EvaluationPeriods": 2,
  "Metrics": [
    {
      "Id": "m1",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/EC2",
          "MetricName": "CPUUtilization",
          "Dimensions": [{"Name": "InstanceId", "Value": "i-123"}]
        },
        "Period": 300,
        "Stat": "Average"
      }
    },
    {
      "Id": "ad1",
      "Expression": "ANOMALY_DETECTION_BAND(m1, 2)"
    }
  ]
}

SNS Integration

SNS 集成

hcl
resource "aws_sns_topic" "alerts" {
  name = "cloudwatch-alerts"
}

resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = "ops-team@example.com"
}

resource "aws_sns_topic_subscription" "lambda" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.alert_handler.arn
}
hcl
resource "aws_sns_topic" "alerts" {
  name = "cloudwatch-alerts"
}

resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = "ops-team@example.com"
}

resource "aws_sns_topic_subscription" "lambda" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.alert_handler.arn
}

TreatMissingData Options

缺失数据处理选项

ЗначениеОписаниеИспользование
notBreaching
Missing = OKСтандартные метрики
breaching
Missing = ALARMHeartbeat мониторинг
ignore
Сохранять текущееALB метрики
missing
Missing = INSUFFICIENTПо умолчанию
描述使用场景
notBreaching
缺失数据视为正常标准指标
breaching
缺失数据视为告警心跳监控
ignore
保持当前状态ALB指标
missing
缺失数据视为数据不足默认选项

Рекомендации по порогам

阈值设置建议

yaml
EC2:
  CPUUtilization:
    warning: 70%
    critical: 85%
    period: 300s

  StatusCheckFailed:
    threshold: 1
    period: 60s

ALB:
  TargetResponseTime:
    p95_warning: 500ms
    p99_critical: 1000ms

  HTTPCode_ELB_5XX:
    threshold: 10
    period: 60s

RDS:
  CPUUtilization:
    warning: 70%
    critical: 85%

  FreeableMemory:
    critical: 256MB

  DiskQueueDepth:
    warning: 5
    critical: 10
yaml
EC2:
  CPUUtilization:
    warning: 70%
    critical: 85%
    period: 300s

  StatusCheckFailed:
    threshold: 1
    period: 60s

ALB:
  TargetResponseTime:
    p95_warning: 500ms
    p99_critical: 1000ms

  HTTPCode_ELB_5XX:
    threshold: 10
    period: 60s

RDS:
  CPUUtilization:
    warning: 70%
    critical: 85%

  FreeableMemory:
    critical: 256MB

  DiskQueueDepth:
    warning: 5
    critical: 10

Стоимость оптимизации

成本优化建议

  • Консолидируйте алармы через composite alarms
  • Используйте более длинные периоды где возможно
  • Удаляйте неиспользуемые алармы регулярно
  • Группируйте ресурсы через теги
  • 通过复合告警整合告警规则
  • 尽可能使用更长的评估周期
  • 定期删除未使用的告警
  • 通过标签对资源进行分组

Тестирование алармов

告警测试

bash
undefined
bash
undefined

Переключить состояние для тестирования уведомлений

切换状态以测试通知

aws cloudwatch set-alarm-state
--alarm-name "HighCPUUtilization"
--state-value ALARM
--state-reason "Testing notifications"
undefined
aws cloudwatch set-alarm-state
--alarm-name "HighCPUUtilization"
--state-value ALARM
--state-reason "Testing notifications"
undefined

Лучшие практики

最佳实践

  1. 2 из 3 datapoints — фильтрация временных спайков
  2. Percentile-based thresholds — для latency метрик (P95, P99)
  3. Multi-level alerts — Warning и Critical уровни
  4. Документируйте runbooks — для каждого типа аларма
  5. Регулярный аудит — пересматривайте эффективность порогов
  1. 2/3数据点规则 — 过滤临时峰值
  2. 基于百分位数的阈值 — 适用于延迟指标(P95、P99)
  3. 多级告警 — 警告和严重两个级别
  4. 记录运行手册 — 针对每种告警类型
  5. 定期审计 — 重新评估阈值的有效性