cloudwatch-alarm-creator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCloudWatch Alarm Creator
CloudWatch 告警创建工具
Эксперт по мониторингу AWS CloudWatch и настройке алармов.
AWS CloudWatch监控与告警配置专家。
Основные принципы
核心原则
- Выбор порогов: Основывайте на исторических данных и бизнес-требованиях
- Статистические методы: Выбирайте подходящую статистику (Average, Sum, Maximum) по характеристикам метрик
- Периоды оценки: Баланс между отзывчивостью и подавлением шума
- Actionable алерты: Каждый аларм должен иметь понятный путь устранения
- Оптимизация стоимости: Эффективные стратегии для минимизации расходов
- 阈值选择:基于历史数据和业务需求
- 统计方法:根据指标特性选择合适的统计类型(Average、Sum、Maximum)
- 评估周期:在响应速度和噪声抑制之间取得平衡
- 可执行告警:每个告警都要有明确的解决路径
- 成本优化:最小化开支的有效策略
EC2 Alarm
EC2 告警
json
{
"AlarmName": "HighCPUUtilization",
"MetricName": "CPUUtilization",
"Namespace": "AWS/EC2",
"Statistic": "Average",
"Period": 300,
"EvaluationPeriods": 2,
"Threshold": 80,
"ComparisonOperator": "GreaterThanThreshold",
"Dimensions": [
{
"Name": "InstanceId",
"Value": "i-1234567890abcdef0"
}
],
"AlarmActions": ["arn:aws:sns:region:account:topic"],
"TreatMissingData": "notBreaching"
}json
{
"AlarmName": "HighCPUUtilization",
"MetricName": "CPUUtilization",
"Namespace": "AWS/EC2",
"Statistic": "Average",
"Period": 300,
"EvaluationPeriods": 2,
"Threshold": 80,
"ComparisonOperator": "GreaterThanThreshold",
"Dimensions": [
{
"Name": "InstanceId",
"Value": "i-1234567890abcdef0"
}
],
"AlarmActions": ["arn:aws:sns:region:account:topic"],
"TreatMissingData": "notBreaching"
}ALB Alarm
ALB 告警
json
{
"AlarmName": "HighTargetResponseTime",
"MetricName": "TargetResponseTime",
"Namespace": "AWS/ApplicationELB",
"Statistic": "Average",
"Period": 60,
"EvaluationPeriods": 3,
"DatapointsToAlarm": 2,
"Threshold": 1.0,
"ComparisonOperator": "GreaterThanThreshold",
"Dimensions": [
{
"Name": "LoadBalancer",
"Value": "app/my-alb/1234567890"
}
],
"TreatMissingData": "ignore"
}json
{
"AlarmName": "HighTargetResponseTime",
"MetricName": "TargetResponseTime",
"Namespace": "AWS/ApplicationELB",
"Statistic": "Average",
"Period": 60,
"EvaluationPeriods": 3,
"DatapointsToAlarm": 2,
"Threshold": 1.0,
"ComparisonOperator": "GreaterThanThreshold",
"Dimensions": [
{
"Name": "LoadBalancer",
"Value": "app/my-alb/1234567890"
}
],
"TreatMissingData": "ignore"
}RDS Alarm
RDS 告警
json
{
"AlarmName": "HighDatabaseConnections",
"MetricName": "DatabaseConnections",
"Namespace": "AWS/RDS",
"Statistic": "Average",
"Period": 300,
"EvaluationPeriods": 2,
"Threshold": 100,
"ComparisonOperator": "GreaterThanThreshold",
"Dimensions": [
{
"Name": "DBInstanceIdentifier",
"Value": "my-database"
}
]
}json
{
"AlarmName": "HighDatabaseConnections",
"MetricName": "DatabaseConnections",
"Namespace": "AWS/RDS",
"Statistic": "Average",
"Period": 300,
"EvaluationPeriods": 2,
"Threshold": 100,
"ComparisonOperator": "GreaterThanThreshold",
"Dimensions": [
{
"Name": "DBInstanceIdentifier",
"Value": "my-database"
}
]
}Terraform Configuration
Terraform 配置
hcl
resource "aws_cloudwatch_metric_alarm" "ec2_cpu_high" {
alarm_name = "ec2-cpu-high-${var.instance_id}"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 300
statistic = "Average"
threshold = 80
alarm_description = "CPU utilization exceeds 80%"
dimensions = {
InstanceId = var.instance_id
}
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
tags = {
Environment = var.environment
ManagedBy = "terraform"
}
}
resource "aws_cloudwatch_metric_alarm" "custom_metric" {
alarm_name = "custom-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
threshold = 5
alarm_description = "Error rate exceeds 5%"
metric_query {
id = "error_rate"
expression = "errors/requests*100"
label = "Error Rate"
return_data = true
}
metric_query {
id = "errors"
metric {
metric_name = "Errors"
namespace = "MyApp"
period = 60
stat = "Sum"
}
}
metric_query {
id = "requests"
metric {
metric_name = "Requests"
namespace = "MyApp"
period = 60
stat = "Sum"
}
}
}hcl
resource "aws_cloudwatch_metric_alarm" "ec2_cpu_high" {
alarm_name = "ec2-cpu-high-${var.instance_id}"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 300
statistic = "Average"
threshold = 80
alarm_description = "CPU utilization exceeds 80%"
dimensions = {
InstanceId = var.instance_id
}
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
tags = {
Environment = var.environment
ManagedBy = "terraform"
}
}
resource "aws_cloudwatch_metric_alarm" "custom_metric" {
alarm_name = "custom-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
threshold = 5
alarm_description = "Error rate exceeds 5%"
metric_query {
id = "error_rate"
expression = "errors/requests*100"
label = "Error Rate"
return_data = true
}
metric_query {
id = "errors"
metric {
metric_name = "Errors"
namespace = "MyApp"
period = 60
stat = "Sum"
}
}
metric_query {
id = "requests"
metric {
metric_name = "Requests"
namespace = "MyApp"
period = 60
stat = "Sum"
}
}
}Composite Alarm
复合告警
json
{
"AlarmName": "CompositeSystemHealth",
"AlarmRule": "ALARM(HighCPU) AND (ALARM(HighMemory) OR ALARM(HighDisk))",
"AlarmActions": ["arn:aws:sns:region:account:critical-alerts"],
"AlarmDescription": "System health degraded - multiple metrics breaching"
}json
{
"AlarmName": "CompositeSystemHealth",
"AlarmRule": "ALARM(HighCPU) AND (ALARM(HighMemory) OR ALARM(HighDisk))",
"AlarmActions": ["arn:aws:sns:region:account:critical-alerts"],
"AlarmDescription": "System health degraded - multiple metrics breaching"
}Anomaly Detection
异常检测
json
{
"AlarmName": "AnomalyDetectionCPU",
"MetricName": "CPUUtilization",
"Namespace": "AWS/EC2",
"ThresholdMetricId": "ad1",
"ComparisonOperator": "GreaterThanUpperThreshold",
"EvaluationPeriods": 2,
"Metrics": [
{
"Id": "m1",
"MetricStat": {
"Metric": {
"Namespace": "AWS/EC2",
"MetricName": "CPUUtilization",
"Dimensions": [{"Name": "InstanceId", "Value": "i-123"}]
},
"Period": 300,
"Stat": "Average"
}
},
{
"Id": "ad1",
"Expression": "ANOMALY_DETECTION_BAND(m1, 2)"
}
]
}json
{
"AlarmName": "AnomalyDetectionCPU",
"MetricName": "CPUUtilization",
"Namespace": "AWS/EC2",
"ThresholdMetricId": "ad1",
"ComparisonOperator": "GreaterThanUpperThreshold",
"EvaluationPeriods": 2,
"Metrics": [
{
"Id": "m1",
"MetricStat": {
"Metric": {
"Namespace": "AWS/EC2",
"MetricName": "CPUUtilization",
"Dimensions": [{"Name": "InstanceId", "Value": "i-123"}]
},
"Period": 300,
"Stat": "Average"
}
},
{
"Id": "ad1",
"Expression": "ANOMALY_DETECTION_BAND(m1, 2)"
}
]
}SNS Integration
SNS 集成
hcl
resource "aws_sns_topic" "alerts" {
name = "cloudwatch-alerts"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = "ops-team@example.com"
}
resource "aws_sns_topic_subscription" "lambda" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "lambda"
endpoint = aws_lambda_function.alert_handler.arn
}hcl
resource "aws_sns_topic" "alerts" {
name = "cloudwatch-alerts"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = "ops-team@example.com"
}
resource "aws_sns_topic_subscription" "lambda" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "lambda"
endpoint = aws_lambda_function.alert_handler.arn
}TreatMissingData Options
缺失数据处理选项
| Значение | Описание | Использование |
|---|---|---|
| Missing = OK | Стандартные метрики |
| Missing = ALARM | Heartbeat мониторинг |
| Сохранять текущее | ALB метрики |
| Missing = INSUFFICIENT | По умолчанию |
| 值 | 描述 | 使用场景 |
|---|---|---|
| 缺失数据视为正常 | 标准指标 |
| 缺失数据视为告警 | 心跳监控 |
| 保持当前状态 | ALB指标 |
| 缺失数据视为数据不足 | 默认选项 |
Рекомендации по порогам
阈值设置建议
yaml
EC2:
CPUUtilization:
warning: 70%
critical: 85%
period: 300s
StatusCheckFailed:
threshold: 1
period: 60s
ALB:
TargetResponseTime:
p95_warning: 500ms
p99_critical: 1000ms
HTTPCode_ELB_5XX:
threshold: 10
period: 60s
RDS:
CPUUtilization:
warning: 70%
critical: 85%
FreeableMemory:
critical: 256MB
DiskQueueDepth:
warning: 5
critical: 10yaml
EC2:
CPUUtilization:
warning: 70%
critical: 85%
period: 300s
StatusCheckFailed:
threshold: 1
period: 60s
ALB:
TargetResponseTime:
p95_warning: 500ms
p99_critical: 1000ms
HTTPCode_ELB_5XX:
threshold: 10
period: 60s
RDS:
CPUUtilization:
warning: 70%
critical: 85%
FreeableMemory:
critical: 256MB
DiskQueueDepth:
warning: 5
critical: 10Стоимость оптимизации
成本优化建议
- Консолидируйте алармы через composite alarms
- Используйте более длинные периоды где возможно
- Удаляйте неиспользуемые алармы регулярно
- Группируйте ресурсы через теги
- 通过复合告警整合告警规则
- 尽可能使用更长的评估周期
- 定期删除未使用的告警
- 通过标签对资源进行分组
Тестирование алармов
告警测试
bash
undefinedbash
undefinedПереключить состояние для тестирования уведомлений
切换状态以测试通知
aws cloudwatch set-alarm-state
--alarm-name "HighCPUUtilization"
--state-value ALARM
--state-reason "Testing notifications"
--alarm-name "HighCPUUtilization"
--state-value ALARM
--state-reason "Testing notifications"
undefinedaws cloudwatch set-alarm-state
--alarm-name "HighCPUUtilization"
--state-value ALARM
--state-reason "Testing notifications"
--alarm-name "HighCPUUtilization"
--state-value ALARM
--state-reason "Testing notifications"
undefinedЛучшие практики
最佳实践
- 2 из 3 datapoints — фильтрация временных спайков
- Percentile-based thresholds — для latency метрик (P95, P99)
- Multi-level alerts — Warning и Critical уровни
- Документируйте runbooks — для каждого типа аларма
- Регулярный аудит — пересматривайте эффективность порогов
- 2/3数据点规则 — 过滤临时峰值
- 基于百分位数的阈值 — 适用于延迟指标(P95、P99)
- 多级告警 — 警告和严重两个级别
- 记录运行手册 — 针对每种告警类型
- 定期审计 — 重新评估阈值的有效性