Prometheus Monitoring and Alerting
Prometheus监控与告警
Prometheus is a powerful open-source monitoring and alerting system designed for reliability and scalability in cloud-native environments. Built for multi-dimensional time-series data with flexible querying via PromQL.
Prometheus是一款功能强大的开源监控与告警系统,专为云原生环境的可靠性和可扩展性设计。它针对多维时间序列数据构建,支持通过PromQL进行灵活查询。
Architecture Components
架构组件
- Prometheus Server: Core component that scrapes and stores time-series data with local TSDB
- Alertmanager: Handles alerts, deduplication, grouping, routing, and notifications to receivers
- Pushgateway: Allows ephemeral jobs to push metrics (use sparingly - prefer pull model)
- Exporters: Convert metrics from third-party systems to Prometheus format (node, blackbox, etc.)
- Client Libraries: Instrument application code (Go, Java, Python, Rust, etc.)
- Prometheus Operator: Kubernetes-native deployment and management via CRDs
- Remote Storage: Long-term storage via Thanos, Cortex, Mimir for multi-cluster federation
- Prometheus Server:核心组件,负责采集并存储时间序列数据,搭载本地TSDB(时间序列数据库)
- Alertmanager:处理告警,包括去重、分组、路由以及向接收方发送通知
- Pushgateway:允许临时任务推送指标(谨慎使用 - 优先采用拉取模型)
- Exporters:将第三方系统的指标转换为Prometheus格式(如node exporter、blackbox exporter等)
- Client Libraries:用于为应用代码添加指标埋点(支持Go、Java、Python、Rust等语言)
- Prometheus Operator:基于Kubernetes CRD(自定义资源定义)的原生部署与管理工具
- Remote Storage:通过Thanos、Cortex、Mimir实现长期存储,支持多集群联邦
- Metrics: Time-series data identified by metric name and key-value labels
- Format:
metric_name{label1="value1", label2="value2"} sample_value timestamp
- Metric Types:
- Counter: Monotonically increasing value (requests, errors) - use or for querying
- Gauge: Value that can go up/down (temperature, memory usage, queue length)
- Histogram: Observations in configurable buckets (latency, request size) - exposes , ,
- Summary: Similar to histogram but calculates quantiles client-side - use histograms for aggregation
- Metrics(指标):由指标名称和键值对标签标识的时间序列数据
- 格式:
metric_name{label1="value1", label2="value2"} sample_value timestamp
- 指标类型:
- Counter(计数器):单调递增的值(如请求数、错误数)- 查询时使用或
- Gauge(仪表盘):可升可降的值(如温度、内存使用率、队列长度)
- Histogram(直方图):将观测值划分到可配置的桶中(如延迟、请求大小)- 会暴露、、指标
- Summary(摘要):与直方图类似,但在客户端计算分位数 - 聚合场景优先使用直方图
Setup and Configuration
部署与配置
Basic Prometheus Server Configuration
基础Prometheus Server配置
prometheus.yml
prometheus.yml
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
cluster: "production"
region: "us-east-1"
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
cluster: "production"
region: "us-east-1"
Alertmanager configuration
Alertmanager配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alerts/*.yml"
- "rules/*.yml"
rule_files:
- "alerts/*.yml"
- "rules/*.yml"
Scrape configurations
采集配置
scrape_configs:
Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
Application services
- job_name: "application"
metrics_path: "/metrics"
static_configs:
- targets:
- "app-1:8080"
- "app-2:8080"
labels:
env: "production"
team: "backend"
Kubernetes service discovery
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
Only scrape pods with prometheus.io/scrape annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Use custom metrics path if specified
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: metrics_path
regex: (.+)
Use custom port if specified
- source_labels:
[address, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: address
Add namespace label
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
Add pod name label
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
Add service name label
- source_labels: [__meta_kubernetes_pod_label_app]
action: replace
target_label: app
Node Exporter for host metrics
- job_name: "node-exporter"
static_configs:
scrape_configs:
Prometheus自身
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
应用服务
- job_name: "application"
metrics_path: "/metrics"
static_configs:
- targets:
- "app-1:8080"
- "app-2:8080"
labels:
env: "production"
team: "backend"
Kubernetes服务发现
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
仅采集带有prometheus.io/scrape注解的Pod
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
如果指定了自定义指标路径则使用
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: metrics_path
regex: (.+)
如果指定了自定义端口则使用
- source_labels:
[address, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: address
添加命名空间标签
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
添加Pod名称标签
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
添加服务名称标签
- source_labels: [__meta_kubernetes_pod_label_app]
action: replace
target_label: app
用于主机指标的Node Exporter
- job_name: "node-exporter"
static_configs:
Alertmanager Configuration
Alertmanager配置
alertmanager.yml
alertmanager.yml
Template files for custom notifications
自定义通知的模板文件
templates:
- "/etc/alertmanager/templates/*.tmpl"
templates:
- "/etc/alertmanager/templates/*.tmpl"
Route alerts to appropriate receivers
将告警路由到对应的接收方
route:
group_by: ["alertname", "cluster", "service"]
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: "default"
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: "pagerduty"
continue: true
# Database alerts to DBA team
- match:
team: database
receiver: "dba-team"
group_by: ["alertname", "instance"]
# Development environment alerts
- match:
env: development
receiver: "slack-dev"
group_wait: 5m
repeat_interval: 4h
route:
group_by: ["alertname", "cluster", "service"]
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: "default"
routes:
# 严重告警发送到PagerDuty
- match:
severity: critical
receiver: "pagerduty"
continue: true
# 数据库告警发送给DBA团队
- match:
team: database
receiver: "dba-team"
group_by: ["alertname", "instance"]
# 开发环境告警
- match:
env: development
receiver: "slack-dev"
group_wait: 5m
repeat_interval: 4h
Inhibition rules (suppress alerts)
抑制规则(抑制关联告警)
inhibit_rules:
Suppress warning alerts if critical alert is firing
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["alertname", "instance"]
Suppress instance alerts if entire service is down
- source_match:
alertname: "ServiceDown"
target_match_re:
alertname: ".*"
equal: ["service"]
receivers:
-
name: "default"
slack_configs:
- channel: "#alerts"
title: "Alert: {{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
-
name: "pagerduty"
pagerduty_configs:
- service_key: "YOUR_PAGERDUTY_SERVICE_KEY"
description: "{{ .GroupLabels.alertname }}"
-
name: "dba-team"
slack_configs:
- channel: "#database-alerts"
email_configs:
- to: "dba-team@example.com"
headers:
Subject: "Database Alert: {{ .GroupLabels.alertname }}"
-
name: "slack-dev"
slack_configs:
- channel: "#dev-alerts"
send_resolved: true
inhibit_rules:
如果有严重告警触发,抑制警告级别的同类型告警
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["alertname", "instance"]
如果整个服务不可用,抑制单个实例的告警
- source_match:
alertname: "ServiceDown"
target_match_re:
alertname: ".*"
equal: ["service"]
receivers:
-
name: "default"
slack_configs:
- channel: "#alerts"
title: "告警: {{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
-
name: "pagerduty"
pagerduty_configs:
- service_key: "YOUR_PAGERDUTY_SERVICE_KEY"
description: "{{ .GroupLabels.alertname }}"
-
name: "dba-team"
slack_configs:
- channel: "#database-alerts"
email_configs:
- to: "dba-team@example.com"
headers:
Subject: "数据库告警: {{ .GroupLabels.alertname }}"
-
name: "slack-dev"
slack_configs:
- channel: "#dev-alerts"
send_resolved: true
Metric Naming Conventions
指标命名规范
Follow these naming patterns for consistency:
Format: <namespace><subsystem><metric>_<unit>
格式: <命名空间><子系统><指标>_<单位>
Counters (always use _total suffix)
计数器(必须以_total为后缀)
http_requests_total
http_request_errors_total
cache_hits_total
http_requests_total
http_request_errors_total
cache_hits_total
memory_usage_bytes
active_connections
queue_size
memory_usage_bytes
active_connections
queue_size
Histograms (use _bucket, _sum, _count suffixes automatically)
直方图(自动添加_bucket、_sum、_count后缀)
http_request_duration_seconds
response_size_bytes
db_query_duration_seconds
http_request_duration_seconds
response_size_bytes
db_query_duration_seconds
Use consistent base units
使用统一的基础单位
- seconds for duration (not milliseconds)
- bytes for size (not kilobytes)
- ratio for percentages (0.0-1.0, not 0-100)
- 时长使用秒(而非毫秒)
- 大小使用字节(而非千字节)
- 百分比使用比率(0.0-1.0,而非0-100)
Label Cardinality Management
标签基数管理
Good: Bounded cardinality
良好:基数有限
http_requests_total{method="GET", status="200", endpoint="/api/users"}
http_requests_total{method="GET", status="200", endpoint="/api/users"}
Good: Reasonable number of label values
良好:标签值数量合理
db_queries_total{table="users", operation="select"}
db_queries_total{table="users", operation="select"}
Bad: Unbounded cardinality (user IDs, email addresses, timestamps)
糟糕:基数无界(用户ID、邮箱地址、时间戳)
http_requests_total{user_id="12345"}
http_requests_total{email="user@example.com"}
http_requests_total{timestamp="1234567890"}
http_requests_total{user_id="12345"}
http_requests_total{email="user@example.com"}
http_requests_total{timestamp="1234567890"}
Bad: High cardinality (full URLs, IP addresses)
糟糕:基数过高(完整URL、IP地址)
http_requests_total{url="/api/users/12345/profile"}
http_requests_total{client_ip="192.168.1.100"}
http_requests_total{url="/api/users/12345/profile"}
http_requests_total{client_ip="192.168.1.100"}
- Keep label values to < 10 per label (ideally)
- Total unique time-series per metric should be < 10,000
- Use recording rules to pre-aggregate high-cardinality metrics
- Avoid labels with unbounded values (IDs, timestamps, user input)
- 每个标签的标签值数量应控制在10个以内(理想情况)
- 每个指标的唯一时间序列总数应小于10,000
- 使用记录规则预聚合高基数指标
- 避免使用无界值作为标签(如ID、时间戳、用户输入内容)
Recording Rules for Performance
用于性能优化的记录规则
Use recording rules to pre-compute expensive queries:
rules/recording_rules.yml
rules/recording_rules.yml
groups:
-
name: performance_rules
interval: 30s
rules:
Pre-calculate request rates
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
Pre-calculate error rates
- record: job:http_request_errors:rate5m
expr: sum(rate(http_request_errors_total[5m])) by (job)
Pre-calculate error ratio
- record: job:http_request_error_ratio:rate5m
expr: |
job:http_request_errors:rate5m
/
job:http_requests:rate5m
Pre-aggregate latency percentiles
-
record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
-
record: job:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
-
name: aggregation_rules
interval: 1m
rules:
Multi-level aggregation for dashboards
-
record: instance:node_cpu_utilization:ratio
expr: |
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
-
record: cluster:node_cpu_utilization:ratio
expr: avg(instance:node_cpu_utilization:ratio)
Memory aggregation
- record: instance:node_memory_utilization:ratio
expr: |
1 - (
node_memory_MemAvailable_bytes
/
node_memory_MemTotal_bytes
)
groups:
-
name: performance_rules
interval: 30s
rules:
预计算请求速率
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
预计算错误速率
- record: job:http_request_errors:rate5m
expr: sum(rate(http_request_errors_total[5m])) by (job)
预计算错误率
- record: job:http_request_error_ratio:rate5m
expr: |
job:http_request_errors:rate5m
/
job:http_requests:rate5m
预聚合延迟分位数
-
record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
-
record: job:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
-
name: aggregation_rules
interval: 1m
rules:
用于仪表盘的多层聚合
-
record: instance:node_cpu_utilization:ratio
expr: |
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
-
record: cluster:node_cpu_utilization:ratio
expr: avg(instance:node_cpu_utilization:ratio)
内存聚合
- record: instance:node_memory_utilization:ratio
expr: |
1 - (
node_memory_MemAvailable_bytes
/
node_memory_MemTotal_bytes
)
Alert Design (Symptoms vs Causes)
告警设计(症状 vs 原因)
Alert on symptoms (user-facing impact), not causes
基于症状告警(面向用户影响),而非基于原因
alerts/symptom_based.yml
alerts/symptom_based.yml
groups:
-
name: symptom_alerts
rules:
GOOD: Alert on user-facing symptoms
-
alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
runbook: "
https://wiki.example.com/runbooks/high-error-rate"
-
alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "High latency on {{ $labels.service }}"
description: "P95 latency is {{ $value }}s (threshold: 1s)"
impact: "Users experiencing slow page loads"
GOOD: SLO-based alerting
- alert: SLOBudgetBurnRate
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > (14.4 * (1 - 0.999)) # 14.4x burn rate for 99.9% SLO
for: 5m
labels:
severity: critical
team: sre
annotations:
summary: "SLO budget burning too fast"
description: "At current rate, monthly error budget will be exhausted in {{ $value | humanizeDuration }}"
groups:
-
name: symptom_alerts
rules:
推荐:基于用户可见的症状告警
-
alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "检测到高错误率"
description: "错误率为{{ $value | humanizePercentage }}(阈值:5%)"
runbook: "
https://wiki.example.com/runbooks/high-error-rate"
-
alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "{{ $labels.service }}服务延迟过高"
description: "P95延迟为{{ $value }}s(阈值:1s)"
impact: "用户正在体验页面加载缓慢"
推荐:基于SLO的告警
- alert: SLOBudgetBurnRate
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > (14.4 * (1 - 0.999)) # 99.9% SLO对应的14.4倍消耗速率
for: 5m
labels:
severity: critical
team: sre
annotations:
summary: "SLO预算消耗过快"
description: "按照当前速率,月度错误预算将在{{ $value | humanizeDuration }}内耗尽"
Cause-based alerts (use for debugging, not paging)
基于原因的告警(用于调试,而非页面告警)
alerts/cause_based.yml
alerts/cause_based.yml
groups:
- name: infrastructure_alerts
rules:
Lower severity for infrastructure issues
-
alert: HighMemoryUsage
expr: |
(
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
) / node_memory_MemTotal_bytes > 0.9
for: 10m
labels:
severity: warning # Not critical unless symptoms appear
team: infrastructure
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | humanizePercentage }}"
-
alert: DiskSpaceLow
expr: |
(
node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"}
) < 0.1
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Only {{ $value | humanizePercentage }} disk space remaining"
action: "Clean up logs or expand disk"
groups:
- name: infrastructure_alerts
rules:
基础设施问题使用较低的告警级别
-
alert: HighMemoryUsage
expr: |
(
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
) / node_memory_MemTotal_bytes > 0.9
for: 10m
labels:
severity: warning # 除非出现用户可见症状,否则不设为严重级别
team: infrastructure
annotations:
summary: "{{ $labels.instance }}内存使用率过高"
description: "内存使用率为{{ $value | humanizePercentage }}"
-
alert: DiskSpaceLow
expr: |
(
node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"}
) < 0.1
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "{{ $labels.instance }}磁盘空间不足"
description: "剩余磁盘空间仅为{{ $value | humanizePercentage }}"
action: "清理日志或扩容磁盘"
Alert Best Practices
告警最佳实践
- For duration: Use clause to avoid flapping
- Meaningful annotations: Include summary, description, runbook URL, impact
- Proper severity levels: critical (page immediately), warning (ticket), info (log)
- Actionable alerts: Every alert should require human action
- Include context: Add labels for team ownership, service, environment
- 持续时长:使用子句避免告警抖动
- 有意义的注解:包含摘要、描述、运行手册URL、影响范围
- 合理的严重级别:critical(立即告警)、warning(创建工单)、info(仅日志)
- 可执行的告警:每个告警都应需要人工干预
- 包含上下文信息:添加团队归属、服务、环境等标签
PromQL Query Patterns
PromQL查询模式
PromQL is the query language for Prometheus. Key concepts: instant vectors, range vectors, scalar, string literals, selectors, operators, functions, and aggregation.
PromQL是Prometheus的查询语言,核心概念包括:即时向量、范围向量、标量、字符串字面量、选择器、运算符、函数和聚合操作。
Selectors and Matchers
选择器与匹配器
Instant vector selector (latest sample for each time-series)
即时向量选择器(每个时间序列的最新样本)
Filter by label values
通过标签值过滤
http_requests_total{method="GET", status="200"}
http_requests_total{method="GET", status="200"}
Regex matching (=) and negative regex (!)
正则匹配(=)和负正则匹配(!)
http_requests_total{status="5.."} # 5xx errors
http_requests_total{endpoint!"/admin.*"} # exclude admin endpoints
http_requests_total{status="5.."} # 5xx错误
http_requests_total{endpoint!"/admin.*"} # 排除admin端点
Label absence/presence
标签存在/不存在
http_requests_total{job="api", status=""} # empty label
http_requests_total{job="api", status!=""} # non-empty label
http_requests_total{job="api", status=""} # 空标签
http_requests_total{job="api", status!=""} # 非空标签
Range vector selector (samples over time)
范围向量选择器(一段时间内的样本)
http_requests_total[5m] # last 5 minutes of samples
http_requests_total[5m] # 最近5分钟的样本
Request rate (requests per second) - ALWAYS use rate() for counters
请求速率(每秒请求数)- 计数器必须使用rate()
rate(http_requests_total[5m])
rate(http_requests_total[5m])
sum(rate(http_requests_total[5m])) by (service)
sum(rate(http_requests_total[5m])) by (service)
Increase over time window (total count) - for alerts/dashboards showing total
时间窗口内的增量(总计数)- 用于告警/仪表盘展示总量
increase(http_requests_total[1h])
increase(http_requests_total[1h])
irate() for volatile, fast-moving counters (more sensitive to spikes)
irate()用于波动较大的计数器(对峰值更敏感)
irate(http_requests_total[5m])
irate(http_requests_total[5m])
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
P50, P95, P99 latency by service
按服务统计P50、P95、P99延迟
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
Average request duration
平均请求延迟
sum(rate(http_request_duration_seconds_sum[5m])) by (service)
/
sum(rate(http_request_duration_seconds_count[5m])) by (service)
sum(rate(http_request_duration_seconds_sum[5m])) by (service)
/
sum(rate(http_request_duration_seconds_count[5m])) by (service)
Aggregation Operations
聚合操作
Sum across all instances
所有实例求和
sum(node_memory_MemTotal_bytes) by (cluster)
sum(node_memory_MemTotal_bytes) by (cluster)
Average CPU usage
平均CPU使用率
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
max(http_request_duration_seconds) by (service)
max(http_request_duration_seconds) by (service)
min(node_filesystem_avail_bytes) by (instance)
min(node_filesystem_avail_bytes) by (instance)
Count number of instances
实例数量统计
stddev(http_request_duration_seconds) by (service)
stddev(http_request_duration_seconds) by (service)
Top 5 services by request rate
请求速率Top 5服务
topk(5, sum(rate(http_requests_total[5m])) by (service))
topk(5, sum(rate(http_requests_total[5m])) by (service))
Bottom 3 instances by available memory
可用内存Bottom 3实例
bottomk(3, node_memory_MemAvailable_bytes)
bottomk(3, node_memory_MemAvailable_bytes)
Predict disk full time (linear regression)
预测磁盘耗尽时间(线性回归)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4 * 3600) < 0
Compare with 1 day ago
与1天前对比
http_requests_total - http_requests_total offset 1d
http_requests_total - http_requests_total offset 1d
Rate of change (derivative)
变化率(导数)
deriv(node_memory_MemAvailable_bytes[5m])
deriv(node_memory_MemAvailable_bytes[5m])
Absent metric detection
检测指标缺失
absent(up{job="critical-service"})
absent(up{job="critical-service"})
Calculate Apdex score (Application Performance Index)
计算Apdex分数(应用性能指数)
(
sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))
+
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) * 0.5
)
/
sum(rate(http_request_duration_seconds_count[5m]))
(
sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))
+
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) * 0.5
)
/
sum(rate(http_request_duration_seconds_count[5m]))
Multi-window multi-burn-rate SLO
多窗口多消耗速率SLO
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
0.001 * 14.4
)
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
0.001 * 14.4
)
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
0.001 * 14.4
)
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
0.001 * 14.4
)
Binary Operators and Vector Matching
二元运算符与向量匹配
Arithmetic operators (+, -, *, /, %, ^)
算术运算符(+、-、*、/、%、^)
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
Comparison operators (==, !=, >, <, >=, <=) - filter to matching values
比较运算符(==、!=、>、<、>=、<=)- 过滤匹配的值
http_request_duration_seconds > 1
http_request_duration_seconds > 1
Logical operators (and, or, unless)
逻辑运算符(and、or、unless)
up{job="api"} and rate(http_requests_total[5m]) > 100
up{job="api"} and rate(http_requests_total[5m]) > 100
One-to-one matching (default)
一对一匹配(默认)
method:http_requests:rate5m / method:http_requests:total
method:http_requests:rate5m / method:http_requests:total
Many-to-one matching with group_left
多对一匹配(使用group_left)
sum(rate(http_requests_total[5m])) by (instance, method)
/ on(instance) group_left
sum(rate(http_requests_total[5m])) by (instance)
sum(rate(http_requests_total[5m])) by (instance, method)
/ on(instance) group_left
sum(rate(http_requests_total[5m])) by (instance)
One-to-many matching with group_right
一对多匹配(使用group_right)
sum(rate(http_requests_total[5m])) by (instance)
/ on(instance) group_right
sum(rate(http_requests_total[5m])) by (instance, method)
sum(rate(http_requests_total[5m])) by (instance)
/ on(instance) group_right
sum(rate(http_requests_total[5m])) by (instance, method)
Time Functions and Offsets
时间函数与偏移量
Compare with previous time period
与之前时间段对比
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)
Day-over-day comparison
日环比对比
http_requests_total - http_requests_total offset 1d
http_requests_total - http_requests_total offset 1d
Time-based filtering
基于时间过滤
http_requests_total and hour() >= 9 and hour() < 17 # business hours
day_of_week() == 0 or day_of_week() == 6 # weekends
http_requests_total and hour() >= 9 and hour() < 17 # 工作时间
day_of_week() == 0 or day_of_week() == 6 # 周末
time() - process_start_time_seconds # uptime in seconds
time() - process_start_time_seconds # 运行时长(秒)
Prometheus supports multiple service discovery mechanisms for dynamic environments where targets appear and disappear.
Prometheus支持多种服务发现机制,适用于目标动态增减的环境。
yaml
scrape_configs:
- job_name: "static-targets"
static_configs:
- targets:
- "host1:9100"
- "host2:9100"
labels:
env: production
region: us-east-1
yaml
scrape_configs:
- job_name: "static-targets"
static_configs:
- targets:
- "host1:9100"
- "host2:9100"
labels:
env: production
region: us-east-1
File-based Service Discovery
基于文件的服务发现
yaml
scrape_configs:
- job_name: 'file-sd'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
- '/etc/prometheus/targets/*.yml'
refresh_interval: 30s
yaml
scrape_configs:
- job_name: 'file-sd'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
- '/etc/prometheus/targets/*.yml'
refresh_interval: 30s
targets/webservers.json
targets/webservers.json
[
{
"targets": ["web1:8080", "web2:8080"],
"labels": {
"job": "web",
"env": "prod"
}
}
]
[
{
"targets": ["web1:8080", "web2:8080"],
"labels": {
"job": "web",
"env": "prod"
}
}
]
Kubernetes Service Discovery
Kubernetes服务发现
yaml
scrape_configs:
# Pod-based discovery
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- production
- staging
relabel_configs:
# Keep only pods with prometheus.io/scrape=true annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Extract custom scrape path from annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Extract custom port from annotation
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Add standard Kubernetes labels
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: kubernetes_pod_name
# Service-based discovery
- job_name: "kubernetes-services"
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Node-based discovery (for node exporters)
- job_name: "kubernetes-nodes"
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Endpoints discovery (for service endpoints)
- job_name: "kubernetes-endpoints"
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: metrics
yaml
scrape_configs:
# 基于Pod的发现
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- production
- staging
relabel_configs:
# 仅保留带有prometheus.io/scrape=true注解的Pod
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# 从注解中提取自定义采集路径
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# 从注解中提取自定义端口
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# 添加标准Kubernetes标签
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: kubernetes_pod_name
# 基于Service的发现
- job_name: "kubernetes-services"
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# 基于Node的发现(用于node exporter)
- job_name: "kubernetes-nodes"
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# 基于Endpoints的发现(用于服务端点)
- job_name: "kubernetes-endpoints"
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: metrics
Consul Service Discovery
Consul服务发现
yaml
scrape_configs:
- job_name: "consul-services"
consul_sd_configs:
- server: "consul.example.com:8500"
datacenter: "dc1"
services: ["web", "api", "cache"]
tags: ["production"]
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_tags]
target_label: tags
yaml
scrape_configs:
- job_name: "consul-services"
consul_sd_configs:
- server: "consul.example.com:8500"
datacenter: "dc1"
services: ["web", "api", "cache"]
tags: ["production"]
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_tags]
target_label: tags
EC2 Service Discovery
EC2服务发现
yaml
scrape_configs:
- job_name: "ec2-instances"
ec2_sd_configs:
- region: us-east-1
access_key: YOUR_ACCESS_KEY
secret_key: YOUR_SECRET_KEY
port: 9100
filters:
- name: tag:Environment
values: [production]
- name: instance-state-name
values: [running]
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
target_label: instance_name
- source_labels: [__meta_ec2_availability_zone]
target_label: availability_zone
- source_labels: [__meta_ec2_instance_type]
target_label: instance_type
yaml
scrape_configs:
- job_name: "ec2-instances"
ec2_sd_configs:
- region: us-east-1
access_key: YOUR_ACCESS_KEY
secret_key: YOUR_SECRET_KEY
port: 9100
filters:
- name: tag:Environment
values: [production]
- name: instance-state-name
values: [running]
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
target_label: instance_name
- source_labels: [__meta_ec2_availability_zone]
target_label: availability_zone
- source_labels: [__meta_ec2_instance_type]
target_label: instance_type
DNS Service Discovery
DNS服务发现
yaml
scrape_configs:
- job_name: "dns-srv-records"
dns_sd_configs:
- names:
- "_prometheus._tcp.example.com"
type: "SRV"
refresh_interval: 30s
relabel_configs:
- source_labels: [__meta_dns_name]
target_label: instance
yaml
scrape_configs:
- job_name: "dns-srv-records"
dns_sd_configs:
- names:
- "_prometheus._tcp.example.com"
type: "SRV"
refresh_interval: 30s
relabel_configs:
- source_labels: [__meta_dns_name]
target_label: instance
Relabeling Actions Reference
Relabeling操作参考
| Action | Description | Use Case |
|---|
| Keep targets where regex matches source labels | Filter targets by annotation/label |
| Drop targets where regex matches source labels | Exclude specific targets |
| Replace target label with value from source labels | Extract custom labels/paths/ports |
| Map source label names to target labels via regex | Copy all Kubernetes labels |
| Drop labels matching regex | Remove internal metadata labels |
| Keep only labels matching regex | Reduce cardinality |
| Set target label to hash of source labels modulo N | Sharding/routing |
| 操作 | 描述 | 使用场景 |
|---|
| 保留正则匹配源标签的目标 | 通过注解/标签过滤目标 |
| 丢弃正则匹配源标签的目标 | 排除特定目标 |
| 用源标签的值替换目标标签 | 提取自定义标签/路径/端口 |
| 通过正则将源标签名映射为目标标签名 | 复制所有Kubernetes标签 |
| 丢弃匹配正则的标签 | 移除内部元数据标签 |
| 仅保留匹配正则的标签 | 降低基数 |
| 将目标标签设置为源标签哈希值取模N的结果 | 分片/路由 |
High Availability and Scalability
高可用性与可扩展性
Prometheus High Availability Setup
Prometheus高可用性部署
Deploy multiple identical Prometheus instances scraping same targets
部署多个相同的Prometheus实例,采集相同的目标
Use external labels to distinguish instances
使用外部标签区分不同实例
global:
external_labels:
replica: prometheus-1 # Change to prometheus-2, etc.
cluster: production
global:
external_labels:
replica: prometheus-1 # 改为prometheus-2等
cluster: production
Alertmanager will deduplicate alerts from multiple Prometheus instances
Alertmanager会对多个Prometheus实例的告警进行去重
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-1:9093
- alertmanager-2:9093
- alertmanager-3:9093
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-1:9093
- alertmanager-2:9093
- alertmanager-3:9093
Alertmanager Clustering
Alertmanager集群
alertmanager.yml - HA cluster configuration
alertmanager.yml - HA集群配置
global:
resolve_timeout: 5m
route:
receiver: "default"
group_by: ["alertname", "cluster"]
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receivers:
- name: "default"
slack_configs:
global:
resolve_timeout: 5m
route:
receiver: "default"
group_by: ["alertname", "cluster"]
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receivers:
- name: "default"
slack_configs:
Start Alertmanager cluster members
启动Alertmanager集群节点
alertmanager-1: --cluster.peer=alertmanager-2:9094 --cluster.peer=alertmanager-3:9094
alertmanager-1: --cluster.peer=alertmanager-2:9094 --cluster.peer=alertmanager-3:9094
alertmanager-2: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-3:9094
alertmanager-2: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-3:9094
alertmanager-3: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-2:9094
alertmanager-3: --cluster.peer=alertmanager-1:9094 --cluster.peer=alertmanager-2:9094
Federation for Hierarchical Monitoring
用于分层监控的联邦
Global Prometheus federating from regional instances
全局Prometheus从区域实例联邦采集数据
scrape_configs:
- job_name: "federate"
scrape_interval: 15s
honor_labels: true
metrics_path: "/federate"
params:
"match[]":
# Pull aggregated metrics only
- '{job="prometheus"}'
- '{name=~"job:.*"}' # Recording rules
- "up"
static_configs:
- targets:
- "prometheus-us-east-1:9090"
- "prometheus-us-west-2:9090"
- "prometheus-eu-west-1:9090"
labels:
region: "us-east-1"
scrape_configs:
- job_name: "federate"
scrape_interval: 15s
honor_labels: true
metrics_path: "/federate"
params:
"match[]":
# 仅拉取聚合后的指标
- '{job="prometheus"}'
- '{name=~"job:.*"}' # 记录规则生成的指标
- "up"
static_configs:
- targets:
- "prometheus-us-east-1:9090"
- "prometheus-us-west-2:9090"
- "prometheus-eu-west-1:9090"
labels:
region: "us-east-1"
Remote Storage for Long-term Retention
用于长期存储的远程存储
Prometheus remote write to Thanos/Cortex/Mimir
Prometheus远程写入到Thanos/Cortex/Mimir
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
queue_config:
capacity: 10000
max_shards: 50
min_shards: 1
max_samples_per_send: 5000
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 100ms
write_relabel_configs:
Drop high-cardinality metrics before remote write
- source_labels: [name]
regex: "go_.*"
action: drop
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
queue_config:
capacity: 10000
max_shards: 50
min_shards: 1
max_samples_per_send: 5000
batch_send_deadline: 5s
min_backoff: 30ms
max_backoff: 100ms
write_relabel_configs:
远程写入前丢弃高基数指标
- source_labels: [name]
regex: "go_.*"
action: drop
Prometheus remote read from long-term storage
从长期存储远程读取
Thanos Architecture for Global View
用于全局视图的Thanos架构
Thanos Sidecar - runs alongside Prometheus
Thanos Sidecar - 与Prometheus一起运行
thanos sidecar
--prometheus.url=
http://localhost:9090
--tsdb.path=/prometheus
--objstore.config-file=/etc/thanos/bucket.yml
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
thanos sidecar
--prometheus.url=
http://localhost:9090
--tsdb.path=/prometheus
--objstore.config-file=/etc/thanos/bucket.yml
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
Thanos Store - queries object storage
Thanos Store - 查询对象存储
thanos store
--data-dir=/var/thanos/store
--objstore.config-file=/etc/thanos/bucket.yml
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
thanos store
--data-dir=/var/thanos/store
--objstore.config-file=/etc/thanos/bucket.yml
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
Thanos Query - global query interface
Thanos Query - 全局查询接口
thanos query
--http-address=0.0.0.0:9090
--grpc-address=0.0.0.0:10901
--store=prometheus-1-sidecar:10901
--store=prometheus-2-sidecar:10901
--store=thanos-store:10901
thanos query
--http-address=0.0.0.0:9090
--grpc-address=0.0.0.0:10901
--store=prometheus-1-sidecar:10901
--store=prometheus-2-sidecar:10901
--store=thanos-store:10901
Thanos Compactor - downsample and compact blocks
Thanos Compactor - 下采样和压缩块
thanos compact
--data-dir=/var/thanos/compact
--objstore.config-file=/etc/thanos/bucket.yml
--retention.resolution-raw=30d
--retention.resolution-5m=90d
--retention.resolution-1h=365d
thanos compact
--data-dir=/var/thanos/compact
--objstore.config-file=/etc/thanos/bucket.yml
--retention.resolution-raw=30d
--retention.resolution-5m=90d
--retention.resolution-1h=365d
Horizontal Sharding with Hashmod
基于Hashmod的水平分片
Split scrape targets across multiple Prometheus instances using hashmod
使用hashmod将采集目标拆分到多个Prometheus实例
scrape_configs:
-
job_name: "kubernetes-pods-shard-0"
kubernetes_sd_configs:
- role: pod
relabel_configs:
Hash pod name and keep only shard 0 (mod 3)
- source_labels: [__meta_kubernetes_pod_name]
modulus: 3
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: "0"
action: keep
-
job_name: "kubernetes-pods-shard-1"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
modulus: 3
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: "1"
action: keep
shard-2 similar pattern...
scrape_configs:
-
job_name: "kubernetes-pods-shard-0"
kubernetes_sd_configs:
- role: pod
relabel_configs:
对Pod名称哈希并仅保留分片0(模3)
- source_labels: [__meta_kubernetes_pod_name]
modulus: 3
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: "0"
action: keep
-
job_name: "kubernetes-pods-shard-1"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
modulus: 3
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: "1"
action: keep
shard-2采用类似模式...
Kubernetes Integration
Kubernetes集成
ServiceMonitor for Prometheus Operator
用于Prometheus Operator的ServiceMonitor
servicemonitor.yaml
servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
namespace: monitoring
labels:
app: myapp
release: prometheus
spec:
Select services to monitor
selector:
matchLabels:
app: myapp
Define namespaces to search
namespaceSelector:
matchNames:
- production
- staging
Endpoint configuration
endpoints:
- port: metrics # Service port name
path: /metrics
interval: 30s
scrapeTimeout: 10s
# Relabeling
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
# Metric relabeling (filter/modify metrics)
metricRelabelings:
- sourceLabels: [__name__]
regex: "go_.*"
action: drop # Drop Go runtime metrics
- sourceLabels: [status]
regex: "[45].."
targetLabel: error
replacement: "true"
Optional: TLS configuration
tlsConfig:
insecureSkipVerify: true
ca:
secret:
name: prometheus-tls
key: ca.crt
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
namespace: monitoring
labels:
app: myapp
release: prometheus
spec:
选择要监控的Service
selector:
matchLabels:
app: myapp
定义要搜索的命名空间
namespaceSelector:
matchNames:
- production
- staging
端点配置
endpoints:
- port: metrics # Service端口名称
path: /metrics
interval: 30s
scrapeTimeout: 10s
# Relabeling
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
# Metric relabeling(过滤/修改指标)
metricRelabelings:
- sourceLabels: [__name__]
regex: "go_.*"
action: drop # 丢弃Go运行时指标
- sourceLabels: [status]
regex: "[45].."
targetLabel: error
replacement: "true"
可选:TLS配置
tlsConfig:
insecureSkipVerify: true
ca:
secret:
name: prometheus-tls
key: ca.crt
PodMonitor for Direct Pod Scraping
用于直接采集Pod的PodMonitor
podmonitor.yaml
podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: app-pods
namespace: monitoring
labels:
release: prometheus
spec:
Select pods to monitor
selector:
matchLabels:
app: myapp
Namespace selection
namespaceSelector:
matchNames:
- production
Pod metrics endpoints
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 15s
# Relabeling
relabelings:
- sourceLabels: [__meta_kubernetes_pod_label_version]
targetLabel: version
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: node
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: app-pods
namespace: monitoring
labels:
release: prometheus
spec:
选择要监控的Pod
selector:
matchLabels:
app: myapp
命名空间选择
namespaceSelector:
matchNames:
- production
Pod指标端点
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 15s
# Relabeling
relabelings:
- sourceLabels: [__meta_kubernetes_pod_label_version]
targetLabel: version
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: node
PrometheusRule for Alerts and Recording Rules
用于告警和记录规则的PrometheusRule
prometheusrule.yaml
prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-rules
namespace: monitoring
labels:
release: prometheus
role: alert-rules
spec:
groups:
- name: app_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5..", app="myapp"}[5m]))
/
sum(rate(http_requests_total{app="myapp"}[5m]))
) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate on {{ $labels.namespace }}/{{ $labels.pod }}"
description: "Error rate is {{ $value | humanizePercentage }}"
dashboard: "
https://grafana.example.com/d/app-overview"
runbook: "
https://wiki.example.com/runbooks/high-error-rate"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Container {{ $labels.container }} has restarted {{ $value }} times in 15m"
- name: app_recording_rules
interval: 30s
rules:
- record: app:http_requests:rate5m
expr: sum(rate(http_requests_total{app="myapp"}[5m])) by (namespace, pod, method, status)
- record: app:http_request_duration_seconds:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{app="myapp"}[5m])) by (le, namespace, pod)
)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-rules
namespace: monitoring
labels:
release: prometheus
role: alert-rules
spec:
groups:
- name: app_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5..", app="myapp"}[5m]))
/
sum(rate(http_requests_total{app="myapp"}[5m]))
) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "{{ $labels.namespace }}/{{ $labels.pod }}高错误率"
description: "错误率为{{ $value | humanizePercentage }}"
dashboard: "
https://grafana.example.com/d/app-overview"
runbook: "
https://wiki.example.com/runbooks/high-error-rate"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }}崩溃循环"
description: "容器{{ $labels.container }}在15分钟内已重启{{ $value }}次"
- name: app_recording_rules
interval: 30s
rules:
- record: app:http_requests:rate5m
expr: sum(rate(http_requests_total{app="myapp"}[5m])) by (namespace, pod, method, status)
- record: app:http_request_duration_seconds:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{app="myapp"}[5m])) by (le, namespace, pod)
)
Prometheus Custom Resource
Prometheus自定义资源
prometheus.yaml
prometheus.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 2
version: v2.45.0
Service account for Kubernetes API access
serviceAccountName: prometheus
Select ServiceMonitors
serviceMonitorSelector:
matchLabels:
release: prometheus
Select PodMonitors
podMonitorSelector:
matchLabels:
release: prometheus
Select PrometheusRules
ruleSelector:
matchLabels:
release: prometheus
role: alert-rules
Resource limits
resources:
requests:
memory: 2Gi
cpu: 1000m
limits:
memory: 4Gi
cpu: 2000m
Storage
storage:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: fast-ssd
Retention
retention: 30d
retentionSize: 45GB
Alertmanager configuration
alerting:
alertmanagers:
- namespace: monitoring
name: alertmanager
port: web
External labels
externalLabels:
cluster: production
region: us-east-1
Security context
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
Enable admin API for management operations
enableAdminAPI: false
Additional scrape configs (from Secret)
additionalScrapeConfigs:
name: additional-scrape-configs
key: prometheus-additional.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 2
version: v2.45.0
用于Kubernetes API访问的服务账号
serviceAccountName: prometheus
选择ServiceMonitor
serviceMonitorSelector:
matchLabels:
release: prometheus
选择PodMonitor
podMonitorSelector:
matchLabels:
release: prometheus
选择PrometheusRules
ruleSelector:
matchLabels:
release: prometheus
role: alert-rules
资源限制
resources:
requests:
memory: 2Gi
cpu: 1000m
limits:
memory: 4Gi
cpu: 2000m
存储
storage:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: fast-ssd
数据保留
retention: 30d
retentionSize: 45GB
Alertmanager配置
alerting:
alertmanagers:
- namespace: monitoring
name: alertmanager
port: web
外部标签
externalLabels:
cluster: production
region: us-east-1
安全上下文
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
启用管理API用于运维操作
enableAdminAPI: false
额外的采集配置(来自Secret)
additionalScrapeConfigs:
name: additional-scrape-configs
key: prometheus-additional.yaml
Application Instrumentation Examples
应用埋点示例
go
// main.go
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// Counter for total requests
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
// Histogram for request duration
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"method", "endpoint"},
)
// Gauge for active connections
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
// Summary for response sizes
responseSizeBytes = promauto.NewSummaryVec(
prometheus.SummaryOpts{
Name: "http_response_size_bytes",
Help: "HTTP response size in bytes",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
},
[]string{"endpoint"},
)
)
// Middleware to instrument HTTP handlers
func instrumentHandler(endpoint string, handler http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
activeConnections.Inc()
defer activeConnections.Dec()
// Wrap response writer to capture status code
wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
handler(wrapped, r)
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)
httpRequestsTotal.WithLabelValues(r.Method, endpoint,
http.StatusText(wrapped.statusCode)).Inc()
}
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
func handleUsers(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.Write([]byte(`{"users": []}`))
}
func main() {
// Register handlers
http.HandleFunc("/api/users", instrumentHandler("/api/users", handleUsers))
http.Handle("/metrics", promhttp.Handler())
// Start server
http.ListenAndServe(":8080", nil)
}
go
// main.go
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// 总请求数计数器
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
// 请求延迟直方图
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"method", "endpoint"},
)
// 活跃连接数仪表盘
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
// 响应大小摘要
responseSizeBytes = promauto.NewSummaryVec(
prometheus.SummaryOpts{
Name: "http_response_size_bytes",
Help: "HTTP response size in bytes",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
},
[]string{"endpoint"},
)
)
// 用于埋点HTTP处理器的中间件
func instrumentHandler(endpoint string, handler http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
activeConnections.Inc()
defer activeConnections.Dec()
// 包装响应Writer以捕获状态码
wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
handler(wrapped, r)
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)
httpRequestsTotal.WithLabelValues(r.Method, endpoint,
http.StatusText(wrapped.statusCode)).Inc()
}
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
func handleUsers(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.Write([]byte(`{"users": []}`))
}
func main() {
// 注册处理器
http.HandleFunc("/api/users", instrumentHandler("/api/users", handleUsers))
http.Handle("/metrics", promhttp.Handler())
// 启动服务器
http.ListenAndServe(":8080", nil)
}
Python Application (Flask)
Python应用(Flask)
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
app = Flask(name)
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
app = Flask(name)
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'endpoint'],
buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
active_requests = Gauge(
'active_requests',
'Number of active requests'
)
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'endpoint'],
buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
active_requests = Gauge(
'active_requests',
'Number of active requests'
)
Middleware for instrumentation
埋点中间件
@app.before_request
def before_request():
active_requests.inc()
request.start_time = time.time()
@app.after_request
def after_request(response):
active_requests.dec()
duration = time.time() - request.start_time
request_duration.labels(
method=request.method,
endpoint=request.endpoint or 'unknown'
).observe(duration)
request_count.labels(
method=request.method,
endpoint=request.endpoint or 'unknown',
status=response.status_code
).inc()
return response
@app.route('/metrics')
def metrics():
return generate_latest()
@app.route('/api/users')
def users():
return {'users': []}
if name == 'main':
app.run(host='0.0.0.0', port=8080)
@app.before_request
def before_request():
active_requests.inc()
request.start_time = time.time()
@app.after_request
def after_request(response):
active_requests.dec()
duration = time.time() - request.start_time
request_duration.labels(
method=request.method,
endpoint=request.endpoint or 'unknown'
).observe(duration)
request_count.labels(
method=request.method,
endpoint=request.endpoint or 'unknown',
status=response.status_code
).inc()
return response
@app.route('/metrics')
def metrics():
return generate_latest()
@app.route('/api/users')
def users():
return {'users': []}
if name == 'main':
app.run(host='0.0.0.0', port=8080)
Production Deployment Checklist
生产部署检查清单
Troubleshooting Commands
故障排查命令
Check Prometheus configuration syntax
检查Prometheus配置语法
promtool check config prometheus.yml
promtool check config prometheus.yml
Check rules file syntax
检查规则文件语法
promtool check rules alerts/*.yml
promtool check rules alerts/*.yml
Test PromQL queries
测试PromQL查询
Check which targets are up
检查哪些目标处于运行状态
Query current metric values
查询当前指标值
Check service discovery
检查服务发现状态
View TSDB stats
查看TSDB统计信息
Check runtime information
检查运行时信息
Common PromQL Patterns
常用PromQL模式
Request rate per second
每秒请求速率
rate(http_requests_total[5m])
rate(http_requests_total[5m])
Error ratio percentage
错误率百分比
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
P95 latency from histogram
从直方图获取P95延迟
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Average latency from histogram
从直方图获取平均延迟
sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
sum(rate(http_request_duration_seconds_sum[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
Memory utilization percentage
内存使用率百分比
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
CPU utilization (non-idle)
CPU使用率(非空闲)
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))
Disk space remaining percentage
剩余磁盘空间百分比
100 * node_filesystem_avail_bytes / node_filesystem_size_bytes
100 * node_filesystem_avail_bytes / node_filesystem_size_bytes
Top 5 endpoints by request rate
请求速率Top 5端点
topk(5, sum(rate(http_requests_total[5m])) by (endpoint))
topk(5, sum(rate(http_requests_total[5m])) by (endpoint))
Service uptime in days
服务运行时长(天)
(time() - process_start_time_seconds) / 86400
(time() - process_start_time_seconds) / 86400
Request rate growth compared to 1 hour ago
与1小时前对比的请求速率变化
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)
Alert Rule Patterns
告警规则模式
High error rate (symptom-based)
高错误率(基于症状)
alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate is {{ $value | humanizePercentage }}"
runbook: "
https://runbooks.example.com/high-error-rate"
alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "错误率为{{ $value | humanizePercentage }}"
runbook: "
https://runbooks.example.com/high-error-rate"
alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 5m
labels:
severity: warning
alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 5m
labels:
severity: warning
alert: ServiceDown
expr: up{job="critical-service"} == 0
for: 2m
labels:
severity: critical
alert: ServiceDown
expr: up{job="critical-service"} == 0
for: 2m
labels:
severity: critical
Disk space low (cause-based, warning only)
磁盘空间不足(基于原因,仅警告)
alert: DiskSpaceLow
expr: |
node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 10m
labels:
severity: warning
alert: DiskSpaceLow
expr: |
node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 10m
labels:
severity: warning
alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
Recording Rule Naming Convention
记录规则命名规范
Format: level:metric:operations
格式: level:metric:operations
level = aggregation level (job, instance, cluster)
level = 聚合级别(job、instance、cluster)
metric = base metric name
metric = 基础指标名称
operations = transformations applied (rate5m, sum, ratio)
operations = 应用的转换操作(rate5m、sum、ratio)
groups:
-
name: aggregation_rules
rules:
Instance-level aggregation
- record: instance:node_cpu_utilization:ratio
expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
Job-level aggregation
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
Job-level error ratio
- record: job:http_request_errors:ratio
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job)
Cluster-level aggregation
- record: cluster:cpu_utilization:ratio
expr: avg(instance:node_cpu_utilization:ratio)
groups:
-
name: aggregation_rules
rules:
实例级聚合
- record: instance:node_cpu_utilization:ratio
expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
Job级聚合
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
Job级错误率
- record: job:http_request_errors:ratio
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job)
集群级聚合
- record: cluster:cpu_utilization:ratio
expr: avg(instance:node_cpu_utilization:ratio)
Metric Naming Best Practices
指标命名最佳实践
| Pattern | Good Example | Bad Example |
|---|
| Counter suffix | | |
| Base units | http_request_duration_seconds
| |
| Ratio range | (0.0-1.0) | (0-100) |
| Byte units | | |
| Namespace prefix | myapp_http_requests_total
| |
| Label naming | {method="GET", status="200"}
| {httpMethod="GET", statusCode="200"}
|
| 模式 | 示例 | 反例 |
|---|
| 计数器后缀 | | |
| 基础单位 | http_request_duration_seconds
| |
| 比率范围 | (0.0-1.0) | (0-100) |
| 字节单位 | | |
| 命名空间前缀 | myapp_http_requests_total
| |
| 标签命名 | {method="GET", status="200"}
| {httpMethod="GET", statusCode="200"}
|
Label Cardinality Guidelines
标签基数指导原则
| Cardinality | Examples | Recommendation |
|---|
| Low (<10) | HTTP method, status code, environment | Safe for all labels |
| Medium (10-100) | API endpoint, service name, pod name | Safe with aggregation |
| High (100-1000) | Container ID, hostname | Use only when necessary |
| Unbounded | User ID, IP address, timestamp, URL path | Never use as label |
| 基数范围 | 示例 | 建议做法 |
|---|
| 低基数(<10) | HTTP方法、状态码、环境 | 所有标签都可以安全使用 |
| 中基数(10-100) | API端点、服务名称、Pod名称 | 结合聚合操作使用 |
| 高基数(100-1000) | 容器ID、主机名 | 仅在必要时使用 |
| 无界基数 | 用户ID、IP地址、时间戳、URL路径 | 绝对不能作为标签使用 |
Kubernetes Annotation-based Scraping
基于Kubernetes注解的自动采集
Pod annotations for automatic Prometheus scraping
用于Prometheus自动采集的Pod注解
apiVersion: v1
kind: Pod
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
prometheus.io/scheme: "http"
spec:
containers:
- name: app
image: myapp:latest
ports:
- containerPort: 8080
name: metrics
apiVersion: v1
kind: Pod
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
prometheus.io/scheme: "http"
spec:
containers:
- name: app
image: myapp:latest
ports:
- containerPort: 8080
name: metrics
Alertmanager Routing Patterns
Alertmanager路由模式
yaml
route:
receiver: default
group_by: ["alertname", "cluster"]
routes:
# Critical alerts to PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true # Also send to default
# Team-based routing
- match:
team: database
receiver: dba-team
group_by: ["alertname", "instance"]
# Environment-based routing
- match:
env: development
receiver: slack-dev
repeat_interval: 4h
# Time-based routing (office hours only)
- match:
severity: warning
receiver: email
active_time_intervals:
- business-hours
time_intervals:
- name: business-hours
time_intervals:
- times:
- start_time: "09:00"
end_time: "17:00"
weekdays: ["monday:friday"]
yaml
route:
receiver: default
group_by: ["alertname", "cluster"]
routes:
# 严重告警发送到PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true # 同时发送到默认接收方
# 基于团队路由
- match:
team: database
receiver: dba-team
group_by: ["alertname", "instance"]
# 基于环境路由
- match:
env: development
receiver: slack-dev
repeat_interval: 4h
# 基于时间路由(仅工作时间)
- match:
severity: warning
receiver: email
active_time_intervals:
- business-hours
time_intervals:
- name: business-hours
time_intervals:
- times:
- start_time: "09:00"
end_time: "17:00"
weekdays: ["monday:friday"]