prometheus-configuration

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prometheus Configuration

Prometheus 配置

Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.
关于Prometheus搭建、指标收集、抓取配置和记录规则的完整指南。

Purpose

用途

Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.
配置Prometheus以实现对基础设施和应用程序的全面指标收集、告警和监控。

When to Use

适用场景

  • Set up Prometheus monitoring
  • Configure metric scraping
  • Create recording rules
  • Design alert rules
  • Implement service discovery
  • 搭建Prometheus监控
  • 配置指标抓取
  • 创建记录规则
  • 设计告警规则
  • 实现服务发现

Prometheus Architecture

Prometheus 架构

┌──────────────┐
│ Applications │ ← Instrumented with client libraries
└──────┬───────┘
       │ /metrics endpoint
┌──────────────┐
│  Prometheus  │ ← Scrapes metrics periodically
│    Server    │
└──────┬───────┘
       ├─→ AlertManager (alerts)
       ├─→ Grafana (visualization)
       └─→ Long-term storage (Thanos/Cortex)
┌──────────────┐
│ Applications │ ← Instrumented with client libraries
└──────┬───────┘
       │ /metrics endpoint
┌──────────────┐
│  Prometheus  │ ← Scrapes metrics periodically
│    Server    │
└──────┬───────┘
       ├─→ AlertManager (alerts)
       ├─→ Grafana (visualization)
       └─→ Long-term storage (Thanos/Cortex)

Installation

安装

Kubernetes with Helm

使用Helm在Kubernetes中部署

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageVolumeSize=50Gi
bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageVolumeSize=50Gi

Docker Compose

Docker Compose部署

yaml
version: "3.8"
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"

volumes:
  prometheus-data:
yaml
version: "3.8"
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"

volumes:
  prometheus-data:

Configuration File

配置文件

prometheus.yml:
yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: "production"
    region: "us-west-2"
prometheus.yml:
yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: "production"
    region: "us-west-2"

Alertmanager configuration

Alertmanager configuration

alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093

Load rules files

Load rules files

rule_files:
  • /etc/prometheus/rules/*.yml
rule_files:
  • /etc/prometheus/rules/*.yml

Scrape configurations

Scrape configurations

scrape_configs:

Prometheus itself

  • job_name: "prometheus" static_configs:
    • targets: ["localhost:9090"]

Node exporters

  • job_name: "node-exporter" static_configs:
    • targets:
      • "node1:9100"
      • "node2:9100"
      • "node3:9100" relabel_configs:
    • source_labels: [address] target_label: instance regex: "([^:]+)(:[0-9]+)?" replacement: "${1}"

Kubernetes pods with annotations

  • job_name: "kubernetes-pods" kubernetes_sd_configs:
    • role: pod relabel_configs:
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
    • source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address
    • source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace
    • source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod

Application metrics

  • job_name: "my-app" static_configs:
    • targets:
      • "app1.example.com:9090"
      • "app2.example.com:9090" metrics_path: "/metrics" scheme: "https" tls_config: ca_file: /etc/prometheus/ca.crt cert_file: /etc/prometheus/client.crt key_file: /etc/prometheus/client.key

**Reference:** See `assets/prometheus.yml.template`
scrape_configs:

Prometheus itself

  • job_name: "prometheus" static_configs:
    • targets: ["localhost:9090"]

Node exporters

  • job_name: "node-exporter" static_configs:
    • targets:
      • "node1:9100"
      • "node2:9100"
      • "node3:9100" relabel_configs:
    • source_labels: [address] target_label: instance regex: "([^:]+)(:[0-9]+)?" replacement: "${1}"

Kubernetes pods with annotations

  • job_name: "kubernetes-pods" kubernetes_sd_configs:
    • role: pod relabel_configs:
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
    • source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path regex: (.+)
    • source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: address
    • source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace
    • source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod

Application metrics

  • job_name: "my-app" static_configs:
    • targets:
      • "app1.example.com:9090"
      • "app2.example.com:9090" metrics_path: "/metrics" scheme: "https" tls_config: ca_file: /etc/prometheus/ca.crt cert_file: /etc/prometheus/client.crt key_file: /etc/prometheus/client.key

**参考:** 请查看 `assets/prometheus.yml.template`

Scrape Configurations

抓取配置

Static Targets

静态目标

yaml
scrape_configs:
  - job_name: "static-targets"
    static_configs:
      - targets: ["host1:9100", "host2:9100"]
        labels:
          env: "production"
          region: "us-west-2"
yaml
scrape_configs:
  - job_name: "static-targets"
    static_configs:
      - targets: ["host1:9100", "host2:9100"]
        labels:
          env: "production"
          region: "us-west-2"

File-based Service Discovery

基于文件的服务发现

yaml
scrape_configs:
  - job_name: "file-sd"
    file_sd_configs:
      - files:
          - /etc/prometheus/targets/*.json
          - /etc/prometheus/targets/*.yml
        refresh_interval: 5m
targets/production.json:
json
[
  {
    "targets": ["app1:9090", "app2:9090"],
    "labels": {
      "env": "production",
      "service": "api"
    }
  }
]
yaml
scrape_configs:
  - job_name: "file-sd"
    file_sd_configs:
      - files:
          - /etc/prometheus/targets/*.json
          - /etc/prometheus/targets/*.yml
        refresh_interval: 5m
targets/production.json:
json
[
  {
    "targets": ["app1:9090", "app2:9090"],
    "labels": {
      "env": "production",
      "service": "api"
    }
  }
]

Kubernetes Service Discovery

Kubernetes服务发现

yaml
scrape_configs:
  - job_name: "kubernetes-services"
    kubernetes_sd_configs:
      - role: service
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
Reference: See
references/scrape-configs.md
yaml
scrape_configs:
  - job_name: "kubernetes-services"
    kubernetes_sd_configs:
      - role: service
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
参考: 请查看
references/scrape-configs.md

Recording Rules

记录规则

Create pre-computed metrics for frequently queried expressions:
yaml
undefined
为频繁查询的表达式创建预计算指标:
yaml
undefined

/etc/prometheus/rules/recording_rules.yml

/etc/prometheus/rules/recording_rules.yml

groups:
  • name: api_metrics interval: 15s rules:

    HTTP request rate per service

    • record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))

    Error rate percentage

    • record: job:http_requests_errors:rate5m expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
    • record: job:http_requests_error_rate:percentage expr: | (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100

    P95 latency

    • record: job:http_request_duration:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )
  • name: resource_metrics interval: 30s rules:

    CPU utilization percentage

    • record: instance:node_cpu:utilization expr: | 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

    Memory utilization percentage

    • record: instance:node_memory:utilization expr: | 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

    Disk usage percentage

    • record: instance:node_disk:utilization expr: | 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)

**Reference:** See `references/recording-rules.md`
groups:
  • name: api_metrics interval: 15s rules:

    HTTP request rate per service

    • record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))

    Error rate percentage

    • record: job:http_requests_errors:rate5m expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
    • record: job:http_requests_error_rate:percentage expr: | (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100

    P95 latency

    • record: job:http_request_duration:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )
  • name: resource_metrics interval: 30s rules:

    CPU utilization percentage

    • record: instance:node_cpu:utilization expr: | 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

    Memory utilization percentage

    • record: instance:node_memory:utilization expr: | 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

    Disk usage percentage

    • record: instance:node_disk:utilization expr: | 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)

**参考:** 请查看 `references/recording-rules.md`

Alert Rules

告警规则

yaml
undefined
yaml
undefined

/etc/prometheus/rules/alert_rules.yml

/etc/prometheus/rules/alert_rules.yml

groups:
  • name: availability interval: 30s rules:
    • alert: ServiceDown expr: up{job="my-app"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} has been down for more than 1 minute"
    • alert: HighErrorRate expr: job:http_requests_error_rate:percentage > 5 for: 5m labels: severity: warning annotations: summary: "High error rate for {{ $labels.job }}" description: "Error rate is {{ $value }}% (threshold: 5%)"
    • alert: HighLatency expr: job:http_request_duration:p95 > 1 for: 5m labels: severity: warning annotations: summary: "High latency for {{ $labels.job }}" description: "P95 latency is {{ $value }}s (threshold: 1s)"
  • name: resources interval: 1m rules:
    • alert: HighCPUUsage expr: instance:node_cpu:utilization > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%"
    • alert: HighMemoryUsage expr: instance:node_memory:utilization > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%"
    • alert: DiskSpaceLow expr: instance:node_disk:utilization > 90 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk usage is {{ $value }}%"
undefined
groups:
  • name: availability interval: 30s rules:
    • alert: ServiceDown expr: up{job="my-app"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} has been down for more than 1 minute"
    • alert: HighErrorRate expr: job:http_requests_error_rate:percentage > 5 for: 5m labels: severity: warning annotations: summary: "High error rate for {{ $labels.job }}" description: "Error rate is {{ $value }}% (threshold: 5%)"
    • alert: HighLatency expr: job:http_request_duration:p95 > 1 for: 5m labels: severity: warning annotations: summary: "High latency for {{ $labels.job }}" description: "P95 latency is {{ $value }}s (threshold: 1s)"
  • name: resources interval: 1m rules:
    • alert: HighCPUUsage expr: instance:node_cpu:utilization > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%"
    • alert: HighMemoryUsage expr: instance:node_memory:utilization > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%"
    • alert: DiskSpaceLow expr: instance:node_disk:utilization > 90 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk usage is {{ $value }}%"
undefined

Validation

验证

bash
undefined
bash
undefined

Validate configuration

Validate configuration

promtool check config prometheus.yml
promtool check config prometheus.yml

Validate rules

Validate rules

promtool check rules /etc/prometheus/rules/*.yml
promtool check rules /etc/prometheus/rules/*.yml

Test query

Test query

promtool query instant http://localhost:9090 'up'

**Reference:** See `scripts/validate-prometheus.sh`
promtool query instant http://localhost:9090 'up'

**参考:** 请查看 `scripts/validate-prometheus.sh`

Best Practices

最佳实践

  1. Use consistent naming for metrics (prefix_name_unit)
  2. Set appropriate scrape intervals (15-60s typical)
  3. Use recording rules for expensive queries
  4. Implement high availability (multiple Prometheus instances)
  5. Configure retention based on storage capacity
  6. Use relabeling for metric cleanup
  7. Monitor Prometheus itself
  8. Implement federation for large deployments
  9. Use Thanos/Cortex for long-term storage
  10. Document custom metrics
  1. 使用一致的命名给指标(前缀_名称_单位)
  2. 设置合适的抓取间隔(通常15-60秒)
  3. 使用记录规则处理复杂查询
  4. 实现高可用性(多个Prometheus实例)
  5. 根据存储容量配置保留时长
  6. 使用重标记清理指标
  7. 监控Prometheus自身
  8. 为大规模部署实现联邦
  9. 使用Thanos/Cortex进行长期存储
  10. 文档化自定义指标

Troubleshooting

故障排查

Check scrape targets:
bash
curl http://localhost:9090/api/v1/targets
Check configuration:
bash
curl http://localhost:9090/api/v1/status/config
Test query:
bash
curl 'http://localhost:9090/api/v1/query?query=up'
检查抓取目标:
bash
curl http://localhost:9090/api/v1/targets
检查配置:
bash
curl http://localhost:9090/api/v1/status/config
测试查询:
bash
curl 'http://localhost:9090/api/v1/query?query=up'

Reference Files

参考文件

  • assets/prometheus.yml.template
    - Complete configuration template
  • references/scrape-configs.md
    - Scrape configuration patterns
  • references/recording-rules.md
    - Recording rule examples
  • scripts/validate-prometheus.sh
    - Validation script
  • assets/prometheus.yml.template
    - 完整配置模板
  • references/scrape-configs.md
    - 抓取配置模式
  • references/recording-rules.md
    - 记录规则示例
  • scripts/validate-prometheus.sh
    - 验证脚本

Related Skills

相关技能

  • grafana-dashboards
    - For visualization
  • slo-implementation
    - For SLO monitoring
  • distributed-tracing
    - For request tracing
  • grafana-dashboards
    - 用于可视化
  • slo-implementation
    - 用于SLO监控
  • distributed-tracing
    - 用于请求追踪