prometheus-grafana

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prometheus & Grafana

Prometheus & Grafana

Collect metrics and visualize system performance with the Prometheus-Grafana stack.
借助Prometheus-Grafana技术栈收集指标并可视化系统性能。

When to Use This Skill

适用场景

Use this skill when:
  • Setting up metrics collection infrastructure
  • Creating monitoring dashboards
  • Writing PromQL queries for analysis
  • Configuring alerting rules
  • Monitoring Kubernetes clusters
在以下场景使用本技能:
  • 搭建指标收集基础设施
  • 创建监控仪表盘
  • 编写PromQL查询进行分析
  • 配置告警规则
  • 监控Kubernetes集群

Prerequisites

前置条件

  • Docker or Kubernetes for deployment
  • Network access to monitored targets
  • Basic understanding of metrics concepts
  • 用于部署的Docker或Kubernetes环境
  • 可访问被监控目标的网络权限
  • 对指标概念有基础理解

Prometheus Setup

Prometheus设置

Docker Deployment

Docker部署

yaml
undefined
yaml
undefined

docker-compose.yml

docker-compose.yml

version: '3.8'
services: prometheus: image: prom/prometheus:v2.48.0 ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./rules:/etc/prometheus/rules - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=15d'
grafana: image: grafana/grafana:10.2.0 ports: - "3000:3000" volumes: - grafana-data:/var/lib/grafana environment: - GF_SECURITY_ADMIN_PASSWORD=admin
volumes: prometheus-data: grafana-data:
undefined
version: '3.8'
services: prometheus: image: prom/prometheus:v2.48.0 ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./rules:/etc/prometheus/rules - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=15d'
grafana: image: grafana/grafana:10.2.0 ports: - "3000:3000" volumes: - grafana-data:/var/lib/grafana environment: - GF_SECURITY_ADMIN_PASSWORD=admin
volumes: prometheus-data: grafana-data:
undefined

Configuration

配置

yaml
undefined
yaml
undefined

prometheus.yml

prometheus.yml

global: scrape_interval: 15s evaluation_interval: 15s
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
rule_files:
  • /etc/prometheus/rules/*.yml
scrape_configs:
  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']
  • job_name: 'node' static_configs:
    • targets:
      • 'node-exporter:9100'
  • job_name: 'applications' static_configs:
    • targets:
      • 'app1:8080'
      • 'app2:8080' metrics_path: /metrics
undefined
global: scrape_interval: 15s evaluation_interval: 15s
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
rule_files:
  • /etc/prometheus/rules/*.yml
scrape_configs:
  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']
  • job_name: 'node' static_configs:
    • targets:
      • 'node-exporter:9100'
  • job_name: 'applications' static_configs:
    • targets:
      • 'app1:8080'
      • 'app2:8080' metrics_path: /metrics
undefined

Kubernetes Deployment

Kubernetes部署

Using Helm

使用Helm

bash
undefined
bash
undefined

Add Prometheus community Helm repo

Add Prometheus community Helm repo

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

Install kube-prometheus-stack

Install kube-prometheus-stack

helm install prometheus prometheus-community/kube-prometheus-stack
--namespace monitoring
--create-namespace
--set grafana.adminPassword=admin
undefined
helm install prometheus prometheus-community/kube-prometheus-stack
--namespace monitoring
--create-namespace
--set grafana.adminPassword=admin
undefined

ServiceMonitor

ServiceMonitor

yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
  namespaceSelector:
    matchNames:
      - default
yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics
  namespaceSelector:
    matchNames:
      - default

PromQL Queries

PromQL查询

Basic Queries

基础查询

promql
undefined
promql
undefined

Current CPU usage

当前CPU使用率

node_cpu_seconds_total{mode="idle"}
node_cpu_seconds_total{mode="idle"}

Rate of HTTP requests per second

HTTP请求每秒速率

rate(http_requests_total[5m])
rate(http_requests_total[5m])

Average response time

平均响应时间

avg(http_request_duration_seconds_sum / http_request_duration_seconds_count)
avg(http_request_duration_seconds_sum / http_request_duration_seconds_count)

Memory usage percentage

内存使用百分比

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
undefined
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
undefined

Aggregations

聚合查询

promql
undefined
promql
undefined

Sum requests by status code

按状态码汇总请求数

sum by (status_code) (rate(http_requests_total[5m]))
sum by (status_code) (rate(http_requests_total[5m]))

Average CPU by instance

按实例统计平均CPU使用率

avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))

Top 5 endpoints by request count

请求数Top 5端点

topk(5, sum by (endpoint) (rate(http_requests_total[5m])))
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))

95th percentile latency

95分位延迟

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
undefined
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
undefined

Time-Based Queries

基于时间的查询

promql
undefined
promql
undefined

Compare to 1 hour ago

与1小时前对比

http_requests_total - http_requests_total offset 1h
http_requests_total - http_requests_total offset 1h

Predict disk space in 4 hours

预测4小时后的磁盘空间

predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600)
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600)

Changes in last 5 minutes

最近5分钟内的状态变化

changes(up[5m])
changes(up[5m])

Average over 24 hours

24小时平均值

avg_over_time(http_requests_total[24h])
undefined
avg_over_time(http_requests_total[24h])
undefined

Alerting Rules

告警规则

yaml
undefined
yaml
undefined

rules/alerts.yml

rules/alerts.yml

groups:
  • name: application rules:
    • alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}"
    • alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down"
    • alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }}"
    • alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1 for: 5m labels: severity: warning annotations: summary: "Disk space low on {{ $labels.instance }}"
undefined
groups:
  • name: application rules:
    • alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}"
    • alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down"
    • alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }}"
    • alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1 for: 5m labels: severity: warning annotations: summary: "Disk space low on {{ $labels.instance }}"
undefined

Alertmanager

Alertmanager配置

yaml
undefined
yaml
undefined

alertmanager.yml

alertmanager.yml

global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/xxx'
route: receiver: 'slack-notifications' group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: critical receiver: 'pagerduty'
receivers:
  • name: 'slack-notifications' slack_configs:
    • channel: '#alerts' send_resolved: true title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}'
  • name: 'pagerduty' pagerduty_configs:
    • service_key: 'xxx' severity: critical
undefined
global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/xxx'
route: receiver: 'slack-notifications' group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: critical receiver: 'pagerduty'
receivers:
  • name: 'slack-notifications' slack_configs:
    • channel: '#alerts' send_resolved: true title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}'
  • name: 'pagerduty' pagerduty_configs:
    • service_key: 'xxx' severity: critical
undefined

Grafana Dashboards

Grafana仪表盘

Dashboard JSON Structure

仪表盘JSON结构

json
{
  "dashboard": {
    "title": "Application Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (status_code)",
            "legendFormat": "{{ status_code }}"
          }
        ],
        "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
      },
      {
        "title": "Latency P95",
        "type": "gauge",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
          }
        ],
        "gridPos": {"x": 12, "y": 0, "w": 6, "h": 8}
      }
    ]
  }
}
json
{
  "dashboard": {
    "title": "Application Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (status_code)",
            "legendFormat": "{{ status_code }}"
          }
        ],
        "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
      },
      {
        "title": "Latency P95",
        "type": "gauge",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
          }
        ],
        "gridPos": {"x": 12, "y": 0, "w": 6, "h": 8}
      }
    ]
  }
}

Provisioning Dashboards

仪表盘配置

yaml
undefined
yaml
undefined

grafana/provisioning/dashboards/dashboards.yml

grafana/provisioning/dashboards/dashboards.yml

apiVersion: 1
providers:
  • name: 'default' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 30 options: path: /var/lib/grafana/dashboards
undefined
apiVersion: 1
providers:
  • name: 'default' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 30 options: path: /var/lib/grafana/dashboards
undefined

Data Source Provisioning

数据源配置

yaml
undefined
yaml
undefined

grafana/provisioning/datasources/prometheus.yml

grafana/provisioning/datasources/prometheus.yml

apiVersion: 1
datasources:
undefined
apiVersion: 1
datasources:
undefined

Recording Rules

记录规则

yaml
undefined
yaml
undefined

rules/recording.yml

rules/recording.yml

groups:
  • name: aggregations interval: 30s rules:
    • record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))
    • record: instance:node_cpu:avg_rate5m expr: | avg by (instance) ( rate(node_cpu_seconds_total{mode!="idle"}[5m]) )
    • record: job:http_latency:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )
undefined
groups:
  • name: aggregations interval: 30s rules:
    • record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))
    • record: instance:node_cpu:avg_rate5m expr: | avg by (instance) ( rate(node_cpu_seconds_total{mode!="idle"}[5m]) )
    • record: job:http_latency:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )
undefined

Application Instrumentation

应用程序埋点

Go Application

Go应用

go
import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var httpRequests = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests",
    },
    []string{"method", "endpoint", "status"},
)

func init() {
    prometheus.MustRegister(httpRequests)
}

// Expose metrics endpoint
http.Handle("/metrics", promhttp.Handler())
go
import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var httpRequests = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests",
    },
    []string{"method", "endpoint", "status"},
)

func init() {
    prometheus.MustRegister(httpRequests)
}

// 暴露指标端点
http.Handle("/metrics", promhttp.Handler())

Node.js Application

Node.js应用

javascript
const client = require('prom-client');

const httpRequests = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'endpoint', 'status']
});

// Middleware
app.use((req, res, next) => {
  res.on('finish', () => {
    httpRequests.inc({
      method: req.method,
      endpoint: req.path,
      status: res.statusCode
    });
  });
  next();
});

// Expose metrics
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});
javascript
const client = require('prom-client');

const httpRequests = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'endpoint', 'status']
});

// 中间件
app.use((req, res, next) => {
  res.on('finish', () => {
    httpRequests.inc({
      method: req.method,
      endpoint: req.path,
      status: res.statusCode
    });
  });
  next();
});

// 暴露指标
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

Common Issues

常见问题

Issue: Targets Not Discovered

问题:目标未被发现

Problem: Prometheus not scraping targets Solution: Check network connectivity, verify target labels
问题:Prometheus未抓取目标 解决方案:检查网络连通性,验证目标标签

Issue: High Memory Usage

问题:内存占用过高

Problem: Prometheus using excessive memory Solution: Reduce retention, use recording rules, limit cardinality
问题:Prometheus内存使用过量 解决方案:缩短数据保留时间、使用记录规则、限制标签基数

Issue: Slow Queries

问题:查询缓慢

Problem: PromQL queries timing out Solution: Use recording rules, limit time ranges, optimize queries
问题:PromQL查询超时 解决方案:使用记录规则、缩短时间范围、优化查询语句

Issue: Missing Data Points

问题:数据点缺失

Problem: Gaps in metrics data Solution: Check scrape interval, verify target availability
问题:指标数据存在间隙 解决方案:检查抓取间隔,验证目标可用性

Best Practices

最佳实践

  • Use recording rules for frequently-used queries
  • Limit label cardinality to prevent memory issues
  • Set appropriate retention based on storage capacity
  • Use histogram metrics for latency measurement
  • Implement proper alerting thresholds
  • Version control dashboards as code
  • Use federation for large-scale deployments
  • Regularly review and prune unused metrics
  • 为常用查询配置记录规则
  • 限制标签基数以避免内存问题
  • 根据存储容量设置合适的数据保留时间
  • 使用直方图指标测量延迟
  • 设置合理的告警阈值
  • 将仪表盘作为代码进行版本控制
  • 大规模部署时使用联邦机制
  • 定期审查并清理未使用的指标

Related Skills

相关技能

  • alerting-oncall - Alert management
  • loki-logging - Log aggregation
  • kubernetes-ops - K8s monitoring
  • alerting-oncall - 告警管理
  • loki-logging - 日志聚合
  • kubernetes-ops - K8s监控