prometheus-grafana
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrometheus & Grafana
Prometheus & Grafana
Collect metrics and visualize system performance with the Prometheus-Grafana stack.
借助Prometheus-Grafana技术栈收集指标并可视化系统性能。
When to Use This Skill
适用场景
Use this skill when:
- Setting up metrics collection infrastructure
- Creating monitoring dashboards
- Writing PromQL queries for analysis
- Configuring alerting rules
- Monitoring Kubernetes clusters
在以下场景使用本技能:
- 搭建指标收集基础设施
- 创建监控仪表盘
- 编写PromQL查询进行分析
- 配置告警规则
- 监控Kubernetes集群
Prerequisites
前置条件
- Docker or Kubernetes for deployment
- Network access to monitored targets
- Basic understanding of metrics concepts
- 用于部署的Docker或Kubernetes环境
- 可访问被监控目标的网络权限
- 对指标概念有基础理解
Prometheus Setup
Prometheus设置
Docker Deployment
Docker部署
yaml
undefinedyaml
undefineddocker-compose.yml
docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.48.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./rules:/etc/prometheus/rules
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
grafana:
image: grafana/grafana:10.2.0
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
prometheus-data:
grafana-data:
undefinedversion: '3.8'
services:
prometheus:
image: prom/prometheus:v2.48.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./rules:/etc/prometheus/rules
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
grafana:
image: grafana/grafana:10.2.0
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
prometheus-data:
grafana-data:
undefinedConfiguration
配置
yaml
undefinedyaml
undefinedprometheus.yml
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
-
job_name: 'prometheus' static_configs:
- targets: ['localhost:9090']
-
job_name: 'node' static_configs:
- targets:
- 'node-exporter:9100'
- targets:
-
job_name: 'applications' static_configs:
- targets:
- 'app1:8080'
- 'app2:8080' metrics_path: /metrics
- targets:
undefinedglobal:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
-
job_name: 'prometheus' static_configs:
- targets: ['localhost:9090']
-
job_name: 'node' static_configs:
- targets:
- 'node-exporter:9100'
- targets:
-
job_name: 'applications' static_configs:
- targets:
- 'app1:8080'
- 'app2:8080' metrics_path: /metrics
- targets:
undefinedKubernetes Deployment
Kubernetes部署
Using Helm
使用Helm
bash
undefinedbash
undefinedAdd Prometheus community Helm repo
Add Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
Install kube-prometheus-stack
Install kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack
--namespace monitoring
--create-namespace
--set grafana.adminPassword=admin
--namespace monitoring
--create-namespace
--set grafana.adminPassword=admin
undefinedhelm install prometheus prometheus-community/kube-prometheus-stack
--namespace monitoring
--create-namespace
--set grafana.adminPassword=admin
--namespace monitoring
--create-namespace
--set grafana.adminPassword=admin
undefinedServiceMonitor
ServiceMonitor
yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp
namespace: monitoring
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
path: /metrics
namespaceSelector:
matchNames:
- defaultyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp
namespace: monitoring
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
path: /metrics
namespaceSelector:
matchNames:
- defaultPromQL Queries
PromQL查询
Basic Queries
基础查询
promql
undefinedpromql
undefinedCurrent CPU usage
当前CPU使用率
node_cpu_seconds_total{mode="idle"}
node_cpu_seconds_total{mode="idle"}
Rate of HTTP requests per second
HTTP请求每秒速率
rate(http_requests_total[5m])
rate(http_requests_total[5m])
Average response time
平均响应时间
avg(http_request_duration_seconds_sum / http_request_duration_seconds_count)
avg(http_request_duration_seconds_sum / http_request_duration_seconds_count)
Memory usage percentage
内存使用百分比
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
undefined(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
undefinedAggregations
聚合查询
promql
undefinedpromql
undefinedSum requests by status code
按状态码汇总请求数
sum by (status_code) (rate(http_requests_total[5m]))
sum by (status_code) (rate(http_requests_total[5m]))
Average CPU by instance
按实例统计平均CPU使用率
avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
Top 5 endpoints by request count
请求数Top 5端点
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))
95th percentile latency
95分位延迟
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
undefinedhistogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
undefinedTime-Based Queries
基于时间的查询
promql
undefinedpromql
undefinedCompare to 1 hour ago
与1小时前对比
http_requests_total - http_requests_total offset 1h
http_requests_total - http_requests_total offset 1h
Predict disk space in 4 hours
预测4小时后的磁盘空间
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600)
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600)
Changes in last 5 minutes
最近5分钟内的状态变化
changes(up[5m])
changes(up[5m])
Average over 24 hours
24小时平均值
avg_over_time(http_requests_total[24h])
undefinedavg_over_time(http_requests_total[24h])
undefinedAlerting Rules
告警规则
yaml
undefinedyaml
undefinedrules/alerts.yml
rules/alerts.yml
groups:
- name: application
rules:
-
alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}"
-
alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down"
-
alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }}"
-
alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1 for: 5m labels: severity: warning annotations: summary: "Disk space low on {{ $labels.instance }}"
-
undefinedgroups:
- name: application
rules:
-
alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}"
-
alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down"
-
alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }}"
-
alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1 for: 5m labels: severity: warning annotations: summary: "Disk space low on {{ $labels.instance }}"
-
undefinedAlertmanager
Alertmanager配置
yaml
undefinedyaml
undefinedalertmanager.yml
alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/xxx'
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty'
receivers:
-
name: 'slack-notifications' slack_configs:
- channel: '#alerts' send_resolved: true title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}'
-
name: 'pagerduty' pagerduty_configs:
- service_key: 'xxx' severity: critical
undefinedglobal:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/xxx'
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty'
receivers:
-
name: 'slack-notifications' slack_configs:
- channel: '#alerts' send_resolved: true title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}'
-
name: 'pagerduty' pagerduty_configs:
- service_key: 'xxx' severity: critical
undefinedGrafana Dashboards
Grafana仪表盘
Dashboard JSON Structure
仪表盘JSON结构
json
{
"dashboard": {
"title": "Application Metrics",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (status_code)",
"legendFormat": "{{ status_code }}"
}
],
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
},
{
"title": "Latency P95",
"type": "gauge",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
}
],
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 8}
}
]
}
}json
{
"dashboard": {
"title": "Application Metrics",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (status_code)",
"legendFormat": "{{ status_code }}"
}
],
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
},
{
"title": "Latency P95",
"type": "gauge",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
}
],
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 8}
}
]
}
}Provisioning Dashboards
仪表盘配置
yaml
undefinedyaml
undefinedgrafana/provisioning/dashboards/dashboards.yml
grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'default' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 30 options: path: /var/lib/grafana/dashboards
undefinedapiVersion: 1
providers:
- name: 'default' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 30 options: path: /var/lib/grafana/dashboards
undefinedData Source Provisioning
数据源配置
yaml
undefinedyaml
undefinedgrafana/provisioning/datasources/prometheus.yml
grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: false
undefinedapiVersion: 1
datasources:
- name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: false
undefinedRecording Rules
记录规则
yaml
undefinedyaml
undefinedrules/recording.yml
rules/recording.yml
groups:
- name: aggregations
interval: 30s
rules:
-
record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))
-
record: instance:node_cpu:avg_rate5m expr: | avg by (instance) ( rate(node_cpu_seconds_total{mode!="idle"}[5m]) )
-
record: job:http_latency:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )
-
undefinedgroups:
- name: aggregations
interval: 30s
rules:
-
record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))
-
record: instance:node_cpu:avg_rate5m expr: | avg by (instance) ( rate(node_cpu_seconds_total{mode!="idle"}[5m]) )
-
record: job:http_latency:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )
-
undefinedApplication Instrumentation
应用程序埋点
Go Application
Go应用
go
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var httpRequests = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
func init() {
prometheus.MustRegister(httpRequests)
}
// Expose metrics endpoint
http.Handle("/metrics", promhttp.Handler())go
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var httpRequests = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
func init() {
prometheus.MustRegister(httpRequests)
}
// 暴露指标端点
http.Handle("/metrics", promhttp.Handler())Node.js Application
Node.js应用
javascript
const client = require('prom-client');
const httpRequests = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'endpoint', 'status']
});
// Middleware
app.use((req, res, next) => {
res.on('finish', () => {
httpRequests.inc({
method: req.method,
endpoint: req.path,
status: res.statusCode
});
});
next();
});
// Expose metrics
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});javascript
const client = require('prom-client');
const httpRequests = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'endpoint', 'status']
});
// 中间件
app.use((req, res, next) => {
res.on('finish', () => {
httpRequests.inc({
method: req.method,
endpoint: req.path,
status: res.statusCode
});
});
next();
});
// 暴露指标
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});Common Issues
常见问题
Issue: Targets Not Discovered
问题:目标未被发现
Problem: Prometheus not scraping targets
Solution: Check network connectivity, verify target labels
问题:Prometheus未抓取目标
解决方案:检查网络连通性,验证目标标签
Issue: High Memory Usage
问题:内存占用过高
Problem: Prometheus using excessive memory
Solution: Reduce retention, use recording rules, limit cardinality
问题:Prometheus内存使用过量
解决方案:缩短数据保留时间、使用记录规则、限制标签基数
Issue: Slow Queries
问题:查询缓慢
Problem: PromQL queries timing out
Solution: Use recording rules, limit time ranges, optimize queries
问题:PromQL查询超时
解决方案:使用记录规则、缩短时间范围、优化查询语句
Issue: Missing Data Points
问题:数据点缺失
Problem: Gaps in metrics data
Solution: Check scrape interval, verify target availability
问题:指标数据存在间隙
解决方案:检查抓取间隔,验证目标可用性
Best Practices
最佳实践
- Use recording rules for frequently-used queries
- Limit label cardinality to prevent memory issues
- Set appropriate retention based on storage capacity
- Use histogram metrics for latency measurement
- Implement proper alerting thresholds
- Version control dashboards as code
- Use federation for large-scale deployments
- Regularly review and prune unused metrics
- 为常用查询配置记录规则
- 限制标签基数以避免内存问题
- 根据存储容量设置合适的数据保留时间
- 使用直方图指标测量延迟
- 设置合理的告警阈值
- 将仪表盘作为代码进行版本控制
- 大规模部署时使用联邦机制
- 定期审查并清理未使用的指标
Related Skills
相关技能
- alerting-oncall - Alert management
- loki-logging - Log aggregation
- kubernetes-ops - K8s monitoring
- alerting-oncall - 告警管理
- loki-logging - 日志聚合
- kubernetes-ops - K8s监控