server-monitoring
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseServer Monitoring — Production Stack
生产级服务器监控栈
Stack Overview
栈概述
| Layer | Tool | Purpose |
|---|---|---|
| Metrics collection | Node Exporter | OS/hardware metrics from |
| Metrics scraping | Prometheus | Pull-based time-series database |
| Visualization | Grafana | Dashboards and alerting UI |
| Alerting | Alertmanager | Route, deduplicate, silence alerts |
| Log shipping | Promtail | Tail logs → push to Loki |
| Log aggregation | Loki | Log storage with label-based indexing |
| Uptime (external) | UptimeRobot | External HTTP/TCP reachability checks |
| Cron monitoring | healthchecks.io | Detect silent cron job failures |
| 层级 | 工具 | 用途 |
|---|---|---|
| 指标采集 | Node Exporter | 从 |
| 指标采集 | Prometheus | 基于拉取模式的时序数据库 |
| 可视化 | Grafana | 仪表盘与告警UI |
| 告警管理 | Alertmanager | 路由、去重、静默告警 |
| 日志传输 | Promtail | 追踪日志并推送到Loki |
| 日志聚合 | Loki | 基于标签索引的日志存储 |
| 外部可用性监控 | UptimeRobot | 外部HTTP/TCP可达性检测 |
| Cron任务监控 | healthchecks.io | 检测静默Cron任务故障 |
Node Exporter Installation (systemd)
Node Exporter安装(systemd方式)
Download the latest release from https://github.com/prometheus/node_exporter/releases.
bash
NODE_EXPORTER_VERSION=1.8.2
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter/etc/systemd/system/node_exporter.serviceini
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.diskstats \
--web.listen-address=127.0.0.1:9100
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetbash
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter
curl -s http://127.0.0.1:9100/metrics | head -20Bind Node Exporter to — never expose directly on 0.0.0.0. Prometheus scrapes it locally; use SSH tunnel or VPN for remote Prometheus.
127.0.0.1:9100bash
NODE_EXPORTER_VERSION=1.8.2
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter/etc/systemd/system/node_exporter.serviceini
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.diskstats \
--web.listen-address=127.0.0.1:9100
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetbash
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter
curl -s http://127.0.0.1:9100/metrics | head -20将Node Exporter绑定到——切勿直接暴露在0.0.0.0。Prometheus在本地采集指标;如需远程采集,请使用SSH隧道或VPN。
127.0.0.1:9100Prometheus Installation and Configuration
Prometheus安装与配置
bash
PROMETHEUS_VERSION=2.53.0
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/{consoles,console_libraries} /etc/prometheus/
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus/etc/prometheus/prometheus.ymlyaml
global:
scrape_interval: 15s # Default scrape interval
evaluation_interval: 15s # Rule evaluation interval
scrape_timeout: 10s
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
# Node Exporter — OS metrics
- job_name: node
static_configs:
- targets: ['localhost:9100']
labels:
server: 'prod-web-01'
env: production
# Application metrics — assumes /metrics on port 3000
- job_name: app
metrics_path: /metrics
static_configs:
- targets: ['localhost:3000']
labels:
app: myapp
env: production
# Blackbox Exporter — probe HTTP endpoints
- job_name: blackbox_http
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://example.com/api/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9115 # Blackbox Exporter address
# Prometheus self-monitoring
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']/etc/systemd/system/prometheus.serviceini
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=10GB \
--web.listen-address=127.0.0.1:9090 \
--web.enable-lifecycle
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetbash
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
promtool check config /etc/prometheus/prometheus.ymlbash
PROMETHEUS_VERSION=2.53.0
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/{consoles,console_libraries} /etc/prometheus/
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus/etc/prometheus/prometheus.ymlyaml
global:
scrape_interval: 15s # Default scrape interval
evaluation_interval: 15s # Rule evaluation interval
scrape_timeout: 10s
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
# Node Exporter — OS metrics
- job_name: node
static_configs:
- targets: ['localhost:9100']
labels:
server: 'prod-web-01'
env: production
# Application metrics — assumes /metrics on port 3000
- job_name: app
metrics_path: /metrics
static_configs:
- targets: ['localhost:3000']
labels:
app: myapp
env: production
# Blackbox Exporter — probe HTTP endpoints
- job_name: blackbox_http
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://example.com/api/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9115 # Blackbox Exporter address
# Prometheus self-monitoring
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']/etc/systemd/system/prometheus.serviceini
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=10GB \
--web.listen-address=127.0.0.1:9090 \
--web.enable-lifecycle
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetbash
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
promtool check config /etc/prometheus/prometheus.ymlAlert Rules
告警规则
/etc/prometheus/rules/server.ymlyaml
groups:
- name: server_alerts
interval: 1m
rules:
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.job }}/{{ $labels.instance }} has been unreachable for more than 2 minutes."
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ printf \"%.1f\" $value }}% (threshold 85%)."
- alert: HighMemory
expr: |
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ printf \"%.1f\" $value }}% (threshold 90%)."
- alert: DiskAlmostFull
expr: |
(node_filesystem_size_bytes{fstype!="tmpfs"} - node_filesystem_free_bytes{fstype!="tmpfs"})
/ node_filesystem_size_bytes{fstype!="tmpfs"} * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Disk almost full on {{ $labels.instance }}"
description: "Filesystem {{ $labels.mountpoint }} is {{ printf \"%.1f\" $value }}% full."
- alert: HighLoad
expr: node_load15 / count without(cpu, mode)(node_cpu_seconds_total{mode="idle"}) > 2.0
for: 10m
labels:
severity: warning
annotations:
summary: "High system load on {{ $labels.instance }}"
description: "15-minute load average per CPU core is {{ printf \"%.2f\" $value }} (threshold 2.0)."Validate rules:
promtool check rules /etc/prometheus/rules/server.yml/etc/prometheus/rules/server.ymlyaml
groups:
- name: server_alerts
interval: 1m
rules:
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.job }}/{{ $labels.instance }} has been unreachable for more than 2 minutes."
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ printf \"%.1f\" $value }}% (threshold 85%)."
- alert: HighMemory
expr: |
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ printf \"%.1f\" $value }}% (threshold 90%)."
- alert: DiskAlmostFull
expr: |
(node_filesystem_size_bytes{fstype!="tmpfs"} - node_filesystem_free_bytes{fstype!="tmpfs"})
/ node_filesystem_size_bytes{fstype!="tmpfs"} * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Disk almost full on {{ $labels.instance }}"
description: "Filesystem {{ $labels.mountpoint }} is {{ printf \"%.1f\" $value }}% full."
- alert: HighLoad
expr: node_load15 / count without(cpu, mode)(node_cpu_seconds_total{mode="idle"}) > 2.0
for: 10m
labels:
severity: warning
annotations:
summary: "High system load on {{ $labels.instance }}"
description: "15-minute load average per CPU core is {{ printf \"%.2f\" $value }} (threshold 2.0)."验证规则:
promtool check rules /etc/prometheus/rules/server.ymlAlertmanager
Alertmanager
bash
ALERTMANAGER_VERSION=0.27.0
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/{alertmanager,amtool} /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager/etc/alertmanager/alertmanager.ymlyaml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'app-specific-password' # Use App Password, not account password
smtp_require_tls: true
resolve_timeout: 5m
route:
group_by: ['alertname', 'instance']
group_wait: 30s # Wait before sending first notification for a new group
group_interval: 5m # How long to wait before sending alert for new alerts in the same group
repeat_interval: 4h # How often to re-send unresolved alerts
receiver: 'team-ops'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
receivers:
- name: 'team-ops'
email_configs:
- to: 'ops-team@example.com'
send_resolved: true
slack_configs:
- api_url: 'https://hooks.slack.com/services/<T_ID>/<B_ID>/<WEBHOOK_TOKEN>'
channel: '#alerts'
send_resolved: true
title: '{{ if eq .Status "firing" }}:red_circle:{{ else }}:white_check_mark:{{ end }} {{ .CommonAnnotations.summary }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'your-pagerduty-integration-key'
send_resolved: true
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'instance']/etc/systemd/system/alertmanager.serviceini
[Unit]
Description=Alertmanager
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=127.0.0.1:9093
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetbash
ALERTMANAGER_VERSION=0.27.0
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/{alertmanager,amtool} /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager/etc/alertmanager/alertmanager.ymlyaml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'app-specific-password' # Use App Password, not account password
smtp_require_tls: true
resolve_timeout: 5m
route:
group_by: ['alertname', 'instance']
group_wait: 30s # Wait before sending first notification for a new group
group_interval: 5m # How long to wait before sending alert for new alerts in the same group
repeat_interval: 4h # How often to re-send unresolved alerts
receiver: 'team-ops'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
receivers:
- name: 'team-ops'
email_configs:
- to: 'ops-team@example.com'
send_resolved: true
slack_configs:
- api_url: 'https://hooks.slack.com/services/<T_ID>/<B_ID>/<WEBHOOK_TOKEN>'
channel: '#alerts'
send_resolved: true
title: '{{ if eq .Status "firing" }}:red_circle:{{ else }}:white_check_mark:{{ end }} {{ .CommonAnnotations.summary }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'your-pagerduty-integration-key'
send_resolved: true
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'instance']/etc/systemd/system/alertmanager.serviceini
[Unit]
Description=Alertmanager
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=127.0.0.1:9093
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetGrafana Installation and Datasource Provisioning
Grafana安装与数据源配置
bash
sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update && sudo apt-get install -y grafana
sudo systemctl enable --now grafana-serverDatasource provisioning ():
/etc/grafana/provisioning/datasources/prometheus.ymlyaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://localhost:9090
isDefault: true
editable: false
- name: Loki
type: loki
access: proxy
url: http://localhost:3100
editable: falseImport Node Exporter Full dashboard (ID 1860) via Grafana UI: Dashboards → Import → enter 1860 → select Prometheus datasource. Other useful dashboard IDs:
- 3662 — Prometheus 2.0 Stats
- 13659 — Node Exporter for Prometheus Dashboard
- 10991 — Blackbox Exporter
bash
sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update && sudo apt-get install -y grafana
sudo systemctl enable --now grafana-server数据源配置():
/etc/grafana/provisioning/datasources/prometheus.ymlyaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://localhost:9090
isDefault: true
editable: false
- name: Loki
type: loki
access: proxy
url: http://localhost:3100
editable: false通过Grafana UI导入Node Exporter Full仪表盘(ID 1860):仪表盘 → 导入 → 输入1860 → 选择Prometheus数据源。其他实用仪表盘ID:
- 3662 — Prometheus 2.0统计面板
- 13659 — Node Exporter Prometheus仪表盘
- 10991 — Blackbox Exporter仪表盘
PLG Stack: Promtail + Loki
PLG栈:Promtail + Loki
Loki Installation
Loki安装
bash
LOKI_VERSION=3.1.0
wget https://github.com/grafana/loki/releases/download/v${LOKI_VERSION}/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
sudo mkdir -p /etc/loki /var/lib/loki/etc/loki/loki-config.ymlyaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
instance_addr: 127.0.0.1
path_prefix: /var/lib/loki
storage:
filesystem:
chunks_directory: /var/lib/loki/chunks
rules_directory: /var/lib/loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: loki_index_
period: 24h
limits_config:
retention_period: 30d # Requires compactor below
compactor:
working_directory: /var/lib/loki/compactor
retention_enabled: true
delete_request_cancel_period: 24hbash
LOKI_VERSION=3.1.0
wget https://github.com/grafana/loki/releases/download/v${LOKI_VERSION}/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
sudo mkdir -p /etc/loki /var/lib/loki/etc/loki/loki-config.ymlyaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
instance_addr: 127.0.0.1
path_prefix: /var/lib/loki
storage:
filesystem:
chunks_directory: /var/lib/loki/chunks
rules_directory: /var/lib/loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: loki_index_
period: 24h
limits_config:
retention_period: 30d # Requires compactor below
compactor:
working_directory: /var/lib/loki/compactor
retention_enabled: true
delete_request_cancel_period: 24hPromtail Configuration
Promtail配置
/etc/promtail/promtail-config.ymlyaml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /var/lib/promtail/positions.yaml
clients:
- url: http://localhost:3100/loki/api/v1/push
scrape_configs:
# Systemd journal — captures all systemd unit logs
- job_name: journal
journal:
max_age: 12h
labels:
job: systemd-journal
host: prod-web-01
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: unit
- source_labels: ['__journal_priority_keyword']
target_label: level
# Application log files
- job_name: app_logs
static_configs:
- targets:
- localhost
labels:
job: myapp
host: prod-web-01
__path__: /var/log/myapp/*.log
pipeline_stages:
- json:
expressions:
level: level
msg: message
- labels:
level:
- timestamp:
source: timestamp
format: RFC3339Nano
# Nginx access logs
- job_name: nginx
static_configs:
- targets:
- localhost
labels:
job: nginx
host: prod-web-01
__path__: /var/log/nginx/access.log/etc/promtail/promtail-config.ymlyaml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /var/lib/promtail/positions.yaml
clients:
- url: http://localhost:3100/loki/api/v1/push
scrape_configs:
# Systemd journal — captures all systemd unit logs
- job_name: journal
journal:
max_age: 12h
labels:
job: systemd-journal
host: prod-web-01
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: unit
- source_labels: ['__journal_priority_keyword']
target_label: level
# Application log files
- job_name: app_logs
static_configs:
- targets:
- localhost
labels:
job: myapp
host: prod-web-01
__path__: /var/log/myapp/*.log
pipeline_stages:
- json:
expressions:
level: level
msg: message
- labels:
level:
- timestamp:
source: timestamp
format: RFC3339Nano
# Nginx access logs
- job_name: nginx
static_configs:
- targets:
- localhost
labels:
job: nginx
host: prod-web-01
__path__: /var/log/nginx/access.logLogQL Basics
LogQL基础
logql
undefinedlogql
undefinedFilter by label
Filter by label
{job="myapp", level="error"}
{job="myapp", level="error"}
Filter by content
Filter by content
{job="nginx"} |= "500"
{job="nginx"} |= "500"
Pattern extraction
Pattern extraction
{job="nginx"} | pattern
<ip> - - [<_>] "<method> <path> <_>" <status> <_>{job="nginx"} | pattern
<ip> - - [<_>] "<method> <path> <_>" <status> <_>Rate of error log lines per minute
Rate of error log lines per minute
rate({job="myapp", level="error"}[1m])
rate({job="myapp", level="error"}[1m])
Count errors by unit over last hour
Count errors by unit over last hour
sum by(unit) (count_over_time({job="systemd-journal", level="err"}[1h]))
---sum by(unit) (count_over_time({job="systemd-journal", level="err"}[1h]))
---System-Level Diagnostic Tools
系统级诊断工具
| Tool | Key Usage | Notes |
|---|---|---|
| Interactive process viewer | |
| I/O per process | |
| Bandwidth per process | |
| Socket statistics (replaces netstat) | |
| Memory/swap/CPU overview | |
| Disk I/O stats | |
| Combined resource stats | |
| Historical performance (sysstat) | |
| Memory and swap summary | Check buff/cache vs actual available |
| Disk space by filesystem | Add |
| Directory size | `du -sh /var/log/* |
| 工具 | 核心用途 | 说明 |
|---|---|---|
| 交互式进程查看器 | |
| 进程I/O统计 | |
| 进程带宽统计 | |
| 套接字统计(替代netstat) | |
| 内存/交换分区/CPU概览 | |
| 磁盘I/O统计 | |
| 综合资源统计 | |
| 历史性能统计(sysstat工具) | |
| 内存与交换分区摘要 | 查看缓冲/缓存与实际可用内存 |
| 文件系统磁盘空间 | 添加 |
| 目录大小 | `du -sh /var/log/* |
Uptime Monitoring
可用性监控
UptimeRobot (Free Tier)
UptimeRobot(免费版)
- Create HTTP(s) monitor: URL, check interval (5 min on free tier), keyword match
- Alert contacts: email + Slack webhook
- Status page: public URL for incident communication
- TCP monitors for non-HTTP services (database ports, SMTP)
- 创建HTTP(s)监控:设置URL、检测间隔(免费版为5分钟)、关键字匹配
- 告警联系人:邮件 + Slack webhook
- 状态页面:公开URL用于事件沟通
- TCP监控:用于非HTTP服务(数据库端口、SMTP)
healthchecks.io (Cron Job Monitoring)
healthchecks.io(Cron任务监控)
Add a curl ping at the end of every cron script:
bash
#!/bin/bash
set -euo pipefail在每个Cron脚本末尾添加curl ping:
bash
#!/bin/bash
set -euo pipefail... backup/maintenance logic here ...
... 备份/维护逻辑 ...
Signal success to healthchecks.io
向healthchecks.io发送成功信号
curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID > /dev/null
For jobs that should ping start + finish:
```bash
curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID/start > /dev/nullcurl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID > /dev/null
对于需要监控开始和结束的任务:
```bash
curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID/start > /dev/null... job logic ...
... 任务逻辑 ...
curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID > /dev/null
---curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID > /dev/null
---Anti-Patterns
反模式
| Anti-Pattern | Problem | Fix |
|---|---|---|
| No alerting configured | Incidents discovered by users, not ops team | Set up Alertmanager with email + Slack from day one |
| Scrape interval < 10s | High cardinality load on Prometheus, noisy data | Use 15s–60s depending on metric volatility |
| No retention limits | Prometheus disk fills up and crashes | Set |
| Monitoring server on the same host being monitored | Single point of failure — if server dies, so does the monitor | Run Prometheus on a dedicated monitoring host or use external uptime service |
| Exposing Node Exporter on 0.0.0.0 | Metrics data leaked publicly | Bind to |
| Catching all logs without labels | Cannot filter Loki queries efficiently | Add |
Alert rules without | Flapping alerts on transient spikes | Always add |
No | Warning floods when critical alert also firing | Inhibit warnings when critical alert matches same instance |
| SMTP password in alertmanager.yml committed to git | Credential leak | Use environment variable substitution or external secret management |
| No Grafana datasource provisioning | Dashboard import fails after Grafana reinstall | Provision datasources as YAML in |
| 反模式 | 问题 | 修复方案 |
|---|---|---|
| 未配置告警 | 事件由用户而非运维团队发现 | 从第一天起就配置Alertmanager,启用邮件+Slack通知 |
| 采集间隔 < 10秒 | Prometheus负载过高,数据噪声大 | 根据指标波动性使用15秒–60秒的采集间隔 |
| 未设置存储保留限制 | Prometheus磁盘被占满导致崩溃 | 设置 |
| 监控服务器与被监控主机在同一节点 | 单点故障——服务器宕机后监控也失效 | 在专用监控主机上运行Prometheus,或使用外部可用性监控服务 |
| Node Exporter暴露在0.0.0.0 | 指标数据公开泄露 | 绑定到 |
| 采集所有日志但未添加标签 | Loki查询无法高效过滤 | 在Promtail配置中添加 |
告警规则未设置 | 瞬时峰值导致告警频繁波动 | 始终添加 |
Alertmanager未配置 | 触发严重告警时,警告告警泛滥 | 当同一实例触发严重告警时,抑制警告告警 |
| alertmanager.yml中的SMTP密码提交到Git | 凭证泄露 | 使用环境变量替换或外部密钥管理系统 |
| 未配置Grafana数据源 | Grafana重装后仪表盘导入失败 | 在 |
Troubleshooting
故障排查
| Symptom | Likely Cause | Diagnostic & Fix |
|---|---|---|
| Metrics not appearing in Prometheus | Node Exporter not running or wrong port | |
| Prometheus cannot reach scrape endpoint | Check firewall ( |
| Alert not firing despite condition met | | |
| Alert firing but no notification | Alertmanager not configured or unreachable | Test Alertmanager with |
| Grafana blank dashboard after import | Wrong datasource selected | Edit dashboard → Panel → Query → verify datasource dropdown matches provisioned name |
| Loki not ingesting logs | Promtail cannot connect or wrong URL | |
| Promtail not tailing journal | Missing journal permissions | Add |
| High cardinality error in Loki | Too many unique label combinations | Avoid high-cardinality labels (request IDs, user IDs) — use log content for those |
| Prometheus OOM killed | Too many metrics or insufficient retention pruning | Add |
| Prometheus evaluating no rules | Confirm |
| 症状 | 可能原因 | 诊断与修复 |
|---|---|---|
| Prometheus中无指标显示 | Node Exporter未运行或端口错误 | |
目标显示 | Prometheus无法连接采集端点 | 检查防火墙( |
| 条件满足但告警未触发 | | |
| 告警触发但无通知 | Alertmanager未配置或不可达 | 使用 |
| Grafana导入仪表盘后显示空白 | 选择了错误的数据源 | 编辑仪表盘 → 面板 → 查询 → 验证数据源下拉框与配置的数据源一致 |
| Loki未接收日志 | Promtail无法连接或URL错误 | |
| Promtail未追踪journal日志 | 缺少journal权限 | 将 |
| Loki出现高基数错误 | 唯一标签组合过多 | 避免高基数标签(请求ID、用户ID)——此类信息放在日志内容中 |
| Prometheus因OOM被杀死 | 指标过多或保留策略未生效 | 添加 |
无法看到 | Prometheus未评估任何规则 | 确认 |