server-monitoring

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Server Monitoring — Production Stack

生产级服务器监控栈

Stack Overview

栈概述

LayerToolPurpose
Metrics collectionNode ExporterOS/hardware metrics from
/proc
and
/sys
Metrics scrapingPrometheusPull-based time-series database
VisualizationGrafanaDashboards and alerting UI
AlertingAlertmanagerRoute, deduplicate, silence alerts
Log shippingPromtailTail logs → push to Loki
Log aggregationLokiLog storage with label-based indexing
Uptime (external)UptimeRobotExternal HTTP/TCP reachability checks
Cron monitoringhealthchecks.ioDetect silent cron job failures

层级工具用途
指标采集Node Exporter
/proc
/sys
获取操作系统/硬件指标
指标采集Prometheus基于拉取模式的时序数据库
可视化Grafana仪表盘与告警UI
告警管理Alertmanager路由、去重、静默告警
日志传输Promtail追踪日志并推送到Loki
日志聚合Loki基于标签索引的日志存储
外部可用性监控UptimeRobot外部HTTP/TCP可达性检测
Cron任务监控healthchecks.io检测静默Cron任务故障

Node Exporter Installation (systemd)

Node Exporter安装(systemd方式)

bash
NODE_EXPORTER_VERSION=1.8.2
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter
/etc/systemd/system/node_exporter.service
:
ini
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.diskstats \
  --web.listen-address=127.0.0.1:9100
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
bash
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter
curl -s http://127.0.0.1:9100/metrics | head -20
Bind Node Exporter to
127.0.0.1:9100
— never expose directly on 0.0.0.0. Prometheus scrapes it locally; use SSH tunnel or VPN for remote Prometheus.

bash
NODE_EXPORTER_VERSION=1.8.2
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter
/etc/systemd/system/node_exporter.service
:
ini
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.diskstats \
  --web.listen-address=127.0.0.1:9100
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
bash
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter
curl -s http://127.0.0.1:9100/metrics | head -20
将Node Exporter绑定到
127.0.0.1:9100
——切勿直接暴露在0.0.0.0。Prometheus在本地采集指标;如需远程采集,请使用SSH隧道或VPN。

Prometheus Installation and Configuration

Prometheus安装与配置

bash
PROMETHEUS_VERSION=2.53.0
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/{consoles,console_libraries} /etc/prometheus/
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
/etc/prometheus/prometheus.yml
:
yaml
global:
  scrape_interval: 15s           # Default scrape interval
  evaluation_interval: 15s       # Rule evaluation interval
  scrape_timeout: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Node Exporter — OS metrics
  - job_name: node
    static_configs:
      - targets: ['localhost:9100']
        labels:
          server: 'prod-web-01'
          env: production

  # Application metrics — assumes /metrics on port 3000
  - job_name: app
    metrics_path: /metrics
    static_configs:
      - targets: ['localhost:3000']
        labels:
          app: myapp
          env: production

  # Blackbox Exporter — probe HTTP endpoints
  - job_name: blackbox_http
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://example.com/api/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115   # Blackbox Exporter address

  # Prometheus self-monitoring
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']
/etc/systemd/system/prometheus.service
:
ini
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=10GB \
  --web.listen-address=127.0.0.1:9090 \
  --web.enable-lifecycle
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
bash
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
promtool check config /etc/prometheus/prometheus.yml

bash
PROMETHEUS_VERSION=2.53.0
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/{consoles,console_libraries} /etc/prometheus/
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
/etc/prometheus/prometheus.yml
:
yaml
global:
  scrape_interval: 15s           # Default scrape interval
  evaluation_interval: 15s       # Rule evaluation interval
  scrape_timeout: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Node Exporter — OS metrics
  - job_name: node
    static_configs:
      - targets: ['localhost:9100']
        labels:
          server: 'prod-web-01'
          env: production

  # Application metrics — assumes /metrics on port 3000
  - job_name: app
    metrics_path: /metrics
    static_configs:
      - targets: ['localhost:3000']
        labels:
          app: myapp
          env: production

  # Blackbox Exporter — probe HTTP endpoints
  - job_name: blackbox_http
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://example.com/api/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115   # Blackbox Exporter address

  # Prometheus self-monitoring
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']
/etc/systemd/system/prometheus.service
:
ini
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=10GB \
  --web.listen-address=127.0.0.1:9090 \
  --web.enable-lifecycle
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
bash
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
promtool check config /etc/prometheus/prometheus.yml

Alert Rules

告警规则

/etc/prometheus/rules/server.yml
:
yaml
groups:
  - name: server_alerts
    interval: 1m
    rules:

      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }}/{{ $labels.instance }} has been unreachable for more than 2 minutes."

      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ printf \"%.1f\" $value }}% (threshold 85%)."

      - alert: HighMemory
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ printf \"%.1f\" $value }}% (threshold 90%)."

      - alert: DiskAlmostFull
        expr: |
          (node_filesystem_size_bytes{fstype!="tmpfs"} - node_filesystem_free_bytes{fstype!="tmpfs"})
          / node_filesystem_size_bytes{fstype!="tmpfs"} * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk almost full on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} is {{ printf \"%.1f\" $value }}% full."

      - alert: HighLoad
        expr: node_load15 / count without(cpu, mode)(node_cpu_seconds_total{mode="idle"}) > 2.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High system load on {{ $labels.instance }}"
          description: "15-minute load average per CPU core is {{ printf \"%.2f\" $value }} (threshold 2.0)."
Validate rules:
promtool check rules /etc/prometheus/rules/server.yml

/etc/prometheus/rules/server.yml
:
yaml
groups:
  - name: server_alerts
    interval: 1m
    rules:

      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }}/{{ $labels.instance }} has been unreachable for more than 2 minutes."

      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ printf \"%.1f\" $value }}% (threshold 85%)."

      - alert: HighMemory
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ printf \"%.1f\" $value }}% (threshold 90%)."

      - alert: DiskAlmostFull
        expr: |
          (node_filesystem_size_bytes{fstype!="tmpfs"} - node_filesystem_free_bytes{fstype!="tmpfs"})
          / node_filesystem_size_bytes{fstype!="tmpfs"} * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk almost full on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} is {{ printf \"%.1f\" $value }}% full."

      - alert: HighLoad
        expr: node_load15 / count without(cpu, mode)(node_cpu_seconds_total{mode="idle"}) > 2.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High system load on {{ $labels.instance }}"
          description: "15-minute load average per CPU core is {{ printf \"%.2f\" $value }} (threshold 2.0)."
验证规则:
promtool check rules /etc/prometheus/rules/server.yml

Alertmanager

Alertmanager

bash
ALERTMANAGER_VERSION=0.27.0
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/{alertmanager,amtool} /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
/etc/alertmanager/alertmanager.yml
:
yaml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'app-specific-password'   # Use App Password, not account password
  smtp_require_tls: true
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s        # Wait before sending first notification for a new group
  group_interval: 5m     # How long to wait before sending alert for new alerts in the same group
  repeat_interval: 4h    # How often to re-send unresolved alerts
  receiver: 'team-ops'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h

receivers:
  - name: 'team-ops'
    email_configs:
      - to: 'ops-team@example.com'
        send_resolved: true
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/<T_ID>/<B_ID>/<WEBHOOK_TOKEN>'
        channel: '#alerts'
        send_resolved: true
        title: '{{ if eq .Status "firing" }}:red_circle:{{ else }}:white_check_mark:{{ end }} {{ .CommonAnnotations.summary }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-pagerduty-integration-key'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'instance']
/etc/systemd/system/alertmanager.service
:
ini
[Unit]
Description=Alertmanager
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=127.0.0.1:9093
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

bash
ALERTMANAGER_VERSION=0.27.0
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/{alertmanager,amtool} /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
/etc/alertmanager/alertmanager.yml
:
yaml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'app-specific-password'   # Use App Password, not account password
  smtp_require_tls: true
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s        # Wait before sending first notification for a new group
  group_interval: 5m     # How long to wait before sending alert for new alerts in the same group
  repeat_interval: 4h    # How often to re-send unresolved alerts
  receiver: 'team-ops'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h

receivers:
  - name: 'team-ops'
    email_configs:
      - to: 'ops-team@example.com'
        send_resolved: true
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/<T_ID>/<B_ID>/<WEBHOOK_TOKEN>'
        channel: '#alerts'
        send_resolved: true
        title: '{{ if eq .Status "firing" }}:red_circle:{{ else }}:white_check_mark:{{ end }} {{ .CommonAnnotations.summary }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-pagerduty-integration-key'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'instance']
/etc/systemd/system/alertmanager.service
:
ini
[Unit]
Description=Alertmanager
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=127.0.0.1:9093
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Grafana Installation and Datasource Provisioning

Grafana安装与数据源配置

bash
sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update && sudo apt-get install -y grafana
sudo systemctl enable --now grafana-server
Datasource provisioning (
/etc/grafana/provisioning/datasources/prometheus.yml
):
yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: false

  - name: Loki
    type: loki
    access: proxy
    url: http://localhost:3100
    editable: false
Import Node Exporter Full dashboard (ID 1860) via Grafana UI: Dashboards → Import → enter 1860 → select Prometheus datasource. Other useful dashboard IDs:
  • 3662 — Prometheus 2.0 Stats
  • 13659 — Node Exporter for Prometheus Dashboard
  • 10991 — Blackbox Exporter

bash
sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update && sudo apt-get install -y grafana
sudo systemctl enable --now grafana-server
数据源配置(
/etc/grafana/provisioning/datasources/prometheus.yml
):
yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: false

  - name: Loki
    type: loki
    access: proxy
    url: http://localhost:3100
    editable: false
通过Grafana UI导入Node Exporter Full仪表盘(ID 1860):仪表盘 → 导入 → 输入1860 → 选择Prometheus数据源。其他实用仪表盘ID:
  • 3662 — Prometheus 2.0统计面板
  • 13659 — Node Exporter Prometheus仪表盘
  • 10991 — Blackbox Exporter仪表盘

PLG Stack: Promtail + Loki

PLG栈:Promtail + Loki

Loki Installation

Loki安装

bash
LOKI_VERSION=3.1.0
wget https://github.com/grafana/loki/releases/download/v${LOKI_VERSION}/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
sudo mkdir -p /etc/loki /var/lib/loki
/etc/loki/loki-config.yml
:
yaml
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: loki_index_
        period: 24h

limits_config:
  retention_period: 30d           # Requires compactor below

compactor:
  working_directory: /var/lib/loki/compactor
  retention_enabled: true
  delete_request_cancel_period: 24h
bash
LOKI_VERSION=3.1.0
wget https://github.com/grafana/loki/releases/download/v${LOKI_VERSION}/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
sudo mkdir -p /etc/loki /var/lib/loki
/etc/loki/loki-config.yml
:
yaml
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: loki_index_
        period: 24h

limits_config:
  retention_period: 30d           # Requires compactor below

compactor:
  working_directory: /var/lib/loki/compactor
  retention_enabled: true
  delete_request_cancel_period: 24h

Promtail Configuration

Promtail配置

/etc/promtail/promtail-config.yml
:
yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
  # Systemd journal — captures all systemd unit logs
  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
        host: prod-web-01
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: unit
      - source_labels: ['__journal_priority_keyword']
        target_label: level

  # Application log files
  - job_name: app_logs
    static_configs:
      - targets:
          - localhost
        labels:
          job: myapp
          host: prod-web-01
          __path__: /var/log/myapp/*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            msg: message
      - labels:
          level:
      - timestamp:
          source: timestamp
          format: RFC3339Nano

  # Nginx access logs
  - job_name: nginx
    static_configs:
      - targets:
          - localhost
        labels:
          job: nginx
          host: prod-web-01
          __path__: /var/log/nginx/access.log
/etc/promtail/promtail-config.yml
:
yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
  # Systemd journal — captures all systemd unit logs
  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
        host: prod-web-01
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: unit
      - source_labels: ['__journal_priority_keyword']
        target_label: level

  # Application log files
  - job_name: app_logs
    static_configs:
      - targets:
          - localhost
        labels:
          job: myapp
          host: prod-web-01
          __path__: /var/log/myapp/*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            msg: message
      - labels:
          level:
      - timestamp:
          source: timestamp
          format: RFC3339Nano

  # Nginx access logs
  - job_name: nginx
    static_configs:
      - targets:
          - localhost
        labels:
          job: nginx
          host: prod-web-01
          __path__: /var/log/nginx/access.log

LogQL Basics

LogQL基础

logql
undefined
logql
undefined

Filter by label

Filter by label

{job="myapp", level="error"}
{job="myapp", level="error"}

Filter by content

Filter by content

{job="nginx"} |= "500"
{job="nginx"} |= "500"

Pattern extraction

Pattern extraction

{job="nginx"} | pattern
<ip> - - [<_>] "<method> <path> <_>" <status> <_>
{job="nginx"} | pattern
<ip> - - [<_>] "<method> <path> <_>" <status> <_>

Rate of error log lines per minute

Rate of error log lines per minute

rate({job="myapp", level="error"}[1m])
rate({job="myapp", level="error"}[1m])

Count errors by unit over last hour

Count errors by unit over last hour

sum by(unit) (count_over_time({job="systemd-journal", level="err"}[1h]))

---
sum by(unit) (count_over_time({job="systemd-journal", level="err"}[1h]))

---

System-Level Diagnostic Tools

系统级诊断工具

ToolKey UsageNotes
htop
Interactive process viewer
F5
tree view,
F6
sort,
F9
kill,
u
filter by user
iotop
I/O per process
sudo iotop -o
(only active),
-a
accumulated
nethogs
Bandwidth per process
sudo nethogs eth0
ss
Socket statistics (replaces netstat)
ss -tlnp
TCP listening,
ss -s
summary,
ss -o state time-wait
vmstat
Memory/swap/CPU overview
vmstat 1 10
(10 samples, 1s interval)
iostat
Disk I/O stats
iostat -xz 1
extended,
iostat -m
in MB/s
dstat
Combined resource stats
dstat -cdngy
CPU/disk/net/page/sys
sar
Historical performance (sysstat)
sar -u 1 5
CPU,
sar -r 1 5
memory,
sar -b 1 5
I/O
free -h
Memory and swap summaryCheck buff/cache vs actual available
df -h
Disk space by filesystemAdd
-i
for inode usage
du -sh
Directory size`du -sh /var/log/*

工具核心用途说明
htop
交互式进程查看器
F5
树形视图,
F6
排序,
F9
终止进程,
u
按用户过滤
iotop
进程I/O统计
sudo iotop -o
(仅显示活跃进程),
-a
累计统计
nethogs
进程带宽统计
sudo nethogs eth0
ss
套接字统计(替代netstat)
ss -tlnp
查看TCP监听端口,
ss -s
统计摘要,
ss -o state time-wait
查看TIME-WAIT状态
vmstat
内存/交换分区/CPU概览
vmstat 1 10
(10次采样,间隔1秒)
iostat
磁盘I/O统计
iostat -xz 1
扩展统计,
iostat -m
以MB/s显示
dstat
综合资源统计
dstat -cdngy
显示CPU/磁盘/网络/分页/系统信息
sar
历史性能统计(sysstat工具)
sar -u 1 5
CPU统计,
sar -r 1 5
内存统计,
sar -b 1 5
I/O统计
free -h
内存与交换分区摘要查看缓冲/缓存与实际可用内存
df -h
文件系统磁盘空间添加
-i
查看inode使用情况
du -sh
目录大小`du -sh /var/log/*

Uptime Monitoring

可用性监控

UptimeRobot (Free Tier)

UptimeRobot(免费版)

  • Create HTTP(s) monitor: URL, check interval (5 min on free tier), keyword match
  • Alert contacts: email + Slack webhook
  • Status page: public URL for incident communication
  • TCP monitors for non-HTTP services (database ports, SMTP)
  • 创建HTTP(s)监控:设置URL、检测间隔(免费版为5分钟)、关键字匹配
  • 告警联系人:邮件 + Slack webhook
  • 状态页面:公开URL用于事件沟通
  • TCP监控:用于非HTTP服务(数据库端口、SMTP)

healthchecks.io (Cron Job Monitoring)

healthchecks.io(Cron任务监控)

Add a curl ping at the end of every cron script:
bash
#!/bin/bash
set -euo pipefail
在每个Cron脚本末尾添加curl ping:
bash
#!/bin/bash
set -euo pipefail

... backup/maintenance logic here ...

... 备份/维护逻辑 ...

Signal success to healthchecks.io

向healthchecks.io发送成功信号

curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID > /dev/null

For jobs that should ping start + finish:
```bash
curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID/start > /dev/null
curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID > /dev/null

对于需要监控开始和结束的任务:
```bash
curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID/start > /dev/null

... job logic ...

... 任务逻辑 ...

curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID > /dev/null

---
curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID > /dev/null

---

Anti-Patterns

反模式

Anti-PatternProblemFix
No alerting configuredIncidents discovered by users, not ops teamSet up Alertmanager with email + Slack from day one
Scrape interval < 10sHigh cardinality load on Prometheus, noisy dataUse 15s–60s depending on metric volatility
No retention limitsPrometheus disk fills up and crashesSet
--storage.tsdb.retention.time=30d
and
--storage.tsdb.retention.size
Monitoring server on the same host being monitoredSingle point of failure — if server dies, so does the monitorRun Prometheus on a dedicated monitoring host or use external uptime service
Exposing Node Exporter on 0.0.0.0Metrics data leaked publiclyBind to
127.0.0.1:9100
, use SSH tunnel or VPN for remote scraping
Catching all logs without labelsCannot filter Loki queries efficientlyAdd
job
,
host
,
level
labels in Promtail config
Alert rules without
for
duration
Flapping alerts on transient spikesAlways add
for: 2m
or longer to avoid noise
No
inhibit_rules
in Alertmanager
Warning floods when critical alert also firingInhibit warnings when critical alert matches same instance
SMTP password in alertmanager.yml committed to gitCredential leakUse environment variable substitution or external secret management
No Grafana datasource provisioningDashboard import fails after Grafana reinstallProvision datasources as YAML in
/etc/grafana/provisioning/

反模式问题修复方案
未配置告警事件由用户而非运维团队发现从第一天起就配置Alertmanager,启用邮件+Slack通知
采集间隔 < 10秒Prometheus负载过高,数据噪声大根据指标波动性使用15秒–60秒的采集间隔
未设置存储保留限制Prometheus磁盘被占满导致崩溃设置
--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size
监控服务器与被监控主机在同一节点单点故障——服务器宕机后监控也失效在专用监控主机上运行Prometheus,或使用外部可用性监控服务
Node Exporter暴露在0.0.0.0指标数据公开泄露绑定到
127.0.0.1:9100
,远程采集使用SSH隧道或VPN
采集所有日志但未添加标签Loki查询无法高效过滤在Promtail配置中添加
job
host
level
标签
告警规则未设置
for
时长
瞬时峰值导致告警频繁波动始终添加
for: 2m
或更长时长以避免噪声
Alertmanager未配置
inhibit_rules
触发严重告警时,警告告警泛滥当同一实例触发严重告警时,抑制警告告警
alertmanager.yml中的SMTP密码提交到Git凭证泄露使用环境变量替换或外部密钥管理系统
未配置Grafana数据源Grafana重装后仪表盘导入失败
/etc/grafana/provisioning/
目录中通过YAML配置数据源

Troubleshooting

故障排查

SymptomLikely CauseDiagnostic & Fix
Metrics not appearing in PrometheusNode Exporter not running or wrong port
curl http://127.0.0.1:9100/metrics
— check systemd status, verify
prometheus.yml
target
up == 0
for a target
Prometheus cannot reach scrape endpointCheck firewall (
ss -tlnp
), test with
curl
from Prometheus host, verify target address
Alert not firing despite condition met
for
duration not elapsed, or rule file not loaded
promtool check rules
— verify rule file path in
prometheus.yml
, check Prometheus logs
Alert firing but no notificationAlertmanager not configured or unreachableTest Alertmanager with
amtool alert add
— check SMTP credentials, Slack webhook URL
Grafana blank dashboard after importWrong datasource selectedEdit dashboard → Panel → Query → verify datasource dropdown matches provisioned name
Loki not ingesting logsPromtail cannot connect or wrong URL
curl http://localhost:3100/ready
— check Promtail logs (
journalctl -u promtail
)
Promtail not tailing journalMissing journal permissionsAdd
promtail
user to
systemd-journal
group:
usermod -aG systemd-journal promtail
High cardinality error in LokiToo many unique label combinationsAvoid high-cardinality labels (request IDs, user IDs) — use log content for those
Prometheus OOM killedToo many metrics or insufficient retention pruningAdd
--storage.tsdb.retention.size
limit, reduce scrape targets or intervals
ALERTS
metric not visible
Prometheus evaluating no rulesConfirm
rule_files
path glob matches actual files:
ls /etc/prometheus/rules/*.yml
症状可能原因诊断与修复
Prometheus中无指标显示Node Exporter未运行或端口错误
curl http://127.0.0.1:9100/metrics
——检查systemd状态,验证
prometheus.yml
中的目标配置
目标显示
up == 0
Prometheus无法连接采集端点检查防火墙(
ss -tlnp
),从Prometheus主机使用
curl
测试,验证目标地址
条件满足但告警未触发
for
时长未到期,或规则文件未加载
promtool check rules
——验证
prometheus.yml
中的规则文件路径,检查Prometheus日志
告警触发但无通知Alertmanager未配置或不可达使用
amtool alert add
测试Alertmanager——检查SMTP凭证、Slack webhook URL
Grafana导入仪表盘后显示空白选择了错误的数据源编辑仪表盘 → 面板 → 查询 → 验证数据源下拉框与配置的数据源一致
Loki未接收日志Promtail无法连接或URL错误
curl http://localhost:3100/ready
——检查Promtail日志(
journalctl -u promtail
Promtail未追踪journal日志缺少journal权限
promtail
用户添加到
systemd-journal
组:
usermod -aG systemd-journal promtail
Loki出现高基数错误唯一标签组合过多避免高基数标签(请求ID、用户ID)——此类信息放在日志内容中
Prometheus因OOM被杀死指标过多或保留策略未生效添加
--storage.tsdb.retention.size
限制,减少采集目标或延长采集间隔
无法看到
ALERTS
指标
Prometheus未评估任何规则确认
rule_files
路径通配符匹配实际文件:
ls /etc/prometheus/rules/*.yml