server-monitoring

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Server Monitoring — Production Stack

生产级服务器监控栈

Stack Overview

栈概述

Layer	Tool	Purpose
Metrics collection	Node Exporter	OS/hardware metrics from `/proc` and `/sys`
Metrics scraping	Prometheus	Pull-based time-series database
Visualization	Grafana	Dashboards and alerting UI
Alerting	Alertmanager	Route, deduplicate, silence alerts
Log shipping	Promtail	Tail logs → push to Loki
Log aggregation	Loki	Log storage with label-based indexing
Uptime (external)	UptimeRobot	External HTTP/TCP reachability checks
Cron monitoring	healthchecks.io	Detect silent cron job failures

层级	工具	用途
指标采集	Node Exporter	从 `/proc` 和 `/sys` 获取操作系统/硬件指标
指标采集	Prometheus	基于拉取模式的时序数据库
可视化	Grafana	仪表盘与告警UI
告警管理	Alertmanager	路由、去重、静默告警
日志传输	Promtail	追踪日志并推送到Loki
日志聚合	Loki	基于标签索引的日志存储
外部可用性监控	UptimeRobot	外部HTTP/TCP可达性检测
Cron任务监控	healthchecks.io	检测静默Cron任务故障

Node Exporter Installation (systemd)

Node Exporter安装（systemd方式）

Download the latest release from https://github.com/prometheus/node_exporter/releases.

bash

NODE_EXPORTER_VERSION=1.8.2
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter

/etc/systemd/system/node_exporter.service

ini

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.diskstats \
  --web.listen-address=127.0.0.1:9100
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

bash

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter
curl -s http://127.0.0.1:9100/metrics | head -20

Bind Node Exporter to

127.0.0.1:9100

— never expose directly on 0.0.0.0. Prometheus scrapes it locally; use SSH tunnel or VPN for remote Prometheus.

从https://github.com/prometheus/node_exporter/releases下载最新版本。

bash

NODE_EXPORTER_VERSION=1.8.2
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo cp node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter

/etc/systemd/system/node_exporter.service

ini

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.diskstats \
  --web.listen-address=127.0.0.1:9100
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

bash

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter
curl -s http://127.0.0.1:9100/metrics | head -20

将Node Exporter绑定到

127.0.0.1:9100

——切勿直接暴露在0.0.0.0。Prometheus在本地采集指标；如需远程采集，请使用SSH隧道或VPN。

Prometheus Installation and Configuration

Prometheus安装与配置

bash

PROMETHEUS_VERSION=2.53.0
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/{consoles,console_libraries} /etc/prometheus/
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

/etc/prometheus/prometheus.yml

yaml

global:
  scrape_interval: 15s           # Default scrape interval
  evaluation_interval: 15s       # Rule evaluation interval
  scrape_timeout: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Node Exporter — OS metrics
  - job_name: node
    static_configs:
      - targets: ['localhost:9100']
        labels:
          server: 'prod-web-01'
          env: production

  # Application metrics — assumes /metrics on port 3000
  - job_name: app
    metrics_path: /metrics
    static_configs:
      - targets: ['localhost:3000']
        labels:
          app: myapp
          env: production

  # Blackbox Exporter — probe HTTP endpoints
  - job_name: blackbox_http
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://example.com/api/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115   # Blackbox Exporter address

  # Prometheus self-monitoring
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

/etc/systemd/system/prometheus.service

ini

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=10GB \
  --web.listen-address=127.0.0.1:9090 \
  --web.enable-lifecycle
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

bash

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
promtool check config /etc/prometheus/prometheus.yml

bash

PROMETHEUS_VERSION=2.53.0
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
sudo cp prometheus-${PROMETHEUS_VERSION}.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp -r prometheus-${PROMETHEUS_VERSION}.linux-amd64/{consoles,console_libraries} /etc/prometheus/
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

/etc/prometheus/prometheus.yml

yaml

global:
  scrape_interval: 15s           # Default scrape interval
  evaluation_interval: 15s       # Rule evaluation interval
  scrape_timeout: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Node Exporter — OS metrics
  - job_name: node
    static_configs:
      - targets: ['localhost:9100']
        labels:
          server: 'prod-web-01'
          env: production

  # Application metrics — assumes /metrics on port 3000
  - job_name: app
    metrics_path: /metrics
    static_configs:
      - targets: ['localhost:3000']
        labels:
          app: myapp
          env: production

  # Blackbox Exporter — probe HTTP endpoints
  - job_name: blackbox_http
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://example.com/api/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115   # Blackbox Exporter address

  # Prometheus self-monitoring
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

/etc/systemd/system/prometheus.service

ini

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=10GB \
  --web.listen-address=127.0.0.1:9090 \
  --web.enable-lifecycle
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

bash

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
promtool check config /etc/prometheus/prometheus.yml

Alert Rules

告警规则

/etc/prometheus/rules/server.yml

yaml

groups:
  - name: server_alerts
    interval: 1m
    rules:

      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }}/{{ $labels.instance }} has been unreachable for more than 2 minutes."

      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ printf \"%.1f\" $value }}% (threshold 85%)."

      - alert: HighMemory
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ printf \"%.1f\" $value }}% (threshold 90%)."

      - alert: DiskAlmostFull
        expr: |
          (node_filesystem_size_bytes{fstype!="tmpfs"} - node_filesystem_free_bytes{fstype!="tmpfs"})
          / node_filesystem_size_bytes{fstype!="tmpfs"} * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk almost full on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} is {{ printf \"%.1f\" $value }}% full."

      - alert: HighLoad
        expr: node_load15 / count without(cpu, mode)(node_cpu_seconds_total{mode="idle"}) > 2.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High system load on {{ $labels.instance }}"
          description: "15-minute load average per CPU core is {{ printf \"%.2f\" $value }} (threshold 2.0)."

Validate rules:

promtool check rules /etc/prometheus/rules/server.yml

/etc/prometheus/rules/server.yml

yaml

groups:
  - name: server_alerts
    interval: 1m
    rules:

      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }}/{{ $labels.instance }} has been unreachable for more than 2 minutes."

      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ printf \"%.1f\" $value }}% (threshold 85%)."

      - alert: HighMemory
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ printf \"%.1f\" $value }}% (threshold 90%)."

      - alert: DiskAlmostFull
        expr: |
          (node_filesystem_size_bytes{fstype!="tmpfs"} - node_filesystem_free_bytes{fstype!="tmpfs"})
          / node_filesystem_size_bytes{fstype!="tmpfs"} * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk almost full on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} is {{ printf \"%.1f\" $value }}% full."

      - alert: HighLoad
        expr: node_load15 / count without(cpu, mode)(node_cpu_seconds_total{mode="idle"}) > 2.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High system load on {{ $labels.instance }}"
          description: "15-minute load average per CPU core is {{ printf \"%.2f\" $value }} (threshold 2.0)."

验证规则：

promtool check rules /etc/prometheus/rules/server.yml

Alertmanager

bash

ALERTMANAGER_VERSION=0.27.0
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/{alertmanager,amtool} /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager

/etc/alertmanager/alertmanager.yml

yaml

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'app-specific-password'   # Use App Password, not account password
  smtp_require_tls: true
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s        # Wait before sending first notification for a new group
  group_interval: 5m     # How long to wait before sending alert for new alerts in the same group
  repeat_interval: 4h    # How often to re-send unresolved alerts
  receiver: 'team-ops'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h

receivers:
  - name: 'team-ops'
    email_configs:
      - to: 'ops-team@example.com'
        send_resolved: true
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/<T_ID>/<B_ID>/<WEBHOOK_TOKEN>'
        channel: '#alerts'
        send_resolved: true
        title: '{{ if eq .Status "firing" }}:red_circle:{{ else }}:white_check_mark:{{ end }} {{ .CommonAnnotations.summary }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-pagerduty-integration-key'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'instance']

/etc/systemd/system/alertmanager.service

ini

[Unit]
Description=Alertmanager
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=127.0.0.1:9093
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

bash

ALERTMANAGER_VERSION=0.27.0
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
sudo cp alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/{alertmanager,amtool} /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager

/etc/alertmanager/alertmanager.yml

yaml

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'app-specific-password'   # Use App Password, not account password
  smtp_require_tls: true
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s        # Wait before sending first notification for a new group
  group_interval: 5m     # How long to wait before sending alert for new alerts in the same group
  repeat_interval: 4h    # How often to re-send unresolved alerts
  receiver: 'team-ops'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h

receivers:
  - name: 'team-ops'
    email_configs:
      - to: 'ops-team@example.com'
        send_resolved: true
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/<T_ID>/<B_ID>/<WEBHOOK_TOKEN>'
        channel: '#alerts'
        send_resolved: true
        title: '{{ if eq .Status "firing" }}:red_circle:{{ else }}:white_check_mark:{{ end }} {{ .CommonAnnotations.summary }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-pagerduty-integration-key'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'instance']

/etc/systemd/system/alertmanager.service

ini

[Unit]
Description=Alertmanager
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=127.0.0.1:9093
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Grafana Installation and Datasource Provisioning

Grafana安装与数据源配置

bash

sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update && sudo apt-get install -y grafana
sudo systemctl enable --now grafana-server

Datasource provisioning (

/etc/grafana/provisioning/datasources/prometheus.yml

yaml

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: false

  - name: Loki
    type: loki
    access: proxy
    url: http://localhost:3100
    editable: false

Import Node Exporter Full dashboard (ID 1860) via Grafana UI: Dashboards → Import → enter 1860 → select Prometheus datasource. Other useful dashboard IDs:

3662 — Prometheus 2.0 Stats
13659 — Node Exporter for Prometheus Dashboard
10991 — Blackbox Exporter

bash

sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update && sudo apt-get install -y grafana
sudo systemctl enable --now grafana-server

数据源配置（

/etc/grafana/provisioning/datasources/prometheus.yml

）：

yaml

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: false

  - name: Loki
    type: loki
    access: proxy
    url: http://localhost:3100
    editable: false

通过Grafana UI导入Node Exporter Full仪表盘（ID 1860）：仪表盘 → 导入 → 输入1860 → 选择Prometheus数据源。其他实用仪表盘ID：

3662 — Prometheus 2.0统计面板
13659 — Node Exporter Prometheus仪表盘
10991 — Blackbox Exporter仪表盘

PLG Stack: Promtail + Loki

PLG栈：Promtail + Loki

Loki Installation

Loki安装

bash

LOKI_VERSION=3.1.0
wget https://github.com/grafana/loki/releases/download/v${LOKI_VERSION}/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
sudo mkdir -p /etc/loki /var/lib/loki

/etc/loki/loki-config.yml

yaml

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: loki_index_
        period: 24h

limits_config:
  retention_period: 30d           # Requires compactor below

compactor:
  working_directory: /var/lib/loki/compactor
  retention_enabled: true
  delete_request_cancel_period: 24h

bash

LOKI_VERSION=3.1.0
wget https://github.com/grafana/loki/releases/download/v${LOKI_VERSION}/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
sudo mkdir -p /etc/loki /var/lib/loki

/etc/loki/loki-config.yml

yaml

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: loki_index_
        period: 24h

limits_config:
  retention_period: 30d           # Requires compactor below

compactor:
  working_directory: /var/lib/loki/compactor
  retention_enabled: true
  delete_request_cancel_period: 24h

Promtail Configuration

Promtail配置

/etc/promtail/promtail-config.yml

yaml

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
  # Systemd journal — captures all systemd unit logs
  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
        host: prod-web-01
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: unit
      - source_labels: ['__journal_priority_keyword']
        target_label: level

  # Application log files
  - job_name: app_logs
    static_configs:
      - targets:
          - localhost
        labels:
          job: myapp
          host: prod-web-01
          __path__: /var/log/myapp/*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            msg: message
      - labels:
          level:
      - timestamp:
          source: timestamp
          format: RFC3339Nano

  # Nginx access logs
  - job_name: nginx
    static_configs:
      - targets:
          - localhost
        labels:
          job: nginx
          host: prod-web-01
          __path__: /var/log/nginx/access.log

/etc/promtail/promtail-config.yml

yaml

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
  # Systemd journal — captures all systemd unit logs
  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
        host: prod-web-01
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: unit
      - source_labels: ['__journal_priority_keyword']
        target_label: level

  # Application log files
  - job_name: app_logs
    static_configs:
      - targets:
          - localhost
        labels:
          job: myapp
          host: prod-web-01
          __path__: /var/log/myapp/*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            msg: message
      - labels:
          level:
      - timestamp:
          source: timestamp
          format: RFC3339Nano

  # Nginx access logs
  - job_name: nginx
    static_configs:
      - targets:
          - localhost
        labels:
          job: nginx
          host: prod-web-01
          __path__: /var/log/nginx/access.log

LogQL Basics

LogQL基础

logql

undefined

logql

undefined

Filter by label

{job="myapp", level="error"}

Filter by content

{job="nginx"} |= "500"

Pattern extraction

{job="nginx"} | pattern

<ip> - - [<_>] "<method> <path> <_>" <status> <_>

{job="nginx"} | pattern

<ip> - - [<_>] "<method> <path> <_>" <status> <_>

Rate of error log lines per minute

rate({job="myapp", level="error"}[1m])

Count errors by unit over last hour

sum by(unit) (count_over_time({job="systemd-journal", level="err"}[1h]))

---

sum by(unit) (count_over_time({job="systemd-journal", level="err"}[1h]))

---

System-Level Diagnostic Tools

系统级诊断工具

Tool	Key Usage	Notes
`htop`	Interactive process viewer	`F5` tree view, `F6` sort, `F9` kill, `u` filter by user
`iotop`	I/O per process	`sudo iotop -o` (only active), `-a` accumulated
`nethogs`	Bandwidth per process	`sudo nethogs eth0`
`ss`	Socket statistics (replaces netstat)	`ss -tlnp` TCP listening, `ss -s` summary, `ss -o state time-wait`
`vmstat`	Memory/swap/CPU overview	`vmstat 1 10` (10 samples, 1s interval)
`iostat`	Disk I/O stats	`iostat -xz 1` extended, `iostat -m` in MB/s
`dstat`	Combined resource stats	`dstat -cdngy` CPU/disk/net/page/sys
`sar`	Historical performance (sysstat)	`sar -u 1 5` CPU, `sar -r 1 5` memory, `sar -b 1 5` I/O
`free -h`	Memory and swap summary	Check buff/cache vs actual available
`df -h`	Disk space by filesystem	Add `-i` for inode usage
`du -sh`	Directory size	`du -sh /var/log/*

工具	核心用途	说明
`htop`	交互式进程查看器	`F5` 树形视图， `F6` 排序， `F9` 终止进程， `u` 按用户过滤
`iotop`	进程I/O统计	`sudo iotop -o` （仅显示活跃进程）， `-a` 累计统计
`nethogs`	进程带宽统计	`sudo nethogs eth0`
`ss`	套接字统计（替代netstat）	`ss -tlnp` 查看TCP监听端口， `ss -s` 统计摘要， `ss -o state time-wait` 查看TIME-WAIT状态
`vmstat`	内存/交换分区/CPU概览	`vmstat 1 10` （10次采样，间隔1秒）
`iostat`	磁盘I/O统计	`iostat -xz 1` 扩展统计， `iostat -m` 以MB/s显示
`dstat`	综合资源统计	`dstat -cdngy` 显示CPU/磁盘/网络/分页/系统信息
`sar`	历史性能统计（sysstat工具）	`sar -u 1 5` CPU统计， `sar -r 1 5` 内存统计， `sar -b 1 5` I/O统计
`free -h`	内存与交换分区摘要	查看缓冲/缓存与实际可用内存
`df -h`	文件系统磁盘空间	添加 `-i` 查看inode使用情况
`du -sh`	目录大小	`du -sh /var/log/*

Uptime Monitoring

可用性监控

UptimeRobot (Free Tier)

UptimeRobot（免费版）

Create HTTP(s) monitor: URL, check interval (5 min on free tier), keyword match
Alert contacts: email + Slack webhook
Status page: public URL for incident communication
TCP monitors for non-HTTP services (database ports, SMTP)

创建HTTP(s)监控：设置URL、检测间隔（免费版为5分钟）、关键字匹配
告警联系人：邮件 + Slack webhook
状态页面：公开URL用于事件沟通
TCP监控：用于非HTTP服务（数据库端口、SMTP）

healthchecks.io (Cron Job Monitoring)

healthchecks.io（Cron任务监控）

Add a curl ping at the end of every cron script:

bash

#!/bin/bash
set -euo pipefail

在每个Cron脚本末尾添加curl ping：

bash

#!/bin/bash
set -euo pipefail

... backup/maintenance logic here ...

... 备份/维护逻辑 ...

Signal success to healthchecks.io

向healthchecks.io发送成功信号

curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID > /dev/null


For jobs that should ping start + finish:
```bash
curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID/start > /dev/null

curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID > /dev/null


对于需要监控开始和结束的任务：
```bash
curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID/start > /dev/null

... job logic ...

... 任务逻辑 ...

curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID > /dev/null

---

curl -fsS --retry 3 https://hc-ping.com/YOUR-CHECK-UUID > /dev/null

---

Anti-Patterns

反模式

Anti-Pattern	Problem	Fix
No alerting configured	Incidents discovered by users, not ops team	Set up Alertmanager with email + Slack from day one
Scrape interval < 10s	High cardinality load on Prometheus, noisy data	Use 15s–60s depending on metric volatility
No retention limits	Prometheus disk fills up and crashes	Set `--storage.tsdb.retention.time=30d` and `--storage.tsdb.retention.size`
Monitoring server on the same host being monitored	Single point of failure — if server dies, so does the monitor	Run Prometheus on a dedicated monitoring host or use external uptime service
Exposing Node Exporter on 0.0.0.0	Metrics data leaked publicly	Bind to `127.0.0.1:9100` , use SSH tunnel or VPN for remote scraping
Catching all logs without labels	Cannot filter Loki queries efficiently	Add `job` , `host` , `level` labels in Promtail config
Alert rules without `for` duration	Flapping alerts on transient spikes	Always add `for: 2m` or longer to avoid noise
No `inhibit_rules` in Alertmanager	Warning floods when critical alert also firing	Inhibit warnings when critical alert matches same instance
SMTP password in alertmanager.yml committed to git	Credential leak	Use environment variable substitution or external secret management
No Grafana datasource provisioning	Dashboard import fails after Grafana reinstall	Provision datasources as YAML in `/etc/grafana/provisioning/`

反模式	问题	修复方案
未配置告警	事件由用户而非运维团队发现	从第一天起就配置Alertmanager，启用邮件+Slack通知
采集间隔 < 10秒	Prometheus负载过高，数据噪声大	根据指标波动性使用15秒–60秒的采集间隔
未设置存储保留限制	Prometheus磁盘被占满导致崩溃	设置 `--storage.tsdb.retention.time=30d` 和 `--storage.tsdb.retention.size`
监控服务器与被监控主机在同一节点	单点故障——服务器宕机后监控也失效	在专用监控主机上运行Prometheus，或使用外部可用性监控服务
Node Exporter暴露在0.0.0.0	指标数据公开泄露	绑定到 `127.0.0.1:9100` ，远程采集使用SSH隧道或VPN
采集所有日志但未添加标签	Loki查询无法高效过滤	在Promtail配置中添加 `job` 、 `host` 、 `level` 标签
告警规则未设置 `for` 时长	瞬时峰值导致告警频繁波动	始终添加 `for: 2m` 或更长时长以避免噪声
Alertmanager未配置 `inhibit_rules`	触发严重告警时，警告告警泛滥	当同一实例触发严重告警时，抑制警告告警
alertmanager.yml中的SMTP密码提交到Git	凭证泄露	使用环境变量替换或外部密钥管理系统
未配置Grafana数据源	Grafana重装后仪表盘导入失败	在 `/etc/grafana/provisioning/` 目录中通过YAML配置数据源

Troubleshooting

故障排查

Symptom	Likely Cause	Diagnostic & Fix
Metrics not appearing in Prometheus	Node Exporter not running or wrong port	`curl http://127.0.0.1:9100/metrics` — check systemd status, verify `prometheus.yml` target
`up == 0` for a target	Prometheus cannot reach scrape endpoint	Check firewall ( `ss -tlnp` ), test with `curl` from Prometheus host, verify target address
Alert not firing despite condition met	`for` duration not elapsed, or rule file not loaded	`promtool check rules` — verify rule file path in `prometheus.yml` , check Prometheus logs
Alert firing but no notification	Alertmanager not configured or unreachable	Test Alertmanager with `amtool alert add` — check SMTP credentials, Slack webhook URL
Grafana blank dashboard after import	Wrong datasource selected	Edit dashboard → Panel → Query → verify datasource dropdown matches provisioned name
Loki not ingesting logs	Promtail cannot connect or wrong URL	`curl http://localhost:3100/ready` — check Promtail logs ( `journalctl -u promtail` )
Promtail not tailing journal	Missing journal permissions	Add `promtail` user to `systemd-journal` group: `usermod -aG systemd-journal promtail`
High cardinality error in Loki	Too many unique label combinations	Avoid high-cardinality labels (request IDs, user IDs) — use log content for those
Prometheus OOM killed	Too many metrics or insufficient retention pruning	Add `--storage.tsdb.retention.size` limit, reduce scrape targets or intervals
`ALERTS` metric not visible	Prometheus evaluating no rules	Confirm `rule_files` path glob matches actual files: `ls /etc/prometheus/rules/*.yml`

症状	可能原因	诊断与修复
Prometheus中无指标显示	Node Exporter未运行或端口错误	`curl http://127.0.0.1:9100/metrics` ——检查systemd状态，验证 `prometheus.yml` 中的目标配置
目标显示 `up == 0`	Prometheus无法连接采集端点	检查防火墙（ `ss -tlnp` ），从Prometheus主机使用 `curl` 测试，验证目标地址
条件满足但告警未触发	`for` 时长未到期，或规则文件未加载	`promtool check rules` ——验证 `prometheus.yml` 中的规则文件路径，检查Prometheus日志
告警触发但无通知	Alertmanager未配置或不可达	使用 `amtool alert add` 测试Alertmanager——检查SMTP凭证、Slack webhook URL
Grafana导入仪表盘后显示空白	选择了错误的数据源	编辑仪表盘 → 面板 → 查询 → 验证数据源下拉框与配置的数据源一致
Loki未接收日志	Promtail无法连接或URL错误	`curl http://localhost:3100/ready` ——检查Promtail日志（ `journalctl -u promtail` ）
Promtail未追踪journal日志	缺少journal权限	将 `promtail` 用户添加到 `systemd-journal` 组： `usermod -aG systemd-journal promtail`
Loki出现高基数错误	唯一标签组合过多	避免高基数标签（请求ID、用户ID）——此类信息放在日志内容中
Prometheus因OOM被杀死	指标过多或保留策略未生效	添加 `--storage.tsdb.retention.size` 限制，减少采集目标或延长采集间隔
无法看到 `ALERTS` 指标	Prometheus未评估任何规则	确认 `rule_files` 路径通配符匹配实际文件： `ls /etc/prometheus/rules/*.yml`