infrastructure-monitor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Infrastructure Monitor

基础设施监控

Set up comprehensive monitoring and observability.
搭建全面的监控与可观测体系。

Quick Start

快速开始

Use Prometheus for metrics, Grafana for dashboards, Loki for logs, set up alerts for critical issues.
使用Prometheus采集指标,Grafana搭建仪表盘,Loki处理日志,并为关键问题配置告警。

Instructions

操作指南

Metrics with Prometheus

基于Prometheus的指标采集

Application instrumentation:
javascript
const prometheus = require('prom-client');

const httpRequestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code']
});

app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration.labels(req.method, req.route?.path, res.statusCode).observe(duration);
  });
  next();
});
Prometheus config:
yaml
scrape_configs:
  - job_name: 'app'
    static_configs:
      - targets: ['app:3000']
    scrape_interval: 15s
应用程序埋点:
javascript
const prometheus = require('prom-client');

const httpRequestDuration = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code']
});

app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration.labels(req.method, req.route?.path, res.statusCode).observe(duration);
  });
  next();
});
Prometheus配置:
yaml
scrape_configs:
  - job_name: 'app'
    static_configs:
      - targets: ['app:3000']
    scrape_interval: 15s

Dashboards with Grafana

基于Grafana的仪表盘

Key metrics to monitor:
  • Request rate (requests/second)
  • Error rate (errors/total requests)
  • Response time (p50, p95, p99)
  • CPU and memory usage
  • Database query time
需监控的关键指标:
  • 请求速率(请求/秒)
  • 错误率(错误数/总请求数)
  • 响应时间(p50、p95、p99分位数)
  • CPU与内存使用率
  • 数据库查询耗时

Logging with Loki

基于Loki的日志管理

Structured logging:
javascript
const winston = require('winston');

const logger = winston.createLogger({
  format: winston.format.json(),
  transports: [
    new winston.transports.Console()
  ]
});

logger.info('User logged in', { userId: user.id, ip: req.ip });
结构化日志:
javascript
const winston = require('winston');

const logger = winston.createLogger({
  format: winston.format.json(),
  transports: [
    new winston.transports.Console()
  ]
});

logger.info('User logged in', { userId: user.id, ip: req.ip });

Alerting

告警配置

Alert rules:
yaml
groups:
  - name: app_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
告警规则:
yaml
groups:
  - name: app_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"

Best Practices

最佳实践

  • Monitor golden signals (latency, traffic, errors, saturation)
  • Set up actionable alerts
  • Use log aggregation
  • Implement distributed tracing
  • Create runbooks for alerts
  • Regular dashboard reviews
  • 监控黄金指标(延迟、流量、错误、饱和度)
  • 配置可执行的告警
  • 使用日志聚合
  • 实现分布式追踪
  • 为告警创建运行手册
  • 定期审核仪表盘