observability-sre

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Observability & Site Reliability Engineering

可观测性与站点可靠性工程

Core Principles

核心原则

  • Three Pillars — Metrics, Logs, and Traces provide holistic visibility
  • Observability-First — Build systems that explain their own behavior
  • SLO-Driven — Define reliability targets that matter to users
  • Proactive Detection — Find issues before customers do
  • Blameless Culture — Learn from failures without blame
  • Automate Toil — Reduce repetitive operational work
  • Continuous Improvement — Each incident makes systems more resilient
  • Full-Stack Visibility — Monitor from infrastructure to business metrics

  • 三大支柱 — 指标(Metrics)、日志(Logs)和链路追踪(Traces)提供全面可见性
  • 可观测性优先 — 构建能够自我解释行为的系统
  • SLO驱动 — 定义对用户重要的可靠性目标
  • 主动检测 — 在用户发现问题前找到隐患
  • 无责文化 — 从故障中学习,而非追责
  • 自动化消除重复劳动 — 减少重复性运维工作
  • 持续改进 — 每一次事件都让系统更具韧性
  • 全栈可见性 — 从基础设施到业务指标的全链路监控

Hard Rules (Must Follow)

硬性规则(必须遵守)

These rules are mandatory. Violating them means the skill is not working correctly.
这些规则为强制性要求。违反规则意味着技能未正确发挥作用。

Symptom-Based Alerts Only

仅基于症状告警

Alert on user-facing symptoms, not internal infrastructure metrics.
yaml
undefined
仅针对用户可见的症状告警,而非内部基础设施指标。
yaml
undefined

❌ FORBIDDEN: Alerting on internal metrics

❌ 禁止:针对内部指标告警

  • alert: CPUHigh expr: cpu_usage > 70%

    Users don't care about CPU, they care about latency

  • alert: MemoryHigh expr: memory_usage > 80%

    Internal metric, may not affect users

  • alert: CPUHigh expr: cpu_usage > 70%

    用户不关心CPU使用率,他们关心的是延迟

  • alert: MemoryHigh expr: memory_usage > 80%

    内部指标,可能不会影响用户

✅ REQUIRED: Alert on user experience

✅ 要求:针对用户体验告警

  • alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 annotations: summary: "Users experiencing slow response times"
  • alert: ErrorRateHigh expr: slo:api_errors:rate5m > 0.001 annotations: summary: "Users encountering errors"
undefined
  • alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 annotations: summary: "用户正在经历缓慢的响应速度"
  • alert: ErrorRateHigh expr: slo:api_errors:rate5m > 0.001 annotations: summary: "用户遇到错误"
undefined

Low Cardinality Labels

低基数标签

Loki/Prometheus labels must have low cardinality (<10 unique labels).
yaml
undefined
Loki/Prometheus标签必须为低基数(唯一标签数量<10)。
yaml
undefined

❌ FORBIDDEN: High cardinality labels

❌ 禁止:高基数标签

labels: user_id: "usr_123" # Millions of values! order_id: "ord_456" # Millions of values! request_id: "req_789" # Every request is unique!
labels: user_id: "usr_123" # 数百万个可能值! order_id: "ord_456" # 数百万个可能值! request_id: "req_789" # 每个请求都是唯一的!

✅ REQUIRED: Low cardinality only

✅ 要求:仅使用低基数标签

labels: namespace: "production" # Few values app: "api-server" # Few values level: "error" # 5-6 values method: "GET" # ~10 values
labels: namespace: "production" # 少量可能值 app: "api-server" # 少量可能值 level: "error" # 5-6个可能值 method: "GET" # 约10个可能值

High cardinality data goes in log body:

高基数数据放入日志体中:

logger.info({ user_id: "usr_123", # In JSON body, not label order_id: "ord_456", }, "Order processed");
undefined
logger.info({ user_id: "usr_123", # 放在JSON体中,而非标签 order_id: "ord_456", }, "订单已处理");
undefined

SLO-Based Error Budgets

基于SLO的错误预算

Every service must have defined SLOs with error budget tracking.
yaml
undefined
每个服务必须定义SLO并跟踪错误预算。
yaml
undefined

❌ FORBIDDEN: No SLO definition

❌ 禁止:无SLO定义

Just monitoring without targets

仅监控而无目标

✅ REQUIRED: Explicit SLO with budget

✅ 要求:明确的SLO与错误预算

SLO: 99.9% availability

SLO:99.9% 可用性

Error Budget: 0.1% = 43.2 minutes/month downtime

错误预算:0.1% = 每月允许43.2分钟停机时间

groups:
  • name: slo_tracking rules:
    • record: slo:api_availability:ratio expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
    • alert: ErrorBudgetBurnRate expr: slo:api_availability:ratio < 0.999 for: 5m annotations: summary: "Burning error budget too fast"
undefined
groups:
  • name: slo_tracking rules:
    • record: slo:api_availability:ratio expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
    • alert: ErrorBudgetBurnRate expr: slo:api_availability:ratio < 0.999 for: 5m annotations: summary: "错误预算消耗过快"
undefined

Trace Context in Logs

日志中包含追踪上下文

All logs must include trace_id for correlation with distributed traces.
typescript
// ❌ FORBIDDEN: Logs without trace context
logger.info("Payment processed");

// ✅ REQUIRED: Include trace_id in every log
const span = trace.getActiveSpan();
logger.info({
  trace_id: span?.spanContext().traceId,
  span_id: span?.spanContext().spanId,
  order_id: "ord_123",
}, "Payment processed");

// Output includes correlation:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"}

所有日志必须包含trace_id,用于与分布式追踪关联。
typescript
// ❌ 禁止:日志中无追踪上下文
logger.info("支付已处理");

// ✅ 要求:每条日志都包含trace_id
const span = trace.getActiveSpan();
logger.info({
  trace_id: span?.spanContext().traceId,
  span_id: span?.spanContext().spanId,
  order_id: "ord_123",
}, "支付已处理");

// 输出包含关联信息:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"支付已处理"}

Quick Reference

快速参考

When to Use What

工具选型指南

ScenarioTool/PatternReason
Metrics collectionPrometheus + GrafanaIndustry standard, powerful query language
Distributed tracingOpenTelemetry + Tempo/JaegerVendor-neutral, CNCF standard
Log aggregation (cost-sensitive)Grafana LokiIndexes only labels, 10x cheaper
Log aggregation (search-heavy)ELK StackFull-text search, advanced analytics
Unified observabilityElastic/Datadog/DynatraceSingle pane of glass for all telemetry
Incident managementPagerDuty/OpsgenieAlert routing, on-call scheduling
Chaos engineeringGremlin/Chaos MeshControlled failure injection
AIOps/Anomaly detectionDynatrace/DatadogAI-driven root cause analysis
场景工具/模式原因
指标采集Prometheus + Grafana行业标准,强大的查询语言
分布式追踪OpenTelemetry + Tempo/Jaeger厂商中立,CNCF标准
日志聚合(成本敏感)Grafana Loki仅索引标签,成本降低10倍
日志聚合(搜索密集型)ELK Stack全文搜索,高级分析能力
统一可观测性Elastic/Datadog/Dynatrace所有遥测数据的统一视图
事件管理PagerDuty/Opsgenie告警路由,排班管理
混沌工程Gremlin/Chaos Mesh可控故障注入
AIOps/异常检测Dynatrace/DatadogAI驱动的根因分析

The Three Pillars

三大支柱详解

PillarWhatWhenTools
MetricsNumerical time-series dataReal-time monitoring, alertingPrometheus, StatsD, CloudWatch
LogsEvent records with contextDebugging, audit trailsLoki, ELK, Splunk
TracesRequest journey across servicesPerformance analysis, dependenciesOpenTelemetry, Jaeger, Zipkin
Fourth Pillar (Emerging): Continuous Profiling — Code-level performance data (CPU, memory usage at function level)

支柱定义使用场景工具
指标数值型时间序列数据实时监控、告警Prometheus, StatsD, CloudWatch
日志带上下文的事件记录调试、审计追踪Loki, ELK, Splunk
链路追踪跨服务的请求链路性能分析、依赖梳理OpenTelemetry, Jaeger, Zipkin
第四支柱(新兴): 持续分析 — 代码级性能数据(函数级CPU、内存使用率)

Observability Architecture

可观测性架构

Layered Prometheus Setup

分层Prometheus部署

yaml
undefined
yaml
undefined

2025 Best Practice: Federated architecture

2025最佳实践:联邦架构

Prevents metric chaos while enabling drill-down

避免指标混乱,同时支持下钻分析

Layer 1: Application Prometheus

第一层:应用级Prometheus

- Detailed business logic metrics

- 详细业务逻辑指标

- High cardinality acceptable

允许高基数

- Short retention (7 days)

- 短保留期(7天)

Layer 2: Cluster Prometheus

第二层:集群级Prometheus

- Per-environment/cluster metrics

- 按环境/集群划分的指标

- Medium retention (30 days)

- 中等保留期(30天)

- Aggregates from application level

- 从应用层聚合数据

Layer 3: Global Prometheus

第三层:全局级Prometheus

- Cross-cluster critical metrics

- 跨集群关键指标

- Long retention (1 year)

- 长保留期(1年)

- Federation from cluster level

- 从集群层联邦采集

Global Prometheus config

全局级Prometheus配置

scrape_configs:
  • job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="kubernetes-nodes"}' - '{name=~"job:.*"}' # Recording rules only static_configs:
    • targets:
      • 'cluster-prom-us-east.internal:9090'
      • 'cluster-prom-eu-west.internal:9090'
undefined
scrape_configs:
  • job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="kubernetes-nodes"}' - '{name=~"job:.*"}' # 仅采集记录规则生成的指标 static_configs:
    • targets:
      • 'cluster-prom-us-east.internal:9090'
      • 'cluster-prom-eu-west.internal:9090'
undefined

Recording Rules for Performance

性能优化用记录规则

yaml
undefined
yaml
undefined

Precompute expensive queries

预计算开销大的查询

groups:
  • name: api_performance interval: 30s rules:

    Request rate (requests per second)

    • record: job:api_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job, method, status)

    Error rate

    • record: job:api_errors:rate5m expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)

    P95 latency

    • record: job:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
undefined
groups:
  • name: api_performance interval: 30s rules:

    请求速率(每秒请求数)

    • record: job:api_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job, method, status)

    错误率

    • record: job:api_errors:rate5m expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)

    P95延迟

    • record: job:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
undefined

Resource Optimization

资源优化

yaml
undefined
yaml
undefined

Increase scrape interval for high-target deployments

针对大量目标的部署,增加采集间隔

scrape_interval: 30s # Default: 15s reduces load by 50%
scrape_interval: 30s # 默认15s,改为30s可减少50%负载

Use relabeling to drop unnecessary metrics

使用重标记规则丢弃不必要的指标

metric_relabel_configs:
  • source_labels: [name] regex: 'go_.|process_.' # Drop Go runtime metrics action: drop
metric_relabel_configs:
  • source_labels: [name] regex: 'go_.|process_.' # 丢弃Go运行时指标 action: drop

Limit sample retention

限制样本保留期

storage: tsdb: retention.time: 15d # Keep only 15 days locally retention.size: 50GB # Or max 50GB

---
storage: tsdb: retention.time: 15d # 本地仅保留15天数据 retention.size: 50GB # 或最大50GB容量

---

Distributed Tracing with OpenTelemetry

基于OpenTelemetry的分布式追踪

Auto-Instrumentation Setup

自动埋点配置

typescript
// Node.js auto-instrumentation
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instruments HTTP, Express, PostgreSQL, Redis, etc.
      '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
    }),
  ],
});

sdk.start();
typescript
// Node.js自动埋点
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // 自动埋点HTTP、Express、PostgreSQL、Redis等
      '@opentelemetry/instrumentation-fs': { enabled: false }, // 过于嘈杂
    }),
  ],
});

sdk.start();

Manual Instrumentation for Business Logic

业务逻辑手动埋点

typescript
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service', '1.0.0');

async function processPayment(orderId: string, amount: number) {
  // Create custom span for business operation
  return tracer.startActiveSpan('processPayment', async (span) => {
    try {
      // Add business context
      span.setAttributes({
        'order.id': orderId,
        'payment.amount': amount,
        'payment.currency': 'USD',
      });

      // Child span for external API call
      const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {
        const result = await stripe.charges.create({ amount, currency: 'usd' });
        childSpan.setAttribute('stripe.charge_id', result.id);
        childSpan.setStatus({ code: SpanStatusCode.OK });
        childSpan.end();
        return result;
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return paymentResult;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}
typescript
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service', '1.0.0');

async function processPayment(orderId: string, amount: number) {
  // 为业务操作创建自定义Span
  return tracer.startActiveSpan('processPayment', async (span) => {
    try {
      // 添加业务上下文
      span.setAttributes({
        'order.id': orderId,
        'payment.amount': amount,
        'payment.currency': 'USD',
      });

      // 为外部API调用创建子Span
      const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {
        const result = await stripe.charges.create({ amount, currency: 'usd' });
        childSpan.setAttribute('stripe.charge_id', result.id);
        childSpan.setStatus({ code: SpanStatusCode.OK });
        childSpan.end();
        return result;
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return paymentResult;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}

Sampling Strategies

采样策略

yaml
undefined
yaml
undefined

OpenTelemetry Collector config

OpenTelemetry Collector配置

processors:

Probabilistic sampling: Keep 10% of traces

probabilistic_sampler: sampling_percentage: 10

Tail sampling: Make decisions after seeing full trace

tail_sampling: policies: # Always sample errors - name: error-traces type: status_code status_code: {status_codes: [ERROR]}
  # Always sample slow requests
  - name: slow-traces
    type: latency
    latency: {threshold_ms: 1000}

  # Sample 5% of normal traffic
  - name: normal-traces
    type: probabilistic
    probabilistic: {sampling_percentage: 5}
undefined
processors:

概率采样:保留10%的追踪数据

probabilistic_sampler: sampling_percentage: 10

尾部采样:在看到完整追踪后再做决策

tail_sampling: policies: # 始终采样错误追踪 - name: error-traces type: status_code status_code: {status_codes: [ERROR]}
  # 始终采样慢请求
  - name: slow-traces
    type: latency
    latency: {threshold_ms: 1000}

  # 采样5%的正常流量
  - name: normal-traces
    type: probabilistic
    probabilistic: {sampling_percentage: 5}
undefined

Context Propagation

上下文传播

typescript
// Ensure trace context flows across services
import { propagation, context } from '@opentelemetry/api';

// Outgoing HTTP request (automatic with auto-instrumentation)
fetch('https://api.example.com/data', {
  headers: {
    // W3C Trace Context headers injected automatically:
    // traceparent: 00-<trace-id>-<span-id>-01
    // tracestate: vendor=value
  },
});

// Manual propagation for non-HTTP (e.g., message queues)
const carrier = {};
propagation.inject(context.active(), carrier);
await publishMessage(queue, { data: payload, headers: carrier });

typescript
// 确保追踪上下文跨服务传递
import { propagation, context } from '@opentelemetry/api';

Structured Logging Best Practices

outgoing HTTP请求(自动埋点会自动处理)

JSON Logging Format

typescript
// Use structured logging library
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  // Include trace context in logs
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};

    const { traceId, spanId } = span.spanContext();
    return {
      trace_id: traceId,
      span_id: spanId,
    };
  },
});

// Structured logging with context
logger.info(
  {
    user_id: '123',
    order_id: 'ord_456',
    amount: 99.99,
    payment_method: 'card',
  },
  'Payment processed successfully'
);

// Output:
// {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"Payment processed successfully"}
fetch('https://api.example.com/data', { headers: { // W3C追踪上下文头会自动注入: // traceparent: 00-<trace-id>-<span-id>-01 // tracestate: vendor=value }, });

Log Levels

非HTTP场景手动传播(如消息队列)

typescript
// Follow standard severity levels
logger.trace({ details }, 'Low-level debugging');     // Very verbose
logger.debug({ state }, 'Debug information');          // Development
logger.info({ event }, 'Normal operation');            // Production default
logger.warn({ issue }, 'Warning condition');           // Potential issues
logger.error({ error, context }, 'Error occurred');    // Errors
logger.fatal({ critical }, 'Fatal error');             // Process crash
const carrier = {}; propagation.inject(context.active(), carrier); await publishMessage(queue, { data: payload, headers: carrier });

---

Grafana Loki Configuration

结构化日志最佳实践

JSON日志格式

yaml
undefined
typescript
// 使用结构化日志库
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  // 日志中包含追踪上下文
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};

    const { traceId, spanId } = span.spanContext();
    return {
      trace_id: traceId,
      span_id: spanId,
    };
  },
});

// 带上下文的结构化日志
logger.info(
  {
    user_id: '123',
    order_id: 'ord_456',
    amount: 99.99,
    payment_method: 'card',
  },
  '支付处理成功'
);

// 输出:
// {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"支付处理成功"}

Promtail config - ships logs to Loki

日志级别

server: http_listen_port: 9080
positions: filename: /tmp/positions.yaml
clients:
scrape_configs:
  • job_name: kubernetes kubernetes_sd_configs:
    • role: pod relabel_configs:

    Add pod labels as Loki labels (LOW cardinality only!)

    • source_labels: [__meta_kubernetes_namespace] target_label: namespace
    • source_labels: [__meta_kubernetes_pod_name] target_label: pod
    • source_labels: [__meta_kubernetes_pod_label_app] target_label: app pipeline_stages:

    Parse JSON logs

    • json: expressions: level: level trace_id: trace_id

    Extract fields as labels

    • labels: level: trace_id:
undefined
typescript
undefined

Loki Best Practices

遵循标准严重级别

  • Low Cardinality Labels — Use only 5-10 labels (namespace, app, level)
  • High Cardinality in Log Body — Put user_id, order_id in JSON, not labels
  • LogQL for Filtering — Use
    {app="api"} | json | user_id="123"
  • Retention Policy — Keep recent logs longer, compress old logs
promql
undefined
logger.trace({ details }, '低级别调试信息'); # 非常详细 logger.debug({ state }, '调试信息'); # 开发环境使用 logger.info({ event }, '正常操作记录'); # 生产环境默认级别 logger.warn({ issue }, '警告状态'); # 潜在问题 logger.error({ error, context }, '发生错误'); # 错误事件 logger.fatal({ critical }, '致命错误'); # 进程崩溃
undefined

LogQL query examples

Grafana Loki配置

{namespace="production", app="api"} |= "error" # Text search
{app="api"} | json | level="error" | line_format "{{.msg}}" # JSON parsing
rate({app="api"}[5m]) # Log rate per second
sum by (level) (count_over_time({namespace="production"}[1h])) # Count by level

---
yaml
undefined

SLO/SLI/SLA Management

Promtail配置 - 将日志发送到Loki

Definitions

  • SLI (Service Level Indicator) — Quantifiable measurement of service behavior
    • Examples: Request latency, error rate, availability, throughput
  • SLO (Service Level Objective) — Target value/range for an SLI
    • Examples: 99.9% availability, P95 latency < 200ms
  • SLA (Service Level Agreement) — Formal commitment with consequences
    • Examples: "99.9% uptime or 10% credit"
server: http_listen_port: 9080
positions: filename: /tmp/positions.yaml
clients:
scrape_configs:
  • job_name: kubernetes kubernetes_sd_configs:
    • role: pod relabel_configs:

    将Pod标签添加为Loki标签(仅低基数!)

    • source_labels: [__meta_kubernetes_namespace] target_label: namespace
    • source_labels: [__meta_kubernetes_pod_name] target_label: pod
    • source_labels: [__meta_kubernetes_pod_label_app] target_label: app pipeline_stages:

    解析JSON日志

    • json: expressions: level: level trace_id: trace_id

    将字段提取为标签

    • labels: level: trace_id:
undefined

The Four Golden Signals

Loki最佳实践

yaml
undefined
  • 低基数标签 — 仅使用5-10个标签(命名空间、应用、级别)
  • 高基数数据放入日志体 — user_id、order_id放入JSON体,而非标签
  • 使用LogQL过滤 — 使用
    {app="api"} | json | user_id="123"
  • 保留策略 — 近期日志保留更久,旧日志压缩存储
promql
undefined

Google SRE's key metrics for any service

LogQL查询示例

  1. Latency SLI: P95 request latency SLO: 95% of requests complete in < 200ms
  2. Traffic SLI: Requests per second SLO: Handle 10,000 req/s peak load
  3. Errors SLI: Error rate (5xx / total) SLO: < 0.1% error rate
  4. Saturation SLI: Resource utilization (CPU, memory, disk) SLO: CPU < 70%, Memory < 80%
undefined
{namespace="production", app="api"} |= "error" # 文本搜索
{app="api"} | json | level="error" | line_format "{{.msg}}" # JSON解析
rate({app="api"}[5m]) # 每秒日志生成速率
sum by (level) (count_over_time({namespace="production"}[1h])) # 按级别统计数量

---

Error Budget

SLO/SLI/SLA管理

定义

python
undefined
  • SLI(服务级别指标) — 服务行为的可量化衡量标准
    • 示例:请求延迟、错误率、可用性、吞吐量
  • SLO(服务级别目标) — SLI的目标值或范围
    • 示例:99.9%可用性、P95延迟<200ms
  • SLA(服务级别协议) — 带有后果的正式承诺
    • 示例:"99.9%正常运行时间,否则赔偿10%"

Error budget = 1 - SLO

四大黄金信号

SLO = 99.9% # "three nines" Error_Budget = 100% - 99.9% = 0.1%
yaml
undefined

Monthly calculation (30 days)

Google SRE提出的任何服务的关键指标

Total_Minutes = 30 * 24 * 60 = 43,200 minutes Allowed_Downtime = 43,200 * 0.001 = 43.2 minutes
  1. 延迟 SLI: P95请求延迟 SLO: 95%的请求在<200ms内完成
  2. 流量 SLI: 每秒请求数 SLO: 峰值负载下处理10,000 req/s
  3. 错误 SLI: 错误率(5xx请求/总请求) SLO: <0.1%错误率
  4. 饱和度 SLI: 资源利用率(CPU、内存、磁盘) SLO: CPU<70%,内存<80%
undefined

If you've had 20 minutes downtime this month:

错误预算

Budget_Remaining = 43.2 - 20 = 23.2 minutes Budget_Consumed = 20 / 43.2 = 46.3%
python
undefined

Policy: If budget > 90% consumed, freeze deployments

错误预算 = 1 - SLO

undefined
SLO = 99.9% # "三个9" Error_Budget = 100% - 99.9% = 0.1%

SLO Implementation with Prometheus

月度计算(30天)

yaml
undefined
Total_Minutes = 30 * 24 * 60 = 43,200分钟 Allowed_Downtime = 43,200 * 0.001 = 43.2分钟

Recording rules for SLI calculation

如果本月已停机20分钟:

groups:
  • name: slo_availability interval: 30s rules:

    Total requests

    • record: slo:api_requests:total expr: sum(rate(http_requests_total[5m]))

    Successful requests (non-5xx)

    • record: slo:api_requests:success expr: sum(rate(http_requests_total{status!~"5.."}[5m]))

    Availability SLI

    • record: slo:api_availability:ratio expr: slo:api_requests:success / slo:api_requests:total

    30-day availability

    • record: slo:api_availability:30d expr: avg_over_time(slo:api_availability:ratio[30d])
  • name: slo_latency interval: 30s rules:

    P95 latency SLI

    • record: slo:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Budget_Remaining = 43.2 - 20 = 23.2分钟 Budget_Consumed = 20 / 43.2 = 46.3%

Alerting on SLO burn rate

策略:如果预算消耗>90%,冻结部署

  • alert: HighErrorBudgetBurnRate expr: | ( slo:api_availability:ratio < 0.999 # Below 99.9% SLO and slo:api_availability:30d > 0.999 # But 30-day average still OK ) for: 5m annotations: summary: "Burning error budget too fast" description: "Current availability {{ $value }} is below SLO. {{ $labels.service }}"

---
undefined

Incident Response

基于Prometheus的SLO实现

Incident Severity Levels

LevelImpactResponse TimeExamples
SEV-1Service down or major degradation< 15 minComplete outage, data loss, security breach
SEV-2Significant impact, partial outage< 1 hourFeature unavailable, high error rates
SEV-3Minor impact, workaround exists< 4 hoursSingle component degraded, slow performance
SEV-4Cosmetic, no user impactNext business dayUI glitches, logging errors
yaml
undefined

Incident Response Roles (IMAG Framework)

用于SLI计算的记录规则

yaml
Incident Commander (IC):
  - Overall coordination and decision-making
  - Declares incident start/end
  - Decides on escalations
  - Owns communication to leadership

Operations Lead (OL):
  - Technical investigation and mitigation
  - Coordinates engineers
  - Implements fixes
  - Reports status to IC

Communications Lead (CL):
  - Internal/external status updates
  - Customer communication
  - Stakeholder notifications
  - Status page updates
groups:
  • name: slo_availability interval: 30s rules:

    总请求数

    • record: slo:api_requests:total expr: sum(rate(http_requests_total[5m]))

    成功请求数(非5xx)

    • record: slo:api_requests:success expr: sum(rate(http_requests_total{status!~"5.."}[5m]))

    可用性SLI

    • record: slo:api_availability:ratio expr: slo:api_requests:success / slo:api_requests:total

    30天可用性

    • record: slo:api_availability:30d expr: avg_over_time(slo:api_availability:ratio[30d])
  • name: slo_latency interval: 30s rules:

    P95延迟SLI

    • record: slo:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Incident Workflow

针对SLO消耗速率的告警

1. Detection (Alert fires or user reports)
2. Triage (Assess severity, assign IC)
3. Response (Assemble team, create war room)
4. Mitigation (Stop the bleeding, restore service)
5. Resolution (Fix root cause)
6. Postmortem (Blameless review, action items)
7. Follow-up (Implement improvements)
  • alert: HighErrorBudgetBurnRate expr: | ( slo:api_availability:ratio < 0.999 # 低于99.9%的SLO and slo:api_availability:30d > 0.999 # 但30天平均值仍达标 ) for: 5m annotations: summary: "错误预算消耗过快" description: "当前可用性{{ $value }}低于SLO。服务:{{ $labels.service }}"

---

On-Call Best Practices

事件响应

事件严重级别

  • Rotation — 1-week shifts, balanced across timezones
  • Escalation — Primary → Secondary → Manager (15 min each)
  • Playbooks — Step-by-step debugging guides for common issues
  • Runbooks — Automated remediation scripts
  • Handoff — 15-min sync at rotation change
  • Compensation — On-call pay or comp time
  • Health — No more than 2 incidents/night target
级别影响范围响应时间示例
SEV-1服务宕机或严重退化<15分钟完全中断、数据丢失、安全漏洞
SEV-2重大影响,部分中断<1小时功能不可用、高错误率
SEV-3轻微影响,存在临时解决方案<4小时单个组件退化、性能缓慢
SEV-4外观问题,无用户影响下一个工作日UI故障、日志错误

Alert Fatigue Prevention

事件响应角色(IMAG框架)

yaml
undefined
yaml
事件指挥官(IC):
  - 整体协调与决策
  - 宣布事件开始/结束
  - 决定升级路径
  - 负责向管理层沟通

运维负责人(OL):
  - 技术调查与缓解
  - 协调工程师
  - 实施修复
  - 向IC汇报状态

沟通负责人(CL):
  - 内外部状态更新
  - 用户沟通
  - 利益相关者通知
  - 状态页面更新

Symptoms vs Causes alerting

事件工作流

Alert on WHAT users experience, not WHY it's broken

GOOD: Symptom-based alert

  • alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 # User-facing metric annotations: summary: "API is slow for users"
1. 检测(告警触发或用户反馈)
2. 分类(评估严重级别,指派IC)
3. 响应(组建团队,创建作战室)
4. 缓解(止损,恢复服务)
5. 解决(修复根因)
6. 事后复盘(无责评审,行动项)
7. 跟进(实施改进措施)

BAD: Cause-based alert

值班最佳实践

  • alert: CPUHigh expr: cpu_usage > 70% # Internal metric, might not impact users

    Don't alert unless this affects SLOs

  • 轮换制 — 每周轮换,跨时区平衡
  • 升级路径 — 主值班→副值班→经理(每级15分钟)
  • 操作手册 — 常见问题的分步调试指南
  • 运行手册 — 自动化修复脚本
  • 交接 — 轮换时15分钟同步
  • 补偿 — 值班补贴或调休
  • 健康保障 — 目标:每晚不超过2次事件

Use SLO-based alerting

防止告警疲劳

Alert when error budget burn rate is too high


---
yaml
undefined

Blameless Postmortems

症状 vs 原因告警

Core Principles

告警内容为用户体验到的问题,而非问题原因

良好:基于症状的告警

  • Assume Good Intentions — Everyone did their best with available information
  • Focus on Systems — Identify gaps in process/tooling, not people
  • Psychological Safety — No punishment for honest mistakes
  • Learning Culture — Incidents are opportunities to improve
  • Separate from Performance Reviews — Postmortem participation never affects evaluations
  • alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 # 用户可见指标 annotations: summary: "API对用户来说响应缓慢"

Postmortem Template

不良:基于原因的告警

markdown
undefined
  • alert: CPUHigh expr: cpu_usage > 70% # 内部指标,可能不影响用户

    除非该指标影响SLO,否则不要告警

Incident Postmortem: [Title]

使用基于SLO的告警

当错误预算消耗过快时告警

Date: 2025-01-15 Duration: 10:30 - 12:15 UTC (1h 45m) Severity: SEV-2 Incident Commander: Jane Doe Responders: John Smith, Alice Johnson

---

Impact

无责事后复盘

核心原则

  • 15,000 users affected
  • 12% error rate on payment processing
  • $5,000 estimated revenue impact
  • No data loss
  • 假设善意 — 每个人都在现有信息下尽力而为
  • 聚焦系统 — 识别流程/工具的漏洞,而非针对个人
  • 心理安全 — 诚实的错误不会受到惩罚
  • 学习文化 — 事件是改进的机会
  • 与绩效评审分离 — 参与复盘绝不影响绩效评估

Timeline (UTC)

事后复盘模板

  • 10:30 - Alert: Payment error rate > 5%
  • 10:32 - IC assigned, war room created
  • 10:45 - Identified: Database connection pool exhausted
  • 11:00 - Mitigation: Increased pool size from 50 → 100
  • 11:15 - Error rate back to normal
  • 12:15 - Incident closed after monitoring
markdown
undefined

Root Cause

事件复盘:[标题]

Database connection pool configured for average load, not peak traffic. Black Friday traffic spike (3x normal) exhausted connections.
日期: 2025-01-15 持续时间: 10:30 - 12:15 UTC(1小时45分钟) 严重级别: SEV-2 事件指挥官: Jane Doe 响应人员: John Smith, Alice Johnson

What Went Well

影响

  • Alert fired within 2 minutes of issue
  • Clear escalation path, IC available immediately
  • Mitigation applied quickly (30 minutes to fix)
  • No data corruption or loss
  • 15,000用户受影响
  • 支付处理错误率达12%
  • 预估收入损失$5,000
  • 无数据丢失

What Went Wrong

时间线(UTC)

  • No load testing at 3x scale
  • No auto-scaling for connection pool
  • No alert on connection pool saturation
  • Insufficient monitoring of database metrics
  • 10:30 - 告警:支付错误率>5%
  • 10:32 - 指派IC,创建作战室
  • 10:45 - 定位问题:数据库连接池耗尽
  • 11:00 - 缓解措施:连接池大小从50→100
  • 11:15 - 错误率恢复正常
  • 12:15 - 监控确认后关闭事件

Action Items

根因

  • (@john) Add connection pool metrics to Grafana (Due: Jan 20)
  • (@alice) Implement auto-scaling based on request rate (Due: Jan 25)
  • (@jane) Add load testing to CI for 5x scale (Due: Feb 1)
  • (@jane) Add alert: connection pool > 80% (Due: Jan 18)
  • (@john) Document connection pool tuning runbook (Due: Jan 22)
数据库连接池配置为应对平均负载,而非峰值流量。黑色星期五流量激增(为平时3倍)导致连接池耗尽。

Lessons Learned

做得好的地方

  1. Black Friday load patterns need dedicated testing
  2. Database metrics were missing from standard dashboards
  3. Auto-scaling should cover ALL resources, not just pods
undefined
  • 问题发生2分钟内触发告警
  • 清晰的升级路径,IC立即到位
  • 缓解措施快速实施(30分钟修复)
  • 无数据损坏或丢失

Follow-up

待改进的地方

  • Review postmortem in team meeting within 1 week
  • Track action items to completion (not optional!)
  • Share learnings across teams
  • Update runbooks and playbooks
  • Celebrate successful incident response

  • 未进行3倍负载的测试
  • 连接池未配置自动扩容
  • 未针对连接池饱和度设置告警
  • 数据库指标监控不足

Chaos Engineering

行动项

Principles

  1. Define Steady State — Normal system behavior (e.g., 99.9% success rate)
  2. Hypothesize — Predict system will remain stable under failure
  3. Inject Failures — Simulate real-world events
  4. Disprove Hypothesis — Look for deviations from steady state
  5. Learn and Improve — Fix weaknesses, increase resilience
  • (@john) 将连接池指标添加到Grafana(截止日期:1月20日)
  • (@alice) 基于请求速率实现自动扩容(截止日期:1月25日)
  • (@jane) 在CI中添加5倍负载测试(截止日期:2月1日)
  • (@jane) 添加告警:连接池使用率>80%(截止日期:1月18日)
  • (@john) 编写连接池调优运行手册(截止日期:1月22日)

Failure Types

经验教训

yaml
Infrastructure:
  - Pod/node termination
  - Network latency/packet loss
  - DNS failures
  - Cloud region outage

Resources:
  - CPU stress
  - Memory exhaustion
  - Disk I/O saturation
  - File descriptor limits

Dependencies:
  - Database connection failures
  - API timeout/errors
  - Cache unavailability
  - Message queue backlog

Security:
  - DDoS simulation
  - Certificate expiration
  - Unauthorized access attempts
  1. 黑色星期五的负载模式需要专门测试
  2. 数据库指标未纳入标准仪表盘
  3. 自动扩容应覆盖所有资源,而非仅Pod
undefined

Chaos Mesh Example

跟进

yaml
undefined
  • 1周内团队会议上评审复盘内容
  • 跟踪行动项至完成(非可选!)
  • 跨团队分享经验
  • 更新运行手册和操作手册
  • 庆祝成功的事件响应

Network latency injection

混沌工程

原则

apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: network-delay spec: action: delay mode: one selector: namespaces: - production labelSelectors: app: payment-service delay: latency: "100ms" correlation: "50" jitter: "50ms" duration: "5m" scheduler: cron: "@every 2h" # Run every 2 hours

  1. 定义稳态 — 系统正常行为(如99.9%成功率)
  2. 提出假设 — 预测系统在故障下保持稳定
  3. 注入故障 — 模拟真实世界事件
  4. 推翻假设 — 寻找与稳态的偏差
  5. 学习与改进 — 修复弱点,提升韧性

Pod kill experiment

故障类型

apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: pod-kill spec: action: pod-kill mode: fixed-percent value: "10" # Kill 10% of pods selector: namespaces: - production labelSelectors: app: api-server duration: "30s"
undefined
yaml
基础设施:
  - Pod/节点终止
  - 网络延迟/丢包
  - DNS故障
  - 云区域 outage

资源:
  - CPU压力
  - 内存耗尽
  - 磁盘I/O饱和
  - 文件描述符限制

依赖:
  - 数据库连接故障
  - API超时/错误
  - 缓存不可用
  - 消息队列积压

安全:
  - DDoS模拟
  - 证书过期
  - 未授权访问尝试

Best Practices

Chaos Mesh示例

  • Start Small — Non-production first, then canary production
  • Collect Baselines — Know normal metrics before experiments
  • Define Success — Clear criteria for what "stable" means
  • Monitor Everything — Watch metrics, logs, traces during tests
  • Automate Rollback — Stop experiment if SLOs violated
  • Game Days — Scheduled chaos exercises with full team
  • Blameless Reviews — Treat chaos failures like production incidents

yaml
undefined

AIOps and AI in Observability

网络延迟注入

2025 Trends

  • Anomaly Detection — AI spots unusual patterns in metrics/logs
  • Root Cause Analysis — Correlate failures across services automatically
  • Predictive Alerting — Predict failures before they happen
  • Auto-Remediation — AI suggests or applies fixes autonomously
  • Natural Language Queries — Ask "Why is checkout slow?" instead of writing PromQL
  • AI Observability — Monitor AI model drift, hallucinations, token usage
apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: network-delay spec: action: delay mode: one selector: namespaces: - production labelSelectors: app: payment-service delay: latency: "100ms" correlation: "50" jitter: "50ms" duration: "5m" scheduler: cron: "@every 2h" # 每2小时运行一次

AI-Driven Platforms (2025)

Pod销毁实验

yaml
Dynatrace Davis AI:
  - Auto-detected 73% of incidents before customer impact
  - Reduced alert noise by 90%
  - Causal AI for root cause analysis

Datadog Watchdog:
  - Anomaly detection across metrics, logs, traces
  - Automated correlation of related issues
  - LLM-powered investigation assistant

Elastic AIOps:
  - Machine learning for log anomaly detection
  - Automated baseline learning
  - Predictive alerting

New Relic AI:
  - Natural language query interface
  - Automated incident summarization
  - Proactive capacity recommendations
apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: pod-kill spec: action: pod-kill mode: fixed-percent value: "10" # 销毁10%的Pod selector: namespaces: - production labelSelectors: app: api-server duration: "30s"
undefined

Implementing AI Observability

最佳实践

python
undefined
  • 从小处着手 — 先在非生产环境测试,再在生产环境金丝雀测试
  • 收集基线 — 实验前了解系统正常指标
  • 定义成功标准 — 明确"稳定"的判定条件
  • 全面监控 — 实验期间监控指标、日志、链路追踪
  • 自动化回滚 — 若SLO被违反,立即停止实验
  • 游戏日 — 定期安排全团队参与的混沌演练
  • 无责评审 — 将混沌实验的故障视为生产事件处理

Monitor AI model performance

AIOps与可观测性中的AI

2025年趋势

from opentelemetry import trace, metrics
tracer = trace.get_tracer(name) meter = metrics.get_meter(name)
  • 异常检测 — AI识别指标/日志中的异常模式
  • 根因分析 — 自动关联跨服务的故障
  • 预测性告警 — 在故障发生前预测隐患
  • 自动修复 — AI建议或自动应用修复措施
  • 自然语言查询 — 提问"为什么结账缓慢?"而非编写PromQL
  • AI可观测性 — 监控AI模型漂移、幻觉、Token使用量

Create metrics for AI model

AI驱动平台(2025年)

model_latency = meter.create_histogram( "ai.model.latency", description="AI model inference latency", unit="ms" ) model_tokens = meter.create_counter( "ai.model.tokens", description="Token usage" )
async def run_ai_model(prompt: str): with tracer.start_as_current_span("ai.inference") as span: start = time.time()
    span.set_attribute("ai.model", "gpt-4")
    span.set_attribute("ai.prompt_length", len(prompt))

    response = await openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

    latency = (time.time() - start) * 1000
    tokens = response.usage.total_tokens

    # Record metrics
    model_latency.record(latency, {"model": "gpt-4"})
    model_tokens.add(tokens, {"model": "gpt-4", "type": "total"})

    # Add to span
    span.set_attribute("ai.response_length", len(response.choices[0].message.content))
    span.set_attribute("ai.tokens_used", tokens)

    return response

---
yaml
Dynatrace Davis AI:
  - 73%的事件在用户受影响前自动检测到
  - 告警噪音减少90%
  - 因果AI用于根因分析

Datadog Watchdog:
  - 跨指标、日志、链路追踪的异常检测
  - 自动关联相关问题
  - LLM驱动的调查助手

Elastic AIOps:
  - 机器学习用于日志异常检测
  - 自动基线学习
  - 预测性告警

New Relic AI:
  - 自然语言查询界面
  - 自动事件摘要
  - 主动容量建议

Grafana Dashboards

实现AI可观测性

3-3-3 Rule

  • 3 rows of panels per dashboard
  • 3 panels per row
  • 3 key metrics per panel
Avoid "dashboard sprawl" — Each dashboard should answer ONE question.
python
undefined

Dashboard Categories

监控AI模型性能

yaml
RED Dashboard (for services):
  - Rate: Requests per second
  - Errors: Error rate
  - Duration: Latency (P50, P95, P99)

USE Dashboard (for resources):
  - Utilization: % of capacity used
  - Saturation: Queue depth, wait time
  - Errors: Error count

Four Golden Signals Dashboard:
  - Latency
  - Traffic
  - Errors
  - Saturation

SLO Dashboard:
  - Current SLI value
  - Error budget remaining
  - Burn rate
  - Trend (30-day)
from opentelemetry import trace, metrics import time import openai
tracer = trace.get_tracer(name) meter = metrics.get_meter(name)

Panel Best Practices

创建AI模型指标

json
{
  "title": "API Request Rate",
  "type": "graph",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total[5m])) by (method)",
      "legendFormat": "{{ method }}"
    }
  ],
  "options": {
    "tooltip": { "mode": "multi" },
    "legend": { "displayMode": "table", "calcs": ["mean", "last"] }
  },
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",  // Requests per second
      "color": { "mode": "palette-classic" },
      "custom": {
        "lineWidth": 2,
        "fillOpacity": 10
      }
    }
  }
}

model_latency = meter.create_histogram( "ai.model.latency", description="AI模型推理延迟", unit="ms" ) model_tokens = meter.create_counter( "ai.model.tokens", description="Token使用量" )
async def run_ai_model(prompt: str): with tracer.start_as_current_span("ai.inference") as span: start = time.time()
    span.set_attribute("ai.model", "gpt-4")
    span.set_attribute("ai.prompt_length", len(prompt))

    response = await openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

    latency = (time.time() - start) * 1000
    tokens = response.usage.total_tokens

    # 记录指标
    model_latency.record(latency, {"model": "gpt-4"})
    model_tokens.add(tokens, {"model": "gpt-4", "type": "total"})

    # 添加到Span
    span.set_attribute("ai.response_length", len(response.choices[0].message.content))
    span.set_attribute("ai.tokens_used", tokens)

    return response

---

Checklist

Grafana仪表盘

3-3-3规则

markdown
undefined
  • 每行3个面板
  • 每个仪表盘3行
  • 每个面板3个关键指标
避免"仪表盘泛滥" — 每个仪表盘应仅回答一个问题。

Metrics (Prometheus + Grafana)

仪表盘分类

  • Layered architecture (app/cluster/global)
  • Recording rules for expensive queries
  • Resource limits and retention configured
  • Dashboards follow 3-3-3 rule
  • Alerts based on SLOs, not internal metrics
yaml
RED仪表盘(服务用):
  - Rate:每秒请求数
  - Errors:错误率
  - Duration:延迟(P50、P95、P99)

USE仪表盘(资源用):
  - Utilization:容量使用率
  - Saturation:队列长度、等待时间
  - Errors:错误计数

四大黄金信号仪表盘:
  - 延迟
  - 流量
  - 错误
  - 饱和度

SLO仪表盘:
  - 当前SLI值
  - 剩余错误预算
  - 消耗速率
  - 趋势(30天)

Tracing (OpenTelemetry)

面板最佳实践

  • Auto-instrumentation enabled
  • Custom spans for business operations
  • Sampling strategy configured
  • Trace context in logs (correlation)
  • Backend connected (Tempo/Jaeger)
json
{
  "title": "API请求速率",
  "type": "graph",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total[5m])) by (method)",
      "legendFormat": "{{ method }}"
    }
  ],
  "options": {
    "tooltip": { "mode": "multi" },
    "legend": { "displayMode": "table", "calcs": ["mean", "last"] }
  },
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",  # 每秒请求数
      "color": { "mode": "palette-classic" },
      "custom": {
        "lineWidth": 2,
        "fillOpacity": 10
      }
    }
  }
}

Logging (Loki/ELK)

检查清单

  • Structured JSON logging
  • Low cardinality labels (<10)
  • Trace IDs in logs
  • Appropriate log levels
  • Retention policy defined
markdown
undefined

SLOs

指标(Prometheus + Grafana)

  • SLIs defined for key user journeys
  • SLOs documented and tracked
  • Error budget calculated
  • Burn rate alerting configured
  • Monthly SLO review process
  • 分层架构(应用/集群/全局)
  • 为开销大的查询配置记录规则
  • 配置资源限制与保留期
  • 仪表盘遵循3-3-3规则
  • 告警基于SLO,而非内部指标

Incident Response

链路追踪(OpenTelemetry)

  • Severity levels defined
  • On-call rotation scheduled
  • Escalation policy documented
  • Runbooks for common issues
  • Postmortem template ready
  • 启用自动埋点
  • 为业务操作添加自定义Span
  • 配置采样策略
  • 日志中包含追踪上下文(关联)
  • 连接后端(Tempo/Jaeger)

Culture

日志(Loki/ELK)

  • Blameless postmortem process
  • Action items tracked to completion
  • Incident learnings shared
  • On-call compensation policy
  • Regular chaos engineering exercises

---
  • 结构化JSON日志
  • 低基数标签(<10个)
  • 日志中包含Trace ID
  • 合理的日志级别
  • 定义保留策略

See Also

SLO

  • reference/monitoring.md — Prometheus and Grafana deep dive
  • reference/logging.md — Structured logging best practices
  • reference/tracing.md — OpenTelemetry and distributed tracing
  • reference/incident-response.md — Incident management and postmortems
  • templates/slo-template.md — SLO definition template
  • 为关键用户旅程定义SLI
  • SLO已文档化并跟踪
  • 计算错误预算
  • 配置消耗速率告警
  • 月度SLO评审流程

事件响应

  • 定义严重级别
  • 安排值班轮换
  • 文档化升级策略
  • 常见问题的运行手册
  • 准备好事后复盘模板

文化

  • 无责事后复盘流程
  • 跟踪行动项至完成
  • 分享事件经验
  • 值班补偿政策
  • 定期混沌工程演练

---

相关链接

  • reference/monitoring.md — Prometheus与Grafana深度解析
  • reference/logging.md — 结构化日志最佳实践
  • reference/tracing.md — OpenTelemetry与分布式追踪
  • reference/incident-response.md — 事件管理与事后复盘
  • templates/slo-template.md — SLO定义模板