observability-sre

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Observability & Site Reliability Engineering

可观测性与站点可靠性工程

Core Principles

核心原则

Three Pillars — Metrics, Logs, and Traces provide holistic visibility
Observability-First — Build systems that explain their own behavior
SLO-Driven — Define reliability targets that matter to users
Proactive Detection — Find issues before customers do
Blameless Culture — Learn from failures without blame
Automate Toil — Reduce repetitive operational work
Continuous Improvement — Each incident makes systems more resilient
Full-Stack Visibility — Monitor from infrastructure to business metrics

三大支柱 — 指标（Metrics）、日志（Logs）和链路追踪（Traces）提供全面可见性
可观测性优先 — 构建能够自我解释行为的系统
SLO驱动 — 定义对用户重要的可靠性目标
主动检测 — 在用户发现问题前找到隐患
无责文化 — 从故障中学习，而非追责
自动化消除重复劳动 — 减少重复性运维工作
持续改进 — 每一次事件都让系统更具韧性
全栈可见性 — 从基础设施到业务指标的全链路监控

Hard Rules (Must Follow)

硬性规则（必须遵守）

These rules are mandatory. Violating them means the skill is not working correctly.

这些规则为强制性要求。违反规则意味着技能未正确发挥作用。

Symptom-Based Alerts Only

仅基于症状告警

Alert on user-facing symptoms, not internal infrastructure metrics.

yaml

undefined

仅针对用户可见的症状告警，而非内部基础设施指标。

yaml

undefined

❌ FORBIDDEN: Alerting on internal metrics

❌ 禁止：针对内部指标告警

alert: CPUHigh expr: cpu_usage > 70%

Users don't care about CPU, they care about latency
alert: MemoryHigh expr: memory_usage > 80%

Internal metric, may not affect users

alert: CPUHigh expr: cpu_usage > 70%

用户不关心CPU使用率，他们关心的是延迟
alert: MemoryHigh expr: memory_usage > 80%

内部指标，可能不会影响用户

✅ REQUIRED: Alert on user experience

✅ 要求：针对用户体验告警

alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 annotations: summary: "Users experiencing slow response times"
alert: ErrorRateHigh expr: slo:api_errors:rate5m > 0.001 annotations: summary: "Users encountering errors"

undefined

alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 annotations: summary: "用户正在经历缓慢的响应速度"
alert: ErrorRateHigh expr: slo:api_errors:rate5m > 0.001 annotations: summary: "用户遇到错误"

undefined

Low Cardinality Labels

低基数标签

Loki/Prometheus labels must have low cardinality (<10 unique labels).

yaml

undefined

Loki/Prometheus标签必须为低基数（唯一标签数量<10）。

yaml

undefined

❌ FORBIDDEN: High cardinality labels

❌ 禁止：高基数标签

labels: user_id: "usr_123" # Millions of values! order_id: "ord_456" # Millions of values! request_id: "req_789" # Every request is unique!

labels: user_id: "usr_123" # 数百万个可能值！ order_id: "ord_456" # 数百万个可能值！ request_id: "req_789" # 每个请求都是唯一的！

✅ REQUIRED: Low cardinality only

✅ 要求：仅使用低基数标签

labels: namespace: "production" # Few values app: "api-server" # Few values level: "error" # 5-6 values method: "GET" # ~10 values

labels: namespace: "production" # 少量可能值 app: "api-server" # 少量可能值 level: "error" # 5-6个可能值 method: "GET" # 约10个可能值

High cardinality data goes in log body:

高基数数据放入日志体中：

logger.info({ user_id: "usr_123", # In JSON body, not label order_id: "ord_456", }, "Order processed");

undefined

logger.info({ user_id: "usr_123", # 放在JSON体中，而非标签 order_id: "ord_456", }, "订单已处理");

undefined

SLO-Based Error Budgets

基于SLO的错误预算

Every service must have defined SLOs with error budget tracking.

yaml

undefined

每个服务必须定义SLO并跟踪错误预算。

yaml

undefined

❌ FORBIDDEN: No SLO definition

❌ 禁止：无SLO定义

Just monitoring without targets

仅监控而无目标

✅ REQUIRED: Explicit SLO with budget

✅ 要求：明确的SLO与错误预算

SLO: 99.9% availability

SLO：99.9% 可用性

Error Budget: 0.1% = 43.2 minutes/month downtime

错误预算：0.1% = 每月允许43.2分钟停机时间

groups:

name: slo_tracking rules:
- record: slo:api_availability:ratio expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- alert: ErrorBudgetBurnRate expr: slo:api_availability:ratio < 0.999 for: 5m annotations: summary: "Burning error budget too fast"

undefined

groups:

name: slo_tracking rules:
- record: slo:api_availability:ratio expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- alert: ErrorBudgetBurnRate expr: slo:api_availability:ratio < 0.999 for: 5m annotations: summary: "错误预算消耗过快"

undefined

Trace Context in Logs

日志中包含追踪上下文

All logs must include trace_id for correlation with distributed traces.

typescript

// ❌ FORBIDDEN: Logs without trace context
logger.info("Payment processed");

// ✅ REQUIRED: Include trace_id in every log
const span = trace.getActiveSpan();
logger.info({
  trace_id: span?.spanContext().traceId,
  span_id: span?.spanContext().spanId,
  order_id: "ord_123",
}, "Payment processed");

// Output includes correlation:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"}

所有日志必须包含trace_id，用于与分布式追踪关联。

typescript

// ❌ 禁止：日志中无追踪上下文
logger.info("支付已处理");

// ✅ 要求：每条日志都包含trace_id
const span = trace.getActiveSpan();
logger.info({
  trace_id: span?.spanContext().traceId,
  span_id: span?.spanContext().spanId,
  order_id: "ord_123",
}, "支付已处理");

// 输出包含关联信息：
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"支付已处理"}

Quick Reference

快速参考

When to Use What

工具选型指南

Scenario	Tool/Pattern	Reason
Metrics collection	Prometheus + Grafana	Industry standard, powerful query language
Distributed tracing	OpenTelemetry + Tempo/Jaeger	Vendor-neutral, CNCF standard
Log aggregation (cost-sensitive)	Grafana Loki	Indexes only labels, 10x cheaper
Log aggregation (search-heavy)	ELK Stack	Full-text search, advanced analytics
Unified observability	Elastic/Datadog/Dynatrace	Single pane of glass for all telemetry
Incident management	PagerDuty/Opsgenie	Alert routing, on-call scheduling
Chaos engineering	Gremlin/Chaos Mesh	Controlled failure injection
AIOps/Anomaly detection	Dynatrace/Datadog	AI-driven root cause analysis

场景	工具/模式	原因
指标采集	Prometheus + Grafana	行业标准，强大的查询语言
分布式追踪	OpenTelemetry + Tempo/Jaeger	厂商中立，CNCF标准
日志聚合（成本敏感）	Grafana Loki	仅索引标签，成本降低10倍
日志聚合（搜索密集型）	ELK Stack	全文搜索，高级分析能力
统一可观测性	Elastic/Datadog/Dynatrace	所有遥测数据的统一视图
事件管理	PagerDuty/Opsgenie	告警路由，排班管理
混沌工程	Gremlin/Chaos Mesh	可控故障注入
AIOps/异常检测	Dynatrace/Datadog	AI驱动的根因分析

The Three Pillars

三大支柱详解

Pillar	What	When	Tools
Metrics	Numerical time-series data	Real-time monitoring, alerting	Prometheus, StatsD, CloudWatch
Logs	Event records with context	Debugging, audit trails	Loki, ELK, Splunk
Traces	Request journey across services	Performance analysis, dependencies	OpenTelemetry, Jaeger, Zipkin

Fourth Pillar (Emerging): Continuous Profiling — Code-level performance data (CPU, memory usage at function level)

支柱	定义	使用场景	工具
指标	数值型时间序列数据	实时监控、告警	Prometheus, StatsD, CloudWatch
日志	带上下文的事件记录	调试、审计追踪	Loki, ELK, Splunk
链路追踪	跨服务的请求链路	性能分析、依赖梳理	OpenTelemetry, Jaeger, Zipkin

第四支柱（新兴）： 持续分析 — 代码级性能数据（函数级CPU、内存使用率）

Observability Architecture

可观测性架构

Layered Prometheus Setup

分层Prometheus部署

yaml

undefined

yaml

undefined

2025 Best Practice: Federated architecture

2025最佳实践：联邦架构

Prevents metric chaos while enabling drill-down

避免指标混乱，同时支持下钻分析

Layer 1: Application Prometheus

第一层：应用级Prometheus

- Detailed business logic metrics

- 详细业务逻辑指标

- High cardinality acceptable

允许高基数

- Short retention (7 days)

- 短保留期（7天）

Layer 2: Cluster Prometheus

第二层：集群级Prometheus

- Per-environment/cluster metrics

- 按环境/集群划分的指标

- Medium retention (30 days)

- 中等保留期（30天）

- Aggregates from application level

- 从应用层聚合数据

Layer 3: Global Prometheus

第三层：全局级Prometheus

- Cross-cluster critical metrics

- 跨集群关键指标

- Long retention (1 year)

- 长保留期（1年）

- Federation from cluster level

- 从集群层联邦采集

Global Prometheus config

全局级Prometheus配置

scrape_configs:

job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="kubernetes-nodes"}' - '{name=~"job:.*"}' # Recording rules only static_configs:
- targets:
  - 'cluster-prom-us-east.internal:9090'
  - 'cluster-prom-eu-west.internal:9090'

undefined

scrape_configs:

job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="kubernetes-nodes"}' - '{name=~"job:.*"}' # 仅采集记录规则生成的指标 static_configs:
- targets:
  - 'cluster-prom-us-east.internal:9090'
  - 'cluster-prom-eu-west.internal:9090'

undefined

Recording Rules for Performance

性能优化用记录规则

yaml

undefined

yaml

undefined

Precompute expensive queries

预计算开销大的查询

groups:

name: api_performance interval: 30s rules:

Request rate (requests per second)
- record: job:api_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job, method, status)
Error rate
- record: job:api_errors:rate5m expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)
P95 latency
- record: job:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

undefined

groups:

name: api_performance interval: 30s rules:

请求速率（每秒请求数）
- record: job:api_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job, method, status)
错误率
- record: job:api_errors:rate5m expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)
P95延迟
- record: job:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

undefined

Resource Optimization

资源优化

yaml

undefined

yaml

undefined

Increase scrape interval for high-target deployments

针对大量目标的部署，增加采集间隔

scrape_interval: 30s # Default: 15s reduces load by 50%

scrape_interval: 30s # 默认15s，改为30s可减少50%负载

Use relabeling to drop unnecessary metrics

使用重标记规则丢弃不必要的指标

metric_relabel_configs:

source_labels: [name] regex: 'go_.|process_.' # Drop Go runtime metrics action: drop

metric_relabel_configs:

source_labels: [name] regex: 'go_.|process_.' # 丢弃Go运行时指标 action: drop

Limit sample retention

限制样本保留期

storage: tsdb: retention.time: 15d # Keep only 15 days locally retention.size: 50GB # Or max 50GB

---

storage: tsdb: retention.time: 15d # 本地仅保留15天数据 retention.size: 50GB # 或最大50GB容量

---

Distributed Tracing with OpenTelemetry

基于OpenTelemetry的分布式追踪

Auto-Instrumentation Setup

自动埋点配置

typescript

// Node.js auto-instrumentation
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instruments HTTP, Express, PostgreSQL, Redis, etc.
      '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
    }),
  ],
});

sdk.start();

typescript

// Node.js自动埋点
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // 自动埋点HTTP、Express、PostgreSQL、Redis等
      '@opentelemetry/instrumentation-fs': { enabled: false }, // 过于嘈杂
    }),
  ],
});

sdk.start();

Manual Instrumentation for Business Logic

业务逻辑手动埋点

typescript

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service', '1.0.0');

async function processPayment(orderId: string, amount: number) {
  // Create custom span for business operation
  return tracer.startActiveSpan('processPayment', async (span) => {
    try {
      // Add business context
      span.setAttributes({
        'order.id': orderId,
        'payment.amount': amount,
        'payment.currency': 'USD',
      });

      // Child span for external API call
      const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {
        const result = await stripe.charges.create({ amount, currency: 'usd' });
        childSpan.setAttribute('stripe.charge_id', result.id);
        childSpan.setStatus({ code: SpanStatusCode.OK });
        childSpan.end();
        return result;
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return paymentResult;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}

typescript

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service', '1.0.0');

async function processPayment(orderId: string, amount: number) {
  // 为业务操作创建自定义Span
  return tracer.startActiveSpan('processPayment', async (span) => {
    try {
      // 添加业务上下文
      span.setAttributes({
        'order.id': orderId,
        'payment.amount': amount,
        'payment.currency': 'USD',
      });

      // 为外部API调用创建子Span
      const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {
        const result = await stripe.charges.create({ amount, currency: 'usd' });
        childSpan.setAttribute('stripe.charge_id', result.id);
        childSpan.setStatus({ code: SpanStatusCode.OK });
        childSpan.end();
        return result;
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return paymentResult;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}

Sampling Strategies

采样策略

yaml

undefined

yaml

undefined

OpenTelemetry Collector config

OpenTelemetry Collector配置

processors:

Probabilistic sampling: Keep 10% of traces

probabilistic_sampler: sampling_percentage: 10

Tail sampling: Make decisions after seeing full trace

tail_sampling: policies: # Always sample errors - name: error-traces type: status_code status_code: {status_codes: [ERROR]}

  # Always sample slow requests
  - name: slow-traces
    type: latency
    latency: {threshold_ms: 1000}

  # Sample 5% of normal traffic
  - name: normal-traces
    type: probabilistic
    probabilistic: {sampling_percentage: 5}

undefined

processors:

概率采样：保留10%的追踪数据

probabilistic_sampler: sampling_percentage: 10

尾部采样：在看到完整追踪后再做决策

tail_sampling: policies: # 始终采样错误追踪 - name: error-traces type: status_code status_code: {status_codes: [ERROR]}

  # 始终采样慢请求
  - name: slow-traces
    type: latency
    latency: {threshold_ms: 1000}

  # 采样5%的正常流量
  - name: normal-traces
    type: probabilistic
    probabilistic: {sampling_percentage: 5}

undefined

Context Propagation

上下文传播

typescript

// Ensure trace context flows across services
import { propagation, context } from '@opentelemetry/api';

// Outgoing HTTP request (automatic with auto-instrumentation)
fetch('https://api.example.com/data', {
  headers: {
    // W3C Trace Context headers injected automatically:
    // traceparent: 00-<trace-id>-<span-id>-01
    // tracestate: vendor=value
  },
});

// Manual propagation for non-HTTP (e.g., message queues)
const carrier = {};
propagation.inject(context.active(), carrier);
await publishMessage(queue, { data: payload, headers: carrier });

typescript

// 确保追踪上下文跨服务传递
import { propagation, context } from '@opentelemetry/api';

Structured Logging Best Practices

outgoing HTTP请求（自动埋点会自动处理）

JSON Logging Format

—

typescript

// Use structured logging library
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  // Include trace context in logs
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};

    const { traceId, spanId } = span.spanContext();
    return {
      trace_id: traceId,
      span_id: spanId,
    };
  },
});

// Structured logging with context
logger.info(
  {
    user_id: '123',
    order_id: 'ord_456',
    amount: 99.99,
    payment_method: 'card',
  },
  'Payment processed successfully'
);

// Output:
// {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"Payment processed successfully"}

fetch('https://api.example.com/data', { headers: { // W3C追踪上下文头会自动注入： // traceparent: 00-<trace-id>-<span-id>-01 // tracestate: vendor=value }, });

Log Levels

非HTTP场景手动传播（如消息队列）

typescript

// Follow standard severity levels
logger.trace({ details }, 'Low-level debugging');     // Very verbose
logger.debug({ state }, 'Debug information');          // Development
logger.info({ event }, 'Normal operation');            // Production default
logger.warn({ issue }, 'Warning condition');           // Potential issues
logger.error({ error, context }, 'Error occurred');    // Errors
logger.fatal({ critical }, 'Fatal error');             // Process crash

const carrier = {}; propagation.inject(context.active(), carrier); await publishMessage(queue, { data: payload, headers: carrier });

---

Grafana Loki Configuration

结构化日志最佳实践

—

JSON日志格式

yaml

undefined

typescript

// 使用结构化日志库
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  // 日志中包含追踪上下文
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};

    const { traceId, spanId } = span.spanContext();
    return {
      trace_id: traceId,
      span_id: spanId,
    };
  },
});

// 带上下文的结构化日志
logger.info(
  {
    user_id: '123',
    order_id: 'ord_456',
    amount: 99.99,
    payment_method: 'card',
  },
  '支付处理成功'
);

// 输出：
// {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"支付处理成功"}

Promtail config - ships logs to Loki

日志级别

server: http_listen_port: 9080

positions: filename: /tmp/positions.yaml

clients:

url: http://loki:3100/loki/api/v1/push

scrape_configs:

job_name: kubernetes kubernetes_sd_configs:
- role: pod relabel_configs:
Add pod labels as Loki labels (LOW cardinality only!)
- source_labels: [__meta_kubernetes_namespace] target_label: namespace
- source_labels: [__meta_kubernetes_pod_name] target_label: pod
- source_labels: [__meta_kubernetes_pod_label_app] target_label: app pipeline_stages:
Parse JSON logs
- json: expressions: level: level trace_id: trace_id
Extract fields as labels
- labels: level: trace_id:

undefined

typescript

undefined

Loki Best Practices

遵循标准严重级别

Low Cardinality Labels — Use only 5-10 labels (namespace, app, level)
High Cardinality in Log Body — Put user_id, order_id in JSON, not labels
LogQL for Filtering — Use
```
{app="api"} | json | user_id="123"
```
Retention Policy — Keep recent logs longer, compress old logs

promql

undefined

logger.trace({ details }, '低级别调试信息'); # 非常详细 logger.debug({ state }, '调试信息'); # 开发环境使用 logger.info({ event }, '正常操作记录'); # 生产环境默认级别 logger.warn({ issue }, '警告状态'); # 潜在问题 logger.error({ error, context }, '发生错误'); # 错误事件 logger.fatal({ critical }, '致命错误'); # 进程崩溃

undefined

LogQL query examples

Grafana Loki配置

{namespace="production", app="api"} |= "error" # Text search

{app="api"} | json | level="error" | line_format "{{.msg}}" # JSON parsing

rate({app="api"}[5m]) # Log rate per second

sum by (level) (count_over_time({namespace="production"}[1h])) # Count by level

---

yaml

undefined

SLO/SLI/SLA Management

Promtail配置 - 将日志发送到Loki

Definitions

—

SLI (Service Level Indicator) — Quantifiable measurement of service behavior
- Examples: Request latency, error rate, availability, throughput
SLO (Service Level Objective) — Target value/range for an SLI
- Examples: 99.9% availability, P95 latency < 200ms
SLA (Service Level Agreement) — Formal commitment with consequences
- Examples: "99.9% uptime or 10% credit"

server: http_listen_port: 9080

positions: filename: /tmp/positions.yaml

clients:

url: http://loki:3100/loki/api/v1/push

scrape_configs:

job_name: kubernetes kubernetes_sd_configs:
- role: pod relabel_configs:
将Pod标签添加为Loki标签（仅低基数！）
- source_labels: [__meta_kubernetes_namespace] target_label: namespace
- source_labels: [__meta_kubernetes_pod_name] target_label: pod
- source_labels: [__meta_kubernetes_pod_label_app] target_label: app pipeline_stages:
解析JSON日志
- json: expressions: level: level trace_id: trace_id
将字段提取为标签
- labels: level: trace_id:

undefined

The Four Golden Signals

Loki最佳实践

yaml

undefined

低基数标签 — 仅使用5-10个标签（命名空间、应用、级别）
高基数数据放入日志体 — user_id、order_id放入JSON体，而非标签
使用LogQL过滤 — 使用
```
{app="api"} | json | user_id="123"
```
保留策略 — 近期日志保留更久，旧日志压缩存储

promql

undefined

Google SRE's key metrics for any service

LogQL查询示例

Latency SLI: P95 request latency SLO: 95% of requests complete in < 200ms
Traffic SLI: Requests per second SLO: Handle 10,000 req/s peak load
Errors SLI: Error rate (5xx / total) SLO: < 0.1% error rate
Saturation SLI: Resource utilization (CPU, memory, disk) SLO: CPU < 70%, Memory < 80%

undefined

{namespace="production", app="api"} |= "error" # 文本搜索

{app="api"} | json | level="error" | line_format "{{.msg}}" # JSON解析

rate({app="api"}[5m]) # 每秒日志生成速率

sum by (level) (count_over_time({namespace="production"}[1h])) # 按级别统计数量

---

Error Budget

SLO/SLI/SLA管理

—

定义

python

undefined

SLI（服务级别指标） — 服务行为的可量化衡量标准
- 示例：请求延迟、错误率、可用性、吞吐量
SLO（服务级别目标） — SLI的目标值或范围
- 示例：99.9%可用性、P95延迟<200ms
SLA（服务级别协议） — 带有后果的正式承诺
- 示例："99.9%正常运行时间，否则赔偿10%"

Error budget = 1 - SLO

四大黄金信号

SLO = 99.9% # "three nines" Error_Budget = 100% - 99.9% = 0.1%

yaml

undefined

Monthly calculation (30 days)

Google SRE提出的任何服务的关键指标

Total_Minutes = 30 * 24 * 60 = 43,200 minutes Allowed_Downtime = 43,200 * 0.001 = 43.2 minutes

延迟 SLI: P95请求延迟 SLO: 95%的请求在<200ms内完成
流量 SLI: 每秒请求数 SLO: 峰值负载下处理10,000 req/s
错误 SLI: 错误率（5xx请求/总请求） SLO: <0.1%错误率
饱和度 SLI: 资源利用率（CPU、内存、磁盘） SLO: CPU<70%，内存<80%

undefined

If you've had 20 minutes downtime this month:

错误预算

Budget_Remaining = 43.2 - 20 = 23.2 minutes Budget_Consumed = 20 / 43.2 = 46.3%

python

undefined

Policy: If budget > 90% consumed, freeze deployments

错误预算 = 1 - SLO

undefined

SLO = 99.9% # "三个9" Error_Budget = 100% - 99.9% = 0.1%

SLO Implementation with Prometheus

月度计算（30天）

yaml

undefined

Total_Minutes = 30 * 24 * 60 = 43,200分钟 Allowed_Downtime = 43,200 * 0.001 = 43.2分钟

Recording rules for SLI calculation

如果本月已停机20分钟：

groups:

name: slo_availability interval: 30s rules:

Total requests
- record: slo:api_requests:total expr: sum(rate(http_requests_total[5m]))
Successful requests (non-5xx)
- record: slo:api_requests:success expr: sum(rate(http_requests_total{status!~"5.."}[5m]))
Availability SLI
- record: slo:api_availability:ratio expr: slo:api_requests:success / slo:api_requests:total
30-day availability
- record: slo:api_availability:30d expr: avg_over_time(slo:api_availability:ratio[30d])
name: slo_latency interval: 30s rules:

P95 latency SLI
- record: slo:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Budget_Remaining = 43.2 - 20 = 23.2分钟 Budget_Consumed = 20 / 43.2 = 46.3%

Alerting on SLO burn rate

策略：如果预算消耗>90%，冻结部署

alert: HighErrorBudgetBurnRate expr: | ( slo:api_availability:ratio < 0.999 # Below 99.9% SLO and slo:api_availability:30d > 0.999 # But 30-day average still OK ) for: 5m annotations: summary: "Burning error budget too fast" description: "Current availability {{ $value }} is below SLO. {{ $labels.service }}"

---

undefined

Incident Response

基于Prometheus的SLO实现

Incident Severity Levels

—

Level	Impact	Response Time	Examples
SEV-1	Service down or major degradation	< 15 min	Complete outage, data loss, security breach
SEV-2	Significant impact, partial outage	< 1 hour	Feature unavailable, high error rates
SEV-3	Minor impact, workaround exists	< 4 hours	Single component degraded, slow performance
SEV-4	Cosmetic, no user impact	Next business day	UI glitches, logging errors

yaml

undefined

Incident Response Roles (IMAG Framework)

用于SLI计算的记录规则

yaml

Incident Commander (IC):
  - Overall coordination and decision-making
  - Declares incident start/end
  - Decides on escalations
  - Owns communication to leadership

Operations Lead (OL):
  - Technical investigation and mitigation
  - Coordinates engineers
  - Implements fixes
  - Reports status to IC

Communications Lead (CL):
  - Internal/external status updates
  - Customer communication
  - Stakeholder notifications
  - Status page updates

groups:

name: slo_availability interval: 30s rules:

总请求数
- record: slo:api_requests:total expr: sum(rate(http_requests_total[5m]))
成功请求数（非5xx）
- record: slo:api_requests:success expr: sum(rate(http_requests_total{status!~"5.."}[5m]))
可用性SLI
- record: slo:api_availability:ratio expr: slo:api_requests:success / slo:api_requests:total
30天可用性
- record: slo:api_availability:30d expr: avg_over_time(slo:api_availability:ratio[30d])
name: slo_latency interval: 30s rules:

P95延迟SLI
- record: slo:api_latency:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Incident Workflow

针对SLO消耗速率的告警

1. Detection (Alert fires or user reports)
   ↓
2. Triage (Assess severity, assign IC)
   ↓
3. Response (Assemble team, create war room)
   ↓
4. Mitigation (Stop the bleeding, restore service)
   ↓
5. Resolution (Fix root cause)
   ↓
6. Postmortem (Blameless review, action items)
   ↓
7. Follow-up (Implement improvements)

alert: HighErrorBudgetBurnRate expr: | ( slo:api_availability:ratio < 0.999 # 低于99.9%的SLO and slo:api_availability:30d > 0.999 # 但30天平均值仍达标 ) for: 5m annotations: summary: "错误预算消耗过快" description: "当前可用性{{ $value }}低于SLO。服务：{{ $labels.service }}"

---

On-Call Best Practices

事件响应

—

事件严重级别

Rotation — 1-week shifts, balanced across timezones
Escalation — Primary → Secondary → Manager (15 min each)
Playbooks — Step-by-step debugging guides for common issues
Runbooks — Automated remediation scripts
Handoff — 15-min sync at rotation change
Compensation — On-call pay or comp time
Health — No more than 2 incidents/night target

级别	影响范围	响应时间	示例
SEV-1	服务宕机或严重退化	<15分钟	完全中断、数据丢失、安全漏洞
SEV-2	重大影响，部分中断	<1小时	功能不可用、高错误率
SEV-3	轻微影响，存在临时解决方案	<4小时	单个组件退化、性能缓慢
SEV-4	外观问题，无用户影响	下一个工作日	UI故障、日志错误

Alert Fatigue Prevention

事件响应角色（IMAG框架）

yaml

undefined

yaml

事件指挥官（IC）：
  - 整体协调与决策
  - 宣布事件开始/结束
  - 决定升级路径
  - 负责向管理层沟通

运维负责人（OL）：
  - 技术调查与缓解
  - 协调工程师
  - 实施修复
  - 向IC汇报状态

沟通负责人（CL）：
  - 内外部状态更新
  - 用户沟通
  - 利益相关者通知
  - 状态页面更新

Symptoms vs Causes alerting

事件工作流

Alert on WHAT users experience, not WHY it's broken

—

GOOD: Symptom-based alert

—

alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 # User-facing metric annotations: summary: "API is slow for users"

1. 检测（告警触发或用户反馈）
   ↓
2. 分类（评估严重级别，指派IC）
   ↓
3. 响应（组建团队，创建作战室）
   ↓
4. 缓解（止损，恢复服务）
   ↓
5. 解决（修复根因）
   ↓
6. 事后复盘（无责评审，行动项）
   ↓
7. 跟进（实施改进措施）

BAD: Cause-based alert

值班最佳实践

alert: CPUHigh expr: cpu_usage > 70% # Internal metric, might not impact users
Don't alert unless this affects SLOs

轮换制 — 每周轮换，跨时区平衡
升级路径 — 主值班→副值班→经理（每级15分钟）
操作手册 — 常见问题的分步调试指南
运行手册 — 自动化修复脚本
交接 — 轮换时15分钟同步
补偿 — 值班补贴或调休
健康保障 — 目标：每晚不超过2次事件

Use SLO-based alerting

防止告警疲劳

Alert when error budget burn rate is too high

—

---

yaml

undefined

Blameless Postmortems

症状 vs 原因告警

Core Principles

告警内容为用户体验到的问题，而非问题原因

—

良好：基于症状的告警

Assume Good Intentions — Everyone did their best with available information
Focus on Systems — Identify gaps in process/tooling, not people
Psychological Safety — No punishment for honest mistakes
Learning Culture — Incidents are opportunities to improve
Separate from Performance Reviews — Postmortem participation never affects evaluations

alert: APILatencyHigh expr: slo:api_latency:p95 > 0.200 # 用户可见指标 annotations: summary: "API对用户来说响应缓慢"

Postmortem Template

不良：基于原因的告警

markdown

undefined

alert: CPUHigh expr: cpu_usage > 70% # 内部指标，可能不影响用户
除非该指标影响SLO，否则不要告警

Incident Postmortem: [Title]

使用基于SLO的告警

—

当错误预算消耗过快时告警

Date: 2025-01-15 Duration: 10:30 - 12:15 UTC (1h 45m) Severity: SEV-2 Incident Commander: Jane Doe Responders: John Smith, Alice Johnson

---

Impact

无责事后复盘

—

核心原则

15,000 users affected
12% error rate on payment processing
$5,000 estimated revenue impact
No data loss

假设善意 — 每个人都在现有信息下尽力而为
聚焦系统 — 识别流程/工具的漏洞，而非针对个人
心理安全 — 诚实的错误不会受到惩罚
学习文化 — 事件是改进的机会
与绩效评审分离 — 参与复盘绝不影响绩效评估

Timeline (UTC)

事后复盘模板

10:30 - Alert: Payment error rate > 5%
10:32 - IC assigned, war room created
10:45 - Identified: Database connection pool exhausted
11:00 - Mitigation: Increased pool size from 50 → 100
11:15 - Error rate back to normal
12:15 - Incident closed after monitoring

markdown

undefined

Root Cause

事件复盘：[标题]

Database connection pool configured for average load, not peak traffic. Black Friday traffic spike (3x normal) exhausted connections.

日期： 2025-01-15 持续时间： 10:30 - 12:15 UTC（1小时45分钟） 严重级别： SEV-2 事件指挥官： Jane Doe 响应人员： John Smith, Alice Johnson

What Went Well

影响

Alert fired within 2 minutes of issue
Clear escalation path, IC available immediately
Mitigation applied quickly (30 minutes to fix)
No data corruption or loss

15,000用户受影响
支付处理错误率达12%
预估收入损失$5,000
无数据丢失

What Went Wrong

时间线（UTC）

No load testing at 3x scale
No auto-scaling for connection pool
No alert on connection pool saturation
Insufficient monitoring of database metrics

10:30 - 告警：支付错误率>5%
10:32 - 指派IC，创建作战室
10:45 - 定位问题：数据库连接池耗尽
11:00 - 缓解措施：连接池大小从50→100
11:15 - 错误率恢复正常
12:15 - 监控确认后关闭事件

Action Items

根因

(@john) Add connection pool metrics to Grafana (Due: Jan 20)
(@alice) Implement auto-scaling based on request rate (Due: Jan 25)
(@jane) Add load testing to CI for 5x scale (Due: Feb 1)
(@jane) Add alert: connection pool > 80% (Due: Jan 18)
(@john) Document connection pool tuning runbook (Due: Jan 22)

数据库连接池配置为应对平均负载，而非峰值流量。黑色星期五流量激增（为平时3倍）导致连接池耗尽。

Lessons Learned

做得好的地方

Black Friday load patterns need dedicated testing
Database metrics were missing from standard dashboards
Auto-scaling should cover ALL resources, not just pods

undefined

问题发生2分钟内触发告警
清晰的升级路径，IC立即到位
缓解措施快速实施（30分钟修复）
无数据损坏或丢失

Follow-up

待改进的地方

Review postmortem in team meeting within 1 week
Track action items to completion (not optional!)
Share learnings across teams
Update runbooks and playbooks
Celebrate successful incident response

未进行3倍负载的测试
连接池未配置自动扩容
未针对连接池饱和度设置告警
数据库指标监控不足

Chaos Engineering

行动项

Principles

—

Define Steady State — Normal system behavior (e.g., 99.9% success rate)
Hypothesize — Predict system will remain stable under failure
Inject Failures — Simulate real-world events
Disprove Hypothesis — Look for deviations from steady state
Learn and Improve — Fix weaknesses, increase resilience

(@john) 将连接池指标添加到Grafana（截止日期：1月20日）
(@alice) 基于请求速率实现自动扩容（截止日期：1月25日）
(@jane) 在CI中添加5倍负载测试（截止日期：2月1日）
(@jane) 添加告警：连接池使用率>80%（截止日期：1月18日）
(@john) 编写连接池调优运行手册（截止日期：1月22日）

Failure Types

经验教训

yaml

Infrastructure:
  - Pod/node termination
  - Network latency/packet loss
  - DNS failures
  - Cloud region outage

Resources:
  - CPU stress
  - Memory exhaustion
  - Disk I/O saturation
  - File descriptor limits

Dependencies:
  - Database connection failures
  - API timeout/errors
  - Cache unavailability
  - Message queue backlog

Security:
  - DDoS simulation
  - Certificate expiration
  - Unauthorized access attempts

黑色星期五的负载模式需要专门测试
数据库指标未纳入标准仪表盘
自动扩容应覆盖所有资源，而非仅Pod

undefined

Chaos Mesh Example

跟进

yaml

undefined

1周内团队会议上评审复盘内容
跟踪行动项至完成（非可选！）
跨团队分享经验
更新运行手册和操作手册
庆祝成功的事件响应

Network latency injection

混沌工程

—

原则

apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: network-delay spec: action: delay mode: one selector: namespaces: - production labelSelectors: app: payment-service delay: latency: "100ms" correlation: "50" jitter: "50ms" duration: "5m" scheduler: cron: "@every 2h" # Run every 2 hours

定义稳态 — 系统正常行为（如99.9%成功率）
提出假设 — 预测系统在故障下保持稳定
注入故障 — 模拟真实世界事件
推翻假设 — 寻找与稳态的偏差
学习与改进 — 修复弱点，提升韧性

Pod kill experiment

故障类型

apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: pod-kill spec: action: pod-kill mode: fixed-percent value: "10" # Kill 10% of pods selector: namespaces: - production labelSelectors: app: api-server duration: "30s"

undefined

yaml

基础设施：
  - Pod/节点终止
  - 网络延迟/丢包
  - DNS故障
  - 云区域 outage

资源：
  - CPU压力
  - 内存耗尽
  - 磁盘I/O饱和
  - 文件描述符限制

依赖：
  - 数据库连接故障
  - API超时/错误
  - 缓存不可用
  - 消息队列积压

安全：
  - DDoS模拟
  - 证书过期
  - 未授权访问尝试

Best Practices

Chaos Mesh示例

Start Small — Non-production first, then canary production
Collect Baselines — Know normal metrics before experiments
Define Success — Clear criteria for what "stable" means
Monitor Everything — Watch metrics, logs, traces during tests
Automate Rollback — Stop experiment if SLOs violated
Game Days — Scheduled chaos exercises with full team
Blameless Reviews — Treat chaos failures like production incidents

yaml

undefined

AIOps and AI in Observability

网络延迟注入

2025 Trends

—

Anomaly Detection — AI spots unusual patterns in metrics/logs
Root Cause Analysis — Correlate failures across services automatically
Predictive Alerting — Predict failures before they happen
Auto-Remediation — AI suggests or applies fixes autonomously
Natural Language Queries — Ask "Why is checkout slow?" instead of writing PromQL
AI Observability — Monitor AI model drift, hallucinations, token usage

AI-Driven Platforms (2025)

Pod销毁实验

yaml

Dynatrace Davis AI:
  - Auto-detected 73% of incidents before customer impact
  - Reduced alert noise by 90%
  - Causal AI for root cause analysis

Datadog Watchdog:
  - Anomaly detection across metrics, logs, traces
  - Automated correlation of related issues
  - LLM-powered investigation assistant

Elastic AIOps:
  - Machine learning for log anomaly detection
  - Automated baseline learning
  - Predictive alerting

New Relic AI:
  - Natural language query interface
  - Automated incident summarization
  - Proactive capacity recommendations

apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: pod-kill spec: action: pod-kill mode: fixed-percent value: "10" # 销毁10%的Pod selector: namespaces: - production labelSelectors: app: api-server duration: "30s"

undefined

Implementing AI Observability

最佳实践

python

undefined

从小处着手 — 先在非生产环境测试，再在生产环境金丝雀测试
收集基线 — 实验前了解系统正常指标
定义成功标准 — 明确"稳定"的判定条件
全面监控 — 实验期间监控指标、日志、链路追踪
自动化回滚 — 若SLO被违反，立即停止实验
游戏日 — 定期安排全团队参与的混沌演练
无责评审 — 将混沌实验的故障视为生产事件处理

Monitor AI model performance

AIOps与可观测性中的AI

—

2025年趋势

from opentelemetry import trace, metrics

tracer = trace.get_tracer(name) meter = metrics.get_meter(name)

异常检测 — AI识别指标/日志中的异常模式
根因分析 — 自动关联跨服务的故障
预测性告警 — 在故障发生前预测隐患
自动修复 — AI建议或自动应用修复措施
自然语言查询 — 提问"为什么结账缓慢？"而非编写PromQL
AI可观测性 — 监控AI模型漂移、幻觉、Token使用量

Create metrics for AI model

AI驱动平台（2025年）

model_latency = meter.create_histogram( "ai.model.latency", description="AI model inference latency", unit="ms" ) model_tokens = meter.create_counter( "ai.model.tokens", description="Token usage" )

async def run_ai_model(prompt: str): with tracer.start_as_current_span("ai.inference") as span: start = time.time()

    span.set_attribute("ai.model", "gpt-4")
    span.set_attribute("ai.prompt_length", len(prompt))

    response = await openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

    latency = (time.time() - start) * 1000
    tokens = response.usage.total_tokens

    # Record metrics
    model_latency.record(latency, {"model": "gpt-4"})
    model_tokens.add(tokens, {"model": "gpt-4", "type": "total"})

    # Add to span
    span.set_attribute("ai.response_length", len(response.choices[0].message.content))
    span.set_attribute("ai.tokens_used", tokens)

    return response

---

yaml

Dynatrace Davis AI：
  - 73%的事件在用户受影响前自动检测到
  - 告警噪音减少90%
  - 因果AI用于根因分析

Datadog Watchdog：
  - 跨指标、日志、链路追踪的异常检测
  - 自动关联相关问题
  - LLM驱动的调查助手

Elastic AIOps：
  - 机器学习用于日志异常检测
  - 自动基线学习
  - 预测性告警

New Relic AI：
  - 自然语言查询界面
  - 自动事件摘要
  - 主动容量建议

Grafana Dashboards

实现AI可观测性

3-3-3 Rule

—

3 rows of panels per dashboard
3 panels per row
3 key metrics per panel

Avoid "dashboard sprawl" — Each dashboard should answer ONE question.

python

undefined

Dashboard Categories

监控AI模型性能

yaml

RED Dashboard (for services):
  - Rate: Requests per second
  - Errors: Error rate
  - Duration: Latency (P50, P95, P99)

USE Dashboard (for resources):
  - Utilization: % of capacity used
  - Saturation: Queue depth, wait time
  - Errors: Error count

Four Golden Signals Dashboard:
  - Latency
  - Traffic
  - Errors
  - Saturation

SLO Dashboard:
  - Current SLI value
  - Error budget remaining
  - Burn rate
  - Trend (30-day)

from opentelemetry import trace, metrics import time import openai

tracer = trace.get_tracer(name) meter = metrics.get_meter(name)

Panel Best Practices

创建AI模型指标

json

{
  "title": "API Request Rate",
  "type": "graph",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total[5m])) by (method)",
      "legendFormat": "{{ method }}"
    }
  ],
  "options": {
    "tooltip": { "mode": "multi" },
    "legend": { "displayMode": "table", "calcs": ["mean", "last"] }
  },
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",  // Requests per second
      "color": { "mode": "palette-classic" },
      "custom": {
        "lineWidth": 2,
        "fillOpacity": 10
      }
    }
  }
}

model_latency = meter.create_histogram( "ai.model.latency", description="AI模型推理延迟", unit="ms" ) model_tokens = meter.create_counter( "ai.model.tokens", description="Token使用量" )

async def run_ai_model(prompt: str): with tracer.start_as_current_span("ai.inference") as span: start = time.time()

    span.set_attribute("ai.model", "gpt-4")
    span.set_attribute("ai.prompt_length", len(prompt))

    response = await openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

    latency = (time.time() - start) * 1000
    tokens = response.usage.total_tokens

    # 记录指标
    model_latency.record(latency, {"model": "gpt-4"})
    model_tokens.add(tokens, {"model": "gpt-4", "type": "total"})

    # 添加到Span
    span.set_attribute("ai.response_length", len(response.choices[0].message.content))
    span.set_attribute("ai.tokens_used", tokens)

    return response

---

Checklist

Grafana仪表盘

—

3-3-3规则

markdown

undefined

每行3个面板
每个仪表盘3行
每个面板3个关键指标

避免"仪表盘泛滥" — 每个仪表盘应仅回答一个问题。

Metrics (Prometheus + Grafana)

仪表盘分类

Layered architecture (app/cluster/global)
Recording rules for expensive queries
Resource limits and retention configured
Dashboards follow 3-3-3 rule
Alerts based on SLOs, not internal metrics

yaml

RED仪表盘（服务用）：
  - Rate：每秒请求数
  - Errors：错误率
  - Duration：延迟（P50、P95、P99）

USE仪表盘（资源用）：
  - Utilization：容量使用率
  - Saturation：队列长度、等待时间
  - Errors：错误计数

四大黄金信号仪表盘：
  - 延迟
  - 流量
  - 错误
  - 饱和度

SLO仪表盘：
  - 当前SLI值
  - 剩余错误预算
  - 消耗速率
  - 趋势（30天）

Tracing (OpenTelemetry)

面板最佳实践

Auto-instrumentation enabled
Custom spans for business operations
Sampling strategy configured
Trace context in logs (correlation)
Backend connected (Tempo/Jaeger)

json

{
  "title": "API请求速率",
  "type": "graph",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total[5m])) by (method)",
      "legendFormat": "{{ method }}"
    }
  ],
  "options": {
    "tooltip": { "mode": "multi" },
    "legend": { "displayMode": "table", "calcs": ["mean", "last"] }
  },
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",  # 每秒请求数
      "color": { "mode": "palette-classic" },
      "custom": {
        "lineWidth": 2,
        "fillOpacity": 10
      }
    }
  }
}

Logging (Loki/ELK)

检查清单

SLOs

指标（Prometheus + Grafana）

Incident Response

链路追踪（OpenTelemetry）

Culture

日志（Loki/ELK）

Blameless postmortem process
Action items tracked to completion
Incident learnings shared
On-call compensation policy
Regular chaos engineering exercises

---

SLO

reference/monitoring.md — Prometheus and Grafana deep dive
reference/logging.md — Structured logging best practices
reference/tracing.md — OpenTelemetry and distributed tracing
reference/incident-response.md — Incident management and postmortems
templates/slo-template.md — SLO definition template

—

事件响应

—

文化

—

observability-sre

Original

Translation

Observability & Site Reliability Engineering

可观测性与站点可靠性工程

Core Principles

核心原则

Hard Rules (Must Follow)

硬性规则（必须遵守）

Symptom-Based Alerts Only

仅基于症状告警

❌ FORBIDDEN: Alerting on internal metrics

❌ 禁止：针对内部指标告警

Users don't care about CPU, they care about latency

Internal metric, may not affect users

用户不关心CPU使用率，他们关心的是延迟

内部指标，可能不会影响用户

✅ REQUIRED: Alert on user experience

✅ 要求：针对用户体验告警

Low Cardinality Labels

低基数标签

❌ FORBIDDEN: High cardinality labels

❌ 禁止：高基数标签

✅ REQUIRED: Low cardinality only

✅ 要求：仅使用低基数标签

High cardinality data goes in log body:

高基数数据放入日志体中：

SLO-Based Error Budgets

基于SLO的错误预算

❌ FORBIDDEN: No SLO definition

❌ 禁止：无SLO定义

Just monitoring without targets

仅监控而无目标

✅ REQUIRED: Explicit SLO with budget

✅ 要求：明确的SLO与错误预算

SLO: 99.9% availability

SLO：99.9% 可用性

Error Budget: 0.1% = 43.2 minutes/month downtime

错误预算：0.1% = 每月允许43.2分钟停机时间

Trace Context in Logs

日志中包含追踪上下文

Quick Reference

快速参考

When to Use What

工具选型指南

The Three Pillars

三大支柱详解

Observability Architecture

可观测性架构

Layered Prometheus Setup

分层Prometheus部署

2025 Best Practice: Federated architecture

2025最佳实践：联邦架构

Prevents metric chaos while enabling drill-down

避免指标混乱，同时支持下钻分析

Layer 1: Application Prometheus

第一层：应用级Prometheus

- Detailed business logic metrics

- 详细业务逻辑指标

- High cardinality acceptable

允许高基数

- Short retention (7 days)

- 短保留期（7天）

Layer 2: Cluster Prometheus

第二层：集群级Prometheus

- Per-environment/cluster metrics

- 按环境/集群划分的指标

- Medium retention (30 days)

- 中等保留期（30天）

- Aggregates from application level

- 从应用层聚合数据

Layer 3: Global Prometheus

第三层：全局级Prometheus

- Cross-cluster critical metrics

- 跨集群关键指标

- Long retention (1 year)

- 长保留期（1年）

- Federation from cluster level

- 从集群层联邦采集

Global Prometheus config