slo-implementation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SLO Implementation

SLO实现

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
用于定义和实现服务水平指标(SLIs)、服务水平目标(SLOs)及错误预算的框架。

Purpose

目标

Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
通过SLIs、SLOs和错误预算设定可衡量的可靠性目标,平衡可靠性与创新速度。

When to Use

适用场景

  • Define service reliability targets
  • Measure user-perceived reliability
  • Implement error budgets
  • Create SLO-based alerts
  • Track reliability goals
  • 定义服务可靠性目标
  • 衡量用户感知的可靠性
  • 实施错误预算
  • 创建基于SLO的告警
  • 跟踪可靠性目标

SLI/SLO/SLA Hierarchy

SLI/SLO/SLA 层级

SLA (Service Level Agreement)
  ↓ Contract with customers
SLO (Service Level Objective)
  ↓ Internal reliability target
SLI (Service Level Indicator)
  ↓ Actual measurement
SLA (Service Level Agreement)
  ↓ 与客户的合同
SLO (Service Level Objective)
  ↓ 内部可靠性目标
SLI (Service Level Indicator)
  ↓ 实际测量值

Defining SLIs

定义SLIs

Common SLI Types

常见SLI类型

1. Availability SLI

1. 可用性SLI

promql
undefined
promql
undefined

Successful requests / Total requests

Successful requests / Total requests

sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))
undefined
sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))
undefined

2. Latency SLI

2. 延迟SLI

promql
undefined
promql
undefined

Requests below latency threshold / Total requests

Requests below latency threshold / Total requests

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))
undefined
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))
undefined

3. Durability SLI

3. 持久性SLI

undefined
undefined

Successful writes / Total writes

Successful writes / Total writes

sum(storage_writes_successful_total) / sum(storage_writes_total)

**Reference:** See `references/slo-definitions.md`
sum(storage_writes_successful_total) / sum(storage_writes_total)

**参考:** 详见 `references/slo-definitions.md`

Setting SLO Targets

设置SLO目标

Availability SLO Examples

可用性SLO示例

SLO %Downtime/MonthDowntime/Year
99%7.2 hours3.65 days
99.9%43.2 minutes8.76 hours
99.95%21.6 minutes4.38 hours
99.99%4.32 minutes52.56 minutes
SLO百分比每月停机时间每年停机时间
99%7.2小时3.65天
99.9%43.2分钟8.76小时
99.95%21.6分钟4.38小时
99.99%4.32分钟52.56分钟

Choose Appropriate SLOs

选择合适的SLOs

Consider:
  • User expectations
  • Business requirements
  • Current performance
  • Cost of reliability
  • Competitor benchmarks
Example SLOs:
yaml
slos:
  - name: api_availability
    target: 99.9
    window: 28d
    sli: |
      sum(rate(http_requests_total{status!~"5.."}[28d]))
      /
      sum(rate(http_requests_total[28d]))

  - name: api_latency_p95
    target: 99
    window: 28d
    sli: |
      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
      /
      sum(rate(http_request_duration_seconds_count[28d]))
考虑因素:
  • 用户期望
  • 业务需求
  • 当前性能
  • 可靠性成本
  • 竞品基准
SLO示例:
yaml
slos:
  - name: api_availability
    target: 99.9
    window: 28d
    sli: |
      sum(rate(http_requests_total{status!~"5.."}[28d]))
      /
      sum(rate(http_requests_total[28d]))

  - name: api_latency_p95
    target: 99
    window: 28d
    sli: |
      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
      /
      sum(rate(http_request_duration_seconds_count[28d]))

Error Budget Calculation

错误预算计算

Error Budget Formula

错误预算公式

Error Budget = 1 - SLO Target
Example:
  • SLO: 99.9% availability
  • Error Budget: 0.1% = 43.2 minutes/month
  • Current Error: 0.05% = 21.6 minutes/month
  • Remaining Budget: 50%
Error Budget = 1 - SLO Target
示例:
  • SLO:99.9% 可用性
  • 错误预算:0.1% = 每月43.2分钟
  • 当前错误率:0.05% = 每月21.6分钟
  • 剩余预算:50%

Error Budget Policy

错误预算策略

yaml
error_budget_policy:
  - remaining_budget: 100%
    action: Normal development velocity
  - remaining_budget: 50%
    action: Consider postponing risky changes
  - remaining_budget: 10%
    action: Freeze non-critical changes
  - remaining_budget: 0%
    action: Feature freeze, focus on reliability
Reference: See
references/error-budget.md
yaml
error_budget_policy:
  - remaining_budget: 100%
    action: Normal development velocity
  - remaining_budget: 50%
    action: Consider postponing risky changes
  - remaining_budget: 10%
    action: Freeze non-critical changes
  - remaining_budget: 0%
    action: Feature freeze, focus on reliability
参考: 详见
references/error-budget.md

SLO Implementation

SLO实施

Prometheus Recording Rules

Prometheus记录规则

yaml
undefined
yaml
undefined

SLI Recording Rules

SLI Recording Rules

groups:
  • name: sli_rules interval: 30s rules:

    Availability SLI

    • record: sli:http_availability:ratio expr: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

    Latency SLI (requests < 500ms)

    • record: sli:http_latency:ratio expr: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))
  • name: slo_rules interval: 5m rules:

    SLO compliance (1 = meeting SLO, 0 = violating)

    • record: slo:http_availability:compliance expr: sli:http_availability:ratio >= bool 0.999
    • record: slo:http_latency:compliance expr: sli:http_latency:ratio >= bool 0.99

    Error budget remaining (percentage)

    • record: slo:http_availability:error_budget_remaining expr: | (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

    Error budget burn rate

    • record: slo:http_availability:burn_rate_5m expr: | (1 - ( sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) )) / (1 - 0.999)
undefined
groups:
  • name: sli_rules interval: 30s rules:

    Availability SLI

    • record: sli:http_availability:ratio expr: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

    Latency SLI (requests < 500ms)

    • record: sli:http_latency:ratio expr: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))
  • name: slo_rules interval: 5m rules:

    SLO compliance (1 = meeting SLO, 0 = violating)

    • record: slo:http_availability:compliance expr: sli:http_availability:ratio >= bool 0.999
    • record: slo:http_latency:compliance expr: sli:http_latency:ratio >= bool 0.99

    Error budget remaining (percentage)

    • record: slo:http_availability:error_budget_remaining expr: | (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

    Error budget burn rate

    • record: slo:http_availability:burn_rate_5m expr: | (1 - ( sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) )) / (1 - 0.999)
undefined

SLO Alerting Rules

SLO告警规则

yaml
groups:
  - name: slo_alerts
    interval: 1m
    rules:
      # Fast burn: 14.4x rate, 1 hour window
      # Consumes 2% error budget in 1 hour
      - alert: SLOErrorBudgetBurnFast
        expr: |
          slo:http_availability:burn_rate_1h > 14.4
          and
          slo:http_availability:burn_rate_5m > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Slow burn: 6x rate, 6 hour window
      # Consumes 5% error budget in 6 hours
      - alert: SLOErrorBudgetBurnSlow
        expr: |
          slo:http_availability:burn_rate_6h > 6
          and
          slo:http_availability:burn_rate_30m > 6
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Error budget exhausted
      - alert: SLOErrorBudgetExhausted
        expr: slo:http_availability:error_budget_remaining < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SLO error budget exhausted"
          description: "Error budget remaining: {{ $value }}%"
yaml
groups:
  - name: slo_alerts
    interval: 1m
    rules:
      # Fast burn: 14.4x rate, 1 hour window
      # Consumes 2% error budget in 1 hour
      - alert: SLOErrorBudgetBurnFast
        expr: |
          slo:http_availability:burn_rate_1h > 14.4
          and
          slo:http_availability:burn_rate_5m > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Slow burn: 6x rate, 6 hour window
      # Consumes 5% error budget in 6 hours
      - alert: SLOErrorBudgetBurnSlow
        expr: |
          slo:http_availability:burn_rate_6h > 6
          and
          slo:http_availability:burn_rate_30m > 6
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Slow error budget burn detected"
          description: "Error budget burning at {{ $value }}x rate"

      # Error budget exhausted
      - alert: SLOErrorBudgetExhausted
        expr: slo:http_availability:error_budget_remaining < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SLO error budget exhausted"
          description: "Error budget remaining: {{ $value }}%"

SLO Dashboard

SLO仪表盘

Grafana Dashboard Structure:
┌────────────────────────────────────┐
│ SLO Compliance (Current)           │
│ ✓ 99.95% (Target: 99.9%)          │
├────────────────────────────────────┤
│ Error Budget Remaining: 65%        │
│ ████████░░ 65%                     │
├────────────────────────────────────┤
│ SLI Trend (28 days)                │
│ [Time series graph]                │
├────────────────────────────────────┤
│ Burn Rate Analysis                 │
│ [Burn rate by time window]         │
└────────────────────────────────────┘
Example Queries:
promql
undefined
Grafana仪表盘结构:
┌────────────────────────────────────┐
│ SLO Compliance (Current)           │
│ ✓ 99.95% (Target: 99.9%)          │
├────────────────────────────────────┤
│ Error Budget Remaining: 65%        │
│ ████████░░ 65%                     │
├────────────────────────────────────┤
│ SLI Trend (28 days)                │
│ [Time series graph]                │
├────────────────────────────────────┤
│ Burn Rate Analysis                 │
│ [Burn rate by time window]         │
└────────────────────────────────────┘
示例查询:
promql
undefined

Current SLO compliance

Current SLO compliance

sli:http_availability:ratio * 100
sli:http_availability:ratio * 100

Error budget remaining

Error budget remaining

slo:http_availability:error_budget_remaining
slo:http_availability:error_budget_remaining

Days until error budget exhausted (at current burn rate)

Days until error budget exhausted (at current burn rate)

(slo:http_availability:error_budget_remaining / 100) * 28 / (1 - sli:http_availability:ratio) * (1 - 0.999)
undefined
(slo:http_availability:error_budget_remaining / 100) * 28 / (1 - sli:http_availability:ratio) * (1 - 0.999)
undefined

Multi-Window Burn Rate Alerts

多窗口消耗率告警

yaml
undefined
yaml
undefined

Combination of short and long windows reduces false positives

Combination of short and long windows reduces false positives

rules:
  • alert: SLOBurnRateHigh expr: | ( slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 ) or ( slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 ) labels: severity: critical
undefined
rules:
  • alert: SLOBurnRateHigh expr: | ( slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 ) or ( slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 ) labels: severity: critical
undefined

SLO Review Process

SLO评审流程

Weekly Review

每周评审

  • Current SLO compliance
  • Error budget status
  • Trend analysis
  • Incident impact
  • 当前SLO合规情况
  • 错误预算状态
  • 趋势分析
  • 事件影响

Monthly Review

每月评审

  • SLO achievement
  • Error budget usage
  • Incident postmortems
  • SLO adjustments
  • SLO达成情况
  • 错误预算使用情况
  • 事件事后分析
  • SLO调整

Quarterly Review

季度评审

  • SLO relevance
  • Target adjustments
  • Process improvements
  • Tooling enhancements
  • SLO相关性
  • 目标调整
  • 流程改进
  • 工具增强

Best Practices

最佳实践

  1. Start with user-facing services
  2. Use multiple SLIs (availability, latency, etc.)
  3. Set achievable SLOs (don't aim for 100%)
  4. Implement multi-window alerts to reduce noise
  5. Track error budget consistently
  6. Review SLOs regularly
  7. Document SLO decisions
  8. Align with business goals
  9. Automate SLO reporting
  10. Use SLOs for prioritization
  1. 从面向用户的服务开始
  2. 使用多种SLIs(可用性、延迟等)
  3. 设置可实现的SLOs(不要追求100%)
  4. 实施多窗口告警以减少误报
  5. 持续跟踪错误预算
  6. 定期评审SLOs
  7. 记录SLO决策
  8. 与业务目标对齐
  9. 自动化SLO报告
  10. 利用SLOs进行优先级排序

Reference Files

参考文件

  • assets/slo-template.md
    - SLO definition template
  • references/slo-definitions.md
    - SLO definition patterns
  • references/error-budget.md
    - Error budget calculations
  • assets/slo-template.md
    - SLO定义模板
  • references/slo-definitions.md
    - SLO定义模式
  • references/error-budget.md
    - 错误预算计算方法

Related Skills

相关技能

  • prometheus-configuration
    - For metric collection
  • grafana-dashboards
    - For SLO visualization
  • prometheus-configuration
    - 用于指标收集
  • grafana-dashboards
    - 用于SLO可视化