Loading...
Loading...
Compare original and translation side by side
See Extended Examples for complete template files.
undefined查看扩展示例获取完整的模板文件。
undefined
**Advanced SRE runbook template** (excerpt):
```markdown
**高级SRE Runbook模板**(节选):
```markdown
Key template components:
- **Metadata**: Service ownership, severity, on-call rotation
- **Diagnostic Phase**: Quick checks → detailed investigation → failure patterns
- **Resolution Phase**: Immediate mitigation → root cause fix → verification
- **Escalation**: Criteria and contact paths
- **Communication**: Internal/external templates
- **Prevention**: Short/long-term actions
**Expected:** Template selected matches incident complexity, sections appropriate for service type.
**On failure:**
- Start with basic template, iterate based on incident patterns
- Review industry examples (Google SRE books, vendor runbooks)
- Adapt template based on team feedback after first use
核心模板组件:
- **元数据**:服务归属、严重等级、on-call轮值信息
- **诊断阶段**:快速检查 → 详细排查 → 故障模式匹配
- **解决阶段**:即时止损 → 根因修复 → 效果验证
- **升级路径**:升级触发条件和联系人路径
- **沟通方案**:内部/外部沟通模板
- **预防措施**:短期/长期优化动作
**预期结果**:选择的模板匹配事件复杂度,板块设置适配对应服务类型。
**异常处理:**
- 先使用基础模板,再根据事件规律迭代优化
- 参考行业公开示例(Google SRE书籍、厂商官方Runbook)
- 首次使用后根据团队反馈调整模板See Extended Examples for complete diagnostic queries and decision trees.
curl -I https://api.example.com/health # Expected: HTTP 200 OKup{job="api-service"} # Expected: 1 for all instancessum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100 # Expected: < 1%{job="api-service"} |= "error" | json | level="error"avg(rate(container_cpu_usage_seconds_total{pod=~"api-service.*"}[5m])) * 100
# Expected: < 70%查看扩展示例获取完整的诊断查询语句和决策树。
curl -I https://api.example.com/health # 预期返回: HTTP 200 OKup{job="api-service"} # 预期结果: 所有实例返回1sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100 # 预期结果: < 1%{job="api-service"} |= "error" | json | level="error"avg(rate(container_cpu_usage_seconds_total{pod=~"api-service.*"}[5m])) * 100
# 预期结果: < 70%See Extended Examples for all 5 resolution options with full commands and rollback procedures.
kubectl rollout undo deployment/api-servicekubectl scale deployment/api-service --replicas=$((current * 3/2))kubectl rollout restart deployment/api-servicekubectl set env deployment/api-service FEATURE_NAME=false-- Kill long-running queries, restart connection pool, increase pool size查看扩展示例获取全部5种解决方案的完整命令和回滚流程。
kubectl rollout undo deployment/api-servicekubectl scale deployment/api-service --replicas=$((current * 3/2))kubectl rollout restart deployment/api-servicekubectl set env deployment/api-service FEATURE_NAME=false-- 终止长运行查询、重启连接池、扩容连接池大小See Extended Examples for full escalation levels and contact directory template.
查看扩展示例获取完整的升级等级和联系人目录模板。
See Extended Examples for all internal and external templates with full formatting.
🚨 INCIDENT: [Title] | Severity: [Critical/High/Medium]
Impact: [users/services] | Owner: @username | Dashboard: [link]
Quick Summary: [1-2 sentences] | Next update: 15 min📊 UPDATE #N | Status: [Investigating/Mitigating/Monitoring]
Actions: [what we tried and outcomes]
Theory: [what we think is happening]
Next: [planned actions]✅ MITIGATION | Metrics: Error [before→after], Latency [before→after]
Root Cause: [brief or "investigating"] | Monitoring 30min before resolved🎉 RESOLVED | Duration: [time] | Root Cause + Impact + Follow-up actions查看扩展示例获取完整的内外部沟通模板和格式规范。
🚨 事件告警: [标题] | 严重等级: [Critical/High/Medium]
影响范围: [用户/服务] | 负责人: @username | 监控看板: [链接]
简要说明: [1-2句话概述] | 下次更新时间: 15分钟后📊 第N次更新 | 当前状态: [排查中/止损中/监控中]
已执行动作: [尝试过的操作和结果]
故障假设: [当前对根因的判断]
下一步计划: [待执行的动作]✅ 已止损 | 指标变化: 错误率[之前→之后], 延迟[之前→之后]
根因: [简要说明或「排查中」] | 监控30分钟后确认完全解决🎉 已解决 | 持续时长: [时间] | 根因+影响范围+后续跟进动作See Extended Examples for complete Prometheus alert configuration and Grafana dashboard JSON.
- alert: HighErrorRate
annotations:
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
dashboard_url: "https://grafana.example.com/d/service-overview"
incident_channel: "#incident-platform"查看扩展示例获取完整的Prometheus告警配置和Grafana看板JSON。
- alert: HighErrorRate
annotations:
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
dashboard_url: "https://grafana.example.com/d/service-overview"
incident_channel: "#incident-platform"configure-alerting-rulesbuild-grafana-dashboardssetup-prometheus-monitoringdefine-slo-sli-slaconfigure-alerting-rulesbuild-grafana-dashboardssetup-prometheus-monitoringdefine-slo-sli-sla