juicebox-incident-runbook

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Juicebox Incident Runbook

Juicebox事件响应手册

Overview

概述

Standardized incident response procedures for Juicebox integration issues.

Juicebox集成问题的标准化事件响应流程。

Incident Severity Levels

事件严重等级

Severity	Description	Response Time	Examples
P1	Critical	< 15 min	Complete outage, data loss
P2	High	< 1 hour	Major feature broken, degraded performance
P3	Medium	< 4 hours	Minor feature issue, workaround exists
P4	Low	< 24 hours	Cosmetic, non-blocking

严重等级	描述	响应时间	示例
P1	关键	< 15分钟	完全停机、数据丢失
P2	高	< 1小时	主要功能故障、性能下降
P3	中等	< 4小时	次要功能问题、存在临时解决方案
P4	低	< 24小时	界面美化问题、无阻塞影响

Quick Diagnostics

快速诊断

Step 1: Immediate Assessment

步骤1：即时评估

bash

#!/bin/bash

bash

#!/bin/bash

quick-diag.sh - Run immediately when incident detected

quick-diag.sh - 事件检测到后立即运行

echo "=== Juicebox Quick Diagnostics ===" echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"

echo "=== Juicebox快速诊断 ===" echo "时间戳: $(date -u +%Y-%m-%dT%H:%M:%SZ)"

Check Juicebox status page

检查Juicebox状态页面

echo "" echo "=== Juicebox Status ===" curl -s https://status.juicebox.ai/api/status | jq '.status'

echo "" echo "=== Juicebox状态 ===" curl -s https://status.juicebox.ai/api/status | jq '.status'

Check our API health

检查我方API健康状态

echo "" echo "=== Our API Health ===" curl -s http://localhost:8080/health/ready | jq '.'

echo "" echo "=== 我方API健康状态 ===" curl -s http://localhost:8080/health/ready | jq '.'

Check recent error logs

检查近期错误日志

echo "" echo "=== Recent Errors (last 5 min) ===" kubectl logs -l app=juicebox-integration --since=5m | grep -i error | tail -20

echo "" echo "=== 近期错误（最近5分钟） ===" kubectl logs -l app=juicebox-integration --since=5m | grep -i error | tail -20

Check metrics

检查指标

echo "" echo "=== Error Rate ===" curl -s http://localhost:9090/api/v1/query?query=rate$juicebox_requests_total\{status=\"error\"\}\[5m\]$ | jq '.data.result[0].value[1]'

undefined

echo "" echo "=== 错误率 ===" curl -s http://localhost:9090/api/v1/query?query=rate$juicebox_requests_total\{status=\"error\"\}\[5m\]$ | jq '.data.result[0].value[1]'

undefined

Step 2: Identify Root Cause

步骤2：确定根本原因

markdown

undefined

markdown

undefined

Incident Triage Decision Tree

事件分类决策树

Is Juicebox status page showing issues?
- YES → External outage, skip to "External Outage Response"
- NO → Continue
Are we getting authentication errors (401)?
- YES → Check API key validity, skip to "Auth Issues"
- NO → Continue
Are we getting rate limited (429)?
- YES → Skip to "Rate Limit Response"
- NO → Continue
Are requests timing out?
- YES → Skip to "Timeout Response"
- NO → Continue
Are we getting unexpected errors?
- YES → Skip to "Application Error Response"
- NO → Gather more data

undefined

Juicebox状态页面是否显示异常？
- 是 → 外部停机，跳转到“外部停机响应”
- 否 → 继续
是否出现认证错误（401）？
- 是 → 检查API密钥有效性，跳转到“认证问题响应”
- 否 → 继续
是否被限流（429）？
- 是 → 跳转到“限流响应”
- 否 → 继续
请求是否超时？
- 是 → 跳转到“超时响应”
- 否 → 继续
是否出现意外错误？
- 是 → 跳转到“应用程序错误响应”
- 否 → 收集更多数据

undefined

Response Procedures

响应流程

External Outage Response

外部停机响应

markdown

undefined

markdown

undefined

When Juicebox is Down

当Juicebox停机时

Confirm Outage
- Check https://status.juicebox.ai
- Verify with curl test to API

Enable Fallback Mode

bash

kubectl set env deployment/juicebox-integration JUICEBOX_FALLBACK=true

Notify Stakeholders
- Post to #incidents channel
- Update status page if customer-facing
Monitor Recovery
- Set up alert for Juicebox status change
- Prepare to disable fallback mode
Post-Incident
- Disable fallback when Juicebox recovers
- Document timeline and impact

undefined

确认停机状态
- 查看https://status.juicebox.ai
- 通过curl测试API验证

启用 fallback 模式

bash

kubectl set env deployment/juicebox-integration JUICEBOX_FALLBACK=true

通知相关人员
- 在#incidents频道发布消息
- 若面向客户，更新状态页面
监控恢复情况
- 设置Juicebox状态变化告警
- 准备禁用fallback模式
事件后处理
- Juicebox恢复后禁用fallback模式
- 记录时间线和影响范围

undefined

Auth Issues Response

认证问题响应

markdown

undefined

markdown

undefined

When Authentication Fails

当认证失败时

Verify API Key

bash

# Mask key for logging
echo "Key prefix: ${JUICEBOX_API_KEY:0:10}..."

# Test auth
curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \
  https://api.juicebox.ai/v1/auth/me

Check Key Status in Dashboard
- Log into https://app.juicebox.ai
- Verify key is active and not revoked
Rotate Key if Compromised
- Generate new key in dashboard
- Update secret manager
- Restart pods
bash
```
kubectl rollout restart deployment/juicebox-integration
```
Verify Recovery
- Check health endpoint
- Monitor error rate

undefined

验证API密钥

bash

# 为日志屏蔽密钥
echo "密钥前缀: ${JUICEBOX_API_KEY:0:10}..."

# 测试认证
curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \
  https://api.juicebox.ai/v1/auth/me

在控制台检查密钥状态
- 登录https://app.juicebox.ai
- 验证密钥是否处于活跃状态且未被撤销
若密钥泄露则轮换密钥
- 在控制台生成新密钥
- 更新密钥管理器
- 重启Pod
bash
```
kubectl rollout restart deployment/juicebox-integration
```
验证恢复情况
- 检查健康端点
- 监控错误率

undefined

Rate Limit Response

限流响应

markdown

undefined

markdown

undefined

When Rate Limited

当被限流时

Check Current Usage

bash

curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \
  https://api.juicebox.ai/v1/usage

Immediate Mitigation

Enable aggressive caching
Reduce request rate

bash

kubectl set env deployment/juicebox-integration JUICEBOX_RATE_LIMIT=10

If Quota Exhausted
- Contact Juicebox support for temporary increase
- Implement request queuing
Long-term Fix
- Review usage patterns
- Implement better caching
- Consider plan upgrade

undefined

检查当前使用情况

bash

curl -H "Authorization: Bearer $JUICEBOX_API_KEY" \
  https://api.juicebox.ai/v1/usage

即时缓解措施

启用激进缓存策略
降低请求频率

bash

kubectl set env deployment/juicebox-integration JUICEBOX_RATE_LIMIT=10

若配额耗尽
- 联系Juicebox支持申请临时提升配额
- 实现请求排队机制
长期解决方案
- 分析使用模式
- 优化缓存策略
- 考虑升级套餐

undefined

Timeout Response

超时响应

markdown

undefined

markdown

undefined

When Requests Timeout

当请求超时时

Check Network

bash

# DNS resolution
nslookup api.juicebox.ai

# Connectivity
curl -v --connect-timeout 5 https://api.juicebox.ai/v1/health

Check Load
- Review query complexity
- Check for unusually large requests

Increase Timeout

bash

kubectl set env deployment/juicebox-integration JUICEBOX_TIMEOUT=60000

Implement Circuit Breaker
- Enable circuit breaker if repeated timeouts
- Serve cached/fallback data

undefined

检查网络状况

bash

# DNS解析
nslookup api.juicebox.ai

# 连通性测试
curl -v --connect-timeout 5 https://api.juicebox.ai/v1/health

检查负载情况
- 分析查询复杂度
- 检查是否存在异常大请求

增加超时时间

bash

kubectl set env deployment/juicebox-integration JUICEBOX_TIMEOUT=60000

实现断路器机制
- 若重复出现超时则启用断路器
- 返回缓存数据或fallback数据

undefined

Incident Communication Template

事件沟通模板

markdown

undefined

markdown

undefined

Incident Report Template

事件报告模板

事件ID: INC-YYYY-MM-DD-XXX 状态: 调查中 | 已定位 | 监控中 | 已解决 严重等级: P1 | P2 | P3 | P4 开始时间: YYYY-MM-DD HH:MM UTC 结束时间: （解决后填写）

Summary

摘要

[Brief description of the incident]

[事件简要描述]

Impact

影响范围

Users affected: [number/percentage]
Features affected: [list]
Duration: [time]

受影响用户数：[数量/百分比]
受影响功能：[列表]
持续时间：[时长]

Timeline

时间线

HH:MM - Incident detected
HH:MM - Investigation started
HH:MM - Root cause identified
HH:MM - Mitigation applied
HH:MM - Incident resolved

HH:MM - 事件被检测到
HH:MM - 开始调查
HH:MM - 定位根本原因
HH:MM - 应用缓解措施
HH:MM - 事件已解决

Root Cause

根本原因

[Description of what caused the incident]

[事件原因描述]

Resolution

解决方案

[What was done to fix it]

[修复措施说明]

Action Items

行动项

Action 1 (Owner, Due Date)
Action 2 (Owner, Due Date)

undefined

行动1（负责人，截止日期）
行动2（负责人，截止日期）

undefined

On-Call Checklist

值班检查表

markdown

undefined

markdown

undefined

On-Call Handoff Checklist

值班交接检查表

Before Shift

值班前

Access to all monitoring dashboards
VPN/access to production systems
Runbook bookmarked
Escalation contacts available

拥有所有监控仪表盘的访问权限
可通过VPN访问生产系统
已收藏运行手册
可联系到升级联系人

During Shift

值班期间

End of Shift

值班结束

Output

输出内容

Diagnostic scripts
Response procedures
Communication templates
On-call checklists

诊断脚本
响应流程
沟通模板
值班检查表

Resources

参考资源

Next Steps

后续步骤

After incident, see

juicebox-data-handling

for data management.

事件处理完成后，参考

juicebox-data-handling

进行数据管理。