clay-incident-runbook

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Clay Incident Runbook

Clay事件响应手册

Overview

概述

Rapid incident response procedures for Clay-related outages.
针对Clay相关服务中断的快速事件响应流程。

Prerequisites

前置条件

  • Access to Clay dashboard and status page
  • kubectl access to production cluster
  • Prometheus/Grafana access
  • Communication channels (Slack, PagerDuty)
  • 拥有Clay控制面板和状态页面的访问权限
  • 拥有生产集群的kubectl访问权限
  • 拥有Prometheus/Grafana访问权限
  • 可使用沟通渠道(Slack、PagerDuty)

Severity Levels

严重等级

LevelDefinitionResponse TimeExamples
P1Complete outage< 15 minClay API unreachable
P2Degraded service< 1 hourHigh latency, partial failures
P3Minor impact< 4 hoursWebhook delays, non-critical errors
P4No user impactNext business dayMonitoring gaps
等级定义响应时间示例
P1完全服务中断< 15分钟Clay API无法访问
P2服务性能下降< 1小时高延迟、部分功能故障
P3轻微影响< 4小时Webhook延迟、非关键错误
P4无用户影响下一个工作日监控缺口

Quick Triage

快速排查

bash
undefined
bash
undefined

1. Check Clay status

1. 检查Clay状态

2. Check our integration health

2. 检查我们的集成健康状态

curl -s https://api.yourapp.com/health | jq '.services.clay'
curl -s https://api.yourapp.com/health | jq '.services.clay'

3. Check error rate (last 5 min)

3. 检查错误率(最近5分钟)

curl -s localhost:9090/api/v1/query?query=rate(clay_errors_total[5m])
curl -s localhost:9090/api/v1/query?query=rate(clay_errors_total[5m])

4. Recent error logs

4. 近期错误日志

kubectl logs -l app=clay-integration --since=5m | grep -i error | tail -20
undefined
kubectl logs -l app=clay-integration --since=5m | grep -i error | tail -20
undefined

Decision Tree

决策树

Clay API returning errors?
├─ YES: Is status.clay.com showing incident?
│   ├─ YES → Wait for Clay to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.
Clay API返回错误?
├─ 是:status.clay.com是否显示事件?
│   ├─ 是 → 等待Clay修复。启用降级方案。
│   └─ 否 → 我方集成问题。检查凭证、配置。
└─ 否:我方服务是否健康?
    ├─ 是 → 问题已解决或为偶发。持续监控。
    └─ 否 → 我方基础设施问题。检查Pod、内存、网络。

Immediate Actions by Error Type

按错误类型执行即时操作

401/403 - Authentication

401/403 - 认证错误

bash
undefined
bash
undefined

Verify API key is set

验证API密钥是否已配置

kubectl get secret clay-secrets -o jsonpath='{.data.api-key}' | base64 -d
kubectl get secret clay-secrets -o jsonpath='{.data.api-key}' | base64 -d

Check if key was rotated

检查密钥是否已轮换

→ Verify in Clay dashboard

→ 在Clay控制面板中验证

Remediation: Update secret and restart pods

修复措施:更新密钥并重启Pod

kubectl create secret generic clay-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/clay-integration
undefined
kubectl create secret generic clay-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/clay-integration
undefined

429 - Rate Limited

429 - 速率限制

bash
undefined
bash
undefined

Check rate limit headers

检查速率限制响应头

curl -v https://api.clay.com 2>&1 | grep -i rate
curl -v https://api.clay.com 2>&1 | grep -i rate

Enable request queuing

启用请求排队

kubectl set env deployment/clay-integration RATE_LIMIT_MODE=queue
kubectl set env deployment/clay-integration RATE_LIMIT_MODE=queue

Long-term: Contact Clay for limit increase

长期方案:联系Clay提升限制

undefined
undefined

500/503 - Clay Errors

500/503 - Clay内部错误

bash
undefined
bash
undefined

Enable graceful degradation

启用优雅降级

kubectl set env deployment/clay-integration CLAY_FALLBACK=true
kubectl set env deployment/clay-integration CLAY_FALLBACK=true

Notify users of degraded service

通知用户服务降级

Update status page

更新状态页面

Monitor Clay status for resolution

监控Clay状态直至问题解决

undefined
undefined

Communication Templates

沟通模板

Internal (Slack)

内部(Slack)

🔴 P1 INCIDENT: Clay Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]
🔴 P1事件:Clay集成
状态:排查中
影响:[描述用户影响]
当前操作:[正在执行的动作]
下次更新:[时间]
事件负责人:@[姓名]

External (Status Page)

外部(状态页面)

Clay Integration Issue

We're experiencing issues with our Clay integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]
Clay集成问题

我们的Clay集成出现异常。
部分用户可能遇到[具体影响]。

我们正在积极排查,将及时更新进展。

最后更新:[时间戳]

Post-Incident

事后处理

Evidence Collection

证据收集

bash
undefined
bash
undefined

Generate debug bundle

生成调试包

./scripts/clay-debug-bundle.sh
./scripts/clay-debug-bundle.sh

Export relevant logs

导出相关日志

kubectl logs -l app=clay-integration --since=1h > incident-logs.txt
kubectl logs -l app=clay-integration --since=1h > incident-logs.txt

Capture metrics

采集指标数据

curl "localhost:9090/api/v1/query_range?query=clay_errors_total&start=2h" > metrics.json
undefined
curl "localhost:9090/api/v1/query_range?query=clay_errors_total&start=2h" > metrics.json
undefined

Postmortem Template

事后复盘模板

markdown
undefined
markdown
undefined

Incident: Clay [Error Type]

事件:Clay [错误类型]

Date: YYYY-MM-DD Duration: X hours Y minutes Severity: P[1-4]
日期: YYYY-MM-DD 持续时间: X小时Y分钟 严重等级: P[1-4]

Summary

摘要

[1-2 sentence description]
[1-2句话描述事件]

Timeline

时间线

  • HH:MM - [Event]
  • HH:MM - [Event]
  • HH:MM - [事件]
  • HH:MM - [事件]

Root Cause

根本原因

[Technical explanation]
[技术层面的解释]

Impact

影响

  • Users affected: N
  • Revenue impact: $X
  • 受影响用户数:N
  • 营收影响:$X

Action Items

行动项

  • [Preventive measure] - Owner - Due date
undefined
  • [预防措施] - 负责人 - 截止日期
undefined

Instructions

操作步骤

Step 1: Quick Triage

步骤1:快速排查

Run the triage commands to identify the issue source.
运行排查命令确定问题来源。

Step 2: Follow Decision Tree

步骤2:遵循决策树

Determine if the issue is Clay-side or internal.
判断问题来自Clay端还是我方内部。

Step 3: Execute Immediate Actions

步骤3:执行即时操作

Apply the appropriate remediation for the error type.
针对错误类型应用对应的修复措施。

Step 4: Communicate Status

步骤4:同步状态

Update internal and external stakeholders.
向内部和外部相关方更新事件状态。

Output

输出结果

  • Issue identified and categorized
  • Remediation applied
  • Stakeholders notified
  • Evidence collected for postmortem
  • 问题已识别并分类
  • 已应用修复措施
  • 已通知相关方
  • 已收集事后复盘所需证据

Error Handling

错误处理

IssueCauseSolution
Can't reach status pageNetwork issueUse mobile or VPN
kubectl failsAuth expiredRe-authenticate
Metrics unavailablePrometheus downCheck backup metrics
Secret rotation failsPermission deniedEscalate to admin
问题原因解决方案
无法访问状态页面网络问题使用移动网络或VPN
kubectl执行失败认证过期重新认证
指标数据不可用Prometheus故障检查备用指标
密钥轮换失败权限不足升级至管理员权限操作

Examples

示例

One-Line Health Check

单行健康检查

bash
curl -sf https://api.yourapp.com/health | jq '.services.clay.status' || echo "UNHEALTHY"
bash
curl -sf https://api.yourapp.com/health | jq '.services.clay.status' || echo "异常"

Resources

资源

Next Steps

后续步骤

For data handling, see
clay-data-handling
.
数据处理相关内容,请查看
clay-data-handling