clay-incident-runbook

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Clay Incident Runbook

Clay事件响应手册

Overview

概述

Rapid incident response procedures for Clay-related outages.

针对Clay相关服务中断的快速事件响应流程。

Prerequisites

前置条件

Access to Clay dashboard and status page
kubectl access to production cluster
Prometheus/Grafana access
Communication channels (Slack, PagerDuty)

拥有Clay控制面板和状态页面的访问权限
拥有生产集群的kubectl访问权限
拥有Prometheus/Grafana访问权限
可使用沟通渠道（Slack、PagerDuty）

Severity Levels

严重等级

Level	Definition	Response Time	Examples
P1	Complete outage	< 15 min	Clay API unreachable
P2	Degraded service	< 1 hour	High latency, partial failures
P3	Minor impact	< 4 hours	Webhook delays, non-critical errors
P4	No user impact	Next business day	Monitoring gaps

等级	定义	响应时间	示例
P1	完全服务中断	< 15分钟	Clay API无法访问
P2	服务性能下降	< 1小时	高延迟、部分功能故障
P3	轻微影响	< 4小时	Webhook延迟、非关键错误
P4	无用户影响	下一个工作日	监控缺口

Quick Triage

快速排查

bash

undefined

bash

undefined

1. Check Clay status

1. 检查Clay状态

curl -s https://status.clay.com | jq

2. Check our integration health

2. 检查我们的集成健康状态

curl -s https://api.yourapp.com/health | jq '.services.clay'

3. Check error rate (last 5 min)

3. 检查错误率（最近5分钟）

curl -s localhost:9090/api/v1/query?query=rate(clay_errors_total[5m])

4. Recent error logs

4. 近期错误日志

kubectl logs -l app=clay-integration --since=5m | grep -i error | tail -20

undefined

kubectl logs -l app=clay-integration --since=5m | grep -i error | tail -20

undefined

Decision Tree

决策树

Clay API returning errors?
├─ YES: Is status.clay.com showing incident?
│   ├─ YES → Wait for Clay to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.

Clay API返回错误？
├─ 是：status.clay.com是否显示事件？
│   ├─ 是 → 等待Clay修复。启用降级方案。
│   └─ 否 → 我方集成问题。检查凭证、配置。
└─ 否：我方服务是否健康？
    ├─ 是 → 问题已解决或为偶发。持续监控。
    └─ 否 → 我方基础设施问题。检查Pod、内存、网络。

Immediate Actions by Error Type

按错误类型执行即时操作

401/403 - Authentication

401/403 - 认证错误

bash

undefined

bash

undefined

Verify API key is set

验证API密钥是否已配置

kubectl get secret clay-secrets -o jsonpath='{.data.api-key}' | base64 -d

Check if key was rotated

检查密钥是否已轮换

→ Verify in Clay dashboard

→ 在Clay控制面板中验证

Remediation: Update secret and restart pods

修复措施：更新密钥并重启Pod

kubectl create secret generic clay-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/clay-integration

undefined

kubectl create secret generic clay-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/clay-integration

undefined

429 - Rate Limited

429 - 速率限制

bash

undefined

bash

undefined

Check rate limit headers

检查速率限制响应头

curl -v https://api.clay.com 2>&1 | grep -i rate

Enable request queuing

启用请求排队

kubectl set env deployment/clay-integration RATE_LIMIT_MODE=queue

Long-term: Contact Clay for limit increase

长期方案：联系Clay提升限制

undefined

undefined

500/503 - Clay Errors

500/503 - Clay内部错误

bash

undefined

bash

undefined

Enable graceful degradation

启用优雅降级

kubectl set env deployment/clay-integration CLAY_FALLBACK=true

Notify users of degraded service

通知用户服务降级

Update status page

更新状态页面

Monitor Clay status for resolution

监控Clay状态直至问题解决

undefined

undefined

Communication Templates

沟通模板

Internal (Slack)

内部（Slack）

🔴 P1 INCIDENT: Clay Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]

🔴 P1事件：Clay集成
状态：排查中
影响：[描述用户影响]
当前操作：[正在执行的动作]
下次更新：[时间]
事件负责人：@[姓名]

External (Status Page)

外部（状态页面）

Clay Integration Issue

We're experiencing issues with our Clay integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]

Clay集成问题

我们的Clay集成出现异常。
部分用户可能遇到[具体影响]。

我们正在积极排查，将及时更新进展。

最后更新：[时间戳]

Post-Incident

事后处理

Evidence Collection

证据收集

bash

undefined

bash

undefined

Generate debug bundle

生成调试包

./scripts/clay-debug-bundle.sh

Export relevant logs

导出相关日志

kubectl logs -l app=clay-integration --since=1h > incident-logs.txt

Capture metrics

采集指标数据

curl "localhost:9090/api/v1/query_range?query=clay_errors_total&start=2h" > metrics.json

undefined

curl "localhost:9090/api/v1/query_range?query=clay_errors_total&start=2h" > metrics.json

undefined

Postmortem Template

事后复盘模板

markdown

undefined

markdown

undefined

Incident: Clay [Error Type]

事件：Clay [错误类型]

Date: YYYY-MM-DD Duration: X hours Y minutes Severity: P[1-4]

日期： YYYY-MM-DD 持续时间： X小时Y分钟 严重等级： P[1-4]

Summary

摘要

[1-2 sentence description]

[1-2句话描述事件]

Timeline

时间线

HH:MM - [Event]
HH:MM - [Event]

HH:MM - [事件]
HH:MM - [事件]

Root Cause

根本原因

[Technical explanation]

[技术层面的解释]

Impact

影响

Users affected: N
Revenue impact: $X

受影响用户数：N
营收影响：$X

Action Items

行动项

[Preventive measure] - Owner - Due date

undefined

[预防措施] - 负责人 - 截止日期

undefined

Instructions

操作步骤

Step 1: Quick Triage

步骤1：快速排查

Run the triage commands to identify the issue source.

运行排查命令确定问题来源。

Step 2: Follow Decision Tree

步骤2：遵循决策树

Determine if the issue is Clay-side or internal.

判断问题来自Clay端还是我方内部。

Step 3: Execute Immediate Actions

步骤3：执行即时操作

Apply the appropriate remediation for the error type.

针对错误类型应用对应的修复措施。

Step 4: Communicate Status

步骤4：同步状态

Update internal and external stakeholders.

向内部和外部相关方更新事件状态。

Output

输出结果

Issue identified and categorized
Remediation applied
Stakeholders notified
Evidence collected for postmortem

问题已识别并分类
已应用修复措施
已通知相关方
已收集事后复盘所需证据

Error Handling

错误处理

Issue	Cause	Solution
Can't reach status page	Network issue	Use mobile or VPN
kubectl fails	Auth expired	Re-authenticate
Metrics unavailable	Prometheus down	Check backup metrics
Secret rotation fails	Permission denied	Escalate to admin

问题	原因	解决方案
无法访问状态页面	网络问题	使用移动网络或VPN
kubectl执行失败	认证过期	重新认证
指标数据不可用	Prometheus故障	检查备用指标
密钥轮换失败	权限不足	升级至管理员权限操作

Examples

示例

One-Line Health Check

单行健康检查

bash

curl -sf https://api.yourapp.com/health | jq '.services.clay.status' || echo "UNHEALTHY"

bash

curl -sf https://api.yourapp.com/health | jq '.services.clay.status' || echo "异常"

Resources

资源

Next Steps

后续步骤

For data handling, see

clay-data-handling

数据处理相关内容，请查看

clay-data-handling

。