drill-recovery

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Disaster Recovery Drills

灾难恢复演练

Disaster drill scenarios and security checklists for indie web apps. Teaches big-tech resilience principles through indie-scale practice.
Scope: Web applications only (SPA, SSR, full-stack). Not mobile, desktop, CLI, or games. Audience: Solo devs, indie builders, vibe-coders. No corporate jargon.
面向独立开发者Web应用的灾难演练场景与安全检查表。 通过独立开发者可实践的方式,传授大型科技公司的系统韧性原则。
适用范围:仅适用于Web应用(SPA、SSR、全栈)。不适用于移动应用、桌面软件、CLI工具或游戏。 目标受众:独立开发者、个体创作者、自由编码者。无企业术语。

Workflow

工作流程

Step 1: Read Project Context

步骤1:了解项目背景

Read the project's context file to understand the codebase:
  • Look for
    CLAUDE.md
    ,
    GEMINI.md
    , or
    AGENTS.md
    at the project root
  • If none found, ask the human to describe their project briefly
阅读项目的背景文件以理解代码库:
  • 在项目根目录查找
    CLAUDE.md
    GEMINI.md
    AGENTS.md
  • 如果未找到,请让用户简要描述其项目

Step 2: Scan Project Stack

步骤2:扫描项目技术栈

Scan the project directly using your file tools. Gather:
From
package.json
:
  • Framework (next.js, vite-react, nuxt, sveltekit, remix, astro, etc.)
  • Database SDK (@supabase/supabase-js, firebase, prisma, drizzle, mongoose)
  • Auth (supabase-auth, nextauth, lucia, clerk)
  • Payments (stripe, lemonsqueezy)
  • AI APIs (openai, @anthropic-ai/sdk, @google/generative-ai)
  • Monitoring (@sentry/react, dd-trace, logrocket)
From project files:
  • Hosting config (vercel.json, wrangler.toml, netlify.toml, fly.toml)
  • TypeScript (tsconfig.json)
  • CI/CD (.github/workflows/)
  • Edge functions (supabase/functions/*)
  • Database tables (from supabase/migrations/ or prisma/schema.prisma)
  • Storage buckets (from migrations or storage config)
From env/security files:
  • .gitignore covers .env files?
  • Client-side env vars (NEXT_PUBLIC_, VITE_) — flag only if they contain actual secrets, not public-by-design keys like anon keys or site keys
  • CSP headers configured? (check _headers, middleware, next.config)
  • RLS enabled? (check migration files for ENABLE ROW LEVEL SECURITY)
If no project files are available, ask 3-5 quick questions: stack, hosting, database, users, backups.
使用文件工具直接扫描项目,收集以下信息:
package.json
中获取:
  • 框架(next.js、vite-react、nuxt、sveltekit、remix、astro等)
  • 数据库SDK(@supabase/supabase-js、firebase、prisma、drizzle、mongoose)
  • 认证组件(supabase-auth、nextauth、lucia、clerk)
  • 支付工具(stripe、lemonsqueezy)
  • AI API(openai、@anthropic-ai/sdk、@google/generative-ai)
  • 监控工具(@sentry/react、dd-trace、logrocket)
从项目文件中获取:
  • 托管配置文件(vercel.json、wrangler.toml、netlify.toml、fly.toml)
  • TypeScript配置(tsconfig.json)
  • CI/CD配置(.github/workflows/)
  • Edge函数(supabase/functions/*)
  • 数据库表结构(来自supabase/migrations/或prisma/schema.prisma)
  • 存储桶配置(来自迁移文件或存储配置)
从环境/安全文件中获取:
  • .gitignore是否包含.env文件?
  • 客户端环境变量(NEXT_PUBLIC_、VITE_)——仅当它们包含实际密钥时标记,不包含设计为公开的密钥(如匿名密钥或站点密钥)
  • 是否配置了CSP头?(检查_headers、中间件、next.config)
  • 是否启用了RLS?(检查迁移文件中的ENABLE ROW LEVEL SECURITY)
如果无法获取项目文件,请询问3-5个简单问题:技术栈、托管平台、数据库、用户规模、备份策略。

Step 3: Load Previous State

步骤3:加载历史状态

Check for
docs/.dr-state.json
. This tracks completed items across runs.
json
{
  "last_run": "2026-02-21",
  "checklist_completed": ["monitoring_added", "ci_pipeline_added"],
  "drills_completed": [
    { "domain": "secrets", "difficulty": "beginner", "date": "2026-02-21" }
  ],
  "runbook_exists": true,
  "postmortem_exists": false,
  "stack_snapshot": {
    "edge_functions": ["advance-game", "submit-answer"],
    "tables": ["users", "questions", "game_sessions"],
    "services": ["supabase", "cloudflare", "resend"],
    "storage_buckets": ["question-images"]
  }
}
If it exists:
  • Skip checklist items in
    checklist_completed
    — these are items the human confirmed they fixed, NOT items that were already safe at scan time. Items that are "already safe" (e.g., RLS enabled, CSP configured) are handled by the conciseness rules — the agent re-scans and re-skips them naturally every run. Never auto-populate
    checklist_completed
    .
  • Skip drill domains already done at that difficulty
  • Don't re-ask about runbook/postmortem if already created
  • Show a brief "Previously completed" summary
Only add to
checklist_completed
when the human explicitly confirms they fixed an action item (e.g., "I added Sentry" → add
"monitoring_added"
).
If it doesn't exist, this is a first run — create it after this session.
检查是否存在
docs/.dr-state.json
文件,该文件用于跟踪历次运行已完成的事项。
json
{
  "last_run": "2026-02-21",
  "checklist_completed": ["monitoring_added", "ci_pipeline_added"],
  "drills_completed": [
    { "domain": "secrets", "difficulty": "beginner", "date": "2026-02-21" }
  ],
  "runbook_exists": true,
  "postmortem_exists": false,
  "stack_snapshot": {
    "edge_functions": ["advance-game", "submit-answer"],
    "tables": ["users", "questions", "game_sessions"],
    "services": ["supabase", "cloudflare", "resend"],
    "storage_buckets": ["question-images"]
  }
}
如果文件存在:
  • 跳过
    checklist_completed
    中的检查表项——这些是用户确认已修复的项,而非扫描时已处于安全状态的项。已处于安全状态的项(如已启用RLS、已配置CSP)会通过简洁性规则自动处理——每次运行时Agent会重新扫描并自然跳过这些项。请勿自动填充
    checklist_completed
  • 跳过已完成对应难度级别的演练领域
  • 如果已创建运行手册/事后分析文档,请勿重复询问
  • 显示简短的“已完成历史项”摘要
仅当用户明确确认已修复某个行动项时,才将其添加到
checklist_completed
中(例如:“我添加了Sentry” → 添加
"monitoring_added"
)。
如果文件不存在,说明这是首次运行——在本次会话结束后创建该文件。

Step 4: Choose Mode

步骤4:选择模式

Present two options:
📋 CHECKLIST — "Am I prepared?" Proactive audit with prioritized fixes. Best for: first-time use, new projects, pre-launch, quarterly review.
🔥 EXERCISE DRILL — "Can I handle it?" Simulated incident in three phases:
  • Before: Prep your playbook, confirm monitoring, define stop conditions
  • During: Scenario injects with pause-and-think prompts
  • After: Observation log, follow-up TODOs with deadlines Best for: after basics are solid, building muscle memory, testing response speed. Solo devs play all roles: incident commander, service owner, on-call, comms lead.
Recommend Checklist first if the user has never done this.
提供两个选项:
📋 检查表模式 —— “我是否已准备就绪?” 主动式审计,提供优先级修复建议。 最适合:首次使用、新项目、上线前检查、季度回顾。
🔥 演练模式 —— “我能否应对事件?” 分三个阶段的模拟事件:
  • 事前:准备运行手册、确认监控配置、定义终止条件
  • 事中:注入场景并提供暂停思考提示
  • 事后:观察日志、带截止日期的后续待办事项 最适合:基础准备完成后、培养肌肉记忆、测试响应速度。 独立开发者需扮演所有角色:事件指挥官、服务所有者、值班人员、沟通负责人。
如果用户从未进行过此类操作,建议先使用检查表模式。

Step 5: Generate & Write Persistent Doc

步骤5:生成并写入持久化文档

Generate the output AND write it to
docs/
. The file is the real deliverable.
File path:
docs/DR_<MODE>_<DATE>.md
  • Checklist →
    docs/DR_CHECKLIST_2026-02-21.md
  • Drill →
    docs/DR_DRILL_<DOMAIN>_2026-02-21.md
Tone: Notes to future me at 2am. Practical, direct, copy-paste-friendly.
Conciseness rules:
  • Only include items that need action. If something is safe or properly configured, skip it entirely. No "this is fine" entries.
  • Skip items already completed in
    .dr-state.json
  • Every section must earn its place. Empty = omit.
生成输出内容写入
docs/
目录。该文件是最终交付物。
文件路径
docs/DR_<MODE>_<DATE>.md
  • 检查表 →
    docs/DR_CHECKLIST_2026-02-21.md
  • 演练 →
    docs/DR_DRILL_<DOMAIN>_2026-02-21.md
语气:仿佛是写给凌晨2点的自己的笔记。实用、直接、便于复制粘贴。
简洁性规则:
  • 仅包含需要采取行动的项。如果某项已处于安全状态或配置正确,完全跳过。不保留“一切正常”的条目。
  • 跳过
    .dr-state.json
    中已完成的项
  • 每个章节都必须有存在的意义。空章节直接省略。

Checklist doc structure

检查表文档结构

markdown
undefined
markdown
undefined

<Project Name> — DR Checklist

<项目名称> — 灾难恢复检查表

Version: 1.0 Created: <date> Profile: <framework> / <hosting> / <database>
版本:1.0 创建日期:<日期> 技术栈概览:<框架> / <托管平台> / <数据库>

Recovery Targets

恢复目标

MetricTargetWhy
RTO< X hour<1 sentence>
RPO< X hours<1 sentence>
指标目标原因
RTO(恢复时间目标)< X小时<1句话说明>
RPO(恢复点目标)< X小时<1句话说明>

What matters most

核心优先级

TierDataRecovery
Critical<actual tables><method>
Can rebuild<derived data><method>
Expendable<ephemeral data>Restart
层级数据类型恢复方式
关键<实际表名><恢复方法>
可重建<衍生数据><恢复方法>
可舍弃<临时数据>重启服务

Your Stack

你的技术栈

<ASCII diagram — keep it simple, only real services>
<ASCII示意图 — 保持简洁,仅包含实际使用的服务>

Weak spots

薄弱环节

<Only single points of failure with no mitigation yet. Skip if none.>
<仅列出尚未缓解的单点故障。若无则省略。>

Action Items

行动项

<Only items needing action. Severity first, then quick wins.>
#WhatHowEffort
1<problem><specific fix with command>⚡/🔧
<仅包含需要采取行动的项。按严重程度排序,其次是快速修复项。>
序号问题修复方法工作量
1<问题描述><包含命令的具体修复步骤>⚡/🔧

Readiness

就绪状态

<Scores — only domains below 8/10. If solid, skip it.>
undefined
<评分 — 仅列出得分低于8/10的领域。若状态良好则省略。>
undefined

Drill doc structure

演练文档结构

Keep it concise. The doc is a practice exercise, not a textbook. Teach through the scenario itself, not extra sections explaining concepts.
markdown
undefined
保持简洁。该文档是练习工具,而非教科书。通过场景本身传授知识,无需额外章节解释概念。
markdown
undefined

<Project Name> — Drill: <Vivid Scenario Title>

<项目名称> — 演练:<生动的场景标题>

Domain: <emoji> <domain> | Difficulty: <level> | Created: <date>
领域<emoji> <领域名称> | 难度:<级别> | 创建日期:<日期>

Before you start

开始前准备

<3-4 honest self-check questions. Short. No fluff.>
<3-4个坦诚的自我检查问题。简短、无冗余。>

Scenario

场景背景

<Background — 2-3 sentences setting the scene with real stack details.>
<2-3句话,结合实际技术栈细节设定场景。>

⏱️ INJECT 1 — <timestamp>

⏱️ 注入事件1 — <时间戳>

<What happened. Real error messages, real service names, real URLs. End with 1-2 pause-and-think questions in bold.>
<事件详情。真实的错误信息、服务名称、URL。 结尾用粗体标注1-2个暂停思考问题。>

⏱️ INJECT 2 — <timestamp>

⏱️ 注入事件2 — <时间戳>

<Escalation or new info. Same format.>
<事件升级或新信息。格式同上。>

Resolution

解决方案

Right now: <commands> Today: <stabilize> This week: <prevent recurrence>
立即执行:<命令> 今日完成:<稳定服务> 本周完成:<预防复发>

TODOs

待办事项

#WhatDeadlineDone?
1...This week
The takeaway: <1-2 sentences. What big-tech calls this, what to actually do at indie scale. No jargon walls.>
Next suggested drill: <pick untried domain from .dr-state.json>
undefined
序号事项截止日期完成状态
1...本周
关键收获:<1-2句话。大型科技公司对该场景的定义,以及独立开发者规模下的实际应对方法。无术语壁垒。>
推荐下一次演练:<从.dr-state.json中选择未尝试过的领域>
undefined

Drill domains

演练领域

Pick from these 7 domains (or random weighted by detected risks):
  • cost — 💸 Cost & Billing (DDoS, billing spikes, API abuse)
  • data — 🗑️ Data Loss (backup failure, accidental delete, corruption)
  • secrets — 🔐 Secrets & Credentials (leaked keys, rotation)
  • access — 🔓 Access Control (broken auth, IDOR, missing RLS)
  • availability — 🚫 Availability (outage, deploy failure, DNS)
  • code — 🤖 Code Vulnerabilities (XSS, SQLi, dependency CVEs)
  • recovery — 🔄 Recoverability (rebuild from scratch, lost env vars)
Difficulty controls inject count:
  • beginner: 2 injects, ~15 min
  • intermediate: 3 injects, ~20 min
  • advanced: 4 injects, ~30 min
Read
references/risk-domains.md
for extra scenario seeds and checklist items if you need more variety.
从以下7个领域中选择(或根据检测到的风险加权随机选择):
  • 成本 — 💸 成本与计费(DDoS、计费峰值、API滥用)
  • 数据 — 🗑️ 数据丢失(备份失败、误删除、数据损坏)
  • 密钥 — 🔐 密钥与凭证(密钥泄露、密钥轮换)
  • 访问 — 🔓 访问控制(认证失效、IDOR、缺失RLS)
  • 可用性 — 🚫 服务可用性( outage、部署失败、DNS故障)
  • 代码 — 🤖 代码漏洞(XSS、SQL注入、依赖包CVE)
  • 恢复 — 🔄 可恢复性(从零重建、丢失环境变量)
难度级别决定注入事件的数量:
  • 初级:2个注入事件,约15分钟
  • 中级:3个注入事件,约20分钟
  • 高级:4个注入事件,约30分钟
如需更多场景变体,可查阅
references/risk-domains.md
获取额外场景素材和检查表项库。

Step 6: Offer Follow-Up Docs

步骤6:提供后续文档选项

After writing the main doc, ask the human — don't assume:
  1. Runbook drift check: Check if
    docs/RUNBOOK.md
    exists.
    • If no → ask: "Want me to write a
      docs/RUNBOOK.md
      with step-by-step recovery commands for your stack?" Only write if they say yes.
    • If yes → compare current stack against
      stack_snapshot
      in
      .dr-state.json
      (or scan runbook content if no state file). Look for:
      • New edge functions not in the runbook
      • New tables not covered by recovery scenarios
      • New services with no runbook entry
      • Removed components still referenced
    • If drift found → tell human: "Your
      docs/RUNBOOK.md
      is missing coverage for: X, Y. Want me to update it?"
    • If no drift → skip silently
  2. Post-mortem (Drill mode only): Ask: "Want me to save a post-mortem to
    docs/POSTMORTEM_<DOMAIN>_<DATE>.md
    ? Useful to track patterns." Only write if they say yes.
  3. Backup script: If no backup strategy detected, ask: "Want me to generate a
    scripts/dr-backup.sh
    ?" Only write if they say yes.
Update
docs/.dr-state.json
after each run:
  • checklist_completed
    /
    drills_completed
    for this session
  • stack_snapshot
    with current edge functions, tables, services, buckets
  • runbook_exists
    /
    postmortem_exists
    flags
生成主文档后,询问用户 — 请勿自行假设:
  1. 运行手册漂移检查:检查是否存在
    docs/RUNBOOK.md
    • 如果不存在 → 询问:“是否需要我为你的技术栈编写一份包含分步恢复命令的
      docs/RUNBOOK.md
      ?” 仅在用户同意时编写。
    • 如果存在 → 将当前技术栈与
      .dr-state.json
      中的
      stack_snapshot
      进行对比(若状态文件不存在则扫描运行手册内容)。查找:
      • 运行手册中未覆盖的新Edge函数
      • 恢复场景未涉及的新数据表
      • 无运行手册条目对应的新服务
      • 仍被引用但已移除的组件
    • 如果发现漂移 → 告知用户:“你的
      docs/RUNBOOK.md
      缺少对以下内容的覆盖:X、Y。是否需要我更新它?”
    • 如果无漂移 → 静默跳过
  2. 事后分析文档(仅演练模式):询问:“是否需要我将事后分析文档保存到
    docs/POSTMORTEM_<DOMAIN>_<DATE>.md
    ?有助于跟踪模式。” 仅在用户同意时编写。
  3. 备份脚本:如果未检测到备份策略,询问:“是否需要我生成一份
    scripts/dr-backup.sh
    ?” 仅在用户同意时编写。
每次运行后更新
docs/.dr-state.json
  • 本次会话完成的
    checklist_completed
    /
    drills_completed
  • 当前技术栈的
    stack_snapshot
    ,包含Edge函数、数据表、服务、存储桶
  • runbook_exists
    /
    postmortem_exists
    标记

Step 7: Follow Up

步骤7:后续跟进

  • For Checklist: offer to generate fix code for top action items
  • For Drill: offer to implement the top TODO right now
  • Suggest next drill: pick an untried domain from
    .dr-state.json
  • Remind: "Run this again next quarter — I'll skip what you've already fixed."
  • 对于检查表模式:提供为顶级行动项生成修复代码的服务
  • 对于演练模式:提供立即实施顶级待办事项的服务
  • 推荐下一次演练:从
    .dr-state.json
    中选择未尝试过的领域
  • 提醒用户:“请在下个季度再次运行此流程 — 我会跳过你已修复的项。”

Reference Files

参考文件

The
references/
directory has supplemental content for deeper scenarios:
  • references/risk-domains.md
    — All 7 risk domains with extra scenario seeds and checklist item libraries
references/
目录包含用于扩展场景的补充内容:
  • references/risk-domains.md
    — 所有7个风险领域的额外场景素材和检查表项库