launch-runbook

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Launch Runbook

发布运行手册

Plan and execute the launch of a website, product, or major release. The runbook is the document everyone uses on launch day. Stack-agnostic.
This skill is for the launch event. For pre-launch QA, use
qa-testing
. For post-launch incident handling, use
incident-response
.

规划并执行网站、产品或重大版本的发布工作。这份运行手册是发布日所有团队成员都会使用的文档,与技术栈无关。
此技能适用于发布活动场景。若需预发布QA测试,请使用
qa-testing
;若需发布后事件处理,请使用
incident-response

When to use

适用场景

  • Launching a new website or major redesign
  • Migrating from one platform to another
  • Releasing a major product or feature
  • Coordinating cross-team launches
  • Building a runbook for a recurring deploy
  • 发布新网站或重大改版
  • 从一个平台迁移至另一个平台
  • 发布重大产品或功能
  • 协调跨团队发布工作
  • 为重复部署制定运行手册

When NOT to use

不适用场景

  • Pre-launch testing (use
    qa-testing
    )
  • Post-launch incident response (use
    incident-response
    )
  • After-launch retrospective (use
    after-action-report
    )

  • 预发布测试(请使用
    qa-testing
  • 发布后事件响应(请使用
    incident-response
  • 发布后回顾总结(请使用
    after-action-report

Required inputs

必要输入项

  • The launch scope (what's being launched)
  • The launch window (date, time, duration)
  • The team (roles, on-call rotation)
  • The rollback criteria (when to abort)
  • The communication plan (who tells whom what, when)

  • 发布范围(发布内容)
  • 发布窗口(日期、时间、时长)
  • 团队信息(角色、值班轮换安排)
  • 回滚标准(触发回滚的条件)
  • 沟通计划(沟通对象、内容、时间节点)

The framework: 4 phases

框架:四个阶段

A launch has four phases. The runbook covers all four.
发布分为四个阶段,运行手册覆盖所有阶段。

Phase 1: Pre-launch (T-30 days to T-1 hour)

阶段1:预发布(发布前30天至发布前1小时)

Verify everything is ready before the launch window.
T-30 days:
  • Final scope locked
  • Cross-team commitments confirmed
  • Pre-launch QA scheduled
  • Comms plan drafted
T-7 days:
  • Pre-launch QA complete
  • All critical and major issues resolved
  • Performance baseline measured
  • Rollback procedures documented and tested
  • DNS TTL lowered (if DNS change is part of launch)
T-1 day:
  • Final go/no-go meeting
  • Roles confirmed
  • Communication channels set up
  • Backup of current production state
T-1 hour:
  • Team assembled in shared communication channel
  • Tools and access verified
  • Final smoke test on staging
确认发布窗口前所有准备工作就绪。
发布前30天:
  • 确定最终发布范围
  • 确认跨团队承诺
  • 安排预发布QA测试
  • 起草沟通计划
发布前7天:
  • 完成预发布QA测试
  • 解决所有严重及重大问题
  • 测量性能基准值
  • 记录并测试回滚流程
  • 降低DNS TTL值(若发布包含DNS变更)
发布前1天:
  • 召开最终上线/不上线决策会议
  • 确认各角色安排
  • 搭建沟通渠道
  • 备份当前生产环境状态
发布前1小时:
  • 团队在共享沟通渠道集合
  • 验证工具权限
  • 在预发布环境完成最终冒烟测试

Phase 2: Cutover (T-0)

阶段2:切换(发布时刻)

The actual launch. Sequenced steps with owners and verifications.
Standard cutover steps:
  1. Announce start to internal team
  2. Enable maintenance mode (if applicable)
  3. Run final database migrations (if applicable)
  4. Deploy code to production
  5. Verify deploy completed without errors
  6. Run smoke tests on production
  7. DNS cutover (if applicable)
  8. Verify DNS propagation
  9. Disable maintenance mode
  10. Run full smoke tests on production
  11. Announce launch to internal team
  12. Begin monitoring window
Each step has:
  • Owner
  • Pre-conditions
  • Action
  • Verification
  • Time estimate
  • Rollback procedure
正式发布环节,包含有序步骤、负责人及验证环节。
标准切换步骤:
  1. 向内部团队宣布发布开始
  2. 启用维护模式(如适用)
  3. 执行最终数据库迁移(如适用)
  4. 将代码部署到生产环境
  5. 验证部署无错误完成
  6. 在生产环境运行冒烟测试
  7. 执行DNS切换(如适用)
  8. 验证DNS传播
  9. 关闭维护模式
  10. 在生产环境运行完整冒烟测试
  11. 向内部团队宣布发布完成
  12. 进入监控阶段
每个步骤包含:
  • 负责人
  • 前置条件
  • 操作内容
  • 验证环节
  • 时间预估
  • 回滚流程

Phase 3: Verification (T+0 to T+24 hours)

阶段3:验证(发布后0至24小时)

Confirm the launch is healthy.
Within first hour:
  • Critical user flows working (checkout, signup, login)
  • No spike in error rates
  • Performance within expected ranges
  • Analytics tracking firing
  • Email and notifications working
Within first 24 hours:
  • No regression in key business metrics
  • No accumulating error patterns
  • Core Web Vitals stable
  • Search Console showing no critical issues (if SEO-relevant)
确认发布状态健康。
发布后1小时内:
  • 核心用户流程正常(结账、注册、登录)
  • 错误率无异常飙升
  • 性能符合预期范围
  • 分析追踪正常触发
  • 邮件及通知功能正常
发布后24小时内:
  • 关键业务指标无退化
  • 无持续累积的错误模式
  • Core Web Vitals稳定
  • Search Console无严重问题(若涉及SEO)

Phase 4: Stabilization (T+24 hours to T+7 days)

阶段4:稳定(发布后24小时至7天)

Monitor the long tail.
  • Track error rates day over day
  • Track performance day over day
  • Track key business metrics vs baseline
  • Address any non-blocking issues identified
  • Plan the AAR (after-action report)

监控长期运行状态。
  • 每日追踪错误率变化
  • 每日追踪性能变化
  • 对比关键业务指标与基准值
  • 处理已发现的非阻塞性问题
  • 规划AAR(事后复盘报告)

Roles and responsibilities

角色与职责

A launch has clear role assignments. Ambiguity here is the most common cause of launch chaos.
RoleResponsibility
Launch leadOwns the runbook. Calls go/no-go. Calls rollback.
Deploy operatorExecutes the technical deploy steps.
QA leadRuns verification tests and confirms each milestone.
Comms leadPosts internal updates, manages external messaging.
On-call engineerAvailable for issues during and after launch.
Stakeholder repApproves on behalf of business stakeholders.
For small teams, one person may fill multiple roles. Each role's responsibilities should still be explicit.

发布需明确角色分配,职责模糊是导致发布混乱的最常见原因。
角色职责
发布负责人负责运行手册,决定是否上线,决定是否回滚。
部署操作员执行技术部署步骤。
QA负责人执行验证测试,确认每个里程碑。
沟通负责人发布内部更新,管理外部消息。
值班工程师发布期间及发布后随时待命处理问题。
利益相关方代表代表业务利益相关方审批。
小型团队中,一人可兼任多个角色,但每个角色的职责仍需明确。

Rollback criteria

回滚标准

Define before the launch. Decisions are easier to make pre-emptively than under pressure.
Automatic rollback triggers:
  • Error rate exceeds X percent of normal
  • Critical user flow (defined) is broken
  • Database integrity issue
  • Security vulnerability discovered post-deploy
Discretionary rollback triggers:
  • Performance degradation beyond Y percent
  • Significant degradation in key business metric
  • Customer-facing error patterns
Decision authority: The launch lead calls rollback. Pre-define who acts as deputy if launch lead is unavailable.

需在发布前定义清楚,提前做出决策比压力下决策更容易。
自动回滚触发条件:
  • 错误率超过正常水平的X%
  • 核心用户流程(已定义)故障
  • 数据库完整性问题
  • 部署后发现安全漏洞
自主回滚触发条件:
  • 性能下降超过Y%
  • 关键业务指标显著退化
  • 出现面向客户的错误模式
决策权限: 发布负责人决定是否回滚。需提前定义发布负责人无法履职时的代理人选。

Communication plan

沟通计划

Internal channels

内部沟通渠道

  • Primary launch channel: Real-time chat for the launch team only
  • Status channel: Broader internal updates
  • War room: Optional video call for high-stakes launches
  • 主发布沟通渠道: 仅面向发布团队的实时聊天群
  • 状态更新渠道: 面向更广泛内部人员的更新群
  • 作战室: 高风险发布时可选的视频会议

Update cadence during launch

发布期间的更新频率

  • Every 15 minutes during cutover
  • Every hour during verification phase
  • Daily during stabilization phase
  • 切换阶段每15分钟更新一次
  • 验证阶段每小时更新一次
  • 稳定阶段每日更新一次

External communication

外部沟通

  • Customer-facing announcement: Pre-drafted, scheduled to publish at confirmed-success milestone
  • Status page: Updated proactively if any user impact
  • Support team: Briefed in advance on what's launching, common questions, escalation path

  • 客户公告: 提前撰写,在确认发布成功的里程碑时点发布
  • 状态页面: 若影响用户需主动更新
  • 支持团队: 提前告知发布内容、常见问题及升级路径

Workflow

工作流程

  1. Build the runbook 30 days out. Scope, sequence, roles, rollback criteria, comms plan.
  2. Test the rollback procedure. Untested rollback is hope, not procedure.
  3. Run a tabletop exercise. Walk through the runbook with the full team. Find gaps.
  4. Lower DNS TTL 48 to 72 hours before launch (if DNS change is part of launch).
  5. Day-of: Run the runbook step by step. Verify each step before moving to next.
  6. Monitor. First hour, first day, first week. Document anything noteworthy.
  7. Schedule the AAR within 1 to 2 weeks of launch.

  1. 提前30天制定运行手册:确定范围、步骤、角色、回滚标准及沟通计划。
  2. 测试回滚流程:未测试的回滚只是侥幸,而非流程。
  3. 开展桌面演练:与全体团队一起走一遍运行手册,找出漏洞。
  4. 降低DNS TTL值:若发布包含DNS变更,提前48至72小时操作。
  5. 发布当日:逐步执行运行手册,每完成一步验证后再进行下一步。
  6. 监控:关注发布后1小时、1天、1周的状态,记录所有值得关注的事项。
  7. 安排AAR会议:在发布后1至2周内召开复盘会议。

Failure patterns

常见失败模式

  • Runbook written by one person, not reviewed. Single perspective misses scenarios.
  • No tested rollback. Discovering rollback is broken at the moment you need it.
  • Vague step descriptions. "Deploy to production" without specifying which tool, which command, which environment.
  • No verification step after each action. Errors propagate.
  • Communication gaps. Team doesn't know launch is happening, or doesn't know it succeeded.
  • Launching at end of day Friday. Or before a holiday. Reduce the time available to respond.
  • Skipping pre-launch QA to hit a date. The bugs appear on launch day instead.
  • Launch fatigue. Long launches without breaks lead to errors. Plan rest cycles for multi-day launches.
  • No on-call for first 24 hours. Someone must be reachable.

  • 运行手册由单人编写且未审核:单一视角会遗漏场景。
  • 未测试回滚流程:在需要回滚时才发现流程无效。
  • 步骤描述模糊:仅写“部署到生产环境”,未指定工具、命令或环境。
  • 操作后无验证步骤:错误会持续扩散。
  • 沟通漏洞:团队不知道发布正在进行,或不知道发布已成功。
  • 周五下班前发布:或假期前发布,减少了响应时间。
  • 为赶进度跳过预发布QA:漏洞会在发布日暴露。
  • 发布疲劳:长时间发布无休息会导致错误,多日发布需规划休息周期。
  • 发布后24小时无值班人员:必须有人随时可联系。

Output format

输出格式

Default output: a markdown runbook at
launch-runbook-[project].md
plus supporting checklists.
Structure:
  1. Launch metadata (what, when, who)
  2. Roles and responsibilities
  3. Pre-launch checklist (T-30, T-7, T-1, T-1hr)
  4. Cutover sequence (numbered steps, owners, verifications)
  5. Rollback procedure
  6. Rollback criteria (automatic and discretionary)
  7. Communication plan
  8. Verification checklist (first hour, first day)
  9. Stabilization plan (first week)
  10. Contacts (escalation paths, on-call)

默认输出:一份名为
launch-runbook-[project].md
的Markdown格式运行手册,以及配套的检查清单。
结构:
  1. 发布元数据(内容、时间、人员)
  2. 角色与职责
  3. 预发布检查清单(发布前30天、7天、1天、1小时)
  4. 切换步骤(编号步骤、负责人、验证环节)
  5. 回滚流程
  6. 回滚标准(自动及自主触发条件)
  7. 沟通计划
  8. 验证检查清单(发布后1小时、1天)
  9. 稳定计划(发布后1周)
  10. 联系人信息(升级路径、值班人员)

Reference files

参考文件

  • references/runbook-template.md
    - Fillable runbook template with example cutover sequences.
  • references/runbook-template.md
    - 可填写的运行手册模板,包含示例切换步骤。