launch-runbook
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLaunch Runbook
发布运行手册
Plan and execute the launch of a website, product, or major release. The runbook is the document everyone uses on launch day. Stack-agnostic.
This skill is for the launch event. For pre-launch QA, use . For post-launch incident handling, use .
qa-testingincident-response规划并执行网站、产品或重大版本的发布工作。这份运行手册是发布日所有团队成员都会使用的文档,与技术栈无关。
此技能适用于发布活动场景。若需预发布QA测试,请使用;若需发布后事件处理,请使用。
qa-testingincident-responseWhen to use
适用场景
- Launching a new website or major redesign
- Migrating from one platform to another
- Releasing a major product or feature
- Coordinating cross-team launches
- Building a runbook for a recurring deploy
- 发布新网站或重大改版
- 从一个平台迁移至另一个平台
- 发布重大产品或功能
- 协调跨团队发布工作
- 为重复部署制定运行手册
When NOT to use
不适用场景
- Pre-launch testing (use )
qa-testing - Post-launch incident response (use )
incident-response - After-launch retrospective (use )
after-action-report
- 预发布测试(请使用)
qa-testing - 发布后事件响应(请使用)
incident-response - 发布后回顾总结(请使用)
after-action-report
Required inputs
必要输入项
- The launch scope (what's being launched)
- The launch window (date, time, duration)
- The team (roles, on-call rotation)
- The rollback criteria (when to abort)
- The communication plan (who tells whom what, when)
- 发布范围(发布内容)
- 发布窗口(日期、时间、时长)
- 团队信息(角色、值班轮换安排)
- 回滚标准(触发回滚的条件)
- 沟通计划(沟通对象、内容、时间节点)
The framework: 4 phases
框架:四个阶段
A launch has four phases. The runbook covers all four.
发布分为四个阶段,运行手册覆盖所有阶段。
Phase 1: Pre-launch (T-30 days to T-1 hour)
阶段1:预发布(发布前30天至发布前1小时)
Verify everything is ready before the launch window.
T-30 days:
- Final scope locked
- Cross-team commitments confirmed
- Pre-launch QA scheduled
- Comms plan drafted
T-7 days:
- Pre-launch QA complete
- All critical and major issues resolved
- Performance baseline measured
- Rollback procedures documented and tested
- DNS TTL lowered (if DNS change is part of launch)
T-1 day:
- Final go/no-go meeting
- Roles confirmed
- Communication channels set up
- Backup of current production state
T-1 hour:
- Team assembled in shared communication channel
- Tools and access verified
- Final smoke test on staging
确认发布窗口前所有准备工作就绪。
发布前30天:
- 确定最终发布范围
- 确认跨团队承诺
- 安排预发布QA测试
- 起草沟通计划
发布前7天:
- 完成预发布QA测试
- 解决所有严重及重大问题
- 测量性能基准值
- 记录并测试回滚流程
- 降低DNS TTL值(若发布包含DNS变更)
发布前1天:
- 召开最终上线/不上线决策会议
- 确认各角色安排
- 搭建沟通渠道
- 备份当前生产环境状态
发布前1小时:
- 团队在共享沟通渠道集合
- 验证工具权限
- 在预发布环境完成最终冒烟测试
Phase 2: Cutover (T-0)
阶段2:切换(发布时刻)
The actual launch. Sequenced steps with owners and verifications.
Standard cutover steps:
- Announce start to internal team
- Enable maintenance mode (if applicable)
- Run final database migrations (if applicable)
- Deploy code to production
- Verify deploy completed without errors
- Run smoke tests on production
- DNS cutover (if applicable)
- Verify DNS propagation
- Disable maintenance mode
- Run full smoke tests on production
- Announce launch to internal team
- Begin monitoring window
Each step has:
- Owner
- Pre-conditions
- Action
- Verification
- Time estimate
- Rollback procedure
正式发布环节,包含有序步骤、负责人及验证环节。
标准切换步骤:
- 向内部团队宣布发布开始
- 启用维护模式(如适用)
- 执行最终数据库迁移(如适用)
- 将代码部署到生产环境
- 验证部署无错误完成
- 在生产环境运行冒烟测试
- 执行DNS切换(如适用)
- 验证DNS传播
- 关闭维护模式
- 在生产环境运行完整冒烟测试
- 向内部团队宣布发布完成
- 进入监控阶段
每个步骤包含:
- 负责人
- 前置条件
- 操作内容
- 验证环节
- 时间预估
- 回滚流程
Phase 3: Verification (T+0 to T+24 hours)
阶段3:验证(发布后0至24小时)
Confirm the launch is healthy.
Within first hour:
- Critical user flows working (checkout, signup, login)
- No spike in error rates
- Performance within expected ranges
- Analytics tracking firing
- Email and notifications working
Within first 24 hours:
- No regression in key business metrics
- No accumulating error patterns
- Core Web Vitals stable
- Search Console showing no critical issues (if SEO-relevant)
确认发布状态健康。
发布后1小时内:
- 核心用户流程正常(结账、注册、登录)
- 错误率无异常飙升
- 性能符合预期范围
- 分析追踪正常触发
- 邮件及通知功能正常
发布后24小时内:
- 关键业务指标无退化
- 无持续累积的错误模式
- Core Web Vitals稳定
- Search Console无严重问题(若涉及SEO)
Phase 4: Stabilization (T+24 hours to T+7 days)
阶段4:稳定(发布后24小时至7天)
Monitor the long tail.
- Track error rates day over day
- Track performance day over day
- Track key business metrics vs baseline
- Address any non-blocking issues identified
- Plan the AAR (after-action report)
监控长期运行状态。
- 每日追踪错误率变化
- 每日追踪性能变化
- 对比关键业务指标与基准值
- 处理已发现的非阻塞性问题
- 规划AAR(事后复盘报告)
Roles and responsibilities
角色与职责
A launch has clear role assignments. Ambiguity here is the most common cause of launch chaos.
| Role | Responsibility |
|---|---|
| Launch lead | Owns the runbook. Calls go/no-go. Calls rollback. |
| Deploy operator | Executes the technical deploy steps. |
| QA lead | Runs verification tests and confirms each milestone. |
| Comms lead | Posts internal updates, manages external messaging. |
| On-call engineer | Available for issues during and after launch. |
| Stakeholder rep | Approves on behalf of business stakeholders. |
For small teams, one person may fill multiple roles. Each role's responsibilities should still be explicit.
发布需明确角色分配,职责模糊是导致发布混乱的最常见原因。
| 角色 | 职责 |
|---|---|
| 发布负责人 | 负责运行手册,决定是否上线,决定是否回滚。 |
| 部署操作员 | 执行技术部署步骤。 |
| QA负责人 | 执行验证测试,确认每个里程碑。 |
| 沟通负责人 | 发布内部更新,管理外部消息。 |
| 值班工程师 | 发布期间及发布后随时待命处理问题。 |
| 利益相关方代表 | 代表业务利益相关方审批。 |
小型团队中,一人可兼任多个角色,但每个角色的职责仍需明确。
Rollback criteria
回滚标准
Define before the launch. Decisions are easier to make pre-emptively than under pressure.
Automatic rollback triggers:
- Error rate exceeds X percent of normal
- Critical user flow (defined) is broken
- Database integrity issue
- Security vulnerability discovered post-deploy
Discretionary rollback triggers:
- Performance degradation beyond Y percent
- Significant degradation in key business metric
- Customer-facing error patterns
Decision authority: The launch lead calls rollback. Pre-define who acts as deputy if launch lead is unavailable.
需在发布前定义清楚,提前做出决策比压力下决策更容易。
自动回滚触发条件:
- 错误率超过正常水平的X%
- 核心用户流程(已定义)故障
- 数据库完整性问题
- 部署后发现安全漏洞
自主回滚触发条件:
- 性能下降超过Y%
- 关键业务指标显著退化
- 出现面向客户的错误模式
决策权限: 发布负责人决定是否回滚。需提前定义发布负责人无法履职时的代理人选。
Communication plan
沟通计划
Internal channels
内部沟通渠道
- Primary launch channel: Real-time chat for the launch team only
- Status channel: Broader internal updates
- War room: Optional video call for high-stakes launches
- 主发布沟通渠道: 仅面向发布团队的实时聊天群
- 状态更新渠道: 面向更广泛内部人员的更新群
- 作战室: 高风险发布时可选的视频会议
Update cadence during launch
发布期间的更新频率
- Every 15 minutes during cutover
- Every hour during verification phase
- Daily during stabilization phase
- 切换阶段每15分钟更新一次
- 验证阶段每小时更新一次
- 稳定阶段每日更新一次
External communication
外部沟通
- Customer-facing announcement: Pre-drafted, scheduled to publish at confirmed-success milestone
- Status page: Updated proactively if any user impact
- Support team: Briefed in advance on what's launching, common questions, escalation path
- 客户公告: 提前撰写,在确认发布成功的里程碑时点发布
- 状态页面: 若影响用户需主动更新
- 支持团队: 提前告知发布内容、常见问题及升级路径
Workflow
工作流程
- Build the runbook 30 days out. Scope, sequence, roles, rollback criteria, comms plan.
- Test the rollback procedure. Untested rollback is hope, not procedure.
- Run a tabletop exercise. Walk through the runbook with the full team. Find gaps.
- Lower DNS TTL 48 to 72 hours before launch (if DNS change is part of launch).
- Day-of: Run the runbook step by step. Verify each step before moving to next.
- Monitor. First hour, first day, first week. Document anything noteworthy.
- Schedule the AAR within 1 to 2 weeks of launch.
- 提前30天制定运行手册:确定范围、步骤、角色、回滚标准及沟通计划。
- 测试回滚流程:未测试的回滚只是侥幸,而非流程。
- 开展桌面演练:与全体团队一起走一遍运行手册,找出漏洞。
- 降低DNS TTL值:若发布包含DNS变更,提前48至72小时操作。
- 发布当日:逐步执行运行手册,每完成一步验证后再进行下一步。
- 监控:关注发布后1小时、1天、1周的状态,记录所有值得关注的事项。
- 安排AAR会议:在发布后1至2周内召开复盘会议。
Failure patterns
常见失败模式
- Runbook written by one person, not reviewed. Single perspective misses scenarios.
- No tested rollback. Discovering rollback is broken at the moment you need it.
- Vague step descriptions. "Deploy to production" without specifying which tool, which command, which environment.
- No verification step after each action. Errors propagate.
- Communication gaps. Team doesn't know launch is happening, or doesn't know it succeeded.
- Launching at end of day Friday. Or before a holiday. Reduce the time available to respond.
- Skipping pre-launch QA to hit a date. The bugs appear on launch day instead.
- Launch fatigue. Long launches without breaks lead to errors. Plan rest cycles for multi-day launches.
- No on-call for first 24 hours. Someone must be reachable.
- 运行手册由单人编写且未审核:单一视角会遗漏场景。
- 未测试回滚流程:在需要回滚时才发现流程无效。
- 步骤描述模糊:仅写“部署到生产环境”,未指定工具、命令或环境。
- 操作后无验证步骤:错误会持续扩散。
- 沟通漏洞:团队不知道发布正在进行,或不知道发布已成功。
- 周五下班前发布:或假期前发布,减少了响应时间。
- 为赶进度跳过预发布QA:漏洞会在发布日暴露。
- 发布疲劳:长时间发布无休息会导致错误,多日发布需规划休息周期。
- 发布后24小时无值班人员:必须有人随时可联系。
Output format
输出格式
Default output: a markdown runbook at plus supporting checklists.
launch-runbook-[project].mdStructure:
- Launch metadata (what, when, who)
- Roles and responsibilities
- Pre-launch checklist (T-30, T-7, T-1, T-1hr)
- Cutover sequence (numbered steps, owners, verifications)
- Rollback procedure
- Rollback criteria (automatic and discretionary)
- Communication plan
- Verification checklist (first hour, first day)
- Stabilization plan (first week)
- Contacts (escalation paths, on-call)
默认输出:一份名为的Markdown格式运行手册,以及配套的检查清单。
launch-runbook-[project].md结构:
- 发布元数据(内容、时间、人员)
- 角色与职责
- 预发布检查清单(发布前30天、7天、1天、1小时)
- 切换步骤(编号步骤、负责人、验证环节)
- 回滚流程
- 回滚标准(自动及自主触发条件)
- 沟通计划
- 验证检查清单(发布后1小时、1天)
- 稳定计划(发布后1周)
- 联系人信息(升级路径、值班人员)
Reference files
参考文件
- - Fillable runbook template with example cutover sequences.
references/runbook-template.md
- - 可填写的运行手册模板,包含示例切换步骤。
references/runbook-template.md