runbook-creation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Runbook Creation

运维手册创建

Create effective operational runbooks and procedures.
创建高效的运维手册与操作流程。

Runbook Structure

运维手册结构

markdown
undefined
markdown
undefined

Runbook: [Service/Process Name]

运维手册:[服务/流程名称]

Overview

概述

Brief description of the service and runbook purpose.
对服务及本手册用途的简要说明。

Prerequisites

前置条件

  • Required access
  • Tools needed
  • Knowledge required
  • 所需权限
  • 必备工具
  • 必要知识

Procedure

操作流程

Step-by-step instructions with commands.
带命令的分步操作说明。

Verification

验证步骤

How to confirm success.
如何确认操作成功。

Rollback

回滚步骤

Steps to undo if needed.
如有需要,如何撤销操作。

Escalation

上报流程

When and how to escalate.
何时及如何上报问题。

Related Runbooks

相关运维手册

Links to related procedures.
undefined
相关操作流程的链接。
undefined

Example Runbook

示例运维手册

markdown
undefined
markdown
undefined

Runbook: Database Failover

运维手册:数据库故障切换

Overview

概述

Procedure to failover PostgreSQL to replica.
将PostgreSQL切换至副本节点的操作流程。

Prerequisites

前置条件

  • DBA access to primary and replica
  • VPN connected
  • Slack channel #db-ops open
  • 拥有主节点和副本节点的DBA权限
  • 已连接VPN
  • 已打开Slack频道#db-ops

Procedure

操作流程

1. Verify Replica Status

1. 验证副本节点状态

```bash psql -h replica -c "SELECT pg_is_in_recovery();"
bash
psql -h replica -c "SELECT pg_is_in_recovery();"

Should return 't'

应返回't'

```
undefined

2. Stop Application Writes

2. 停止应用写入

```bash kubectl scale deployment app --replicas=0 ```
bash
kubectl scale deployment app --replicas=0

3. Promote Replica

3. 提升副本节点为主节点

```bash psql -h replica -c "SELECT pg_promote();" ```
bash
psql -h replica -c "SELECT pg_promote();"

4. Update DNS

4. 更新DNS

```bash aws route53 change-resource-record-sets ... ```
bash
aws route53 change-resource-record-sets ...

Verification

验证步骤

  • Application connects to new primary
  • No replication lag errors
  • Transactions completing
  • 应用已连接至新主节点
  • 无复制延迟错误
  • 事务正常完成

Escalation

上报流程

If issues persist after 15 minutes, escalate to:
  • Primary: @dba-lead
  • Secondary: @platform-oncall
undefined
若问题持续15分钟以上,上报至:
  • 第一负责人:@dba-lead
  • 第二负责人:@platform-oncall
undefined

Best Practices

最佳实践

  • Keep procedures simple and clear
  • Include verification steps
  • Test runbooks regularly
  • Version control runbooks
  • Include troubleshooting tips
  • 保持操作流程简洁明了
  • 包含验证步骤
  • 定期测试运维手册
  • 对运维手册进行版本控制
  • 包含故障排查提示