backup-disaster-recovery
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBackup and Disaster Recovery
备份与灾难恢复
Overview
概述
Design and implement comprehensive backup and disaster recovery strategies to ensure data protection, business continuity, and rapid recovery from infrastructure failures.
设计并实施全面的备份与灾难恢复策略,确保数据保护、业务连续性,并能从基础设施故障中快速恢复。
When to Use
适用场景
- Data protection and compliance
- Business continuity planning
- Disaster recovery planning
- Point-in-time recovery
- Cross-region failover
- Data migration
- Compliance and audit requirements
- Recovery time objective (RTO) optimization
- 数据保护与合规
- 业务连续性规划
- 灾难恢复规划
- 时间点恢复
- 跨区域故障转移
- 数据迁移
- 合规与审计要求
- 恢复时间目标(RTO)优化
Implementation Examples
实施示例
1. Database Backup Configuration
1. 数据库备份配置
yaml
undefinedyaml
undefinedpostgres-backup-cronjob.yaml
postgres-backup-cronjob.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: backup-script
namespace: databases
data:
backup.sh: |
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/backups/postgresql"
RETENTION_DAYS=30
DB_HOST="${POSTGRES_HOST}"
DB_PORT="${POSTGRES_PORT:-5432}"
DB_USER="${POSTGRES_USER}"
DB_PASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD="$DB_PASSWORD"
# Create backup directory
mkdir -p "$BACKUP_DIR"
# Full backup
BACKUP_FILE="$BACKUP_DIR/full-$(date +%Y%m%d-%H%M%S).sql"
echo "Starting backup to $BACKUP_FILE"
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -v \
--format=plain --no-owner --no-privileges > "$BACKUP_FILE"
# Compress backup
gzip "$BACKUP_FILE"
echo "Backup compressed: ${BACKUP_FILE}.gz"
# Upload to S3
aws s3 cp "${BACKUP_FILE}.gz" \
"s3://my-backups/postgres/$(date +%Y/%m/%d)/"
# Clean local old backups
find "$BACKUP_DIR" -type f -mtime +7 -delete
# Verify backup
if pg_restore -d "postgresql://$DB_USER@$DB_HOST:$DB_PORT/test_restore" \
"${BACKUP_FILE}.gz" --single-transaction 2>/dev/null; then
echo "Backup verification successful"
dropdb -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" test_restore
fi
echo "Backup complete: ${BACKUP_FILE}.gz"apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: databases
spec:
schedule: "0 2 * * *" # 2 AM daily
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
serviceAccountName: backup-sa
containers:
- name: backup
image: postgres:15-alpine
env:
- name: POSTGRES_HOST
valueFrom:
secretKeyRef:
name: postgres-credentials
key: host
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-credentials
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-key
volumeMounts:
- name: backup-script
mountPath: /backup
- name: backup-storage
mountPath: /backups
command:
- sh
- -c
- apk add --no-cache aws-cli && bash /backup/backup.sh
volumes:
- name: backup-script
configMap:
name: backup-script
defaultMode: 0755
- name: backup-storage
emptyDir:
sizeLimit: 100Gi
restartPolicy: OnFailure
undefinedapiVersion: v1
kind: ConfigMap
metadata:
name: backup-script
namespace: databases
data:
backup.sh: |
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/backups/postgresql"
RETENTION_DAYS=30
DB_HOST="${POSTGRES_HOST}"
DB_PORT="${POSTGRES_PORT:-5432}"
DB_USER="${POSTGRES_USER}"
DB_PASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD="$DB_PASSWORD"
# Create backup directory
mkdir -p "$BACKUP_DIR"
# Full backup
BACKUP_FILE="$BACKUP_DIR/full-$(date +%Y%m%d-%H%M%S).sql"
echo "Starting backup to $BACKUP_FILE"
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -v \
--format=plain --no-owner --no-privileges > "$BACKUP_FILE"
# Compress backup
gzip "$BACKUP_FILE"
echo "Backup compressed: ${BACKUP_FILE}.gz"
# Upload to S3
aws s3 cp "${BACKUP_FILE}.gz" \
"s3://my-backups/postgres/$(date +%Y/%m/%d)/"
# Clean local old backups
find "$BACKUP_DIR" -type f -mtime +7 -delete
# Verify backup
if pg_restore -d "postgresql://$DB_USER@$DB_HOST:$DB_PORT/test_restore" \
"${BACKUP_FILE}.gz" --single-transaction 2>/dev/null; then
echo "Backup verification successful"
dropdb -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" test_restore
fi
echo "Backup complete: ${BACKUP_FILE}.gz"apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: databases
spec:
schedule: "0 2 * * *" # 2 AM daily
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
serviceAccountName: backup-sa
containers:
- name: backup
image: postgres:15-alpine
env:
- name: POSTGRES_HOST
valueFrom:
secretKeyRef:
name: postgres-credentials
key: host
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-credentials
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-key
volumeMounts:
- name: backup-script
mountPath: /backup
- name: backup-storage
mountPath: /backups
command:
- sh
- -c
- apk add --no-cache aws-cli && bash /backup/backup.sh
volumes:
- name: backup-script
configMap:
name: backup-script
defaultMode: 0755
- name: backup-storage
emptyDir:
sizeLimit: 100Gi
restartPolicy: OnFailure
undefined2. Disaster Recovery Plan Template
2. 灾难恢复计划模板
yaml
undefinedyaml
undefineddisaster-recovery-plan.yaml
disaster-recovery-plan.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: dr-procedures
namespace: operations
data:
dr-runbook.md: |
# Disaster Recovery Runbook
## RTO and RPO Targets
- RTO (Recovery Time Objective): 4 hours
- RPO (Recovery Point Objective): 1 hour
## Pre-Disaster Checklist
- [ ] Verify backups are current
- [ ] Test backup restoration process
- [ ] Verify DR site resources are provisioned
- [ ] Confirm failover DNS is configured
## Primary Region Failure
### Detection (0-15 minutes)
- Alerting system detects primary region down
- Incident commander declared
- War room opened in Slack #incidents
### Initial Actions (15-30 minutes)
- Verify primary region is truly down
- Check backup systems in secondary region
- Validate latest backup timestamp
### Failover Procedure (30 minutes - 2 hours)
1. Validate backup integrity
2. Restore database from latest backup
3. Update application configuration
4. Perform DNS failover to secondary region
5. Verify application health
### Recovery Steps
1. Restore from backup: `restore-backup.sh --backup-id=latest`
2. Update DNS: `aws route53 change-resource-record-sets --cli-input-json file://failover.json`
3. Verify: `curl https://myapp.com/health`
4. Run smoke tests
5. Monitor error rates and performance
## Post-Disaster
- Document timeline and RCA
- Update runbooks
- Schedule post-mortem
- Test backups againapiVersion: v1
kind: Secret
metadata:
name: dr-credentials
namespace: operations
type: Opaque
stringData:
backup_aws_access_key: "AKIAIOSFODNN7EXAMPLE"
backup_aws_secret_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
dr_site_password: "secure-password-here"
undefinedapiVersion: v1
kind: ConfigMap
metadata:
name: dr-procedures
namespace: operations
data:
dr-runbook.md: |
# Disaster Recovery Runbook
## RTO and RPO Targets
- RTO (Recovery Time Objective): 4 hours
- RPO (Recovery Point Objective): 1 hour
## Pre-Disaster Checklist
- [ ] Verify backups are current
- [ ] Test backup restoration process
- [ ] Verify DR site resources are provisioned
- [ ] Confirm failover DNS is configured
## Primary Region Failure
### Detection (0-15 minutes)
- Alerting system detects primary region down
- Incident commander declared
- War room opened in Slack #incidents
### Initial Actions (15-30 minutes)
- Verify primary region is truly down
- Check backup systems in secondary region
- Validate latest backup timestamp
### Failover Procedure (30 minutes - 2 hours)
1. Validate backup integrity
2. Restore database from latest backup
3. Update application configuration
4. Perform DNS failover to secondary region
5. Verify application health
### Recovery Steps
1. Restore from backup: `restore-backup.sh --backup-id=latest`
2. Update DNS: `aws route53 change-resource-record-sets --cli-input-json file://failover.json`
3. Verify: `curl https://myapp.com/health`
4. Run smoke tests
5. Monitor error rates and performance
## Post-Disaster
- Document timeline and RCA
- Update runbooks
- Schedule post-mortem
- Test backups againapiVersion: v1
kind: Secret
metadata:
name: dr-credentials
namespace: operations
type: Opaque
stringData:
backup_aws_access_key: "AKIAIOSFODNN7EXAMPLE"
backup_aws_secret_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
dr_site_password: "secure-password-here"
undefined3. Backup and Restore Script
3. 备份与恢复脚本
bash
#!/bin/bashbash
#!/bin/bashbackup-restore.sh - Complete backup and restore utilities
backup-restore.sh - Complete backup and restore utilities
set -euo pipefail
BACKUP_BUCKET="s3://my-backups"
BACKUP_RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
set -euo pipefail
BACKUP_BUCKET="s3://my-backups"
BACKUP_RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
Colors
Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
log_info() {
echo -e "${GREEN}[INFO]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
log_warn() {
echo -e "${YELLOW}[WARN]${NC} $1"
}
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
log_info() {
echo -e "${GREEN}[INFO]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
log_warn() {
echo -e "${YELLOW}[WARN]${NC} $1"
}
Backup function
Backup function
backup_all() {
local environment=$1
log_info "Starting backup for $environment environment"
# Backup databases
log_info "Backing up databases..."
for db in myapp_db analytics_db; do
local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${TIMESTAMP}.sql.gz"
pg_dump "$db" | gzip | aws s3 cp - "$backup_file"
log_info "Backed up $db to $backup_file"
done
# Backup Kubernetes resources
log_info "Backing up Kubernetes resources..."
kubectl get all,configmap,secret,ingress,pvc -A -o yaml | \
gzip | aws s3 cp - "$BACKUP_BUCKET/$environment/kubernetes-${TIMESTAMP}.yaml.gz"
log_info "Kubernetes resources backed up"
# Backup volumes
log_info "Backing up persistent volumes..."
for pvc in $(kubectl get pvc -A -o name); do
local pvc_name=$(echo $pvc | cut -d'/' -f2)
log_info "Backing up PVC: $pvc_name"
kubectl exec -n default -it backup-pod -- \
tar czf - /data | aws s3 cp - "$BACKUP_BUCKET/$environment/volumes/${pvc_name}-${TIMESTAMP}.tar.gz"
done
log_info "All backups completed successfully"}
backup_all() {
local environment=$1
log_info "Starting backup for $environment environment"
# Backup databases
log_info "Backing up databases..."
for db in myapp_db analytics_db; do
local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${TIMESTAMP}.sql.gz"
pg_dump "$db" | gzip | aws s3 cp - "$backup_file"
log_info "Backed up $db to $backup_file"
done
# Backup Kubernetes resources
log_info "Backing up Kubernetes resources..."
kubectl get all,configmap,secret,ingress,pvc -A -o yaml | \
gzip | aws s3 cp - "$BACKUP_BUCKET/$environment/kubernetes-${TIMESTAMP}.yaml.gz"
log_info "Kubernetes resources backed up"
# Backup volumes
log_info "Backing up persistent volumes..."
for pvc in $(kubectl get pvc -A -o name); do
local pvc_name=$(echo $pvc | cut -d'/' -f2)
log_info "Backing up PVC: $pvc_name"
kubectl exec -n default -it backup-pod -- \
tar czf - /data | aws s3 cp - "$BACKUP_BUCKET/$environment/volumes/${pvc_name}-${TIMESTAMP}.tar.gz"
done
log_info "All backups completed successfully"}
Restore function
Restore function
restore_all() {
local environment=$1
local backup_date=$2
log_warn "Restoring from backup date: $backup_date"
read -p "Are you sure? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
log_error "Restore cancelled"
exit 1
fi
# Restore databases
log_info "Restoring databases..."
for db in myapp_db analytics_db; do
local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${backup_date}.sql.gz"
log_info "Restoring $db from $backup_file"
aws s3 cp "$backup_file" - | gunzip | psql "$db"
done
# Restore Kubernetes resources
log_info "Restoring Kubernetes resources..."
local k8s_backup="$BACKUP_BUCKET/$environment/kubernetes-${backup_date}.yaml.gz"
aws s3 cp "$k8s_backup" - | gunzip | kubectl apply -f -
log_info "Restore completed successfully"}
restore_all() {
local environment=$1
local backup_date=$2
log_warn "Restoring from backup date: $backup_date"
read -p "Are you sure? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
log_error "Restore cancelled"
exit 1
fi
# Restore databases
log_info "Restoring databases..."
for db in myapp_db analytics_db; do
local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${backup_date}.sql.gz"
log_info "Restoring $db from $backup_file"
aws s3 cp "$backup_file" - | gunzip | psql "$db"
done
# Restore Kubernetes resources
log_info "Restoring Kubernetes resources..."
local k8s_backup="$BACKUP_BUCKET/$environment/kubernetes-${backup_date}.yaml.gz"
aws s3 cp "$k8s_backup" - | gunzip | kubectl apply -f -
log_info "Restore completed successfully"}
Test restore
Test restore
test_restore() {
local environment=$1
log_info "Testing restore procedure..."
# Get latest backup
local latest_backup=$(aws s3 ls "$BACKUP_BUCKET/$environment/databases/" | \
sort | tail -n 1 | awk '{print $4}')
if [ -z "$latest_backup" ]; then
log_error "No backups found"
exit 1
fi
log_info "Testing restore from: $latest_backup"
# Create test database
psql -c "CREATE DATABASE test_restore_$(date +%s);"
# Download and restore
aws s3 cp "$BACKUP_BUCKET/$environment/databases/$latest_backup" - | \
gunzip | psql "test_restore_$(date +%s)"
log_info "Test restore successful"}
test_restore() {
local environment=$1
log_info "Testing restore procedure..."
# Get latest backup
local latest_backup=$(aws s3 ls "$BACKUP_BUCKET/$environment/databases/" | \
sort | tail -n 1 | awk '{print $4}')
if [ -z "$latest_backup" ]; then
log_error "No backups found"
exit 1
fi
log_info "Testing restore from: $latest_backup"
# Create test database
psql -c "CREATE DATABASE test_restore_$(date +%s);"
# Download and restore
aws s3 cp "$BACKUP_BUCKET/$environment/databases/$latest_backup" - | \
gunzip | psql "test_restore_$(date +%s)"
log_info "Test restore successful"}
List backups
List backups
list_backups() {
local environment=$1
log_info "Available backups for $environment:"
aws s3 ls "$BACKUP_BUCKET/$environment/" --recursive | grep -E ".sql.gz|.yaml.gz|.tar.gz"
}
list_backups() {
local environment=$1
log_info "Available backups for $environment:"
aws s3 ls "$BACKUP_BUCKET/$environment/" --recursive | grep -E ".sql.gz|.yaml.gz|.tar.gz"
}
Cleanup old backups
Cleanup old backups
cleanup_old_backups() {
local environment=$1
log_info "Cleaning up backups older than $BACKUP_RETENTION_DAYS days"
find "$BACKUP_BUCKET/$environment" -type f -mtime "+$BACKUP_RETENTION_DAYS" -delete
log_info "Cleanup completed"}
cleanup_old_backups() {
local environment=$1
log_info "Cleaning up backups older than $BACKUP_RETENTION_DAYS days"
find "$BACKUP_BUCKET/$environment" -type f -mtime "+$BACKUP_RETENTION_DAYS" -delete
log_info "Cleanup completed"}
Main
Main
main() {
case "${1:-}" in
backup)
backup_all "${2:-production}"
;;
restore)
restore_all "${2:-production}" "${3:-}"
;;
test)
test_restore "${2:-production}"
;;
list)
list_backups "${2:-production}"
;;
cleanup)
cleanup_old_backups "${2:-production}"
;;
*)
echo "Usage: $0 {backup|restore|test|list|cleanup} [environment] [backup-date]"
exit 1
;;
esac
}
main "$@"
undefinedmain() {
case "${1:-}" in
backup)
backup_all "${2:-production}"
;;
restore)
restore_all "${2:-production}" "${3:-}"
;;
test)
test_restore "${2:-production}"
;;
list)
list_backups "${2:-production}"
;;
cleanup)
cleanup_old_backups "${2:-production}"
;;
*)
echo "Usage: $0 {backup|restore|test|list|cleanup} [environment] [backup-date]"
exit 1
;;
esac
}
main "$@"
undefined4. Cross-Region Failover
4. 跨区域故障转移
yaml
undefinedyaml
undefinedroute53-failover.yaml
route53-failover.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: failover-config
namespace: operations
data:
failover.sh: |
#!/bin/bash
set -euo pipefail
PRIMARY_REGION="us-east-1"
SECONDARY_REGION="us-west-2"
DOMAIN="myapp.com"
HOSTED_ZONE_ID="Z1234567890ABC"
echo "Initiating failover to $SECONDARY_REGION"
# Get primary endpoint
PRIMARY_ENDPOINT=$(aws elbv2 describe-load-balancers \
--region "$PRIMARY_REGION" \
--query 'LoadBalancers[0].DNSName' \
--output text)
# Get secondary endpoint
SECONDARY_ENDPOINT=$(aws elbv2 describe-load-balancers \
--region "$SECONDARY_REGION" \
--query 'LoadBalancers[0].DNSName' \
--output text)
# Update Route53 to failover
aws route53 change-resource-record-sets \
--hosted-zone-id "$HOSTED_ZONE_ID" \
--change-batch '{
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "'$DOMAIN'",
"Type": "A",
"TTL": 60,
"SetIdentifier": "Primary",
"Failover": "PRIMARY",
"AliasTarget": {
"HostedZoneId": "Z35SXDOTRQ7X7K",
"DNSName": "'$PRIMARY_ENDPOINT'",
"EvaluateTargetHealth": true
}
}
},
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "'$DOMAIN'",
"Type": "A",
"TTL": 60,
"SetIdentifier": "Secondary",
"Failover": "SECONDARY",
"AliasTarget": {
"HostedZoneId": "Z35SXDOTRQ7X7K",
"DNSName": "'$SECONDARY_ENDPOINT'",
"EvaluateTargetHealth": false
}
}
}
]
}'
echo "Failover completed"undefinedapiVersion: v1
kind: ConfigMap
metadata:
name: failover-config
namespace: operations
data:
failover.sh: |
#!/bin/bash
set -euo pipefail
PRIMARY_REGION="us-east-1"
SECONDARY_REGION="us-west-2"
DOMAIN="myapp.com"
HOSTED_ZONE_ID="Z1234567890ABC"
echo "Initiating failover to $SECONDARY_REGION"
# Get primary endpoint
PRIMARY_ENDPOINT=$(aws elbv2 describe-load-balancers \
--region "$PRIMARY_REGION" \
--query 'LoadBalancers[0].DNSName' \
--output text)
# Get secondary endpoint
SECONDARY_ENDPOINT=$(aws elbv2 describe-load-balancers \
--region "$SECONDARY_REGION" \
--query 'LoadBalancers[0].DNSName' \
--output text)
# Update Route53 to failover
aws route53 change-resource-record-sets \
--hosted-zone-id "$HOSTED_ZONE_ID" \
--change-batch '{
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "'$DOMAIN'",
"Type": "A",
"TTL": 60,
"SetIdentifier": "Primary",
"Failover": "PRIMARY",
"AliasTarget": {
"HostedZoneId": "Z35SXDOTRQ7X7K",
"DNSName": "'$PRIMARY_ENDPOINT'",
"EvaluateTargetHealth": true
}
}
},
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "'$DOMAIN'",
"Type": "A",
"TTL": 60,
"SetIdentifier": "Secondary",
"Failover": "SECONDARY",
"AliasTarget": {
"HostedZoneId": "Z35SXDOTRQ7X7K",
"DNSName": "'$SECONDARY_ENDPOINT'",
"EvaluateTargetHealth": false
}
}
}
]
}'
echo "Failover completed"undefinedBest Practices
最佳实践
✅ DO
✅ 建议做法
- Perform regular backup testing
- Use multiple backup locations
- Implement automated backups
- Document recovery procedures
- Test failover procedures regularly
- Monitor backup completion
- Use immutable backups
- Encrypt backups at rest and in transit
- 定期执行备份测试
- 使用多个备份位置
- 实施自动化备份
- 记录恢复流程
- 定期测试故障转移流程
- 监控备份完成情况
- 使用不可变备份
- 在静态和传输过程中加密备份
❌ DON'T
❌ 避免做法
- Rely on a single backup location
- Ignore backup failures
- Store backups with production data
- Skip testing recovery procedures
- Over-compress backups beyond recovery speed needs
- Forget to verify backup integrity
- Store encryption keys with backups
- Assume backups are automatically working
- 依赖单一备份位置
- 忽略备份失败
- 将备份与生产数据存储在一起
- 跳过恢复流程测试
- 过度压缩备份影响恢复速度
- 忘记验证备份完整性
- 将加密密钥与备份存储在一起
- 假设备份自动正常工作