devops-infrastructure
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDevOps & Infrastructure
DevOps与基础设施
When to Load
何时加载
- Trigger: Docker, CI/CD pipelines, deployment configuration, monitoring, infrastructure as code
- Skip: Application logic only with no infrastructure or deployment concerns
- 触发条件:Docker、CI/CD流水线、部署配置、监控、基础设施即代码相关场景
- 跳过场景:仅涉及应用逻辑,无基础设施或部署相关需求的场景
DevOps Workflow
DevOps工作流
Copy this checklist and track progress:
DevOps Setup Progress:
- [ ] Step 1: Containerize application (Dockerfile)
- [ ] Step 2: Set up CI/CD pipeline
- [ ] Step 3: Define deployment strategy
- [ ] Step 4: Configure monitoring & alerting
- [ ] Step 5: Set up environment management
- [ ] Step 6: Document runbooks
- [ ] Step 7: Validate against anti-patterns checklist复制以下清单跟踪进度:
DevOps Setup Progress:
- [ ] Step 1: Containerize application (Dockerfile)
- [ ] Step 2: Set up CI/CD pipeline
- [ ] Step 3: Define deployment strategy
- [ ] Step 4: Configure monitoring & alerting
- [ ] Step 5: Set up environment management
- [ ] Step 6: Document runbooks
- [ ] Step 7: Validate against anti-patterns checklistDocker Best Practices
Docker最佳实践
Multi-Stage Build
多阶段构建
dockerfile
undefineddockerfile
undefinedWRONG: Single stage, bloated image
WRONG: Single stage, bloated image
FROM node:20
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build
CMD ["node", "dist/index.js"]
FROM node:20
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build
CMD ["node", "dist/index.js"]
Result: 1.2GB image with devDependencies and source code
Result: 1.2GB image with devDependencies and source code
CORRECT: Multi-stage build
CORRECT: Multi-stage build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -s /bin/sh -D appuser
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER appuser
EXPOSE 3000
CMD ["node", "dist/index.js"]
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -s /bin/sh -D appuser
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER appuser
EXPOSE 3000
CMD ["node", "dist/index.js"]
Result: ~150MB image, no devDependencies, non-root user
Result: ~150MB image, no devDependencies, non-root user
undefinedundefinedPython Multi-Stage
Python多阶段构建
dockerfile
FROM python:3.12-slim AS builder
WORKDIR /app
RUN pip install uv
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .
FROM python:3.12-slim AS runner
WORKDIR /app
RUN useradd -r -s /bin/false appuser
COPY /app/.venv /app/.venv
COPY /app/src ./src
ENV PATH="/app/.venv/bin:$PATH"
USER appuser
CMD ["python", "-m", "src.main"]dockerfile
FROM python:3.12-slim AS builder
WORKDIR /app
RUN pip install uv
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .
FROM python:3.12-slim AS runner
WORKDIR /app
RUN useradd -r -s /bin/false appuser
COPY /app/.venv /app/.venv
COPY /app/src ./src
ENV PATH="/app/.venv/bin:$PATH"
USER appuser
CMD ["python", "-m", "src.main"]Layer Caching
层缓存
dockerfile
undefineddockerfile
undefinedWRONG: Cache busted on every code change
WRONG: Cache busted on every code change
COPY . .
RUN npm ci
COPY . .
RUN npm ci
CORRECT: Dependencies cached separately
CORRECT: Dependencies cached separately
COPY package*.json ./
RUN npm ci # cached unless package.json changes
COPY . . # only source code changes bust this layer
undefinedCOPY package*.json ./
RUN npm ci # cached unless package.json changes
COPY . . # only source code changes bust this layer
undefined.dockerignore
.dockerignore
node_modules
.git
.env
*.md
.vscode
coverage
dist
__pycache__
.pytest_cache
*.pycnode_modules
.git
.env
*.md
.vscode
coverage
dist
__pycache__
.pytest_cache
*.pycSecurity
安全
dockerfile
undefineddockerfile
undefinedAlways pin versions
Always pin versions
FROM node:20.11.0-alpine # NOT node:latest
FROM node:20.11.0-alpine # NOT node:latest
Don't run as root
Don't run as root
USER appuser
USER appuser
Read-only filesystem where possible
Read-only filesystem where possible
docker run --read-only --tmpfs /tmp myapp
docker run --read-only --tmpfs /tmp myapp
Scan images
Scan images
docker scout cves myimage:latest
docker scout cves myimage:latest
trivy image myimage:latest
trivy image myimage:latest
undefinedundefinedCI/CD Pipeline Design
CI/CD流水线设计
GitHub Actions Structure
GitHub Actions结构
yaml
name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: "npm"
- run: npm ci
- run: npm run lint
test:
runs-on: ubuntu-latest
needs: lint
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: testdb
ports: ["5432:5432"]
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: "npm"
- run: npm ci
- run: npm test
build:
runs-on: ubuntu-latest
needs: test
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v5
with:
push: ${{ github.event_name == 'push' }}
tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
runs-on: ubuntu-latest
needs: build
if: github.ref == 'refs/heads/main'
environment: production
steps:
- run: echo "Deploy to production"yaml
name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: "npm"
- run: npm ci
- run: npm run lint
test:
runs-on: ubuntu-latest
needs: lint
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: testdb
ports: ["5432:5432"]
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: "npm"
- run: npm ci
- run: npm test
build:
runs-on: ubuntu-latest
needs: test
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v5
with:
push: ${{ github.event_name == 'push' }}
tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
runs-on: ubuntu-latest
needs: build
if: github.ref == 'refs/heads/main'
environment: production
steps:
- run: echo "Deploy to production"Caching Strategies
缓存策略
yaml
undefinedyaml
undefinedNode modules
Node modules
- uses: actions/setup-node@v4 with: cache: "npm"
- uses: actions/setup-node@v4 with: cache: "npm"
Python with uv
Python with uv
- name: Cache uv uses: actions/cache@v4 with: path: ~/.cache/uv key: uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}
- name: Cache uv uses: actions/cache@v4 with: path: ~/.cache/uv key: uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}
Docker layer caching
Docker layer caching
- uses: docker/build-push-action@v5 with: cache-from: type=gha cache-to: type=gha,mode=max
undefined- uses: docker/build-push-action@v5 with: cache-from: type=gha cache-to: type=gha,mode=max
undefinedDeployment Strategies
部署策略
Blue-Green Deployment
蓝绿部署
1. Run two identical environments: Blue (live) and Green (idle)
2. Deploy new version to Green
3. Run smoke tests on Green
4. Switch load balancer to Green
5. Green is now live, Blue is idle
6. Rollback: switch back to Blue
Pros: Instant rollback, zero downtime
Cons: 2x infrastructure cost during deploy1. Run two identical environments: Blue (live) and Green (idle)
2. Deploy new version to Green
3. Run smoke tests on Green
4. Switch load balancer to Green
5. Green is now live, Blue is idle
6. Rollback: switch back to Blue
优点:即时回滚,零停机
缺点:部署期间需要2倍基础设施成本Canary Deployment
金丝雀部署
1. Deploy new version to small subset (5% of traffic)
2. Monitor error rates and latency
3. Gradually increase: 5% -> 25% -> 50% -> 100%
4. Rollback: route all traffic back to old version
Pros: Limited blast radius, real-world testing
Cons: More complex routing, longer rollout1. Deploy new version to small subset (5% of traffic)
2. Monitor error rates and latency
3. Gradually increase: 5% -> 25% -> 50% -> 100%
4. Rollback: route all traffic back to old version
优点:影响范围可控,可在真实环境验证
缺点:路由逻辑更复杂,发布周期更长Rolling Deployment
滚动部署
1. Replace instances one at a time
2. Each new instance passes health checks before next starts
3. Continue until all instances updated
Pros: No extra infrastructure, gradual rollout
Cons: Mixed versions during deploy, slower rollback1. Replace instances one at a time
2. Each new instance passes health checks before next starts
3. Continue until all instances updated
优点:无需额外基础设施,发布过程平缓
缺点:部署期间存在版本混用情况,回滚速度较慢Feature Flags
功能开关
typescript
// Simple feature flag implementation
const features = {
NEW_CHECKOUT: process.env.FF_NEW_CHECKOUT === "true",
DARK_MODE: process.env.FF_DARK_MODE === "true",
};
function getCheckoutFlow(user: User) {
if (features.NEW_CHECKOUT && user.betaGroup) {
return newCheckoutFlow(user);
}
return legacyCheckoutFlow(user);
}
// Use a proper service for production: LaunchDarkly, Unleash, Flagsmithtypescript
// Simple feature flag implementation
const features = {
NEW_CHECKOUT: process.env.FF_NEW_CHECKOUT === "true",
DARK_MODE: process.env.FF_DARK_MODE === "true",
};
function getCheckoutFlow(user: User) {
if (features.NEW_CHECKOUT && user.betaGroup) {
return newCheckoutFlow(user);
}
return legacyCheckoutFlow(user);
}
// Use a proper service for production: LaunchDarkly, Unleash, FlagsmithInfrastructure as Code
基础设施即代码
Terraform Basics
Terraform基础
hcl
undefinedhcl
undefinedmain.tf
main.tf
terraform {
required_version = ">= 1.5"
backend "s3" {
bucket = "myapp-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_instance" "web" {
ami = var.ami_id
instance_type = var.instance_type
tags = {
Name = "web-${var.environment}"
Environment = var.environment
ManagedBy = "terraform"
}
}
terraform {
required_version = ">= 1.5"
backend "s3" {
bucket = "myapp-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_instance" "web" {
ami = var.ami_id
instance_type = var.instance_type
tags = {
Name = "web-${var.environment}"
Environment = var.environment
ManagedBy = "terraform"
}
}
variables.tf
variables.tf
variable "environment" {
type = string
default = "dev"
}
variable "instance_type" {
type = string
default = "t3.micro"
}
undefinedvariable "environment" {
type = string
default = "dev"
}
variable "instance_type" {
type = string
default = "t3.micro"
}
undefinedTerraform Rules
Terraform使用规则
1. Always use remote state (S3, GCS, Terraform Cloud)
2. Lock state files to prevent concurrent modifications
3. Use variables and modules for reusability
4. Tag all resources with environment and ManagedBy
5. Run `terraform plan` before `terraform apply`
6. Never edit infrastructure manually (all changes via code)
7. Use workspaces or separate state files per environment1. 始终使用远端状态存储(S3、GCS、Terraform Cloud)
2. 锁定状态文件防止并发修改
3. 使用变量和模块提升复用性
4. 为所有资源添加环境和ManagedBy标签
5. 执行`terraform apply`前先运行`terraform plan`
6. 禁止手动修改基础设施(所有变更通过代码实现)
7. 按环境使用工作区或独立的状态文件Monitoring & Observability
监控与可观测性
The Three Pillars
三大支柱
METRICS: Numeric measurements over time
- Request rate, error rate, latency (RED method)
- CPU, memory, disk, network (USE method)
- Business metrics (signups, purchases)
Tools: Prometheus, Datadog, CloudWatch
LOGS: Discrete events with context
- Structured JSON format
- Correlation IDs across services
- Log levels: DEBUG, INFO, WARN, ERROR
Tools: ELK Stack, Loki, CloudWatch Logs
TRACES: Request flow across services
- Distributed tracing with span context
- Latency breakdown per service
- Dependency mapping
Tools: Jaeger, Zipkin, Datadog APM指标(METRICS):随时间变化的数值测量
- 请求率、错误率、延迟(RED方法)
- CPU、内存、磁盘、网络(USE方法)
- 业务指标(注册量、购买量)
工具:Prometheus、Datadog、CloudWatch
日志(LOGS):带上下文的离散事件
- 结构化JSON格式
- 跨服务的关联ID
- 日志级别:DEBUG、INFO、WARN、ERROR
工具:ELK Stack、Loki、CloudWatch Logs
链路(TRACES):跨服务的请求流转记录
- 带span上下文的分布式链路追踪
- 每个服务的延迟拆解
- 依赖关系映射
工具:Jaeger、Zipkin、Datadog APMHealth Check Endpoint
健康检查端点
typescript
// Express health check
app.get("/health", async (req, res) => {
const checks = {
uptime: process.uptime(),
timestamp: Date.now(),
database: "unknown",
redis: "unknown",
};
try {
await db.query("SELECT 1");
checks.database = "healthy";
} catch (e) {
checks.database = "unhealthy";
}
try {
await redis.ping();
checks.redis = "healthy";
} catch (e) {
checks.redis = "unhealthy";
}
const isHealthy = checks.database === "healthy";
res.status(isHealthy ? 200 : 503).json(checks);
});typescript
// Express health check
app.get("/health", async (req, res) => {
const checks = {
uptime: process.uptime(),
timestamp: Date.now(),
database: "unknown",
redis: "unknown",
};
try {
await db.query("SELECT 1");
checks.database = "healthy";
} catch (e) {
checks.database = "unhealthy";
}
try {
await redis.ping();
checks.redis = "healthy";
} catch (e) {
checks.redis = "unhealthy";
}
const isHealthy = checks.database === "healthy";
res.status(isHealthy ? 200 : 503).json(checks);
});Alerting Rules
告警规则
Good alerts:
- Error rate > 1% for 5 minutes (actionable)
- P99 latency > 2s for 10 minutes (meaningful)
- Disk usage > 80% (preventive)
Bad alerts:
- CPU spike for 30 seconds (too noisy)
- Any single 500 error (too sensitive)
- "Something might be wrong" (not actionable)
Alert fatigue is real. Every alert should require human action.优质告警:
- 5分钟内错误率>1%(可行动)
- 10分钟内P99延迟>2s(有业务意义)
- 磁盘使用率>80%(可预防)
劣质告警:
- CPU突增30秒(噪音过多)
- 任意单个500错误(过于敏感)
- "可能存在异常"(无法直接行动)
告警疲劳是真实存在的问题,每一条告警都应该需要人工介入处理。Environment Management
环境管理
Dev/Staging/Prod Parity
开发/预发/生产环境一致性
yaml
undefinedyaml
undefineddocker-compose.yml for local development
docker-compose.yml for local development
services:
app:
build: .
env_file: .env
ports: ["3000:3000"]
depends_on:
postgres:
condition: service_healthy
postgres:
image: postgres:16
environment:
POSTGRES_DB: myapp
healthcheck:
test: ["CMD-SHELL", "pg_isready"]
interval: 5s
volumes:
- pgdata:/var/lib/postgresql/data
redis:
image: redis:7-alpine
ports: ["6379:6379"]
volumes:
pgdata:
undefinedservices:
app:
build: .
env_file: .env
ports: ["3000:3000"]
depends_on:
postgres:
condition: service_healthy
postgres:
image: postgres:16
environment:
POSTGRES_DB: myapp
healthcheck:
test: ["CMD-SHELL", "pg_isready"]
interval: 5s
volumes:
- pgdata:/var/lib/postgresql/data
redis:
image: redis:7-alpine
ports: ["6379:6379"]
volumes:
pgdata:
undefinedEnvironment Variables
环境变量
undefinedundefined.env.example (committed to git, no real values)
.env.example (提交到git,无真实值)
DATABASE_URL=postgresql://user:placeholder@localhost:5432/myapp
REDIS_URL=redis://localhost:6379
LOG_LEVEL=debug
API_KEY=your-key-here
DATABASE_URL=postgresql://user:placeholder@localhost:5432/myapp
REDIS_URL=redis://localhost:6379
LOG_LEVEL=debug
API_KEY=your-key-here
.env (never committed, listed in .gitignore)
.env (永远不提交,添加到.gitignore中)
Contains real values for local development
存储本地开发的真实配置值
undefinedundefinedCommon Anti-Patterns Summary
常见反模式汇总
AVOID DO INSTEAD
-------------------------------------------------------------------
FROM node:latest Pin exact versions (node:20.11.0-alpine)
Running as root in container Create and use non-root user
No .dockerignore Exclude .git, node_modules, .env
Single CI job does everything Separate lint, test, build, deploy stages
Manual deployment Automated pipeline with approvals
No health checks Liveness + readiness probes
Alerts on every error Alert on error RATE thresholds
Same config in all environments Per-environment configuration
No rollback plan Test rollback before every deploy
Logs as unstructured strings Structured JSON logs with correlation IDs避免 推荐方案
-------------------------------------------------------------------
FROM node:latest 锁定精确版本(node:20.11.0-alpine)
容器内以root身份运行 创建并使用非root用户
无.dockerignore文件 排除.git、node_modules、.env等文件
单个CI任务处理所有流程 拆分lint、test、build、deploy阶段
手动部署 带审批流程的自动化流水线
无健康检查 添加存活+就绪探针
每次错误都触发告警 基于错误率阈值设置告警
所有环境使用相同配置 按环境区分配置
无回滚计划 每次部署前先测试回滚流程
非结构化字符串格式日志 带关联ID的结构化JSON日志