devops-infrastructure

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

DevOps & Infrastructure

DevOps与基础设施

When to Load

何时加载

Trigger: Docker, CI/CD pipelines, deployment configuration, monitoring, infrastructure as code
Skip: Application logic only with no infrastructure or deployment concerns

触发条件：Docker、CI/CD流水线、部署配置、监控、基础设施即代码相关场景
跳过场景：仅涉及应用逻辑，无基础设施或部署相关需求的场景

DevOps Workflow

DevOps工作流

Copy this checklist and track progress:

DevOps Setup Progress:
- [ ] Step 1: Containerize application (Dockerfile)
- [ ] Step 2: Set up CI/CD pipeline
- [ ] Step 3: Define deployment strategy
- [ ] Step 4: Configure monitoring & alerting
- [ ] Step 5: Set up environment management
- [ ] Step 6: Document runbooks
- [ ] Step 7: Validate against anti-patterns checklist

复制以下清单跟踪进度：

DevOps Setup Progress:
- [ ] Step 1: Containerize application (Dockerfile)
- [ ] Step 2: Set up CI/CD pipeline
- [ ] Step 3: Define deployment strategy
- [ ] Step 4: Configure monitoring & alerting
- [ ] Step 5: Set up environment management
- [ ] Step 6: Document runbooks
- [ ] Step 7: Validate against anti-patterns checklist

Docker Best Practices

Docker最佳实践

Multi-Stage Build

多阶段构建

dockerfile

undefined

dockerfile

undefined

WRONG: Single stage, bloated image

FROM node:20 WORKDIR /app COPY . . RUN npm install RUN npm run build CMD ["node", "dist/index.js"]

Result: 1.2GB image with devDependencies and source code

CORRECT: Multi-stage build

FROM node:20-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build

FROM node:20-alpine AS runner WORKDIR /app ENV NODE_ENV=production RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -s /bin/sh -D appuser COPY --from=builder /app/dist ./dist COPY --from=builder /app/node_modules ./node_modules COPY --from=builder /app/package.json ./ USER appuser EXPOSE 3000 CMD ["node", "dist/index.js"]

FROM node:20-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build

Result: ~150MB image, no devDependencies, non-root user

undefined

undefined

Python Multi-Stage

Python多阶段构建

dockerfile

FROM python:3.12-slim AS builder
WORKDIR /app
RUN pip install uv
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .

FROM python:3.12-slim AS runner
WORKDIR /app
RUN useradd -r -s /bin/false appuser
COPY --from=builder /app/.venv /app/.venv
COPY --from=builder /app/src ./src
ENV PATH="/app/.venv/bin:$PATH"
USER appuser
CMD ["python", "-m", "src.main"]

dockerfile

FROM python:3.12-slim AS builder
WORKDIR /app
RUN pip install uv
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .

FROM python:3.12-slim AS runner
WORKDIR /app
RUN useradd -r -s /bin/false appuser
COPY --from=builder /app/.venv /app/.venv
COPY --from=builder /app/src ./src
ENV PATH="/app/.venv/bin:$PATH"
USER appuser
CMD ["python", "-m", "src.main"]

Layer Caching

层缓存

dockerfile

undefined

dockerfile

undefined

WRONG: Cache busted on every code change

COPY . . RUN npm ci

CORRECT: Dependencies cached separately

COPY package*.json ./ RUN npm ci # cached unless package.json changes COPY . . # only source code changes bust this layer

undefined

COPY package*.json ./ RUN npm ci # cached unless package.json changes COPY . . # only source code changes bust this layer

undefined

.dockerignore

node_modules
.git
.env
*.md
.vscode
coverage
dist
__pycache__
.pytest_cache
*.pyc

node_modules
.git
.env
*.md
.vscode
coverage
dist
__pycache__
.pytest_cache
*.pyc

Security

安全

dockerfile

undefined

dockerfile

undefined

Always pin versions

FROM node:20.11.0-alpine # NOT node:latest

Don't run as root

USER appuser

Read-only filesystem where possible

docker run --read-only --tmpfs /tmp myapp

Scan images

docker scout cves myimage:latest

trivy image myimage:latest

undefined

undefined

CI/CD Pipeline Design

CI/CD流水线设计

GitHub Actions Structure

GitHub Actions结构

yaml

name: CI/CD
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: "npm"
      - run: npm ci
      - run: npm run lint

  test:
    runs-on: ubuntu-latest
    needs: lint
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: testdb
        ports: ["5432:5432"]
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: "npm"
      - run: npm ci
      - run: npm test

  build:
    runs-on: ubuntu-latest
    needs: test
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v5
        with:
          push: ${{ github.event_name == 'push' }}
          tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    runs-on: ubuntu-latest
    needs: build
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - run: echo "Deploy to production"

yaml

name: CI/CD
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: "npm"
      - run: npm ci
      - run: npm run lint

  test:
    runs-on: ubuntu-latest
    needs: lint
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: testdb
        ports: ["5432:5432"]
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: "npm"
      - run: npm ci
      - run: npm test

  build:
    runs-on: ubuntu-latest
    needs: test
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v5
        with:
          push: ${{ github.event_name == 'push' }}
          tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    runs-on: ubuntu-latest
    needs: build
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - run: echo "Deploy to production"

Caching Strategies

缓存策略

yaml

undefined

yaml

undefined

Node modules

uses: actions/setup-node@v4 with: cache: "npm"

uses: actions/setup-node@v4 with: cache: "npm"

Python with uv

name: Cache uv uses: actions/cache@v4 with: path: ~/.cache/uv key: uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}

name: Cache uv uses: actions/cache@v4 with: path: ~/.cache/uv key: uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}

Docker layer caching

uses: docker/build-push-action@v5 with: cache-from: type=gha cache-to: type=gha,mode=max

undefined

uses: docker/build-push-action@v5 with: cache-from: type=gha cache-to: type=gha,mode=max

undefined

Deployment Strategies

部署策略

Blue-Green Deployment

蓝绿部署

1. Run two identical environments: Blue (live) and Green (idle)
2. Deploy new version to Green
3. Run smoke tests on Green
4. Switch load balancer to Green
5. Green is now live, Blue is idle
6. Rollback: switch back to Blue

Pros: Instant rollback, zero downtime
Cons: 2x infrastructure cost during deploy

1. Run two identical environments: Blue (live) and Green (idle)
2. Deploy new version to Green
3. Run smoke tests on Green
4. Switch load balancer to Green
5. Green is now live, Blue is idle
6. Rollback: switch back to Blue

优点：即时回滚，零停机
缺点：部署期间需要2倍基础设施成本

Canary Deployment

金丝雀部署

1. Deploy new version to small subset (5% of traffic)
2. Monitor error rates and latency
3. Gradually increase: 5% -> 25% -> 50% -> 100%
4. Rollback: route all traffic back to old version

Pros: Limited blast radius, real-world testing
Cons: More complex routing, longer rollout

1. Deploy new version to small subset (5% of traffic)
2. Monitor error rates and latency
3. Gradually increase: 5% -> 25% -> 50% -> 100%
4. Rollback: route all traffic back to old version

优点：影响范围可控，可在真实环境验证
缺点：路由逻辑更复杂，发布周期更长

Rolling Deployment

滚动部署

1. Replace instances one at a time
2. Each new instance passes health checks before next starts
3. Continue until all instances updated

Pros: No extra infrastructure, gradual rollout
Cons: Mixed versions during deploy, slower rollback

1. Replace instances one at a time
2. Each new instance passes health checks before next starts
3. Continue until all instances updated

优点：无需额外基础设施，发布过程平缓
缺点：部署期间存在版本混用情况，回滚速度较慢

Feature Flags

功能开关

typescript

// Simple feature flag implementation
const features = {
  NEW_CHECKOUT: process.env.FF_NEW_CHECKOUT === "true",
  DARK_MODE: process.env.FF_DARK_MODE === "true",
};

function getCheckoutFlow(user: User) {
  if (features.NEW_CHECKOUT && user.betaGroup) {
    return newCheckoutFlow(user);
  }
  return legacyCheckoutFlow(user);
}

// Use a proper service for production: LaunchDarkly, Unleash, Flagsmith

typescript

// Simple feature flag implementation
const features = {
  NEW_CHECKOUT: process.env.FF_NEW_CHECKOUT === "true",
  DARK_MODE: process.env.FF_DARK_MODE === "true",
};

function getCheckoutFlow(user: User) {
  if (features.NEW_CHECKOUT && user.betaGroup) {
    return newCheckoutFlow(user);
  }
  return legacyCheckoutFlow(user);
}

// Use a proper service for production: LaunchDarkly, Unleash, Flagsmith

Infrastructure as Code

基础设施即代码

Terraform Basics

Terraform基础

hcl

undefined

hcl

undefined

main.tf

terraform { required_version = ">= 1.5" backend "s3" { bucket = "myapp-terraform-state" key = "prod/terraform.tfstate" region = "us-east-1" } }

resource "aws_instance" "web" { ami = var.ami_id instance_type = var.instance_type tags = { Name = "web-${var.environment}" Environment = var.environment ManagedBy = "terraform" } }

terraform { required_version = ">= 1.5" backend "s3" { bucket = "myapp-terraform-state" key = "prod/terraform.tfstate" region = "us-east-1" } }

resource "aws_instance" "web" { ami = var.ami_id instance_type = var.instance_type tags = { Name = "web-${var.environment}" Environment = var.environment ManagedBy = "terraform" } }

variables.tf

variable "environment" { type = string default = "dev" }

variable "instance_type" { type = string default = "t3.micro" }

undefined

variable "environment" { type = string default = "dev" }

variable "instance_type" { type = string default = "t3.micro" }

undefined

Terraform Rules

Terraform使用规则

1. Always use remote state (S3, GCS, Terraform Cloud)
2. Lock state files to prevent concurrent modifications
3. Use variables and modules for reusability
4. Tag all resources with environment and ManagedBy
5. Run `terraform plan` before `terraform apply`
6. Never edit infrastructure manually (all changes via code)
7. Use workspaces or separate state files per environment

1. 始终使用远端状态存储（S3、GCS、Terraform Cloud）
2. 锁定状态文件防止并发修改
3. 使用变量和模块提升复用性
4. 为所有资源添加环境和ManagedBy标签
5. 执行`terraform apply`前先运行`terraform plan`
6. 禁止手动修改基础设施（所有变更通过代码实现）
7. 按环境使用工作区或独立的状态文件

Monitoring & Observability

监控与可观测性

The Three Pillars

三大支柱

METRICS: Numeric measurements over time
  - Request rate, error rate, latency (RED method)
  - CPU, memory, disk, network (USE method)
  - Business metrics (signups, purchases)
  Tools: Prometheus, Datadog, CloudWatch

LOGS: Discrete events with context
  - Structured JSON format
  - Correlation IDs across services
  - Log levels: DEBUG, INFO, WARN, ERROR
  Tools: ELK Stack, Loki, CloudWatch Logs

TRACES: Request flow across services
  - Distributed tracing with span context
  - Latency breakdown per service
  - Dependency mapping
  Tools: Jaeger, Zipkin, Datadog APM

指标（METRICS）：随时间变化的数值测量
  - 请求率、错误率、延迟（RED方法）
  - CPU、内存、磁盘、网络（USE方法）
  - 业务指标（注册量、购买量）
  工具：Prometheus、Datadog、CloudWatch

日志（LOGS）：带上下文的离散事件
  - 结构化JSON格式
  - 跨服务的关联ID
  - 日志级别：DEBUG、INFO、WARN、ERROR
  工具：ELK Stack、Loki、CloudWatch Logs

链路（TRACES）：跨服务的请求流转记录
  - 带span上下文的分布式链路追踪
  - 每个服务的延迟拆解
  - 依赖关系映射
  工具：Jaeger、Zipkin、Datadog APM

Health Check Endpoint

健康检查端点

typescript

// Express health check
app.get("/health", async (req, res) => {
  const checks = {
    uptime: process.uptime(),
    timestamp: Date.now(),
    database: "unknown",
    redis: "unknown",
  };

  try {
    await db.query("SELECT 1");
    checks.database = "healthy";
  } catch (e) {
    checks.database = "unhealthy";
  }

  try {
    await redis.ping();
    checks.redis = "healthy";
  } catch (e) {
    checks.redis = "unhealthy";
  }

  const isHealthy = checks.database === "healthy";
  res.status(isHealthy ? 200 : 503).json(checks);
});

typescript

// Express health check
app.get("/health", async (req, res) => {
  const checks = {
    uptime: process.uptime(),
    timestamp: Date.now(),
    database: "unknown",
    redis: "unknown",
  };

  try {
    await db.query("SELECT 1");
    checks.database = "healthy";
  } catch (e) {
    checks.database = "unhealthy";
  }

  try {
    await redis.ping();
    checks.redis = "healthy";
  } catch (e) {
    checks.redis = "unhealthy";
  }

  const isHealthy = checks.database === "healthy";
  res.status(isHealthy ? 200 : 503).json(checks);
});

Alerting Rules

告警规则

Good alerts:
- Error rate > 1% for 5 minutes (actionable)
- P99 latency > 2s for 10 minutes (meaningful)
- Disk usage > 80% (preventive)

Bad alerts:
- CPU spike for 30 seconds (too noisy)
- Any single 500 error (too sensitive)
- "Something might be wrong" (not actionable)

Alert fatigue is real. Every alert should require human action.

优质告警：
- 5分钟内错误率>1%（可行动）
- 10分钟内P99延迟>2s（有业务意义）
- 磁盘使用率>80%（可预防）

劣质告警：
- CPU突增30秒（噪音过多）
- 任意单个500错误（过于敏感）
- "可能存在异常"（无法直接行动）

告警疲劳是真实存在的问题，每一条告警都应该需要人工介入处理。

Environment Management

环境管理

Dev/Staging/Prod Parity

开发/预发/生产环境一致性

yaml

undefined

yaml

undefined

docker-compose.yml for local development

services: app: build: . env_file: .env ports: ["3000:3000"] depends_on: postgres: condition: service_healthy

postgres: image: postgres:16 environment: POSTGRES_DB: myapp healthcheck: test: ["CMD-SHELL", "pg_isready"] interval: 5s volumes: - pgdata:/var/lib/postgresql/data

redis: image: redis:7-alpine ports: ["6379:6379"]

volumes: pgdata:

undefined

services: app: build: . env_file: .env ports: ["3000:3000"] depends_on: postgres: condition: service_healthy

postgres: image: postgres:16 environment: POSTGRES_DB: myapp healthcheck: test: ["CMD-SHELL", "pg_isready"] interval: 5s volumes: - pgdata:/var/lib/postgresql/data

redis: image: redis:7-alpine ports: ["6379:6379"]

volumes: pgdata:

undefined

Environment Variables

环境变量

undefined

undefined

.env.example (committed to git, no real values)

.env.example (提交到git，无真实值)

DATABASE_URL=postgresql://user:placeholder@localhost:5432/myapp REDIS_URL=redis://localhost:6379 LOG_LEVEL=debug API_KEY=your-key-here

.env (never committed, listed in .gitignore)

.env (永远不提交，添加到.gitignore中)

Contains real values for local development

存储本地开发的真实配置值

undefined

undefined

Common Anti-Patterns Summary

常见反模式汇总

AVOID                              DO INSTEAD
-------------------------------------------------------------------
FROM node:latest                   Pin exact versions (node:20.11.0-alpine)
Running as root in container       Create and use non-root user
No .dockerignore                   Exclude .git, node_modules, .env
Single CI job does everything      Separate lint, test, build, deploy stages
Manual deployment                  Automated pipeline with approvals
No health checks                   Liveness + readiness probes
Alerts on every error              Alert on error RATE thresholds
Same config in all environments    Per-environment configuration
No rollback plan                   Test rollback before every deploy
Logs as unstructured strings       Structured JSON logs with correlation IDs

避免                                  推荐方案
-------------------------------------------------------------------
FROM node:latest                      锁定精确版本（node:20.11.0-alpine）
容器内以root身份运行                   创建并使用非root用户
无.dockerignore文件                   排除.git、node_modules、.env等文件
单个CI任务处理所有流程                 拆分lint、test、build、deploy阶段
手动部署                              带审批流程的自动化流水线
无健康检查                            添加存活+就绪探针
每次错误都触发告警                     基于错误率阈值设置告警
所有环境使用相同配置                   按环境区分配置
无回滚计划                            每次部署前先测试回滚流程
非结构化字符串格式日志                 带关联ID的结构化JSON日志