devops-automation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DevOps Automation

DevOps自动化

GitHub Actions Workflow Structure

GitHub Actions工作流结构

yaml
name: CI/CD
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: 'npm'
      - run: npm ci
      - run: npm run lint

  test:
    runs-on: ubuntu-latest
    needs: lint
    strategy:
      matrix:
        node-version: [20, 22]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node-version }}
          cache: 'npm'
      - run: npm ci
      - run: npm test -- --coverage
      - uses: actions/upload-artifact@v4
        with:
          name: coverage-${{ matrix.node-version }}
          path: coverage/

  deploy:
    runs-on: ubuntu-latest
    needs: test
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - uses: actions/checkout@v4
      - run: ./deploy.sh
Key patterns:
  • Use
    concurrency
    to cancel outdated runs
  • Cache dependencies with setup action's
    cache
    option
  • Use
    needs
    for job dependencies
  • Gate deploys with
    environment
    protection rules
  • Use matrix for cross-version testing
yaml
name: CI/CD
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: 'npm'
      - run: npm ci
      - run: npm run lint

  test:
    runs-on: ubuntu-latest
    needs: lint
    strategy:
      matrix:
        node-version: [20, 22]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node-version }}
          cache: 'npm'
      - run: npm ci
      - run: npm test -- --coverage
      - uses: actions/upload-artifact@v4
        with:
          name: coverage-${{ matrix.node-version }}
          path: coverage/

  deploy:
    runs-on: ubuntu-latest
    needs: test
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - uses: actions/checkout@v4
      - run: ./deploy.sh
核心模式:
  • 使用
    concurrency
    取消过时的运行
  • 利用setup action的
    cache
    选项缓存依赖
  • 使用
    needs
    定义作业依赖
  • 通过
    environment
    保护规则管控部署
  • 使用矩阵进行跨版本测试

Docker Multi-Stage Builds

Docker多阶段构建

dockerfile
FROM node:22-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production

FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:22-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -S appuser
COPY --from=deps /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]
Rules:
  • Use specific image tags, never
    latest
  • Run as non-root user
  • Copy only necessary files into final stage
  • Add
    HEALTHCHECK
    for orchestrator integration
  • Use
    .dockerignore
    to exclude
    node_modules
    ,
    .git
    , tests
dockerfile
FROM node:22-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production

FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:22-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -S appuser
COPY --from=deps /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]
规则:
  • 使用特定的镜像标签,绝不使用
    latest
  • 以非root用户运行
  • 仅将必要文件复制到最终阶段
  • 添加
    HEALTHCHECK
    以与编排器集成
  • 使用
    .dockerignore
    排除
    node_modules
    .git
    和测试文件

Kubernetes Deployment Manifest

Kubernetes部署清单

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  labels:
    app: api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
        - name: api
          image: registry.example.com/api:v1.2.3
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 15
            periodSeconds: 20
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: api-secrets
                  key: database-url
Always set resource requests and limits. Always define readiness and liveness probes. Use
maxUnavailable: 0
for zero-downtime deploys.
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  labels:
    app: api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
        - name: api
          image: registry.example.com/api:v1.2.3
          ports:
            - containerPort: 3000
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 15
            periodSeconds: 20
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: api-secrets
                  key: database-url
始终设置资源请求和限制。始终定义就绪探针和存活探针。使用
maxUnavailable: 0
实现零停机部署。

Helm Chart Structure

Helm Chart结构

chart/
  Chart.yaml
  values.yaml
  values-staging.yaml
  values-production.yaml
  templates/
    deployment.yaml
    service.yaml
    ingress.yaml
    hpa.yaml
    _helpers.tpl
yaml
undefined
chart/
  Chart.yaml
  values.yaml
  values-staging.yaml
  values-production.yaml
  templates/
    deployment.yaml
    service.yaml
    ingress.yaml
    hpa.yaml
    _helpers.tpl
yaml
undefined

values.yaml

values.yaml

replicaCount: 2 image: repository: registry.example.com/api tag: latest pullPolicy: IfNotPresent resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi ingress: enabled: true host: api.example.com autoscaling: enabled: true minReplicas: 2 maxReplicas: 10 targetCPUUtilization: 70

Use `values-{env}.yaml` overrides per environment. Lint charts with `helm lint`. Test with `helm template` before deploying.
replicaCount: 2 image: repository: registry.example.com/api tag: latest pullPolicy: IfNotPresent resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi ingress: enabled: true host: api.example.com autoscaling: enabled: true minReplicas: 2 maxReplicas: 10 targetCPUUtilization: 70

使用`values-{env}.yaml`为每个环境提供覆盖配置。使用`helm lint`检查Chart。部署前使用`helm template`进行测试。

ArgoCD GitOps Pattern

ArgoCD GitOps模式

yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-server
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/k8s-manifests
    targetRevision: main
    path: apps/api-server
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
GitOps principles:
  • Git is the single source of truth for cluster state
  • All changes go through PRs (no
    kubectl apply
    in production)
  • ArgoCD auto-syncs from Git to cluster
  • Enable
    selfHeal
    to revert manual cluster changes
  • Separate app code repos from deployment manifest repos
yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-server
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/k8s-manifests
    targetRevision: main
    path: apps/api-server
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
GitOps原则:
  • Git是集群状态的唯一可信源
  • 所有变更都通过PR进行(生产环境中不使用
    kubectl apply
  • ArgoCD自动将Git中的内容同步到集群
  • 启用
    selfHeal
    以恢复手动修改的集群变更
  • 将应用代码仓库与部署清单仓库分开

Monitoring Stack

监控栈

yaml
undefined
yaml
undefined

Prometheus ServiceMonitor

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: api-server spec: selector: matchLabels: app: api-server endpoints: - port: metrics interval: 15s path: /metrics

Key metrics to expose:
- `http_request_duration_seconds` (histogram) - request latency by route and status
- `http_requests_total` (counter) - request count by route and status
- `process_resident_memory_bytes` (gauge) - memory usage
- `db_query_duration_seconds` (histogram) - database query latency

Alert on: error rate >1%, P99 latency >2s, memory >80% of limit, pod restarts >3 in 10 minutes.
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: api-server spec: selector: matchLabels: app: api-server endpoints: - port: metrics interval: 15s path: /metrics

需要暴露的关键指标:
- `http_request_duration_seconds`(直方图)- 按路由和状态统计的请求延迟
- `http_requests_total`(计数器)- 按路由和状态统计的请求数量
- `process_resident_memory_bytes`(仪表盘)- 内存使用量
- `db_query_duration_seconds`(直方图)- 数据库查询延迟

告警规则:错误率>1%、P99延迟>2秒、内存使用率超过限制的80%、10分钟内Pod重启次数>3次时触发告警。

Pipeline Best Practices

流水线最佳实践

  1. Keep CI under 10 minutes (parallelize jobs, cache aggressively)
  2. Run linting and type checking before tests
  3. Use ephemeral environments for PR previews
  4. Pin all action versions to SHA, not tags
  5. Store secrets in GitHub Secrets, never in workflow files
  6. Use OIDC for cloud provider authentication (no long-lived keys)
  7. Tag images with git SHA, not
    latest
  8. Run security scans (Trivy, Snyk) on container images in CI
  1. 保持CI流程在10分钟内完成(并行作业、充分利用缓存)
  2. 在测试前运行代码检查和类型校验
  3. 为PR预览使用临时环境
  4. 将所有action版本固定到SHA值,而非标签
  5. 将密钥存储在GitHub Secrets中,绝不要放在工作流文件里
  6. 使用OIDC进行云提供商认证(不使用长期密钥)
  7. 使用Git SHA为镜像打标签,而非
    latest
  8. 在CI中对容器镜像进行安全扫描(Trivy、Snyk)