devops-automation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDevOps Automation
DevOps自动化
GitHub Actions Workflow Structure
GitHub Actions工作流结构
yaml
name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: 'npm'
- run: npm ci
- run: npm run lint
test:
runs-on: ubuntu-latest
needs: lint
strategy:
matrix:
node-version: [20, 22]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
- run: npm ci
- run: npm test -- --coverage
- uses: actions/upload-artifact@v4
with:
name: coverage-${{ matrix.node-version }}
path: coverage/
deploy:
runs-on: ubuntu-latest
needs: test
if: github.ref == 'refs/heads/main'
environment: production
steps:
- uses: actions/checkout@v4
- run: ./deploy.shKey patterns:
- Use to cancel outdated runs
concurrency - Cache dependencies with setup action's option
cache - Use for job dependencies
needs - Gate deploys with protection rules
environment - Use matrix for cross-version testing
yaml
name: CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
cache: 'npm'
- run: npm ci
- run: npm run lint
test:
runs-on: ubuntu-latest
needs: lint
strategy:
matrix:
node-version: [20, 22]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
- run: npm ci
- run: npm test -- --coverage
- uses: actions/upload-artifact@v4
with:
name: coverage-${{ matrix.node-version }}
path: coverage/
deploy:
runs-on: ubuntu-latest
needs: test
if: github.ref == 'refs/heads/main'
environment: production
steps:
- uses: actions/checkout@v4
- run: ./deploy.sh核心模式:
- 使用取消过时的运行
concurrency - 利用setup action的选项缓存依赖
cache - 使用定义作业依赖
needs - 通过保护规则管控部署
environment - 使用矩阵进行跨版本测试
Docker Multi-Stage Builds
Docker多阶段构建
dockerfile
FROM node:22-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:22-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -S appuser
COPY /app/node_modules ./node_modules
COPY /app/dist ./dist
COPY /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]Rules:
- Use specific image tags, never
latest - Run as non-root user
- Copy only necessary files into final stage
- Add for orchestrator integration
HEALTHCHECK - Use to exclude
.dockerignore,node_modules, tests.git
dockerfile
FROM node:22-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:22-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -S appuser
COPY /app/node_modules ./node_modules
COPY /app/dist ./dist
COPY /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]规则:
- 使用特定的镜像标签,绝不使用
latest - 以非root用户运行
- 仅将必要文件复制到最终阶段
- 添加以与编排器集成
HEALTHCHECK - 使用排除
.dockerignore、node_modules和测试文件.git
Kubernetes Deployment Manifest
Kubernetes部署清单
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
labels:
app: api-server
spec:
replicas: 3
selector:
matchLabels:
app: api-server
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api
image: registry.example.com/api:v1.2.3
ports:
- containerPort: 3000
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: api-secrets
key: database-urlAlways set resource requests and limits. Always define readiness and liveness probes. Use for zero-downtime deploys.
maxUnavailable: 0yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
labels:
app: api-server
spec:
replicas: 3
selector:
matchLabels:
app: api-server
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api
image: registry.example.com/api:v1.2.3
ports:
- containerPort: 3000
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: api-secrets
key: database-url始终设置资源请求和限制。始终定义就绪探针和存活探针。使用实现零停机部署。
maxUnavailable: 0Helm Chart Structure
Helm Chart结构
chart/
Chart.yaml
values.yaml
values-staging.yaml
values-production.yaml
templates/
deployment.yaml
service.yaml
ingress.yaml
hpa.yaml
_helpers.tplyaml
undefinedchart/
Chart.yaml
values.yaml
values-staging.yaml
values-production.yaml
templates/
deployment.yaml
service.yaml
ingress.yaml
hpa.yaml
_helpers.tplyaml
undefinedvalues.yaml
values.yaml
replicaCount: 2
image:
repository: registry.example.com/api
tag: latest
pullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
ingress:
enabled: true
host: api.example.com
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilization: 70
Use `values-{env}.yaml` overrides per environment. Lint charts with `helm lint`. Test with `helm template` before deploying.replicaCount: 2
image:
repository: registry.example.com/api
tag: latest
pullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
ingress:
enabled: true
host: api.example.com
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilization: 70
使用`values-{env}.yaml`为每个环境提供覆盖配置。使用`helm lint`检查Chart。部署前使用`helm template`进行测试。ArgoCD GitOps Pattern
ArgoCD GitOps模式
yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: api-server
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/org/k8s-manifests
targetRevision: main
path: apps/api-server
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=trueGitOps principles:
- Git is the single source of truth for cluster state
- All changes go through PRs (no in production)
kubectl apply - ArgoCD auto-syncs from Git to cluster
- Enable to revert manual cluster changes
selfHeal - Separate app code repos from deployment manifest repos
yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: api-server
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/org/k8s-manifests
targetRevision: main
path: apps/api-server
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=trueGitOps原则:
- Git是集群状态的唯一可信源
- 所有变更都通过PR进行(生产环境中不使用)
kubectl apply - ArgoCD自动将Git中的内容同步到集群
- 启用以恢复手动修改的集群变更
selfHeal - 将应用代码仓库与部署清单仓库分开
Monitoring Stack
监控栈
yaml
undefinedyaml
undefinedPrometheus ServiceMonitor
Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-server
spec:
selector:
matchLabels:
app: api-server
endpoints:
- port: metrics
interval: 15s
path: /metrics
Key metrics to expose:
- `http_request_duration_seconds` (histogram) - request latency by route and status
- `http_requests_total` (counter) - request count by route and status
- `process_resident_memory_bytes` (gauge) - memory usage
- `db_query_duration_seconds` (histogram) - database query latency
Alert on: error rate >1%, P99 latency >2s, memory >80% of limit, pod restarts >3 in 10 minutes.apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-server
spec:
selector:
matchLabels:
app: api-server
endpoints:
- port: metrics
interval: 15s
path: /metrics
需要暴露的关键指标:
- `http_request_duration_seconds`(直方图)- 按路由和状态统计的请求延迟
- `http_requests_total`(计数器)- 按路由和状态统计的请求数量
- `process_resident_memory_bytes`(仪表盘)- 内存使用量
- `db_query_duration_seconds`(直方图)- 数据库查询延迟
告警规则:错误率>1%、P99延迟>2秒、内存使用率超过限制的80%、10分钟内Pod重启次数>3次时触发告警。Pipeline Best Practices
流水线最佳实践
- Keep CI under 10 minutes (parallelize jobs, cache aggressively)
- Run linting and type checking before tests
- Use ephemeral environments for PR previews
- Pin all action versions to SHA, not tags
- Store secrets in GitHub Secrets, never in workflow files
- Use OIDC for cloud provider authentication (no long-lived keys)
- Tag images with git SHA, not
latest - Run security scans (Trivy, Snyk) on container images in CI
- 保持CI流程在10分钟内完成(并行作业、充分利用缓存)
- 在测试前运行代码检查和类型校验
- 为PR预览使用临时环境
- 将所有action版本固定到SHA值,而非标签
- 将密钥存储在GitHub Secrets中,绝不要放在工作流文件里
- 使用OIDC进行云提供商认证(不使用长期密钥)
- 使用Git SHA为镜像打标签,而非
latest - 在CI中对容器镜像进行安全扫描(Trivy、Snyk)