devops-iac-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DevOps IaC Engineer

DevOps IaC工程师

This skill provides expertise in designing and managing cloud infrastructure using Infrastructure as Code (IaC) and DevOps/SRE best practices.
本技能提供基于基础设施即代码(IaC)和DevOps/SRE最佳实践的云基础设施设计与管理专业知识。

When to Use

适用场景

  • Designing cloud architecture (AWS, GCP, Azure)
  • Implementing or refactoring CI/CD pipelines
  • Setting up observability (logging, metrics, tracing)
  • Creating Kubernetes clusters and container orchestration strategies
  • Implementing security controls and compliance checks
  • Improving system reliability (SLO/SLA, Disaster Recovery)
  • 设计云架构(AWS、GCP、Azure)
  • 实施或重构CI/CD流水线
  • 搭建可观测性系统(日志、指标、追踪)
  • 创建Kubernetes集群及容器编排策略
  • 实施安全控制与合规检查
  • 提升系统可靠性(SLO/SLA、灾难恢复)

Infrastructure as Code (IaC) Principles

基础设施即代码(IaC)原则

  • Declarative Code: Use Terraform/OpenTofu to define the desired state.
  • GitOps: Code repository is the single source of truth. Changes are applied via PRs and automated pipelines.
  • Immutable Infrastructure: Replace servers/containers rather than patching them in place.
  • 声明式代码:使用Terraform/OpenTofu定义期望状态。
  • GitOps:代码仓库为唯一可信源,通过PR和自动化流水线应用变更。
  • 不可变基础设施:替换服务器/容器而非原地修补。

Core Domains

核心领域

1. Terraform & IaC

1. Terraform & IaC

  • Use modules for reusability.
  • Separate state by environment (dev, stage, prod) and region.
  • Automate
    plan
    and
    apply
    in CI/CD.
  • 使用模块提升复用性。
  • 按环境(开发、预发布、生产)和区域分离状态。
  • 在CI/CD中自动化
    plan
    apply
    操作。

2. Kubernetes & Containers

2. Kubernetes & 容器

  • Build small, stateless containers.
  • Use Helm or Kustomize for resource management.
  • Implement resource limits and requests.
  • Use namespaces for isolation.
  • 构建轻量、无状态容器。
  • 使用Helm或Kustomize进行资源管理。
  • 实施资源限制与请求配置。
  • 使用命名空间实现隔离。

3. CI/CD Pipelines

3. CI/CD流水线

  • CI: Lint, test, build, and scan (security) on every commit.
  • CD: Automated deployment to lower environments; manual approval for production.
  • Use tools like GitHub Actions, Cloud Build, or ArgoCD.
  • CI:每次提交时执行代码检查、测试、构建和安全扫描。
  • CD:自动部署到低环境;生产环境需手动审批。
  • 使用GitHub Actions、Cloud Build或ArgoCD等工具。

4. Observability

4. 可观测性

  • Logs: Centralized logging (e.g., Cloud Logging, ELK).
  • Metrics: Prometheus/Grafana or Cloud Monitoring.
  • Tracing: OpenTelemetry for distributed tracing.
  • 日志:集中式日志系统(如Cloud Logging、ELK)。
  • 指标:Prometheus/Grafana或Cloud Monitoring。
  • 追踪:使用OpenTelemetry进行分布式追踪。

5. Security (DevSecOps)

5. 安全(DevSecOps)

  • Scan IaC for misconfigurations (e.g., Checkov, Trivy).
  • Manage secrets utilizing Secret Manager or Vault (never in code).
  • Least privilege IAM roles.
  • 扫描IaC配置错误(如Checkov、Trivy)。
  • 使用Secret Manager或Vault管理密钥(绝不要嵌入代码)。
  • 遵循最小权限IAM角色原则。

SRE Practices

SRE实践

  • SLI/SLO: Define Service Level Indicators and Objectives for critical user journeys.
  • Error Budgets: Use error budgets to balance innovation and reliability.
  • Post-Mortems: Conduct blameless post-mortems for incidents.
  • SLI/SLO:为关键用户旅程定义服务水平指标和目标。
  • 错误预算:利用错误平衡创新与可靠性。
  • 事后复盘:针对事件开展无责复盘。