aws-containers

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AWS Containers

AWS 容器

Service Overview

服务概览

Developer NeedRecommendKey CLI / CDK
Simplest container deploy (HTTP app/API, new customers)ECS Express Mode
aws ecs create-express-gateway-service
Web app, worker, batch, scheduled taskECS on Fargate
aws ecs create-service
/ CDK
ecsPatterns.ApplicationLoadBalancedFargateService
GPU workloads or >16 vCPUECS on EC2CDK
ecs.Ec2Service
Store container imagesECR
aws ecr create-repository
Web app behind a load balancerECS Fargate + ALBCDK
ecsPatterns.ApplicationLoadBalancedFargateService
SQS worker scaling on queue depthECS Fargate + SQSCDK
ecsPatterns.QueueProcessingFargateService
Cron job / scheduled taskECS Fargate + EventBridgeCDK
ecsPatterns.ScheduledFargateTask
Service mesh / service-to-serviceECS Service ConnectConfigure on ECS service with Cloud Map namespace
Debug a running containerECS Exec
aws ecs execute-command --interactive --command "/bin/sh"
When a developer says "deploy my container" without naming a service: recommend ECS Express Mode for simple HTTP apps (replaces App Runner for new customers). Recommend ECS Fargate for everything else. Never recommend EKS unless they explicitly ask for Kubernetes.
开发者需求推荐方案核心CLI/CDK命令
最简单的容器部署(HTTP应用/API,新用户)ECS Express模式
aws ecs create-express-gateway-service
Web应用、工作流、批量任务、定时任务ECS on Fargate
aws ecs create-service
/ CDK
ecsPatterns.ApplicationLoadBalancedFargateService
GPU工作负载或>16 vCPUECS on EC2CDK
ecs.Ec2Service
存储容器镜像ECR
aws ecr create-repository
负载均衡器后的Web应用ECS Fargate + ALBCDK
ecsPatterns.ApplicationLoadBalancedFargateService
基于队列深度的SQS工作流扩缩容ECS Fargate + SQSCDK
ecsPatterns.QueueProcessingFargateService
定时任务/计划任务ECS Fargate + EventBridgeCDK
ecsPatterns.ScheduledFargateTask
服务网格/服务间通信ECS Service Connect在ECS服务上配置Cloud Map命名空间
调试运行中的容器ECS Exec
aws ecs execute-command --interactive --command "/bin/sh"
当开发者仅说“部署我的容器”而未指定服务时:对于简单HTTP应用,推荐ECS Express模式(为新用户替代App Runner);其他场景推荐ECS Fargate。除非用户明确要求Kubernetes,否则绝不推荐EKS。

Overview

概述

Provides expertise for building, deploying, and operating containerized workloads using Amazon ECS, AWS Fargate, Amazon ECR, and AWS App Runner.
Recommended setup: Install the AWS MCP server for sandboxed execution, audit logging, and enterprise controls. See: aws.amazon.com/mcp
Without AWS MCP: This skill works with any agent that has AWS CLI access. All commands use standard AWS CLI syntax.
When NOT to use this skill:
  • Kubernetes or EKS workloads → use the kubernetes skill
  • CI/CD pipeline setup for container deployments → use the deploy skill
  • VPC subnet design and security group architecture → use the networking skill
  • Running code without containers (Lambda, Step Functions) → use the serverless skill
Before executing any commands:
  • You MUST verify AWS CLI v2 is installed and configured before running commands
  • You MUST inform the user if required tools (AWS CLI, Docker, Session Manager plugin) are missing
  • You MUST respect the user's decision to abort at any point
提供使用Amazon ECS、AWS Fargate、Amazon ECR和AWS App Runner构建、部署和运维容器化工作负载的专业指导。
推荐配置:安装AWS MCP服务器以实现沙箱执行、审计日志和企业级控制。详情见:aws.amazon.com/mcp
无AWS MCP时:此技能适用于任何具备AWS CLI访问权限的Agent。所有命令均使用标准AWS CLI语法。
不适用于以下场景
  • Kubernetes或EKS工作负载 → 使用Kubernetes技能
  • 容器部署的CI/CD流水线配置 → 使用部署技能
  • VPC子网设计和安全组架构 → 使用网络技能
  • 非容器化运行代码(Lambda、Step Functions) → 使用无服务器技能
执行任何命令前
  • 必须先验证已安装并配置AWS CLI v2,再运行命令
  • 若缺少所需工具(AWS CLI、Docker、Session Manager插件),必须告知用户
  • 必须尊重用户随时中止操作的决定

Gotchas

常见陷阱

Apply these every time. Each corrects a mistake agents make without explicit instruction.
  1. Fargate CPU/memory must be valid combinations. Arbitrary values cause
    Invalid 'cpu' setting for task
    :
    • 256 (0.25 vCPU): 512 MiB, 1 GB, 2 GB
    • 512 (0.5 vCPU): 1–4 GB (1 GB increments)
    • 1024 (1 vCPU): 2–8 GB (1 GB increments)
    • 2048 (2 vCPU): 4–16 GB (1 GB increments)
    • 4096 (4 vCPU): 8–30 GB (1 GB increments)
    • 8192 (8 vCPU): 16–60 GB (4 GB increments)
    • 16384 (16 vCPU): 32–120 GB (8 GB increments)
    If the user requests an invalid combination, tell them and recommend the nearest valid option. You MUST NOT silently produce an invalid task definition.
  2. Fargate requires
    awsvpc
    networking mode — no exceptions.
    Agents frequently suggest
    bridge
    or
    host
    mode for Fargate tasks, which causes immediate registration failure. You MUST set
    networkMode
    to
    awsvpc
    for all Fargate task definitions. On EC2,
    awsvpc
    is recommended;
    bridge
    is legacy only.
  3. Execution role vs task role — never confuse them.
    executionRoleArn
    : ECS agent uses it to pull images, fetch secrets, write logs.
    taskRoleArn
    : application code uses it to call AWS APIs. ECS Exec permissions (
    ssmmessages:*
    ) go on the task role. ECR pull permissions go on the execution role.
    ecr:GetAuthorizationToken
    MUST use
    Resource: "*"
    (registry-level action).
  4. Secrets are injected at task launch only — no hot-reload. Changed secrets require
    aws ecs update-service --force-new-deployment
    . To reference a specific JSON key in Secrets Manager:
    arn:aws:secretsmanager:region:account:secret:name-hash:json-key::
    — the trailing colons are required (they represent empty version-stage and version-id fields). You can also use SSM Parameter Store with
    valueFrom
    pointing to the parameter ARN — the execution role needs
    ssm:GetParameters
    permission.
  5. ALB deregistration delay defaults to 300s — reduce to 30–60s. This is the #1 cause of slow deployments. Set it on the target group. It SHOULD exceed your longest request duration.
  6. Set
    healthCheckGracePeriodSeconds
    on every ECS service behind an ALB.
    Without it, the ALB marks tasks unhealthy before they're ready, the circuit breaker counts failures, and the deployment rolls back. JVM/Spring Boot apps need 60–120s.
  7. Always enable deployment circuit breaker with rollback. Without it, bad deployments stay "in progress" for 30+ minutes. In CDK:
    circuitBreaker: { rollback: true }
    (specifying the property implicitly enables it;
    enable
    defaults to
    true
    ).
  8. Private subnet Fargate tasks need NAT or all four VPC endpoints. Required endpoints:
    ecr.dkr
    (interface),
    ecr.api
    (interface),
    s3
    (gateway — ECR stores layers in S3),
    logs
    (interface — for CloudWatch). The S3 gateway endpoint is the most commonly missed. For ECS Exec, also add
    ssmmessages
    .
  9. ECR lifecycle policies evaluate within 24 hours — not immediately. Multi-architecture images referenced by a manifest list cannot be expired until the manifest list is deleted first. Preview before applying: first
    aws ecr start-lifecycle-policy-preview --repository-name $REPO
    , then
    aws ecr get-lifecycle-policy-preview --repository-name $REPO --output json
    to see which images would be affected.
  10. ECS Exec requires task role permissions, NOT execution role. The task role needs
    ssmmessages:CreateControlChannel
    ,
    CreateDataChannel
    ,
    OpenControlChannel
    ,
    OpenDataChannel
    . Tasks launched before enabling
    enableExecuteCommand
    do NOT support ECS Exec — force a new deployment. The container image must include the binary specified in
    --command
    (e.g.,
    /bin/sh
    for interactive sessions). For command logging to S3 or CloudWatch Logs,
    script
    and
    cat
    must also be installed. Fargate platform version MUST be 1.4.0+.
  11. awslogs
    log driver mode — check your account's default.
    Per ECS docs, the ECS service defaults to
    non-blocking
    mode, which drops logs when the buffer fills. The
    defaultLogDriverMode
    account setting can override this per account. For guaranteed log delivery (audit/compliance), explicitly set
    "mode": "blocking"
    in
    logConfiguration.options
    . Check your effective default:
    aws ecs list-account-settings --name defaultLogDriverMode --effective-settings --output json
    .
  12. App Runner VPC connector routes ALL application-initiated outbound traffic through the VPC. (App Runner is sunset — new customers should use ECS Express Mode instead.) Without a NAT gateway, external API calls and AWS service calls from your application code break. App Runner's own managed traffic (pulling images, pushing logs, retrieving secrets) is NOT routed through the VPC and is unaffected. Implement retry logic with backoff for database connections at startup.
  13. For
    desiredCount=1
    zero-downtime deploys:
    minimumHealthyPercent=100, maximumPercent=200
    .
    This requires capacity for 2 tasks during deployment. You MUST NOT set
    minimumHealthyPercent=0
    if zero downtime is required.
  14. 502 Bad Gateway from ALB — check in this order: (a) Container not listening on the port in the target group. (b) Container crashing before responding. (c) Task security group doesn't allow inbound from ALB security group on the container port. (d) Health check path returns non-200. (e) Health check timeout exceeds response time.
  15. Fargate platform version: always use
    LATEST
    or
    1.4.0
    .
    Version 1.3.0 is being retired June 15, 2026 and terminated June 30, 2026.
  16. SQS worker scaling: use a custom backlog-per-task metric. Raw
    ApproximateNumberOfMessagesVisible
    with target tracking doesn't work because adding tasks doesn't reduce queue depth proportionally. Use custom metric (
    ApproximateNumberOfMessagesVisible / RunningTaskCount
    ) with target tracking, or use step scaling. CDK
    QueueProcessingFargateService
    handles this automatically via
    scalingSteps
    . Workers MUST handle SIGTERM gracefully within
    stopTimeout
    (default 30s, max 120s on Fargate).
  17. Blue/green deployments: use native ECS blue/green (July 2025+) for new services. Supports all-at-once, canary, and linear traffic shifting (canary/linear added October 2025), plus Service Connect, headless services, EBS volumes, and lifecycle hooks. CodeDeploy blue/green is now legacy — native ECS blue/green has full feature parity.
  18. Container dependency
    HEALTHY
    condition requires a health check on the dependency container.
    Without a configured health check, the dependent container never starts — ECS does not progress it to its next state. If
    startTimeout
    is set (max 120s), the dependency times out and the task fails; if not set, the dependent container blocks indefinitely. For init containers, use
    SUCCESS
    condition instead.
每次操作都需遵循以下规则,这些规则用于纠正Agent未获明确指令时易犯的错误。
  1. Fargate的CPU/内存必须为有效组合。任意值会导致
    Invalid 'cpu' setting for task
    错误:
    • 256(0.25 vCPU):512 MiB、1 GB、2 GB
    • 512(0.5 vCPU):1–4 GB(1 GB递增)
    • 1024(1 vCPU):2–8 GB(1 GB递增)
    • 2048(2 vCPU):4–16 GB(1 GB递增)
    • 4096(4 vCPU):8–30 GB(1 GB递增)
    • 8192(8 vCPU):16–60 GB(4 GB递增)
    • 16384(16 vCPU):32–120 GB(8 GB递增)
    如果用户请求无效组合,需告知并推荐最接近的有效选项。绝不能静默生成无效的任务定义。
  2. Fargate必须使用
    awsvpc
    网络模式——无例外
    。Agent常建议为Fargate任务使用
    bridge
    host
    模式,这会导致立即注册失败。所有Fargate任务定义必须将
    networkMode
    设置为
    awsvpc
    。在EC2上,推荐使用
    awsvpc
    bridge
    仅为遗留模式。
  3. 执行角色与任务角色——切勿混淆
    executionRoleArn
    :ECS Agent使用它拉取镜像、获取密钥、写入日志。
    taskRoleArn
    :应用代码使用它调用AWS API。ECS Exec权限(
    ssmmessages:*
    )需附加到任务角色。ECR拉取权限需附加到执行角色。
    ecr:GetAuthorizationToken
    必须使用
    Resource: "*"
    (注册表级操作)。
  4. 密钥仅在任务启动时注入——不支持热重载。修改密钥后需执行
    aws ecs update-service --force-new-deployment
    。要引用Secrets Manager中的特定JSON密钥:
    arn:aws:secretsmanager:region:account:secret:name-hash:json-key::
    ——末尾的冒号是必需的(代表空的版本阶段和版本ID字段)。也可使用SSM Parameter Store,通过
    valueFrom
    指向参数ARN——执行角色需具备
    ssm:GetParameters
    权限。
  5. ALB注销延迟默认值为300秒——建议缩短至30–60秒。这是部署缓慢的头号原因。需在目标组上设置该值,且应超过最长请求时长。
  6. 每个ALB后的ECS服务都必须设置
    healthCheckGracePeriodSeconds
    。若未设置,ALB会在任务就绪前将其标记为不健康,断路器会统计故障并回滚部署。JVM/Spring Boot应用需设置60–120秒。
  7. 始终启用部署断路器并配置回滚。若未启用,错误部署会持续“进行中”30分钟以上。在CDK中:
    circuitBreaker: { rollback: true }
    (指定该属性即隐式启用,
    enable
    默认值为
    true
    )。
  8. 私有子网中的Fargate任务需要NAT网关或全部四个VPC终端节点。所需终端节点:
    ecr.dkr
    (接口)、
    ecr.api
    (接口)、
    s3
    (网关——ECR将镜像层存储在S3中)、
    logs
    (接口——用于CloudWatch)。S3网关终端节点最常被遗漏。若使用ECS Exec,还需添加
    ssmmessages
    终端节点。
  9. ECR生命周期策略会在24小时内生效——并非立即。清单列表引用的多架构镜像需先删除清单列表,才能过期清理。应用前可预览:先执行
    aws ecr start-lifecycle-policy-preview --repository-name $REPO
    ,再执行
    aws ecr get-lifecycle-policy-preview --repository-name $REPO --output json
    查看受影响的镜像。
  10. ECS Exec需要任务角色权限,而非执行角色。任务角色需具备
    ssmmessages:CreateControlChannel
    CreateDataChannel
    OpenControlChannel
    OpenDataChannel
    权限。启用
    enableExecuteCommand
    前启动的任务不支持ECS Exec——需强制部署新任务。容器镜像必须包含
    --command
    中指定的二进制文件(例如交互式会话的
    /bin/sh
    )。若要将命令日志发送到S3或CloudWatch Logs,还需安装
    script
    cat
    。Fargate平台版本必须为1.4.0+。
  11. awslogs
    日志驱动模式——检查账户默认设置
    。根据ECS文档,ECS服务默认使用
    non-blocking
    模式,缓冲区填满时会丢弃日志。账户设置
    defaultLogDriverMode
    可覆盖此默认值。如需保证日志交付(审计/合规),需在
    logConfiguration.options
    中显式设置
    "mode": "blocking"
    。检查当前有效默认值:
    aws ecs list-account-settings --name defaultLogDriverMode --effective-settings --output json
  12. App Runner VPC连接器会将所有应用发起的出站流量路由到VPC。(App Runner已停用——新用户应使用ECS Express模式。)若无NAT网关,应用代码的外部API调用和AWS服务调用会失败。App Runner自身的托管流量(拉取镜像、推送日志、获取密钥)不会路由到VPC,不受影响。启动时需为数据库连接实现带退避的重试逻辑。
  13. 对于
    desiredCount=1
    的零停机部署:设置
    minimumHealthyPercent=100, maximumPercent=200
    。这要求部署期间具备容纳2个任务的容量。若需零停机,绝不能设置
    minimumHealthyPercent=0
  14. ALB返回502 Bad Gateway——按以下顺序排查:(a) 容器未监听目标组中配置的端口。(b) 容器在响应前崩溃。(c) 任务安全组未允许ALB安全组在容器端口上的入站流量。(d) 健康检查路径返回非200状态码。(e) 健康检查超时超过响应时间。
  15. Fargate平台版本:始终使用
    LATEST
    1.4.0
    。版本1.3.0将于2026年6月15日退役,2026年6月30日终止服务。
  16. SQS工作流扩缩容:使用自定义的“每个任务待处理消息数”指标。原始的
    ApproximateNumberOfMessagesVisible
    指标配合目标追踪无法正常工作,因为添加任务不会按比例减少队列深度。使用自定义指标(
    ApproximateNumberOfMessagesVisible / RunningTaskCount
    )配合目标追踪,或使用阶梯式扩缩容。CDK
    QueueProcessingFargateService
    会通过
    scalingSteps
    自动处理此问题。工作流必须在
    stopTimeout
    (默认30秒,Fargate上最大120秒)内优雅处理SIGTERM信号。
  17. 蓝绿部署:新服务使用原生ECS蓝绿部署(2025年7月+)。支持全量、金丝雀和线性流量切换(金丝雀/线性切换于2025年10月新增),还支持Service Connect、无头服务、EBS卷和生命周期钩子。CodeDeploy蓝绿部署现已成为遗留方案——原生ECS蓝绿部署具备完整功能 parity。
  18. 容器依赖的
    HEALTHY
    条件要求依赖容器配置健康检查
    。若未配置健康检查,依赖容器将永远无法启动——ECS不会推进其状态。若设置了
    startTimeout
    (最大120秒),依赖会超时导致任务失败;若未设置,依赖容器会无限期阻塞。对于初始化容器,应使用
    SUCCESS
    条件。

Quick-Start: CDK Fargate Web App

快速入门:CDK Fargate Web应用

typescript
import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecsPatterns from 'aws-cdk-lib/aws-ecs-patterns';

const service = new ecsPatterns.ApplicationLoadBalancedFargateService(this, 'WebApp', {
  taskImageOptions: {
    image: ecs.ContainerImage.fromEcrRepository(repo, 'latest'),
    containerPort: 8080,
    secrets: { DB_PASSWORD: ecs.Secret.fromSecretsManager(dbSecret) },
  },
  cpu: 512,
  memoryLimitMiB: 1024,
  desiredCount: 2,
  publicLoadBalancer: true,
  circuitBreaker: { rollback: true },
  minHealthyPercent: 100,
});

service.targetGroup.setAttribute('deregistration_delay.timeout_seconds', '30');

const scaling = service.service.autoScaleTaskCount({ minCapacity: 2, maxCapacity: 10 });
scaling.scaleOnCpuUtilization('CpuScaling', { targetUtilizationPercent: 70 });
CDK L3 patterns auto-create VPC, cluster, ALB, target group, and security groups. For production, create these separately and pass them in.
ApplicationLoadBalancedFargateService
defaults to
assignPublicIp: false
— tasks in public subnets need
assignPublicIp: true
for internet access, or use private subnets with NAT.
typescript
import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecsPatterns from 'aws-cdk-lib/aws-ecs-patterns';

const service = new ecsPatterns.ApplicationLoadBalancedFargateService(this, 'WebApp', {
  taskImageOptions: {
    image: ecs.ContainerImage.fromEcrRepository(repo, 'latest'),
    containerPort: 8080,
    secrets: { DB_PASSWORD: ecs.Secret.fromSecretsManager(dbSecret) },
  },
  cpu: 512,
  memoryLimitMiB: 1024,
  desiredCount: 2,
  publicLoadBalancer: true,
  circuitBreaker: { rollback: true },
  minHealthyPercent: 100,
});

service.targetGroup.setAttribute('deregistration_delay.timeout_seconds', '30');

const scaling = service.service.autoScaleTaskCount({ minCapacity: 2, maxCapacity: 10 });
scaling.scaleOnCpuUtilization('CpuScaling', { targetUtilizationPercent: 70 });
CDK L3模式会自动创建VPC、集群、ALB、目标组和安全组。生产环境中,建议单独创建这些资源并传入。
ApplicationLoadBalancedFargateService
默认
assignPublicIp: false
——公共子网中的任务若需访问互联网,需设置
assignPublicIp: true
,或使用带NAT网关的私有子网。

Quick-Start: ECS Exec

快速入门:ECS Exec

bash
undefined
bash
undefined

1. Enable on the service (existing tasks won't support it — force new deployment)

1. 在服务上启用(现有任务不支持——需强制部署新任务)

aws ecs update-service --cluster $CLUSTER --service $SERVICE
--enable-execute-command --force-new-deployment --output json
aws ecs update-service --cluster $CLUSTER --service $SERVICE
--enable-execute-command --force-new-deployment --output json

2. Connect (task role must have ssmmessages:* permissions)

2. 连接(任务角色必须具备ssmmessages:*权限)

aws ecs execute-command --cluster $CLUSTER --task $TASK_ID
--container $CONTAINER --interactive --command "/bin/sh"

If `TargetNotConnectedException`: wait 30–60s for SSM agent startup, check NAT/VPC endpoint for `ssmmessages`, verify task role (not execution role) has permissions.
aws ecs execute-command --cluster $CLUSTER --task $TASK_ID
--container $CONTAINER --interactive --command "/bin/sh"

若出现`TargetNotConnectedException`:等待30–60秒让SSM代理启动,检查`ssmmessages`的NAT/VPC终端节点,验证任务角色(而非执行角色)具备权限。

Common Workflows

常见工作流

Use the best available tool for AWS operations (MCP server, AWS CLI, or SDK). The commands below show the AWS CLI form.
Read reference files only when the conversation requires deeper detail.
  • Read references/task-definition-authoring.md if the user needs to author a task definition, configure CPU/memory, set up networking modes, inject secrets, mount volumes, or configure container dependencies.
  • Read references/fargate-service-deployment.md if the user needs to deploy a Fargate service behind an ALB, configure health checks, tune deregistration delay, set up path-based routing, or handle private subnet networking.
  • Read references/ecr-repository-management.md if the user needs ECR lifecycle policies, image scanning, cross-account image pulls, or is debugging image pull errors.
  • Read references/ecs-exec-debugging.md if the user needs to set up ECS Exec, debug TargetNotConnectedException, configure session logging, or validate ECS Exec prerequisites.
  • Read references/service-scaling-and-updates.md if the user needs auto-scaling, deployment strategies (rolling, blue/green), circuit breaker configuration, or Service Connect setup.
  • Read references/app-runner-guide.md if the user has an existing App Runner service, needs to troubleshoot App Runner connectivity, or wants to migrate from App Runner to ECS Express Mode.
  • Read references/ecs-infrastructure-patterns.md if the user needs CDK or CloudFormation examples for Fargate services, SQS workers, scheduled tasks, EFS volumes, ECS Exec, path-based routing, private subnets, or FireLens.
  • Read references/ecs-logging-and-firelens.md if the user needs awslogs configuration, FireLens/Fluent Bit setup, multiline log handling, or guaranteed log delivery.
  • Read references/ecs-troubleshooting-guide.md if the user is debugging task placement failures, OOM kills (exit code 137), health check failures, image pull errors, or networking issues in private subnets.
  • Read references/fargate-spot.md if the user asks about Fargate Spot pricing, capacity provider strategies, or interruption handling.
使用AWS操作的最佳工具(MCP服务器、AWS CLI或SDK)。以下命令展示AWS CLI格式。
仅当对话需要更详细内容时才读取参考文件。
  • 若用户需要编写任务定义、配置CPU/内存、设置网络模式、注入密钥、挂载卷或配置容器依赖,请阅读references/task-definition-authoring.md
  • 若用户需要在ALB后部署Fargate服务、配置健康检查、调整注销延迟、设置基于路径的路由或处理私有子网网络,请阅读references/fargate-service-deployment.md
  • 若用户需要ECR生命周期策略、镜像扫描、跨账户镜像拉取或排查镜像拉取错误,请阅读references/ecr-repository-management.md
  • 若用户需要设置ECS Exec、排查TargetNotConnectedException、配置会话日志或验证ECS Exec先决条件,请阅读references/ecs-exec-debugging.md
  • 若用户需要自动扩缩容、部署策略(滚动、蓝绿)、断路器配置或Service Connect设置,请阅读references/service-scaling-and-updates.md
  • 若用户已有App Runner服务、需要排查App Runner连接问题或想从App Runner迁移到ECS Express模式,请阅读references/app-runner-guide.md
  • 若用户需要Fargate服务、SQS工作流、定时任务、EFS卷、ECS Exec、基于路径的路由、私有子网或FireLens的CDK/CloudFormation示例,请阅读references/ecs-infrastructure-patterns.md
  • 若用户需要awslogs配置、FireLens/Fluent Bit设置、多行日志处理或保证日志交付,请阅读references/ecs-logging-and-firelens.md
  • 若用户需要排查任务调度失败、OOM终止(退出码137)、健康检查失败、镜像拉取错误或私有子网网络问题,请阅读references/ecs-troubleshooting-guide.md
  • 若用户询问Fargate Spot定价、容量提供商策略或中断处理,请阅读references/fargate-spot.md

Decision Guide: ECS Express Mode vs ECS Fargate

决策指南:ECS Express模式 vs ECS Fargate

App Runner: Sunset April 30, 2026 — no new customers, no new features. Existing customers should migrate to ECS Express Mode. See App Runner Availability Change.
FactorECS Express ModeECS Fargate
Setup complexityMinimal (single API call)Moderate — task def, service, cluster, ALB
Networking controlManaged (ALB in default VPC)Full — awsvpc, security groups, subnets
ScalingAuto (CPU-based)Configurable target/step scaling
Use whenNew simple HTTP app/API, zero infra managementProduction services needing VPC, ALB, fine-grained IAM
LimitationsNew service, evolving feature setMost setup required
Default recommendation: Use ECS Fargate for production workloads. Use ECS Express Mode for the simplest path (new customers).
App Runner:2026年4月30日停用——不再接纳新用户,不再新增功能。现有用户应迁移到ECS Express模式。详情见App Runner可用性变更
因素ECS Express模式ECS Fargate
配置复杂度极低(单次API调用)中等——需任务定义、服务、集群、ALB
网络控制托管(默认VPC中的ALB)完全可控——awsvpc、安全组、子网
扩缩容自动(基于CPU)可配置目标/阶梯式扩缩容
适用场景新的简单HTTP应用/API,无需基础设施管理需要VPC、ALB、细粒度IAM的生产服务
限制仅支持新服务,功能集仍在演进需完成大部分配置工作
默认推荐:生产工作负载使用ECS Fargate。最简路径(新用户)使用ECS Express模式。

Troubleshooting

故障排查

CannotPullContainerError

CannotPullContainerError

Cause: Task cannot reach ECR. In private subnets, tasks need NAT gateway or VPC endpoints (
ecr.api
,
ecr.dkr
,
s3
gateway,
logs
). Fix: Verify route table has a route to NAT gateway or create the required VPC endpoints. Verify the execution role has
ecr:GetDownloadUrlForLayer
,
ecr:BatchGetImage
,
ecr:GetAuthorizationToken
(Resource:
"*"
). Check security group allows outbound HTTPS (443).
原因:任务无法连接到ECR。私有子网中的任务需要NAT网关或VPC终端节点(
ecr.api
ecr.dkr
s3
网关、
logs
)。 解决方法:验证路由表是否有指向NAT网关的路由,或创建所需VPC终端节点。验证执行角色具备
ecr:GetDownloadUrlForLayer
ecr:BatchGetImage
ecr:GetAuthorizationToken
权限(Resource:
"*"
)。检查安全组是否允许HTTPS(443)出站流量。

Task failed ELB health checks

任务未通过ELB健康检查

Cause: Health check path returns non-200, container not listening on the configured port, or health check grace period too short. Fix: Verify the container responds on the health check path and port. Set
healthCheckGracePeriodSeconds
to at least 60s (longer for JVM apps). Ensure the security group allows traffic from the ALB security group on the container port.
原因:健康检查路径返回非200状态码、容器未监听配置的端口,或健康检查宽限期过短。 解决方法:验证容器在健康检查路径和端口上能正常响应。将
healthCheckGracePeriodSeconds
设置为至少60秒(JVM应用需更长时间)。确保安全组允许ALB安全组在容器端口上的流量。

OutOfMemoryError / exit code 137

OutOfMemoryError / 退出码137

Cause: Container exceeded its memory hard limit (SIGKILL). On Fargate, task-level memory is the hard limit. Fix: Increase task-level memory. For JVM apps, use
-XX:MaxRAMPercentage=75
instead of fixed
-Xmx
— this automatically adapts to the container's memory allocation. Check container-level
memory
(hard limit) vs
memoryReservation
(soft limit).
原因:容器超出内存硬限制(SIGKILL)。在Fargate上,任务级内存即为硬限制。 解决方法:增加任务级内存。对于JVM应用,使用
-XX:MaxRAMPercentage=75
替代固定的
-Xmx
——这会自动适配容器的内存分配。检查容器级
memory
(硬限制)与
memoryReservation
(软限制)的设置。

AccessDeniedException on AWS API calls from container

容器调用AWS API时出现AccessDeniedException

Cause: Permissions are on the execution role instead of the task role, or the task role is missing. Fix: Verify the task definition has
taskRoleArn
set (not just
executionRoleArn
). Add the required permissions to the task role.
原因:权限附加到了执行角色而非任务角色,或缺少任务角色。 解决方法:验证任务定义已设置
taskRoleArn
(而非仅设置
executionRoleArn
)。将所需权限添加到任务角色。

Service stuck deploying / tasks keep restarting

服务部署停滞/任务持续重启

Cause: Deployment circuit breaker not enabled, or health check failing on new tasks. Fix: Enable circuit breaker with rollback. Check service events:
aws ecs describe-services --cluster $CLUSTER --services $SERVICE --output json
. Check stopped task reasons:
aws ecs describe-tasks --cluster $CLUSTER --tasks $TASK_ID --output json
.
原因:未启用部署断路器,或新任务健康检查失败。 解决方法:启用断路器并配置回滚。查看服务事件:
aws ecs describe-services --cluster $CLUSTER --services $SERVICE --output json
。查看停止任务的原因:
aws ecs describe-tasks --cluster $CLUSTER --tasks $TASK_ID --output json

ECS Exec TargetNotConnectedException

ECS Exec出现TargetNotConnectedException

Cause: SSM agent not running, missing task role permissions, or missing VPC endpoint. Fix: Verify
enableExecuteCommand
is true on the service. Check the task role has SSM permissions. For private subnets, create the
ssmmessages
VPC endpoint. Verify with
aws ecs describe-tasks
that
ExecuteCommandAgent
status is
RUNNING
.
原因:SSM代理未运行、缺少任务角色权限或缺少VPC终端节点。 解决方法:验证服务上
enableExecuteCommand
为true。检查任务角色具备SSM权限。对于私有子网,创建
ssmmessages
VPC终端节点。通过
aws ecs describe-tasks
验证
ExecuteCommandAgent
状态为
RUNNING

Error retry classification

错误重试分类

RetryDo NOT retry
ThrottlingExceptionInvalidParameterException
ServiceUnavailableExceptionClientException
ServerExceptionAccessDeniedException
可重试不可重试
ThrottlingExceptionInvalidParameterException
ServiceUnavailableExceptionClientException
ServerExceptionAccessDeniedException

Security Considerations

安全注意事项

  • You MUST use IAM roles (execution role + task role) — never embed credentials in container images or environment variables
  • You MUST use Secrets Manager or SSM Parameter Store for sensitive configuration, injected via the
    secrets
    field in the task definition
  • You SHOULD enable ECR image scanning on push for vulnerability detection
  • You SHOULD use private subnets with NAT gateway or VPC endpoints for production workloads
  • You MUST enable CloudTrail for ECS API audit logging
  • You SHOULD configure CloudWatch Container Insights for monitoring
  • You SHOULD use
    readonlyRootFilesystem: true
    in container definitions where possible (note: incompatible with ECS Exec)
  • You MUST scope task role permissions to specific resources — avoid
    *
    wildcards and
    *FullAccess
    policies
  • You MUST confirm with the user before executing destructive operations:
    --force-new-deployment
    (replaces all running tasks),
    delete-service
    ,
    deregister-task-definition
    . ECS does not support
    --dry-run
    — use the plan-validate-execute pattern: explain what will happen, get confirmation, then execute
  • You SHOULD use ACM certificates with HTTPS listeners on ALBs fronting ECS services — per ECS network security best practices: "provision certificates for the load balancer using AWS Certificate Manager (ACM)"
  • You SHOULD avoid logging sensitive data (secrets, PII, tokens) in container stdout/stderr — these flow to CloudWatch Logs via the awslogs driver. If sensitive data may appear in logs, enable CloudWatch Logs encryption with a KMS key
  • You SHOULD attach an AWS WAF WebACL to internet-facing ALBs for defense in depth against common web exploits
  • You SHOULD include
    aws:SourceArn
    and
    aws:SourceAccount
    condition keys in ECR repository policies for cross-account access to prevent confused deputy attacks
  • 必须使用IAM角色(执行角色+任务角色)——绝不能在容器镜像或环境变量中嵌入凭证
  • 必须使用Secrets Manager或SSM Parameter Store存储敏感配置,并通过任务定义的
    secrets
    字段注入
  • 应启用ECR镜像推送扫描以检测漏洞
  • 生产工作流应使用带NAT网关或VPC终端节点的私有子网
  • 必须为ECS API启用CloudTrail审计日志
  • 应配置CloudWatch Container Insights进行监控
  • 尽可能在容器定义中使用
    readonlyRootFilesystem: true
    (注意:与ECS Exec不兼容)
  • 必须将任务角色权限限定到特定资源——避免使用
    *
    通配符和
    *FullAccess
    策略
  • 执行破坏性操作前必须征得用户确认:
    --force-new-deployment
    (替换所有运行中任务)、
    delete-service
    deregister-task-definition
    。ECS不支持
    --dry-run
    ——使用计划-验证-执行模式:说明操作内容、获取确认、再执行
  • 应为ECS服务前端的ALB配置ACM证书和HTTPS监听器——根据ECS网络安全最佳实践:“使用AWS Certificate Manager (ACM)为负载均衡器配置证书”
  • 应避免在容器标准输出/标准错误中记录敏感数据(密钥、PII、令牌)——这些数据会通过awslogs驱动流向CloudWatch Logs。若日志中可能包含敏感数据,需启用CloudWatch Logs的KMS密钥加密
  • 应为面向互联网的ALB附加AWS WAF WebACL,以深度防御常见Web攻击
  • 跨账户访问ECR仓库时,应在仓库策略中包含
    aws:SourceArn
    aws:SourceAccount
    条件键,以防止混淆代理攻击

Additional Resources

附加资源