aws-containers

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AWS Containers

AWS 容器

Service Overview

服务概览

Developer Need	Recommend	Key CLI / CDK
Simplest container deploy (HTTP app/API, new customers)	ECS Express Mode	`aws ecs create-express-gateway-service`
Web app, worker, batch, scheduled task	ECS on Fargate	`aws ecs create-service` / CDK `ecsPatterns.ApplicationLoadBalancedFargateService`
GPU workloads or >16 vCPU	ECS on EC2	CDK `ecs.Ec2Service`
Store container images	ECR	`aws ecr create-repository`
Web app behind a load balancer	ECS Fargate + ALB	CDK `ecsPatterns.ApplicationLoadBalancedFargateService`
SQS worker scaling on queue depth	ECS Fargate + SQS	CDK `ecsPatterns.QueueProcessingFargateService`
Cron job / scheduled task	ECS Fargate + EventBridge	CDK `ecsPatterns.ScheduledFargateTask`
Service mesh / service-to-service	ECS Service Connect	Configure on ECS service with Cloud Map namespace
Debug a running container	ECS Exec	`aws ecs execute-command --interactive --command "/bin/sh"`

When a developer says "deploy my container" without naming a service: recommend ECS Express Mode for simple HTTP apps (replaces App Runner for new customers). Recommend ECS Fargate for everything else. Never recommend EKS unless they explicitly ask for Kubernetes.

开发者需求	推荐方案	核心CLI/CDK命令
最简单的容器部署（HTTP应用/API，新用户）	ECS Express模式	`aws ecs create-express-gateway-service`
Web应用、工作流、批量任务、定时任务	ECS on Fargate	`aws ecs create-service` / CDK `ecsPatterns.ApplicationLoadBalancedFargateService`
GPU工作负载或>16 vCPU	ECS on EC2	CDK `ecs.Ec2Service`
存储容器镜像	ECR	`aws ecr create-repository`
负载均衡器后的Web应用	ECS Fargate + ALB	CDK `ecsPatterns.ApplicationLoadBalancedFargateService`
基于队列深度的SQS工作流扩缩容	ECS Fargate + SQS	CDK `ecsPatterns.QueueProcessingFargateService`
定时任务/计划任务	ECS Fargate + EventBridge	CDK `ecsPatterns.ScheduledFargateTask`
服务网格/服务间通信	ECS Service Connect	在ECS服务上配置Cloud Map命名空间
调试运行中的容器	ECS Exec	`aws ecs execute-command --interactive --command "/bin/sh"`

当开发者仅说“部署我的容器”而未指定服务时：对于简单HTTP应用，推荐ECS Express模式（为新用户替代App Runner）；其他场景推荐ECS Fargate。除非用户明确要求Kubernetes，否则绝不推荐EKS。

Overview

概述

Provides expertise for building, deploying, and operating containerized workloads using Amazon ECS, AWS Fargate, Amazon ECR, and AWS App Runner.

Recommended setup: Install the AWS MCP server for sandboxed execution, audit logging, and enterprise controls. See: aws.amazon.com/mcp

Without AWS MCP: This skill works with any agent that has AWS CLI access. All commands use standard AWS CLI syntax.

When NOT to use this skill:

Kubernetes or EKS workloads → use the kubernetes skill
CI/CD pipeline setup for container deployments → use the deploy skill
VPC subnet design and security group architecture → use the networking skill
Running code without containers (Lambda, Step Functions) → use the serverless skill

Before executing any commands:

You MUST verify AWS CLI v2 is installed and configured before running commands
You MUST inform the user if required tools (AWS CLI, Docker, Session Manager plugin) are missing
You MUST respect the user's decision to abort at any point

提供使用Amazon ECS、AWS Fargate、Amazon ECR和AWS App Runner构建、部署和运维容器化工作负载的专业指导。

推荐配置：安装AWS MCP服务器以实现沙箱执行、审计日志和企业级控制。详情见：aws.amazon.com/mcp

无AWS MCP时：此技能适用于任何具备AWS CLI访问权限的Agent。所有命令均使用标准AWS CLI语法。

不适用于以下场景：

Kubernetes或EKS工作负载 → 使用Kubernetes技能
容器部署的CI/CD流水线配置 → 使用部署技能
VPC子网设计和安全组架构 → 使用网络技能
非容器化运行代码（Lambda、Step Functions） → 使用无服务器技能

执行任何命令前：

必须先验证已安装并配置AWS CLI v2，再运行命令
若缺少所需工具（AWS CLI、Docker、Session Manager插件），必须告知用户
必须尊重用户随时中止操作的决定

Gotchas

常见陷阱

Apply these every time. Each corrects a mistake agents make without explicit instruction.

Fargate CPU/memory must be valid combinations. Arbitrary values cause
```
Invalid 'cpu' setting for task
```
:
- 256 (0.25 vCPU): 512 MiB, 1 GB, 2 GB
- 512 (0.5 vCPU): 1–4 GB (1 GB increments)
- 1024 (1 vCPU): 2–8 GB (1 GB increments)
- 2048 (2 vCPU): 4–16 GB (1 GB increments)
- 4096 (4 vCPU): 8–30 GB (1 GB increments)
- 8192 (8 vCPU): 16–60 GB (4 GB increments)
- 16384 (16 vCPU): 32–120 GB (8 GB increments)
If the user requests an invalid combination, tell them and recommend the nearest valid option. You MUST NOT silently produce an invalid task definition.
Fargate requires
awsvpc
networking mode — no exceptions. Agents frequently suggest
```
bridge
```
or
```
host
```
mode for Fargate tasks, which causes immediate registration failure. You MUST set
```
networkMode
```
to
```
awsvpc
```
for all Fargate task definitions. On EC2,
```
awsvpc
```
is recommended;
```
bridge
```
is legacy only.
Execution role vs task role — never confuse them.
```
executionRoleArn
```
: ECS agent uses it to pull images, fetch secrets, write logs.
```
taskRoleArn
```
: application code uses it to call AWS APIs. ECS Exec permissions (
```
ssmmessages:*
```
) go on the task role. ECR pull permissions go on the execution role.
```
ecr:GetAuthorizationToken
```
MUST use
```
Resource: "*"
```
(registry-level action).
Secrets are injected at task launch only — no hot-reload. Changed secrets require
```
aws ecs update-service --force-new-deployment
```
. To reference a specific JSON key in Secrets Manager:
```
arn:aws:secretsmanager:region:account:secret:name-hash:json-key::
```
— the trailing colons are required (they represent empty version-stage and version-id fields). You can also use SSM Parameter Store with
```
valueFrom
```
pointing to the parameter ARN — the execution role needs
```
ssm:GetParameters
```
permission.
ALB deregistration delay defaults to 300s — reduce to 30–60s. This is the #1 cause of slow deployments. Set it on the target group. It SHOULD exceed your longest request duration.
Set
healthCheckGracePeriodSeconds
on every ECS service behind an ALB. Without it, the ALB marks tasks unhealthy before they're ready, the circuit breaker counts failures, and the deployment rolls back. JVM/Spring Boot apps need 60–120s.
Always enable deployment circuit breaker with rollback. Without it, bad deployments stay "in progress" for 30+ minutes. In CDK:
```
circuitBreaker: { rollback: true }
```
(specifying the property implicitly enables it;
```
enable
```
defaults to
```
true
```
).
Private subnet Fargate tasks need NAT or all four VPC endpoints. Required endpoints:
```
ecr.dkr
```
(interface),
```
ecr.api
```
(interface),
```
s3
```
(gateway — ECR stores layers in S3),
```
logs
```
(interface — for CloudWatch). The S3 gateway endpoint is the most commonly missed. For ECS Exec, also add
```
ssmmessages
```
.
ECR lifecycle policies evaluate within 24 hours — not immediately. Multi-architecture images referenced by a manifest list cannot be expired until the manifest list is deleted first. Preview before applying: first
```
aws ecr start-lifecycle-policy-preview --repository-name $REPO
```
, then
```
aws ecr get-lifecycle-policy-preview --repository-name $REPO --output json
```
to see which images would be affected.
ECS Exec requires task role permissions, NOT execution role. The task role needs
```
ssmmessages:CreateControlChannel
```
,
```
CreateDataChannel
```
,
```
OpenControlChannel
```
,
```
OpenDataChannel
```
. Tasks launched before enabling
```
enableExecuteCommand
```
do NOT support ECS Exec — force a new deployment. The container image must include the binary specified in
```
--command
```
(e.g.,
```
/bin/sh
```
for interactive sessions). For command logging to S3 or CloudWatch Logs,
```
script
```
and
```
cat
```
must also be installed. Fargate platform version MUST be 1.4.0+.
awslogs
log driver mode — check your account's default. Per ECS docs, the ECS service defaults to
```
non-blocking
```
mode, which drops logs when the buffer fills. The
```
defaultLogDriverMode
```
account setting can override this per account. For guaranteed log delivery (audit/compliance), explicitly set
```
"mode": "blocking"
```
in
```
logConfiguration.options
```
. Check your effective default:
```
aws ecs list-account-settings --name defaultLogDriverMode --effective-settings --output json
```
.
App Runner VPC connector routes ALL application-initiated outbound traffic through the VPC. (App Runner is sunset — new customers should use ECS Express Mode instead.) Without a NAT gateway, external API calls and AWS service calls from your application code break. App Runner's own managed traffic (pulling images, pushing logs, retrieving secrets) is NOT routed through the VPC and is unaffected. Implement retry logic with backoff for database connections at startup.
For
desiredCount=1
zero-downtime deploys:
minimumHealthyPercent=100, maximumPercent=200
. This requires capacity for 2 tasks during deployment. You MUST NOT set
```
minimumHealthyPercent=0
```
if zero downtime is required.
502 Bad Gateway from ALB — check in this order: (a) Container not listening on the port in the target group. (b) Container crashing before responding. (c) Task security group doesn't allow inbound from ALB security group on the container port. (d) Health check path returns non-200. (e) Health check timeout exceeds response time.
Fargate platform version: always use
LATEST
or
1.4.0
. Version 1.3.0 is being retired June 15, 2026 and terminated June 30, 2026.
SQS worker scaling: use a custom backlog-per-task metric. Raw
```
ApproximateNumberOfMessagesVisible
```
with target tracking doesn't work because adding tasks doesn't reduce queue depth proportionally. Use custom metric (
```
ApproximateNumberOfMessagesVisible / RunningTaskCount
```
) with target tracking, or use step scaling. CDK
```
QueueProcessingFargateService
```
handles this automatically via
```
scalingSteps
```
. Workers MUST handle SIGTERM gracefully within
```
stopTimeout
```
(default 30s, max 120s on Fargate).
Blue/green deployments: use native ECS blue/green (July 2025+) for new services. Supports all-at-once, canary, and linear traffic shifting (canary/linear added October 2025), plus Service Connect, headless services, EBS volumes, and lifecycle hooks. CodeDeploy blue/green is now legacy — native ECS blue/green has full feature parity.
Container dependency
HEALTHY
condition requires a health check on the dependency container. Without a configured health check, the dependent container never starts — ECS does not progress it to its next state. If
```
startTimeout
```
is set (max 120s), the dependency times out and the task fails; if not set, the dependent container blocks indefinitely. For init containers, use
```
SUCCESS
```
condition instead.

每次操作都需遵循以下规则，这些规则用于纠正Agent未获明确指令时易犯的错误。

Fargate的CPU/内存必须为有效组合。任意值会导致
```
Invalid 'cpu' setting for task
```
错误：
- 256（0.25 vCPU）：512 MiB、1 GB、2 GB
- 512（0.5 vCPU）：1–4 GB（1 GB递增）
- 1024（1 vCPU）：2–8 GB（1 GB递增）
- 2048（2 vCPU）：4–16 GB（1 GB递增）
- 4096（4 vCPU）：8–30 GB（1 GB递增）
- 8192（8 vCPU）：16–60 GB（4 GB递增）
- 16384（16 vCPU）：32–120 GB（8 GB递增）
如果用户请求无效组合，需告知并推荐最接近的有效选项。绝不能静默生成无效的任务定义。
Fargate必须使用
awsvpc
网络模式——无例外。Agent常建议为Fargate任务使用
```
bridge
```
或
```
host
```
模式，这会导致立即注册失败。所有Fargate任务定义必须将
```
networkMode
```
设置为
```
awsvpc
```
。在EC2上，推荐使用
```
awsvpc
```
；
```
bridge
```
仅为遗留模式。
执行角色与任务角色——切勿混淆。
```
executionRoleArn
```
：ECS Agent使用它拉取镜像、获取密钥、写入日志。
```
taskRoleArn
```
：应用代码使用它调用AWS API。ECS Exec权限（
```
ssmmessages:*
```
）需附加到任务角色。ECR拉取权限需附加到执行角色。
```
ecr:GetAuthorizationToken
```
必须使用
```
Resource: "*"
```
（注册表级操作）。
密钥仅在任务启动时注入——不支持热重载。修改密钥后需执行
```
aws ecs update-service --force-new-deployment
```
。要引用Secrets Manager中的特定JSON密钥：
```
arn:aws:secretsmanager:region:account:secret:name-hash:json-key::
```
——末尾的冒号是必需的（代表空的版本阶段和版本ID字段）。也可使用SSM Parameter Store，通过
```
valueFrom
```
指向参数ARN——执行角色需具备
```
ssm:GetParameters
```
权限。
ALB注销延迟默认值为300秒——建议缩短至30–60秒。这是部署缓慢的头号原因。需在目标组上设置该值，且应超过最长请求时长。
每个ALB后的ECS服务都必须设置
healthCheckGracePeriodSeconds
。若未设置，ALB会在任务就绪前将其标记为不健康，断路器会统计故障并回滚部署。JVM/Spring Boot应用需设置60–120秒。
始终启用部署断路器并配置回滚。若未启用，错误部署会持续“进行中”30分钟以上。在CDK中：
```
circuitBreaker: { rollback: true }
```
（指定该属性即隐式启用，
```
enable
```
默认值为
```
true
```
）。
私有子网中的Fargate任务需要NAT网关或全部四个VPC终端节点。所需终端节点：
```
ecr.dkr
```
（接口）、
```
ecr.api
```
（接口）、
```
s3
```
（网关——ECR将镜像层存储在S3中）、
```
logs
```
（接口——用于CloudWatch）。S3网关终端节点最常被遗漏。若使用ECS Exec，还需添加
```
ssmmessages
```
终端节点。
ECR生命周期策略会在24小时内生效——并非立即。清单列表引用的多架构镜像需先删除清单列表，才能过期清理。应用前可预览：先执行
```
aws ecr start-lifecycle-policy-preview --repository-name $REPO
```
，再执行
```
aws ecr get-lifecycle-policy-preview --repository-name $REPO --output json
```
查看受影响的镜像。
ECS Exec需要任务角色权限，而非执行角色。任务角色需具备
```
ssmmessages:CreateControlChannel
```
、
```
CreateDataChannel
```
、
```
OpenControlChannel
```
、
```
OpenDataChannel
```
权限。启用
```
enableExecuteCommand
```
前启动的任务不支持ECS Exec——需强制部署新任务。容器镜像必须包含
```
--command
```
中指定的二进制文件（例如交互式会话的
```
/bin/sh
```
）。若要将命令日志发送到S3或CloudWatch Logs，还需安装
```
script
```
和
```
cat
```
。Fargate平台版本必须为1.4.0+。
awslogs
日志驱动模式——检查账户默认设置。根据ECS文档，ECS服务默认使用
```
non-blocking
```
模式，缓冲区填满时会丢弃日志。账户设置
```
defaultLogDriverMode
```
可覆盖此默认值。如需保证日志交付（审计/合规），需在
```
logConfiguration.options
```
中显式设置
```
"mode": "blocking"
```
。检查当前有效默认值：
```
aws ecs list-account-settings --name defaultLogDriverMode --effective-settings --output json
```
。
App Runner VPC连接器会将所有应用发起的出站流量路由到VPC。（App Runner已停用——新用户应使用ECS Express模式。）若无NAT网关，应用代码的外部API调用和AWS服务调用会失败。App Runner自身的托管流量（拉取镜像、推送日志、获取密钥）不会路由到VPC，不受影响。启动时需为数据库连接实现带退避的重试逻辑。
对于
desiredCount=1
的零停机部署：设置
minimumHealthyPercent=100, maximumPercent=200
。这要求部署期间具备容纳2个任务的容量。若需零停机，绝不能设置
```
minimumHealthyPercent=0
```
。
ALB返回502 Bad Gateway——按以下顺序排查：(a) 容器未监听目标组中配置的端口。(b) 容器在响应前崩溃。(c) 任务安全组未允许ALB安全组在容器端口上的入站流量。(d) 健康检查路径返回非200状态码。(e) 健康检查超时超过响应时间。
Fargate平台版本：始终使用
LATEST
或
1.4.0
。版本1.3.0将于2026年6月15日退役，2026年6月30日终止服务。
SQS工作流扩缩容：使用自定义的“每个任务待处理消息数”指标。原始的
```
ApproximateNumberOfMessagesVisible
```
指标配合目标追踪无法正常工作，因为添加任务不会按比例减少队列深度。使用自定义指标（
```
ApproximateNumberOfMessagesVisible / RunningTaskCount
```
）配合目标追踪，或使用阶梯式扩缩容。CDK
```
QueueProcessingFargateService
```
会通过
```
scalingSteps
```
自动处理此问题。工作流必须在
```
stopTimeout
```
（默认30秒，Fargate上最大120秒）内优雅处理SIGTERM信号。
蓝绿部署：新服务使用原生ECS蓝绿部署（2025年7月+）。支持全量、金丝雀和线性流量切换（金丝雀/线性切换于2025年10月新增），还支持Service Connect、无头服务、EBS卷和生命周期钩子。CodeDeploy蓝绿部署现已成为遗留方案——原生ECS蓝绿部署具备完整功能 parity。
容器依赖的
HEALTHY
条件要求依赖容器配置健康检查。若未配置健康检查，依赖容器将永远无法启动——ECS不会推进其状态。若设置了
```
startTimeout
```
（最大120秒），依赖会超时导致任务失败；若未设置，依赖容器会无限期阻塞。对于初始化容器，应使用
```
SUCCESS
```
条件。

Quick-Start: CDK Fargate Web App

快速入门：CDK Fargate Web应用

typescript

import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecsPatterns from 'aws-cdk-lib/aws-ecs-patterns';

const service = new ecsPatterns.ApplicationLoadBalancedFargateService(this, 'WebApp', {
  taskImageOptions: {
    image: ecs.ContainerImage.fromEcrRepository(repo, 'latest'),
    containerPort: 8080,
    secrets: { DB_PASSWORD: ecs.Secret.fromSecretsManager(dbSecret) },
  },
  cpu: 512,
  memoryLimitMiB: 1024,
  desiredCount: 2,
  publicLoadBalancer: true,
  circuitBreaker: { rollback: true },
  minHealthyPercent: 100,
});

service.targetGroup.setAttribute('deregistration_delay.timeout_seconds', '30');

const scaling = service.service.autoScaleTaskCount({ minCapacity: 2, maxCapacity: 10 });
scaling.scaleOnCpuUtilization('CpuScaling', { targetUtilizationPercent: 70 });

CDK L3 patterns auto-create VPC, cluster, ALB, target group, and security groups. For production, create these separately and pass them in.

ApplicationLoadBalancedFargateService

defaults to

assignPublicIp: false

— tasks in public subnets need

assignPublicIp: true

for internet access, or use private subnets with NAT.

typescript

import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecsPatterns from 'aws-cdk-lib/aws-ecs-patterns';

const service = new ecsPatterns.ApplicationLoadBalancedFargateService(this, 'WebApp', {
  taskImageOptions: {
    image: ecs.ContainerImage.fromEcrRepository(repo, 'latest'),
    containerPort: 8080,
    secrets: { DB_PASSWORD: ecs.Secret.fromSecretsManager(dbSecret) },
  },
  cpu: 512,
  memoryLimitMiB: 1024,
  desiredCount: 2,
  publicLoadBalancer: true,
  circuitBreaker: { rollback: true },
  minHealthyPercent: 100,
});

service.targetGroup.setAttribute('deregistration_delay.timeout_seconds', '30');

const scaling = service.service.autoScaleTaskCount({ minCapacity: 2, maxCapacity: 10 });
scaling.scaleOnCpuUtilization('CpuScaling', { targetUtilizationPercent: 70 });

CDK L3模式会自动创建VPC、集群、ALB、目标组和安全组。生产环境中，建议单独创建这些资源并传入。

ApplicationLoadBalancedFargateService

默认

assignPublicIp: false

——公共子网中的任务若需访问互联网，需设置

assignPublicIp: true

，或使用带NAT网关的私有子网。

Quick-Start: ECS Exec

快速入门：ECS Exec

bash

undefined

bash

undefined

1. Enable on the service (existing tasks won't support it — force new deployment)

1. 在服务上启用（现有任务不支持——需强制部署新任务）

aws ecs update-service --cluster $CLUSTER --service $SERVICE
--enable-execute-command --force-new-deployment --output json

2. Connect (task role must have ssmmessages:* permissions)

2. 连接（任务角色必须具备ssmmessages:*权限）

aws ecs execute-command --cluster $CLUSTER --task $TASK_ID
--container $CONTAINER --interactive --command "/bin/sh"


If `TargetNotConnectedException`: wait 30–60s for SSM agent startup, check NAT/VPC endpoint for `ssmmessages`, verify task role (not execution role) has permissions.

aws ecs execute-command --cluster $CLUSTER --task $TASK_ID
--container $CONTAINER --interactive --command "/bin/sh"


若出现`TargetNotConnectedException`：等待30–60秒让SSM代理启动，检查`ssmmessages`的NAT/VPC终端节点，验证任务角色（而非执行角色）具备权限。

Common Workflows

常见工作流

Use the best available tool for AWS operations (MCP server, AWS CLI, or SDK). The commands below show the AWS CLI form.

Read reference files only when the conversation requires deeper detail.

Read references/task-definition-authoring.md if the user needs to author a task definition, configure CPU/memory, set up networking modes, inject secrets, mount volumes, or configure container dependencies.
Read references/fargate-service-deployment.md if the user needs to deploy a Fargate service behind an ALB, configure health checks, tune deregistration delay, set up path-based routing, or handle private subnet networking.
Read references/ecr-repository-management.md if the user needs ECR lifecycle policies, image scanning, cross-account image pulls, or is debugging image pull errors.
Read references/ecs-exec-debugging.md if the user needs to set up ECS Exec, debug TargetNotConnectedException, configure session logging, or validate ECS Exec prerequisites.
Read references/service-scaling-and-updates.md if the user needs auto-scaling, deployment strategies (rolling, blue/green), circuit breaker configuration, or Service Connect setup.
Read references/app-runner-guide.md if the user has an existing App Runner service, needs to troubleshoot App Runner connectivity, or wants to migrate from App Runner to ECS Express Mode.
Read references/ecs-infrastructure-patterns.md if the user needs CDK or CloudFormation examples for Fargate services, SQS workers, scheduled tasks, EFS volumes, ECS Exec, path-based routing, private subnets, or FireLens.
Read references/ecs-logging-and-firelens.md if the user needs awslogs configuration, FireLens/Fluent Bit setup, multiline log handling, or guaranteed log delivery.
Read references/ecs-troubleshooting-guide.md if the user is debugging task placement failures, OOM kills (exit code 137), health check failures, image pull errors, or networking issues in private subnets.
Read references/fargate-spot.md if the user asks about Fargate Spot pricing, capacity provider strategies, or interruption handling.

使用AWS操作的最佳工具（MCP服务器、AWS CLI或SDK）。以下命令展示AWS CLI格式。

仅当对话需要更详细内容时才读取参考文件。

若用户需要编写任务定义、配置CPU/内存、设置网络模式、注入密钥、挂载卷或配置容器依赖，请阅读references/task-definition-authoring.md。
若用户需要在ALB后部署Fargate服务、配置健康检查、调整注销延迟、设置基于路径的路由或处理私有子网网络，请阅读references/fargate-service-deployment.md。
若用户需要ECR生命周期策略、镜像扫描、跨账户镜像拉取或排查镜像拉取错误，请阅读references/ecr-repository-management.md。
若用户需要设置ECS Exec、排查TargetNotConnectedException、配置会话日志或验证ECS Exec先决条件，请阅读references/ecs-exec-debugging.md。
若用户需要自动扩缩容、部署策略（滚动、蓝绿）、断路器配置或Service Connect设置，请阅读references/service-scaling-and-updates.md。
若用户已有App Runner服务、需要排查App Runner连接问题或想从App Runner迁移到ECS Express模式，请阅读references/app-runner-guide.md。
若用户需要Fargate服务、SQS工作流、定时任务、EFS卷、ECS Exec、基于路径的路由、私有子网或FireLens的CDK/CloudFormation示例，请阅读references/ecs-infrastructure-patterns.md。
若用户需要awslogs配置、FireLens/Fluent Bit设置、多行日志处理或保证日志交付，请阅读references/ecs-logging-and-firelens.md。
若用户需要排查任务调度失败、OOM终止（退出码137）、健康检查失败、镜像拉取错误或私有子网网络问题，请阅读references/ecs-troubleshooting-guide.md。
若用户询问Fargate Spot定价、容量提供商策略或中断处理，请阅读references/fargate-spot.md。

Decision Guide: ECS Express Mode vs ECS Fargate

决策指南：ECS Express模式 vs ECS Fargate

App Runner: Sunset April 30, 2026 — no new customers, no new features. Existing customers should migrate to ECS Express Mode. See App Runner Availability Change.

Factor	ECS Express Mode	ECS Fargate
Setup complexity	Minimal (single API call)	Moderate — task def, service, cluster, ALB
Networking control	Managed (ALB in default VPC)	Full — awsvpc, security groups, subnets
Scaling	Auto (CPU-based)	Configurable target/step scaling
Use when	New simple HTTP app/API, zero infra management	Production services needing VPC, ALB, fine-grained IAM
Limitations	New service, evolving feature set	Most setup required

Default recommendation: Use ECS Fargate for production workloads. Use ECS Express Mode for the simplest path (new customers).

App Runner：2026年4月30日停用——不再接纳新用户，不再新增功能。现有用户应迁移到ECS Express模式。详情见App Runner可用性变更。

因素	ECS Express模式	ECS Fargate
配置复杂度	极低（单次API调用）	中等——需任务定义、服务、集群、ALB
网络控制	托管（默认VPC中的ALB）	完全可控——awsvpc、安全组、子网
扩缩容	自动（基于CPU）	可配置目标/阶梯式扩缩容
适用场景	新的简单HTTP应用/API，无需基础设施管理	需要VPC、ALB、细粒度IAM的生产服务
限制	仅支持新服务，功能集仍在演进	需完成大部分配置工作

默认推荐：生产工作负载使用ECS Fargate。最简路径（新用户）使用ECS Express模式。

Troubleshooting

故障排查

CannotPullContainerError

Cause: Task cannot reach ECR. In private subnets, tasks need NAT gateway or VPC endpoints (

ecr.api

ecr.dkr

s3

gateway,

logs

). Fix: Verify route table has a route to NAT gateway or create the required VPC endpoints. Verify the execution role has

ecr:GetDownloadUrlForLayer

ecr:BatchGetImage

ecr:GetAuthorizationToken

(Resource:

"*"

). Check security group allows outbound HTTPS (443).

原因：任务无法连接到ECR。私有子网中的任务需要NAT网关或VPC终端节点（

ecr.api

、

ecr.dkr

、

s3

网关、

logs

）。 解决方法：验证路由表是否有指向NAT网关的路由，或创建所需VPC终端节点。验证执行角色具备

ecr:GetDownloadUrlForLayer

、

ecr:BatchGetImage

、

ecr:GetAuthorizationToken

权限（Resource:

"*"

）。检查安全组是否允许HTTPS（443）出站流量。

Task failed ELB health checks

任务未通过ELB健康检查

Cause: Health check path returns non-200, container not listening on the configured port, or health check grace period too short. Fix: Verify the container responds on the health check path and port. Set

healthCheckGracePeriodSeconds

to at least 60s (longer for JVM apps). Ensure the security group allows traffic from the ALB security group on the container port.

原因：健康检查路径返回非200状态码、容器未监听配置的端口，或健康检查宽限期过短。 解决方法：验证容器在健康检查路径和端口上能正常响应。将

healthCheckGracePeriodSeconds

设置为至少60秒（JVM应用需更长时间）。确保安全组允许ALB安全组在容器端口上的流量。

OutOfMemoryError / exit code 137

OutOfMemoryError / 退出码137

Cause: Container exceeded its memory hard limit (SIGKILL). On Fargate, task-level memory is the hard limit. Fix: Increase task-level memory. For JVM apps, use

-XX:MaxRAMPercentage=75

instead of fixed

-Xmx

— this automatically adapts to the container's memory allocation. Check container-level

memory

(hard limit) vs

memoryReservation

(soft limit).

原因：容器超出内存硬限制（SIGKILL）。在Fargate上，任务级内存即为硬限制。 解决方法：增加任务级内存。对于JVM应用，使用

-XX:MaxRAMPercentage=75

替代固定的

-Xmx

——这会自动适配容器的内存分配。检查容器级

memory

（硬限制）与

memoryReservation

（软限制）的设置。

AccessDeniedException on AWS API calls from container

容器调用AWS API时出现AccessDeniedException

Cause: Permissions are on the execution role instead of the task role, or the task role is missing. Fix: Verify the task definition has

taskRoleArn

set (not just

executionRoleArn

). Add the required permissions to the task role.

原因：权限附加到了执行角色而非任务角色，或缺少任务角色。 解决方法：验证任务定义已设置

taskRoleArn

（而非仅设置

executionRoleArn

）。将所需权限添加到任务角色。

Service stuck deploying / tasks keep restarting

服务部署停滞/任务持续重启

Cause: Deployment circuit breaker not enabled, or health check failing on new tasks. Fix: Enable circuit breaker with rollback. Check service events:

aws ecs describe-services --cluster $CLUSTER --services $SERVICE --output json

. Check stopped task reasons:

aws ecs describe-tasks --cluster $CLUSTER --tasks $TASK_ID --output json

原因：未启用部署断路器，或新任务健康检查失败。 解决方法：启用断路器并配置回滚。查看服务事件：

aws ecs describe-services --cluster $CLUSTER --services $SERVICE --output json

。查看停止任务的原因：

aws ecs describe-tasks --cluster $CLUSTER --tasks $TASK_ID --output json

。

ECS Exec TargetNotConnectedException

ECS Exec出现TargetNotConnectedException

Cause: SSM agent not running, missing task role permissions, or missing VPC endpoint. Fix: Verify

enableExecuteCommand

is true on the service. Check the task role has SSM permissions. For private subnets, create the

ssmmessages

VPC endpoint. Verify with

aws ecs describe-tasks

that

ExecuteCommandAgent

status is

RUNNING

原因：SSM代理未运行、缺少任务角色权限或缺少VPC终端节点。 解决方法：验证服务上

enableExecuteCommand

为true。检查任务角色具备SSM权限。对于私有子网，创建

ssmmessages

VPC终端节点。通过

aws ecs describe-tasks

验证

ExecuteCommandAgent

状态为

RUNNING

。

Error retry classification

错误重试分类

Retry	Do NOT retry
ThrottlingException	InvalidParameterException
ServiceUnavailableException	ClientException
ServerException	AccessDeniedException

可重试	不可重试
ThrottlingException	InvalidParameterException
ServiceUnavailableException	ClientException
ServerException	AccessDeniedException

Security Considerations

安全注意事项

You MUST use IAM roles (execution role + task role) — never embed credentials in container images or environment variables
You MUST use Secrets Manager or SSM Parameter Store for sensitive configuration, injected via the
```
secrets
```
field in the task definition
You SHOULD enable ECR image scanning on push for vulnerability detection
You SHOULD use private subnets with NAT gateway or VPC endpoints for production workloads
You MUST enable CloudTrail for ECS API audit logging
You SHOULD configure CloudWatch Container Insights for monitoring
You SHOULD use
```
readonlyRootFilesystem: true
```
in container definitions where possible (note: incompatible with ECS Exec)
You MUST scope task role permissions to specific resources — avoid
```
*
```
wildcards and
```
*FullAccess
```
policies
You MUST confirm with the user before executing destructive operations:
```
--force-new-deployment
```
(replaces all running tasks),
```
delete-service
```
,
```
deregister-task-definition
```
. ECS does not support
```
--dry-run
```
— use the plan-validate-execute pattern: explain what will happen, get confirmation, then execute
You SHOULD use ACM certificates with HTTPS listeners on ALBs fronting ECS services — per ECS network security best practices: "provision certificates for the load balancer using AWS Certificate Manager (ACM)"
You SHOULD avoid logging sensitive data (secrets, PII, tokens) in container stdout/stderr — these flow to CloudWatch Logs via the awslogs driver. If sensitive data may appear in logs, enable CloudWatch Logs encryption with a KMS key
You SHOULD attach an AWS WAF WebACL to internet-facing ALBs for defense in depth against common web exploits
You SHOULD include
```
aws:SourceArn
```
and
```
aws:SourceAccount
```
condition keys in ECR repository policies for cross-account access to prevent confused deputy attacks

必须使用IAM角色（执行角色+任务角色）——绝不能在容器镜像或环境变量中嵌入凭证
必须使用Secrets Manager或SSM Parameter Store存储敏感配置，并通过任务定义的
```
secrets
```
字段注入
应启用ECR镜像推送扫描以检测漏洞
生产工作流应使用带NAT网关或VPC终端节点的私有子网
必须为ECS API启用CloudTrail审计日志
应配置CloudWatch Container Insights进行监控
尽可能在容器定义中使用
```
readonlyRootFilesystem: true
```
（注意：与ECS Exec不兼容）
必须将任务角色权限限定到特定资源——避免使用
```
*
```
通配符和
```
*FullAccess
```
策略
执行破坏性操作前必须征得用户确认：
```
--force-new-deployment
```
（替换所有运行中任务）、
```
delete-service
```
、
```
deregister-task-definition
```
。ECS不支持
```
--dry-run
```
——使用计划-验证-执行模式：说明操作内容、获取确认、再执行
应为ECS服务前端的ALB配置ACM证书和HTTPS监听器——根据ECS网络安全最佳实践：“使用AWS Certificate Manager (ACM)为负载均衡器配置证书”
应避免在容器标准输出/标准错误中记录敏感数据（密钥、PII、令牌）——这些数据会通过awslogs驱动流向CloudWatch Logs。若日志中可能包含敏感数据，需启用CloudWatch Logs的KMS密钥加密
应为面向互联网的ALB附加AWS WAF WebACL，以深度防御常见Web攻击
跨账户访问ECR仓库时，应在仓库策略中包含
```
aws:SourceArn
```
和
```
aws:SourceAccount
```
条件键，以防止混淆代理攻击

aws-containers

Original

Translation

AWS Containers

AWS 容器

Service Overview

服务概览

Overview

概述

Gotchas

常见陷阱

Quick-Start: CDK Fargate Web App

快速入门：CDK Fargate Web应用

Quick-Start: ECS Exec

快速入门：ECS Exec

1. Enable on the service (existing tasks won't support it — force new deployment)

1. 在服务上启用（现有任务不支持——需强制部署新任务）

2. Connect (task role must have ssmmessages:* permissions)

2. 连接（任务角色必须具备ssmmessages:*权限）

Common Workflows

常见工作流

Decision Guide: ECS Express Mode vs ECS Fargate

决策指南：ECS Express模式 vs ECS Fargate

Troubleshooting

故障排查

CannotPullContainerError

CannotPullContainerError

Task failed ELB health checks

任务未通过ELB健康检查

OutOfMemoryError / exit code 137

OutOfMemoryError / 退出码137

AccessDeniedException on AWS API calls from container

容器调用AWS API时出现AccessDeniedException

Service stuck deploying / tasks keep restarting

服务部署停滞/任务持续重启

ECS Exec TargetNotConnectedException

ECS Exec出现TargetNotConnectedException

Error retry classification

错误重试分类

Security Considerations

安全注意事项

Additional Resources

附加资源