aws-containers
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAWS Containers
AWS 容器
Service Overview
服务概览
| Developer Need | Recommend | Key CLI / CDK |
|---|---|---|
| Simplest container deploy (HTTP app/API, new customers) | ECS Express Mode | |
| Web app, worker, batch, scheduled task | ECS on Fargate | |
| GPU workloads or >16 vCPU | ECS on EC2 | CDK |
| Store container images | ECR | |
| Web app behind a load balancer | ECS Fargate + ALB | CDK |
| SQS worker scaling on queue depth | ECS Fargate + SQS | CDK |
| Cron job / scheduled task | ECS Fargate + EventBridge | CDK |
| Service mesh / service-to-service | ECS Service Connect | Configure on ECS service with Cloud Map namespace |
| Debug a running container | ECS Exec | |
When a developer says "deploy my container" without naming a service: recommend ECS Express Mode for simple HTTP apps (replaces App Runner for new customers). Recommend ECS Fargate for everything else. Never recommend EKS unless they explicitly ask for Kubernetes.
| 开发者需求 | 推荐方案 | 核心CLI/CDK命令 |
|---|---|---|
| 最简单的容器部署(HTTP应用/API,新用户) | ECS Express模式 | |
| Web应用、工作流、批量任务、定时任务 | ECS on Fargate | |
| GPU工作负载或>16 vCPU | ECS on EC2 | CDK |
| 存储容器镜像 | ECR | |
| 负载均衡器后的Web应用 | ECS Fargate + ALB | CDK |
| 基于队列深度的SQS工作流扩缩容 | ECS Fargate + SQS | CDK |
| 定时任务/计划任务 | ECS Fargate + EventBridge | CDK |
| 服务网格/服务间通信 | ECS Service Connect | 在ECS服务上配置Cloud Map命名空间 |
| 调试运行中的容器 | ECS Exec | |
当开发者仅说“部署我的容器”而未指定服务时:对于简单HTTP应用,推荐ECS Express模式(为新用户替代App Runner);其他场景推荐ECS Fargate。除非用户明确要求Kubernetes,否则绝不推荐EKS。
Overview
概述
Provides expertise for building, deploying, and operating containerized workloads using Amazon ECS, AWS Fargate, Amazon ECR, and AWS App Runner.
Recommended setup: Install the AWS MCP server for sandboxed execution, audit logging, and enterprise controls. See: aws.amazon.com/mcp
Without AWS MCP: This skill works with any agent that has AWS CLI access. All commands use standard AWS CLI syntax.
When NOT to use this skill:
- Kubernetes or EKS workloads → use the kubernetes skill
- CI/CD pipeline setup for container deployments → use the deploy skill
- VPC subnet design and security group architecture → use the networking skill
- Running code without containers (Lambda, Step Functions) → use the serverless skill
Before executing any commands:
- You MUST verify AWS CLI v2 is installed and configured before running commands
- You MUST inform the user if required tools (AWS CLI, Docker, Session Manager plugin) are missing
- You MUST respect the user's decision to abort at any point
提供使用Amazon ECS、AWS Fargate、Amazon ECR和AWS App Runner构建、部署和运维容器化工作负载的专业指导。
推荐配置:安装AWS MCP服务器以实现沙箱执行、审计日志和企业级控制。详情见:aws.amazon.com/mcp
无AWS MCP时:此技能适用于任何具备AWS CLI访问权限的Agent。所有命令均使用标准AWS CLI语法。
不适用于以下场景:
- Kubernetes或EKS工作负载 → 使用Kubernetes技能
- 容器部署的CI/CD流水线配置 → 使用部署技能
- VPC子网设计和安全组架构 → 使用网络技能
- 非容器化运行代码(Lambda、Step Functions) → 使用无服务器技能
执行任何命令前:
- 必须先验证已安装并配置AWS CLI v2,再运行命令
- 若缺少所需工具(AWS CLI、Docker、Session Manager插件),必须告知用户
- 必须尊重用户随时中止操作的决定
Gotchas
常见陷阱
Apply these every time. Each corrects a mistake agents make without explicit instruction.
-
Fargate CPU/memory must be valid combinations. Arbitrary values cause:
Invalid 'cpu' setting for task- 256 (0.25 vCPU): 512 MiB, 1 GB, 2 GB
- 512 (0.5 vCPU): 1–4 GB (1 GB increments)
- 1024 (1 vCPU): 2–8 GB (1 GB increments)
- 2048 (2 vCPU): 4–16 GB (1 GB increments)
- 4096 (4 vCPU): 8–30 GB (1 GB increments)
- 8192 (8 vCPU): 16–60 GB (4 GB increments)
- 16384 (16 vCPU): 32–120 GB (8 GB increments)
If the user requests an invalid combination, tell them and recommend the nearest valid option. You MUST NOT silently produce an invalid task definition. -
Fargate requiresnetworking mode — no exceptions. Agents frequently suggest
awsvpcorbridgemode for Fargate tasks, which causes immediate registration failure. You MUST sethosttonetworkModefor all Fargate task definitions. On EC2,awsvpcis recommended;awsvpcis legacy only.bridge -
Execution role vs task role — never confuse them.: ECS agent uses it to pull images, fetch secrets, write logs.
executionRoleArn: application code uses it to call AWS APIs. ECS Exec permissions (taskRoleArn) go on the task role. ECR pull permissions go on the execution role.ssmmessages:*MUST useecr:GetAuthorizationToken(registry-level action).Resource: "*" -
Secrets are injected at task launch only — no hot-reload. Changed secrets require. To reference a specific JSON key in Secrets Manager:
aws ecs update-service --force-new-deployment— the trailing colons are required (they represent empty version-stage and version-id fields). You can also use SSM Parameter Store witharn:aws:secretsmanager:region:account:secret:name-hash:json-key::pointing to the parameter ARN — the execution role needsvalueFrompermission.ssm:GetParameters -
ALB deregistration delay defaults to 300s — reduce to 30–60s. This is the #1 cause of slow deployments. Set it on the target group. It SHOULD exceed your longest request duration.
-
Seton every ECS service behind an ALB. Without it, the ALB marks tasks unhealthy before they're ready, the circuit breaker counts failures, and the deployment rolls back. JVM/Spring Boot apps need 60–120s.
healthCheckGracePeriodSeconds -
Always enable deployment circuit breaker with rollback. Without it, bad deployments stay "in progress" for 30+ minutes. In CDK:(specifying the property implicitly enables it;
circuitBreaker: { rollback: true }defaults toenable).true -
Private subnet Fargate tasks need NAT or all four VPC endpoints. Required endpoints:(interface),
ecr.dkr(interface),ecr.api(gateway — ECR stores layers in S3),s3(interface — for CloudWatch). The S3 gateway endpoint is the most commonly missed. For ECS Exec, also addlogs.ssmmessages -
ECR lifecycle policies evaluate within 24 hours — not immediately. Multi-architecture images referenced by a manifest list cannot be expired until the manifest list is deleted first. Preview before applying: first, then
aws ecr start-lifecycle-policy-preview --repository-name $REPOto see which images would be affected.aws ecr get-lifecycle-policy-preview --repository-name $REPO --output json -
ECS Exec requires task role permissions, NOT execution role. The task role needs,
ssmmessages:CreateControlChannel,CreateDataChannel,OpenControlChannel. Tasks launched before enablingOpenDataChanneldo NOT support ECS Exec — force a new deployment. The container image must include the binary specified inenableExecuteCommand(e.g.,--commandfor interactive sessions). For command logging to S3 or CloudWatch Logs,/bin/shandscriptmust also be installed. Fargate platform version MUST be 1.4.0+.cat -
log driver mode — check your account's default. Per ECS docs, the ECS service defaults to
awslogsmode, which drops logs when the buffer fills. Thenon-blockingaccount setting can override this per account. For guaranteed log delivery (audit/compliance), explicitly setdefaultLogDriverModein"mode": "blocking". Check your effective default:logConfiguration.options.aws ecs list-account-settings --name defaultLogDriverMode --effective-settings --output json -
App Runner VPC connector routes ALL application-initiated outbound traffic through the VPC. (App Runner is sunset — new customers should use ECS Express Mode instead.) Without a NAT gateway, external API calls and AWS service calls from your application code break. App Runner's own managed traffic (pulling images, pushing logs, retrieving secrets) is NOT routed through the VPC and is unaffected. Implement retry logic with backoff for database connections at startup.
-
Forzero-downtime deploys:
desiredCount=1. This requires capacity for 2 tasks during deployment. You MUST NOT setminimumHealthyPercent=100, maximumPercent=200if zero downtime is required.minimumHealthyPercent=0 -
502 Bad Gateway from ALB — check in this order: (a) Container not listening on the port in the target group. (b) Container crashing before responding. (c) Task security group doesn't allow inbound from ALB security group on the container port. (d) Health check path returns non-200. (e) Health check timeout exceeds response time.
-
Fargate platform version: always useor
LATEST. Version 1.3.0 is being retired June 15, 2026 and terminated June 30, 2026.1.4.0 -
SQS worker scaling: use a custom backlog-per-task metric. Rawwith target tracking doesn't work because adding tasks doesn't reduce queue depth proportionally. Use custom metric (
ApproximateNumberOfMessagesVisible) with target tracking, or use step scaling. CDKApproximateNumberOfMessagesVisible / RunningTaskCounthandles this automatically viaQueueProcessingFargateService. Workers MUST handle SIGTERM gracefully withinscalingSteps(default 30s, max 120s on Fargate).stopTimeout -
Blue/green deployments: use native ECS blue/green (July 2025+) for new services. Supports all-at-once, canary, and linear traffic shifting (canary/linear added October 2025), plus Service Connect, headless services, EBS volumes, and lifecycle hooks. CodeDeploy blue/green is now legacy — native ECS blue/green has full feature parity.
-
Container dependencycondition requires a health check on the dependency container. Without a configured health check, the dependent container never starts — ECS does not progress it to its next state. If
HEALTHYis set (max 120s), the dependency times out and the task fails; if not set, the dependent container blocks indefinitely. For init containers, usestartTimeoutcondition instead.SUCCESS
每次操作都需遵循以下规则,这些规则用于纠正Agent未获明确指令时易犯的错误。
-
Fargate的CPU/内存必须为有效组合。任意值会导致错误:
Invalid 'cpu' setting for task- 256(0.25 vCPU):512 MiB、1 GB、2 GB
- 512(0.5 vCPU):1–4 GB(1 GB递增)
- 1024(1 vCPU):2–8 GB(1 GB递增)
- 2048(2 vCPU):4–16 GB(1 GB递增)
- 4096(4 vCPU):8–30 GB(1 GB递增)
- 8192(8 vCPU):16–60 GB(4 GB递增)
- 16384(16 vCPU):32–120 GB(8 GB递增)
如果用户请求无效组合,需告知并推荐最接近的有效选项。绝不能静默生成无效的任务定义。 -
Fargate必须使用网络模式——无例外。Agent常建议为Fargate任务使用
awsvpc或bridge模式,这会导致立即注册失败。所有Fargate任务定义必须将host设置为networkMode。在EC2上,推荐使用awsvpc;awsvpc仅为遗留模式。bridge -
执行角色与任务角色——切勿混淆。:ECS Agent使用它拉取镜像、获取密钥、写入日志。
executionRoleArn:应用代码使用它调用AWS API。ECS Exec权限(taskRoleArn)需附加到任务角色。ECR拉取权限需附加到执行角色。ssmmessages:*必须使用ecr:GetAuthorizationToken(注册表级操作)。Resource: "*" -
密钥仅在任务启动时注入——不支持热重载。修改密钥后需执行。要引用Secrets Manager中的特定JSON密钥:
aws ecs update-service --force-new-deployment——末尾的冒号是必需的(代表空的版本阶段和版本ID字段)。也可使用SSM Parameter Store,通过arn:aws:secretsmanager:region:account:secret:name-hash:json-key::指向参数ARN——执行角色需具备valueFrom权限。ssm:GetParameters -
ALB注销延迟默认值为300秒——建议缩短至30–60秒。这是部署缓慢的头号原因。需在目标组上设置该值,且应超过最长请求时长。
-
每个ALB后的ECS服务都必须设置。若未设置,ALB会在任务就绪前将其标记为不健康,断路器会统计故障并回滚部署。JVM/Spring Boot应用需设置60–120秒。
healthCheckGracePeriodSeconds -
始终启用部署断路器并配置回滚。若未启用,错误部署会持续“进行中”30分钟以上。在CDK中:(指定该属性即隐式启用,
circuitBreaker: { rollback: true }默认值为enable)。true -
私有子网中的Fargate任务需要NAT网关或全部四个VPC终端节点。所需终端节点:(接口)、
ecr.dkr(接口)、ecr.api(网关——ECR将镜像层存储在S3中)、s3(接口——用于CloudWatch)。S3网关终端节点最常被遗漏。若使用ECS Exec,还需添加logs终端节点。ssmmessages -
ECR生命周期策略会在24小时内生效——并非立即。清单列表引用的多架构镜像需先删除清单列表,才能过期清理。应用前可预览:先执行,再执行
aws ecr start-lifecycle-policy-preview --repository-name $REPO查看受影响的镜像。aws ecr get-lifecycle-policy-preview --repository-name $REPO --output json -
ECS Exec需要任务角色权限,而非执行角色。任务角色需具备、
ssmmessages:CreateControlChannel、CreateDataChannel、OpenControlChannel权限。启用OpenDataChannel前启动的任务不支持ECS Exec——需强制部署新任务。容器镜像必须包含enableExecuteCommand中指定的二进制文件(例如交互式会话的--command)。若要将命令日志发送到S3或CloudWatch Logs,还需安装/bin/sh和script。Fargate平台版本必须为1.4.0+。cat -
日志驱动模式——检查账户默认设置。根据ECS文档,ECS服务默认使用
awslogs模式,缓冲区填满时会丢弃日志。账户设置non-blocking可覆盖此默认值。如需保证日志交付(审计/合规),需在defaultLogDriverMode中显式设置logConfiguration.options。检查当前有效默认值:"mode": "blocking"。aws ecs list-account-settings --name defaultLogDriverMode --effective-settings --output json -
App Runner VPC连接器会将所有应用发起的出站流量路由到VPC。(App Runner已停用——新用户应使用ECS Express模式。)若无NAT网关,应用代码的外部API调用和AWS服务调用会失败。App Runner自身的托管流量(拉取镜像、推送日志、获取密钥)不会路由到VPC,不受影响。启动时需为数据库连接实现带退避的重试逻辑。
-
对于的零停机部署:设置
desiredCount=1。这要求部署期间具备容纳2个任务的容量。若需零停机,绝不能设置minimumHealthyPercent=100, maximumPercent=200。minimumHealthyPercent=0 -
ALB返回502 Bad Gateway——按以下顺序排查:(a) 容器未监听目标组中配置的端口。(b) 容器在响应前崩溃。(c) 任务安全组未允许ALB安全组在容器端口上的入站流量。(d) 健康检查路径返回非200状态码。(e) 健康检查超时超过响应时间。
-
Fargate平台版本:始终使用或
LATEST。版本1.3.0将于2026年6月15日退役,2026年6月30日终止服务。1.4.0 -
SQS工作流扩缩容:使用自定义的“每个任务待处理消息数”指标。原始的指标配合目标追踪无法正常工作,因为添加任务不会按比例减少队列深度。使用自定义指标(
ApproximateNumberOfMessagesVisible)配合目标追踪,或使用阶梯式扩缩容。CDKApproximateNumberOfMessagesVisible / RunningTaskCount会通过QueueProcessingFargateService自动处理此问题。工作流必须在scalingSteps(默认30秒,Fargate上最大120秒)内优雅处理SIGTERM信号。stopTimeout -
蓝绿部署:新服务使用原生ECS蓝绿部署(2025年7月+)。支持全量、金丝雀和线性流量切换(金丝雀/线性切换于2025年10月新增),还支持Service Connect、无头服务、EBS卷和生命周期钩子。CodeDeploy蓝绿部署现已成为遗留方案——原生ECS蓝绿部署具备完整功能 parity。
-
容器依赖的条件要求依赖容器配置健康检查。若未配置健康检查,依赖容器将永远无法启动——ECS不会推进其状态。若设置了
HEALTHY(最大120秒),依赖会超时导致任务失败;若未设置,依赖容器会无限期阻塞。对于初始化容器,应使用startTimeout条件。SUCCESS
Quick-Start: CDK Fargate Web App
快速入门:CDK Fargate Web应用
typescript
import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecsPatterns from 'aws-cdk-lib/aws-ecs-patterns';
const service = new ecsPatterns.ApplicationLoadBalancedFargateService(this, 'WebApp', {
taskImageOptions: {
image: ecs.ContainerImage.fromEcrRepository(repo, 'latest'),
containerPort: 8080,
secrets: { DB_PASSWORD: ecs.Secret.fromSecretsManager(dbSecret) },
},
cpu: 512,
memoryLimitMiB: 1024,
desiredCount: 2,
publicLoadBalancer: true,
circuitBreaker: { rollback: true },
minHealthyPercent: 100,
});
service.targetGroup.setAttribute('deregistration_delay.timeout_seconds', '30');
const scaling = service.service.autoScaleTaskCount({ minCapacity: 2, maxCapacity: 10 });
scaling.scaleOnCpuUtilization('CpuScaling', { targetUtilizationPercent: 70 });CDK L3 patterns auto-create VPC, cluster, ALB, target group, and security groups. For production, create these separately and pass them in. defaults to — tasks in public subnets need for internet access, or use private subnets with NAT.
ApplicationLoadBalancedFargateServiceassignPublicIp: falseassignPublicIp: truetypescript
import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecsPatterns from 'aws-cdk-lib/aws-ecs-patterns';
const service = new ecsPatterns.ApplicationLoadBalancedFargateService(this, 'WebApp', {
taskImageOptions: {
image: ecs.ContainerImage.fromEcrRepository(repo, 'latest'),
containerPort: 8080,
secrets: { DB_PASSWORD: ecs.Secret.fromSecretsManager(dbSecret) },
},
cpu: 512,
memoryLimitMiB: 1024,
desiredCount: 2,
publicLoadBalancer: true,
circuitBreaker: { rollback: true },
minHealthyPercent: 100,
});
service.targetGroup.setAttribute('deregistration_delay.timeout_seconds', '30');
const scaling = service.service.autoScaleTaskCount({ minCapacity: 2, maxCapacity: 10 });
scaling.scaleOnCpuUtilization('CpuScaling', { targetUtilizationPercent: 70 });CDK L3模式会自动创建VPC、集群、ALB、目标组和安全组。生产环境中,建议单独创建这些资源并传入。默认——公共子网中的任务若需访问互联网,需设置,或使用带NAT网关的私有子网。
ApplicationLoadBalancedFargateServiceassignPublicIp: falseassignPublicIp: trueQuick-Start: ECS Exec
快速入门:ECS Exec
bash
undefinedbash
undefined1. Enable on the service (existing tasks won't support it — force new deployment)
1. 在服务上启用(现有任务不支持——需强制部署新任务)
aws ecs update-service --cluster $CLUSTER --service $SERVICE
--enable-execute-command --force-new-deployment --output json
--enable-execute-command --force-new-deployment --output json
aws ecs update-service --cluster $CLUSTER --service $SERVICE
--enable-execute-command --force-new-deployment --output json
--enable-execute-command --force-new-deployment --output json
2. Connect (task role must have ssmmessages:* permissions)
2. 连接(任务角色必须具备ssmmessages:*权限)
aws ecs execute-command --cluster $CLUSTER --task $TASK_ID
--container $CONTAINER --interactive --command "/bin/sh"
--container $CONTAINER --interactive --command "/bin/sh"
If `TargetNotConnectedException`: wait 30–60s for SSM agent startup, check NAT/VPC endpoint for `ssmmessages`, verify task role (not execution role) has permissions.aws ecs execute-command --cluster $CLUSTER --task $TASK_ID
--container $CONTAINER --interactive --command "/bin/sh"
--container $CONTAINER --interactive --command "/bin/sh"
若出现`TargetNotConnectedException`:等待30–60秒让SSM代理启动,检查`ssmmessages`的NAT/VPC终端节点,验证任务角色(而非执行角色)具备权限。Common Workflows
常见工作流
Use the best available tool for AWS operations (MCP server, AWS CLI, or SDK). The commands below show the AWS CLI form.
Read reference files only when the conversation requires deeper detail.
- Read references/task-definition-authoring.md if the user needs to author a task definition, configure CPU/memory, set up networking modes, inject secrets, mount volumes, or configure container dependencies.
- Read references/fargate-service-deployment.md if the user needs to deploy a Fargate service behind an ALB, configure health checks, tune deregistration delay, set up path-based routing, or handle private subnet networking.
- Read references/ecr-repository-management.md if the user needs ECR lifecycle policies, image scanning, cross-account image pulls, or is debugging image pull errors.
- Read references/ecs-exec-debugging.md if the user needs to set up ECS Exec, debug TargetNotConnectedException, configure session logging, or validate ECS Exec prerequisites.
- Read references/service-scaling-and-updates.md if the user needs auto-scaling, deployment strategies (rolling, blue/green), circuit breaker configuration, or Service Connect setup.
- Read references/app-runner-guide.md if the user has an existing App Runner service, needs to troubleshoot App Runner connectivity, or wants to migrate from App Runner to ECS Express Mode.
- Read references/ecs-infrastructure-patterns.md if the user needs CDK or CloudFormation examples for Fargate services, SQS workers, scheduled tasks, EFS volumes, ECS Exec, path-based routing, private subnets, or FireLens.
- Read references/ecs-logging-and-firelens.md if the user needs awslogs configuration, FireLens/Fluent Bit setup, multiline log handling, or guaranteed log delivery.
- Read references/ecs-troubleshooting-guide.md if the user is debugging task placement failures, OOM kills (exit code 137), health check failures, image pull errors, or networking issues in private subnets.
- Read references/fargate-spot.md if the user asks about Fargate Spot pricing, capacity provider strategies, or interruption handling.
使用AWS操作的最佳工具(MCP服务器、AWS CLI或SDK)。以下命令展示AWS CLI格式。
仅当对话需要更详细内容时才读取参考文件。
- 若用户需要编写任务定义、配置CPU/内存、设置网络模式、注入密钥、挂载卷或配置容器依赖,请阅读references/task-definition-authoring.md。
- 若用户需要在ALB后部署Fargate服务、配置健康检查、调整注销延迟、设置基于路径的路由或处理私有子网网络,请阅读references/fargate-service-deployment.md。
- 若用户需要ECR生命周期策略、镜像扫描、跨账户镜像拉取或排查镜像拉取错误,请阅读references/ecr-repository-management.md。
- 若用户需要设置ECS Exec、排查TargetNotConnectedException、配置会话日志或验证ECS Exec先决条件,请阅读references/ecs-exec-debugging.md。
- 若用户需要自动扩缩容、部署策略(滚动、蓝绿)、断路器配置或Service Connect设置,请阅读references/service-scaling-and-updates.md。
- 若用户已有App Runner服务、需要排查App Runner连接问题或想从App Runner迁移到ECS Express模式,请阅读references/app-runner-guide.md。
- 若用户需要Fargate服务、SQS工作流、定时任务、EFS卷、ECS Exec、基于路径的路由、私有子网或FireLens的CDK/CloudFormation示例,请阅读references/ecs-infrastructure-patterns.md。
- 若用户需要awslogs配置、FireLens/Fluent Bit设置、多行日志处理或保证日志交付,请阅读references/ecs-logging-and-firelens.md。
- 若用户需要排查任务调度失败、OOM终止(退出码137)、健康检查失败、镜像拉取错误或私有子网网络问题,请阅读references/ecs-troubleshooting-guide.md。
- 若用户询问Fargate Spot定价、容量提供商策略或中断处理,请阅读references/fargate-spot.md。
Decision Guide: ECS Express Mode vs ECS Fargate
决策指南:ECS Express模式 vs ECS Fargate
App Runner: Sunset April 30, 2026 — no new customers, no new features. Existing customers should migrate to ECS Express Mode. See App Runner Availability Change.
| Factor | ECS Express Mode | ECS Fargate |
|---|---|---|
| Setup complexity | Minimal (single API call) | Moderate — task def, service, cluster, ALB |
| Networking control | Managed (ALB in default VPC) | Full — awsvpc, security groups, subnets |
| Scaling | Auto (CPU-based) | Configurable target/step scaling |
| Use when | New simple HTTP app/API, zero infra management | Production services needing VPC, ALB, fine-grained IAM |
| Limitations | New service, evolving feature set | Most setup required |
Default recommendation: Use ECS Fargate for production workloads. Use ECS Express Mode for the simplest path (new customers).
App Runner:2026年4月30日停用——不再接纳新用户,不再新增功能。现有用户应迁移到ECS Express模式。详情见App Runner可用性变更。
| 因素 | ECS Express模式 | ECS Fargate |
|---|---|---|
| 配置复杂度 | 极低(单次API调用) | 中等——需任务定义、服务、集群、ALB |
| 网络控制 | 托管(默认VPC中的ALB) | 完全可控——awsvpc、安全组、子网 |
| 扩缩容 | 自动(基于CPU) | 可配置目标/阶梯式扩缩容 |
| 适用场景 | 新的简单HTTP应用/API,无需基础设施管理 | 需要VPC、ALB、细粒度IAM的生产服务 |
| 限制 | 仅支持新服务,功能集仍在演进 | 需完成大部分配置工作 |
默认推荐:生产工作负载使用ECS Fargate。最简路径(新用户)使用ECS Express模式。
Troubleshooting
故障排查
CannotPullContainerError
CannotPullContainerError
Cause: Task cannot reach ECR. In private subnets, tasks need NAT gateway or VPC endpoints (, , gateway, ).
Fix: Verify route table has a route to NAT gateway or create the required VPC endpoints. Verify the execution role has , , (Resource: ). Check security group allows outbound HTTPS (443).
ecr.apiecr.dkrs3logsecr:GetDownloadUrlForLayerecr:BatchGetImageecr:GetAuthorizationToken"*"原因:任务无法连接到ECR。私有子网中的任务需要NAT网关或VPC终端节点(、、网关、)。
解决方法:验证路由表是否有指向NAT网关的路由,或创建所需VPC终端节点。验证执行角色具备、、权限(Resource: )。检查安全组是否允许HTTPS(443)出站流量。
ecr.apiecr.dkrs3logsecr:GetDownloadUrlForLayerecr:BatchGetImageecr:GetAuthorizationToken"*"Task failed ELB health checks
任务未通过ELB健康检查
Cause: Health check path returns non-200, container not listening on the configured port, or health check grace period too short.
Fix: Verify the container responds on the health check path and port. Set to at least 60s (longer for JVM apps). Ensure the security group allows traffic from the ALB security group on the container port.
healthCheckGracePeriodSeconds原因:健康检查路径返回非200状态码、容器未监听配置的端口,或健康检查宽限期过短。
解决方法:验证容器在健康检查路径和端口上能正常响应。将设置为至少60秒(JVM应用需更长时间)。确保安全组允许ALB安全组在容器端口上的流量。
healthCheckGracePeriodSecondsOutOfMemoryError / exit code 137
OutOfMemoryError / 退出码137
Cause: Container exceeded its memory hard limit (SIGKILL). On Fargate, task-level memory is the hard limit.
Fix: Increase task-level memory. For JVM apps, use instead of fixed — this automatically adapts to the container's memory allocation. Check container-level (hard limit) vs (soft limit).
-XX:MaxRAMPercentage=75-XmxmemorymemoryReservation原因:容器超出内存硬限制(SIGKILL)。在Fargate上,任务级内存即为硬限制。
解决方法:增加任务级内存。对于JVM应用,使用替代固定的——这会自动适配容器的内存分配。检查容器级(硬限制)与(软限制)的设置。
-XX:MaxRAMPercentage=75-XmxmemorymemoryReservationAccessDeniedException on AWS API calls from container
容器调用AWS API时出现AccessDeniedException
Cause: Permissions are on the execution role instead of the task role, or the task role is missing.
Fix: Verify the task definition has set (not just ). Add the required permissions to the task role.
taskRoleArnexecutionRoleArn原因:权限附加到了执行角色而非任务角色,或缺少任务角色。
解决方法:验证任务定义已设置(而非仅设置)。将所需权限添加到任务角色。
taskRoleArnexecutionRoleArnService stuck deploying / tasks keep restarting
服务部署停滞/任务持续重启
Cause: Deployment circuit breaker not enabled, or health check failing on new tasks.
Fix: Enable circuit breaker with rollback. Check service events: . Check stopped task reasons: .
aws ecs describe-services --cluster $CLUSTER --services $SERVICE --output jsonaws ecs describe-tasks --cluster $CLUSTER --tasks $TASK_ID --output json原因:未启用部署断路器,或新任务健康检查失败。
解决方法:启用断路器并配置回滚。查看服务事件:。查看停止任务的原因:。
aws ecs describe-services --cluster $CLUSTER --services $SERVICE --output jsonaws ecs describe-tasks --cluster $CLUSTER --tasks $TASK_ID --output jsonECS Exec TargetNotConnectedException
ECS Exec出现TargetNotConnectedException
Cause: SSM agent not running, missing task role permissions, or missing VPC endpoint.
Fix: Verify is true on the service. Check the task role has SSM permissions. For private subnets, create the VPC endpoint. Verify with that status is .
enableExecuteCommandssmmessagesaws ecs describe-tasksExecuteCommandAgentRUNNING原因:SSM代理未运行、缺少任务角色权限或缺少VPC终端节点。
解决方法:验证服务上为true。检查任务角色具备SSM权限。对于私有子网,创建 VPC终端节点。通过验证状态为。
enableExecuteCommandssmmessagesaws ecs describe-tasksExecuteCommandAgentRUNNINGError retry classification
错误重试分类
| Retry | Do NOT retry |
|---|---|
| ThrottlingException | InvalidParameterException |
| ServiceUnavailableException | ClientException |
| ServerException | AccessDeniedException |
| 可重试 | 不可重试 |
|---|---|
| ThrottlingException | InvalidParameterException |
| ServiceUnavailableException | ClientException |
| ServerException | AccessDeniedException |
Security Considerations
安全注意事项
- You MUST use IAM roles (execution role + task role) — never embed credentials in container images or environment variables
- You MUST use Secrets Manager or SSM Parameter Store for sensitive configuration, injected via the field in the task definition
secrets - You SHOULD enable ECR image scanning on push for vulnerability detection
- You SHOULD use private subnets with NAT gateway or VPC endpoints for production workloads
- You MUST enable CloudTrail for ECS API audit logging
- You SHOULD configure CloudWatch Container Insights for monitoring
- You SHOULD use in container definitions where possible (note: incompatible with ECS Exec)
readonlyRootFilesystem: true - You MUST scope task role permissions to specific resources — avoid wildcards and
*policies*FullAccess - You MUST confirm with the user before executing destructive operations: (replaces all running tasks),
--force-new-deployment,delete-service. ECS does not supportderegister-task-definition— use the plan-validate-execute pattern: explain what will happen, get confirmation, then execute--dry-run - You SHOULD use ACM certificates with HTTPS listeners on ALBs fronting ECS services — per ECS network security best practices: "provision certificates for the load balancer using AWS Certificate Manager (ACM)"
- You SHOULD avoid logging sensitive data (secrets, PII, tokens) in container stdout/stderr — these flow to CloudWatch Logs via the awslogs driver. If sensitive data may appear in logs, enable CloudWatch Logs encryption with a KMS key
- You SHOULD attach an AWS WAF WebACL to internet-facing ALBs for defense in depth against common web exploits
- You SHOULD include and
aws:SourceArncondition keys in ECR repository policies for cross-account access to prevent confused deputy attacksaws:SourceAccount
- 必须使用IAM角色(执行角色+任务角色)——绝不能在容器镜像或环境变量中嵌入凭证
- 必须使用Secrets Manager或SSM Parameter Store存储敏感配置,并通过任务定义的字段注入
secrets - 应启用ECR镜像推送扫描以检测漏洞
- 生产工作流应使用带NAT网关或VPC终端节点的私有子网
- 必须为ECS API启用CloudTrail审计日志
- 应配置CloudWatch Container Insights进行监控
- 尽可能在容器定义中使用(注意:与ECS Exec不兼容)
readonlyRootFilesystem: true - 必须将任务角色权限限定到特定资源——避免使用通配符和
*策略*FullAccess - 执行破坏性操作前必须征得用户确认:(替换所有运行中任务)、
--force-new-deployment、delete-service。ECS不支持deregister-task-definition——使用计划-验证-执行模式:说明操作内容、获取确认、再执行--dry-run - 应为ECS服务前端的ALB配置ACM证书和HTTPS监听器——根据ECS网络安全最佳实践:“使用AWS Certificate Manager (ACM)为负载均衡器配置证书”
- 应避免在容器标准输出/标准错误中记录敏感数据(密钥、PII、令牌)——这些数据会通过awslogs驱动流向CloudWatch Logs。若日志中可能包含敏感数据,需启用CloudWatch Logs的KMS密钥加密
- 应为面向互联网的ALB附加AWS WAF WebACL,以深度防御常见Web攻击
- 跨账户访问ECR仓库时,应在仓库策略中包含和
aws:SourceArn条件键,以防止混淆代理攻击aws:SourceAccount