processing-s3-uploads-with-step-functions

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Step Functions Workflow: Route S3 Uploads to Lambda or Fargate

Step Functions工作流:将S3上传文件路由至Lambda或Fargate

Overview

概述

This skill deploys an event-driven workflow using AWS CLI. When a file is uploaded to an S3 bucket, EventBridge triggers a Step Functions state machine. The state machine checks the file size and routes processing to either a Lambda function (files ≤ 6 MB) or a Fargate task (files > 6 MB).
The architecture includes:
  • An S3 bucket with EventBridge notifications enabled
  • An EventBridge rule that triggers Step Functions on S3 object creation
  • A Step Functions state machine with a Choice state for routing
  • A Lambda function for processing small files
  • An ECS Fargate task for processing large files
  • A VPC with two subnets, internet gateway, and security group
  • An ECR repository for the Fargate container image
  • Scoped IAM roles for Lambda, Step Functions, and ECS tasks
Use this skill when:
  • You need to process S3 uploads with different compute based on file size
  • You want a serverless workflow that can handle both small and large files
  • You need Step Functions orchestration with Lambda and Fargate
Do not use this skill when:
  • All files are small enough for Lambda (use S3 → Lambda directly)
  • You need real-time streaming (use Kinesis)
  • You don't need file-size-based routing
本技能使用AWS CLI部署一个事件驱动的工作流。当文件上传至S3存储桶时,EventBridge会触发Step Functions状态机。状态机会检查文件大小,并将处理任务路由至Lambda函数(文件≤6 MB)或Fargate任务(文件>6 MB)。
该架构包含:
  • 已启用EventBridge通知的S3存储桶
  • 当S3对象创建时触发Step Functions的EventBridge规则
  • 包含Choice状态用于路由的Step Functions状态机
  • 用于处理小文件的Lambda函数
  • 用于处理大文件的ECS Fargate任务
  • 包含两个子网、互联网网关和安全组的VPC
  • 用于存储Fargate容器镜像的ECR仓库
  • 为Lambda、Step Functions和ECS任务配置的权限受限IAM角色
在以下场景使用本技能:
  • 需要根据文件大小使用不同计算资源处理S3上传文件时
  • 希望构建可同时处理小文件和大文件的无服务器工作流时
  • 需要通过Step Functions编排Lambda和Fargate任务时
请勿在以下场景使用本技能:
  • 所有文件都小到可由Lambda直接处理时(直接使用S3 → Lambda即可)
  • 需要实时流处理时(使用Kinesis)
  • 不需要基于文件大小进行路由时

Prerequisites

前置条件

  1. AWS CLI v2 — Installed and configured. Verify with
    aws sts get-caller-identity
    .
  2. Python 3.12 — For the Lambda function runtime.
  3. Docker — For building and pushing the Fargate container image.
  1. AWS CLI v2 — 已安装并配置。可通过
    aws sts get-caller-identity
    命令验证。
  2. Python 3.12 — 作为Lambda函数的运行时环境。
  3. Docker — 用于构建并推送Fargate容器镜像。

Parameters

参数

  • bucket_name (required): Name for the S3 bucket (globally unique, lowercase, 3-63 characters)
  • region (required): AWS region for all resources
  • ecr_repo_name (required): Name for the ECR repository
  • state_machine_name (required): Name for the Step Functions state machine
  • kms_key_arn (optional): ARN of a KMS key for CloudWatch Logs encryption. If not provided, create one with
    aws kms create-key --description "Key for CloudWatch Logs encryption" --region {region}
Constraints for parameter acquisition:
  • You MUST ask for all required parameters upfront in a single prompt
  • You MUST support multiple input methods (direct input, file path, URL)
  • You MUST confirm successful acquisition of all parameters before proceeding
  • You MUST validate that bucket_name follows S3 naming rules
  • bucket_name(必填):S3存储桶名称(全局唯一,小写,长度3-63字符)
  • region(必填):所有资源所在的AWS区域
  • ecr_repo_name(必填):ECR仓库名称
  • state_machine_name(必填):Step Functions状态机名称
  • kms_key_arn(可选):用于CloudWatch Logs加密的KMS密钥ARN。若未提供,可通过
    aws kms create-key --description "Key for CloudWatch Logs encryption" --region {region}
    命令创建
参数获取约束:
  • 必须在单次提示中一次性要求所有必填参数
  • 必须支持多种输入方式(直接输入、文件路径、URL)
  • 必须在继续操作前确认已成功获取所有参数
  • 必须验证bucket_name是否符合S3命名规则

Procedures

操作步骤

Step 0: Verify Dependencies

步骤0:验证依赖项

Constraints:
  • You MUST verify the following tools are available: aws-cli, python3 (3.12+), docker
  • You MUST inform the user about any missing tools with a clear message
  • You MUST ask if the user wants to proceed despite missing tools
  • You MUST respect the customer's decision to abort at any point
  • You MUST explain to the customer what step is being executed, why, and which tool is being called
约束:
  • 必须验证以下工具是否可用:aws-cli、python3(3.12+)、docker
  • 若有工具缺失,必须向用户发送清晰的提示信息
  • 必须询问用户是否要在工具缺失的情况下继续操作
  • 必须尊重用户随时终止操作的决定
  • 必须向用户解释当前执行的步骤、原因以及调用的工具

Step 1: Retrieve AWS Account ID

步骤1:获取AWS账户ID

Constraints:
  • You MUST retrieve the account ID with:
    aws sts get-caller-identity --query 'Account' --output text
  • You MUST store the result as {account_id} for use in all subsequent steps
  • You MUST abort if credentials are not configured
约束:
  • 必须通过以下命令获取账户ID:
    aws sts get-caller-identity --query 'Account' --output text
  • 必须将结果存储为{account_id},供后续所有步骤使用
  • 若凭证未配置,必须终止操作

Step 2: Get the Default VPC and Networking

步骤2:获取默认VPC和网络资源

Constraints:
  • You MUST retrieve the default VPC ID with:
    aws ec2 describe-vpcs --filters Name=isDefault,Values=true --query 'Vpcs[0].VpcId' --output text --region {region}
  • If no default VPC exists, inform the user they must create one with
    aws ec2 create-default-vpc --region {region}
    or provide a VPC ID manually
  • You MUST retrieve two subnet IDs from the default VPC:
    aws ec2 describe-subnets --filters Name=vpc-id,Values={vpc_id} --query 'Subnets[0:2].SubnetId' --output text --region {region}
  • You MUST create a security group in the default VPC:
    aws ec2 create-security-group --group-name fargate-sg --description "Security group for Fargate tasks" --vpc-id {vpc_id} --region {region}
  • You MUST configure security group egress rules to allow only HTTPS and DNS outbound. First revoke the default allow-all egress rule:
    aws ec2 revoke-security-group-egress --group-id {sg_id} --ip-permissions IpProtocol=-1,IpRanges='[{CidrIp=0.0.0.0/0}]' --region {region}
    Then add scoped rules:
    aws ec2 authorize-security-group-egress --group-id {sg_id} --protocol tcp --port 443 --cidr 0.0.0.0/0 --region {region}
    and
    aws ec2 authorize-security-group-egress --group-id {sg_id} --protocol udp --port 53 --cidr 0.0.0.0/0 --region {region}
  • You MUST recommend VPC endpoints for S3 and CloudWatch Logs for production workloads to avoid internet-routed traffic and eliminate the need for broad egress rules
  • You MUST capture {vpc_id}, {subnet1_id}, {subnet2_id}, and {sg_id} for use in later steps
约束:
  • 必须通过以下命令获取默认VPC ID:
    aws ec2 describe-vpcs --filters Name=isDefault,Values=true --query 'Vpcs[0].VpcId' --output text --region {region}
  • 若不存在默认VPC,需告知用户必须通过
    aws ec2 create-default-vpc --region {region}
    命令创建,或手动提供VPC ID
  • 必须从默认VPC中获取两个子网ID:
    aws ec2 describe-subnets --filters Name=vpc-id,Values={vpc_id} --query 'Subnets[0:2].SubnetId' --output text --region {region}
  • 必须在默认VPC中创建安全组:
    aws ec2 create-security-group --group-name fargate-sg --description "Security group for Fargate tasks" --vpc-id {vpc_id} --region {region}
  • 必须配置安全组出站规则,仅允许HTTPS和DNS出站。首先撤销默认的允许所有出站流量规则:
    aws ec2 revoke-security-group-egress --group-id {sg_id} --ip-permissions IpProtocol=-1,IpRanges='[{CidrIp=0.0.0.0/0}]' --region {region}
    然后添加限定规则:
    aws ec2 authorize-security-group-egress --group-id {sg_id} --protocol tcp --port 443 --cidr 0.0.0.0/0 --region {region}
    aws ec2 authorize-security-group-egress --group-id {sg_id} --protocol udp --port 53 --cidr 0.0.0.0/0 --region {region}
  • 必须建议在生产环境中为S3和CloudWatch Logs配置VPC终端节点,以避免互联网路由流量并消除对宽泛出站规则的需求
  • 必须记录{vpc_id}、{subnet1_id}、{subnet2_id}和{sg_id},供后续步骤使用

Step 3: Create the ECR Repository

步骤3:创建ECR仓库

Constraints:
  • You MUST create the repository with:
    aws ecr create-repository --repository-name {ecr_repo_name} --region {region}
  • You MUST capture the repositoryUri from the response
约束:
  • 必须通过以下命令创建仓库:
    aws ecr create-repository --repository-name {ecr_repo_name} --region {region}
  • 必须从响应中获取repositoryUri

Step 4: Build and Push the Container Image

步骤4:构建并推送容器镜像

Constraints:
  • You MUST verify Docker is installed by running
    docker --version
    . If Docker is not installed, instruct the user to install it from https://docs.docker.com/get-docker/ and abort until it is available
  • You MUST authenticate Docker with ECR:
    aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account_id}.dkr.ecr.{region}.amazonaws.com
  • The Dockerfile and processor code are in
    scripts/Dockerfile
    and
    scripts/fargate_processor.py
  • You MUST build and push the image from the scripts directory:
    cd scripts
    docker build --platform linux/amd64 -t {ecr_repo_name} .
    docker tag {ecr_repo_name}:latest {account_id}.dkr.ecr.{region}.amazonaws.com/{ecr_repo_name}:latest
    docker push {account_id}.dkr.ecr.{region}.amazonaws.com/{ecr_repo_name}:latest
    cd ..
约束:
  • 必须通过运行
    docker --version
    验证Docker是否已安装。若未安装,需指导用户从https://docs.docker.com/get-docker/安装,并终止操作直至Docker可用
  • 必须通过以下命令让Docker与ECR进行身份验证:
    aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account_id}.dkr.ecr.{region}.amazonaws.com
  • Dockerfile和处理器代码位于
    scripts/Dockerfile
    scripts/fargate_processor.py
  • 必须从scripts目录构建并推送镜像:
    cd scripts
    docker build --platform linux/amd64 -t {ecr_repo_name} .
    docker tag {ecr_repo_name}:latest {account_id}.dkr.ecr.{region}.amazonaws.com/{ecr_repo_name}:latest
    docker push {account_id}.dkr.ecr.{region}.amazonaws.com/{ecr_repo_name}:latest
    cd ..

Step 5: Create IAM Roles

步骤5:创建IAM角色

Follow the detailed instructions in
references/iam-roles.md
to create all IAM roles (Lambda, ECS task execution, ECS task, Step Functions, and EventBridge roles).
  • You MUST wait at least 10 seconds for IAM role propagation
按照
references/iam-roles.md
中的详细说明创建所有IAM角色(Lambda、ECS任务执行角色、ECS任务角色、Step Functions角色和EventBridge角色)。
  • 必须等待至少10秒,确保IAM角色完成传播

Step 6: Create the Lambda Function

步骤6:创建Lambda函数

Constraints:
  • The function code is in
    scripts/lambda_function.py
  • You MUST be in the skill root directory before packaging and creating the function
  • You MUST package it with:
    python3 -c "import zipfile,io; z=io.BytesIO(); f=zipfile.ZipFile(z,'w'); f.writestr('lambda_function.py', open('scripts/lambda_function.py').read()); f.close(); open('/tmp/lambda_function.zip','wb').write(z.getvalue())"
  • You MUST create the function with:
    aws lambda create-function \
        --function-name sfn-file-processor \
        --runtime python3.12 \
        --handler lambda_function.lambda_handler \
        --role arn:aws:iam::{account_id}:role/sfn-lambda-role \
        --zip-file fileb:///tmp/lambda_function.zip \
        --timeout 60 \
        --architectures x86_64 \
        --region {region}
  • You MUST verify the function was created with:
    aws lambda get-function --function-name sfn-file-processor --region {region}
约束:
  • 函数代码位于
    scripts/lambda_function.py
  • 必须在技能根目录下打包并创建函数
  • 必须通过以下命令打包:
    python3 -c "import zipfile,io; z=io.BytesIO(); f=zipfile.ZipFile(z,'w'); f.writestr('lambda_function.py', open('scripts/lambda_function.py').read()); f.close(); open('/tmp/lambda_function.zip','wb').write(z.getvalue())"
  • 必须通过以下命令创建函数:
    aws lambda create-function \
        --function-name sfn-file-processor \
        --runtime python3.12 \
        --handler lambda_function.lambda_handler \
        --role arn:aws:iam::{account_id}:role/sfn-lambda-role \
        --zip-file fileb:///tmp/lambda_function.zip \
        --timeout 60 \
        --architectures x86_64 \
        --region {region}
  • 必须通过以下命令验证函数是否创建成功:
    aws lambda get-function --function-name sfn-file-processor --region {region}

Step 7: Create the CloudWatch Log Group

步骤7:创建CloudWatch日志组

Constraints:
  • You MUST create the log group for Fargate:
    aws logs create-log-group --log-group-name /StepFunctionFargateTask --region {region}
  • You MUST encrypt the log group with a KMS key:
    aws logs associate-kms-key --log-group-name /StepFunctionFargateTask --kms-key-arn {kms_key_arn} --region {region}
约束:
  • 必须为Fargate创建日志组:
    aws logs create-log-group --log-group-name /StepFunctionFargateTask --region {region}
  • 必须使用KMS密钥加密日志组:
    aws logs associate-kms-key --log-group-name /StepFunctionFargateTask --kms-key-arn {kms_key_arn} --region {region}

Step 8: Create the ECS Cluster and Task Definition

步骤8:创建ECS集群和任务定义

Follow the detailed instructions in
references/ecs-task-definition.md
to create the ECS cluster and register the Fargate task definition.
  • You MUST capture the task definition ARN from the response
按照
references/ecs-task-definition.md
中的详细说明创建ECS集群并注册Fargate任务定义。
  • 必须从响应中获取任务定义ARN

Step 9: Create the S3 Bucket with EventBridge Notifications

步骤9:创建启用EventBridge通知的S3存储桶

Constraints:
  • You MUST create the bucket with:
    aws s3api create-bucket --bucket {bucket_name} --region {region} --create-bucket-configuration LocationConstraint={region}
  • You MUST NOT include
    --create-bucket-configuration
    if region is us-east-1
  • You MUST enable EventBridge notifications on the bucket:
    aws s3api put-bucket-notification-configuration --bucket {bucket_name} --notification-configuration '{"EventBridgeConfiguration": {}}' --region {region}
  • You MUST enable default encryption on the bucket:
    aws s3api put-bucket-encryption --bucket {bucket_name} --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"aws:kms"}}]}' --region {region}
约束:
  • 必须通过以下命令创建存储桶:
    aws s3api create-bucket --bucket {bucket_name} --region {region} --create-bucket-configuration LocationConstraint={region}
  • 若区域为us-east-1,不得包含
    --create-bucket-configuration
    参数
  • 必须为存储桶启用EventBridge通知:
    aws s3api put-bucket-notification-configuration --bucket {bucket_name} --notification-configuration '{"EventBridgeConfiguration": {}}' --region {region}
  • 必须为存储桶启用默认加密:
    aws s3api put-bucket-encryption --bucket {bucket_name} --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"aws:kms"}}]}' --region {region}

Step 10: Create the Step Functions State Machine

步骤10:创建Step Functions状态机

Constraints:
  • The state machine definition is in
    scripts/statemachine.asl.json
  • You MUST create a working copy and replace all placeholders:
    sed -e 's|${LambdaFunction}|arn:aws:lambda:{region}:{account_id}:function:sfn-file-processor|g' \
        -e 's|${Cluster}|arn:aws:ecs:{region}:{account_id}:cluster/sfn-cluster|g' \
        -e 's|${TaskDefinition}|{task_definition_arn}|g' \
        -e 's|${Subnet1}|{subnet1_id}|g' \
        -e 's|${Subnet2}|{subnet2_id}|g' \
        -e 's|${SecurityGroup}|{sg_id}|g' \
        scripts/statemachine.asl.json > /tmp/statemachine.asl.json
  • You MUST create the state machine with:
    aws stepfunctions create-state-machine \
        --name {state_machine_name} \
        --definition file:///tmp/statemachine.asl.json \
        --role-arn arn:aws:iam::{account_id}:role/sfn-state-machine-role \
        --type STANDARD \
        --region {region}
  • You MUST capture the stateMachineArn from the response
约束:
  • 状态机定义位于
    scripts/statemachine.asl.json
  • 必须创建工作副本并替换所有占位符:
    sed -e 's|${LambdaFunction}|arn:aws:lambda:{region}:{account_id}:function:sfn-file-processor|g' \
        -e 's|${Cluster}|arn:aws:ecs:{region}:{account_id}:cluster/sfn-cluster|g' \
        -e 's|${TaskDefinition}|{task_definition_arn}|g' \
        -e 's|${Subnet1}|{subnet1_id}|g' \
        -e 's|${Subnet2}|{subnet2_id}|g' \
        -e 's|${SecurityGroup}|{sg_id}|g' \
        scripts/statemachine.asl.json > /tmp/statemachine.asl.json
  • 必须通过以下命令创建状态机:
    aws stepfunctions create-state-machine \
        --name {state_machine_name} \
        --definition file:///tmp/statemachine.asl.json \
        --role-arn arn:aws:iam::{account_id}:role/sfn-state-machine-role \
        --type STANDARD \
        --region {region}
  • 必须从响应中获取stateMachineArn

Step 11: Create the EventBridge Rule

步骤11:创建EventBridge规则

Constraints:
  • You MUST create the EventBridge rule to trigger on S3 object creation:
    aws events put-rule \
        --name s3-to-stepfunctions \
        --event-pattern '{
          "source": ["aws.s3"],
          "detail-type": ["Object Created"],
          "detail": {
            "bucket": {
              "name": ["{bucket_name}"]
            }
          }
        }' \
        --region {region}
  • You MUST add the state machine as a target:
    aws events put-targets \
        --rule s3-to-stepfunctions \
        --targets '[{
          "Id": "StepFunctionsTarget",
          "Arn": "{state_machine_arn}",
          "RoleArn": "arn:aws:iam::{account_id}:role/sfn-eventbridge-role"
        }]' \
        --region {region}
约束:
  • 必须创建EventBridge规则,在S3对象创建时触发:
    aws events put-rule \
        --name s3-to-stepfunctions \
        --event-pattern '{
          "source": ["aws.s3"],
          "detail-type": ["Object Created"],
          "detail": {
            "bucket": {
              "name": ["{bucket_name}"]
            }
          }
        }' \
        --region {region}
  • 必须将状态机添加为目标:
    aws events put-targets \
        --rule s3-to-stepfunctions \
        --targets '[{
          "Id": "StepFunctionsTarget",
          "Arn": "{state_machine_arn}",
          "RoleArn": "arn:aws:iam::{account_id}:role/sfn-eventbridge-role"
        }]' \
        --region {region}

Step 12: Configure Monitoring

步骤12:配置监控

Constraints:
  • You MUST create a Dead Letter Queue for failed EventBridge invocations:
    aws sqs create-queue --queue-name s3-to-stepfunctions-dlq --region {region}
  • You MUST update the EventBridge target to attach the DLQ:
    aws events put-targets \
        --rule s3-to-stepfunctions \
        --targets '[{
          "Id": "StepFunctionsTarget",
          "Arn": "{state_machine_arn}",
          "RoleArn": "arn:aws:iam::{account_id}:role/sfn-eventbridge-role",
          "DeadLetterConfig": {
            "Arn": "arn:aws:sqs:{region}:{account_id}:s3-to-stepfunctions-dlq"
          }
        }]' \
        --region {region}
  • You MUST create a CloudWatch alarm for Step Functions execution failures:
    aws cloudwatch put-metric-alarm --alarm-name sfn-execution-failures --metric-name ExecutionsFailed --namespace AWS/States --statistic Sum --period 300 --threshold 1 --comparison-operator GreaterThanOrEqualToThreshold --evaluation-periods 1 --dimensions Name=StateMachineArn,Value={state_machine_arn} --region {region}
约束:
  • 必须为失败的EventBridge调用创建死信队列:
    aws sqs create-queue --queue-name s3-to-stepfunctions-dlq --region {region}
  • 必须更新EventBridge目标以附加死信队列:
    aws events put-targets \
        --rule s3-to-stepfunctions \
        --targets '[{
          "Id": "StepFunctionsTarget",
          "Arn": "{state_machine_arn}",
          "RoleArn": "arn:aws:iam::{account_id}:role/sfn-eventbridge-role",
          "DeadLetterConfig": {
            "Arn": "arn:aws:sqs:{region}:{account_id}:s3-to-stepfunctions-dlq"
          }
        }]' \
        --region {region}
  • 必须为Step Functions执行失败创建CloudWatch告警:
    aws cloudwatch put-metric-alarm --alarm-name sfn-execution-failures --metric-name ExecutionsFailed --namespace AWS/States --statistic Sum --period 300 --threshold 1 --comparison-operator GreaterThanOrEqualToThreshold --evaluation-periods 1 --dimensions Name=StateMachineArn,Value={state_machine_arn} --region {region}

Step 13: Validate

步骤13:验证

Constraints:
  • You MUST test with a small file (< 6 MB) to verify Lambda processing:
    echo 'test data' > /tmp/small-file.txt
    aws s3 cp /tmp/small-file.txt s3://{bucket_name}/small-file.txt --region {region}
  • You MUST wait 15 seconds then check the Step Functions execution:
    aws stepfunctions list-executions --state-machine-arn {state_machine_arn} --region {region}
  • You MUST verify the execution succeeded and routed to Lambda
  • You MUST provide a summary of all created resources including: VPC ID, subnet IDs, security group ID, ECR repo URI, ECS cluster ARN, task definition ARN, Lambda function ARN, state machine ARN, bucket name, and EventBridge rule name
约束:
  • 必须使用小文件(<6 MB)测试Lambda处理是否正常:
    echo 'test data' > /tmp/small-file.txt
    aws s3 cp /tmp/small-file.txt s3://{bucket_name}/small-file.txt --region {region}
  • 必须等待15秒后检查Step Functions执行情况:
    aws stepfunctions list-executions --state-machine-arn {state_machine_arn} --region {region}
  • 必须验证执行成功且已路由至Lambda
  • 必须提供所有已创建资源的摘要,包括:VPC ID、子网ID、安全组ID、ECR仓库URI、ECS集群ARN、任务定义ARN、Lambda函数ARN、状态机ARN、存储桶名称和EventBridge规则名称

Troubleshooting

故障排除

EventBridge rule not triggering

EventBridge规则未触发

  • Verify EventBridge notifications are enabled on the bucket:
    aws s3api get-bucket-notification-configuration --bucket {bucket_name}
  • Verify the rule exists:
    aws events describe-rule --name s3-to-stepfunctions --region {region}
  • Check that the target has the correct state machine ARN and role
  • 验证存储桶是否已启用EventBridge通知:
    aws s3api get-bucket-notification-configuration --bucket {bucket_name}
  • 验证规则是否存在:
    aws events describe-rule --name s3-to-stepfunctions --region {region}
  • 检查目标是否配置了正确的状态机ARN和角色

Step Functions execution fails at Fargate task

Step Functions执行在Fargate任务环节失败

  • Verify the container image exists in ECR:
    aws ecr describe-images --repository-name {ecr_repo_name} --region {region}
  • Check that the subnets have internet access (route table with IGW)
  • Verify the security group allows outbound traffic
  • Check CloudWatch Logs at
    /StepFunctionFargateTask
  • 验证容器镜像是否存在于ECR:
    aws ecr describe-images --repository-name {ecr_repo_name} --region {region}
  • 检查子网是否具有互联网访问权限(路由表包含IGW)
  • 验证安全组是否允许出站流量
  • 查看CloudWatch Logs中的
    /StepFunctionFargateTask
    日志

Lambda invocation fails

Lambda调用失败

  • Check CloudWatch Logs:
    aws logs tail /aws/lambda/sfn-file-processor --region {region}
  • Verify the Step Functions role has
    lambda:InvokeFunction
    permission
  • 查看CloudWatch Logs:
    aws logs tail /aws/lambda/sfn-file-processor --region {region}
  • 验证Step Functions角色是否具有
    lambda:InvokeFunction
    权限

IAM PassRole errors

IAM PassRole错误

  • The Step Functions role must have
    iam:PassRole
    for both the ECS execution role and task role ARNs
  • Step Functions角色必须对ECS执行角色和任务角色ARN具有
    iam:PassRole
    权限

Fargate task stuck in PROVISIONING

Fargate任务卡在PROVISIONING状态

  • Verify the subnets have auto-assign public IP enabled
  • Verify the internet gateway is attached and route table has 0.0.0.0/0 route
  • 验证子网是否已启用自动分配公网IP
  • 验证互联网网关已附加且路由表包含0.0.0.0/0路由

Security Considerations

安全注意事项

  • Fargate tasks with public IPs are exposed to the internet. Revoke the default allow-all egress rule and configure scoped egress:
    aws ec2 revoke-security-group-egress --group-id {sg_id} --ip-permissions IpProtocol=-1,IpRanges='[{CidrIp=0.0.0.0/0}]'
    then add
    aws ec2 authorize-security-group-egress --group-id {sg_id} --protocol tcp --port 443 --cidr 0.0.0.0/0
    and
    aws ec2 authorize-security-group-egress --group-id {sg_id} --protocol udp --port 53 --cidr 0.0.0.0/0
    . For production, consider using VPC endpoints for S3 and CloudWatch Logs instead of internet-routed traffic.
  • Scan container images for vulnerabilities before pushing to ECR. Enable ECR image scanning with:
    aws ecr put-image-scanning-configuration --repository-name {ecr_repo_name} --image-scanning-configuration scanOnPush=true --region {region}
  • Use IAM roles for credentials — never hardcode access keys in container code.
  • Enable encryption at rest for the S3 bucket:
    aws s3api put-bucket-encryption --bucket {bucket_name} --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"aws:kms"}}]}'
  • Enable CloudWatch Logs encryption for Fargate container logs:
    aws logs associate-kms-key --log-group-name /StepFunctionFargateTask --kms-key-arn <KMS_KEY_ARN>
  • Configure a Dead Letter Queue on the EventBridge rule for failed invocations
  • Set up CloudWatch alarms on Step Functions execution failures for operational visibility
  • 带有公网IP的Fargate任务会暴露在互联网中。请撤销默认的允许所有出站流量规则,并配置限定的出站规则:
    aws ec2 revoke-security-group-egress --group-id {sg_id} --ip-permissions IpProtocol=-1,IpRanges='[{CidrIp=0.0.0.0/0}]'
    ,然后添加
    aws ec2 authorize-security-group-egress --group-id {sg_id} --protocol tcp --port 443 --cidr 0.0.0.0/0
    aws ec2 authorize-security-group-egress --group-id {sg_id} --protocol udp --port 53 --cidr 0.0.0.0/0
    。对于生产环境,建议使用S3和CloudWatch Logs的VPC终端节点,而非互联网路由流量。
  • 在推送镜像至ECR前,扫描容器镜像以查找漏洞。通过以下命令启用ECR镜像扫描:
    aws ecr put-image-scanning-configuration --repository-name {ecr_repo_name} --image-scanning-configuration scanOnPush=true --region {region}
  • 使用IAM角色获取凭证——切勿在容器代码中硬编码访问密钥。
  • 为S3存储桶启用静态加密:
    aws s3api put-bucket-encryption --bucket {bucket_name} --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"aws:kms"}}]}'
  • 为Fargate容器日志启用CloudWatch Logs加密:
    aws logs associate-kms-key --log-group-name /StepFunctionFargateTask --kms-key-arn <KMS_KEY_ARN>
  • 为EventBridge规则配置死信队列,处理失败的调用
  • 为Step Functions执行失败设置CloudWatch告警,提升运维可见性

Version information

版本信息

  • AWS CLI: 2.x
  • Python runtime: 3.12
  • Last validated: 2026-04-27
  • AWS CLI: 2.x
  • Python runtime: 3.12
  • 最后验证日期: 2026-04-27

Additional Resources

额外资源