creating-data-lake-table
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCreate Data Lake Tables with Amazon S3 Tables
使用Amazon S3 Tables创建数据湖表
Overview
概述
Amazon S3 Tables provides managed Iceberg tables with automatic compaction and snapshot management. Queryable via Athena and Iceberg-compatible engines.
Amazon S3 Tables提供托管Iceberg表,支持自动压缩和快照管理。可通过Athena及兼容Iceberg的引擎进行查询。
Common Tasks
常见任务
You MUST use AWS MCP server tools when connected, they provide command validation, sandboxed execution, and audit logging. Fall back to AWS CLI if MCP unavailable.
连接时必须使用AWS MCP服务器工具,它们提供命令验证、沙箱执行和审计日志功能。如果MCP不可用,则使用AWS CLI作为替代。
Decision Guide
决策指南
Before creating, You MUST check what exists:
You MUST run when user mentions a database.
aws glue get-tables --database-name <NAME>| What you find | Action |
|---|---|
| Fuzzy database name ("our analytics db") | You MUST STOP. Delegate to |
| Non-S3-Tables table with matching name | You MUST STOP. Delegate to |
| Existing S3 Tables table with matching name | You MUST check schema match. Reuse if compatible, recreate only if user confirms. |
| No matching tables | Proceed with creation (Steps 1-8). |
| User explicitly requests new S3 Tables table | Skip checks, proceed with creation. |
Creation paths:
- Existing data in S3: Create empty table (Steps 1-8), then use skill.
ingesting-into-data-lake - Glue ETL pipeline: Read first, then Steps 1-6.
references/table-creation-glue-etl.md - Lake Formation access control: Search AWS docs for .
"S3 Tables integration with Lake Formation"
创建前必须检查现有资源:
当用户提及数据库时,必须执行命令。
aws glue get-tables --database-name <NAME>| 检查结果 | 操作 |
|---|---|
| 模糊数据库名称(如“我们的分析数据库”) | 必须停止操作。委托给 |
| 存在同名非S3-Tables表 | 必须停止操作。委托给 |
| 存在同名S3 Tables表 | 必须检查Schema是否匹配。若兼容则复用,仅在用户确认后重新创建。 |
| 无匹配表 | 继续创建流程(步骤1-8)。 |
| 用户明确要求创建新的S3 Tables表 | 跳过检查,直接进入创建流程。 |
创建路径:
- S3中已有数据:创建空表(步骤1-8),然后使用技能导入数据。
ingesting-into-data-lake - Glue ETL管道:先阅读,再执行步骤1-6。
references/table-creation-glue-etl.md - Lake Formation访问控制:在AWS文档中搜索。
"S3 Tables integration with Lake Formation"
1. Verify Dependencies
1. 验证依赖项
Constraints:
- You MUST check whether AWS MCP server tools or AWS CLI are available and inform user if missing
- You MUST confirm target AWS region and verify credentials with
aws sts get-caller-identity
约束条件:
- 必须检查AWS MCP服务器工具或AWS CLI是否可用,若缺失需告知用户
- 必须确认目标AWS区域,并通过验证凭证
aws sts get-caller-identity
2. Understand the Schema
2. 明确Schema
- Explicit schema: Validate Iceberg types.
- Loose description: Ask columns, types, grain. Propose and confirm.
- Existing S3 data: Infer schema from file headers only. Create empty table first, then use skill.
ingesting-into-data-lake
Constraints:
- You MUST read for Iceberg type mapping, partitions, and naming.
references/best-practices.md - You MUST ask for all required parameters upfront: table name, columns, types, partition strategy. For schema evolution, see .
references/athena-ddl-path.md - You MUST use all lowercase names -- Glue rejects mixed case with . Namespace and table names MUST NOT contain hyphens.
GENERIC_INTERNAL_ERROR - You SHOULD suggest partition columns based on access patterns.
- 显式Schema:验证Iceberg类型。
- 模糊描述:询问列名、类型、粒度。提出方案并确认。
- S3中已有数据:仅从文件头推断Schema。先创建空表,再使用技能导入数据。
ingesting-into-data-lake
约束条件:
- 必须阅读,了解Iceberg类型映射、分区和命名规范。
references/best-practices.md - 必须提前获取所有必填参数:表名、列、类型、分区策略。关于Schema演进,请查看。
references/athena-ddl-path.md - 必须使用全小写名称——Glue会拒绝混合大小写名称并返回。命名空间和表名不得包含连字符。
GENERIC_INTERNAL_ERROR - 应根据访问模式建议分区列。
3. Create Table Bucket
3. 创建表存储桶
Names: 3-63 chars, lowercase, numbers, hyphens.
bash
aws s3tables create-table-bucket --name <BUCKET_NAME> --region <REGION>Capture . Encryption (SSE-S3 default, SSE-KMS) and storage class (STANDARD, INTELLIGENT_TIERING) set at creation. See .
table-bucket-arnreferences/best-practices.mdConstraints:
- You MUST check existing buckets with and ask user to select or create new.
aws s3tables list-table-buckets - If using SSE-KMS, KMS key policy MUST allow S3 Tables maintenance service principal to read data. Search AWS docs for for required policy.
"S3 Tables KMS key policy" - If bucket creation fails, see for common errors.
references/best-practices.md
命名规则:3-63个字符,小写字母、数字、连字符。
bash
aws s3tables create-table-bucket --name <BUCKET_NAME> --region <REGION>记录。加密方式(默认SSE-S3,可选SSE-KMS)和存储类别(STANDARD、INTELLIGENT_TIERING)在创建时设置。详情请查看。
table-bucket-arnreferences/best-practices.md约束条件:
- 必须通过检查现有存储桶,让用户选择或创建新存储桶。
aws s3tables list-table-buckets - 如果使用SSE-KMS,KMS密钥策略必须允许S3 Tables维护服务主体读取数据。请在AWS文档中搜索获取所需策略。
"S3 Tables KMS key policy" - 如果存储桶创建失败,请查看中的常见错误说明。
references/best-practices.md
4. Create Namespace
4. 创建命名空间
bash
aws s3tables create-namespace --table-bucket-arn <ARN> --namespace <NAMESPACE>Constraints:
- You MUST list existing namespaces first and suggest reusing if relevant
- You MUST use lowercase names with no hyphens
bash
aws s3tables create-namespace --table-bucket-arn <ARN> --namespace <NAMESPACE>约束条件:
- 必须先列出现有命名空间,如有相关建议复用
- 必须使用全小写名称,不得包含连字符
5. Create Glue Data Catalog Integration
5. 配置Glue Data Catalog集成
Check if exists (create once per region per account):
s3tablescatalogbash
aws glue get-catalog --catalog-id s3tablescatalogIf not found, create (requires , ):
glue:CreateCatalogglue:passConnectionbash
aws glue create-catalog --name "s3tablescatalog" --catalog-input '{
"FederatedCatalog": {
"Identifier": "arn:aws:s3tables:<REGION>:<ACCOUNT_ID>:bucket/*",
"ConnectionName": "aws:s3tables"
},
"CreateDatabaseDefaultPermissions": [{"Principal": {"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"}, "Permissions": ["ALL"]}],
"CreateTableDefaultPermissions": [{"Principal": {"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"}, "Permissions": ["ALL"]}],
"AllowFullTableExternalDataAccess": "True"
}'Verify with .
aws glue get-catalogs --parent-catalog-id s3tablescatalog检查是否存在(每个账户每个区域只需创建一次):
s3tablescatalogbash
aws glue get-catalog --catalog-id s3tablescatalog如果未找到,则创建(需要、权限):
glue:CreateCatalogglue:passConnectionbash
aws glue create-catalog --name "s3tablescatalog" --catalog-input '{
"FederatedCatalog": {
"Identifier": "arn:aws:s3tables:<REGION>:<ACCOUNT_ID>:bucket/*",
"ConnectionName": "aws:s3tables"
},
"CreateDatabaseDefaultPermissions": [{"Principal": {"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"}, "Permissions": ["ALL"]}],
"CreateTableDefaultPermissions": [{"Principal": {"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"}, "Permissions": ["ALL"]}],
"AllowFullTableExternalDataAccess": "True"
}'通过验证创建结果。
aws glue get-catalogs --parent-catalog-id s3tablescatalog6. Configure Access Control
6. 配置访问控制
S3 Tables uses IAM namespace (not ).
s3tables:*s3:*Querying principal permissions (bucket policy):
- ,
s3tables:GetTableBucket,s3tables:GetNamespace,s3tables:GetTable,s3tables:GetTableMetadataLocations3tables:GetTableData
Querying principal permissions (IAM policy):
- ,
glue:GetCatalog,glue:GetDatabaseglue:GetTable
You MUST scope to correct ARN patterns. You MUST read for exact resource ARNs.
references/access-control.mdConstraints:
- You MUST ask user for querying principal ARN
- You MUST NOT grant broader permissions than necessary
- You MUST NOT create IAM roles automatically, verify existing and guide user
S3 Tables使用 IAM命名空间(而非)。
s3tables:*s3:*查询主体权限(存储桶策略):
- ,
s3tables:GetTableBucket,s3tables:GetNamespace,s3tables:GetTable,s3tables:GetTableMetadataLocations3tables:GetTableData
查询主体权限(IAM策略):
- ,
glue:GetCatalog,glue:GetDatabaseglue:GetTable
必须限定正确的ARN模式。必须阅读获取准确的资源ARN。
references/access-control.md约束条件:
- 必须向用户索要查询主体ARN
- 不得授予超出必要范围的权限
- 不得自动创建IAM角色,需验证现有角色并引导用户操作
7. Create the Table
7. 创建表
| Context | Path |
|---|---|
| Default (any user) | S3 Tables API (below) |
| User specifically wants SQL DDL | Athena DDL (see |
| Glue ETL pipeline | Spark DDL via |
Default: S3 Tables API:
bash
aws s3tables create-table \
--table-bucket-arn <ARN> \
--namespace <NAMESPACE> \
--name <TABLE_NAME> \
--format ICEBERG \
--metadata '<METADATA_JSON>'Metadata JSON MUST nest under key:
"iceberg"json
{"iceberg":{"schema":{"fields":[
{"name":"order_date","type":"date","required":true},
{"name":"customer_id","type":"string","required":true},
{"name":"amount","type":"double","required":false}
]},
"partitionSpec":{"fields":[
{"sourceId":1,"fieldId":1000,"transform":"month","name":"order_date_month"}
]}}}Constraints:
- MUST reference a valid schema field ID
partitionSpec.sourceId - For schema evolution after creation, use Athena DDL. See
references/athena-ddl-path.md - You MUST use for complex types (list, map, struct) with explicit field IDs. See
schemaV2.references/best-practices.md - You SHOULD search AWS docs for for supported partition transforms
"IcebergPartitionField S3 Tables"
| 场景 | 实现方式 |
|---|---|
| 默认(任意用户) | S3 Tables API(如下) |
| 用户明确要求使用SQL DDL | Athena DDL(查看 |
| Glue ETL管道 | 通过 |
默认方式:S3 Tables API:
bash
aws s3tables create-table \
--table-bucket-arn <ARN> \
--namespace <NAMESPACE> \
--name <TABLE_NAME> \
--format ICEBERG \
--metadata '<METADATA_JSON>'Metadata JSON必须嵌套在键下:
"iceberg"json
{"iceberg":{"schema":{"fields":[
{"name":"order_date","type":"date","required":true},
{"name":"customer_id","type":"string","required":true},
{"name":"amount","type":"double","required":false}
]},
"partitionSpec":{"fields":[
{"sourceId":1,"fieldId":1000,"transform":"month","name":"order_date_month"}
]}}}约束条件:
- 必须引用有效的Schema字段ID
partitionSpec.sourceId - 创建后的Schema演进需使用Athena DDL。详情查看
references/athena-ddl-path.md - 若使用复杂类型(list、map、struct),必须使用并指定显式字段ID。详情查看
schemaV2。references/best-practices.md - 应在AWS文档中搜索了解支持的分区转换方式
"IcebergPartitionField S3 Tables"
8. Verify and Confirm
8. 验证与确认
You MUST verify with and confirm queryability with via Athena using . Do NOT put catalog in SQL. Present summary: bucket ARN, namespace, table, schema, partitions.
aws s3tables get-tableDESCRIBE <table_name>--query-execution-context '{"Catalog":"s3tablescatalog/<BUCKET_NAME>","Database":"<NAMESPACE>"}'必须通过验证表创建结果,并通过Athena执行确认可查询性,执行时需指定。不得在SQL中指定catalog。向用户展示汇总信息:存储桶ARN、命名空间、表名、Schema、分区。
aws s3tables get-tableDESCRIBE <table_name>--query-execution-context '{"Catalog":"s3tablescatalog/<BUCKET_NAME>","Database":"<NAMESPACE>"}'Troubleshooting
故障排查
| Error | Cause | Fix |
|---|---|---|
| "Table location can not be specified" | LOCATION in CREATE TABLE | Remove LOCATION clause. S3 Tables manages storage automatically. |
| Using | S3 Tables uses |
| 错误信息 | 原因 | 修复方案 |
|---|---|---|
| "Table location can not be specified" | CREATE TABLE语句中包含LOCATION | 删除LOCATION子句。S3 Tables会自动管理存储位置。 |
| 使用了 | S3 Tables使用 |
Additional Resources
额外资源
- access-control.md -- IAM permissions, ARN patterns, permission errors
- best-practices.md -- Iceberg types, partitions, naming, common errors
- athena-ddl-path.md -- Athena DDL, schema evolution
- table-creation-glue-etl.md -- Spark DDL via Glue ETL
- Loading data: skill
ingesting-into-data-lake
- access-control.md -- IAM权限、ARN模式、权限错误排查
- best-practices.md -- Iceberg类型、分区、命名规范、常见错误
- athena-ddl-path.md -- Athena DDL、Schema演进
- table-creation-glue-etl.md -- 通过Glue ETL使用Spark DDL
- 数据加载:技能
ingesting-into-data-lake