ingesting-into-data-lake

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Ingest into Data Lake

导入至数据湖

Move data from a source into a queryable table in the data lake. This skill assumes the source connection (if one is needed) already exists. For Glue connection setup or troubleshooting, delegate to

connecting-to-data-source

将数据从源端导入数据湖中的可查询表。本技能假设源端连接（如果需要）已存在。如需设置或排查Glue连接，请委托给

connecting-to-data-source

。

Philosophy

理念

Default to S3 Tables unless the environment says otherwise. S3 Tables is the recommended target for new data lake work. If the user's catalog inventory shows they haven't adopted S3 Tables, recommend standard Iceberg on their existing general-purpose bucket instead of forcing them to change posture.

除非环境另有规定，否则默认使用S3 Tables。 S3 Tables是新建数据湖工作的推荐目标。如果用户的目录清单显示尚未采用S3 Tables，则建议在其现有通用存储桶上使用标准Iceberg，而非强制用户改变现有架构。

Common Tasks

常见任务

You MUST execute commands using AWS MCP server tools when connected -- they provide validation, sandboxed execution, and audit logging. Fall back to AWS CLI only if MCP is unavailable. You MUST explain each step before executing.

连接后必须使用AWS MCP服务器工具执行命令——这些工具可提供验证、沙箱执行和审计日志记录。仅当MCP不可用时才使用AWS CLI作为备选。执行前必须解释每个步骤。

Workflow

工作流程

1. Verify Dependencies and Context

1. 验证依赖项和上下文

You MUST check whether AWS MCP tools or AWS CLI are available and inform the user if missing
You MUST confirm target AWS region and verify credentials with
```
aws sts get-caller-identity
```
For SageMaker Unified Studio project roles, note that target tables and connections may be scoped to the project. See the caller ARN detection pattern in
```
querying-data-lake
```
.

必须检查AWS MCP工具或AWS CLI是否可用，若缺失需告知用户
必须确认目标AWS区域，并使用
```
aws sts get-caller-identity
```
验证凭证
对于SageMaker Unified Studio项目角色，需注意目标表和连接可能仅限于该项目。请参考
```
querying-data-lake
```
中的调用者ARN检测模式。

2. Classify the Source

2. 分类数据源

User says...	Source type	Reference
"upload my file", "local CSV", "move to S3"	Local file	local-upload.md
"load from S3", "import CSV/JSON/Parquet from s3://"	S3 files	s3-files.md
"import from Oracle/Postgres/MySQL/SQL Server/Redshift/RDS/Aurora"	JDBC	jdbc-ingest.md
"pull from Snowflake", "Snowflake table to S3"	Snowflake	snowflake-ingest.md
"import from BigQuery", "GCP analytics to S3"	BigQuery	bigquery-ingest.md
"export DynamoDB", "DynamoDB to data lake"	DynamoDB	dynamodb-ingest.md
"migrate Glue table", "convert Hive to Iceberg"	Catalog migration	catalog-migration.md

If the user names Salesforce, ServiceNow, SAP, MongoDB, Kafka, or another SaaS/streaming source, decline -- these are not supported in this release.

If the source table is referenced by a fuzzy or business name ("migrate our orders table", "pull from the sales warehouse"), delegate to

finding-data-lake-assets

to resolve before proceeding.

用户表述...	数据源类型	参考文档
"上传我的文件"、"本地CSV"、"迁移至S3"	本地文件	local-upload.md
"从S3加载"、"从s3://导入CSV/JSON/Parquet"	S3文件	s3-files.md
"从Oracle/Postgres/MySQL/SQL Server/Redshift/RDS/Aurora导入"	JDBC	jdbc-ingest.md
"从Snowflake拉取"、"Snowflake表迁移至S3"	Snowflake	snowflake-ingest.md
"从BigQuery导入"、"GCP分析数据迁移至S3"	BigQuery	bigquery-ingest.md
"导出DynamoDB"、"DynamoDB迁移至数据湖"	DynamoDB	dynamodb-ingest.md
"迁移Glue表"、"将Hive转换为Iceberg"	目录迁移	catalog-migration.md

如果用户提及Salesforce、ServiceNow、SAP、MongoDB、Kafka或其他SaaS/流式数据源，请拒绝处理——本版本不支持这些数据源。

如果源表使用模糊名称或业务名称指代（如"迁移我们的订单表"、"从销售仓库拉取数据"），请先委托给

finding-data-lake-assets

进行解析，再继续操作。

3. Confirm Connection Exists (if applicable)

3. 确认连接已存在（如适用）

For JDBC, Snowflake, and BigQuery sources, a Glue connection is required. Check:

bash

aws glue get-connection --name <CONNECTION_NAME> --region <REGION>

If the connection does not exist, stop and delegate to

connecting-to-data-source

to create and test it. Do not proceed with ingest until the connection is verified.

Local files, S3 files, DynamoDB, and catalog migration do not need a Glue connection.

对于JDBC、Snowflake和BigQuery数据源，需要Glue连接。请执行以下命令检查：

bash

aws glue get-connection --name <CONNECTION_NAME> --region <REGION>

如果连接不存在，请停止操作并委托给

connecting-to-data-source

创建并测试连接。必须在连接验证通过后才能继续数据导入。

本地文件、S3文件、DynamoDB和目录迁移不需要Glue连接。

4. Clarify the Target

4. 明确目标配置

You MUST ask the user (or suggest based on catalog inventory) before creating or writing to any table:

Database/namespace: Does a specific target database exist? Or should one be created?
Table: Existing table (append/merge) or new table (delegate to
```
creating-data-lake-table
```
)?
Format: S3 Tables (default), standard Iceberg, or raw Parquet?

Inventory-aware defaults:

If you have already run

exploring-data-catalog

or can quickly check, use what exists:

Account has an
```
s3tablescatalog
```
federated catalog and active table buckets: recommend S3 Tables
Account has general-purpose buckets with Iceberg tables and no S3 Tables usage: recommend standard Iceberg on their existing bucket
Account uses Parquet/ORC on S3 without Iceberg metadata: ask whether to adopt Iceberg now (recommend yes) or continue with raw files

Do not force S3 Tables on customers who haven't adopted it. See iceberg-catalog-config-and-usage.md.

Delegations from this step:

Target table doesn't exist ->
```
creating-data-lake-table
```
Target database named by fuzzy term ->
```
finding-data-lake-assets
```
User doesn't know what exists ->
```
exploring-data-catalog
```

在创建或写入任何表之前，必须询问用户（或根据目录清单给出建议）：

数据库/命名空间：是否存在特定的目标数据库？是否需要创建新数据库？
表：是写入现有表（追加/合并）还是创建新表（委托给
```
creating-data-lake-table
```
）？
格式：S3 Tables（默认）、标准Iceberg还是原生Parquet？

基于目录清单的默认建议：

如果已执行

exploring-data-catalog

或可快速检查，请使用现有配置：

账户拥有
```
s3tablescatalog
```
联合目录和活跃的表存储桶：建议使用S3 Tables
账户的通用存储桶中有Iceberg表且未使用S3 Tables：建议在现有存储桶上使用标准Iceberg
账户在S3上使用Parquet/ORC但无Iceberg元数据：询问用户是否现在采用Iceberg（建议是）或继续使用原生文件

请勿强制尚未采用S3 Tables的客户使用该格式。请参考iceberg-catalog-config-and-usage.md。

此步骤的委托场景：

目标表不存在 →
```
creating-data-lake-table
```
目标数据库使用模糊名称指代 →
```
finding-data-lake-assets
```
用户不清楚现有配置 →
```
exploring-data-catalog
```

5. Execute Source Workflow

5. 执行源端工作流程

Read the source-specific reference and follow its phases. Each is self-contained with job templates, gotchas, and troubleshooting:

Local / S3 / JDBC / Snowflake / BigQuery / DynamoDB / catalog migration -- one reference per source

Common Glue 5.1 or higher job configuration and PySpark templates are shared in glue-job-config.md and glue-job-scripts.md.

阅读源端专属参考文档并遵循其步骤。每个文档都包含作业模板、注意事项和故障排查方法：

本地文件/S3文件/JDBC/Snowflake/BigQuery/DynamoDB/目录迁移——每个数据源对应一份参考文档

Glue 5.1或更高版本的通用作业配置和PySpark模板可在glue-job-config.md和glue-job-scripts.md中找到。

6. Validate

6. 验证

Run all three, do not skip:

Row count matches expected (source vs target)
Null check on critical columns
Spot-check 3-5 sample rows

See data-quality-validation.md.

必须执行以下三项验证，不可跳过：

源端与目标端的行数匹配预期
关键列的空值检查
抽查3-5条样本行

请参考data-quality-validation.md。

7. Schedule (if recurring)

7. 调度（若为定期任务）

For recurring pipelines, create a Glue Trigger with a cron schedule. See testing-and-scheduling.md. Simple single-step pipelines use Glue Triggers; multi-step with branching uses MWAA.

对于定期数据管道，请使用cron调度创建Glue触发器。请参考testing-and-scheduling.md。简单的单步管道使用Glue触发器；带分支的多步管道使用MWAA。

Argument Routing

参数路由

S3 path only: Infer one-time load, start Step 2 with S3 files
Connection name: Start Step 3 with the named connection
Table name: Start Step 4, ask whether this is source or target
```
--target
```
flag: Pre-fill the target format in Step 4
No args: Walk through interactively

仅提供S3路径：推断为一次性加载，从步骤2的S3文件开始
提供连接名称：从步骤3的指定连接开始
提供表名称：从步骤4开始，询问该表是源表还是目标表
```
--target
```
标志：在步骤4中预填目标格式
无参数：交互式引导用户完成流程

Gotchas

注意事项

S3 Tables requires Glue 5.1 or higher and
```
--datalake-formats iceberg
```
job argument
All
```
spark.sql.catalog.*
```
config MUST go in
```
--conf
```
job arguments, never in
```
spark.conf.set()
```
. Glue 5.x throws
```
AnalysisException: Cannot modify the value of a static config
```
otherwise. See iceberg-catalog-config-and-usage.md for correct catalog configs.
The
```
warehouse
```
parameter is required in S3 Tables catalog config. Without it Spark fails with "Cannot derive default warehouse location".
Table and column names in S3 Tables MUST be all lowercase
```
overwritePartitions()
```
only replaces partitions present in the DataFrame -- for full refresh with deletes, use
```
createOrReplace()
```
Standard Iceberg targets MUST include a LOCATION clause; S3 Tables MUST NOT
DynamoDB does not need a Glue connection -- do not attempt to create one
Connection failures during ingest delegate back to
```
connecting-to-data-source
```
; do not debug network/credentials in this skill
For target tables in SageMaker Unified Studio projects, ensure the project role has write access to the target namespace before the Glue job runs

S3 Tables需要Glue 5.1或更高版本，且作业参数需包含
```
--datalake-formats iceberg
```
所有
```
spark.sql.catalog.*
```
配置必须放在
```
--conf
```
作业参数中，绝不能放在
```
spark.conf.set()
```
中。否则Glue 5.x会抛出
```
AnalysisException: Cannot modify the value of a static config
```
错误。请参考iceberg-catalog-config-and-usage.md获取正确的目录配置。
S3 Tables目录配置中必须包含
```
warehouse
```
参数。否则Spark会因"Cannot derive default warehouse location"失败。
S3 Tables中的表名和列名必须全部小写
```
overwritePartitions()
```
仅替换DataFrame中存在的分区——如需全量刷新并删除数据，请使用
```
createOrReplace()
```
标准Iceberg目标必须包含LOCATION子句；S3 Tables绝不能包含该子句
DynamoDB不需要Glue连接——请勿尝试创建连接
数据导入期间的连接失败需委托给
```
connecting-to-data-source
```
；请勿在此技能中调试网络/凭证问题
对于SageMaker Unified Studio项目中的目标表，确保项目角色在Glue作业运行前拥有目标命名空间的写入权限

Troubleshooting

故障排查

Error	Likely cause	Action
Access Denied on S3	Missing IAM permissions	Check Glue role has s3:GetObject, s3:PutObject
Access Denied on S3 Tables	Missing s3tables:* permissions	Add S3 Tables inline policy to Glue role
CTAS timeout	Dataset too large for Athena	Switch to Glue ETL or batch with WHERE filters
JDBC connection timeout/auth failure	Connection-level issue	Delegate to `connecting-to-data-source`
Throughput exceeded (DynamoDB)	Read percent too high	Lower `read.percent` or use native export

See error-handling.md for the full catalog.

错误	可能原因	操作
S3访问被拒绝	缺少IAM权限	检查Glue角色是否拥有s3:GetObject和s3:PutObject权限
S3 Tables访问被拒绝	缺少s3tables:*权限	为Glue角色添加S3 Tables内联策略
CTAS超时	数据集过大，Athena无法处理	切换为Glue ETL或使用WHERE过滤器分批处理
JDBC连接超时/认证失败	连接层面问题	委托给 `connecting-to-data-source`
吞吐量超出限制（DynamoDB）	读取占比过高	降低 `read.percent` 或使用原生导出功能

完整的错误目录请参考error-handling.md。

References

参考文档

Source-specific

源端专属

local-upload.md -- Local files
s3-files.md -- S3 files (CSV, JSON, Parquet, Avro, ORC)
jdbc-ingest.md -- Oracle, SQL Server, PostgreSQL, MySQL, RDS, Aurora, Redshift
snowflake-ingest.md -- Snowflake
bigquery-ingest.md -- BigQuery
dynamodb-ingest.md -- DynamoDB (export and Glue direct read)
catalog-migration.md -- Existing Glue catalog tables (Hive, self-managed Iceberg)

local-upload.md -- 本地文件
s3-files.md -- S3文件（CSV、JSON、Parquet、Avro、ORC）
jdbc-ingest.md -- Oracle、SQL Server、PostgreSQL、MySQL、RDS、Aurora、Redshift
snowflake-ingest.md -- Snowflake
bigquery-ingest.md -- BigQuery
dynamodb-ingest.md -- DynamoDB（导出和Glue直接读取）
catalog-migration.md -- 现有Glue目录表（Hive、自管理Iceberg）

Cross-cutting

通用

iceberg-catalog-config-and-usage.md -- S3 Tables, standard Iceberg, raw files: catalog config, engine access patterns
glue-job-config.md -- Job sizing, monitoring, retry
glue-job-scripts.md -- PySpark templates (append, upsert, custom SQL, full refresh)
incremental-loading.md -- Watermark strategies
testing-and-scheduling.md -- Glue Triggers, MWAA
data-quality-validation.md -- Row counts, null checks, Glue Data Quality
schema-evolution.md -- ALTER TABLE ADD COLUMNS, nested JSON
type-transformations.md -- Type conflict resolution
format-specific-loading.md -- CSV/JSON/Parquet/Avro/ORC specifics
athena-loading.md -- Athena INSERT INTO as simple-load fallback
error-handling.md -- Ingest errors (connection errors delegate to connecting-to-data-source)
upload-options.md -- aws s3 cp vs sync, multipart

iceberg-catalog-config-and-usage.md -- S3 Tables、标准Iceberg、原生文件：目录配置、引擎访问模式
glue-job-config.md -- 作业规模调整、监控、重试
glue-job-scripts.md -- PySpark模板（追加、更新插入、自定义SQL、全量刷新）
incremental-loading.md -- 水印策略
testing-and-scheduling.md -- Glue触发器、MWAA
data-quality-validation.md -- 行数统计、空值检查、Glue数据质量
schema-evolution.md -- ALTER TABLE ADD COLUMNS、嵌套JSON
type-transformations.md -- 类型冲突解决
format-specific-loading.md -- CSV/JSON/Parquet/Avro/ORC的专属加载说明
athena-loading.md -- 使用Athena INSERT INTO作为简单加载的备选方案
error-handling.md -- 数据导入错误（连接错误委托给connecting-to-data-source）
upload-options.md -- aws s3 cp与sync对比、分块上传

Migration-specific

迁移专属

ctas-patterns.md -- Athena CTAS syntax and partition transforms
glue-etl-migration.md -- Large-table migration via Glue 5.1 or higher PySpark
migration-validation.md -- Full validation checklist
migration-troubleshooting.md -- CTAS failures, visibility, partitions

ctas-patterns.md -- Athena CTAS语法和分区转换
glue-etl-migration.md -- 通过Glue 5.1或更高版本的PySpark进行大表迁移
migration-validation.md -- 完整验证清单
migration-troubleshooting.md -- CTAS失败、可见性、分区

JDBC-specific

JDBC专属

jdbc-schema-discovery.md -- Crawler, direct inspection, custom SQL
jdbc-performance.md -- Parallel reads, partitioning

jdbc-schema-discovery.md -- 爬虫、直接检查、自定义SQL
jdbc-performance.md -- 并行读取、分区