ingesting-into-data-lake
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIngest into Data Lake
导入至数据湖
Move data from a source into a queryable table in the data lake. This skill assumes the source connection (if one is needed) already exists. For Glue connection setup or troubleshooting, delegate to .
connecting-to-data-source将数据从源端导入数据湖中的可查询表。本技能假设源端连接(如果需要)已存在。如需设置或排查Glue连接,请委托给。
connecting-to-data-sourcePhilosophy
理念
Default to S3 Tables unless the environment says otherwise. S3 Tables is the recommended target for new data lake work. If the user's catalog inventory shows they haven't adopted S3 Tables, recommend standard Iceberg on their existing general-purpose bucket instead of forcing them to change posture.
除非环境另有规定,否则默认使用S3 Tables。 S3 Tables是新建数据湖工作的推荐目标。如果用户的目录清单显示尚未采用S3 Tables,则建议在其现有通用存储桶上使用标准Iceberg,而非强制用户改变现有架构。
Common Tasks
常见任务
You MUST execute commands using AWS MCP server tools when connected -- they provide validation, sandboxed execution, and audit logging. Fall back to AWS CLI only if MCP is unavailable. You MUST explain each step before executing.
连接后必须使用AWS MCP服务器工具执行命令——这些工具可提供验证、沙箱执行和审计日志记录。仅当MCP不可用时才使用AWS CLI作为备选。执行前必须解释每个步骤。
Workflow
工作流程
1. Verify Dependencies and Context
1. 验证依赖项和上下文
- You MUST check whether AWS MCP tools or AWS CLI are available and inform the user if missing
- You MUST confirm target AWS region and verify credentials with
aws sts get-caller-identity - For SageMaker Unified Studio project roles, note that target tables and connections may be scoped to the project. See the caller ARN detection pattern in .
querying-data-lake
- 必须检查AWS MCP工具或AWS CLI是否可用,若缺失需告知用户
- 必须确认目标AWS区域,并使用验证凭证
aws sts get-caller-identity - 对于SageMaker Unified Studio项目角色,需注意目标表和连接可能仅限于该项目。请参考中的调用者ARN检测模式。
querying-data-lake
2. Classify the Source
2. 分类数据源
| User says... | Source type | Reference |
|---|---|---|
| "upload my file", "local CSV", "move to S3" | Local file | local-upload.md |
| "load from S3", "import CSV/JSON/Parquet from s3://" | S3 files | s3-files.md |
| "import from Oracle/Postgres/MySQL/SQL Server/Redshift/RDS/Aurora" | JDBC | jdbc-ingest.md |
| "pull from Snowflake", "Snowflake table to S3" | Snowflake | snowflake-ingest.md |
| "import from BigQuery", "GCP analytics to S3" | BigQuery | bigquery-ingest.md |
| "export DynamoDB", "DynamoDB to data lake" | DynamoDB | dynamodb-ingest.md |
| "migrate Glue table", "convert Hive to Iceberg" | Catalog migration | catalog-migration.md |
If the user names Salesforce, ServiceNow, SAP, MongoDB, Kafka, or another SaaS/streaming source, decline -- these are not supported in this release.
If the source table is referenced by a fuzzy or business name ("migrate our orders table", "pull from the sales warehouse"), delegate to to resolve before proceeding.
finding-data-lake-assets| 用户表述... | 数据源类型 | 参考文档 |
|---|---|---|
| "上传我的文件"、"本地CSV"、"迁移至S3" | 本地文件 | local-upload.md |
| "从S3加载"、"从s3://导入CSV/JSON/Parquet" | S3文件 | s3-files.md |
| "从Oracle/Postgres/MySQL/SQL Server/Redshift/RDS/Aurora导入" | JDBC | jdbc-ingest.md |
| "从Snowflake拉取"、"Snowflake表迁移至S3" | Snowflake | snowflake-ingest.md |
| "从BigQuery导入"、"GCP分析数据迁移至S3" | BigQuery | bigquery-ingest.md |
| "导出DynamoDB"、"DynamoDB迁移至数据湖" | DynamoDB | dynamodb-ingest.md |
| "迁移Glue表"、"将Hive转换为Iceberg" | 目录迁移 | catalog-migration.md |
如果用户提及Salesforce、ServiceNow、SAP、MongoDB、Kafka或其他SaaS/流式数据源,请拒绝处理——本版本不支持这些数据源。
如果源表使用模糊名称或业务名称指代(如"迁移我们的订单表"、"从销售仓库拉取数据"),请先委托给进行解析,再继续操作。
finding-data-lake-assets3. Confirm Connection Exists (if applicable)
3. 确认连接已存在(如适用)
For JDBC, Snowflake, and BigQuery sources, a Glue connection is required. Check:
bash
aws glue get-connection --name <CONNECTION_NAME> --region <REGION>If the connection does not exist, stop and delegate to to create and test it. Do not proceed with ingest until the connection is verified.
connecting-to-data-sourceLocal files, S3 files, DynamoDB, and catalog migration do not need a Glue connection.
对于JDBC、Snowflake和BigQuery数据源,需要Glue连接。请执行以下命令检查:
bash
aws glue get-connection --name <CONNECTION_NAME> --region <REGION>如果连接不存在,请停止操作并委托给创建并测试连接。必须在连接验证通过后才能继续数据导入。
connecting-to-data-source本地文件、S3文件、DynamoDB和目录迁移不需要Glue连接。
4. Clarify the Target
4. 明确目标配置
You MUST ask the user (or suggest based on catalog inventory) before creating or writing to any table:
- Database/namespace: Does a specific target database exist? Or should one be created?
- Table: Existing table (append/merge) or new table (delegate to )?
creating-data-lake-table - Format: S3 Tables (default), standard Iceberg, or raw Parquet?
Inventory-aware defaults:
If you have already run or can quickly check, use what exists:
exploring-data-catalog- Account has an federated catalog and active table buckets: recommend S3 Tables
s3tablescatalog - Account has general-purpose buckets with Iceberg tables and no S3 Tables usage: recommend standard Iceberg on their existing bucket
- Account uses Parquet/ORC on S3 without Iceberg metadata: ask whether to adopt Iceberg now (recommend yes) or continue with raw files
Do not force S3 Tables on customers who haven't adopted it. See iceberg-catalog-config-and-usage.md.
Delegations from this step:
- Target table doesn't exist ->
creating-data-lake-table - Target database named by fuzzy term ->
finding-data-lake-assets - User doesn't know what exists ->
exploring-data-catalog
在创建或写入任何表之前,必须询问用户(或根据目录清单给出建议):
- 数据库/命名空间:是否存在特定的目标数据库?是否需要创建新数据库?
- 表:是写入现有表(追加/合并)还是创建新表(委托给)?
creating-data-lake-table - 格式:S3 Tables(默认)、标准Iceberg还是原生Parquet?
基于目录清单的默认建议:
如果已执行或可快速检查,请使用现有配置:
exploring-data-catalog- 账户拥有联合目录和活跃的表存储桶:建议使用S3 Tables
s3tablescatalog - 账户的通用存储桶中有Iceberg表且未使用S3 Tables:建议在现有存储桶上使用标准Iceberg
- 账户在S3上使用Parquet/ORC但无Iceberg元数据:询问用户是否现在采用Iceberg(建议是)或继续使用原生文件
请勿强制尚未采用S3 Tables的客户使用该格式。请参考iceberg-catalog-config-and-usage.md。
此步骤的委托场景:
- 目标表不存在 →
creating-data-lake-table - 目标数据库使用模糊名称指代 →
finding-data-lake-assets - 用户不清楚现有配置 →
exploring-data-catalog
5. Execute Source Workflow
5. 执行源端工作流程
Read the source-specific reference and follow its phases. Each is self-contained with job templates, gotchas, and troubleshooting:
- Local / S3 / JDBC / Snowflake / BigQuery / DynamoDB / catalog migration -- one reference per source
Common Glue 5.1 or higher job configuration and PySpark templates are shared in glue-job-config.md and glue-job-scripts.md.
阅读源端专属参考文档并遵循其步骤。每个文档都包含作业模板、注意事项和故障排查方法:
- 本地文件/S3文件/JDBC/Snowflake/BigQuery/DynamoDB/目录迁移——每个数据源对应一份参考文档
Glue 5.1或更高版本的通用作业配置和PySpark模板可在glue-job-config.md和glue-job-scripts.md中找到。
6. Validate
6. 验证
Run all three, do not skip:
- Row count matches expected (source vs target)
- Null check on critical columns
- Spot-check 3-5 sample rows
See data-quality-validation.md.
必须执行以下三项验证,不可跳过:
- 源端与目标端的行数匹配预期
- 关键列的空值检查
- 抽查3-5条样本行
请参考data-quality-validation.md。
7. Schedule (if recurring)
7. 调度(若为定期任务)
For recurring pipelines, create a Glue Trigger with a cron schedule. See testing-and-scheduling.md. Simple single-step pipelines use Glue Triggers; multi-step with branching uses MWAA.
对于定期数据管道,请使用cron调度创建Glue触发器。请参考testing-and-scheduling.md。简单的单步管道使用Glue触发器;带分支的多步管道使用MWAA。
Argument Routing
参数路由
- S3 path only: Infer one-time load, start Step 2 with S3 files
- Connection name: Start Step 3 with the named connection
- Table name: Start Step 4, ask whether this is source or target
- flag: Pre-fill the target format in Step 4
--target - No args: Walk through interactively
- 仅提供S3路径:推断为一次性加载,从步骤2的S3文件开始
- 提供连接名称:从步骤3的指定连接开始
- 提供表名称:从步骤4开始,询问该表是源表还是目标表
- 标志:在步骤4中预填目标格式
--target - 无参数:交互式引导用户完成流程
Gotchas
注意事项
- S3 Tables requires Glue 5.1 or higher and job argument
--datalake-formats iceberg - All config MUST go in
spark.sql.catalog.*job arguments, never in--conf. Glue 5.x throwsspark.conf.set()otherwise. See iceberg-catalog-config-and-usage.md for correct catalog configs.AnalysisException: Cannot modify the value of a static config - The parameter is required in S3 Tables catalog config. Without it Spark fails with "Cannot derive default warehouse location".
warehouse - Table and column names in S3 Tables MUST be all lowercase
- only replaces partitions present in the DataFrame -- for full refresh with deletes, use
overwritePartitions()createOrReplace() - Standard Iceberg targets MUST include a LOCATION clause; S3 Tables MUST NOT
- DynamoDB does not need a Glue connection -- do not attempt to create one
- Connection failures during ingest delegate back to ; do not debug network/credentials in this skill
connecting-to-data-source - For target tables in SageMaker Unified Studio projects, ensure the project role has write access to the target namespace before the Glue job runs
- S3 Tables需要Glue 5.1或更高版本,且作业参数需包含
--datalake-formats iceberg - 所有配置必须放在
spark.sql.catalog.*作业参数中,绝不能放在--conf中。否则Glue 5.x会抛出spark.conf.set()错误。请参考iceberg-catalog-config-and-usage.md获取正确的目录配置。AnalysisException: Cannot modify the value of a static config - S3 Tables目录配置中必须包含参数。否则Spark会因"Cannot derive default warehouse location"失败。
warehouse - S3 Tables中的表名和列名必须全部小写
- 仅替换DataFrame中存在的分区——如需全量刷新并删除数据,请使用
overwritePartitions()createOrReplace() - 标准Iceberg目标必须包含LOCATION子句;S3 Tables绝不能包含该子句
- DynamoDB不需要Glue连接——请勿尝试创建连接
- 数据导入期间的连接失败需委托给;请勿在此技能中调试网络/凭证问题
connecting-to-data-source - 对于SageMaker Unified Studio项目中的目标表,确保项目角色在Glue作业运行前拥有目标命名空间的写入权限
Troubleshooting
故障排查
| Error | Likely cause | Action |
|---|---|---|
| Access Denied on S3 | Missing IAM permissions | Check Glue role has s3:GetObject, s3:PutObject |
| Access Denied on S3 Tables | Missing s3tables:* permissions | Add S3 Tables inline policy to Glue role |
| CTAS timeout | Dataset too large for Athena | Switch to Glue ETL or batch with WHERE filters |
| JDBC connection timeout/auth failure | Connection-level issue | Delegate to |
| Throughput exceeded (DynamoDB) | Read percent too high | Lower |
See error-handling.md for the full catalog.
| 错误 | 可能原因 | 操作 |
|---|---|---|
| S3访问被拒绝 | 缺少IAM权限 | 检查Glue角色是否拥有s3:GetObject和s3:PutObject权限 |
| S3 Tables访问被拒绝 | 缺少s3tables:*权限 | 为Glue角色添加S3 Tables内联策略 |
| CTAS超时 | 数据集过大,Athena无法处理 | 切换为Glue ETL或使用WHERE过滤器分批处理 |
| JDBC连接超时/认证失败 | 连接层面问题 | 委托给 |
| 吞吐量超出限制(DynamoDB) | 读取占比过高 | 降低 |
完整的错误目录请参考error-handling.md。
References
参考文档
Source-specific
源端专属
- local-upload.md -- Local files
- s3-files.md -- S3 files (CSV, JSON, Parquet, Avro, ORC)
- jdbc-ingest.md -- Oracle, SQL Server, PostgreSQL, MySQL, RDS, Aurora, Redshift
- snowflake-ingest.md -- Snowflake
- bigquery-ingest.md -- BigQuery
- dynamodb-ingest.md -- DynamoDB (export and Glue direct read)
- catalog-migration.md -- Existing Glue catalog tables (Hive, self-managed Iceberg)
- local-upload.md -- 本地文件
- s3-files.md -- S3文件(CSV、JSON、Parquet、Avro、ORC)
- jdbc-ingest.md -- Oracle、SQL Server、PostgreSQL、MySQL、RDS、Aurora、Redshift
- snowflake-ingest.md -- Snowflake
- bigquery-ingest.md -- BigQuery
- dynamodb-ingest.md -- DynamoDB(导出和Glue直接读取)
- catalog-migration.md -- 现有Glue目录表(Hive、自管理Iceberg)
Cross-cutting
通用
- iceberg-catalog-config-and-usage.md -- S3 Tables, standard Iceberg, raw files: catalog config, engine access patterns
- glue-job-config.md -- Job sizing, monitoring, retry
- glue-job-scripts.md -- PySpark templates (append, upsert, custom SQL, full refresh)
- incremental-loading.md -- Watermark strategies
- testing-and-scheduling.md -- Glue Triggers, MWAA
- data-quality-validation.md -- Row counts, null checks, Glue Data Quality
- schema-evolution.md -- ALTER TABLE ADD COLUMNS, nested JSON
- type-transformations.md -- Type conflict resolution
- format-specific-loading.md -- CSV/JSON/Parquet/Avro/ORC specifics
- athena-loading.md -- Athena INSERT INTO as simple-load fallback
- error-handling.md -- Ingest errors (connection errors delegate to connecting-to-data-source)
- upload-options.md -- aws s3 cp vs sync, multipart
- iceberg-catalog-config-and-usage.md -- S3 Tables、标准Iceberg、原生文件:目录配置、引擎访问模式
- glue-job-config.md -- 作业规模调整、监控、重试
- glue-job-scripts.md -- PySpark模板(追加、更新插入、自定义SQL、全量刷新)
- incremental-loading.md -- 水印策略
- testing-and-scheduling.md -- Glue触发器、MWAA
- data-quality-validation.md -- 行数统计、空值检查、Glue数据质量
- schema-evolution.md -- ALTER TABLE ADD COLUMNS、嵌套JSON
- type-transformations.md -- 类型冲突解决
- format-specific-loading.md -- CSV/JSON/Parquet/Avro/ORC的专属加载说明
- athena-loading.md -- 使用Athena INSERT INTO作为简单加载的备选方案
- error-handling.md -- 数据导入错误(连接错误委托给connecting-to-data-source)
- upload-options.md -- aws s3 cp与sync对比、分块上传
Migration-specific
迁移专属
- ctas-patterns.md -- Athena CTAS syntax and partition transforms
- glue-etl-migration.md -- Large-table migration via Glue 5.1 or higher PySpark
- migration-validation.md -- Full validation checklist
- migration-troubleshooting.md -- CTAS failures, visibility, partitions
- ctas-patterns.md -- Athena CTAS语法和分区转换
- glue-etl-migration.md -- 通过Glue 5.1或更高版本的PySpark进行大表迁移
- migration-validation.md -- 完整验证清单
- migration-troubleshooting.md -- CTAS失败、可见性、分区
JDBC-specific
JDBC专属
- jdbc-schema-discovery.md -- Crawler, direct inspection, custom SQL
- jdbc-performance.md -- Parallel reads, partitioning
- jdbc-schema-discovery.md -- 爬虫、直接检查、自定义SQL
- jdbc-performance.md -- 并行读取、分区