querying-data-lake

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Query Data Lake

查询数据湖

Execute SQL queries on Amazon Athena across default and federated catalogs (Glue, S3 Tables, Redshift) with workgroup selection, statement classification, and error recovery.

通过工作组选择、语句分类和错误恢复功能，在Amazon Athena上执行跨默认目录与联合目录（Glue、S3 Tables、Redshift）的SQL查询。

Overview

概述

Executes and manages Athena SQL queries across default and federated catalogs. Selects a workgroup, resolves target assets (delegating fuzzy references to

finding-data-lake-assets

), classifies statements for safety, and reports cost and data scanned. Use the AWS MCP server for sandboxed execution and audit logging; the same AWS CLI commands work directly when the MCP server is not available.

Constraints for parameter acquisition:

You MUST accept a single optional argument: SQL text, a named-query name, a workgroup name, a catalog name, or
```
profile TABLE_NAME
```
You MUST accept the argument as direct text or a pointer to a file containing SQL
You MUST ask the user for the target AWS region if not already set
You MUST confirm the output S3 location before executing any non-trivial query
You MUST respect the user's decision to abort at any step

执行并管理跨默认目录与联合目录的Athena SQL查询。选择工作组、解析目标资产（将模糊引用委托给

finding-data-lake-assets

）、对语句进行安全分类，并报告成本和扫描的数据量。使用AWS MCP服务器进行沙箱执行和审计日志记录；当MCP服务器不可用时，可直接使用相同的AWS CLI命令。

参数获取约束：

必须接受单个可选参数：SQL文本、命名查询名称、工作组名称、目录名称或
```
profile TABLE_NAME
```
必须接受直接文本形式的参数，或指向包含SQL的文件的指针
如果未设置目标AWS区域，必须向用户询问
在执行任何非 trivial 查询前，必须确认输出S3位置
必须尊重用户在任何步骤中止操作的决定

Common Tasks

常见任务

1. Verify Dependencies

1. 验证依赖项

Check for required tools and AWS access before running queries.

Constraints:

You MUST verify AWS MCP server tools are available (
```
aws___call_aws
```
) and run queries through them when present; fall back to AWS CLI only if the MCP server is unavailable
You MUST NOT fall back to shell or Bash for query execution — results must be captured via the MCP tool or
```
aws athena
```
CLI so output location and cost are tracked
You MUST confirm credentials with
```
aws sts get-caller-identity
```
and inform the user about any missing tools

在运行查询前检查所需工具和AWS访问权限。

约束：

必须验证AWS MCP服务器工具（
```
aws___call_aws
```
）是否可用；若可用则通过其运行查询；仅当MCP服务器不可用时才回退到AWS CLI
不得回退到Shell或Bash执行查询——必须通过MCP工具或
```
aws athena
```
CLI捕获结果，以便跟踪输出位置和成本
必须使用
```
aws sts get-caller-identity
```
确认凭据，并告知用户任何缺失的工具

2. Resolve Workgroup

2. 解析工作组

Check caller identity, list workgroups, auto-select the best one (see workgroup-selection.md).

Constraints:

You MUST select a workgroup before submitting any query (prevents output-location errors)
You MUST present the selected workgroup and its output location to the user
You MUST NOT auto-escalate to a different workgroup on failure without user confirmation

检查调用者身份、列出工作组、自动选择最佳工作组（参见workgroup-selection.md）。

约束：

在提交任何查询前必须选择工作组（避免输出位置错误）
必须向用户展示所选工作组及其输出位置
未经用户确认，不得在失败时自动切换到其他工作组

3. Resolve the Target Asset

3. 解析目标资产

If the user refers to a table by name, by business concept ("our quarterly report", "the sales data"), by S3 path, or by catalog without specifying the table, delegate to

finding-data-lake-assets

to return the concrete

database.table

(and catalog if non-default).

Constraints:

You MUST NOT attempt to resolve fuzzy asset references with
```
athena list-data-catalogs
```
or by iterating
```
get-tables
```
— those miss federated catalogs and waste tokens
You SHOULD skip this step only when the user provides a fully-qualified reference (exact
```
database.table
```
) or raw SQL they want executed as-is
You MUST state the resolved asset explicitly before building the query: "Found [table] in [catalog]. Using this for the query."
You SHOULD default to the default Glue catalog unless the user mentions "federated", "Redshift", "S3 Tables", or
```
finding-data-lake-assets
```
returns a different catalog

如果用户通过名称、业务概念（如“我们的季度报告”“销售数据”）、S3路径或未指定表的目录引用表，请委托给

finding-data-lake-assets

返回具体的

database.table

约束：

不得尝试使用
```
athena list-data-catalogs
```
或遍历
```
get-tables
```
来解析模糊资产引用——这些方法会遗漏联合目录并浪费令牌
仅当用户提供完全限定的引用（精确的
```
database.table
```
）或希望按原样执行的原始SQL时，才应跳过此步骤
在构建查询前必须明确说明解析后的资产：“在[目录]中找到[表]。将其用于本次查询。”
除非用户提到“federated”“Redshift”“S3 Tables”，或
```
finding-data-lake-assets
```
返回不同目录，否则默认使用默认Glue目录

4. Discover Schema

4. 发现 Schema

For analytical queries, You SHOULD profile the target table before building the final query. You MUST show sample rows (

SELECT ... LIMIT 5

) as part of profiling.

对于分析查询，在构建最终查询前应对目标表进行分析。必须将示例行（

SELECT ... LIMIT 5

）作为分析的一部分展示。

5. Build Query

5. 构建查询

Table addressing depends on catalog type:

Default Glue catalog:
```
database.table
```
(omit the catalog prefix for single-catalog queries). In cross-catalog queries, qualify default-catalog tables with
```
"awsdatacatalog".database.table
```
.
Registered data source:
```
datasource.database.table
```
Unregistered Glue catalog:
```
"catalog/subcatalog".database.table
```

表寻址方式取决于目录类型：

默认Glue目录：
```
database.table
```
（单目录查询可省略目录前缀）。在跨目录查询中，默认目录表需使用
```
"awsdatacatalog".database.table
```
限定。
已注册数据源：
```
datasource.database.table
```
未注册Glue目录：
```
"catalog/subcatalog".database.table
```

6. Classify and Execute

6. 分类与执行

Classify the SQL statement before executing:

Statement	Behavior
`SELECT` , `SHOW` , `DESCRIBE` , `EXPLAIN`	Safe — execute
`INSERT` , `UPDATE` , `DELETE` , `DROP` , `ALTER` , `CREATE` , `TRUNCATE` , `MERGE`	Destructive — warn the user and require explicit confirmation
Unsure	Treat as destructive; confirm

Example tool call (via AWS MCP server):

aws___call_aws(command="aws athena start-query-execution --work-group <WORKGROUP_NAME> --query-string '<sql>' --query-execution-context Database=<db>")

For federated or S3 Tables catalogs, also set

Catalog=<CATALOG_PATH>

in the execution context (e.g.

Catalog=s3tablescatalog/<BUCKET_NAME>

Constraints:

You MUST warn the user before executing when the target is Redshift-federated ("No partition pruning — every query scans the full table")
You MUST warn the user before executing a cross-catalog join ("Cross-catalog joins incur network overhead and may be slow")
You MUST confirm the output S3 location before executing
You MUST explain which tool is being called before executing
You MUST respect the user's decision to abort

在执行前对SQL语句进行分类：

语句类型	行为
`SELECT` , `SHOW` , `DESCRIBE` , `EXPLAIN`	安全——执行
`INSERT` , `UPDATE` , `DELETE` , `DROP` , `ALTER` , `CREATE` , `TRUNCATE` , `MERGE`	破坏性——向用户发出警告并要求明确确认
不确定	视为破坏性操作；需确认

示例工具调用（通过AWS MCP服务器）：

aws___call_aws(command="aws athena start-query-execution --work-group <WORKGROUP_NAME> --query-string '<sql>' --query-execution-context Database=<db>")

对于联合目录或S3 Tables目录，还需在执行上下文中设置

Catalog=<CATALOG_PATH>

（例如

Catalog=s3tablescatalog/<BUCKET_NAME>

）。

约束：

当目标是Redshift联合目录时，执行前必须向用户发出警告（“无分区修剪——每次查询都会扫描全表”）
在执行跨目录连接前必须向用户发出警告（“跨目录连接会产生网络开销，可能速度较慢”）
执行前必须确认输出S3位置
执行前必须说明将调用哪个工具
必须尊重用户中止操作的决定

7. Present and Recover

7. 展示结果与恢复

Present results with cost, data scanned, duration, and actionable insights. On failure, list available workgroups and let the user choose which to retry with.

展示结果时附带成本、扫描数据量、时长和可操作的见解。失败时，列出可用的工作组并让用户选择重试的工作组。

Argument Routing

参数路由

Resolve in this order; stop at the first match:

Contains SQL keywords (
```
SELECT
```
,
```
SHOW
```
,
```
DESCRIBE
```
,
```
INSERT
```
, etc.) — SQL text, execute directly
```
profile TABLE_NAME
```
— run comprehensive table profiling (see query-patterns.md)
Matches a known named query — look up and execute
Matches a known workgroup — show workgroup status and recent queries
Matches a known catalog — delegate to
```
exploring-data-catalog
```
to enumerate databases and tables
No args — show recent query activity and available tables

按以下顺序解析；匹配到第一个后停止：

包含SQL关键字（
```
SELECT
```
,
```
SHOW
```
,
```
DESCRIBE
```
,
```
INSERT
```
等）——视为SQL文本，直接执行
```
profile TABLE_NAME
```
——运行全面的表分析（参见query-patterns.md）
匹配已知的命名查询——查找并执行
匹配已知的工作组——显示工作组状态和最近的查询
匹配已知的目录——委托给
```
exploring-data-catalog
```
枚举数据库和表
无参数——显示最近的查询活动和可用表

Principles

原则

Always select workgroup before executing (prevents output-location errors)
Profile unfamiliar tables before running analytical queries
Present cost alongside results so users build cost awareness
Suggest
```
LIMIT
```
for exploratory queries on large tables
Never ask domain questions with obvious answers, but always confirm security-relevant actions (workgroup switches, output location changes, non-SELECT statements)

执行前始终选择工作组（避免输出位置错误）
在运行分析查询前对不熟悉的表进行分析
展示结果时附带成本，帮助用户建立成本意识
对大型表的探索性查询建议使用
```
LIMIT
```
永远不要询问答案显而易见的领域问题，但始终确认与安全相关的操作（工作组切换、输出位置更改、非SELECT语句）

Troubleshooting

故障排除

Error	Cause	Fix
Redshift identifier error with mixed case	Redshift-federated names are lowercase only	Lowercase the identifier
`CatalogId` validation failure	ARN passed instead of catalog name	Pass the catalog name, not the ARN
Cross-catalog `information_schema` returns nothing	Missing catalog qualifier	Use catalog-qualified path: `"catalog".information_schema.tables`
Query fails with output-location error	Workgroup has no output location configured	Select a different workgroup with an output location, or configure one
Destructive statement executed without confirmation	Statement classification skipped	Always classify `INSERT` / `UPDATE` / `DELETE` / `DROP` / `ALTER` / `CREATE` / `TRUNCATE` / `MERGE` and confirm with the user

错误	原因	修复方法
Redshift标识符大小写混合错误	Redshift联合目录名称仅支持小写	将标识符改为小写
`CatalogId` 验证失败	传入了ARN而非目录名称	传入目录名称，而非ARN
跨目录 `information_schema` 无返回结果	缺少目录限定符	使用目录限定路径： `"catalog".information_schema.tables`
查询因输出位置错误失败	工作组未配置输出位置	选择一个已配置输出位置的其他工作组，或配置输出位置
未经确认执行了破坏性语句	跳过了语句分类	始终对 `INSERT` / `UPDATE` / `DELETE` / `DROP` / `ALTER` / `CREATE` / `TRUNCATE` / `MERGE` 进行分类并获得用户确认

Additional Resources

额外资源

Workgroup selection logic
Common query patterns
Athena best practices
Athena federated query

工作组选择逻辑
常见查询模式
Athena最佳实践
Athena联合查询