querying-data-lake

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Query Data Lake

查询数据湖

Execute SQL queries on Amazon Athena across default and federated catalogs (Glue, S3 Tables, Redshift) with workgroup selection, statement classification, and error recovery.
通过工作组选择、语句分类和错误恢复功能,在Amazon Athena上执行跨默认目录与联合目录(Glue、S3 Tables、Redshift)的SQL查询。

Overview

概述

Executes and manages Athena SQL queries across default and federated catalogs. Selects a workgroup, resolves target assets (delegating fuzzy references to
finding-data-lake-assets
), classifies statements for safety, and reports cost and data scanned. Use the AWS MCP server for sandboxed execution and audit logging; the same AWS CLI commands work directly when the MCP server is not available.
Constraints for parameter acquisition:
  • You MUST accept a single optional argument: SQL text, a named-query name, a workgroup name, a catalog name, or
    profile TABLE_NAME
  • You MUST accept the argument as direct text or a pointer to a file containing SQL
  • You MUST ask the user for the target AWS region if not already set
  • You MUST confirm the output S3 location before executing any non-trivial query
  • You MUST respect the user's decision to abort at any step
执行并管理跨默认目录与联合目录的Athena SQL查询。选择工作组、解析目标资产(将模糊引用委托给
finding-data-lake-assets
)、对语句进行安全分类,并报告成本和扫描的数据量。使用AWS MCP服务器进行沙箱执行和审计日志记录;当MCP服务器不可用时,可直接使用相同的AWS CLI命令。
参数获取约束:
  • 必须接受单个可选参数:SQL文本、命名查询名称、工作组名称、目录名称或
    profile TABLE_NAME
  • 必须接受直接文本形式的参数,或指向包含SQL的文件的指针
  • 如果未设置目标AWS区域,必须向用户询问
  • 在执行任何非 trivial 查询前,必须确认输出S3位置
  • 必须尊重用户在任何步骤中止操作的决定

Common Tasks

常见任务

1. Verify Dependencies

1. 验证依赖项

Check for required tools and AWS access before running queries.
Constraints:
  • You MUST verify AWS MCP server tools are available (
    aws___call_aws
    ) and run queries through them when present; fall back to AWS CLI only if the MCP server is unavailable
  • You MUST NOT fall back to shell or Bash for query execution — results must be captured via the MCP tool or
    aws athena
    CLI so output location and cost are tracked
  • You MUST confirm credentials with
    aws sts get-caller-identity
    and inform the user about any missing tools
在运行查询前检查所需工具和AWS访问权限。
约束:
  • 必须验证AWS MCP服务器工具(
    aws___call_aws
    )是否可用;若可用则通过其运行查询;仅当MCP服务器不可用时才回退到AWS CLI
  • 不得回退到Shell或Bash执行查询——必须通过MCP工具或
    aws athena
    CLI捕获结果,以便跟踪输出位置和成本
  • 必须使用
    aws sts get-caller-identity
    确认凭据,并告知用户任何缺失的工具

2. Resolve Workgroup

2. 解析工作组

Check caller identity, list workgroups, auto-select the best one (see workgroup-selection.md).
Constraints:
  • You MUST select a workgroup before submitting any query (prevents output-location errors)
  • You MUST present the selected workgroup and its output location to the user
  • You MUST NOT auto-escalate to a different workgroup on failure without user confirmation
检查调用者身份、列出工作组、自动选择最佳工作组(参见workgroup-selection.md)。
约束:
  • 在提交任何查询前必须选择工作组(避免输出位置错误)
  • 必须向用户展示所选工作组及其输出位置
  • 未经用户确认,不得在失败时自动切换到其他工作组

3. Resolve the Target Asset

3. 解析目标资产

If the user refers to a table by name, by business concept ("our quarterly report", "the sales data"), by S3 path, or by catalog without specifying the table, delegate to
finding-data-lake-assets
to return the concrete
database.table
(and catalog if non-default).
Constraints:
  • You MUST NOT attempt to resolve fuzzy asset references with
    athena list-data-catalogs
    or by iterating
    get-tables
    — those miss federated catalogs and waste tokens
  • You SHOULD skip this step only when the user provides a fully-qualified reference (exact
    database.table
    ) or raw SQL they want executed as-is
  • You MUST state the resolved asset explicitly before building the query: "Found [table] in [catalog]. Using this for the query."
  • You SHOULD default to the default Glue catalog unless the user mentions "federated", "Redshift", "S3 Tables", or
    finding-data-lake-assets
    returns a different catalog
如果用户通过名称、业务概念(如“我们的季度报告”“销售数据”)、S3路径或未指定表的目录引用表,请委托给
finding-data-lake-assets
返回具体的
database.table
(如果是非默认目录则需包含目录)。
约束:
  • 不得尝试使用
    athena list-data-catalogs
    或遍历
    get-tables
    来解析模糊资产引用——这些方法会遗漏联合目录并浪费令牌
  • 仅当用户提供完全限定的引用(精确的
    database.table
    )或希望按原样执行的原始SQL时,才应跳过此步骤
  • 在构建查询前必须明确说明解析后的资产:“在[目录]中找到[表]。将其用于本次查询。”
  • 除非用户提到“federated”“Redshift”“S3 Tables”,或
    finding-data-lake-assets
    返回不同目录,否则默认使用默认Glue目录

4. Discover Schema

4. 发现 Schema

For analytical queries, You SHOULD profile the target table before building the final query. You MUST show sample rows (
SELECT ... LIMIT 5
) as part of profiling.
对于分析查询,在构建最终查询前应对目标表进行分析。必须将示例行(
SELECT ... LIMIT 5
)作为分析的一部分展示。

5. Build Query

5. 构建查询

Table addressing depends on catalog type:
  • Default Glue catalog:
    database.table
    (omit the catalog prefix for single-catalog queries). In cross-catalog queries, qualify default-catalog tables with
    "awsdatacatalog".database.table
    .
  • Registered data source:
    datasource.database.table
  • Unregistered Glue catalog:
    "catalog/subcatalog".database.table
表寻址方式取决于目录类型:
  • 默认Glue目录:
    database.table
    (单目录查询可省略目录前缀)。在跨目录查询中,默认目录表需使用
    "awsdatacatalog".database.table
    限定。
  • 已注册数据源:
    datasource.database.table
  • 未注册Glue目录:
    "catalog/subcatalog".database.table

6. Classify and Execute

6. 分类与执行

Classify the SQL statement before executing:
StatementBehavior
SELECT
,
SHOW
,
DESCRIBE
,
EXPLAIN
Safe — execute
INSERT
,
UPDATE
,
DELETE
,
DROP
,
ALTER
,
CREATE
,
TRUNCATE
,
MERGE
Destructive — warn the user and require explicit confirmation
UnsureTreat as destructive; confirm
Example tool call (via AWS MCP server):
aws___call_aws(command="aws athena start-query-execution --work-group <WORKGROUP_NAME> --query-string '<sql>' --query-execution-context Database=<db>")
For federated or S3 Tables catalogs, also set
Catalog=<CATALOG_PATH>
in the execution context (e.g.
Catalog=s3tablescatalog/<BUCKET_NAME>
).
Constraints:
  • You MUST warn the user before executing when the target is Redshift-federated ("No partition pruning — every query scans the full table")
  • You MUST warn the user before executing a cross-catalog join ("Cross-catalog joins incur network overhead and may be slow")
  • You MUST confirm the output S3 location before executing
  • You MUST explain which tool is being called before executing
  • You MUST respect the user's decision to abort
在执行前对SQL语句进行分类:
语句类型行为
SELECT
,
SHOW
,
DESCRIBE
,
EXPLAIN
安全——执行
INSERT
,
UPDATE
,
DELETE
,
DROP
,
ALTER
,
CREATE
,
TRUNCATE
,
MERGE
破坏性——向用户发出警告并要求明确确认
不确定视为破坏性操作;需确认
示例工具调用(通过AWS MCP服务器):
aws___call_aws(command="aws athena start-query-execution --work-group <WORKGROUP_NAME> --query-string '<sql>' --query-execution-context Database=<db>")
对于联合目录或S3 Tables目录,还需在执行上下文中设置
Catalog=<CATALOG_PATH>
(例如
Catalog=s3tablescatalog/<BUCKET_NAME>
)。
约束:
  • 当目标是Redshift联合目录时,执行前必须向用户发出警告(“无分区修剪——每次查询都会扫描全表”)
  • 在执行跨目录连接前必须向用户发出警告(“跨目录连接会产生网络开销,可能速度较慢”)
  • 执行前必须确认输出S3位置
  • 执行前必须说明将调用哪个工具
  • 必须尊重用户中止操作的决定

7. Present and Recover

7. 展示结果与恢复

Present results with cost, data scanned, duration, and actionable insights. On failure, list available workgroups and let the user choose which to retry with.
展示结果时附带成本、扫描数据量、时长和可操作的见解。失败时,列出可用的工作组并让用户选择重试的工作组。

Argument Routing

参数路由

Resolve in this order; stop at the first match:
  1. Contains SQL keywords (
    SELECT
    ,
    SHOW
    ,
    DESCRIBE
    ,
    INSERT
    , etc.) — SQL text, execute directly
  2. profile TABLE_NAME
    — run comprehensive table profiling (see query-patterns.md)
  3. Matches a known named query — look up and execute
  4. Matches a known workgroup — show workgroup status and recent queries
  5. Matches a known catalog — delegate to
    exploring-data-catalog
    to enumerate databases and tables
  6. No args — show recent query activity and available tables
按以下顺序解析;匹配到第一个后停止:
  1. 包含SQL关键字(
    SELECT
    ,
    SHOW
    ,
    DESCRIBE
    ,
    INSERT
    等)——视为SQL文本,直接执行
  2. profile TABLE_NAME
    ——运行全面的表分析(参见query-patterns.md
  3. 匹配已知的命名查询——查找并执行
  4. 匹配已知的工作组——显示工作组状态和最近的查询
  5. 匹配已知的目录——委托给
    exploring-data-catalog
    枚举数据库和表
  6. 无参数——显示最近的查询活动和可用表

Principles

原则

  • Always select workgroup before executing (prevents output-location errors)
  • Profile unfamiliar tables before running analytical queries
  • Present cost alongside results so users build cost awareness
  • Suggest
    LIMIT
    for exploratory queries on large tables
  • Never ask domain questions with obvious answers, but always confirm security-relevant actions (workgroup switches, output location changes, non-SELECT statements)
  • 执行前始终选择工作组(避免输出位置错误)
  • 在运行分析查询前对不熟悉的表进行分析
  • 展示结果时附带成本,帮助用户建立成本意识
  • 对大型表的探索性查询建议使用
    LIMIT
  • 永远不要询问答案显而易见的领域问题,但始终确认与安全相关的操作(工作组切换、输出位置更改、非SELECT语句)

Troubleshooting

故障排除

ErrorCauseFix
Redshift identifier error with mixed caseRedshift-federated names are lowercase onlyLowercase the identifier
CatalogId
validation failure
ARN passed instead of catalog namePass the catalog name, not the ARN
Cross-catalog
information_schema
returns nothing
Missing catalog qualifierUse catalog-qualified path:
"catalog".information_schema.tables
Query fails with output-location errorWorkgroup has no output location configuredSelect a different workgroup with an output location, or configure one
Destructive statement executed without confirmationStatement classification skippedAlways classify
INSERT
/
UPDATE
/
DELETE
/
DROP
/
ALTER
/
CREATE
/
TRUNCATE
/
MERGE
and confirm with the user
错误原因修复方法
Redshift标识符大小写混合错误Redshift联合目录名称仅支持小写将标识符改为小写
CatalogId
验证失败
传入了ARN而非目录名称传入目录名称,而非ARN
跨目录
information_schema
无返回结果
缺少目录限定符使用目录限定路径:
"catalog".information_schema.tables
查询因输出位置错误失败工作组未配置输出位置选择一个已配置输出位置的其他工作组,或配置输出位置
未经确认执行了破坏性语句跳过了语句分类始终对
INSERT
/
UPDATE
/
DELETE
/
DROP
/
ALTER
/
CREATE
/
TRUNCATE
/
MERGE
进行分类并获得用户确认

Additional Resources

额外资源