querying-data-lake
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseQuery Data Lake
查询数据湖
Execute SQL queries on Amazon Athena across default and federated catalogs (Glue, S3 Tables, Redshift) with workgroup selection, statement classification, and error recovery.
通过工作组选择、语句分类和错误恢复功能,在Amazon Athena上执行跨默认目录与联合目录(Glue、S3 Tables、Redshift)的SQL查询。
Overview
概述
Executes and manages Athena SQL queries across default and federated catalogs. Selects a workgroup, resolves target assets (delegating fuzzy references to ), classifies statements for safety, and reports cost and data scanned. Use the AWS MCP server for sandboxed execution and audit logging; the same AWS CLI commands work directly when the MCP server is not available.
finding-data-lake-assetsConstraints for parameter acquisition:
- You MUST accept a single optional argument: SQL text, a named-query name, a workgroup name, a catalog name, or
profile TABLE_NAME - You MUST accept the argument as direct text or a pointer to a file containing SQL
- You MUST ask the user for the target AWS region if not already set
- You MUST confirm the output S3 location before executing any non-trivial query
- You MUST respect the user's decision to abort at any step
执行并管理跨默认目录与联合目录的Athena SQL查询。选择工作组、解析目标资产(将模糊引用委托给)、对语句进行安全分类,并报告成本和扫描的数据量。使用AWS MCP服务器进行沙箱执行和审计日志记录;当MCP服务器不可用时,可直接使用相同的AWS CLI命令。
finding-data-lake-assets参数获取约束:
- 必须接受单个可选参数:SQL文本、命名查询名称、工作组名称、目录名称或
profile TABLE_NAME - 必须接受直接文本形式的参数,或指向包含SQL的文件的指针
- 如果未设置目标AWS区域,必须向用户询问
- 在执行任何非 trivial 查询前,必须确认输出S3位置
- 必须尊重用户在任何步骤中止操作的决定
Common Tasks
常见任务
1. Verify Dependencies
1. 验证依赖项
Check for required tools and AWS access before running queries.
Constraints:
- You MUST verify AWS MCP server tools are available () and run queries through them when present; fall back to AWS CLI only if the MCP server is unavailable
aws___call_aws - You MUST NOT fall back to shell or Bash for query execution — results must be captured via the MCP tool or CLI so output location and cost are tracked
aws athena - You MUST confirm credentials with and inform the user about any missing tools
aws sts get-caller-identity
在运行查询前检查所需工具和AWS访问权限。
约束:
- 必须验证AWS MCP服务器工具()是否可用;若可用则通过其运行查询;仅当MCP服务器不可用时才回退到AWS CLI
aws___call_aws - 不得回退到Shell或Bash执行查询——必须通过MCP工具或CLI捕获结果,以便跟踪输出位置和成本
aws athena - 必须使用确认凭据,并告知用户任何缺失的工具
aws sts get-caller-identity
2. Resolve Workgroup
2. 解析工作组
Check caller identity, list workgroups, auto-select the best one (see workgroup-selection.md).
Constraints:
- You MUST select a workgroup before submitting any query (prevents output-location errors)
- You MUST present the selected workgroup and its output location to the user
- You MUST NOT auto-escalate to a different workgroup on failure without user confirmation
检查调用者身份、列出工作组、自动选择最佳工作组(参见workgroup-selection.md)。
约束:
- 在提交任何查询前必须选择工作组(避免输出位置错误)
- 必须向用户展示所选工作组及其输出位置
- 未经用户确认,不得在失败时自动切换到其他工作组
3. Resolve the Target Asset
3. 解析目标资产
If the user refers to a table by name, by business concept ("our quarterly report", "the sales data"), by S3 path, or by catalog without specifying the table, delegate to to return the concrete (and catalog if non-default).
finding-data-lake-assetsdatabase.tableConstraints:
- You MUST NOT attempt to resolve fuzzy asset references with or by iterating
athena list-data-catalogs— those miss federated catalogs and waste tokensget-tables - You SHOULD skip this step only when the user provides a fully-qualified reference (exact ) or raw SQL they want executed as-is
database.table - You MUST state the resolved asset explicitly before building the query: "Found [table] in [catalog]. Using this for the query."
- You SHOULD default to the default Glue catalog unless the user mentions "federated", "Redshift", "S3 Tables", or returns a different catalog
finding-data-lake-assets
如果用户通过名称、业务概念(如“我们的季度报告”“销售数据”)、S3路径或未指定表的目录引用表,请委托给返回具体的(如果是非默认目录则需包含目录)。
finding-data-lake-assetsdatabase.table约束:
- 不得尝试使用或遍历
athena list-data-catalogs来解析模糊资产引用——这些方法会遗漏联合目录并浪费令牌get-tables - 仅当用户提供完全限定的引用(精确的)或希望按原样执行的原始SQL时,才应跳过此步骤
database.table - 在构建查询前必须明确说明解析后的资产:“在[目录]中找到[表]。将其用于本次查询。”
- 除非用户提到“federated”“Redshift”“S3 Tables”,或返回不同目录,否则默认使用默认Glue目录
finding-data-lake-assets
4. Discover Schema
4. 发现 Schema
For analytical queries, You SHOULD profile the target table before building the final query. You MUST show sample rows () as part of profiling.
SELECT ... LIMIT 5对于分析查询,在构建最终查询前应对目标表进行分析。必须将示例行()作为分析的一部分展示。
SELECT ... LIMIT 55. Build Query
5. 构建查询
Table addressing depends on catalog type:
- Default Glue catalog: (omit the catalog prefix for single-catalog queries). In cross-catalog queries, qualify default-catalog tables with
database.table."awsdatacatalog".database.table - Registered data source:
datasource.database.table - Unregistered Glue catalog:
"catalog/subcatalog".database.table
表寻址方式取决于目录类型:
- 默认Glue目录:(单目录查询可省略目录前缀)。在跨目录查询中,默认目录表需使用
database.table限定。"awsdatacatalog".database.table - 已注册数据源:
datasource.database.table - 未注册Glue目录:
"catalog/subcatalog".database.table
6. Classify and Execute
6. 分类与执行
Classify the SQL statement before executing:
| Statement | Behavior |
|---|---|
| Safe — execute |
| Destructive — warn the user and require explicit confirmation |
| Unsure | Treat as destructive; confirm |
Example tool call (via AWS MCP server):
aws___call_aws(command="aws athena start-query-execution --work-group <WORKGROUP_NAME> --query-string '<sql>' --query-execution-context Database=<db>")For federated or S3 Tables catalogs, also set in the execution context (e.g. ).
Catalog=<CATALOG_PATH>Catalog=s3tablescatalog/<BUCKET_NAME>Constraints:
- You MUST warn the user before executing when the target is Redshift-federated ("No partition pruning — every query scans the full table")
- You MUST warn the user before executing a cross-catalog join ("Cross-catalog joins incur network overhead and may be slow")
- You MUST confirm the output S3 location before executing
- You MUST explain which tool is being called before executing
- You MUST respect the user's decision to abort
在执行前对SQL语句进行分类:
| 语句类型 | 行为 |
|---|---|
| 安全——执行 |
| 破坏性——向用户发出警告并要求明确确认 |
| 不确定 | 视为破坏性操作;需确认 |
示例工具调用(通过AWS MCP服务器):
aws___call_aws(command="aws athena start-query-execution --work-group <WORKGROUP_NAME> --query-string '<sql>' --query-execution-context Database=<db>")对于联合目录或S3 Tables目录,还需在执行上下文中设置(例如)。
Catalog=<CATALOG_PATH>Catalog=s3tablescatalog/<BUCKET_NAME>约束:
- 当目标是Redshift联合目录时,执行前必须向用户发出警告(“无分区修剪——每次查询都会扫描全表”)
- 在执行跨目录连接前必须向用户发出警告(“跨目录连接会产生网络开销,可能速度较慢”)
- 执行前必须确认输出S3位置
- 执行前必须说明将调用哪个工具
- 必须尊重用户中止操作的决定
7. Present and Recover
7. 展示结果与恢复
Present results with cost, data scanned, duration, and actionable insights. On failure, list available workgroups and let the user choose which to retry with.
展示结果时附带成本、扫描数据量、时长和可操作的见解。失败时,列出可用的工作组并让用户选择重试的工作组。
Argument Routing
参数路由
Resolve in this order; stop at the first match:
- Contains SQL keywords (,
SELECT,SHOW,DESCRIBE, etc.) — SQL text, execute directlyINSERT - — run comprehensive table profiling (see query-patterns.md)
profile TABLE_NAME - Matches a known named query — look up and execute
- Matches a known workgroup — show workgroup status and recent queries
- Matches a known catalog — delegate to to enumerate databases and tables
exploring-data-catalog - No args — show recent query activity and available tables
按以下顺序解析;匹配到第一个后停止:
- 包含SQL关键字(,
SELECT,SHOW,DESCRIBE等)——视为SQL文本,直接执行INSERT - ——运行全面的表分析(参见query-patterns.md)
profile TABLE_NAME - 匹配已知的命名查询——查找并执行
- 匹配已知的工作组——显示工作组状态和最近的查询
- 匹配已知的目录——委托给枚举数据库和表
exploring-data-catalog - 无参数——显示最近的查询活动和可用表
Principles
原则
- Always select workgroup before executing (prevents output-location errors)
- Profile unfamiliar tables before running analytical queries
- Present cost alongside results so users build cost awareness
- Suggest for exploratory queries on large tables
LIMIT - Never ask domain questions with obvious answers, but always confirm security-relevant actions (workgroup switches, output location changes, non-SELECT statements)
- 执行前始终选择工作组(避免输出位置错误)
- 在运行分析查询前对不熟悉的表进行分析
- 展示结果时附带成本,帮助用户建立成本意识
- 对大型表的探索性查询建议使用
LIMIT - 永远不要询问答案显而易见的领域问题,但始终确认与安全相关的操作(工作组切换、输出位置更改、非SELECT语句)
Troubleshooting
故障排除
| Error | Cause | Fix |
|---|---|---|
| Redshift identifier error with mixed case | Redshift-federated names are lowercase only | Lowercase the identifier |
| ARN passed instead of catalog name | Pass the catalog name, not the ARN |
Cross-catalog | Missing catalog qualifier | Use catalog-qualified path: |
| Query fails with output-location error | Workgroup has no output location configured | Select a different workgroup with an output location, or configure one |
| Destructive statement executed without confirmation | Statement classification skipped | Always classify |
| 错误 | 原因 | 修复方法 |
|---|---|---|
| Redshift标识符大小写混合错误 | Redshift联合目录名称仅支持小写 | 将标识符改为小写 |
| 传入了ARN而非目录名称 | 传入目录名称,而非ARN |
跨目录 | 缺少目录限定符 | 使用目录限定路径: |
| 查询因输出位置错误失败 | 工作组未配置输出位置 | 选择一个已配置输出位置的其他工作组,或配置输出位置 |
| 未经确认执行了破坏性语句 | 跳过了语句分类 | 始终对 |
Additional Resources
额外资源
- Workgroup selection logic
- Common query patterns
- Athena best practices
- Athena federated query
- 工作组选择逻辑
- 常见查询模式
- Athena最佳实践
- Athena联合查询