finding-data-lake-assets

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Find Data Lake Assets

查找数据湖资产

Overview

概述

Resolves data lake asset references to concrete catalog entries. Acts as a resolver for other skills and direct user requests. Covers Glue, S3, S3 Tables, and Redshift. Optimized for low token usage — return the answer fast and get out of the way.

Constraints for parameter acquisition:

You MUST accept a single argument: table name, keyword, column name, or S3 path
You MUST accept the argument as direct input or a pointer to a file containing the spec
You MUST ask for the target AWS region if not already set
You MUST confirm ambiguous input before searching (e.g., "Did you mean table X or bucket Y?")
You MUST respect the user's decision to abort at any step

解析数据湖资产引用至具体目录条目。作为其他技能和用户直接请求的解析器。覆盖Glue、S3、S3 Tables和Redshift。针对低token使用优化——快速返回结果并结束流程。

参数获取约束：

必须接受单一参数：表名、关键词、列名或S3路径
必须接受直接输入的参数，或指向包含规范的文件的指针
如果未设置目标AWS区域，必须询问用户
对于模糊输入，搜索前必须确认（例如："您指的是表X还是存储桶Y？"）
必须尊重用户在任何步骤中止操作的决定

Common Tasks

常见任务

You MUST execute commands using AWS MCP server tools when connected — they provide validation, sandboxed execution, and audit logging. Fall back to AWS CLI only if MCP is unavailable. You MUST explain each step before executing.

当已连接时，必须使用AWS MCP服务器工具执行命令——这些工具提供验证、沙箱执行和审计日志。仅当MCP不可用时才回退到AWS CLI。执行前必须解释每一步。

1. Verify Dependencies

1. 验证依赖项

Check for required tools and AWS access before searching.

Constraints:

You MUST verify AWS MCP server tools (
```
aws___call_aws
```
) are available; fall back to AWS CLI if not
You MUST confirm credentials with
```
aws sts get-caller-identity
```
You MUST inform the user about any missing tools and ask whether to proceed

搜索前检查所需工具和AWS访问权限。

约束：

必须验证AWS MCP服务器工具（
```
aws___call_aws
```
）是否可用；如果不可用，回退到AWS CLI
必须使用
```
aws sts get-caller-identity
```
确认凭证
必须告知用户任何缺失的工具，并询问是否继续

2. Classify the Request

2. 分类请求

Determine the mode:

Resolve (most common): User/skill references something specific. Signals: possessive/definite articles ("our X table", "the Y dataset") imply the asset exists. Goal: find it, return the reference, done.
Search: User is exploring. Signals: "find tables with", "what has customer_id". Goal: rank candidates, present top matches.

You SHOULD default to Resolve mode when ambiguous.

确定模式：

解析（最常见）：用户/技能引用特定对象。信号：所有格/定冠词（"我们的X表"、"Y数据集"）意味着资产已存在。目标：找到它，返回引用，结束。
搜索：用户正在探索。信号："查找包含...的表"、"哪些表有customer_id"。目标：对候选结果排序，展示顶部匹配项。

当存在歧义时，默认采用解析模式。

3. Extract Search Terms

3. 提取搜索词

Parse the request into search dimensions:

Name terms: Table or database names mentioned
Domain terms: Business concepts (billing, orders, churn)
Column terms: Specific column names (customer_id, event_type)
Location terms: S3 paths, bucket names, prefixes

将请求解析为搜索维度：

名称词：提及的表或数据库名称
领域词：业务概念（账单、订单、客户流失）
列词：特定列名（customer_id、event_type）
位置词：S3路径、存储桶名称、前缀

4. Layered Search (stop early)

4. 分层搜索（提前终止）

Search sources in order. Stop at the first layer that returns a high-confidence match. Do NOT search all layers every time.

You MUST track which layers were searched and which were skipped. Report this in the output (see Step 6).

Layer 1: Glue Data Catalog (always start here)

You SHOULD use

SearchTables

as the primary API — it searches table names, column names, and column comments across the entire catalog in one call. You MUST NOT loop over databases with

get-tables

unless you already know the database name. See search-strategy.md for patterns.

aws glue search-tables --search-text "orders"
aws glue get-tables --database-name sales --expression "order.*"

Layer 2: S3 Reverse Lookup (S3 path provided)

When a user provides an S3 path, you SHOULD default to reverse lookup first — they usually want the Glue table, not the file contents.

aws glue search-tables --search-text "<path-keyword>"
aws s3api list-objects-v2 --bucket <bucket-name> --prefix <prefix>

Layer 3: Redshift Catalog (if user mentions Redshift, warehouse, or lakehouse)

sql

SELECT schema_name, table_name, table_type
FROM svv_all_tables
WHERE table_name ILIKE '%orders%';

Redshift Spectrum external tables also appear in Glue. If Layer 1 found the table with a Spectrum SerDe, skip Layer 3.

按顺序搜索数据源。在第一个返回高置信度匹配的层级停止。不要每次都搜索所有层级。

必须跟踪已搜索和已跳过的层级。在输出中报告这些信息（见步骤6）。

层级1：Glue Data Catalog（始终从此处开始）

应使用

SearchTables

作为主要API——它可在一次调用中搜索整个目录中的表名、列名和列注释。除非已知道数据库名称，否则不得使用

get-tables

遍历数据库。有关模式，请参阅search-strategy.md。

aws glue search-tables --search-text "orders"
aws glue get-tables --database-name sales --expression "order.*"

层级2：S3反向查找（提供S3路径时）

当用户提供S3路径时，应默认先进行反向查找——他们通常想要Glue表，而非文件内容。

aws glue search-tables --search-text "<path-keyword>"
aws s3api list-objects-v2 --bucket <bucket-name> --prefix <prefix>

层级3：Redshift目录（如果用户提及Redshift、仓库或湖仓）

sql

SELECT schema_name, table_name, table_type
FROM svv_all_tables
WHERE table_name ILIKE '%orders%';

Redshift Spectrum外部表也会出现在Glue中。如果层级1已找到带有Spectrum SerDe的表，则跳过层级3。

4b. Broad Scan Fallback (single turn)

4b. 广度扫描回退（单次执行）

When

search-tables

returns nothing and S3 Tables enumeration also misses, you MAY need to scan across databases. Do NOT issue separate CLI calls per database — that burns turns and tokens. Instead, write a short Python script using boto3 paginators that does the full scan in one execution. Write the script to a file and run it with

python3

The script MUST:

Paginate
```
get_databases()
```
to collect all database names
For each database, paginate
```
get_tables()
```
with an
```
Expression
```
filter matching the search term
Print only matching results as structured output (JSON or table)
Accept the region and search term as arguments or variables

python

import boto3, sys, json

region = sys.argv[1]
term = sys.argv[2]

glue = boto3.client("glue", region_name=region)
matches = []

db_paginator = glue.get_paginator("get_databases")
for db_page in db_paginator.paginate():
    for db in db_page["DatabaseList"]:
        db_name = db["Name"]
        tbl_paginator = glue.get_paginator("get_tables")
        for tbl_page in tbl_paginator.paginate(
            DatabaseName=db_name, Expression=f".*{term}.*"
        ):
            for tbl in tbl_page["TableList"]:
                matches.append({
                    "database": db_name,
                    "table": tbl["Name"],
                    "format": tbl.get("Parameters", {}).get("classification", "unknown"),
                    "location": tbl.get("StorageDescriptor", {}).get("Location", ""),
                })

print(json.dumps(matches, indent=2) if matches else "No matches found.")

You MUST only use this fallback after

search-tables

and S3 Tables enumeration have already returned nothing. This is a last resort, not a first choice.

当

search-tables

无返回结果且S3 Tables枚举也未找到时，可能需要跨数据库扫描。不得针对每个数据库单独发出CLI调用——这会消耗执行次数和token。相反，编写一个使用boto3分页器的简短Python脚本，在一次执行中完成全量扫描。将脚本写入文件并使用

python3

运行。

脚本必须：

分页
```
get_databases()
```
以收集所有数据库名称
针对每个数据库，使用
```
Expression
```
过滤器分页
```
get_tables()
```
以匹配搜索词
仅以结构化输出（JSON或表格）打印匹配结果
接受区域和搜索词作为参数或变量

python

import boto3, sys, json

region = sys.argv[1]
term = sys.argv[2]

glue = boto3.client("glue", region_name=region)
matches = []

db_paginator = glue.get_paginator("get_databases")
for db_page in db_paginator.paginate():
    for db in db_page["DatabaseList"]:
        db_name = db["Name"]
        tbl_paginator = glue.get_paginator("get_tables")
        for tbl_page in tbl_paginator.paginate(
            DatabaseName=db_name, Expression=f".*{term}.*"
        ):
            for tbl in tbl_page["TableList"]:
                matches.append({
                    "database": db_name,
                    "table": tbl["Name"],
                    "format": tbl.get("Parameters", {}).get("classification", "unknown"),
                    "location": tbl.get("StorageDescriptor", {}).get("Location", ""),
                })

print(json.dumps(matches, indent=2) if matches else "No matches found.")

仅当

search-tables

和S3 Tables枚举均无返回结果时，才可使用此回退方案。这是最后手段，而非首选方法。

5. Apply the Confidence Gate

5. 应用置信度阈值

High confidence (exact name match, single result): Return the resolved reference immediately. No summary, no options.
Medium confidence (fuzzy match, 2-3 results): Present top matches with one line each: name, why it matched, format. Let the user pick.
Low confidence (many weak matches or none): Report what was searched and what was skipped, suggest refining the query or running
```
exploring-data-catalog
```
.

高置信度（精确名称匹配，单一结果）：立即返回解析后的引用。无需摘要或选项。
中等置信度（模糊匹配，2-3个结果）：展示顶部匹配项，每个匹配项一行：名称、匹配原因、格式。让用户选择。
低置信度（多个弱匹配或无匹配）：报告已搜索和已跳过的内容，建议优化查询或运行
```
exploring-data-catalog
```
。

6. Return the Reference

6. 返回引用

For high-confidence resolve, return a structured reference. Always include a "Sources searched / skipped" line so the user knows which data stores were checked and which were not.

Table: database_name.table_name
Catalog: default | catalog_name
Format: Parquet | CSV | JSON | ORC | Iceberg
Location: s3://bucket/prefix/
Partition keys: [key1, key2] or none
Sources searched: Glue Data Catalog
Sources skipped: S3, Redshift (stopped early — high-confidence match in Glue)

S3 Tables use a 4-level hierarchy (catalog / table-bucket / namespace / table), and

search-tables

does not index

s3tablescatalog/*

. If the user mentions S3 Tables explicitly or Layer 1 returns nothing for an expected S3 Tables asset, enumerate via

aws s3tables list-table-buckets

and

list-namespaces

. Return as:

Table: s3tablescatalog/<table-bucket>/<namespace>/<table>
Format: Iceberg
Location: arn:aws:s3tables:<region>:<account>:bucket/<table-bucket>/table/<table-uuid>
Sources searched: Glue Data Catalog, S3 Tables
Sources skipped: Redshift (not relevant to S3 Tables lookup)

SQL reference:

"s3tablescatalog/<table-bucket>"."<namespace>"."<table>"

You MUST always include both "Sources searched" and "Sources skipped" in the output. List the reason for skipping in parentheses. Valid reasons: "stopped early", "not relevant to this request", "access denied", "no results in prior layer".

对于高置信度解析，返回结构化引用。必须始终包含"已搜索/已跳过的数据源"行，以便用户了解已检查和未检查的数据存储。

Table: database_name.table_name
Catalog: default | catalog_name
Format: Parquet | CSV | JSON | ORC | Iceberg
Location: s3://bucket/prefix/
Partition keys: [key1, key2] or none
Sources searched: Glue Data Catalog
Sources skipped: S3, Redshift (stopped early — high-confidence match in Glue)

S3 Tables使用4级层级结构（目录/表存储桶/命名空间/表），且

search-tables

不会索引

s3tablescatalog/*

。如果用户明确提及S3 Tables，或层级1未找到预期的S3 Tables资产，则通过

aws s3tables list-table-buckets

和

list-namespaces

枚举。返回格式如下：

Table: s3tablescatalog/<table-bucket>/<namespace>/<table>
Format: Iceberg
Location: arn:aws:s3tables:<region>:<account>:bucket/<table-bucket>/table/<table-uuid>
Sources searched: Glue Data Catalog, S3 Tables
Sources skipped: Redshift (not relevant to S3 Tables lookup)

SQL引用：

"s3tablescatalog/<table-bucket>"."<namespace>"."<table>"

。

必须始终在输出中包含"已搜索的数据源"和"已跳过的数据源"。在括号中注明跳过原因。有效原因："提前终止"、"与本次请求无关"、"访问被拒绝"、"前一层级无结果"。

Troubleshooting

故障排除

Error	Cause	Fix
`get-tables` fails with missing database	Requires `--database-name`	For cross-database search, use `search-tables` instead
`search-tables` returns nothing for S3 Tables	Does not cover S3 Tables federated catalogs	Use `aws s3tables list-table-buckets` when S3 Tables is in play
`AccessDeniedException` on `search-tables`	Caller lacks `glue:SearchTables` permission	Request the permission or fall back to Glue `get-tables` with a known database
API call times out or throttles ( `ThrottlingException` )	Throttled by service-level rate limits	Retry with exponential backoff; reduce parallel calls
Resource not in expected region	Cross-region lookup	Confirm AWS region; the Glue catalog is region-scoped
Delegating caller expects verbose output	Other skill called this as a resolver	Return minimal output — caller needs a catalog reference, not a formatted summary

错误	原因	解决方法
`get-tables` 因数据库缺失失败	需要 `--database-name` 参数	跨数据库搜索请使用 `search-tables`
`search-tables` 针对S3 Tables无返回结果	不覆盖S3 Tables联合目录	当涉及S3 Tables时，使用 `aws s3tables list-table-buckets`
`search-tables` 出现 `AccessDeniedException`	调用者缺乏 `glue:SearchTables` 权限	请求该权限，或回退到使用已知数据库的Glue `get-tables`
API调用超时或限流（ `ThrottlingException` ）	受服务级速率限制	使用指数退避重试；减少并行调用
资源不在预期区域	跨区域查找	确认AWS区域；Glue目录是区域范围的
委托调用者需要详细输出	其他技能调用此解析器	返回最小化输出——调用者需要目录引用，而非格式化摘要

Principles

原则

You MUST prefer
```
search-tables
```
over iterating databases. One API call beats N.
You MUST pass an
```
Expression
```
filter when calling
```
get-tables
```
; never call it without one.
You MUST NOT issue separate CLI calls per database. If a broad scan is needed, use the boto3 paginator script from Step 4b to do it in a single turn.
You SHOULD resolve fast and stop early. Every extra API call costs tokens.
You SHOULD assume the asset exists in Resolve mode — search to find it, not to confirm it.

必须优先使用
```
search-tables
```
而非遍历数据库。一次API调用优于N次调用。
调用
```
get-tables
```
时必须传递
```
Expression
```
过滤器；不得无过滤器调用。
不得针对每个数据库单独发出CLI调用。如果需要广度扫描，使用步骤4b中的boto3分页器脚本在单次执行中完成。
应快速解析并提前终止。每一次额外的API调用都会消耗token。
在解析模式下，应假设资产存在——搜索是为了找到它，而非确认它是否存在。