exploring-data-catalog

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Structured inventory and cataloging across your AWS data landscape: Glue Data Catalog with S3 Tables, Redshift-federated, and remote Iceberg catalogs.

针对您的AWS数据全景进行结构化盘点与编目：涵盖Glue Data Catalog、S3 Tables、Redshift联邦目录及远程Iceberg目录。

Overview

概述

Maps data in an AWS account. Starts with catalog landscape (Glue, S3 Tables, federated), then drills into databases and tables. Read-only — no query execution.

Constraints for parameter acquisition:

You MUST ask for the target AWS region upfront if not provided
You MUST support a single optional argument: search term, catalog name, database name, S3 path, or table name
You MUST accept the argument as direct input or a pointer to a file containing the spec
You MUST confirm the scope (full landscape vs. targeted deep dive) before making API calls
You MUST respect the user's decision to abort at any step

梳理AWS账户中的数据。从目录全景（Glue、S3 Tables、联邦目录）入手，再深入到数据库和表层面。仅支持只读操作——不执行查询。

参数获取约束：

若未提供目标AWS区域，必须先向用户询问
必须支持单个可选参数：搜索词、目录名称、数据库名称、S3路径或表名称
必须接受直接输入的参数，或指向包含参数规范的文件的指针
在调用API之前，必须确认范围（全景盘点 vs 定向深度分析）
必须尊重用户在任何步骤中止操作的决定

Common Tasks

常见任务

Pagination: All list and search calls in this workflow may return paginated results. You MUST pass

--next-token

from the previous response until no more tokens are returned. You MUST NOT assume a single page contains all results.

分页处理： 此工作流中的所有列表和搜索调用可能返回分页结果。必须传递上一次响应中的

--next-token

，直到没有更多令牌返回。不得假设单页包含所有结果。

1. Verify Dependencies

1. 验证依赖项

Check for required tools and AWS access before discovery.

Constraints:

You MUST verify AWS MCP server tools are available (
```
aws___call_aws
```
,
```
aws___search_documentation
```
) and fall back to AWS CLI if not
You MUST confirm credentials are valid:
```
aws sts get-caller-identity
```
You MUST inform the user about any missing tools and ask whether to proceed

在开始发现操作前，检查所需工具和AWS访问权限。

约束：

必须验证AWS MCP服务器工具（
```
aws___call_aws
```
、
```
aws___search_documentation
```
）是否可用，若不可用则回退到AWS CLI
必须通过
```
aws sts get-caller-identity
```
确认凭证有效
必须告知用户任何缺失的工具，并询问是否继续

2. Discover Catalogs

2. 发现目录

List catalogs in account:

bash

aws glue get-catalogs --recursive --include-root

Classify each catalog by type:

Field Present	Catalog Type	What It Contains
Neither `TargetRedshiftCatalog` nor `FederatedCatalog`	Default (Glue)	Standard Glue databases and tables
`FederatedCatalog.ConnectionName` = `aws:s3tables`	S3 Tables	Managed Iceberg table buckets
`TargetRedshiftCatalog`	Redshift-federated	Redshift databases exposed as Glue catalogs
`FederatedCatalog` with `ConnectionName` ≠ `aws:s3tables`	Remote Iceberg	External catalogs (Snowflake, Databricks, Iceberg REST)

Constraints:

You MUST include
```
--include-root
```
to capture default account catalog
You MUST present summary of catalog counts by type
If only default catalog exists, You SHOULD skip catalog overview and go to step 3

列出账户中的目录：

bash

aws glue get-catalogs --recursive --include-root

按类型对每个目录进行分类：

存在的字段	目录类型	包含内容
既无 `TargetRedshiftCatalog` 也无 `FederatedCatalog`	默认（Glue）	标准Glue数据库和表
`FederatedCatalog.ConnectionName` = `aws:s3tables`	S3 Tables	托管Iceberg表的存储桶
`TargetRedshiftCatalog`	Redshift联邦目录	以Glue目录形式暴露的Redshift数据库
`FederatedCatalog` 且 `ConnectionName` ≠ `aws:s3tables`	远程Iceberg目录	外部目录（Snowflake、Databricks、Iceberg REST）

约束：

必须包含
```
--include-root
```
参数以捕获账户默认目录
必须按类型呈现目录数量的摘要
若仅存在默认目录，应跳过目录概览直接进入步骤3

3. Enumerate Databases and Tables

3. 枚举数据库和表

For each catalog (or the user-specified one):

bash

aws glue get-databases --catalog-id <catalog-id>
aws glue get-tables --database-name <db> --catalog-id <catalog-id>

For S3 Tables catalogs, also enumerate via the S3 Tables API:

bash

aws s3tables list-table-buckets
aws s3tables list-namespaces --table-bucket-arn <arn>
aws s3tables list-tables --table-bucket-arn <arn> --namespace <ns>

Constraints:

You MUST flag S3 Tables not registered in Glue; You SHOULD suggest registration
For sub-catalogs,
```
--catalog-id
```
accepts the catalog name (not the ARN)
For the default catalog, omit
```
--catalog-id
```
or pass the account ID

bash

aws glue get-databases --catalog-id <catalog-id>
aws glue get-tables --database-name <db> --catalog-id <catalog-id>

对于S3 Tables目录，还需通过S3 Tables API进行枚举：

bash

aws s3tables list-table-buckets
aws s3tables list-namespaces --table-bucket-arn <arn>
aws s3tables list-tables --table-bucket-arn <arn> --namespace <ns>

约束：

必须标记未在Glue中注册的S3 Tables；应建议用户进行注册
对于子目录，
```
--catalog-id
```
接受目录名称（而非ARN）
对于默认目录，省略
```
--catalog-id
```
或传入账户ID

4. Capture Details and Analyze

4. 捕获详情并分析

For each database, capture table count, formats, partitioning, and S3 locations. For each table of interest, capture column schemas, types, partition keys, SerDe format, and last access time.

You MUST report data formats in human-readable terms (Parquet, CSV, JSON), not raw SerDe class names.

See discovery-checklist.md for analysis framework.

针对每个数据库，捕获表数量、格式、分区情况及S3存储位置。针对每个重点关注的表，捕获列 schema、类型、分区键、SerDe格式及最后访问时间。

必须以易读的术语（如Parquet、CSV、JSON）报告数据格式，而非原始SerDe类名。

分析框架请参考discovery-checklist.md。

Argument Routing

参数路由

Resolve the argument in this order; stop at the first match:

Starts with
```
s3://
```
— S3 path (explore unregistered data, detect formats)
Matches a known catalog from step 2 (
```
get-catalogs
```
) — deep dive into that catalog
Matches a known database (
```
get-databases
```
) — deep dive into that database
Matches a known table (
```
get-tables
```
) — detailed table analysis with schema and partitions
No match — treat as search term (Glue
```
search-tables
```
)
No args — full landscape discovery (catalogs, then databases and tables)

按以下顺序解析参数，匹配到第一个即停止：

以
```
s3://
```
开头 — S3路径（探索未注册数据，检测格式）
匹配步骤2中发现的已知目录（
```
get-catalogs
```
） — 深度分析该目录
匹配已知数据库（
```
get-databases
```
） — 深度分析该数据库
匹配已知表（
```
get-tables
```
） — 对表进行包含schema和分区的详细分析
无匹配项 — 视为搜索词（使用Glue的
```
search-tables
```
）
无参数 — 进行全景发现（先目录，再数据库和表）

Principles

原则

Start with catalog landscape, then narrow based on user interest
Always report catalog types — users need to know where data lives
Always report data formats — they drive cost and performance decisions
Flag stale tables and missing descriptions
Suggest partitioning for large unpartitioned tables
Summary first, details on request
You MUST NOT execute Athena queries (
```
start-query-execution
```
) during discovery; query execution belongs to
```
querying-data-lake
```

从目录全景入手，再根据用户兴趣缩小范围
始终报告目录类型——用户需要了解数据存储位置
始终报告数据格式——它们会影响成本和性能决策
标记长期未使用的表和缺失描述的表
为大型未分区表建议分区方案
先提供摘要，按需提供详情
发现过程中不得执行Athena查询（
```
start-query-execution
```
）；查询执行属于
```
querying-data-lake
```
的功能范围

Troubleshooting

故障排除

Error	Cause	Fix
Only sub-catalogs returned, default missing	`--include-root` omitted	Re-run `get-catalogs` with `--include-root`
Federated catalog query slow or failing	Network call to remote source; connection misconfigured	Report connection errors clearly rather than silently skipping
S3 Tables not queryable via Athena	Tables exist in S3 Tables API but not registered in Glue	Flag as "not queryable"; suggest registration
`get-databases` / `get-tables` fails with catalog-id	Default catalog requires omit or account ID	Omit `--catalog-id` or pass account ID for the default catalog

错误	原因	解决方法
仅返回子目录，缺失默认目录	省略了 `--include-root` 参数	使用 `--include-root` 重新运行 `get-catalogs`
联邦目录查询缓慢或失败	与远程源的网络调用；连接配置错误	清晰报告连接错误，而非静默跳过
S3 Tables无法通过Athena查询	表存在于S3 Tables API中但未在Glue注册	标记为“不可查询”；建议注册
使用catalog-id时 `get-databases` / `get-tables` 失败	默认目录需省略该参数或传入账户ID	对于默认目录，省略 `--catalog-id` 或传入账户ID

exploring-data-catalog

Original

Translation

Overview

概述

Common Tasks

常见任务

1. Verify Dependencies

1. 验证依赖项

2. Discover Catalogs

2. 发现目录

3. Enumerate Databases and Tables

3. 枚举数据库和表

4. Capture Details and Analyze

4. 捕获详情并分析

Argument Routing

参数路由

Principles

原则

Troubleshooting

故障排除

Additional Resources

额外资源