exploring-data-catalog

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
Structured inventory and cataloging across your AWS data landscape: Glue Data Catalog with S3 Tables, Redshift-federated, and remote Iceberg catalogs.
针对您的AWS数据全景进行结构化盘点与编目:涵盖Glue Data Catalog、S3 Tables、Redshift联邦目录及远程Iceberg目录。

Overview

概述

Maps data in an AWS account. Starts with catalog landscape (Glue, S3 Tables, federated), then drills into databases and tables. Read-only — no query execution.
Constraints for parameter acquisition:
  • You MUST ask for the target AWS region upfront if not provided
  • You MUST support a single optional argument: search term, catalog name, database name, S3 path, or table name
  • You MUST accept the argument as direct input or a pointer to a file containing the spec
  • You MUST confirm the scope (full landscape vs. targeted deep dive) before making API calls
  • You MUST respect the user's decision to abort at any step
梳理AWS账户中的数据。从目录全景(Glue、S3 Tables、联邦目录)入手,再深入到数据库和表层面。仅支持只读操作——不执行查询。
参数获取约束:
  • 若未提供目标AWS区域,必须先向用户询问
  • 必须支持单个可选参数:搜索词、目录名称、数据库名称、S3路径或表名称
  • 必须接受直接输入的参数,或指向包含参数规范的文件的指针
  • 在调用API之前,必须确认范围(全景盘点 vs 定向深度分析)
  • 必须尊重用户在任何步骤中止操作的决定

Common Tasks

常见任务

Pagination: All list and search calls in this workflow may return paginated results. You MUST pass
--next-token
from the previous response until no more tokens are returned. You MUST NOT assume a single page contains all results.
分页处理: 此工作流中的所有列表和搜索调用可能返回分页结果。必须传递上一次响应中的
--next-token
,直到没有更多令牌返回。不得假设单页包含所有结果。

1. Verify Dependencies

1. 验证依赖项

Check for required tools and AWS access before discovery.
Constraints:
  • You MUST verify AWS MCP server tools are available (
    aws___call_aws
    ,
    aws___search_documentation
    ) and fall back to AWS CLI if not
  • You MUST confirm credentials are valid:
    aws sts get-caller-identity
  • You MUST inform the user about any missing tools and ask whether to proceed
在开始发现操作前,检查所需工具和AWS访问权限。
约束:
  • 必须验证AWS MCP服务器工具(
    aws___call_aws
    aws___search_documentation
    )是否可用,若不可用则回退到AWS CLI
  • 必须通过
    aws sts get-caller-identity
    确认凭证有效
  • 必须告知用户任何缺失的工具,并询问是否继续

2. Discover Catalogs

2. 发现目录

List catalogs in account:
bash
aws glue get-catalogs --recursive --include-root
Classify each catalog by type:
Field PresentCatalog TypeWhat It Contains
Neither
TargetRedshiftCatalog
nor
FederatedCatalog
Default (Glue)Standard Glue databases and tables
FederatedCatalog.ConnectionName
=
aws:s3tables
S3 TablesManaged Iceberg table buckets
TargetRedshiftCatalog
Redshift-federatedRedshift databases exposed as Glue catalogs
FederatedCatalog
with
ConnectionName
aws:s3tables
Remote IcebergExternal catalogs (Snowflake, Databricks, Iceberg REST)
Constraints:
  • You MUST include
    --include-root
    to capture default account catalog
  • You MUST present summary of catalog counts by type
  • If only default catalog exists, You SHOULD skip catalog overview and go to step 3
列出账户中的目录:
bash
aws glue get-catalogs --recursive --include-root
按类型对每个目录进行分类:
存在的字段目录类型包含内容
既无
TargetRedshiftCatalog
也无
FederatedCatalog
默认(Glue)标准Glue数据库和表
FederatedCatalog.ConnectionName
=
aws:s3tables
S3 Tables托管Iceberg表的存储桶
TargetRedshiftCatalog
Redshift联邦目录以Glue目录形式暴露的Redshift数据库
FederatedCatalog
ConnectionName
aws:s3tables
远程Iceberg目录外部目录(Snowflake、Databricks、Iceberg REST)
约束:
  • 必须包含
    --include-root
    参数以捕获账户默认目录
  • 必须按类型呈现目录数量的摘要
  • 若仅存在默认目录,应跳过目录概览直接进入步骤3

3. Enumerate Databases and Tables

3. 枚举数据库和表

For each catalog (or the user-specified one):
bash
aws glue get-databases --catalog-id <catalog-id>
aws glue get-tables --database-name <db> --catalog-id <catalog-id>
For S3 Tables catalogs, also enumerate via the S3 Tables API:
bash
aws s3tables list-table-buckets
aws s3tables list-namespaces --table-bucket-arn <arn>
aws s3tables list-tables --table-bucket-arn <arn> --namespace <ns>
Constraints:
  • You MUST flag S3 Tables not registered in Glue; You SHOULD suggest registration
  • For sub-catalogs,
    --catalog-id
    accepts the catalog name (not the ARN)
  • For the default catalog, omit
    --catalog-id
    or pass the account ID
针对每个目录(或用户指定的目录):
bash
aws glue get-databases --catalog-id <catalog-id>
aws glue get-tables --database-name <db> --catalog-id <catalog-id>
对于S3 Tables目录,还需通过S3 Tables API进行枚举:
bash
aws s3tables list-table-buckets
aws s3tables list-namespaces --table-bucket-arn <arn>
aws s3tables list-tables --table-bucket-arn <arn> --namespace <ns>
约束:
  • 必须标记未在Glue中注册的S3 Tables;应建议用户进行注册
  • 对于子目录,
    --catalog-id
    接受目录名称(而非ARN)
  • 对于默认目录,省略
    --catalog-id
    或传入账户ID

4. Capture Details and Analyze

4. 捕获详情并分析

For each database, capture table count, formats, partitioning, and S3 locations. For each table of interest, capture column schemas, types, partition keys, SerDe format, and last access time.
You MUST report data formats in human-readable terms (Parquet, CSV, JSON), not raw SerDe class names.
See discovery-checklist.md for analysis framework.
针对每个数据库,捕获表数量、格式、分区情况及S3存储位置。针对每个重点关注的表,捕获列 schema、类型、分区键、SerDe格式及最后访问时间。
必须以易读的术语(如Parquet、CSV、JSON)报告数据格式,而非原始SerDe类名。
分析框架请参考discovery-checklist.md

Argument Routing

参数路由

Resolve the argument in this order; stop at the first match:
  1. Starts with
    s3://
    — S3 path (explore unregistered data, detect formats)
  2. Matches a known catalog from step 2 (
    get-catalogs
    ) — deep dive into that catalog
  3. Matches a known database (
    get-databases
    ) — deep dive into that database
  4. Matches a known table (
    get-tables
    ) — detailed table analysis with schema and partitions
  5. No match — treat as search term (Glue
    search-tables
    )
  6. No args — full landscape discovery (catalogs, then databases and tables)
按以下顺序解析参数,匹配到第一个即停止:
  1. s3://
    开头 — S3路径(探索未注册数据,检测格式)
  2. 匹配步骤2中发现的已知目录(
    get-catalogs
    ) — 深度分析该目录
  3. 匹配已知数据库(
    get-databases
    ) — 深度分析该数据库
  4. 匹配已知表(
    get-tables
    ) — 对表进行包含schema和分区的详细分析
  5. 无匹配项 — 视为搜索词(使用Glue的
    search-tables
  6. 无参数 — 进行全景发现(先目录,再数据库和表)

Principles

原则

  • Start with catalog landscape, then narrow based on user interest
  • Always report catalog types — users need to know where data lives
  • Always report data formats — they drive cost and performance decisions
  • Flag stale tables and missing descriptions
  • Suggest partitioning for large unpartitioned tables
  • Summary first, details on request
  • You MUST NOT execute Athena queries (
    start-query-execution
    ) during discovery; query execution belongs to
    querying-data-lake
  • 从目录全景入手,再根据用户兴趣缩小范围
  • 始终报告目录类型——用户需要了解数据存储位置
  • 始终报告数据格式——它们会影响成本和性能决策
  • 标记长期未使用的表和缺失描述的表
  • 为大型未分区表建议分区方案
  • 先提供摘要,按需提供详情
  • 发现过程中不得执行Athena查询(
    start-query-execution
    );查询执行属于
    querying-data-lake
    的功能范围

Troubleshooting

故障排除

ErrorCauseFix
Only sub-catalogs returned, default missing
--include-root
omitted
Re-run
get-catalogs
with
--include-root
Federated catalog query slow or failingNetwork call to remote source; connection misconfiguredReport connection errors clearly rather than silently skipping
S3 Tables not queryable via AthenaTables exist in S3 Tables API but not registered in GlueFlag as "not queryable"; suggest registration
get-databases
/
get-tables
fails with catalog-id
Default catalog requires omit or account IDOmit
--catalog-id
or pass account ID for the default catalog
错误原因解决方法
仅返回子目录,缺失默认目录省略了
--include-root
参数
使用
--include-root
重新运行
get-catalogs
联邦目录查询缓慢或失败与远程源的网络调用;连接配置错误清晰报告连接错误,而非静默跳过
S3 Tables无法通过Athena查询表存在于S3 Tables API中但未在Glue注册标记为“不可查询”;建议注册
使用catalog-id时
get-databases
/
get-tables
失败
默认目录需省略该参数或传入账户ID对于默认目录,省略
--catalog-id
或传入账户ID

Additional Resources

额外资源