querying-aws-sagemaker-catalog

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Query AWS SageMaker Catalog System Tables

查询AWS SageMaker Catalog系统表

Overview

概述

Works best with the AWS MCP server for sandboxed execution and audit logging. All commands below use the AWS CLI and work in any environment with configured AWS credentials.
Amazon SageMaker Unified Studio (whose catalog feature is referred to below as SageMaker Catalog) exports asset metadata as a daily-snapshot Apache Iceberg table in the AWS-managed
aws-sagemaker-catalog
table bucket. This enables SQL queries over your entire data catalog inventory — asset counts, governance gaps, ownership audits, and historical comparisons — without building custom ETL.
Data is partitioned by
snapshot_time
and exported once daily (around midnight per region). The table is read-only.
最佳搭配AWS MCP server使用,以实现沙箱执行和审计日志记录。以下所有命令均使用AWS CLI,可在任何配置了AWS凭证的环境中运行。
Amazon SageMaker Unified Studio(下文将其目录功能称为SageMaker Catalog)会将资产元数据以每日快照的形式导出为Apache Iceberg表,存储在AWS托管的
aws-sagemaker-catalog
表存储桶中。这使得您无需构建自定义ETL,即可对整个数据目录清单执行SQL查询——包括资产计数、治理缺口、所有权审计以及历史对比。
数据按
snapshot_time
分区,每日导出一次(大致为每个区域的午夜时段)。该表为只读状态。

Decision Tree

决策树

User intentUse this skill?Alternative
SQL analytics on catalog state (counts, governance, trends)Yes
Historical comparison ("what changed in catalog last week")Yes — time travel via
snapshot_time
Find assets without owners or descriptionsYes
Find a specific table by name or conceptNo
finding-data-lake-assets
or Glue Discovery
search
Browse/enumerate catalog interactivelyNo
exploring-data-catalog
Run a query on a table's dataNo
querying-data-lake
Manage catalog metadata (add descriptions, tags)NoGlue Discovery
put-form-type
/
associate-glossary-terms
用户意图是否使用此技能?替代方案
对目录状态进行SQL分析(计数、治理、趋势)
历史对比("上周目录有哪些变化") — 通过
snapshot_time
实现时间旅行
查找无所有者或无描述的资产
按名称或概念查找特定表
finding-data-lake-assets
或Glue Discovery
search
交互式浏览/枚举目录
exploring-data-catalog
对表数据运行查询
querying-data-lake
管理目录元数据(添加描述、标签)Glue Discovery
put-form-type
/
associate-glossary-terms

Common Tasks

常见任务

1. Check If Configured

1. 检查配置情况

bash
aws datazone get-data-export-configuration \
  --domain-identifier <DOMAIN_ID> \
  --region <REGION>
  • If no domain exists:
    aws datazone list-domains --region <REGION>
  • If export not enabled: guide user to enable.
  • One domain per account per region.
Verify table bucket exists:
bash
aws s3tables list-table-buckets --region <REGION> \
  --query "tableBuckets[?name=='aws-sagemaker-catalog']"
bash
aws datazone get-data-export-configuration \
  --domain-identifier <DOMAIN_ID> \
  --region <REGION>
  • 若不存在域:执行
    aws datazone list-domains --region <REGION>
  • 若未启用导出:引导用户启用。
  • 每个区域的每个账户对应一个域。
验证表存储桶是否存在:
bash
aws s3tables list-table-buckets --region <REGION> \
  --query "tableBuckets[?name=='aws-sagemaker-catalog']"

2. Enable

2. 启用导出

With KMS encryption (recommended for production):
bash
aws datazone put-data-export-configuration \
  --domain-identifier <DOMAIN_ID> \
  --region <REGION> \
  --enable-export \
  --encryption-configuration kmsKeyArn=<KMS_KEY_ARN>,sseAlgorithm=aws:kms
Note: Encryption cannot be changed after creation. Always specify KMS for sensitive catalog data.
Without encryption (for quick testing only):
bash
aws datazone put-data-export-configuration \
  --domain-identifier <DOMAIN_ID> \
  --region <REGION> \
  --enable-export
First data available within 24 hours. See: Exporting asset metadata
使用KMS加密(生产环境推荐):
bash
aws datazone put-data-export-configuration \
  --domain-identifier <DOMAIN_ID> \
  --region <REGION> \
  --enable-export \
  --encryption-configuration kmsKeyArn=<KMS_KEY_ARN>,sseAlgorithm=aws:kms
注意:加密设置创建后无法修改。对于敏感目录数据,请始终指定KMS加密。
不使用加密(仅用于快速测试):
bash
aws datazone put-data-export-configuration \
  --domain-identifier <DOMAIN_ID> \
  --region <REGION> \
  --enable-export
首次数据将在24小时内可用。参考: 导出资产元数据

3. Verify Permissions for Querying

3. 验证查询权限

Requires:
  • S3 Tables federated catalog registered in Glue (
    s3tablescatalog
    )
  • Lake Formation SELECT + DESCRIBE grants on the table
Grant access:
bash
aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=<ROLE_ARN> \
  --resource '{"Table": {"CatalogId": "<ACCOUNT>:s3tablescatalog/aws-sagemaker-catalog", "DatabaseName": "asset_metadata", "Name": "asset"}}' \
  --permissions DESCRIBE SELECT \
  --region <REGION>
需要:
  • 在Glue中注册的S3 Tables联合目录(
    s3tablescatalog
  • 对该表的Lake Formation SELECT + DESCRIBE权限
授予访问权限:
bash
aws lakeformation grant-permissions \
  --principal DataLakePrincipalIdentifier=<ROLE_ARN> \
  --resource '{"Table": {"CatalogId": "<ACCOUNT>:s3tablescatalog/aws-sagemaker-catalog", "DatabaseName": "asset_metadata", "Name": "asset"}}' \
  --permissions DESCRIBE SELECT \
  --region <REGION>

4. Query

4. 执行查询

Query syntax:
sql
"s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
Constraints:
  • You MUST always filter by
    snapshot_time
    — without it, the query scans all historical snapshots and returns duplicates
  • You MUST confirm workgroup and output location before executing
  • Default to
    DATE(snapshot_time) = CURRENT_DATE
    for current state
  • You SHOULD use the key columns documented in this skill to build queries. If you need the full schema, run
    get-tables
    once:
    aws glue get-tables --catalog-id "<ACCOUNT>:s3tablescatalog/aws-sagemaker-catalog" --database-name "asset_metadata" --region <REGION>
Key columns:
ColumnWhat it holdsUsage
snapshot_time
Partition key — daily snapshot timestampAlways filter on this
asset_id
Unique catalog asset identifierPrimary key for lookups
resource_type_enum
GlueTable, RedshiftTable, S3Collection, etc.Filter by asset type
resource_id
ARN or native identifierCross-reference with source systems
asset_name
Business-friendly nameDisplay, search
resource_name
Technical name (table name, prefix)Filtering
business_description
Business context (NULL if not provided)Governance gaps
extended_metadata
map<string,string>
— flexible key-value attributes
Use bracket notation:
extended_metadata['owningEntityId']
asset_created_time
When asset first appeared in catalogGrowth analysis
asset_updated_time
Last modification timeFreshness checks
Current catalog state:
sql
SELECT resource_type_enum, COUNT(*) as count
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
GROUP BY resource_type_enum
ORDER BY count DESC;
Assets without business descriptions:
sql
SELECT asset_name, resource_name, resource_type_enum, account_id
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
  AND business_description IS NULL;
Asset growth over last 30 days:
sql
SELECT DATE(snapshot_time) as date, COUNT(*) as total_assets
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY DATE(snapshot_time)
ORDER BY date DESC;
Time travel — compare current vs 7 days ago (new descriptions added):
sql
SELECT t.asset_id, t.resource_name,
       p.business_description as before,
       t.business_description as now
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset" t
JOIN "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset" p
  ON t.asset_id = p.asset_id
WHERE DATE(t.snapshot_time) = CURRENT_DATE
  AND DATE(p.snapshot_time) = CURRENT_DATE - INTERVAL '7' DAY
  AND p.business_description IS NULL
  AND t.business_description IS NOT NULL;
Assets by owner:
sql
SELECT extended_metadata['owningEntityId'] as owner, COUNT(*) as count
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
  AND extended_metadata['owningEntityId'] IS NOT NULL
GROUP BY extended_metadata['owningEntityId']
ORDER BY count DESC;
Filter by metadata form field:
sql
SELECT *
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
  AND extended_metadata['<metadata-form-name>.<field-name>'] = '<field-value>';
查询语法:
sql
"s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
约束条件:
  • 必须始终按
    snapshot_time
    过滤——若不添加此条件,查询会扫描所有历史快照并返回重复数据
  • 执行前必须确认工作组和输出位置
  • 默认使用
    DATE(snapshot_time) = CURRENT_DATE
    查询当前状态
  • 应使用本技能中记录的关键字段构建查询。如需完整架构,执行一次
    get-tables
    aws glue get-tables --catalog-id "<ACCOUNT>:s3tablescatalog/aws-sagemaker-catalog" --database-name "asset_metadata" --region <REGION>
关键字段:
字段存储内容用途
snapshot_time
分区键——每日快照时间戳必须始终按此字段过滤
asset_id
目录资产唯一标识符查找操作的主键
resource_type_enum
资产类型,如GlueTable、RedshiftTable、S3Collection等按资产类型过滤
resource_id
ARN或原生标识符与源系统交叉引用
asset_name
业务友好名称展示、搜索
resource_name
技术名称(表名、前缀)过滤
business_description
业务上下文(未提供则为NULL)识别治理缺口
extended_metadata
map<string,string>
——灵活的键值属性
使用方括号语法:
extended_metadata['owningEntityId']
asset_created_time
资产首次出现在目录中的时间增长分析
asset_updated_time
最后修改时间新鲜度检查
查询当前目录状态:
sql
SELECT resource_type_enum, COUNT(*) as count
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
GROUP BY resource_type_enum
ORDER BY count DESC;
查询无业务描述的资产:
sql
SELECT asset_name, resource_name, resource_type_enum, account_id
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
  AND business_description IS NULL;
查询过去30天的资产增长情况:
sql
SELECT DATE(snapshot_time) as date, COUNT(*) as total_assets
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY DATE(snapshot_time)
ORDER BY date DESC;
时间旅行——对比当前与7天前的状态(新增的描述):
sql
SELECT t.asset_id, t.resource_name,
       p.business_description as before,
       t.business_description as now
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset" t
JOIN "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset" p
  ON t.asset_id = p.asset_id
WHERE DATE(t.snapshot_time) = CURRENT_DATE
  AND DATE(p.snapshot_time) = CURRENT_DATE - INTERVAL '7' DAY
  AND p.business_description IS NULL
  AND t.business_description IS NOT NULL;
按所有者统计资产:
sql
SELECT extended_metadata['owningEntityId'] as owner, COUNT(*) as count
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
  AND extended_metadata['owningEntityId'] IS NOT NULL
GROUP BY extended_metadata['owningEntityId']
ORDER BY count DESC;
按元数据表单字段过滤:
sql
SELECT *
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
  AND extended_metadata['<metadata-form-name>.<field-name>'] = '<field-value>';

Key Behaviors

核心特性

  • Daily snapshots — exported around midnight per region
  • Always filter by
    snapshot_time
    — without it you get all history (duplicates, slow)
  • One domain per account per region — to switch domains, delete config first
  • No additional charge beyond S3 Tables storage + Athena queries
  • Read-only — to update asset metadata, use Glue Discovery APIs or SageMaker Unified Studio
  • 每日快照——大致在每个区域的午夜时段导出
  • 必须按
    snapshot_time
    过滤
    ——若不添加此条件,会获取所有历史数据(重复、查询缓慢)
  • 每个区域每个账户对应一个域——切换域需先删除现有配置
  • 无额外费用——仅产生S3 Tables存储和Athena查询的费用
  • 只读——如需更新资产元数据,请使用Glue Discovery API或SageMaker Unified Studio

Troubleshooting

故障排查

ErrorCauseFix
aws-sagemaker-catalog
bucket not found
Export not enabledRun
put-data-export-configuration --enable-export
Empty results with
CURRENT_DATE
First export hasn't run yet (takes up to 24h)Wait; try yesterday's date
AccessDenied
on query
Missing Lake Formation grantsGrant SELECT + DESCRIBE on the table
CATALOG_NOT_FOUND
S3 Tables not registered in GlueEnable integration: S3 console > Table buckets > Enable integration
Duplicate rows in resultsMissing
snapshot_time
filter
Add
WHERE DATE(snapshot_time) = CURRENT_DATE
extended_metadata
key returns NULL
Key doesn't exist for that assetCheck available keys:
SELECT DISTINCT key FROM ... CROSS JOIN UNNEST(map_keys(extended_metadata)) AS t(key) WHERE DATE(snapshot_time) = CURRENT_DATE
Cannot update export encryptionEncryption set at creation time onlyDelete and recreate export config
错误原因解决方法
未找到
aws-sagemaker-catalog
存储桶
未启用导出执行
put-data-export-configuration --enable-export
使用
CURRENT_DATE
查询返回空结果
首次导出尚未完成(最长需24小时)等待;尝试查询昨日日期
查询时出现
AccessDenied
缺少Lake Formation权限为该表授予SELECT + DESCRIBE权限
出现
CATALOG_NOT_FOUND
S3 Tables未在Glue中注册启用集成:S3控制台 > 表存储桶 > 启用集成
查询结果出现重复行缺少
snapshot_time
过滤条件
添加
WHERE DATE(snapshot_time) = CURRENT_DATE
extended_metadata
键返回NULL
该资产不存在此键查看可用键:
SELECT DISTINCT key FROM ... CROSS JOIN UNNEST(map_keys(extended_metadata)) AS t(key) WHERE DATE(snapshot_time) = CURRENT_DATE
无法修改导出加密设置加密设置仅在创建时指定删除并重新创建导出配置

Security Considerations

安全注意事项

Data sensitivity: Catalog metadata exposes organizational structure including asset names, ownership, account IDs, naming conventions, and internal resource identifiers. Treat query results as sensitive by default.
Encryption at rest: Always enable KMS encryption when creating the export configuration. Encryption cannot be changed after creation. Additionally, configure SSE-KMS on your Athena workgroup output bucket.
Least-privilege access: Grant Lake Formation SELECT + DESCRIBE only on the specific
asset_metadata.asset
table to roles that need catalog analytics. Avoid granting access to the entire
aws-sagemaker-catalog
bucket.
Audit trail: Enable CloudTrail logging for DataZone (
PutDataExportConfiguration
,
GetDataExportConfiguration
), Athena (
StartQueryExecution
,
GetQueryResults
), and S3 Tables API calls to track who queries catalog metadata.
Credential hygiene: Use IAM roles with temporary credentials for querying. Avoid long-lived access keys for users accessing catalog metadata. Scope down or rotate principals when access is no longer needed.
数据敏感性:目录元数据会暴露组织结构,包括资产名称、所有权、账户ID、命名规范以及内部资源标识符。默认情况下,请将查询结果视为敏感数据。
静态加密:创建导出配置时请始终启用KMS加密。加密设置创建后无法修改。此外,请为Athena工作组输出存储桶配置SSE-KMS加密。
最小权限访问:仅为需要目录分析的角色授予Lake Formation对特定
asset_metadata.asset
表的SELECT + DESCRIBE权限。避免授予对整个
aws-sagemaker-catalog
存储桶的访问权限。
审计跟踪:为DataZone(
PutDataExportConfiguration
GetDataExportConfiguration
)、Athena(
StartQueryExecution
GetQueryResults
)和S3 Tables API调用启用CloudTrail日志记录,以跟踪谁查询了目录元数据。
凭证安全:使用带有临时凭证的IAM角色执行查询。避免使用长期访问密钥让用户访问目录元数据。当不再需要访问权限时,缩小权限范围或轮换主体。

Additional Resources

额外资源