querying-aws-sagemaker-catalog
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseQuery AWS SageMaker Catalog System Tables
查询AWS SageMaker Catalog系统表
Overview
概述
Works best with the AWS MCP server for sandboxed execution and audit logging. All commands below use the AWS CLI and work in any environment with configured AWS credentials.
Amazon SageMaker Unified Studio (whose catalog feature is referred to below as SageMaker Catalog) exports asset metadata as a daily-snapshot
Apache Iceberg table in the AWS-managed table bucket. This
enables SQL queries over your entire data catalog inventory — asset counts, governance
gaps, ownership audits, and historical comparisons — without building custom ETL.
aws-sagemaker-catalogData is partitioned by and exported once daily (around midnight per
region). The table is read-only.
snapshot_time最佳搭配AWS MCP server使用,以实现沙箱执行和审计日志记录。以下所有命令均使用AWS CLI,可在任何配置了AWS凭证的环境中运行。
Amazon SageMaker Unified Studio(下文将其目录功能称为SageMaker Catalog)会将资产元数据以每日快照的形式导出为Apache Iceberg表,存储在AWS托管的表存储桶中。这使得您无需构建自定义ETL,即可对整个数据目录清单执行SQL查询——包括资产计数、治理缺口、所有权审计以及历史对比。
aws-sagemaker-catalog数据按分区,每日导出一次(大致为每个区域的午夜时段)。该表为只读状态。
snapshot_timeDecision Tree
决策树
| User intent | Use this skill? | Alternative |
|---|---|---|
| SQL analytics on catalog state (counts, governance, trends) | Yes | — |
| Historical comparison ("what changed in catalog last week") | Yes — time travel via | — |
| Find assets without owners or descriptions | Yes | — |
| Find a specific table by name or concept | No | |
| Browse/enumerate catalog interactively | No | |
| Run a query on a table's data | No | |
| Manage catalog metadata (add descriptions, tags) | No | Glue Discovery |
| 用户意图 | 是否使用此技能? | 替代方案 |
|---|---|---|
| 对目录状态进行SQL分析(计数、治理、趋势) | 是 | — |
| 历史对比("上周目录有哪些变化") | 是 — 通过 | — |
| 查找无所有者或无描述的资产 | 是 | — |
| 按名称或概念查找特定表 | 否 | |
| 交互式浏览/枚举目录 | 否 | |
| 对表数据运行查询 | 否 | |
| 管理目录元数据(添加描述、标签) | 否 | Glue Discovery |
Common Tasks
常见任务
1. Check If Configured
1. 检查配置情况
bash
aws datazone get-data-export-configuration \
--domain-identifier <DOMAIN_ID> \
--region <REGION>- If no domain exists:
aws datazone list-domains --region <REGION> - If export not enabled: guide user to enable.
- One domain per account per region.
Verify table bucket exists:
bash
aws s3tables list-table-buckets --region <REGION> \
--query "tableBuckets[?name=='aws-sagemaker-catalog']"bash
aws datazone get-data-export-configuration \
--domain-identifier <DOMAIN_ID> \
--region <REGION>- 若不存在域:执行
aws datazone list-domains --region <REGION> - 若未启用导出:引导用户启用。
- 每个区域的每个账户对应一个域。
验证表存储桶是否存在:
bash
aws s3tables list-table-buckets --region <REGION> \
--query "tableBuckets[?name=='aws-sagemaker-catalog']"2. Enable
2. 启用导出
With KMS encryption (recommended for production):
bash
aws datazone put-data-export-configuration \
--domain-identifier <DOMAIN_ID> \
--region <REGION> \
--enable-export \
--encryption-configuration kmsKeyArn=<KMS_KEY_ARN>,sseAlgorithm=aws:kmsNote: Encryption cannot be changed after creation. Always specify KMS for sensitive catalog data.
Without encryption (for quick testing only):
bash
aws datazone put-data-export-configuration \
--domain-identifier <DOMAIN_ID> \
--region <REGION> \
--enable-exportFirst data available within 24 hours. See:
Exporting asset metadata
使用KMS加密(生产环境推荐):
bash
aws datazone put-data-export-configuration \
--domain-identifier <DOMAIN_ID> \
--region <REGION> \
--enable-export \
--encryption-configuration kmsKeyArn=<KMS_KEY_ARN>,sseAlgorithm=aws:kms注意:加密设置创建后无法修改。对于敏感目录数据,请始终指定KMS加密。
不使用加密(仅用于快速测试):
bash
aws datazone put-data-export-configuration \
--domain-identifier <DOMAIN_ID> \
--region <REGION> \
--enable-export首次数据将在24小时内可用。参考:
导出资产元数据
3. Verify Permissions for Querying
3. 验证查询权限
Requires:
- S3 Tables federated catalog registered in Glue ()
s3tablescatalog - Lake Formation SELECT + DESCRIBE grants on the table
Grant access:
bash
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=<ROLE_ARN> \
--resource '{"Table": {"CatalogId": "<ACCOUNT>:s3tablescatalog/aws-sagemaker-catalog", "DatabaseName": "asset_metadata", "Name": "asset"}}' \
--permissions DESCRIBE SELECT \
--region <REGION>需要:
- 在Glue中注册的S3 Tables联合目录()
s3tablescatalog - 对该表的Lake Formation SELECT + DESCRIBE权限
授予访问权限:
bash
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=<ROLE_ARN> \
--resource '{"Table": {"CatalogId": "<ACCOUNT>:s3tablescatalog/aws-sagemaker-catalog", "DatabaseName": "asset_metadata", "Name": "asset"}}' \
--permissions DESCRIBE SELECT \
--region <REGION>4. Query
4. 执行查询
Query syntax:
sql
"s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"Constraints:
-
You MUST always filter by— without it, the query scans all historical snapshots and returns duplicates
snapshot_time -
You MUST confirm workgroup and output location before executing
-
Default tofor current state
DATE(snapshot_time) = CURRENT_DATE -
You SHOULD use the key columns documented in this skill to build queries. If you need the full schema, runonce:
get-tablesaws glue get-tables --catalog-id "<ACCOUNT>:s3tablescatalog/aws-sagemaker-catalog" --database-name "asset_metadata" --region <REGION>
Key columns:
| Column | What it holds | Usage |
|---|---|---|
| Partition key — daily snapshot timestamp | Always filter on this |
| Unique catalog asset identifier | Primary key for lookups |
| GlueTable, RedshiftTable, S3Collection, etc. | Filter by asset type |
| ARN or native identifier | Cross-reference with source systems |
| Business-friendly name | Display, search |
| Technical name (table name, prefix) | Filtering |
| Business context (NULL if not provided) | Governance gaps |
| | Use bracket notation: |
| When asset first appeared in catalog | Growth analysis |
| Last modification time | Freshness checks |
Current catalog state:
sql
SELECT resource_type_enum, COUNT(*) as count
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
GROUP BY resource_type_enum
ORDER BY count DESC;Assets without business descriptions:
sql
SELECT asset_name, resource_name, resource_type_enum, account_id
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
AND business_description IS NULL;Asset growth over last 30 days:
sql
SELECT DATE(snapshot_time) as date, COUNT(*) as total_assets
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY DATE(snapshot_time)
ORDER BY date DESC;Time travel — compare current vs 7 days ago (new descriptions added):
sql
SELECT t.asset_id, t.resource_name,
p.business_description as before,
t.business_description as now
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset" t
JOIN "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset" p
ON t.asset_id = p.asset_id
WHERE DATE(t.snapshot_time) = CURRENT_DATE
AND DATE(p.snapshot_time) = CURRENT_DATE - INTERVAL '7' DAY
AND p.business_description IS NULL
AND t.business_description IS NOT NULL;Assets by owner:
sql
SELECT extended_metadata['owningEntityId'] as owner, COUNT(*) as count
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
AND extended_metadata['owningEntityId'] IS NOT NULL
GROUP BY extended_metadata['owningEntityId']
ORDER BY count DESC;Filter by metadata form field:
sql
SELECT *
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
AND extended_metadata['<metadata-form-name>.<field-name>'] = '<field-value>';查询语法:
sql
"s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"约束条件:
-
必须始终按过滤——若不添加此条件,查询会扫描所有历史快照并返回重复数据
snapshot_time -
执行前必须确认工作组和输出位置
-
默认使用查询当前状态
DATE(snapshot_time) = CURRENT_DATE -
应使用本技能中记录的关键字段构建查询。如需完整架构,执行一次:
get-tablesaws glue get-tables --catalog-id "<ACCOUNT>:s3tablescatalog/aws-sagemaker-catalog" --database-name "asset_metadata" --region <REGION>
关键字段:
| 字段 | 存储内容 | 用途 |
|---|---|---|
| 分区键——每日快照时间戳 | 必须始终按此字段过滤 |
| 目录资产唯一标识符 | 查找操作的主键 |
| 资产类型,如GlueTable、RedshiftTable、S3Collection等 | 按资产类型过滤 |
| ARN或原生标识符 | 与源系统交叉引用 |
| 业务友好名称 | 展示、搜索 |
| 技术名称(表名、前缀) | 过滤 |
| 业务上下文(未提供则为NULL) | 识别治理缺口 |
| | 使用方括号语法: |
| 资产首次出现在目录中的时间 | 增长分析 |
| 最后修改时间 | 新鲜度检查 |
查询当前目录状态:
sql
SELECT resource_type_enum, COUNT(*) as count
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
GROUP BY resource_type_enum
ORDER BY count DESC;查询无业务描述的资产:
sql
SELECT asset_name, resource_name, resource_type_enum, account_id
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
AND business_description IS NULL;查询过去30天的资产增长情况:
sql
SELECT DATE(snapshot_time) as date, COUNT(*) as total_assets
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY DATE(snapshot_time)
ORDER BY date DESC;时间旅行——对比当前与7天前的状态(新增的描述):
sql
SELECT t.asset_id, t.resource_name,
p.business_description as before,
t.business_description as now
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset" t
JOIN "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset" p
ON t.asset_id = p.asset_id
WHERE DATE(t.snapshot_time) = CURRENT_DATE
AND DATE(p.snapshot_time) = CURRENT_DATE - INTERVAL '7' DAY
AND p.business_description IS NULL
AND t.business_description IS NOT NULL;按所有者统计资产:
sql
SELECT extended_metadata['owningEntityId'] as owner, COUNT(*) as count
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
AND extended_metadata['owningEntityId'] IS NOT NULL
GROUP BY extended_metadata['owningEntityId']
ORDER BY count DESC;按元数据表单字段过滤:
sql
SELECT *
FROM "s3tablescatalog/aws-sagemaker-catalog"."asset_metadata"."asset"
WHERE DATE(snapshot_time) = CURRENT_DATE
AND extended_metadata['<metadata-form-name>.<field-name>'] = '<field-value>';Key Behaviors
核心特性
- Daily snapshots — exported around midnight per region
- Always filter by — without it you get all history (duplicates, slow)
snapshot_time - One domain per account per region — to switch domains, delete config first
- No additional charge beyond S3 Tables storage + Athena queries
- Read-only — to update asset metadata, use Glue Discovery APIs or SageMaker Unified Studio
- 每日快照——大致在每个区域的午夜时段导出
- 必须按过滤——若不添加此条件,会获取所有历史数据(重复、查询缓慢)
snapshot_time - 每个区域每个账户对应一个域——切换域需先删除现有配置
- 无额外费用——仅产生S3 Tables存储和Athena查询的费用
- 只读——如需更新资产元数据,请使用Glue Discovery API或SageMaker Unified Studio
Troubleshooting
故障排查
| Error | Cause | Fix |
|---|---|---|
| Export not enabled | Run |
Empty results with | First export hasn't run yet (takes up to 24h) | Wait; try yesterday's date |
| Missing Lake Formation grants | Grant SELECT + DESCRIBE on the table |
| S3 Tables not registered in Glue | Enable integration: S3 console > Table buckets > Enable integration |
| Duplicate rows in results | Missing | Add |
| Key doesn't exist for that asset | Check available keys: |
| Cannot update export encryption | Encryption set at creation time only | Delete and recreate export config |
| 错误 | 原因 | 解决方法 |
|---|---|---|
未找到 | 未启用导出 | 执行 |
使用 | 首次导出尚未完成(最长需24小时) | 等待;尝试查询昨日日期 |
查询时出现 | 缺少Lake Formation权限 | 为该表授予SELECT + DESCRIBE权限 |
出现 | S3 Tables未在Glue中注册 | 启用集成:S3控制台 > 表存储桶 > 启用集成 |
| 查询结果出现重复行 | 缺少 | 添加 |
| 该资产不存在此键 | 查看可用键: |
| 无法修改导出加密设置 | 加密设置仅在创建时指定 | 删除并重新创建导出配置 |
Security Considerations
安全注意事项
Data sensitivity: Catalog metadata exposes organizational structure including asset names, ownership, account IDs, naming conventions, and internal resource identifiers. Treat query results as sensitive by default.
Encryption at rest: Always enable KMS encryption when creating the export configuration. Encryption cannot be changed after creation. Additionally, configure SSE-KMS on your Athena workgroup output bucket.
Least-privilege access: Grant Lake Formation SELECT + DESCRIBE only on the specific table to roles that need catalog analytics. Avoid granting access to the entire bucket.
asset_metadata.assetaws-sagemaker-catalogAudit trail: Enable CloudTrail logging for DataZone (, ), Athena (, ), and S3 Tables API calls to track who queries catalog metadata.
PutDataExportConfigurationGetDataExportConfigurationStartQueryExecutionGetQueryResultsCredential hygiene: Use IAM roles with temporary credentials for querying. Avoid long-lived access keys for users accessing catalog metadata. Scope down or rotate principals when access is no longer needed.
数据敏感性:目录元数据会暴露组织结构,包括资产名称、所有权、账户ID、命名规范以及内部资源标识符。默认情况下,请将查询结果视为敏感数据。
静态加密:创建导出配置时请始终启用KMS加密。加密设置创建后无法修改。此外,请为Athena工作组输出存储桶配置SSE-KMS加密。
最小权限访问:仅为需要目录分析的角色授予Lake Formation对特定表的SELECT + DESCRIBE权限。避免授予对整个存储桶的访问权限。
asset_metadata.assetaws-sagemaker-catalog审计跟踪:为DataZone(、)、Athena(、)和S3 Tables API调用启用CloudTrail日志记录,以跟踪谁查询了目录元数据。
PutDataExportConfigurationGetDataExportConfigurationStartQueryExecutionGetQueryResults凭证安全:使用带有临时凭证的IAM角色执行查询。避免使用长期访问密钥让用户访问目录元数据。当不再需要访问权限时,缩小权限范围或轮换主体。