warehouse-init
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseInitialize Warehouse Schema
初始化数据仓库模式
Generate a comprehensive, user-editable schema reference file for the data warehouse.
Scripts: — All CLI commands below are relative to the skill's directory. Before running any command, to relative to this file.
../analyzing-data/scripts/analyzing-datascripts/cli.pycd../analyzing-data/为数据仓库生成一份全面的、可由用户编辑的模式参考文件。
脚本路径: — 以下所有CLI命令均相对于 skill的目录。在运行任何命令前,需切换到相对于当前文件的目录。
../analyzing-data/scripts/analyzing-datascripts/cli.py../analyzing-data/What This Does
功能说明
- Discovers all databases, schemas, tables, and columns from the warehouse
- Enriches with codebase context (dbt models, gusty SQL, schema docs)
- Records row counts and identifies large tables
- Generates - a version-controllable, team-shareable reference
.astro/warehouse.md - Enables instant concept→table lookups without warehouse queries
- 发现数据仓库中的所有数据库、模式、表和列
- 结合代码库上下文丰富内容(dbt模型、gusty SQL、模式文档)
- 记录行数并识别大表
- 生成文件——一份可版本控制、团队共享的参考文档
.astro/warehouse.md - 无需执行仓库查询即可实现概念到表的即时映射
Process
操作流程
Step 1: Read Warehouse Configuration
步骤1:读取数据仓库配置
bash
cat ~/.astro/agents/warehouse.ymlGet the list of databases to discover (e.g., ).
databases: [HQ, ANALYTICS, RAW]bash
cat ~/.astro/agents/warehouse.yml获取需要发现的数据库列表(例如:)。
databases: [HQ, ANALYTICS, RAW]Step 2: Search Codebase for Context (Parallel)
步骤2:并行搜索代码库上下文
Launch a subagent to find business context in code:
Task(
subagent_type="Explore",
prompt="""
Search for data model documentation in the codebase:
1. dbt models: **/models/**/*.yml, **/schema.yml
- Extract table descriptions, column descriptions
- Note primary keys and tests
2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter
- Parse frontmatter for: description, primary_key, tests
- Note schema mappings
3. AGENTS.md or CLAUDE.md files with data layer documentation
Return a mapping of:
table_name -> {description, primary_key, important_columns, layer}
"""
)启动子代理在代码中查找业务上下文:
Task(
subagent_type="Explore",
prompt="""
在代码库中搜索数据模型文档:
1. dbt模型:**/models/**/*.yml, **/schema.yml
- 提取表描述、列描述
- 记录主键和测试规则
2. Gusty/声明式SQL:**/dags/**/*.sql文件中的YAML前置内容
- 解析前置内容中的:描述、主键、测试规则
- 记录模式映射关系
3. 包含数据层文档的AGENTS.md或CLAUDE.md文件
返回以下格式的映射关系:
table_name -> {description, primary_key, important_columns, layer}
"""
)Step 3: Parallel Warehouse Discovery
步骤3:并行数据仓库发现
Launch one subagent per database using the Task tool:
For each database in configured_databases:
Task(
subagent_type="general-purpose",
prompt="""
Discover all metadata for database {DATABASE}.
Use the CLI to run SQL queries:
# Scripts are relative to ../analyzing-data/
uv run scripts/cli.py exec "df = run_sql('...')"
uv run scripts/cli.py exec "print(df)"
1. Query schemas:
SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA
2. Query tables with row counts:
SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
ORDER BY TABLE_SCHEMA, TABLE_NAME
3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns:
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'X'
Return a structured summary:
- Database name
- List of schemas with table counts
- For each table: name, row_count, key columns
- Flag any tables with >100M rows as "large"
"""
)Run all subagents in parallel (single message with multiple Task calls).
使用Task工具为每个数据库启动一个子代理:
For each database in configured_databases:
Task(
subagent_type="general-purpose",
prompt="""
发现数据库{DATABASE}的所有元数据。
使用CLI执行SQL查询:
# 脚本路径相对于../analyzing-data/
uv run scripts/cli.py exec "df = run_sql('...')"
uv run scripts/cli.py exec "print(df)"
1. 查询模式:
SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA
2. 查询带行数的表:
SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
ORDER BY TABLE_SCHEMA, TABLE_NAME
3. 对于重要模式(MODEL_*, METRICS_*, MART_*),查询列信息:
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'X'
返回结构化摘要:
- 数据库名称
- 包含表数量的模式列表
- 每个表的:名称、行数、关键字段
- 将行数>1亿的表标记为“大表”
"""
)并行运行所有子代理(在单个消息中包含多个Task调用)。
Step 4: Discover Categorical Value Families
步骤4:发现分类值族
For key categorical columns (like OPERATOR, STATUS, TYPE, FEATURE), discover value families:
bash
uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"Group related values into families by common prefix/suffix (e.g., for ExportCSV, ExportJSON, ExportParquet).
Export*对于关键分类列(如OPERATOR、STATUS、TYPE、FEATURE),发现其值族:
bash
uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"通过共同前缀/后缀将相关值分组为值族(例如:包含ExportCSV、ExportJSON、ExportParquet)。
Export*Step 5: Merge Results
步骤5:合并结果
Combine warehouse metadata + codebase context:
- Quick Reference table - concept → table mappings (pre-populated from code if found)
- Categorical Columns - value families for key filter columns
- Database sections - one per database
- Schema subsections - tables grouped by schema
- Table details - columns, row counts, descriptions from code, warnings
整合数据仓库元数据与代码库上下文:
- 快速参考表 - 概念到表的映射(如果在代码中找到则预填充)
- 分类列 - 关键过滤列的值族
- 数据库章节 - 每个数据库对应一个章节
- 模式子章节 - 按模式分组的表
- 表详情 - 列、行数、来自代码的描述、警告信息
Step 6: Generate warehouse.md
步骤6:生成warehouse.md
Write the file to:
- (default - project-specific, version-controllable)
.astro/warehouse.md - (if
~/.astro/agents/warehouse.mdflag)--global
将文件写入:
- (默认 - 项目专属,可版本控制)
.astro/warehouse.md - (如果使用
~/.astro/agents/warehouse.md参数)--global
Output Format
输出格式
markdown
undefinedmarkdown
undefinedWarehouse Schema
数据仓库模式
Generated byon {DATE}. Edit freely to add business context./data:warehouse-init
由于{DATE}生成。可自由编辑以添加业务上下文。/data:warehouse-init
Quick Reference
快速参考
| Concept | Table | Key Column | Date Column |
|---|---|---|---|
| customers | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_AT |
| 概念 | 表 | 关键字段 | 日期字段 |
|---|---|---|---|
| customers | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_AT |
Categorical Columns
分类列
When filtering on these columns, explore value families first (values often have variants):
| Table | Column | Value Families |
|---|---|---|
| {TABLE} | {COLUMN} | |
当对这些列进行过滤时,请先查看值族(值通常有变体):
| 表 | 列 | 值族 |
|---|---|---|
| {TABLE} | {COLUMN} | |
Data Layer Hierarchy
数据层层级
Query downstream first: > > > >
reportingmart_*metric_*model_*IN_*| Layer | Prefix | Purpose |
|---|---|---|
| Reporting | | Dashboard-optimized |
| Mart | | Combined analytics |
| Metric | | KPIs at various grains |
| Model | | Cleansed sources of truth |
| Raw | | Source data - avoid |
优先查询下游层级: > > > >
reportingmart_*metric_*model_*IN_*| 层级 | 前缀 | 用途 |
|---|---|---|
| 报表层 | | 针对仪表板优化 |
| 数据集市 | | 组合分析 |
| 指标层 | | 多粒度KPI |
| 模型层 | | 清洗后的可信数据源 |
| 原始层 | | 源数据 - 避免直接使用 |
{DATABASE} Database
{DATABASE}数据库
{SCHEMA} Schema
{SCHEMA}模式
{TABLE_NAME}
{TABLE_NAME}
{DESCRIPTION from code if found}
| Column | Type | Description |
|---|---|---|
| COL1 | VARCHAR | {from code or inferred} |
- Rows: {ROW_COUNT}
- Key column: {PRIMARY_KEY from code or inferred} {IF ROW_COUNT > 100M: - ⚠️ WARNING: Large table - always add date filters}
{来自代码的描述(如果有)}
| 列 | 类型 | 描述 |
|---|---|---|
| COL1 | VARCHAR | {来自代码或推断的描述} |
- 行数: {ROW_COUNT}
- 关键字段: {来自代码或推断的主键} {如果ROW_COUNT > 1亿: - ⚠️ 警告: 大表 - 请始终添加日期过滤}
Relationships
关系
{Inferred relationships based on column names like *_ID}undefined{根据*_ID等列名推断的关系}undefinedCommand Options
命令选项
| Option | Effect |
|---|---|
| Generate .astro/warehouse.md |
| Regenerate, preserving user edits |
| Only discover specific database |
| Write to ~/.astro/agents/ instead |
| 选项 | 作用 |
|---|---|
| 生成.astro/warehouse.md |
| 重新生成,保留用户编辑内容 |
| 仅发现指定数据库 |
| 写入到~/.astro/agents/目录 |
Step 7: Pre-populate Cache
步骤7:预填充缓存
After generating warehouse.md, populate the concept cache:
bash
undefined生成warehouse.md后,填充概念缓存:
bash
undefinedScripts are relative to ../analyzing-data/
脚本路径相对于../analyzing-data/
uv run cli.py concept import -p .astro/warehouse.md
uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
undefineduv run cli.py concept import -p .astro/warehouse.md
uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
undefinedStep 8: Offer CLAUDE.md Integration (Ask User)
步骤8:提供CLAUDE.md集成选项(询问用户)
Ask the user:
Would you like to add the Quick Reference table to your CLAUDE.md file?This ensures the schema mappings are always in context for data queries, improving accuracy from ~25% to ~100% for complex queries.Options:
- Yes, add to CLAUDE.md (Recommended) - Append Quick Reference section
- No, skip - Use warehouse.md and cache only
If user chooses Yes:
- Check if or
.claude/CLAUDE.mdexistsCLAUDE.md - If exists, append the Quick Reference section (avoid duplicates)
- If not exists, create with just the Quick Reference
.claude/CLAUDE.md
Quick Reference section to add:
markdown
undefined询问用户:
是否要将快速参考表添加到您的CLAUDE.md文件中?这将确保模式映射始终在数据查询的上下文中,将复杂查询的准确率从约25%提升至约100%。选项:
- 是,添加到CLAUDE.md(推荐) - 追加快速参考章节
- 否,跳过 - 仅使用warehouse.md和缓存
如果用户选择是:
- 检查或
.claude/CLAUDE.md是否存在CLAUDE.md - 如果存在,追加快速参考章节(避免重复)
- 如果不存在,创建并仅添加快速参考内容
.claude/CLAUDE.md
要添加的快速参考章节:
markdown
undefinedData Warehouse Quick Reference
数据仓库快速参考
When querying the warehouse, use these table mappings:
| Concept | Table | Key Column | Date Column |
|---|---|---|---|
| {rows from warehouse.md Quick Reference} |
Large tables (always filter by date): {list tables with >100M rows}
Auto-generated by. Run/data:warehouse-initto update./data:warehouse-init --refresh
**If yes:** Append the Quick Reference section to `.claude/CLAUDE.md` or `CLAUDE.md`.查询数据仓库时,请使用以下表映射:
| 概念 | 表 | 关键字段 | 日期字段 |
|---|---|---|---|
| {来自warehouse.md快速参考的行} |
大表(请始终按日期过滤): {行数>1亿的表列表}
由自动生成。运行/data:warehouse-init进行更新。/data:warehouse-init --refresh
**如果用户选择是:** 将快速参考章节追加到`.claude/CLAUDE.md`或`CLAUDE.md`中。After Generation
生成后操作
Tell the user:
Generated .astro/warehouse.md
Summary:
- {N} databases, {N} schemas, {N} tables
- {N} tables enriched with code descriptions
- {N} concepts cached for instant lookup
Next steps:
1. Edit .astro/warehouse.md to add business context
2. Commit to version control
3. Run /data:warehouse-init --refresh when schema changes告知用户:
已生成.astro/warehouse.md
摘要:
- {N}个数据库,{N}个模式,{N}张表
- {N}张表已补充来自代码的描述
- {N}个概念已缓存,可即时查询
后续步骤:
1. 编辑.astro/warehouse.md以添加业务上下文
2. 提交到版本控制系统
3. 当模式变更时运行/data:warehouse-init --refreshRefresh Behavior
刷新行为
When is specified:
--refresh- Read existing warehouse.md
- Preserve all HTML comments ()
<!-- ... --> - Preserve Quick Reference table entries (user-added)
- Preserve user-added descriptions
- Update row counts and add new tables
- Mark removed tables with comment
<!-- REMOVED -->
当指定参数时:
--refresh- 读取现有的warehouse.md
- 保留所有HTML注释()
<!-- ... --> - 保留快速参考表中的用户添加条目
- 保留用户添加的描述
- 更新行数并添加新表
- 用标记已删除的表
<!-- REMOVED -->
Cache Staleness & Schema Drift
缓存过期与模式漂移
The runtime cache has a 7-day TTL by default. After 7 days, cached entries expire and will be re-discovered on next use.
运行时缓存默认7天TTL。7天后,缓存条目将过期,下次使用时会重新发现。
When to Refresh
何时刷新
Run when:
/data:warehouse-init --refresh- Schema changes: Tables added, renamed, or removed
- Column changes: New columns added or types changed
- After deployments: If your data pipeline deploys schema migrations
- Weekly: As a good practice, even if no known changes
当出现以下情况时运行:
/data:warehouse-init --refresh- 模式变更:添加、重命名或删除表
- 列变更:添加新列或修改列类型
- 部署后:如果数据流水线部署了模式迁移
- 每周:即使没有已知变更,作为最佳实践定期执行
Signs of Stale Cache
缓存过期的迹象
Watch for these indicators:
- Queries fail with "table not found" errors
- Results seem wrong or outdated
- New tables aren't being discovered
注意以下指标:
- 查询因“表不存在”错误而失败
- 结果看起来错误或过时
- 新表未被发现
Manual Cache Reset
手动重置缓存
If you suspect cache issues:
bash
undefined如果怀疑缓存有问题:
bash
undefinedScripts are relative to ../analyzing-data/
脚本路径相对于../analyzing-data/
uv run scripts/cli.py cache status
uv run scripts/cli.py cache clear --stale-only
uv run scripts/cli.py cache clear
undefineduv run scripts/cli.py cache status
uv run scripts/cli.py cache clear --stale-only
uv run scripts/cli.py cache clear
undefinedCodebase Patterns Recognized
可识别的代码库模式
| Pattern | Source | What We Extract |
|---|---|---|
| dbt | table/column descriptions, tests |
| gusty | YAML frontmatter (description, primary_key) |
| docs | data layer hierarchy, conventions |
| docs | business context |
| 模式 | 来源 | 提取内容 |
|---|---|---|
| dbt | 表/列描述、测试规则 |
| gusty | YAML前置内容(描述、主键) |
| 文档 | 数据层层级、约定 |
| 文档 | 业务上下文 |
Example Session
示例会话
User: /data:warehouse-init
Agent:
→ Reading warehouse configuration...
→ Found 1 warehouse with databases: HQ, PRODUCT
→ Searching codebase for data documentation...
Found: AGENTS.md with data layer hierarchy
Found: 45 SQL files with YAML frontmatter in dags/declarative/
→ Launching parallel warehouse discovery...
[Database: HQ] Discovering schemas...
[Database: PRODUCT] Discovering schemas...
→ HQ: Found 29 schemas, 401 tables
→ PRODUCT: Found 1 schema, 0 tables
→ Merging warehouse metadata with code context...
Enriched 45 tables with descriptions from code
→ Generated .astro/warehouse.md
Summary:
- 2 databases
- 30 schemas
- 401 tables
- 45 tables enriched with code descriptions
- 8 large tables flagged (>100M rows)
Next steps:
1. Review .astro/warehouse.md
2. Add concept mappings to Quick Reference
3. Commit to version control
4. Run /data:warehouse-init --refresh when schema changes用户:/data:warehouse-init
代理:
→ 读取数据仓库配置...
→ 找到1个数据仓库,包含数据库:HQ、PRODUCT
→ 在代码库中搜索数据文档...
找到:包含数据层层级的AGENTS.md
找到:dags/declarative/目录下45个带YAML前置内容的SQL文件
→ 启动并行数据仓库发现...
[数据库:HQ] 正在发现模式...
[数据库:PRODUCT] 正在发现模式...
→ HQ:找到29个模式,401张表
→ PRODUCT:找到1个模式,0张表
→ 整合数据仓库元数据与代码上下文...
为45张表补充了来自代码的描述
→ 生成.astro/warehouse.md
摘要:
- 2个数据库
- 30个模式
- 401张表
- 45张表已补充代码中的描述
- 标记了8张大表(行数>1亿)
后续步骤:
1. 查看.astro/warehouse.md
2. 在快速参考中添加概念映射
3. 提交到版本控制系统
4. 当模式变更时运行/data:warehouse-init --refresh