warehouse-init

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Initialize Warehouse Schema

初始化数据仓库模式

Generate a comprehensive, user-editable schema reference file for the data warehouse.
Scripts:
../analyzing-data/scripts/
— All CLI commands below are relative to the
analyzing-data
skill's directory. Before running any
scripts/cli.py
command,
cd
to
../analyzing-data/
relative to this file.
为数据仓库生成一份全面的、可由用户编辑的模式参考文件。
脚本路径:
../analyzing-data/scripts/
— 以下所有CLI命令均相对于
analyzing-data
skill的目录。在运行任何
scripts/cli.py
命令前,需切换到相对于当前文件的
../analyzing-data/
目录。

What This Does

功能说明

  1. Discovers all databases, schemas, tables, and columns from the warehouse
  2. Enriches with codebase context (dbt models, gusty SQL, schema docs)
  3. Records row counts and identifies large tables
  4. Generates
    .astro/warehouse.md
    - a version-controllable, team-shareable reference
  5. Enables instant concept→table lookups without warehouse queries
  1. 发现数据仓库中的所有数据库、模式、表和列
  2. 结合代码库上下文丰富内容(dbt模型、gusty SQL、模式文档)
  3. 记录行数并识别大表
  4. 生成
    .astro/warehouse.md
    文件——一份可版本控制、团队共享的参考文档
  5. 无需执行仓库查询即可实现概念到表的即时映射

Process

操作流程

Step 1: Read Warehouse Configuration

步骤1:读取数据仓库配置

bash
cat ~/.astro/agents/warehouse.yml
Get the list of databases to discover (e.g.,
databases: [HQ, ANALYTICS, RAW]
).
bash
cat ~/.astro/agents/warehouse.yml
获取需要发现的数据库列表(例如:
databases: [HQ, ANALYTICS, RAW]
)。

Step 2: Search Codebase for Context (Parallel)

步骤2:并行搜索代码库上下文

Launch a subagent to find business context in code:
Task(
    subagent_type="Explore",
    prompt="""
    Search for data model documentation in the codebase:

    1. dbt models: **/models/**/*.yml, **/schema.yml
       - Extract table descriptions, column descriptions
       - Note primary keys and tests

    2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter
       - Parse frontmatter for: description, primary_key, tests
       - Note schema mappings

    3. AGENTS.md or CLAUDE.md files with data layer documentation

    Return a mapping of:
      table_name -> {description, primary_key, important_columns, layer}
    """
)
启动子代理在代码中查找业务上下文:
Task(
    subagent_type="Explore",
    prompt="""
    在代码库中搜索数据模型文档:

    1. dbt模型:**/models/**/*.yml, **/schema.yml
       - 提取表描述、列描述
       - 记录主键和测试规则

    2. Gusty/声明式SQL:**/dags/**/*.sql文件中的YAML前置内容
       - 解析前置内容中的:描述、主键、测试规则
       - 记录模式映射关系

    3. 包含数据层文档的AGENTS.md或CLAUDE.md文件

    返回以下格式的映射关系:
      table_name -> {description, primary_key, important_columns, layer}
    """
)

Step 3: Parallel Warehouse Discovery

步骤3:并行数据仓库发现

Launch one subagent per database using the Task tool:
For each database in configured_databases:
    Task(
        subagent_type="general-purpose",
        prompt="""
        Discover all metadata for database {DATABASE}.

        Use the CLI to run SQL queries:
        # Scripts are relative to ../analyzing-data/
        uv run scripts/cli.py exec "df = run_sql('...')"
        uv run scripts/cli.py exec "print(df)"

        1. Query schemas:
           SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA

        2. Query tables with row counts:
           SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
           ORDER BY TABLE_SCHEMA, TABLE_NAME

        3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns:
           SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
           WHERE TABLE_SCHEMA = 'X'

        Return a structured summary:
        - Database name
        - List of schemas with table counts
        - For each table: name, row_count, key columns
        - Flag any tables with >100M rows as "large"
        """
    )
Run all subagents in parallel (single message with multiple Task calls).
使用Task工具为每个数据库启动一个子代理:
For each database in configured_databases:
    Task(
        subagent_type="general-purpose",
        prompt="""
        发现数据库{DATABASE}的所有元数据。

        使用CLI执行SQL查询:
        # 脚本路径相对于../analyzing-data/
        uv run scripts/cli.py exec "df = run_sql('...')"
        uv run scripts/cli.py exec "print(df)"

        1. 查询模式:
           SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA

        2. 查询带行数的表:
           SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
           ORDER BY TABLE_SCHEMA, TABLE_NAME

        3. 对于重要模式(MODEL_*, METRICS_*, MART_*),查询列信息:
           SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
           WHERE TABLE_SCHEMA = 'X'

        返回结构化摘要:
        - 数据库名称
        - 包含表数量的模式列表
        - 每个表的:名称、行数、关键字段
        - 将行数>1亿的表标记为“大表”
        """
    )
并行运行所有子代理(在单个消息中包含多个Task调用)。

Step 4: Discover Categorical Value Families

步骤4:发现分类值族

For key categorical columns (like OPERATOR, STATUS, TYPE, FEATURE), discover value families:
bash
uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"
Group related values into families by common prefix/suffix (e.g.,
Export*
for ExportCSV, ExportJSON, ExportParquet).
对于关键分类列(如OPERATOR、STATUS、TYPE、FEATURE),发现其值族:
bash
uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"
通过共同前缀/后缀将相关值分组为值族(例如:
Export*
包含ExportCSV、ExportJSON、ExportParquet)。

Step 5: Merge Results

步骤5:合并结果

Combine warehouse metadata + codebase context:
  1. Quick Reference table - concept → table mappings (pre-populated from code if found)
  2. Categorical Columns - value families for key filter columns
  3. Database sections - one per database
  4. Schema subsections - tables grouped by schema
  5. Table details - columns, row counts, descriptions from code, warnings
整合数据仓库元数据与代码库上下文:
  1. 快速参考表 - 概念到表的映射(如果在代码中找到则预填充)
  2. 分类列 - 关键过滤列的值族
  3. 数据库章节 - 每个数据库对应一个章节
  4. 模式子章节 - 按模式分组的表
  5. 表详情 - 列、行数、来自代码的描述、警告信息

Step 6: Generate warehouse.md

步骤6:生成warehouse.md

Write the file to:
  • .astro/warehouse.md
    (default - project-specific, version-controllable)
  • ~/.astro/agents/warehouse.md
    (if
    --global
    flag)
将文件写入:
  • .astro/warehouse.md
    (默认 - 项目专属,可版本控制)
  • ~/.astro/agents/warehouse.md
    (如果使用
    --global
    参数)

Output Format

输出格式

markdown
undefined
markdown
undefined

Warehouse Schema

数据仓库模式

Generated by
/data:warehouse-init
on {DATE}. Edit freely to add business context.
/data:warehouse-init
于{DATE}生成。可自由编辑以添加业务上下文。

Quick Reference

快速参考

ConceptTableKey ColumnDate Column
customersHQ.MODEL_ASTRO.ORGANIZATIONSORG_IDCREATED_AT
<!-- Add your concept mappings here -->
概念关键字段日期字段
customersHQ.MODEL_ASTRO.ORGANIZATIONSORG_IDCREATED_AT
<!-- 在此添加您的概念映射 -->

Categorical Columns

分类列

When filtering on these columns, explore value families first (values often have variants):
TableColumnValue Families
{TABLE}{COLUMN}
{PREFIX}*
({VALUE1}, {VALUE2}, ...)
<!-- Populated by /data:warehouse-init from actual warehouse data -->
当对这些列进行过滤时,请先查看值族(值通常有变体):
值族
{TABLE}{COLUMN}
{PREFIX}*
({VALUE1}, {VALUE2}, ...)
<!-- 由/data:warehouse-init从实际数据仓库数据填充 -->

Data Layer Hierarchy

数据层层级

Query downstream first:
reporting
>
mart_*
>
metric_*
>
model_*
>
IN_*
LayerPrefixPurpose
Reporting
reporting.*
Dashboard-optimized
Mart
mart_*
Combined analytics
Metric
metric_*
KPIs at various grains
Model
model_*
Cleansed sources of truth
Raw
IN_*
Source data - avoid
优先查询下游层级:
reporting
>
mart_*
>
metric_*
>
model_*
>
IN_*
层级前缀用途
报表层
reporting.*
针对仪表板优化
数据集市
mart_*
组合分析
指标层
metric_*
多粒度KPI
模型层
model_*
清洗后的可信数据源
原始层
IN_*
源数据 - 避免直接使用

{DATABASE} Database

{DATABASE}数据库

{SCHEMA} Schema

{SCHEMA}模式

{TABLE_NAME}

{TABLE_NAME}

{DESCRIPTION from code if found}
ColumnTypeDescription
COL1VARCHAR{from code or inferred}
  • Rows: {ROW_COUNT}
  • Key column: {PRIMARY_KEY from code or inferred} {IF ROW_COUNT > 100M: - ⚠️ WARNING: Large table - always add date filters}
{来自代码的描述(如果有)}
类型描述
COL1VARCHAR{来自代码或推断的描述}
  • 行数: {ROW_COUNT}
  • 关键字段: {来自代码或推断的主键} {如果ROW_COUNT > 1亿: - ⚠️ 警告: 大表 - 请始终添加日期过滤}

Relationships

关系

{Inferred relationships based on column names like *_ID}
undefined
{根据*_ID等列名推断的关系}
undefined

Command Options

命令选项

OptionEffect
/data:warehouse-init
Generate .astro/warehouse.md
/data:warehouse-init --refresh
Regenerate, preserving user edits
/data:warehouse-init --database HQ
Only discover specific database
/data:warehouse-init --global
Write to ~/.astro/agents/ instead
选项作用
/data:warehouse-init
生成.astro/warehouse.md
/data:warehouse-init --refresh
重新生成,保留用户编辑内容
/data:warehouse-init --database HQ
仅发现指定数据库
/data:warehouse-init --global
写入到~/.astro/agents/目录

Step 7: Pre-populate Cache

步骤7:预填充缓存

After generating warehouse.md, populate the concept cache:
bash
undefined
生成warehouse.md后,填充概念缓存:
bash
undefined

Scripts are relative to ../analyzing-data/

脚本路径相对于../analyzing-data/

uv run cli.py concept import -p .astro/warehouse.md uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
undefined
uv run cli.py concept import -p .astro/warehouse.md uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
undefined

Step 8: Offer CLAUDE.md Integration (Ask User)

步骤8:提供CLAUDE.md集成选项(询问用户)

Ask the user:
Would you like to add the Quick Reference table to your CLAUDE.md file?
This ensures the schema mappings are always in context for data queries, improving accuracy from ~25% to ~100% for complex queries.
Options:
  1. Yes, add to CLAUDE.md (Recommended) - Append Quick Reference section
  2. No, skip - Use warehouse.md and cache only
If user chooses Yes:
  1. Check if
    .claude/CLAUDE.md
    or
    CLAUDE.md
    exists
  2. If exists, append the Quick Reference section (avoid duplicates)
  3. If not exists, create
    .claude/CLAUDE.md
    with just the Quick Reference
Quick Reference section to add:
markdown
undefined
询问用户:
是否要将快速参考表添加到您的CLAUDE.md文件中?
这将确保模式映射始终在数据查询的上下文中,将复杂查询的准确率从约25%提升至约100%。
选项:
  1. 是,添加到CLAUDE.md(推荐) - 追加快速参考章节
  2. 否,跳过 - 仅使用warehouse.md和缓存
如果用户选择是:
  1. 检查
    .claude/CLAUDE.md
    CLAUDE.md
    是否存在
  2. 如果存在,追加快速参考章节(避免重复)
  3. 如果不存在,创建
    .claude/CLAUDE.md
    并仅添加快速参考内容
要添加的快速参考章节:
markdown
undefined

Data Warehouse Quick Reference

数据仓库快速参考

When querying the warehouse, use these table mappings:
ConceptTableKey ColumnDate Column
{rows from warehouse.md Quick Reference}
Large tables (always filter by date): {list tables with >100M rows}
Auto-generated by
/data:warehouse-init
. Run
/data:warehouse-init --refresh
to update.
**If yes:** Append the Quick Reference section to `.claude/CLAUDE.md` or `CLAUDE.md`.
查询数据仓库时,请使用以下表映射:
概念关键字段日期字段
{来自warehouse.md快速参考的行}
大表(请始终按日期过滤): {行数>1亿的表列表}
/data:warehouse-init
自动生成。运行
/data:warehouse-init --refresh
进行更新。
**如果用户选择是:** 将快速参考章节追加到`.claude/CLAUDE.md`或`CLAUDE.md`中。

After Generation

生成后操作

Tell the user:
Generated .astro/warehouse.md

Summary:
  - {N} databases, {N} schemas, {N} tables
  - {N} tables enriched with code descriptions
  - {N} concepts cached for instant lookup

Next steps:
  1. Edit .astro/warehouse.md to add business context
  2. Commit to version control
  3. Run /data:warehouse-init --refresh when schema changes
告知用户:
已生成.astro/warehouse.md

摘要:
  - {N}个数据库,{N}个模式,{N}张表
  - {N}张表已补充来自代码的描述
  - {N}个概念已缓存,可即时查询

后续步骤:
  1. 编辑.astro/warehouse.md以添加业务上下文
  2. 提交到版本控制系统
  3. 当模式变更时运行/data:warehouse-init --refresh

Refresh Behavior

刷新行为

When
--refresh
is specified:
  1. Read existing warehouse.md
  2. Preserve all HTML comments (
    <!-- ... -->
    )
  3. Preserve Quick Reference table entries (user-added)
  4. Preserve user-added descriptions
  5. Update row counts and add new tables
  6. Mark removed tables with
    <!-- REMOVED -->
    comment
当指定
--refresh
参数时:
  1. 读取现有的warehouse.md
  2. 保留所有HTML注释(
    <!-- ... -->
  3. 保留快速参考表中的用户添加条目
  4. 保留用户添加的描述
  5. 更新行数并添加新表
  6. <!-- REMOVED -->
    标记已删除的表

Cache Staleness & Schema Drift

缓存过期与模式漂移

The runtime cache has a 7-day TTL by default. After 7 days, cached entries expire and will be re-discovered on next use.
运行时缓存默认7天TTL。7天后,缓存条目将过期,下次使用时会重新发现。

When to Refresh

何时刷新

Run
/data:warehouse-init --refresh
when:
  • Schema changes: Tables added, renamed, or removed
  • Column changes: New columns added or types changed
  • After deployments: If your data pipeline deploys schema migrations
  • Weekly: As a good practice, even if no known changes
当出现以下情况时运行
/data:warehouse-init --refresh
  • 模式变更:添加、重命名或删除表
  • 列变更:添加新列或修改列类型
  • 部署后:如果数据流水线部署了模式迁移
  • 每周:即使没有已知变更,作为最佳实践定期执行

Signs of Stale Cache

缓存过期的迹象

Watch for these indicators:
  • Queries fail with "table not found" errors
  • Results seem wrong or outdated
  • New tables aren't being discovered
注意以下指标:
  • 查询因“表不存在”错误而失败
  • 结果看起来错误或过时
  • 新表未被发现

Manual Cache Reset

手动重置缓存

If you suspect cache issues:
bash
undefined
如果怀疑缓存有问题:
bash
undefined

Scripts are relative to ../analyzing-data/

脚本路径相对于../analyzing-data/

uv run scripts/cli.py cache status uv run scripts/cli.py cache clear --stale-only uv run scripts/cli.py cache clear
undefined
uv run scripts/cli.py cache status uv run scripts/cli.py cache clear --stale-only uv run scripts/cli.py cache clear
undefined

Codebase Patterns Recognized

可识别的代码库模式

PatternSourceWhat We Extract
**/models/**/*.yml
dbttable/column descriptions, tests
**/dags/**/*.sql
gustyYAML frontmatter (description, primary_key)
AGENTS.md
,
CLAUDE.md
docsdata layer hierarchy, conventions
**/docs/**/*.md
docsbusiness context
模式来源提取内容
**/models/**/*.yml
dbt表/列描述、测试规则
**/dags/**/*.sql
gustyYAML前置内容(描述、主键)
AGENTS.md
,
CLAUDE.md
文档数据层层级、约定
**/docs/**/*.md
文档业务上下文

Example Session

示例会话

User: /data:warehouse-init

Agent:
→ Reading warehouse configuration...
→ Found 1 warehouse with databases: HQ, PRODUCT

→ Searching codebase for data documentation...
  Found: AGENTS.md with data layer hierarchy
  Found: 45 SQL files with YAML frontmatter in dags/declarative/

→ Launching parallel warehouse discovery...
  [Database: HQ] Discovering schemas...
  [Database: PRODUCT] Discovering schemas...

→ HQ: Found 29 schemas, 401 tables
→ PRODUCT: Found 1 schema, 0 tables

→ Merging warehouse metadata with code context...
  Enriched 45 tables with descriptions from code

→ Generated .astro/warehouse.md

Summary:
  - 2 databases
  - 30 schemas
  - 401 tables
  - 45 tables enriched with code descriptions
  - 8 large tables flagged (>100M rows)

Next steps:
  1. Review .astro/warehouse.md
  2. Add concept mappings to Quick Reference
  3. Commit to version control
  4. Run /data:warehouse-init --refresh when schema changes
用户:/data:warehouse-init

代理:
→ 读取数据仓库配置...
→ 找到1个数据仓库,包含数据库:HQ、PRODUCT

→ 在代码库中搜索数据文档...
  找到:包含数据层层级的AGENTS.md
  找到:dags/declarative/目录下45个带YAML前置内容的SQL文件

→ 启动并行数据仓库发现...
  [数据库:HQ] 正在发现模式...
  [数据库:PRODUCT] 正在发现模式...

→ HQ:找到29个模式,401张表
→ PRODUCT:找到1个模式,0张表

→ 整合数据仓库元数据与代码上下文...
  为45张表补充了来自代码的描述

→ 生成.astro/warehouse.md

摘要:
  - 2个数据库
  - 30个模式
  - 401张表
  - 45张表已补充代码中的描述
  - 标记了8张大表(行数>1亿)

后续步骤:
  1. 查看.astro/warehouse.md
  2. 在快速参考中添加概念映射
  3. 提交到版本控制系统
  4. 当模式变更时运行/data:warehouse-init --refresh