warehouse-init

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Initialize Warehouse Schema

初始化数据仓库模式

Generate a comprehensive, user-editable schema reference file for the data warehouse.

Scripts:

../analyzing-data/scripts/

— All CLI commands below are relative to the

analyzing-data

skill's directory. Before running any

scripts/cli.py

command,

cd

../analyzing-data/

relative to this file.

为数据仓库生成一份全面的、可由用户编辑的模式参考文件。

脚本路径：

../analyzing-data/scripts/

— 以下所有CLI命令均相对于

analyzing-data

skill的目录。在运行任何

scripts/cli.py

命令前，需切换到相对于当前文件的

../analyzing-data/

目录。

What This Does

功能说明

Discovers all databases, schemas, tables, and columns from the warehouse
Enriches with codebase context (dbt models, gusty SQL, schema docs)
Records row counts and identifies large tables
Generates
```
.astro/warehouse.md
```
- a version-controllable, team-shareable reference
Enables instant concept→table lookups without warehouse queries

发现数据仓库中的所有数据库、模式、表和列
结合代码库上下文丰富内容（dbt模型、gusty SQL、模式文档）
记录行数并识别大表
生成
```
.astro/warehouse.md
```
文件——一份可版本控制、团队共享的参考文档
无需执行仓库查询即可实现概念到表的即时映射

Process

操作流程

Step 1: Read Warehouse Configuration

步骤1：读取数据仓库配置

bash

cat ~/.astro/agents/warehouse.yml

Get the list of databases to discover (e.g.,

databases: [HQ, ANALYTICS, RAW]

bash

cat ~/.astro/agents/warehouse.yml

获取需要发现的数据库列表（例如：

databases: [HQ, ANALYTICS, RAW]

）。

Step 2: Search Codebase for Context (Parallel)

步骤2：并行搜索代码库上下文

Launch a subagent to find business context in code:

Task(
    subagent_type="Explore",
    prompt="""
    Search for data model documentation in the codebase:

    1. dbt models: **/models/**/*.yml, **/schema.yml
       - Extract table descriptions, column descriptions
       - Note primary keys and tests

    2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter
       - Parse frontmatter for: description, primary_key, tests
       - Note schema mappings

    3. AGENTS.md or CLAUDE.md files with data layer documentation

    Return a mapping of:
      table_name -> {description, primary_key, important_columns, layer}
    """
)

启动子代理在代码中查找业务上下文：

Task(
    subagent_type="Explore",
    prompt="""
    在代码库中搜索数据模型文档：

    1. dbt模型：**/models/**/*.yml, **/schema.yml
       - 提取表描述、列描述
       - 记录主键和测试规则

    2. Gusty/声明式SQL：**/dags/**/*.sql文件中的YAML前置内容
       - 解析前置内容中的：描述、主键、测试规则
       - 记录模式映射关系

    3. 包含数据层文档的AGENTS.md或CLAUDE.md文件

    返回以下格式的映射关系：
      table_name -> {description, primary_key, important_columns, layer}
    """
)

Step 3: Parallel Warehouse Discovery

步骤3：并行数据仓库发现

Launch one subagent per database using the Task tool:

For each database in configured_databases:
    Task(
        subagent_type="general-purpose",
        prompt="""
        Discover all metadata for database {DATABASE}.

        Use the CLI to run SQL queries:
        # Scripts are relative to ../analyzing-data/
        uv run scripts/cli.py exec "df = run_sql('...')"
        uv run scripts/cli.py exec "print(df)"

        1. Query schemas:
           SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA

        2. Query tables with row counts:
           SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
           ORDER BY TABLE_SCHEMA, TABLE_NAME

        3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns:
           SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
           WHERE TABLE_SCHEMA = 'X'

        Return a structured summary:
        - Database name
        - List of schemas with table counts
        - For each table: name, row_count, key columns
        - Flag any tables with >100M rows as "large"
        """
    )

Run all subagents in parallel (single message with multiple Task calls).

使用Task工具为每个数据库启动一个子代理：

For each database in configured_databases:
    Task(
        subagent_type="general-purpose",
        prompt="""
        发现数据库{DATABASE}的所有元数据。

        使用CLI执行SQL查询：
        # 脚本路径相对于../analyzing-data/
        uv run scripts/cli.py exec "df = run_sql('...')"
        uv run scripts/cli.py exec "print(df)"

        1. 查询模式：
           SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA

        2. 查询带行数的表：
           SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
           ORDER BY TABLE_SCHEMA, TABLE_NAME

        3. 对于重要模式（MODEL_*, METRICS_*, MART_*），查询列信息：
           SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
           WHERE TABLE_SCHEMA = 'X'

        返回结构化摘要：
        - 数据库名称
        - 包含表数量的模式列表
        - 每个表的：名称、行数、关键字段
        - 将行数>1亿的表标记为“大表”
        """
    )

并行运行所有子代理（在单个消息中包含多个Task调用）。

Step 4: Discover Categorical Value Families

步骤4：发现分类值族

For key categorical columns (like OPERATOR, STATUS, TYPE, FEATURE), discover value families:

bash

uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"

Group related values into families by common prefix/suffix (e.g.,

Export*

for ExportCSV, ExportJSON, ExportParquet).

对于关键分类列（如OPERATOR、STATUS、TYPE、FEATURE），发现其值族：

bash

uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"

通过共同前缀/后缀将相关值分组为值族（例如：

Export*

包含ExportCSV、ExportJSON、ExportParquet）。

Step 5: Merge Results

步骤5：合并结果

Combine warehouse metadata + codebase context:

Quick Reference table - concept → table mappings (pre-populated from code if found)
Categorical Columns - value families for key filter columns
Database sections - one per database
Schema subsections - tables grouped by schema
Table details - columns, row counts, descriptions from code, warnings

整合数据仓库元数据与代码库上下文：

快速参考表 - 概念到表的映射（如果在代码中找到则预填充）
分类列 - 关键过滤列的值族
数据库章节 - 每个数据库对应一个章节
模式子章节 - 按模式分组的表
表详情 - 列、行数、来自代码的描述、警告信息

Step 6: Generate warehouse.md

步骤6：生成warehouse.md

Write the file to:

```
.astro/warehouse.md
```
(default - project-specific, version-controllable)
```
~/.astro/agents/warehouse.md
```
(if
```
--global
```
flag)

将文件写入：

```
.astro/warehouse.md
```
（默认 - 项目专属，可版本控制）
```
~/.astro/agents/warehouse.md
```
（如果使用
```
--global
```
参数）

Output Format

输出格式

markdown

undefined

markdown

undefined

Warehouse Schema

数据仓库模式

Generated by
/data:warehouse-init
on {DATE}. Edit freely to add business context.

由
/data:warehouse-init
于{DATE}生成。可自由编辑以添加业务上下文。

Quick Reference

快速参考

Concept	Table	Key Column	Date Column
customers	HQ.MODEL_ASTRO.ORGANIZATIONS	ORG_ID	CREATED_AT

概念	表	关键字段	日期字段
customers	HQ.MODEL_ASTRO.ORGANIZATIONS	ORG_ID	CREATED_AT

Categorical Columns

分类列

When filtering on these columns, explore value families first (values often have variants):

Table	Column	Value Families
{TABLE}	{COLUMN}	`{PREFIX}*` ({VALUE1}, {VALUE2}, ...)

当对这些列进行过滤时，请先查看值族（值通常有变体）：

表	列	值族
{TABLE}	{COLUMN}	`{PREFIX}*` ({VALUE1}, {VALUE2}, ...)

Data Layer Hierarchy

数据层层级

Query downstream first:

reporting

mart_*

metric_*

model_*

IN_*

Layer	Prefix	Purpose
Reporting	`reporting.*`	Dashboard-optimized
Mart	`mart_*`	Combined analytics
Metric	`metric_*`	KPIs at various grains
Model	`model_*`	Cleansed sources of truth
Raw	`IN_*`	Source data - avoid

优先查询下游层级：

reporting

mart_*

metric_*

model_*

IN_*

层级	前缀	用途
报表层	`reporting.*`	针对仪表板优化
数据集市	`mart_*`	组合分析
指标层	`metric_*`	多粒度KPI
模型层	`model_*`	清洗后的可信数据源
原始层	`IN_*`	源数据 - 避免直接使用

{DATABASE} Database

{DATABASE}数据库

{SCHEMA} Schema

{SCHEMA}模式

{TABLE_NAME}

{DESCRIPTION from code if found}

Column	Type	Description
COL1	VARCHAR	{from code or inferred}

Rows: {ROW_COUNT}
Key column: {PRIMARY_KEY from code or inferred} {IF ROW_COUNT > 100M: - ⚠️ WARNING: Large table - always add date filters}

{来自代码的描述（如果有）}

列	类型	描述
COL1	VARCHAR	{来自代码或推断的描述}

行数： {ROW_COUNT}
关键字段： {来自代码或推断的主键} {如果ROW_COUNT > 1亿: - ⚠️ 警告： 大表 - 请始终添加日期过滤}

Relationships

关系

{Inferred relationships based on column names like *_ID}

undefined

{根据*_ID等列名推断的关系}

undefined

Command Options

命令选项

Option	Effect
`/data:warehouse-init`	Generate .astro/warehouse.md
`/data:warehouse-init --refresh`	Regenerate, preserving user edits
`/data:warehouse-init --database HQ`	Only discover specific database
`/data:warehouse-init --global`	Write to ~/.astro/agents/ instead

选项	作用
`/data:warehouse-init`	生成.astro/warehouse.md
`/data:warehouse-init --refresh`	重新生成，保留用户编辑内容
`/data:warehouse-init --database HQ`	仅发现指定数据库
`/data:warehouse-init --global`	写入到~/.astro/agents/目录

Step 7: Pre-populate Cache

步骤7：预填充缓存

After generating warehouse.md, populate the concept cache:

bash

undefined

生成warehouse.md后，填充概念缓存：

bash

undefined

Scripts are relative to ../analyzing-data/

脚本路径相对于../analyzing-data/

uv run cli.py concept import -p .astro/warehouse.md uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID

undefined

uv run cli.py concept import -p .astro/warehouse.md uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID

undefined

Step 8: Offer CLAUDE.md Integration (Ask User)

步骤8：提供CLAUDE.md集成选项（询问用户）

Ask the user:

Would you like to add the Quick Reference table to your CLAUDE.md file?

This ensures the schema mappings are always in context for data queries, improving accuracy from ~25% to ~100% for complex queries.

Options:

Yes, add to CLAUDE.md (Recommended) - Append Quick Reference section

No, skip - Use warehouse.md and cache only

If user chooses Yes:

Check if
```
.claude/CLAUDE.md
```
or
```
CLAUDE.md
```
exists
If exists, append the Quick Reference section (avoid duplicates)
If not exists, create
```
.claude/CLAUDE.md
```
with just the Quick Reference

Quick Reference section to add:

markdown

undefined

询问用户：

是否要将快速参考表添加到您的CLAUDE.md文件中？

这将确保模式映射始终在数据查询的上下文中，将复杂查询的准确率从约25%提升至约100%。

选项：

是，添加到CLAUDE.md（推荐） - 追加快速参考章节

否，跳过 - 仅使用warehouse.md和缓存

如果用户选择是：

检查
```
.claude/CLAUDE.md
```
或
```
CLAUDE.md
```
是否存在
如果存在，追加快速参考章节（避免重复）
如果不存在，创建
```
.claude/CLAUDE.md
```
并仅添加快速参考内容

要添加的快速参考章节：

markdown

undefined

Data Warehouse Quick Reference

数据仓库快速参考

When querying the warehouse, use these table mappings:

Concept	Table	Key Column	Date Column
{rows from warehouse.md Quick Reference}

Large tables (always filter by date): {list tables with >100M rows}

Auto-generated by
/data:warehouse-init
. Run
/data:warehouse-init --refresh
to update.

**If yes:** Append the Quick Reference section to `.claude/CLAUDE.md` or `CLAUDE.md`.

查询数据仓库时，请使用以下表映射：

概念	表	关键字段	日期字段
{来自warehouse.md快速参考的行}

大表（请始终按日期过滤）： {行数>1亿的表列表}

由
/data:warehouse-init
自动生成。运行
/data:warehouse-init --refresh
进行更新。

**如果用户选择是：** 将快速参考章节追加到`.claude/CLAUDE.md`或`CLAUDE.md`中。

After Generation

生成后操作

Tell the user:

Generated .astro/warehouse.md

Summary:
  - {N} databases, {N} schemas, {N} tables
  - {N} tables enriched with code descriptions
  - {N} concepts cached for instant lookup

Next steps:
  1. Edit .astro/warehouse.md to add business context
  2. Commit to version control
  3. Run /data:warehouse-init --refresh when schema changes

告知用户：

已生成.astro/warehouse.md

摘要：
  - {N}个数据库，{N}个模式，{N}张表
  - {N}张表已补充来自代码的描述
  - {N}个概念已缓存，可即时查询

后续步骤：
  1. 编辑.astro/warehouse.md以添加业务上下文
  2. 提交到版本控制系统
  3. 当模式变更时运行/data:warehouse-init --refresh

Refresh Behavior

刷新行为

When

--refresh

is specified:

Read existing warehouse.md
Preserve all HTML comments (
```

```
)
Preserve Quick Reference table entries (user-added)
Preserve user-added descriptions
Update row counts and add new tables
Mark removed tables with
```

```
comment

当指定

--refresh

参数时：

读取现有的warehouse.md
保留所有HTML注释（
```

```
）
保留快速参考表中的用户添加条目
保留用户添加的描述
更新行数并添加新表
用
```

```
标记已删除的表

Cache Staleness & Schema Drift

缓存过期与模式漂移

The runtime cache has a 7-day TTL by default. After 7 days, cached entries expire and will be re-discovered on next use.

运行时缓存默认7天TTL。7天后，缓存条目将过期，下次使用时会重新发现。

When to Refresh

何时刷新

Run

/data:warehouse-init --refresh

when:

Schema changes: Tables added, renamed, or removed
Column changes: New columns added or types changed
After deployments: If your data pipeline deploys schema migrations
Weekly: As a good practice, even if no known changes

当出现以下情况时运行

/data:warehouse-init --refresh

：

模式变更：添加、重命名或删除表
列变更：添加新列或修改列类型
部署后：如果数据流水线部署了模式迁移
每周：即使没有已知变更，作为最佳实践定期执行

Signs of Stale Cache

缓存过期的迹象

Watch for these indicators:

Queries fail with "table not found" errors
Results seem wrong or outdated
New tables aren't being discovered

注意以下指标：

查询因“表不存在”错误而失败
结果看起来错误或过时
新表未被发现

Manual Cache Reset

手动重置缓存

If you suspect cache issues:

bash

undefined

如果怀疑缓存有问题：

bash

undefined

Scripts are relative to ../analyzing-data/

脚本路径相对于../analyzing-data/

uv run scripts/cli.py cache status uv run scripts/cli.py cache clear --stale-only uv run scripts/cli.py cache clear

undefined

uv run scripts/cli.py cache status uv run scripts/cli.py cache clear --stale-only uv run scripts/cli.py cache clear

undefined

Codebase Patterns Recognized

可识别的代码库模式

Pattern	Source	What We Extract
`/models//*.yml`	dbt	table/column descriptions, tests
`/dags//*.sql`	gusty	YAML frontmatter (description, primary_key)
`AGENTS.md` , `CLAUDE.md`	docs	data layer hierarchy, conventions
`/docs//*.md`	docs	business context

模式	来源	提取内容
`/models//*.yml`	dbt	表/列描述、测试规则
`/dags//*.sql`	gusty	YAML前置内容（描述、主键）
`AGENTS.md` , `CLAUDE.md`	文档	数据层层级、约定
`/docs//*.md`	文档	业务上下文

Example Session

示例会话

User: /data:warehouse-init

Agent:
→ Reading warehouse configuration...
→ Found 1 warehouse with databases: HQ, PRODUCT

→ Searching codebase for data documentation...
  Found: AGENTS.md with data layer hierarchy
  Found: 45 SQL files with YAML frontmatter in dags/declarative/

→ Launching parallel warehouse discovery...
  [Database: HQ] Discovering schemas...
  [Database: PRODUCT] Discovering schemas...

→ HQ: Found 29 schemas, 401 tables
→ PRODUCT: Found 1 schema, 0 tables

→ Merging warehouse metadata with code context...
  Enriched 45 tables with descriptions from code

→ Generated .astro/warehouse.md

Summary:
  - 2 databases
  - 30 schemas
  - 401 tables
  - 45 tables enriched with code descriptions
  - 8 large tables flagged (>100M rows)

Next steps:
  1. Review .astro/warehouse.md
  2. Add concept mappings to Quick Reference
  3. Commit to version control
  4. Run /data:warehouse-init --refresh when schema changes

用户：/data:warehouse-init

代理：
→ 读取数据仓库配置...
→ 找到1个数据仓库，包含数据库：HQ、PRODUCT

→ 在代码库中搜索数据文档...
  找到：包含数据层层级的AGENTS.md
  找到：dags/declarative/目录下45个带YAML前置内容的SQL文件

→ 启动并行数据仓库发现...
  [数据库：HQ] 正在发现模式...
  [数据库：PRODUCT] 正在发现模式...

→ HQ：找到29个模式，401张表
→ PRODUCT：找到1个模式，0张表

→ 整合数据仓库元数据与代码上下文...
  为45张表补充了来自代码的描述

→ 生成.astro/warehouse.md

摘要：
  - 2个数据库
  - 30个模式
  - 401张表
  - 45张表已补充代码中的描述
  - 标记了8张大表（行数>1亿）

后续步骤：
  1. 查看.astro/warehouse.md
  2. 在快速参考中添加概念映射
  3. 提交到版本控制系统
  4. 当模式变更时运行/data:warehouse-init --refresh