init

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Initialize Warehouse Schema

初始化数据仓库 Schema

Generate a comprehensive, user-editable schema reference file for the data warehouse.

Scripts:

../analyzing-data/scripts/

生成一个全面的、可由用户编辑的数据仓库 schema 参考文件。

脚本路径：

../analyzing-data/scripts/

What This Does

功能说明

Discovers all databases, schemas, tables, and columns from the warehouse
Enriches with codebase context (dbt models, gusty SQL, schema docs)
Records row counts and identifies large tables
Generates
```
.astro/warehouse.md
```
- a version-controllable, team-shareable reference
Enables instant concept→table lookups without warehouse queries

发现数据仓库中的所有数据库、schema、表和列
结合代码库上下文丰富内容（dbt 模型、gusty SQL、schema 文档）
记录行数并识别大表
生成
```
.astro/warehouse.md
```
- 一个可版本控制、团队共享的参考文件
无需执行数据仓库查询即可实现概念到表的即时查找

Process

执行流程

Step 1: Read Warehouse Configuration

步骤1：读取数据仓库配置

bash

cat ~/.astro/agents/warehouse.yml

Get the list of databases to discover (e.g.,

databases: [HQ, ANALYTICS, RAW]

bash

cat ~/.astro/agents/warehouse.yml

获取需要发现的数据库列表（例如：

databases: [HQ, ANALYTICS, RAW]

）。

Step 2: Search Codebase for Context (Parallel)

步骤2：并行搜索代码库上下文

Launch a subagent to find business context in code:

Task(
    subagent_type="Explore",
    prompt="""
    Search for data model documentation in the codebase:

    1. dbt models: **/models/**/*.yml, **/schema.yml
       - Extract table descriptions, column descriptions
       - Note primary keys and tests

    2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter
       - Parse frontmatter for: description, primary_key, tests
       - Note schema mappings

    3. AGENTS.md or CLAUDE.md files with data layer documentation

    Return a mapping of:
      table_name -> {description, primary_key, important_columns, layer}
    """
)

启动子代理在代码中查找业务上下文：

Task(
    subagent_type="Explore",
    prompt="""
    Search for data model documentation in the codebase:

    1. dbt models: **/models/**/*.yml, **/schema.yml
       - Extract table descriptions, column descriptions
       - Note primary keys and tests

    2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter
       - Parse frontmatter for: description, primary_key, tests
       - Note schema mappings

    3. AGENTS.md or CLAUDE.md files with data layer documentation

    Return a mapping of:
      table_name -> {description, primary_key, important_columns, layer}
    """
)

Step 3: Parallel Warehouse Discovery

步骤3：并行数据仓库发现

Launch one subagent per database using the Task tool:

For each database in configured_databases:
    Task(
        subagent_type="general-purpose",
        prompt="""
        Discover all metadata for database {DATABASE}.

        Use the CLI to run SQL queries:
        # Scripts are relative to ../analyzing-data/
        uv run scripts/cli.py exec "df = run_sql('...')"
        uv run scripts/cli.py exec "print(df)"

        1. Query schemas:
           SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA

        2. Query tables with row counts:
           SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
           ORDER BY TABLE_SCHEMA, TABLE_NAME

        3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns:
           SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
           WHERE TABLE_SCHEMA = 'X'

        Return a structured summary:
        - Database name
        - List of schemas with table counts
        - For each table: name, row_count, key columns
        - Flag any tables with >100M rows as "large"
        """
    )

Run all subagents in parallel (single message with multiple Task calls).

使用 Task 工具为每个数据库启动一个子代理：

For each database in configured_databases:
    Task(
        subagent_type="general-purpose",
        prompt="""
        Discover all metadata for database {DATABASE}.

        Use the CLI to run SQL queries:
        # Scripts are relative to ../analyzing-data/
        uv run scripts/cli.py exec "df = run_sql('...')"
        uv run scripts/cli.py exec "print(df)"

        1. Query schemas:
           SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA

        2. Query tables with row counts:
           SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
           ORDER BY TABLE_SCHEMA, TABLE_NAME

        3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns:
           SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
           WHERE TABLE_SCHEMA = 'X'

        Return a structured summary:
        - Database name
        - List of schemas with table counts
        - For each table: name, row_count, key columns
        - Flag any tables with >100M rows as "large"
        """
    )

并行运行所有子代理（单条消息包含多个 Task 调用）。

Step 4: Discover Categorical Value Families

步骤4：发现分类值族

For key categorical columns (like OPERATOR, STATUS, TYPE, FEATURE), discover value families:

bash

uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"

Group related values into families by common prefix/suffix (e.g.,

Export*

for ExportCSV, ExportJSON, ExportParquet).

针对关键分类列（如 OPERATOR、STATUS、TYPE、FEATURE），发现其值族：

bash

uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"

根据公共前缀/后缀将相关值分组为值族（例如：

Export*

包含 ExportCSV、ExportJSON、ExportParquet）。

Step 5: Merge Results

步骤5：合并结果

Combine warehouse metadata + codebase context:

Quick Reference table - concept → table mappings (pre-populated from code if found)
Categorical Columns - value families for key filter columns
Database sections - one per database
Schema subsections - tables grouped by schema
Table details - columns, row counts, descriptions from code, warnings

结合数据仓库元数据与代码库上下文：

快速参考表 - 概念到表的映射（如果在代码中找到则预填充）
分类列 - 关键筛选列的值族
数据库章节 - 每个数据库对应一个章节
Schema 子章节 - 按 Schema 分组的表
表详情 - 列、行数、来自代码的描述、警告

Step 6: Generate warehouse.md

步骤6：生成 warehouse.md

Write the file to:

```
.astro/warehouse.md
```
(default - project-specific, version-controllable)
```
~/.astro/agents/warehouse.md
```
(if
```
--global
```
flag)

将文件写入以下位置：

```
.astro/warehouse.md
```
（默认 - 项目专属，可版本控制）
```
~/.astro/agents/warehouse.md
```
（如果使用
```
--global
```
参数）

Output Format

输出格式

markdown

undefined

markdown

undefined

Warehouse Schema

Generated by
/data:init
on {DATE}. Edit freely to add business context.

Quick Reference

Concept	Table	Key Column	Date Column
customers	HQ.MODEL_ASTRO.ORGANIZATIONS	ORG_ID	CREATED_AT

Concept	Table	Key Column	Date Column
customers	HQ.MODEL_ASTRO.ORGANIZATIONS	ORG_ID	CREATED_AT

Categorical Columns

When filtering on these columns, explore value families first (values often have variants):

Table	Column	Value Families
{TABLE}	{COLUMN}	`{PREFIX}*` ({VALUE1}, {VALUE2}, ...)

When filtering on these columns, explore value families first (values often have variants):

Table	Column	Value Families
{TABLE}	{COLUMN}	`{PREFIX}*` ({VALUE1}, {VALUE2}, ...)

Data Layer Hierarchy

Query downstream first:

reporting

mart_*

metric_*

model_*

IN_*

Layer	Prefix	Purpose
Reporting	`reporting.*`	Dashboard-optimized
Mart	`mart_*`	Combined analytics
Metric	`metric_*`	KPIs at various grains
Model	`model_*`	Cleansed sources of truth
Raw	`IN_*`	Source data - avoid

Query downstream first:

reporting

mart_*

metric_*

model_*

IN_*

Layer	Prefix	Purpose
Reporting	`reporting.*`	Dashboard-optimized
Mart	`mart_*`	Combined analytics
Metric	`metric_*`	KPIs at various grains
Model	`model_*`	Cleansed sources of truth
Raw	`IN_*`	Source data - avoid

{DATABASE} Database

{SCHEMA} Schema

{TABLE_NAME}

{DESCRIPTION from code if found}

Column	Type	Description
COL1	VARCHAR	{from code or inferred}

Rows: {ROW_COUNT}
Key column: {PRIMARY_KEY from code or inferred} {IF ROW_COUNT > 100M: - ⚠️ WARNING: Large table - always add date filters}

{DESCRIPTION from code if found}

Column	Type	Description
COL1	VARCHAR	{from code or inferred}

Rows: {ROW_COUNT}
Key column: {PRIMARY_KEY from code or inferred} {IF ROW_COUNT > 100M: - ⚠️ WARNING: Large table - always add date filters}

Relationships

{Inferred relationships based on column names like *_ID}

undefined

{Inferred relationships based on column names like *_ID}

undefined

Command Options

命令选项

Option	Effect
`/data:init`	Generate .astro/warehouse.md
`/data:init --refresh`	Regenerate, preserving user edits
`/data:init --database HQ`	Only discover specific database
`/data:init --global`	Write to ~/.astro/agents/ instead

选项	作用
`/data:init`	生成 .astro/warehouse.md
`/data:init --refresh`	重新生成，保留用户编辑内容
`/data:init --database HQ`	仅发现指定数据库
`/data:init --global`	写入到 ~/.astro/agents/ 目录

Step 7: Pre-populate Cache

步骤7：预填充缓存

After generating warehouse.md, populate the concept cache:

bash

undefined

生成 warehouse.md 后，填充概念缓存：

bash

undefined

Scripts are relative to ../analyzing-data/

uv run cli.py concept import -p .astro/warehouse.md uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID

undefined

uv run cli.py concept import -p .astro/warehouse.md uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID

undefined

Step 8: Offer CLAUDE.md Integration (Ask User)

步骤8：提供 CLAUDE.md 集成选项（询问用户）

Ask the user:

Would you like to add the Quick Reference table to your CLAUDE.md file?

This ensures the schema mappings are always in context for data queries, improving accuracy from ~25% to ~100% for complex queries.

Options:

Yes, add to CLAUDE.md (Recommended) - Append Quick Reference section

No, skip - Use warehouse.md and cache only

If user chooses Yes:

Check if
```
.claude/CLAUDE.md
```
or
```
CLAUDE.md
```
exists
If exists, append the Quick Reference section (avoid duplicates)
If not exists, create
```
.claude/CLAUDE.md
```
with just the Quick Reference

Quick Reference section to add:

markdown

undefined

询问用户：

是否要将快速参考表添加到您的 CLAUDE.md 文件中？

这将确保 schema 映射始终在数据查询的上下文中，使复杂查询的准确率从约25%提升至约100%。

选项：

是，添加到 CLAUDE.md（推荐）- 追加快速参考章节

否，跳过 - 仅使用 warehouse.md 和缓存

如果用户选择是：

检查
```
.claude/CLAUDE.md
```
或
```
CLAUDE.md
```
是否存在
如果存在，追加快速参考章节（避免重复）
如果不存在，创建
```
.claude/CLAUDE.md
```
并仅添加快速参考内容

要添加的快速参考章节：

markdown

undefined

Data Warehouse Quick Reference

When querying the warehouse, use these table mappings:

Concept	Table	Key Column	Date Column
{rows from warehouse.md Quick Reference}

Large tables (always filter by date): {list tables with >100M rows}

Auto-generated by
/data:init
. Run
/data:init --refresh
to update.

**If yes:** Append the Quick Reference section to `.claude/CLAUDE.md` or `CLAUDE.md`.

When querying the warehouse, use these table mappings:

Concept	Table	Key Column	Date Column
{rows from warehouse.md Quick Reference}

Large tables (always filter by date): {list tables with >100M rows}

Auto-generated by
/data:init
. Run
/data:init --refresh
to update.

**如果选择是：** 将快速参考章节追加到 `.claude/CLAUDE.md` 或 `CLAUDE.md`。

After Generation

生成后操作

Tell the user:

Generated .astro/warehouse.md

Summary:
  - {N} databases, {N} schemas, {N} tables
  - {N} tables enriched with code descriptions
  - {N} concepts cached for instant lookup

Next steps:
  1. Edit .astro/warehouse.md to add business context
  2. Commit to version control
  3. Run /data:init --refresh when schema changes

告知用户：

Generated .astro/warehouse.md

Summary:
  - {N} databases, {N} schemas, {N} tables
  - {N} tables enriched with code descriptions
  - {N} concepts cached for instant lookup

Next steps:
  1. Edit .astro/warehouse.md to add business context
  2. Commit to version control
  3. Run /data:init --refresh when schema changes

Refresh Behavior

刷新行为

When

--refresh

is specified:

Read existing warehouse.md
Preserve all HTML comments (
```

```
)
Preserve Quick Reference table entries (user-added)
Preserve user-added descriptions
Update row counts and add new tables
Mark removed tables with
```

```
comment

当指定

--refresh

参数时：

读取现有的 warehouse.md
保留所有 HTML 注释（
```

```
）
保留快速参考表中的条目（用户添加的内容）
保留用户添加的描述
更新行数并添加新表
为已删除的表添加
```

```
注释

Cache Staleness & Schema Drift

缓存过期与 Schema 漂移

The runtime cache has a 7-day TTL by default. After 7 days, cached entries expire and will be re-discovered on next use.

运行时缓存默认有7天的 TTL。7天后，缓存条目将过期，下次使用时会重新发现。

When to Refresh

何时刷新

Run

/data:init --refresh

when:

Schema changes: Tables added, renamed, or removed
Column changes: New columns added or types changed
After deployments: If your data pipeline deploys schema migrations
Weekly: As a good practice, even if no known changes

在以下情况运行

/data:init --refresh

：

Schema 变更：添加、重命名或删除表
列变更：添加新列或修改列类型
部署后：如果数据管道部署了 schema 迁移
每周：作为最佳实践，即使没有已知变更

Signs of Stale Cache

缓存过期的迹象

Watch for these indicators:

Queries fail with "table not found" errors
Results seem wrong or outdated
New tables aren't being discovered

注意以下指标：

查询因"表不存在"错误而失败
结果看起来错误或过时
新表未被发现

Manual Cache Reset

手动重置缓存

If you suspect cache issues:

bash

undefined

如果怀疑缓存有问题：

bash

undefined

Scripts are relative to ../analyzing-data/

uv run scripts/cli.py cache status uv run scripts/cli.py cache clear --stale-only uv run scripts/cli.py cache clear

undefined

uv run scripts/cli.py cache status uv run scripts/cli.py cache clear --stale-only uv run scripts/cli.py cache clear

undefined

Codebase Patterns Recognized

可识别的代码库模式

Pattern	Source	What We Extract
`/models//*.yml`	dbt	table/column descriptions, tests
`/dags//*.sql`	gusty	YAML frontmatter (description, primary_key)
`AGENTS.md` , `CLAUDE.md`	docs	data layer hierarchy, conventions
`/docs//*.md`	docs	business context

模式	来源	提取内容
`/models//*.yml`	dbt	表/列描述、测试规则
`/dags//*.sql`	gusty	YAML 前置内容（描述、主键、测试规则）
`AGENTS.md` , `CLAUDE.md`	文档	数据层架构、约定
`/docs//*.md`	文档	业务上下文

Example Session

示例会话

User: /data:init

Agent:
→ Reading warehouse configuration...
→ Found 1 warehouse with databases: HQ, PRODUCT

→ Searching codebase for data documentation...
  Found: AGENTS.md with data layer hierarchy
  Found: 45 SQL files with YAML frontmatter in dags/declarative/

→ Launching parallel warehouse discovery...
  [Database: HQ] Discovering schemas...
  [Database: PRODUCT] Discovering schemas...

→ HQ: Found 29 schemas, 401 tables
→ PRODUCT: Found 1 schema, 0 tables

→ Merging warehouse metadata with code context...
  Enriched 45 tables with descriptions from code

→ Generated .astro/warehouse.md

Summary:
  - 2 databases
  - 30 schemas
  - 401 tables
  - 45 tables enriched with code descriptions
  - 8 large tables flagged (>100M rows)

Next steps:
  1. Review .astro/warehouse.md
  2. Add concept mappings to Quick Reference
  3. Commit to version control
  4. Run /data:init --refresh when schema changes

User: /data:init

Agent:
→ Reading warehouse configuration...
→ Found 1 warehouse with databases: HQ, PRODUCT

→ Searching codebase for data documentation...
  Found: AGENTS.md with data layer hierarchy
  Found: 45 SQL files with YAML frontmatter in dags/declarative/

→ Launching parallel warehouse discovery...
  [Database: HQ] Discovering schemas...
  [Database: PRODUCT] Discovering schemas...

→ HQ: Found 29 schemas, 401 tables
→ PRODUCT: Found 1 schema, 0 tables

→ Merging warehouse metadata with code context...
  Enriched 45 tables with descriptions from code

→ Generated .astro/warehouse.md

Summary:
  - 2 databases
  - 30 schemas
  - 401 tables
  - 45 tables enriched with code descriptions
  - 8 large tables flagged (>100M rows)

Next steps:
  1. Review .astro/warehouse.md
  2. Add concept mappings to Quick Reference
  3. Commit to version control
  4. Run /data:init --refresh when schema changes