init

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Initialize Warehouse Schema

初始化数据仓库 Schema

Generate a comprehensive, user-editable schema reference file for the data warehouse.
Scripts:
../analyzing-data/scripts/
生成一个全面的、可由用户编辑的数据仓库 schema 参考文件。
脚本路径:
../analyzing-data/scripts/

What This Does

功能说明

  1. Discovers all databases, schemas, tables, and columns from the warehouse
  2. Enriches with codebase context (dbt models, gusty SQL, schema docs)
  3. Records row counts and identifies large tables
  4. Generates
    .astro/warehouse.md
    - a version-controllable, team-shareable reference
  5. Enables instant concept→table lookups without warehouse queries
  1. 发现数据仓库中的所有数据库、schema、表和列
  2. 结合代码库上下文丰富内容(dbt 模型、gusty SQL、schema 文档)
  3. 记录行数并识别大表
  4. 生成
    .astro/warehouse.md
    - 一个可版本控制、团队共享的参考文件
  5. 无需执行数据仓库查询即可实现概念到表的即时查找

Process

执行流程

Step 1: Read Warehouse Configuration

步骤1:读取数据仓库配置

bash
cat ~/.astro/agents/warehouse.yml
Get the list of databases to discover (e.g.,
databases: [HQ, ANALYTICS, RAW]
).
bash
cat ~/.astro/agents/warehouse.yml
获取需要发现的数据库列表(例如:
databases: [HQ, ANALYTICS, RAW]
)。

Step 2: Search Codebase for Context (Parallel)

步骤2:并行搜索代码库上下文

Launch a subagent to find business context in code:
Task(
    subagent_type="Explore",
    prompt="""
    Search for data model documentation in the codebase:

    1. dbt models: **/models/**/*.yml, **/schema.yml
       - Extract table descriptions, column descriptions
       - Note primary keys and tests

    2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter
       - Parse frontmatter for: description, primary_key, tests
       - Note schema mappings

    3. AGENTS.md or CLAUDE.md files with data layer documentation

    Return a mapping of:
      table_name -> {description, primary_key, important_columns, layer}
    """
)
启动子代理在代码中查找业务上下文:
Task(
    subagent_type="Explore",
    prompt="""
    Search for data model documentation in the codebase:

    1. dbt models: **/models/**/*.yml, **/schema.yml
       - Extract table descriptions, column descriptions
       - Note primary keys and tests

    2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter
       - Parse frontmatter for: description, primary_key, tests
       - Note schema mappings

    3. AGENTS.md or CLAUDE.md files with data layer documentation

    Return a mapping of:
      table_name -> {description, primary_key, important_columns, layer}
    """
)

Step 3: Parallel Warehouse Discovery

步骤3:并行数据仓库发现

Launch one subagent per database using the Task tool:
For each database in configured_databases:
    Task(
        subagent_type="general-purpose",
        prompt="""
        Discover all metadata for database {DATABASE}.

        Use the CLI to run SQL queries:
        # Scripts are relative to ../analyzing-data/
        uv run scripts/cli.py exec "df = run_sql('...')"
        uv run scripts/cli.py exec "print(df)"

        1. Query schemas:
           SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA

        2. Query tables with row counts:
           SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
           ORDER BY TABLE_SCHEMA, TABLE_NAME

        3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns:
           SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
           WHERE TABLE_SCHEMA = 'X'

        Return a structured summary:
        - Database name
        - List of schemas with table counts
        - For each table: name, row_count, key columns
        - Flag any tables with >100M rows as "large"
        """
    )
Run all subagents in parallel (single message with multiple Task calls).
使用 Task 工具为每个数据库启动一个子代理:
For each database in configured_databases:
    Task(
        subagent_type="general-purpose",
        prompt="""
        Discover all metadata for database {DATABASE}.

        Use the CLI to run SQL queries:
        # Scripts are relative to ../analyzing-data/
        uv run scripts/cli.py exec "df = run_sql('...')"
        uv run scripts/cli.py exec "print(df)"

        1. Query schemas:
           SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA

        2. Query tables with row counts:
           SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.TABLES
           ORDER BY TABLE_SCHEMA, TABLE_NAME

        3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns:
           SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT
           FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS
           WHERE TABLE_SCHEMA = 'X'

        Return a structured summary:
        - Database name
        - List of schemas with table counts
        - For each table: name, row_count, key columns
        - Flag any tables with >100M rows as "large"
        """
    )
并行运行所有子代理(单条消息包含多个 Task 调用)。

Step 4: Discover Categorical Value Families

步骤4:发现分类值族

For key categorical columns (like OPERATOR, STATUS, TYPE, FEATURE), discover value families:
bash
uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"
Group related values into families by common prefix/suffix (e.g.,
Export*
for ExportCSV, ExportJSON, ExportParquet).
针对关键分类列(如 OPERATOR、STATUS、TYPE、FEATURE),发现其值族:
bash
uv run cli.py exec "df = run_sql('''
SELECT DISTINCT column_name, COUNT(*) as occurrences
FROM table
WHERE column_name IS NOT NULL
GROUP BY column_name
ORDER BY occurrences DESC
LIMIT 50
''')"
uv run cli.py exec "print(df)"
根据公共前缀/后缀将相关值分组为值族(例如:
Export*
包含 ExportCSV、ExportJSON、ExportParquet)。

Step 5: Merge Results

步骤5:合并结果

Combine warehouse metadata + codebase context:
  1. Quick Reference table - concept → table mappings (pre-populated from code if found)
  2. Categorical Columns - value families for key filter columns
  3. Database sections - one per database
  4. Schema subsections - tables grouped by schema
  5. Table details - columns, row counts, descriptions from code, warnings
结合数据仓库元数据与代码库上下文:
  1. 快速参考表 - 概念到表的映射(如果在代码中找到则预填充)
  2. 分类列 - 关键筛选列的值族
  3. 数据库章节 - 每个数据库对应一个章节
  4. Schema 子章节 - 按 Schema 分组的表
  5. 表详情 - 列、行数、来自代码的描述、警告

Step 6: Generate warehouse.md

步骤6:生成 warehouse.md

Write the file to:
  • .astro/warehouse.md
    (default - project-specific, version-controllable)
  • ~/.astro/agents/warehouse.md
    (if
    --global
    flag)
将文件写入以下位置:
  • .astro/warehouse.md
    (默认 - 项目专属,可版本控制)
  • ~/.astro/agents/warehouse.md
    (如果使用
    --global
    参数)

Output Format

输出格式

markdown
undefined
markdown
undefined

Warehouse Schema

Warehouse Schema

Generated by
/data:init
on {DATE}. Edit freely to add business context.
Generated by
/data:init
on {DATE}. Edit freely to add business context.

Quick Reference

Quick Reference

ConceptTableKey ColumnDate Column
customersHQ.MODEL_ASTRO.ORGANIZATIONSORG_IDCREATED_AT
<!-- Add your concept mappings here -->
ConceptTableKey ColumnDate Column
customersHQ.MODEL_ASTRO.ORGANIZATIONSORG_IDCREATED_AT
<!-- Add your concept mappings here -->

Categorical Columns

Categorical Columns

When filtering on these columns, explore value families first (values often have variants):
TableColumnValue Families
{TABLE}{COLUMN}
{PREFIX}*
({VALUE1}, {VALUE2}, ...)
<!-- Populated by /data:init from actual warehouse data -->
When filtering on these columns, explore value families first (values often have variants):
TableColumnValue Families
{TABLE}{COLUMN}
{PREFIX}*
({VALUE1}, {VALUE2}, ...)
<!-- Populated by /data:init from actual warehouse data -->

Data Layer Hierarchy

Data Layer Hierarchy

Query downstream first:
reporting
>
mart_*
>
metric_*
>
model_*
>
IN_*
LayerPrefixPurpose
Reporting
reporting.*
Dashboard-optimized
Mart
mart_*
Combined analytics
Metric
metric_*
KPIs at various grains
Model
model_*
Cleansed sources of truth
Raw
IN_*
Source data - avoid
Query downstream first:
reporting
>
mart_*
>
metric_*
>
model_*
>
IN_*
LayerPrefixPurpose
Reporting
reporting.*
Dashboard-optimized
Mart
mart_*
Combined analytics
Metric
metric_*
KPIs at various grains
Model
model_*
Cleansed sources of truth
Raw
IN_*
Source data - avoid

{DATABASE} Database

{DATABASE} Database

{SCHEMA} Schema

{SCHEMA} Schema

{TABLE_NAME}

{TABLE_NAME}

{DESCRIPTION from code if found}
ColumnTypeDescription
COL1VARCHAR{from code or inferred}
  • Rows: {ROW_COUNT}
  • Key column: {PRIMARY_KEY from code or inferred} {IF ROW_COUNT > 100M: - ⚠️ WARNING: Large table - always add date filters}
{DESCRIPTION from code if found}
ColumnTypeDescription
COL1VARCHAR{from code or inferred}
  • Rows: {ROW_COUNT}
  • Key column: {PRIMARY_KEY from code or inferred} {IF ROW_COUNT > 100M: - ⚠️ WARNING: Large table - always add date filters}

Relationships

Relationships

{Inferred relationships based on column names like *_ID}
undefined
{Inferred relationships based on column names like *_ID}
undefined

Command Options

命令选项

OptionEffect
/data:init
Generate .astro/warehouse.md
/data:init --refresh
Regenerate, preserving user edits
/data:init --database HQ
Only discover specific database
/data:init --global
Write to ~/.astro/agents/ instead
选项作用
/data:init
生成 .astro/warehouse.md
/data:init --refresh
重新生成,保留用户编辑内容
/data:init --database HQ
仅发现指定数据库
/data:init --global
写入到 ~/.astro/agents/ 目录

Step 7: Pre-populate Cache

步骤7:预填充缓存

After generating warehouse.md, populate the concept cache:
bash
undefined
生成 warehouse.md 后,填充概念缓存:
bash
undefined

Scripts are relative to ../analyzing-data/

Scripts are relative to ../analyzing-data/

uv run cli.py concept import -p .astro/warehouse.md uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
undefined
uv run cli.py concept import -p .astro/warehouse.md uv run cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
undefined

Step 8: Offer CLAUDE.md Integration (Ask User)

步骤8:提供 CLAUDE.md 集成选项(询问用户)

Ask the user:
Would you like to add the Quick Reference table to your CLAUDE.md file?
This ensures the schema mappings are always in context for data queries, improving accuracy from ~25% to ~100% for complex queries.
Options:
  1. Yes, add to CLAUDE.md (Recommended) - Append Quick Reference section
  2. No, skip - Use warehouse.md and cache only
If user chooses Yes:
  1. Check if
    .claude/CLAUDE.md
    or
    CLAUDE.md
    exists
  2. If exists, append the Quick Reference section (avoid duplicates)
  3. If not exists, create
    .claude/CLAUDE.md
    with just the Quick Reference
Quick Reference section to add:
markdown
undefined
询问用户:
是否要将快速参考表添加到您的 CLAUDE.md 文件中?
这将确保 schema 映射始终在数据查询的上下文中,使复杂查询的准确率从约25%提升至约100%。
选项:
  1. 是,添加到 CLAUDE.md(推荐)- 追加快速参考章节
  2. 否,跳过 - 仅使用 warehouse.md 和缓存
如果用户选择是:
  1. 检查
    .claude/CLAUDE.md
    CLAUDE.md
    是否存在
  2. 如果存在,追加快速参考章节(避免重复)
  3. 如果不存在,创建
    .claude/CLAUDE.md
    并仅添加快速参考内容
要添加的快速参考章节:
markdown
undefined

Data Warehouse Quick Reference

Data Warehouse Quick Reference

When querying the warehouse, use these table mappings:
ConceptTableKey ColumnDate Column
{rows from warehouse.md Quick Reference}
Large tables (always filter by date): {list tables with >100M rows}
Auto-generated by
/data:init
. Run
/data:init --refresh
to update.
**If yes:** Append the Quick Reference section to `.claude/CLAUDE.md` or `CLAUDE.md`.
When querying the warehouse, use these table mappings:
ConceptTableKey ColumnDate Column
{rows from warehouse.md Quick Reference}
Large tables (always filter by date): {list tables with >100M rows}
Auto-generated by
/data:init
. Run
/data:init --refresh
to update.
**如果选择是:** 将快速参考章节追加到 `.claude/CLAUDE.md` 或 `CLAUDE.md`。

After Generation

生成后操作

Tell the user:
Generated .astro/warehouse.md

Summary:
  - {N} databases, {N} schemas, {N} tables
  - {N} tables enriched with code descriptions
  - {N} concepts cached for instant lookup

Next steps:
  1. Edit .astro/warehouse.md to add business context
  2. Commit to version control
  3. Run /data:init --refresh when schema changes
告知用户:
Generated .astro/warehouse.md

Summary:
  - {N} databases, {N} schemas, {N} tables
  - {N} tables enriched with code descriptions
  - {N} concepts cached for instant lookup

Next steps:
  1. Edit .astro/warehouse.md to add business context
  2. Commit to version control
  3. Run /data:init --refresh when schema changes

Refresh Behavior

刷新行为

When
--refresh
is specified:
  1. Read existing warehouse.md
  2. Preserve all HTML comments (
    <!-- ... -->
    )
  3. Preserve Quick Reference table entries (user-added)
  4. Preserve user-added descriptions
  5. Update row counts and add new tables
  6. Mark removed tables with
    <!-- REMOVED -->
    comment
当指定
--refresh
参数时:
  1. 读取现有的 warehouse.md
  2. 保留所有 HTML 注释(
    <!-- ... -->
  3. 保留快速参考表中的条目(用户添加的内容)
  4. 保留用户添加的描述
  5. 更新行数并添加新表
  6. 为已删除的表添加
    <!-- REMOVED -->
    注释

Cache Staleness & Schema Drift

缓存过期与 Schema 漂移

The runtime cache has a 7-day TTL by default. After 7 days, cached entries expire and will be re-discovered on next use.
运行时缓存默认有7天的 TTL。7天后,缓存条目将过期,下次使用时会重新发现。

When to Refresh

何时刷新

Run
/data:init --refresh
when:
  • Schema changes: Tables added, renamed, or removed
  • Column changes: New columns added or types changed
  • After deployments: If your data pipeline deploys schema migrations
  • Weekly: As a good practice, even if no known changes
在以下情况运行
/data:init --refresh
  • Schema 变更:添加、重命名或删除表
  • 列变更:添加新列或修改列类型
  • 部署后:如果数据管道部署了 schema 迁移
  • 每周:作为最佳实践,即使没有已知变更

Signs of Stale Cache

缓存过期的迹象

Watch for these indicators:
  • Queries fail with "table not found" errors
  • Results seem wrong or outdated
  • New tables aren't being discovered
注意以下指标:
  • 查询因"表不存在"错误而失败
  • 结果看起来错误或过时
  • 新表未被发现

Manual Cache Reset

手动重置缓存

If you suspect cache issues:
bash
undefined
如果怀疑缓存有问题:
bash
undefined

Scripts are relative to ../analyzing-data/

Scripts are relative to ../analyzing-data/

uv run scripts/cli.py cache status uv run scripts/cli.py cache clear --stale-only uv run scripts/cli.py cache clear
undefined
uv run scripts/cli.py cache status uv run scripts/cli.py cache clear --stale-only uv run scripts/cli.py cache clear
undefined

Codebase Patterns Recognized

可识别的代码库模式

PatternSourceWhat We Extract
**/models/**/*.yml
dbttable/column descriptions, tests
**/dags/**/*.sql
gustyYAML frontmatter (description, primary_key)
AGENTS.md
,
CLAUDE.md
docsdata layer hierarchy, conventions
**/docs/**/*.md
docsbusiness context
模式来源提取内容
**/models/**/*.yml
dbt表/列描述、测试规则
**/dags/**/*.sql
gustyYAML 前置内容(描述、主键、测试规则)
AGENTS.md
,
CLAUDE.md
文档数据层架构、约定
**/docs/**/*.md
文档业务上下文

Example Session

示例会话

User: /data:init

Agent:
→ Reading warehouse configuration...
→ Found 1 warehouse with databases: HQ, PRODUCT

→ Searching codebase for data documentation...
  Found: AGENTS.md with data layer hierarchy
  Found: 45 SQL files with YAML frontmatter in dags/declarative/

→ Launching parallel warehouse discovery...
  [Database: HQ] Discovering schemas...
  [Database: PRODUCT] Discovering schemas...

→ HQ: Found 29 schemas, 401 tables
→ PRODUCT: Found 1 schema, 0 tables

→ Merging warehouse metadata with code context...
  Enriched 45 tables with descriptions from code

→ Generated .astro/warehouse.md

Summary:
  - 2 databases
  - 30 schemas
  - 401 tables
  - 45 tables enriched with code descriptions
  - 8 large tables flagged (>100M rows)

Next steps:
  1. Review .astro/warehouse.md
  2. Add concept mappings to Quick Reference
  3. Commit to version control
  4. Run /data:init --refresh when schema changes
User: /data:init

Agent:
→ Reading warehouse configuration...
→ Found 1 warehouse with databases: HQ, PRODUCT

→ Searching codebase for data documentation...
  Found: AGENTS.md with data layer hierarchy
  Found: 45 SQL files with YAML frontmatter in dags/declarative/

→ Launching parallel warehouse discovery...
  [Database: HQ] Discovering schemas...
  [Database: PRODUCT] Discovering schemas...

→ HQ: Found 29 schemas, 401 tables
→ PRODUCT: Found 1 schema, 0 tables

→ Merging warehouse metadata with code context...
  Enriched 45 tables with descriptions from code

→ Generated .astro/warehouse.md

Summary:
  - 2 databases
  - 30 schemas
  - 401 tables
  - 45 tables enriched with code descriptions
  - 8 large tables flagged (>100M rows)

Next steps:
  1. Review .astro/warehouse.md
  2. Add concept mappings to Quick Reference
  3. Commit to version control
  4. Run /data:init --refresh when schema changes