analyzing-data

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Analysis

数据分析

Answer business questions by querying the data warehouse. The kernel starts automatically on first use.
通过查询数据仓库来解答业务问题。内核在首次使用时会自动启动。

Prerequisites

前提条件

uv must be installed:
bash
curl -LsSf https://astral.sh/uv/install.sh | sh
Scripts are located relative to this skill file.
必须安装uv:
bash
curl -LsSf https://astral.sh/uv/install.sh | sh
脚本位于此技能文件的相对路径下。

MANDATORY FIRST STEP

强制第一步

Before any other action, check for cached patterns:
bash
uv run scripts/cli.py pattern lookup "<user's question>"
This is NON-NEGOTIABLE. Patterns contain proven strategies that save time and avoid failed queries.

在执行任何其他操作之前,检查缓存的模式:
bash
uv run scripts/cli.py pattern lookup "<user's question>"
这是必须执行的步骤。模式包含经过验证的策略,可节省时间并避免查询失败。

Workflow

工作流程

Analysis Progress:
- [ ] Step 1: pattern lookup (check for cached strategy)
- [ ] Step 2: concept lookup (check for known tables)
- [ ] Step 3: Search codebase for table definitions (Grep)
- [ ] Step 4: Read SQL file to get table/column names
- [ ] Step 5: Execute query via kernel (run_sql)
- [ ] Step 6: learn_concept (ALWAYS before presenting results)
- [ ] Step 7: learn_pattern (ALWAYS if discovery required)
- [ ] Step 8: record_pattern_outcome (if you used a pattern in Step 1)
- [ ] Step 9: Present findings to user

分析进度:
- [ ] 步骤1: 模式查询(检查缓存的策略)
- [ ] 步骤2: 概念查询(检查已知表)
- [ ] 步骤3: 在代码库中搜索表定义(Grep)
- [ ] 步骤4: 读取SQL文件以获取表/列名称
- [ ] 步骤5: 通过内核执行查询(run_sql)
- [ ] 步骤6: 学习概念(在呈现结果前必须执行)
- [ ] 步骤7: 学习模式(如果需要探索,必须执行)
- [ ] 步骤8: 记录模式结果(如果在步骤1中使用了模式)
- [ ] 步骤9: 向用户呈现结果

CLI Commands

CLI命令

Kernel Management

内核管理

bash
uv run scripts/cli.py warehouse list  # List available warehouses
uv run scripts/cli.py start           # Start kernel with default warehouse
uv run scripts/cli.py start -w my_pg  # Start with specific warehouse
uv run scripts/cli.py exec "..."      # Execute Python code
uv run scripts/cli.py status          # Check kernel status
uv run scripts/cli.py restart         # Restart kernel
uv run scripts/cli.py stop            # Stop kernel
uv run scripts/cli.py install plotly  # Install additional packages
bash
uv run scripts/cli.py warehouse list  # List available warehouses
uv run scripts/cli.py start           # Start kernel with default warehouse
uv run scripts/cli.py start -w my_pg  # Start with specific warehouse
uv run scripts/cli.py exec "..."      # Execute Python code
uv run scripts/cli.py status          # Check kernel status
uv run scripts/cli.py restart         # Restart kernel
uv run scripts/cli.py stop            # Stop kernel
uv run scripts/cli.py install plotly  # Install additional packages

Concept Cache (concept -> table mappings)

概念缓存(概念 -> 表映射)

bash
undefined
bash
undefined

Look up a concept

Look up a concept

uv run scripts/cli.py concept lookup customers
uv run scripts/cli.py concept lookup customers

Learn a new concept

Learn a new concept

uv run scripts/cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
uv run scripts/cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID

List all concepts

List all concepts

uv run scripts/cli.py concept list
uv run scripts/cli.py concept list

Import concepts from warehouse.md

Import concepts from warehouse.md

uv run scripts/cli.py concept import -p /path/to/warehouse.md
undefined
uv run scripts/cli.py concept import -p /path/to/warehouse.md
undefined

Pattern Cache (query strategies)

模式缓存(查询策略)

bash
undefined
bash
undefined

Look up patterns for a question

Look up patterns for a question

uv run scripts/cli.py pattern lookup "who uses operator X"
uv run scripts/cli.py pattern lookup "who uses operator X"

Learn a new pattern

Learn a new pattern

uv run scripts/cli.py pattern learn operator_usage
-q "who uses X operator"
-q "which customers use X"
-s "1. Query TASK_RUNS for operator_class"
-s "2. Join with ORGS on org_id"
-t "HQ.MODEL_ASTRO.TASK_RUNS"
-t "HQ.MODEL_ASTRO.ORGANIZATIONS"
-g "TASK_RUNS is huge - always filter by date"
uv run scripts/cli.py pattern learn operator_usage
-q "who uses X operator"
-q "which customers use X"
-s "1. Query TASK_RUNS for operator_class"
-s "2. Join with ORGS on org_id"
-t "HQ.MODEL_ASTRO.TASK_RUNS"
-t "HQ.MODEL_ASTRO.ORGANIZATIONS"
-g "TASK_RUNS is huge - always filter by date"

Record pattern outcome

Record pattern outcome

uv run scripts/cli.py pattern record operator_usage --success
uv run scripts/cli.py pattern record operator_usage --success

List all patterns

List all patterns

uv run scripts/cli.py pattern list
uv run scripts/cli.py pattern list

Delete a pattern

Delete a pattern

uv run scripts/cli.py pattern delete operator_usage
undefined
uv run scripts/cli.py pattern delete operator_usage
undefined

Table Schema Cache

表结构缓存

bash
undefined
bash
undefined

Look up cached table schema

Look up cached table schema

uv run scripts/cli.py table lookup HQ.MART_CUST.CURRENT_ASTRO_CUSTS
uv run scripts/cli.py table lookup HQ.MART_CUST.CURRENT_ASTRO_CUSTS

Cache a table schema

Cache a table schema

uv run scripts/cli.py table cache DB.SCHEMA.TABLE -c '[{"name":"id","type":"INT"}]'
uv run scripts/cli.py table cache DB.SCHEMA.TABLE -c '[{"name":"id","type":"INT"}]'

List all cached tables

List all cached tables

uv run scripts/cli.py table list
uv run scripts/cli.py table list

Delete from cache

Delete from cache

uv run scripts/cli.py table delete DB.SCHEMA.TABLE
undefined
uv run scripts/cli.py table delete DB.SCHEMA.TABLE
undefined

Cache Management

缓存管理

bash
undefined
bash
undefined

View cache statistics

View cache statistics

uv run scripts/cli.py cache status
uv run scripts/cli.py cache status

Clear all caches

Clear all caches

uv run scripts/cli.py cache clear
uv run scripts/cli.py cache clear

Clear only stale entries (older than 90 days)

Clear only stale entries (older than 90 days)

uv run scripts/cli.py cache clear --stale-only

---
uv run scripts/cli.py cache clear --stale-only

---

Quick Start Example

快速开始示例

bash
undefined
bash
undefined

1. Check for existing patterns

1. Check for existing patterns

uv run scripts/cli.py pattern lookup "how many customers"
uv run scripts/cli.py pattern lookup "how many customers"

2. Check for known concepts

2. Check for known concepts

uv run scripts/cli.py concept lookup customers
uv run scripts/cli.py concept lookup customers

3. Execute query

3. Execute query

uv run scripts/cli.py exec "df = run_sql('SELECT COUNT(*) FROM HQ.MART_CUST.CURRENT_ASTRO_CUSTS')" uv run scripts/cli.py exec "print(df)"
uv run scripts/cli.py exec "df = run_sql('SELECT COUNT(*) FROM HQ.MART_CUST.CURRENT_ASTRO_CUSTS')" uv run scripts/cli.py exec "print(df)"

4. Cache what we learned

4. Cache what we learned

uv run scripts/cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID

---
uv run scripts/cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID

---

Available Functions in Kernel

内核中的可用函数

Once kernel starts, these are available:
FunctionDescription
run_sql(query, limit=100)
Execute SQL, return Polars DataFrame
run_sql_pandas(query, limit=100)
Execute SQL, return Pandas DataFrame
pl
Polars library (imported)
pd
Pandas library (imported)

内核启动后,以下函数可用:
函数描述
run_sql(query, limit=100)
执行SQL,返回Polars DataFrame
run_sql_pandas(query, limit=100)
执行SQL,返回Pandas DataFrame
pl
Polars库(已导入)
pd
Pandas库(已导入)

Table Discovery via Codebase

通过代码库发现表

If concept/pattern cache miss, search the codebase:
Grep pattern="<concept>" glob="**/*.sql"
Repo TypeWhere to Look
Gusty
dags/declarative/04_metric/
,
06_reporting/
,
05_mart/
dbt
models/marts/
,
models/staging/

如果概念/模式缓存未命中,请搜索代码库:
Grep pattern="<concept>" glob="**/*.sql"
仓库类型查找位置
Gusty
dags/declarative/04_metric/
,
06_reporting/
,
05_mart/
dbt
models/marts/
,
models/staging/

Known Tables Quick Reference

已知表快速参考

ConceptTableKey ColumnDate Column
customersHQ.MART_CUST.CURRENT_ASTRO_CUSTSACCT_ID-
organizationsHQ.MODEL_ASTRO.ORGANIZATIONSORG_IDCREATED_TS
deploymentsHQ.MODEL_ASTRO.DEPLOYMENTSDEPLOYMENT_IDCREATED_TS
task_runsHQ.MODEL_ASTRO.TASK_RUNS-START_TS
dag_runsHQ.MODEL_ASTRO.DAG_RUNS-START_TS
usersHQ.MODEL_ASTRO.USERSUSER_ID-
accountsHQ.MODEL_CRM.SF_ACCOUNTSACCT_ID-
Large tables (always filter by date): TASK_RUNS (6B rows), DAG_RUNS (500M rows)

概念关键字段日期字段
customersHQ.MART_CUST.CURRENT_ASTRO_CUSTSACCT_ID-
organizationsHQ.MODEL_ASTRO.ORGANIZATIONSORG_IDCREATED_TS
deploymentsHQ.MODEL_ASTRO.DEPLOYMENTSDEPLOYMENT_IDCREATED_TS
task_runsHQ.MODEL_ASTRO.TASK_RUNS-START_TS
dag_runsHQ.MODEL_ASTRO.DAG_RUNS-START_TS
usersHQ.MODEL_ASTRO.USERSUSER_ID-
accountsHQ.MODEL_CRM.SF_ACCOUNTSACCT_ID-
大型表(必须按日期过滤): TASK_RUNS(60亿行)、DAG_RUNS(5亿行)

Query Tips

查询技巧

  • Use LIMIT during exploration
  • Filter early with WHERE clauses
  • Prefer pre-aggregated tables (
    METRICS_*
    ,
    MART_*
    ,
    AGG_*
    )
  • For 100M+ row tables: no JOINs or GROUP BY on first query
SQL Dialect Differences:
OperationSnowflakePostgreSQLBigQuery
Date subtract
DATEADD(day, -7, x)
x - INTERVAL '7 days'
DATE_SUB(x, INTERVAL 7 DAY)
Case-insensitive
ILIKE
ILIKE
LOWER(x) LIKE LOWER(y)

  • 在探索阶段使用LIMIT
  • 尽早使用WHERE子句过滤
  • 优先使用预聚合表(
    METRICS_*
    MART_*
    AGG_*
  • 对于1亿行以上的表:首次查询不要使用JOIN或GROUP BY
SQL方言差异:
操作SnowflakePostgreSQLBigQuery
日期减法
DATEADD(day, -7, x)
x - INTERVAL '7 days'
DATE_SUB(x, INTERVAL 7 DAY)
不区分大小写
ILIKE
ILIKE
LOWER(x) LIKE LOWER(y)

Reference

参考资料

  • reference/discovery-warehouse.md - Large table handling, warehouse discovery
  • reference/discovery-warehouse.md - 大型表处理、仓库发现