analyzing-data
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Analysis
数据分析
Answer business questions by querying the data warehouse. The kernel starts automatically on first use.
通过查询数据仓库来解答业务问题。内核在首次使用时会自动启动。
Prerequisites
前提条件
uv must be installed:
bash
curl -LsSf https://astral.sh/uv/install.sh | shScripts are located relative to this skill file.
必须安装uv:
bash
curl -LsSf https://astral.sh/uv/install.sh | sh脚本位于此技能文件的相对路径下。
MANDATORY FIRST STEP
强制第一步
Before any other action, check for cached patterns:
bash
uv run scripts/cli.py pattern lookup "<user's question>"This is NON-NEGOTIABLE. Patterns contain proven strategies that save time and avoid failed queries.
在执行任何其他操作之前,检查缓存的模式:
bash
uv run scripts/cli.py pattern lookup "<user's question>"这是必须执行的步骤。模式包含经过验证的策略,可节省时间并避免查询失败。
Workflow
工作流程
Analysis Progress:
- [ ] Step 1: pattern lookup (check for cached strategy)
- [ ] Step 2: concept lookup (check for known tables)
- [ ] Step 3: Search codebase for table definitions (Grep)
- [ ] Step 4: Read SQL file to get table/column names
- [ ] Step 5: Execute query via kernel (run_sql)
- [ ] Step 6: learn_concept (ALWAYS before presenting results)
- [ ] Step 7: learn_pattern (ALWAYS if discovery required)
- [ ] Step 8: record_pattern_outcome (if you used a pattern in Step 1)
- [ ] Step 9: Present findings to user分析进度:
- [ ] 步骤1: 模式查询(检查缓存的策略)
- [ ] 步骤2: 概念查询(检查已知表)
- [ ] 步骤3: 在代码库中搜索表定义(Grep)
- [ ] 步骤4: 读取SQL文件以获取表/列名称
- [ ] 步骤5: 通过内核执行查询(run_sql)
- [ ] 步骤6: 学习概念(在呈现结果前必须执行)
- [ ] 步骤7: 学习模式(如果需要探索,必须执行)
- [ ] 步骤8: 记录模式结果(如果在步骤1中使用了模式)
- [ ] 步骤9: 向用户呈现结果CLI Commands
CLI命令
Kernel Management
内核管理
bash
uv run scripts/cli.py warehouse list # List available warehouses
uv run scripts/cli.py start # Start kernel with default warehouse
uv run scripts/cli.py start -w my_pg # Start with specific warehouse
uv run scripts/cli.py exec "..." # Execute Python code
uv run scripts/cli.py status # Check kernel status
uv run scripts/cli.py restart # Restart kernel
uv run scripts/cli.py stop # Stop kernel
uv run scripts/cli.py install plotly # Install additional packagesbash
uv run scripts/cli.py warehouse list # List available warehouses
uv run scripts/cli.py start # Start kernel with default warehouse
uv run scripts/cli.py start -w my_pg # Start with specific warehouse
uv run scripts/cli.py exec "..." # Execute Python code
uv run scripts/cli.py status # Check kernel status
uv run scripts/cli.py restart # Restart kernel
uv run scripts/cli.py stop # Stop kernel
uv run scripts/cli.py install plotly # Install additional packagesConcept Cache (concept -> table mappings)
概念缓存(概念 -> 表映射)
bash
undefinedbash
undefinedLook up a concept
Look up a concept
uv run scripts/cli.py concept lookup customers
uv run scripts/cli.py concept lookup customers
Learn a new concept
Learn a new concept
uv run scripts/cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
uv run scripts/cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
List all concepts
List all concepts
uv run scripts/cli.py concept list
uv run scripts/cli.py concept list
Import concepts from warehouse.md
Import concepts from warehouse.md
uv run scripts/cli.py concept import -p /path/to/warehouse.md
undefineduv run scripts/cli.py concept import -p /path/to/warehouse.md
undefinedPattern Cache (query strategies)
模式缓存(查询策略)
bash
undefinedbash
undefinedLook up patterns for a question
Look up patterns for a question
uv run scripts/cli.py pattern lookup "who uses operator X"
uv run scripts/cli.py pattern lookup "who uses operator X"
Learn a new pattern
Learn a new pattern
uv run scripts/cli.py pattern learn operator_usage
-q "who uses X operator"
-q "which customers use X"
-s "1. Query TASK_RUNS for operator_class"
-s "2. Join with ORGS on org_id"
-t "HQ.MODEL_ASTRO.TASK_RUNS"
-t "HQ.MODEL_ASTRO.ORGANIZATIONS"
-g "TASK_RUNS is huge - always filter by date"
-q "who uses X operator"
-q "which customers use X"
-s "1. Query TASK_RUNS for operator_class"
-s "2. Join with ORGS on org_id"
-t "HQ.MODEL_ASTRO.TASK_RUNS"
-t "HQ.MODEL_ASTRO.ORGANIZATIONS"
-g "TASK_RUNS is huge - always filter by date"
uv run scripts/cli.py pattern learn operator_usage
-q "who uses X operator"
-q "which customers use X"
-s "1. Query TASK_RUNS for operator_class"
-s "2. Join with ORGS on org_id"
-t "HQ.MODEL_ASTRO.TASK_RUNS"
-t "HQ.MODEL_ASTRO.ORGANIZATIONS"
-g "TASK_RUNS is huge - always filter by date"
-q "who uses X operator"
-q "which customers use X"
-s "1. Query TASK_RUNS for operator_class"
-s "2. Join with ORGS on org_id"
-t "HQ.MODEL_ASTRO.TASK_RUNS"
-t "HQ.MODEL_ASTRO.ORGANIZATIONS"
-g "TASK_RUNS is huge - always filter by date"
Record pattern outcome
Record pattern outcome
uv run scripts/cli.py pattern record operator_usage --success
uv run scripts/cli.py pattern record operator_usage --success
List all patterns
List all patterns
uv run scripts/cli.py pattern list
uv run scripts/cli.py pattern list
Delete a pattern
Delete a pattern
uv run scripts/cli.py pattern delete operator_usage
undefineduv run scripts/cli.py pattern delete operator_usage
undefinedTable Schema Cache
表结构缓存
bash
undefinedbash
undefinedLook up cached table schema
Look up cached table schema
uv run scripts/cli.py table lookup HQ.MART_CUST.CURRENT_ASTRO_CUSTS
uv run scripts/cli.py table lookup HQ.MART_CUST.CURRENT_ASTRO_CUSTS
Cache a table schema
Cache a table schema
uv run scripts/cli.py table cache DB.SCHEMA.TABLE -c '[{"name":"id","type":"INT"}]'
uv run scripts/cli.py table cache DB.SCHEMA.TABLE -c '[{"name":"id","type":"INT"}]'
List all cached tables
List all cached tables
uv run scripts/cli.py table list
uv run scripts/cli.py table list
Delete from cache
Delete from cache
uv run scripts/cli.py table delete DB.SCHEMA.TABLE
undefineduv run scripts/cli.py table delete DB.SCHEMA.TABLE
undefinedCache Management
缓存管理
bash
undefinedbash
undefinedView cache statistics
View cache statistics
uv run scripts/cli.py cache status
uv run scripts/cli.py cache status
Clear all caches
Clear all caches
uv run scripts/cli.py cache clear
uv run scripts/cli.py cache clear
Clear only stale entries (older than 90 days)
Clear only stale entries (older than 90 days)
uv run scripts/cli.py cache clear --stale-only
---uv run scripts/cli.py cache clear --stale-only
---Quick Start Example
快速开始示例
bash
undefinedbash
undefined1. Check for existing patterns
1. Check for existing patterns
uv run scripts/cli.py pattern lookup "how many customers"
uv run scripts/cli.py pattern lookup "how many customers"
2. Check for known concepts
2. Check for known concepts
uv run scripts/cli.py concept lookup customers
uv run scripts/cli.py concept lookup customers
3. Execute query
3. Execute query
uv run scripts/cli.py exec "df = run_sql('SELECT COUNT(*) FROM HQ.MART_CUST.CURRENT_ASTRO_CUSTS')"
uv run scripts/cli.py exec "print(df)"
uv run scripts/cli.py exec "df = run_sql('SELECT COUNT(*) FROM HQ.MART_CUST.CURRENT_ASTRO_CUSTS')"
uv run scripts/cli.py exec "print(df)"
4. Cache what we learned
4. Cache what we learned
uv run scripts/cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
---uv run scripts/cli.py concept learn customers HQ.MART_CUST.CURRENT_ASTRO_CUSTS -k ACCT_ID
---Available Functions in Kernel
内核中的可用函数
Once kernel starts, these are available:
| Function | Description |
|---|---|
| Execute SQL, return Polars DataFrame |
| Execute SQL, return Pandas DataFrame |
| Polars library (imported) |
| Pandas library (imported) |
内核启动后,以下函数可用:
| 函数 | 描述 |
|---|---|
| 执行SQL,返回Polars DataFrame |
| 执行SQL,返回Pandas DataFrame |
| Polars库(已导入) |
| Pandas库(已导入) |
Table Discovery via Codebase
通过代码库发现表
If concept/pattern cache miss, search the codebase:
Grep pattern="<concept>" glob="**/*.sql"| Repo Type | Where to Look |
|---|---|
| Gusty | |
| dbt | |
如果概念/模式缓存未命中,请搜索代码库:
Grep pattern="<concept>" glob="**/*.sql"| 仓库类型 | 查找位置 |
|---|---|
| Gusty | |
| dbt | |
Known Tables Quick Reference
已知表快速参考
| Concept | Table | Key Column | Date Column |
|---|---|---|---|
| customers | HQ.MART_CUST.CURRENT_ASTRO_CUSTS | ACCT_ID | - |
| organizations | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_TS |
| deployments | HQ.MODEL_ASTRO.DEPLOYMENTS | DEPLOYMENT_ID | CREATED_TS |
| task_runs | HQ.MODEL_ASTRO.TASK_RUNS | - | START_TS |
| dag_runs | HQ.MODEL_ASTRO.DAG_RUNS | - | START_TS |
| users | HQ.MODEL_ASTRO.USERS | USER_ID | - |
| accounts | HQ.MODEL_CRM.SF_ACCOUNTS | ACCT_ID | - |
Large tables (always filter by date): TASK_RUNS (6B rows), DAG_RUNS (500M rows)
| 概念 | 表 | 关键字段 | 日期字段 |
|---|---|---|---|
| customers | HQ.MART_CUST.CURRENT_ASTRO_CUSTS | ACCT_ID | - |
| organizations | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_TS |
| deployments | HQ.MODEL_ASTRO.DEPLOYMENTS | DEPLOYMENT_ID | CREATED_TS |
| task_runs | HQ.MODEL_ASTRO.TASK_RUNS | - | START_TS |
| dag_runs | HQ.MODEL_ASTRO.DAG_RUNS | - | START_TS |
| users | HQ.MODEL_ASTRO.USERS | USER_ID | - |
| accounts | HQ.MODEL_CRM.SF_ACCOUNTS | ACCT_ID | - |
大型表(必须按日期过滤): TASK_RUNS(60亿行)、DAG_RUNS(5亿行)
Query Tips
查询技巧
- Use LIMIT during exploration
- Filter early with WHERE clauses
- Prefer pre-aggregated tables (,
METRICS_*,MART_*)AGG_* - For 100M+ row tables: no JOINs or GROUP BY on first query
SQL Dialect Differences:
| Operation | Snowflake | PostgreSQL | BigQuery |
|---|---|---|---|
| Date subtract | | | |
| Case-insensitive | | | |
- 在探索阶段使用LIMIT
- 尽早使用WHERE子句过滤
- 优先使用预聚合表(、
METRICS_*、MART_*)AGG_* - 对于1亿行以上的表:首次查询不要使用JOIN或GROUP BY
SQL方言差异:
| 操作 | Snowflake | PostgreSQL | BigQuery |
|---|---|---|---|
| 日期减法 | | | |
| 不区分大小写 | | | |
Reference
参考资料
- reference/discovery-warehouse.md - Large table handling, warehouse discovery
- reference/discovery-warehouse.md - 大型表处理、仓库发现