data-context-extractor

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Context Extractor

数据上下文提取器

A meta-skill that extracts company-specific data knowledge from analysts and generates tailored data analysis skills.

这是一个元Skill，可从分析师那里提取公司特定的数据知识，并生成定制化的数据分析Skill。

How It Works

工作原理

This skill has two modes:

Bootstrap Mode: Create a new data analysis skill from scratch
Iteration Mode: Improve an existing skill by adding domain-specific reference files

本Skill包含两种模式：

引导模式：从零开始创建新的数据分析Skill
迭代模式：通过添加领域特定的参考文件来改进现有Skill

Bootstrap Mode

引导模式

Use when: User wants to create a new data context skill for their warehouse.

适用场景：用户希望为其数据仓库创建新的数据上下文Skill。

Phase 1: Database Connection & Discovery

阶段1：数据库连接与发现

Step 1: Identify the database type

Ask: "What data warehouse are you using?"

Common options:

BigQuery
Snowflake
PostgreSQL/Redshift
Databricks

Use

~~data warehouse

tools (query and schema) to connect. If unclear, check available MCP tools in the current session.

Step 2: Explore the schema

Use

~~data warehouse

schema tools to:

List available datasets/schemas
Identify the most important tables (ask user: "Which 3-5 tables do analysts query most often?")
Pull schema details for those key tables

Sample exploration queries by dialect:

sql

-- BigQuery: List datasets
SELECT schema_name FROM INFORMATION_SCHEMA.SCHEMATA

-- BigQuery: List tables in a dataset
SELECT table_name FROM `project.dataset.INFORMATION_SCHEMA.TABLES`

-- Snowflake: List schemas
SHOW SCHEMAS IN DATABASE my_database

-- Snowflake: List tables
SHOW TABLES IN SCHEMA my_schema

步骤1：确定数据库类型

询问：“您使用的是什么数据仓库？”

常见选项：

BigQuery
Snowflake
PostgreSQL/Redshift
Databricks

使用

~~data warehouse

工具（查询和架构）进行连接。若不确定，可检查当前会话中可用的MCP工具。

步骤2：探索数据架构

使用

~~data warehouse

架构工具完成以下操作：

列出可用的数据集/架构
确定最重要的表（询问用户：“分析师最常查询的3-5张表是哪些？”）
提取这些核心表的架构细节

按方言分类的探索查询示例：

sql

-- BigQuery: 列出数据集
SELECT schema_name FROM INFORMATION_SCHEMA.SCHEMATA

-- BigQuery: 列出数据集中的表
SELECT table_name FROM `project.dataset.INFORMATION_SCHEMA.TABLES`

-- Snowflake: 列出架构
SHOW SCHEMAS IN DATABASE my_database

-- Snowflake: 列出表
SHOW TABLES IN SCHEMA my_schema

Phase 2: Core Questions (Ask These)

阶段2：核心问题（需询问）

After schema discovery, ask these questions conversationally (not all at once):

Entity Disambiguation (Critical)

"When people here say 'user' or 'customer', what exactly do they mean? Are there different types?"

Listen for:

Multiple entity types (user vs account vs organization)
Relationships between them (1:1, 1:many, many:many)
Which ID fields link them together

Primary Identifiers

"What's the main identifier for a [customer/user/account]? Are there multiple IDs for the same entity?"

Listen for:

Primary keys vs business keys
UUID vs integer IDs
Legacy ID systems

Key Metrics

"What are the 2-3 metrics people ask about most? How is each one calculated?"

Listen for:

Exact formulas (ARR = monthly_revenue × 12)
Which tables/columns feed each metric
Time period conventions (trailing 7 days, calendar month, etc.)

Data Hygiene

"What should ALWAYS be filtered out of queries? (test data, fraud, internal users, etc.)"

Listen for:

Standard WHERE clauses to always include
Flag columns that indicate exclusions (is_test, is_internal, is_fraud)
Specific values to exclude (status = 'deleted')

Common Gotchas

"What mistakes do new analysts typically make with this data?"

Listen for:

Confusing column names
Timezone issues
NULL handling quirks
Historical vs current state tables

完成架构探索后，以对话形式询问以下问题（无需一次性全部提出）：

实体消歧（关键环节）

“在贵公司，当人们提到‘用户’或‘客户’时，具体指的是什么？是否存在不同类型？”

需关注：

多种实体类型（用户vs账户vs组织）
实体间的关系（一对一、一对多、多对多）
用于关联实体的ID字段

主标识符

“[客户/用户/账户]的主要标识符是什么？同一实体是否有多个ID？”

需关注：

主键vs业务键
UUIDvs整数ID
遗留ID系统

核心指标

“人们最常关注的2-3个指标是什么？每个指标的计算方式是怎样的？”

需关注：

精确公式（如ARR = 月度收入 × 12）
为每个指标提供数据的表/列
时间周期约定（如过去7天、自然月等）

数据清洗规则

“查询时必须过滤掉哪些内容？（如测试数据、欺诈数据、内部用户等）”

需关注：

需始终包含的标准WHERE子句
用于标记排除项的列（如is_test、is_internal、is_fraud）
需排除的特定值（如status = 'deleted'）

常见误区

“新分析师使用这些数据时通常会犯哪些错误？”

需关注：

易混淆的列名
时区问题
NULL值处理的特殊规则
历史状态表vs当前状态表

Phase 3: Generate the Skill

阶段3：生成Skill

Create a skill with this structure:

[company]-data-analyst/
├── SKILL.md
└── references/
    ├── entities.md          # Entity definitions and relationships
    ├── metrics.md           # KPI calculations
    ├── tables/              # One file per domain
    │   ├── [domain1].md
    │   └── [domain2].md
    └── dashboards.json      # Optional: existing dashboards catalog

SKILL.md Template: See

references/skill-template.md

SQL Dialect Section: See

references/sql-dialects.md

and include the appropriate dialect notes.

Reference File Template: See

references/domain-template.md

按照以下结构创建Skill：

[company]-data-analyst/
├── SKILL.md
└── references/
    ├── entities.md          # 实体定义与关系
    ├── metrics.md           # KPI计算方式
    ├── tables/              # 按领域划分的文件
    │   ├── [domain1].md
    │   └── [domain2].md
    └── dashboards.json      # 可选：现有仪表板目录

SKILL.md模板：请参考

references/skill-template.md

SQL方言部分：请参考

references/sql-dialects.md

并添加相应的方言说明。

参考文件模板：请参考

references/domain-template.md

Phase 4: Package and Deliver

阶段4：打包与交付

Create all files in the skill directory
Package as a zip file
Present to user with summary of what was captured

在Skill目录中创建所有文件
打包为zip文件
向用户展示并总结已捕获的内容

Iteration Mode

迭代模式

Use when: User has an existing skill but needs to add more context.

适用场景：用户已有现有Skill，但需要补充更多上下文。

Step 1: Load Existing Skill

步骤1：加载现有Skill

Ask user to upload their existing skill (zip or folder), or locate it if already in the session.

Read the current SKILL.md and reference files to understand what's already documented.

请用户上传其现有Skill（zip或文件夹），或在当前会话中查找该Skill。

读取当前的SKILL.md和参考文件，了解已记录的内容。

Step 2: Identify the Gap

步骤2：识别缺口

Ask: "What domain or topic needs more context? What queries are failing or producing wrong results?"

Common gaps:

A new data domain (marketing, finance, product, etc.)
Missing metric definitions
Undocumented table relationships
New terminology

询问：“哪个领域或主题需要补充更多上下文？哪些查询失败或产生了错误结果？”

常见缺口：

新的数据领域（如营销、财务、产品等）
缺失的指标定义
未记录的表关系
新术语

Step 3: Targeted Discovery

步骤3：针对性探索

For the identified domain:

Explore relevant tables: Use
```
~~data warehouse
```
schema tools to find tables in that domain
Ask domain-specific questions:
- "What tables are used for [domain] analysis?"
- "What are the key metrics for [domain]?"
- "Any special filters or gotchas for [domain] data?"
Generate new reference file: Create
```
references/[domain].md
```
using the domain template

针对已识别的领域：

探索相关表：使用
```
~~data warehouse
```
架构工具查找该领域的表
询问领域特定问题：
- “用于[领域]分析的表有哪些？”
- “[领域]的核心指标是什么？”
- “[领域]数据有哪些特殊过滤规则或误区？”
生成新参考文件：使用领域模板创建
```
references/[domain].md
```

Step 4: Update and Repackage

步骤4：更新与重新打包

Add the new reference file
Update SKILL.md's "Knowledge Base Navigation" section to include the new domain
Repackage the skill
Present the updated skill to user

添加新的参考文件
更新SKILL.md中的“知识库导航”部分，将新领域纳入其中
重新打包Skill
向用户展示更新后的Skill

Reference File Standards

参考文件标准

Each reference file should include:

每个参考文件应包含以下内容：

For Table Documentation

表文档规范

Location: Full table path
Description: What this table contains, when to use it
Primary Key: How to uniquely identify rows
Update Frequency: How often data refreshes
Key Columns: Table with column name, type, description, notes
Relationships: How this table joins to others
Sample Queries: 2-3 common query patterns

位置：完整的表路径
描述：该表包含的内容及适用场景
主键：用于唯一标识行的字段
更新频率：数据刷新的频率
核心列：包含列名、类型、描述、备注的表格
关系：该表与其他表的关联方式
查询示例：2-3个常见查询模式

For Metrics Documentation

指标文档规范

Metric Name: Human-readable name
Definition: Plain English explanation
Formula: Exact calculation with column references
Source Table(s): Where the data comes from
Caveats: Edge cases, exclusions, gotchas

指标名称：易读的名称
定义：通俗易懂的解释
公式：带有列引用的精确计算方式
源表：数据来源表
注意事项：边缘情况、排除规则、常见误区

For Entity Documentation

实体文档规范

Entity Name: What it's called
Definition: What it represents in the business
Primary Table: Where to find this entity
ID Field(s): How to identify it
Relationships: How it relates to other entities
Common Filters: Standard exclusions (internal, test, etc.)

实体名称：通用称呼
定义：该实体在业务中的含义
主表：可找到该实体的表
ID字段：用于识别该实体的字段
关系：与其他实体的关联方式
常见过滤规则：标准排除项（如内部用户、测试数据等）

data-context-extractor

Original

Translation

Data Context Extractor

数据上下文提取器

How It Works

工作原理

Bootstrap Mode

引导模式

Phase 1: Database Connection & Discovery

阶段1：数据库连接与发现

Phase 2: Core Questions (Ask These)

阶段2：核心问题（需询问）

Phase 3: Generate the Skill

阶段3：生成Skill

Phase 4: Package and Deliver

阶段4：打包与交付

Iteration Mode

迭代模式

Step 1: Load Existing Skill

步骤1：加载现有Skill

Step 2: Identify the Gap

步骤2：识别缺口

Step 3: Targeted Discovery

步骤3：针对性探索

Step 4: Update and Repackage

步骤4：更新与重新打包

Reference File Standards

参考文件标准

For Table Documentation

表文档规范

For Metrics Documentation

指标文档规范

For Entity Documentation

实体文档规范

Quality Checklist

质量检查清单