data-context-extractor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Context Extractor

数据上下文提取器

A meta-skill that extracts company-specific data knowledge from analysts and generates tailored data analysis skills.
这是一个元Skill,可从分析师那里提取公司特定的数据知识,并生成定制化的数据分析Skill。

How It Works

工作原理

This skill has two modes:
  1. Bootstrap Mode: Create a new data analysis skill from scratch
  2. Iteration Mode: Improve an existing skill by adding domain-specific reference files

本Skill包含两种模式:
  1. 引导模式:从零开始创建新的数据分析Skill
  2. 迭代模式:通过添加领域特定的参考文件来改进现有Skill

Bootstrap Mode

引导模式

Use when: User wants to create a new data context skill for their warehouse.
适用场景:用户希望为其数据仓库创建新的数据上下文Skill。

Phase 1: Database Connection & Discovery

阶段1:数据库连接与发现

Step 1: Identify the database type
Ask: "What data warehouse are you using?"
Common options:
  • BigQuery
  • Snowflake
  • PostgreSQL/Redshift
  • Databricks
Use
~~data warehouse
tools (query and schema) to connect. If unclear, check available MCP tools in the current session.
Step 2: Explore the schema
Use
~~data warehouse
schema tools to:
  1. List available datasets/schemas
  2. Identify the most important tables (ask user: "Which 3-5 tables do analysts query most often?")
  3. Pull schema details for those key tables
Sample exploration queries by dialect:
sql
-- BigQuery: List datasets
SELECT schema_name FROM INFORMATION_SCHEMA.SCHEMATA

-- BigQuery: List tables in a dataset
SELECT table_name FROM `project.dataset.INFORMATION_SCHEMA.TABLES`

-- Snowflake: List schemas
SHOW SCHEMAS IN DATABASE my_database

-- Snowflake: List tables
SHOW TABLES IN SCHEMA my_schema
步骤1:确定数据库类型
询问:“您使用的是什么数据仓库?”
常见选项:
  • BigQuery
  • Snowflake
  • PostgreSQL/Redshift
  • Databricks
使用
~~data warehouse
工具(查询和架构)进行连接。若不确定,可检查当前会话中可用的MCP工具。
步骤2:探索数据架构
使用
~~data warehouse
架构工具完成以下操作:
  1. 列出可用的数据集/架构
  2. 确定最重要的表(询问用户:“分析师最常查询的3-5张表是哪些?”)
  3. 提取这些核心表的架构细节
按方言分类的探索查询示例:
sql
-- BigQuery: 列出数据集
SELECT schema_name FROM INFORMATION_SCHEMA.SCHEMATA

-- BigQuery: 列出数据集中的表
SELECT table_name FROM `project.dataset.INFORMATION_SCHEMA.TABLES`

-- Snowflake: 列出架构
SHOW SCHEMAS IN DATABASE my_database

-- Snowflake: 列出表
SHOW TABLES IN SCHEMA my_schema

Phase 2: Core Questions (Ask These)

阶段2:核心问题(需询问)

After schema discovery, ask these questions conversationally (not all at once):
Entity Disambiguation (Critical)
"When people here say 'user' or 'customer', what exactly do they mean? Are there different types?"
Listen for:
  • Multiple entity types (user vs account vs organization)
  • Relationships between them (1:1, 1:many, many:many)
  • Which ID fields link them together
Primary Identifiers
"What's the main identifier for a [customer/user/account]? Are there multiple IDs for the same entity?"
Listen for:
  • Primary keys vs business keys
  • UUID vs integer IDs
  • Legacy ID systems
Key Metrics
"What are the 2-3 metrics people ask about most? How is each one calculated?"
Listen for:
  • Exact formulas (ARR = monthly_revenue × 12)
  • Which tables/columns feed each metric
  • Time period conventions (trailing 7 days, calendar month, etc.)
Data Hygiene
"What should ALWAYS be filtered out of queries? (test data, fraud, internal users, etc.)"
Listen for:
  • Standard WHERE clauses to always include
  • Flag columns that indicate exclusions (is_test, is_internal, is_fraud)
  • Specific values to exclude (status = 'deleted')
Common Gotchas
"What mistakes do new analysts typically make with this data?"
Listen for:
  • Confusing column names
  • Timezone issues
  • NULL handling quirks
  • Historical vs current state tables
完成架构探索后,以对话形式询问以下问题(无需一次性全部提出):
实体消歧(关键环节)
“在贵公司,当人们提到‘用户’或‘客户’时,具体指的是什么?是否存在不同类型?”
需关注:
  • 多种实体类型(用户vs账户vs组织)
  • 实体间的关系(一对一、一对多、多对多)
  • 用于关联实体的ID字段
主标识符
“[客户/用户/账户]的主要标识符是什么?同一实体是否有多个ID?”
需关注:
  • 主键vs业务键
  • UUIDvs整数ID
  • 遗留ID系统
核心指标
“人们最常关注的2-3个指标是什么?每个指标的计算方式是怎样的?”
需关注:
  • 精确公式(如ARR = 月度收入 × 12)
  • 为每个指标提供数据的表/列
  • 时间周期约定(如过去7天、自然月等)
数据清洗规则
“查询时必须过滤掉哪些内容?(如测试数据、欺诈数据、内部用户等)”
需关注:
  • 需始终包含的标准WHERE子句
  • 用于标记排除项的列(如is_test、is_internal、is_fraud)
  • 需排除的特定值(如status = 'deleted')
常见误区
“新分析师使用这些数据时通常会犯哪些错误?”
需关注:
  • 易混淆的列名
  • 时区问题
  • NULL值处理的特殊规则
  • 历史状态表vs当前状态表

Phase 3: Generate the Skill

阶段3:生成Skill

Create a skill with this structure:
[company]-data-analyst/
├── SKILL.md
└── references/
    ├── entities.md          # Entity definitions and relationships
    ├── metrics.md           # KPI calculations
    ├── tables/              # One file per domain
    │   ├── [domain1].md
    │   └── [domain2].md
    └── dashboards.json      # Optional: existing dashboards catalog
SKILL.md Template: See
references/skill-template.md
SQL Dialect Section: See
references/sql-dialects.md
and include the appropriate dialect notes.
Reference File Template: See
references/domain-template.md
按照以下结构创建Skill:
[company]-data-analyst/
├── SKILL.md
└── references/
    ├── entities.md          # 实体定义与关系
    ├── metrics.md           # KPI计算方式
    ├── tables/              # 按领域划分的文件
    │   ├── [domain1].md
    │   └── [domain2].md
    └── dashboards.json      # 可选:现有仪表板目录
SKILL.md模板:请参考
references/skill-template.md
SQL方言部分:请参考
references/sql-dialects.md
并添加相应的方言说明。
参考文件模板:请参考
references/domain-template.md

Phase 4: Package and Deliver

阶段4:打包与交付

  1. Create all files in the skill directory
  2. Package as a zip file
  3. Present to user with summary of what was captured

  1. 在Skill目录中创建所有文件
  2. 打包为zip文件
  3. 向用户展示并总结已捕获的内容

Iteration Mode

迭代模式

Use when: User has an existing skill but needs to add more context.
适用场景:用户已有现有Skill,但需要补充更多上下文。

Step 1: Load Existing Skill

步骤1:加载现有Skill

Ask user to upload their existing skill (zip or folder), or locate it if already in the session.
Read the current SKILL.md and reference files to understand what's already documented.
请用户上传其现有Skill(zip或文件夹),或在当前会话中查找该Skill。
读取当前的SKILL.md和参考文件,了解已记录的内容。

Step 2: Identify the Gap

步骤2:识别缺口

Ask: "What domain or topic needs more context? What queries are failing or producing wrong results?"
Common gaps:
  • A new data domain (marketing, finance, product, etc.)
  • Missing metric definitions
  • Undocumented table relationships
  • New terminology
询问:“哪个领域或主题需要补充更多上下文?哪些查询失败或产生了错误结果?”
常见缺口:
  • 新的数据领域(如营销、财务、产品等)
  • 缺失的指标定义
  • 未记录的表关系
  • 新术语

Step 3: Targeted Discovery

步骤3:针对性探索

For the identified domain:
  1. Explore relevant tables: Use
    ~~data warehouse
    schema tools to find tables in that domain
  2. Ask domain-specific questions:
    • "What tables are used for [domain] analysis?"
    • "What are the key metrics for [domain]?"
    • "Any special filters or gotchas for [domain] data?"
  3. Generate new reference file: Create
    references/[domain].md
    using the domain template
针对已识别的领域:
  1. 探索相关表:使用
    ~~data warehouse
    架构工具查找该领域的表
  2. 询问领域特定问题
    • “用于[领域]分析的表有哪些?”
    • “[领域]的核心指标是什么?”
    • “[领域]数据有哪些特殊过滤规则或误区?”
  3. 生成新参考文件:使用领域模板创建
    references/[domain].md

Step 4: Update and Repackage

步骤4:更新与重新打包

  1. Add the new reference file
  2. Update SKILL.md's "Knowledge Base Navigation" section to include the new domain
  3. Repackage the skill
  4. Present the updated skill to user

  1. 添加新的参考文件
  2. 更新SKILL.md中的“知识库导航”部分,将新领域纳入其中
  3. 重新打包Skill
  4. 向用户展示更新后的Skill

Reference File Standards

参考文件标准

Each reference file should include:
每个参考文件应包含以下内容:

For Table Documentation

表文档规范

  • Location: Full table path
  • Description: What this table contains, when to use it
  • Primary Key: How to uniquely identify rows
  • Update Frequency: How often data refreshes
  • Key Columns: Table with column name, type, description, notes
  • Relationships: How this table joins to others
  • Sample Queries: 2-3 common query patterns
  • 位置:完整的表路径
  • 描述:该表包含的内容及适用场景
  • 主键:用于唯一标识行的字段
  • 更新频率:数据刷新的频率
  • 核心列:包含列名、类型、描述、备注的表格
  • 关系:该表与其他表的关联方式
  • 查询示例:2-3个常见查询模式

For Metrics Documentation

指标文档规范

  • Metric Name: Human-readable name
  • Definition: Plain English explanation
  • Formula: Exact calculation with column references
  • Source Table(s): Where the data comes from
  • Caveats: Edge cases, exclusions, gotchas
  • 指标名称:易读的名称
  • 定义:通俗易懂的解释
  • 公式:带有列引用的精确计算方式
  • 源表:数据来源表
  • 注意事项:边缘情况、排除规则、常见误区

For Entity Documentation

实体文档规范

  • Entity Name: What it's called
  • Definition: What it represents in the business
  • Primary Table: Where to find this entity
  • ID Field(s): How to identify it
  • Relationships: How it relates to other entities
  • Common Filters: Standard exclusions (internal, test, etc.)

  • 实体名称:通用称呼
  • 定义:该实体在业务中的含义
  • 主表:可找到该实体的表
  • ID字段:用于识别该实体的字段
  • 关系:与其他实体的关联方式
  • 常见过滤规则:标准排除项(如内部用户、测试数据等)

Quality Checklist

质量检查清单

Before delivering a generated skill, verify:
  • SKILL.md has complete frontmatter (name, description)
  • Entity disambiguation section is clear
  • Key terminology is defined
  • Standard filters/exclusions are documented
  • At least 2-3 sample queries per domain
  • SQL uses correct dialect syntax
  • Reference files are linked from SKILL.md navigation section
在交付生成的Skill前,需验证以下内容:
  • SKILL.md包含完整的前置信息(名称、描述)
  • 实体消歧部分清晰明确
  • 核心术语已定义
  • 标准过滤/排除规则已记录
  • 每个领域至少包含2-3个查询示例
  • SQL使用正确的方言语法
  • 参考文件已在SKILL.md的导航部分中关联