hugging-face-datasets
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOverview
概述
This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, content management, and SQL-based data manipulation. It is designed to complement the existing Hugging Face MCP server by providing dataset editing and querying capabilities.
本Skill提供了在Hugging Face Hub上管理数据集的工具,重点关注数据集的创建、配置、内容管理以及基于SQL的数据操作。它旨在通过提供数据集编辑和查询功能,补充现有的Hugging Face MCP服务器的能力。
Integration with HF MCP Server
与HF MCP服务器的集成
- Use HF MCP Server for: Dataset discovery, search, and metadata retrieval
- Use This Skill for: Dataset creation, content editing, SQL queries, data transformation, and structured data formatting
- 使用HF MCP服务器进行:数据集发现、搜索和元数据检索
- 使用本Skill进行:数据集创建、内容编辑、SQL查询、数据转换和结构化数据格式化
Version
版本
2.1.0
2.1.0
Dependencies
依赖项
This skill uses PEP 723 scripts with inline dependency management
本Skill使用PEP 723脚本,支持内联依赖管理
Scripts auto-install requirements when run with: uv run scripts/script_name.py
使用以下命令运行脚本时会自动安装依赖:uv run scripts/script_name.py
- uv (Python package manager)
- Getting Started: See "Usage Instructions" below for PEP 723 usage
- uv(Python包管理器)
- 入门指南:请参阅下方的“使用说明”了解PEP 723的使用方法
Core Capabilities
核心功能
1. Dataset Lifecycle Management
1. 数据集生命周期管理
- Initialize: Create new dataset repositories with proper structure
- Configure: Store detailed configuration including system prompts and metadata
- Stream Updates: Add rows efficiently without downloading entire datasets
- 初始化:创建具有合理结构的新数据集仓库
- 配置:存储包含系统提示词和元数据的详细配置
- 流式更新:高效添加数据行,无需下载整个数据集
2. SQL-Based Dataset Querying (NEW)
2. 基于SQL的数据集查询(新增)
Query any Hugging Face dataset using DuckDB SQL via :
scripts/sql_manager.py- Direct Queries: Run SQL on datasets using the protocol
hf:// - Schema Discovery: Describe dataset structure and column types
- Data Sampling: Get random samples for exploration
- Aggregations: Count, histogram, unique values analysis
- Transformations: Filter, join, reshape data with SQL
- Export & Push: Save results locally or push to new Hub repos
通过使用DuckDB SQL查询任意Hugging Face数据集:
scripts/sql_manager.py- 直接查询:使用协议对数据集运行SQL
hf:// - ** Schema 发现**:描述数据集结构和列类型
- 数据采样:获取随机样本用于探索
- 聚合分析:计数、直方图、唯一值分析
- 数据转换:使用SQL进行过滤、连接、重塑数据
- 导出与推送:将结果保存到本地或推送到新的Hub仓库
3. Multi-Format Dataset Support
3. 多格式数据集支持
Supports diverse dataset types through template system:
- Chat/Conversational: Chat templating, multi-turn dialogues, tool usage examples
- Text Classification: Sentiment analysis, intent detection, topic classification
- Question-Answering: Reading comprehension, factual QA, knowledge bases
- Text Completion: Language modeling, code completion, creative writing
- Tabular Data: Structured data for regression/classification tasks
- Custom Formats: Flexible schema definition for specialized needs
通过模板系统支持多种数据集类型:
- 对话/会话型:对话模板、多轮对话、工具使用示例
- 文本分类:情感分析、意图识别、主题分类
- 问答:阅读理解、事实问答、知识库
- 文本补全:语言建模、代码补全、创意写作
- 表格数据:用于回归/分类任务的结构化数据
- 自定义格式:为特殊需求提供灵活的Schema定义
4. Quality Assurance Features
4. 质量保障功能
- JSON Validation: Ensures data integrity during uploads
- Batch Processing: Efficient handling of large datasets
- Error Recovery: Graceful handling of upload failures and conflicts
- JSON验证:确保上传过程中的数据完整性
- 批量处理:高效处理大型数据集
- 错误恢复:优雅处理上传失败和冲突
Usage Instructions
使用说明
The skill includes two Python scripts that use PEP 723 inline dependency management:
All paths are relative to the directory containing this SKILL.md file. Scripts are run with:uv run scripts/script_name.py [arguments]
- - Dataset creation and management
scripts/dataset_manager.py - - SQL-based dataset querying and transformation
scripts/sql_manager.py
本Skill包含两个使用PEP 723内联依赖管理的Python脚本:
所有路径均相对于包含本SKILL.md文件的目录。 脚本运行命令:uv run scripts/script_name.py [参数]
- - 数据集创建与管理
scripts/dataset_manager.py - - 基于SQL的数据集查询与转换
scripts/sql_manager.py
Prerequisites
前提条件
- package manager installed
uv - environment variable must be set with a Write-access token
HF_TOKEN
- 已安装包管理器
uv - 必须设置环境变量,且该令牌具有写入权限
HF_TOKEN
SQL Dataset Querying (sql_manager.py)
SQL数据集查询(sql_manager.py)
Query, transform, and push Hugging Face datasets using DuckDB SQL. The protocol provides direct access to any public dataset (or private with token).
hf://使用DuckDB SQL查询、转换和推送Hugging Face数据集。协议可直接访问任何公开数据集(使用令牌可访问私有数据集)。
hf://Quick Start
快速开始
bash
undefinedbash
undefinedQuery a dataset
查询数据集
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"
Get dataset schema
获取数据集Schema
uv run scripts/sql_manager.py describe --dataset "cais/mmlu"
uv run scripts/sql_manager.py describe --dataset "cais/mmlu"
Sample random rows
随机采样数据行
uv run scripts/sql_manager.py sample --dataset "cais/mmlu" --n 5
uv run scripts/sql_manager.py sample --dataset "cais/mmlu" --n 5
Count rows with filter
带过滤条件的计数
uv run scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"
undefineduv run scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"
undefinedSQL Query Syntax
SQL查询语法
Use as the table name in your SQL - it gets replaced with the actual path:
datahf://sql
-- Basic select
SELECT * FROM data LIMIT 10
-- Filtering
SELECT * FROM data WHERE subject='nutrition'
-- Aggregations
SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC
-- Column selection and transformation
SELECT question, choices[answer] AS correct_answer FROM data
-- Regex matching
SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet')
-- String functions
SELECT regexp_replace(question, '\n', '') AS cleaned FROM data在SQL中使用作为表名,它会自动替换为实际的路径:
datahf://sql
-- 基础查询
SELECT * FROM data LIMIT 10
-- 过滤
SELECT * FROM data WHERE subject='nutrition'
-- 聚合
SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC
-- 列选择与转换
SELECT question, choices[answer] AS correct_answer FROM data
-- 正则匹配
SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet')
-- 字符串函数
SELECT regexp_replace(question, '\n', '') AS cleaned FROM dataCommon Operations
常见操作
1. Explore Dataset Structure
1. 探索数据集结构
bash
undefinedbash
undefinedGet schema
获取Schema
uv run scripts/sql_manager.py describe --dataset "cais/mmlu"
uv run scripts/sql_manager.py describe --dataset "cais/mmlu"
Get unique values in column
获取列的唯一值
uv run scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject"
uv run scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject"
Get value distribution
获取值分布
uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20
undefineduv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20
undefined2. Filter and Transform
2. 过滤与转换
bash
undefinedbash
undefinedComplex filtering with SQL
复杂SQL过滤
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"
--dataset "cais/mmlu"
--sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"
--dataset "cais/mmlu"
--sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"
Using transform command
使用transform命令
uv run scripts/sql_manager.py transform
--dataset "cais/mmlu"
--select "subject, COUNT(*) as cnt"
--group-by "subject"
--order-by "cnt DESC"
--limit 10
--dataset "cais/mmlu"
--select "subject, COUNT(*) as cnt"
--group-by "subject"
--order-by "cnt DESC"
--limit 10
undefineduv run scripts/sql_manager.py transform
--dataset "cais/mmlu"
--select "subject, COUNT(*) as cnt"
--group-by "subject"
--order-by "cnt DESC"
--limit 10
--dataset "cais/mmlu"
--select "subject, COUNT(*) as cnt"
--group-by "subject"
--order-by "cnt DESC"
--limit 10
undefined3. Create Subsets and Push to Hub
3. 创建子集并推送到Hub
bash
undefinedbash
undefinedQuery and push to new dataset
查询并推送到新数据集
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--push-to "username/mmlu-nutrition-subset"
--private
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--push-to "username/mmlu-nutrition-subset"
--private
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--push-to "username/mmlu-nutrition-subset"
--private
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--push-to "username/mmlu-nutrition-subset"
--private
Transform and push
转换并推送
uv run scripts/sql_manager.py transform
--dataset "ibm/duorc"
--config "ParaphraseRC"
--select "question, answers"
--where "LENGTH(question) > 50"
--push-to "username/duorc-long-questions"
--dataset "ibm/duorc"
--config "ParaphraseRC"
--select "question, answers"
--where "LENGTH(question) > 50"
--push-to "username/duorc-long-questions"
undefineduv run scripts/sql_manager.py transform
--dataset "ibm/duorc"
--config "ParaphraseRC"
--select "question, answers"
--where "LENGTH(question) > 50"
--push-to "username/duorc-long-questions"
--dataset "ibm/duorc"
--config "ParaphraseRC"
--select "question, answers"
--where "LENGTH(question) > 50"
--push-to "username/duorc-long-questions"
undefined4. Export to Local Files
4. 导出到本地文件
bash
undefinedbash
undefinedExport to Parquet
导出为Parquet格式
uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--output "nutrition.parquet"
--format parquet
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--output "nutrition.parquet"
--format parquet
uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--output "nutrition.parquet"
--format parquet
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--output "nutrition.parquet"
--format parquet
Export to JSONL
导出为JSONL格式
uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT * FROM data LIMIT 100"
--output "sample.jsonl"
--format jsonl
--dataset "cais/mmlu"
--sql "SELECT * FROM data LIMIT 100"
--output "sample.jsonl"
--format jsonl
undefineduv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT * FROM data LIMIT 100"
--output "sample.jsonl"
--format jsonl
--dataset "cais/mmlu"
--sql "SELECT * FROM data LIMIT 100"
--output "sample.jsonl"
--format jsonl
undefined5. Working with Dataset Configs/Splits
5. 处理数据集配置/拆分
bash
undefinedbash
undefinedSpecify config (subset)
指定配置(子集)
uv run scripts/sql_manager.py query
--dataset "ibm/duorc"
--config "ParaphraseRC"
--sql "SELECT * FROM data LIMIT 5"
--dataset "ibm/duorc"
--config "ParaphraseRC"
--sql "SELECT * FROM data LIMIT 5"
uv run scripts/sql_manager.py query
--dataset "ibm/duorc"
--config "ParaphraseRC"
--sql "SELECT * FROM data LIMIT 5"
--dataset "ibm/duorc"
--config "ParaphraseRC"
--sql "SELECT * FROM data LIMIT 5"
Specify split
指定拆分
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--split "test"
--sql "SELECT COUNT(*) FROM data"
--dataset "cais/mmlu"
--split "test"
--sql "SELECT COUNT(*) FROM data"
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--split "test"
--sql "SELECT COUNT(*) FROM data"
--dataset "cais/mmlu"
--split "test"
--sql "SELECT COUNT(*) FROM data"
Query all splits
查询所有拆分
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--split "*"
--sql "SELECT * FROM data LIMIT 10"
--dataset "cais/mmlu"
--split "*"
--sql "SELECT * FROM data LIMIT 10"
undefineduv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--split "*"
--sql "SELECT * FROM data LIMIT 10"
--dataset "cais/mmlu"
--split "*"
--sql "SELECT * FROM data LIMIT 10"
undefined6. Raw SQL with Full Paths
6. 使用完整路径的原始SQL
For complex queries or joining datasets:
bash
uv run scripts/sql_manager.py raw --sql "
SELECT a.*, b.*
FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a
JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b
ON a.id = b.id
LIMIT 100
"对于复杂查询或多数据集连接:
bash
uv run scripts/sql_manager.py raw --sql "
SELECT a.*, b.*
FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a
JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b
ON a.id = b.id
LIMIT 100
"Python API Usage
Python API使用
python
from sql_manager import HFDatasetSQL
sql = HFDatasetSQL()python
from sql_manager import HFDatasetSQL
sql = HFDatasetSQL()Query
查询
results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10")
results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10")
Get schema
获取Schema
schema = sql.describe("cais/mmlu")
schema = sql.describe("cais/mmlu")
Sample
采样
samples = sql.sample("cais/mmlu", n=5, seed=42)
samples = sql.sample("cais/mmlu", n=5, seed=42)
Count
计数
count = sql.count("cais/mmlu", where="subject='nutrition'")
count = sql.count("cais/mmlu", where="subject='nutrition'")
Histogram
直方图
dist = sql.histogram("cais/mmlu", "subject")
dist = sql.histogram("cais/mmlu", "subject")
Filter and transform
过滤与转换
results = sql.filter_and_transform(
"cais/mmlu",
select="subject, COUNT(*) as cnt",
group_by="subject",
order_by="cnt DESC",
limit=10
)
results = sql.filter_and_transform(
"cais/mmlu",
select="subject, COUNT(*) as cnt",
group_by="subject",
order_by="cnt DESC",
limit=10
)
Push to Hub
推送到Hub
url = sql.push_to_hub(
"cais/mmlu",
"username/nutrition-subset",
sql="SELECT * FROM data WHERE subject='nutrition'",
private=True
)
url = sql.push_to_hub(
"cais/mmlu",
"username/nutrition-subset",
sql="SELECT * FROM data WHERE subject='nutrition'",
private=True
)
Export locally
导出到本地
sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100")
sql.close()
undefinedsql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100")
sql.close()
undefinedHF Path Format
HF路径格式
DuckDB uses the protocol to access datasets:
hf://hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquetExamples:
hf://datasets/cais/mmlu@~parquet/default/train/*.parquethf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet
The revision provides auto-converted Parquet files for any dataset format.
@~parquetDuckDB使用协议访问数据集:
hf://hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet示例:
hf://datasets/cais/mmlu@~parquet/default/train/*.parquethf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet
@~parquetUseful DuckDB SQL Functions
实用的DuckDB SQL函数
sql
-- String functions
LENGTH(column) -- String length
regexp_replace(col, '\n', '') -- Regex replace
regexp_matches(col, 'pattern') -- Regex match
LOWER(col), UPPER(col) -- Case conversion
-- Array functions
choices[0] -- Array indexing (0-based)
array_length(choices) -- Array length
unnest(choices) -- Expand array to rows
-- Aggregations
COUNT(*), SUM(col), AVG(col)
GROUP BY col HAVING condition
-- Sampling
USING SAMPLE 10 -- Random sample
USING SAMPLE 10 (RESERVOIR, 42) -- Reproducible sample
-- Window functions
ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)sql
-- 字符串函数
LENGTH(column) -- 字符串长度
regexp_replace(col, '\n', '') -- 正则替换
regexp_matches(col, 'pattern') -- 正则匹配
LOWER(col), UPPER(col) -- 大小写转换
-- 数组函数
choices[0] -- 数组索引(从0开始)
array_length(choices) -- 数组长度
unnest(choices) -- 将数组展开为行
-- 聚合函数
COUNT(*), SUM(col), AVG(col)
GROUP BY col HAVING condition
-- 采样
USING SAMPLE 10 -- 随机采样
USING SAMPLE 10 (RESERVOIR, 42) -- 可复现的采样
-- 窗口函数
ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)Dataset Creation (dataset_manager.py)
数据集创建(dataset_manager.py)
Recommended Workflow
推荐工作流
1. Discovery (Use HF MCP Server):
python
undefined1. 发现(使用HF MCP服务器):
python
undefinedUse HF MCP tools to find existing datasets
使用HF MCP工具查找现有数据集
search_datasets("conversational AI training")
get_dataset_details("username/dataset-name")
**2. Creation (Use This Skill):**
```bashsearch_datasets("conversational AI training")
get_dataset_details("username/dataset-name")
**2. 创建(使用本Skill):**
```bashInitialize new dataset
初始化新数据集
uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
Configure with detailed system prompt
使用详细的系统提示词进行配置
uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"
**3. Content Management (Use This Skill):**
```bashuv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"
**3. 内容管理(使用本Skill):**
```bashQuick setup with any template
使用任意模板快速设置
uv run scripts/dataset_manager.py quick_setup
--repo_id "your-username/dataset-name"
--template classification
--repo_id "your-username/dataset-name"
--template classification
uv run scripts/dataset_manager.py quick_setup
--repo_id "your-username/dataset-name"
--template classification
--repo_id "your-username/dataset-name"
--template classification
Add data with template validation
带模板验证的添加数据
uv run scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json "$(cat your_qa_data.json)"
--repo_id "your-username/dataset-name"
--template qa
--rows_json "$(cat your_qa_data.json)"
undefineduv run scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json "$(cat your_qa_data.json)"
--repo_id "your-username/dataset-name"
--template qa
--rows_json "$(cat your_qa_data.json)"
undefinedTemplate-Based Data Structures
基于模板的数据结构
1. Chat Template ()
--template chatjson
{
"messages": [
{"role": "user", "content": "Natural user request"},
{"role": "assistant", "content": "Response with tool usage"},
{"role": "tool", "content": "Tool response", "tool_call_id": "call_123"}
],
"scenario": "Description of use case",
"complexity": "simple|intermediate|advanced"
}2. Classification Template ()
--template classificationjson
{
"text": "Input text to be classified",
"label": "classification_label",
"confidence": 0.95,
"metadata": {"domain": "technology", "language": "en"}
}3. QA Template ()
--template qajson
{
"question": "What is the question being asked?",
"answer": "The complete answer",
"context": "Additional context if needed",
"answer_type": "factual|explanatory|opinion",
"difficulty": "easy|medium|hard"
}4. Completion Template ()
--template completionjson
{
"prompt": "The beginning text or context",
"completion": "The expected continuation",
"domain": "code|creative|technical|conversational",
"style": "description of writing style"
}5. Tabular Template ()
--template tabularjson
{
"columns": [
{"name": "feature1", "type": "numeric", "description": "First feature"},
{"name": "target", "type": "categorical", "description": "Target variable"}
],
"data": [
{"feature1": 123, "target": "class_a"},
{"feature1": 456, "target": "class_b"}
]
}1. 对话模板()
--template chatjson
{
"messages": [
{"role": "user", "content": "自然的用户请求"},
{"role": "assistant", "content": "包含工具使用的响应"},
{"role": "tool", "content": "工具响应", "tool_call_id": "call_123"}
],
"scenario": "使用场景描述",
"complexity": "simple|intermediate|advanced"
}2. 分类模板()
--template classificationjson
{
"text": "待分类的输入文本",
"label": "classification_label",
"confidence": 0.95,
"metadata": {"domain": "technology", "language": "en"}
}3. 问答模板()
--template qajson
{
"question": "提出的问题是什么?",
"answer": "完整的答案",
"context": "(可选)额外的上下文信息",
"answer_type": "factual|explanatory|opinion",
"difficulty": "easy|medium|hard"
}4. 补全模板()
--template completionjson
{
"prompt": "开头文本或上下文",
"completion": "预期的续写内容",
"domain": "code|creative|technical|conversational",
"style": "写作风格描述"
}5. 表格模板()
--template tabularjson
{
"columns": [
{"name": "feature1", "type": "numeric", "description": "第一个特征"},
{"name": "target", "type": "categorical", "description": "目标变量"}
],
"data": [
{"feature1": 123, "target": "class_a"},
{"feature1": 456, "target": "class_b"}
]
}Advanced System Prompt Template
高级系统提示词模板
For high-quality training data generation:
text
You are an AI assistant expert at using MCP tools effectively.用于生成高质量训练数据:
text
你是一名擅长有效使用MCP工具的AI助手。MCP SERVER DEFINITIONS
MCP服务器定义
[Define available servers and tools]
[定义可用的服务器和工具]
TRAINING EXAMPLE STRUCTURE
训练示例结构
[Specify exact JSON schema for chat templating]
[指定对话模板的精确JSON Schema]
QUALITY GUIDELINES
质量指南
[Detail requirements for realistic scenarios, progressive complexity, proper tool usage]
[详细说明真实场景、渐进复杂度、正确工具使用的要求]
EXAMPLE CATEGORIES
示例类别
[List development workflows, debugging scenarios, data management tasks]
undefined[列出开发工作流、调试场景、数据管理任务]
undefinedExample Categories & Templates
示例类别与模板
The skill includes diverse training examples beyond just MCP usage:
Available Example Sets:
- - MCP tool usage examples (debugging, project setup, database analysis)
training_examples.json - - Broader scenarios including:
diverse_training_examples.json- Educational Chat - Explaining programming concepts, tutorials
- Git Workflows - Feature branches, version control guidance
- Code Analysis - Performance optimization, architecture review
- Content Generation - Professional writing, creative brainstorming
- Codebase Navigation - Legacy code exploration, systematic analysis
- Conversational Support - Problem-solving, technical discussions
Using Different Example Sets:
bash
undefined本Skill包含除MCP使用之外的多种训练示例:
可用示例集:
- - MCP工具使用示例(调试、项目设置、数据库分析)
training_examples.json - - 更广泛的场景,包括:
diverse_training_examples.json- 教育对话 - 解释编程概念、教程
- Git工作流 - 功能分支、版本控制指导
- 代码分析 - 性能优化、架构评审
- 内容生成 - 专业写作、创意头脑风暴
- 代码库导航 - 遗留代码探索、系统化分析
- 会话支持 - 问题解决、技术讨论
使用不同的示例集:
bash
undefinedAdd MCP-focused examples
添加聚焦MCP的示例
uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(cat examples/training_examples.json)"
--rows_json "$(cat examples/training_examples.json)"
uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(cat examples/training_examples.json)"
--rows_json "$(cat examples/training_examples.json)"
Add diverse conversational examples
添加多样化的会话示例
uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(cat examples/diverse_training_examples.json)"
--rows_json "$(cat examples/diverse_training_examples.json)"
uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(cat examples/diverse_training_examples.json)"
--rows_json "$(cat examples/diverse_training_examples.json)"
Mix both for comprehensive training data
混合两者以获取全面的训练数据
uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"
--rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"
undefineduv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"
--rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"
undefinedCommands Reference
命令参考
List Available Templates:
bash
uv run scripts/dataset_manager.py list_templatesQuick Setup (Recommended):
bash
uv run scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classificationManual Setup:
bash
undefined列出可用模板:
bash
uv run scripts/dataset_manager.py list_templates快速设置(推荐):
bash
uv run scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification手动设置:
bash
undefinedInitialize repository
初始化仓库
uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
Configure with system prompt
使用系统提示词配置
uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here"
uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "你的提示词内容"
Add data with validation
带验证的添加数据
uv run scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'
--repo_id "your-username/dataset-name"
--template qa
--rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'
**View Dataset Statistics:**
```bash
uv run scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"uv run scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json '[{"question": "什么是AI?", "answer": "人工智能是..."}]'
--repo_id "your-username/dataset-name"
--template qa
--rows_json '[{"question": "什么是AI?", "answer": "人工智能是..."}]'
**查看数据集统计信息:**
```bash
uv run scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"Error Handling
错误处理
- Repository exists: Script will notify and continue with configuration
- Invalid JSON: Clear error message with parsing details
- Network issues: Automatic retry for transient failures
- Token permissions: Validation before operations begin
- 仓库已存在:脚本会发出通知并继续进行配置
- 无效JSON:提供包含解析细节的清晰错误消息
- 网络问题:对临时故障自动重试
- 令牌权限:在操作开始前进行验证
Combined Workflow Examples
组合工作流示例
Example 1: Create Training Subset from Existing Dataset
示例1:从现有数据集创建训练子集
bash
undefinedbash
undefined1. Explore the source dataset
1. 探索源数据集
uv run scripts/sql_manager.py describe --dataset "cais/mmlu"
uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject"
uv run scripts/sql_manager.py describe --dataset "cais/mmlu"
uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject"
2. Query and create subset
2. 查询并创建子集
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')"
--push-to "username/mmlu-medical-subset"
--private
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')"
--push-to "username/mmlu-medical-subset"
--private
undefineduv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')"
--push-to "username/mmlu-medical-subset"
--private
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')"
--push-to "username/mmlu-medical-subset"
--private
undefinedExample 2: Transform and Reshape Data
示例2:转换并重塑数据
bash
undefinedbash
undefinedTransform MMLU to QA format with correct answers extracted
将MMLU转换为QA格式,提取正确答案
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT question, choices[answer] as correct_answer, subject FROM data"
--push-to "username/mmlu-qa-format"
--dataset "cais/mmlu"
--sql "SELECT question, choices[answer] as correct_answer, subject FROM data"
--push-to "username/mmlu-qa-format"
undefineduv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT question, choices[answer] as correct_answer, subject FROM data"
--push-to "username/mmlu-qa-format"
--dataset "cais/mmlu"
--sql "SELECT question, choices[answer] as correct_answer, subject FROM data"
--push-to "username/mmlu-qa-format"
undefinedExample 3: Merge Multiple Dataset Splits
示例3:合并多个数据集拆分
bash
undefinedbash
undefinedExport multiple splits and combine
导出多个拆分并合并
uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--split "*"
--output "mmlu_all.parquet"
--dataset "cais/mmlu"
--split "*"
--output "mmlu_all.parquet"
undefineduv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--split "*"
--output "mmlu_all.parquet"
--dataset "cais/mmlu"
--split "*"
--output "mmlu_all.parquet"
undefinedExample 4: Quality Filtering
示例4:质量过滤
bash
undefinedbash
undefinedFilter for high-quality examples
过滤高质量示例
uv run scripts/sql_manager.py query
--dataset "squad"
--sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20"
--push-to "username/squad-filtered"
--dataset "squad"
--sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20"
--push-to "username/squad-filtered"
undefineduv run scripts/sql_manager.py query
--dataset "squad"
--sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20"
--push-to "username/squad-filtered"
--dataset "squad"
--sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20"
--push-to "username/squad-filtered"
undefinedExample 5: Create Custom Training Dataset
示例5:创建自定义训练数据集
bash
undefinedbash
undefined1. Query source data
1. 查询源数据
uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT question, subject FROM data WHERE subject='nutrition'"
--output "nutrition_source.jsonl"
--format jsonl
--dataset "cais/mmlu"
--sql "SELECT question, subject FROM data WHERE subject='nutrition'"
--output "nutrition_source.jsonl"
--format jsonl
uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT question, subject FROM data WHERE subject='nutrition'"
--output "nutrition_source.jsonl"
--format jsonl
--dataset "cais/mmlu"
--sql "SELECT question, subject FROM data WHERE subject='nutrition'"
--output "nutrition_source.jsonl"
--format jsonl
2. Process with your pipeline (add answers, format, etc.)
2. 使用你的流水线处理(添加答案、格式化等)
3. Push processed data
3. 推送处理后的数据
uv run scripts/dataset_manager.py init --repo_id "username/nutrition-training"
uv run scripts/dataset_manager.py add_rows
--repo_id "username/nutrition-training"
--template qa
--rows_json "$(cat processed_data.json)"
--repo_id "username/nutrition-training"
--template qa
--rows_json "$(cat processed_data.json)"
undefineduv run scripts/dataset_manager.py init --repo_id "username/nutrition-training"
uv run scripts/dataset_manager.py add_rows
--repo_id "username/nutrition-training"
--template qa
--rows_json "$(cat processed_data.json)"
--repo_id "username/nutrition-training"
--template qa
--rows_json "$(cat processed_data.json)"
undefined