hugging-face-datasets

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Overview

概述

This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, content management, and SQL-based data manipulation. It is designed to complement the existing Hugging Face MCP server by providing dataset editing and querying capabilities.
本Skill提供了在Hugging Face Hub上管理数据集的工具,重点关注数据集的创建、配置、内容管理以及基于SQL的数据操作。它旨在通过提供数据集编辑和查询功能,补充现有的Hugging Face MCP服务器的能力。

Integration with HF MCP Server

与HF MCP服务器的集成

  • Use HF MCP Server for: Dataset discovery, search, and metadata retrieval
  • Use This Skill for: Dataset creation, content editing, SQL queries, data transformation, and structured data formatting
  • 使用HF MCP服务器进行:数据集发现、搜索和元数据检索
  • 使用本Skill进行:数据集创建、内容编辑、SQL查询、数据转换和结构化数据格式化

Version

版本

2.1.0
2.1.0

Dependencies

依赖项

This skill uses PEP 723 scripts with inline dependency management

本Skill使用PEP 723脚本,支持内联依赖管理

Scripts auto-install requirements when run with: uv run scripts/script_name.py

使用以下命令运行脚本时会自动安装依赖:uv run scripts/script_name.py

  • uv (Python package manager)
  • Getting Started: See "Usage Instructions" below for PEP 723 usage
  • uv(Python包管理器)
  • 入门指南:请参阅下方的“使用说明”了解PEP 723的使用方法

Core Capabilities

核心功能

1. Dataset Lifecycle Management

1. 数据集生命周期管理

  • Initialize: Create new dataset repositories with proper structure
  • Configure: Store detailed configuration including system prompts and metadata
  • Stream Updates: Add rows efficiently without downloading entire datasets
  • 初始化:创建具有合理结构的新数据集仓库
  • 配置:存储包含系统提示词和元数据的详细配置
  • 流式更新:高效添加数据行,无需下载整个数据集

2. SQL-Based Dataset Querying (NEW)

2. 基于SQL的数据集查询(新增)

Query any Hugging Face dataset using DuckDB SQL via
scripts/sql_manager.py
:
  • Direct Queries: Run SQL on datasets using the
    hf://
    protocol
  • Schema Discovery: Describe dataset structure and column types
  • Data Sampling: Get random samples for exploration
  • Aggregations: Count, histogram, unique values analysis
  • Transformations: Filter, join, reshape data with SQL
  • Export & Push: Save results locally or push to new Hub repos
通过
scripts/sql_manager.py
使用DuckDB SQL查询任意Hugging Face数据集:
  • 直接查询:使用
    hf://
    协议对数据集运行SQL
  • ** Schema 发现**:描述数据集结构和列类型
  • 数据采样:获取随机样本用于探索
  • 聚合分析:计数、直方图、唯一值分析
  • 数据转换:使用SQL进行过滤、连接、重塑数据
  • 导出与推送:将结果保存到本地或推送到新的Hub仓库

3. Multi-Format Dataset Support

3. 多格式数据集支持

Supports diverse dataset types through template system:
  • Chat/Conversational: Chat templating, multi-turn dialogues, tool usage examples
  • Text Classification: Sentiment analysis, intent detection, topic classification
  • Question-Answering: Reading comprehension, factual QA, knowledge bases
  • Text Completion: Language modeling, code completion, creative writing
  • Tabular Data: Structured data for regression/classification tasks
  • Custom Formats: Flexible schema definition for specialized needs
通过模板系统支持多种数据集类型:
  • 对话/会话型:对话模板、多轮对话、工具使用示例
  • 文本分类:情感分析、意图识别、主题分类
  • 问答:阅读理解、事实问答、知识库
  • 文本补全:语言建模、代码补全、创意写作
  • 表格数据:用于回归/分类任务的结构化数据
  • 自定义格式:为特殊需求提供灵活的Schema定义

4. Quality Assurance Features

4. 质量保障功能

  • JSON Validation: Ensures data integrity during uploads
  • Batch Processing: Efficient handling of large datasets
  • Error Recovery: Graceful handling of upload failures and conflicts
  • JSON验证:确保上传过程中的数据完整性
  • 批量处理:高效处理大型数据集
  • 错误恢复:优雅处理上传失败和冲突

Usage Instructions

使用说明

The skill includes two Python scripts that use PEP 723 inline dependency management:
All paths are relative to the directory containing this SKILL.md file. Scripts are run with:
uv run scripts/script_name.py [arguments]
  • scripts/dataset_manager.py
    - Dataset creation and management
  • scripts/sql_manager.py
    - SQL-based dataset querying and transformation
本Skill包含两个使用PEP 723内联依赖管理的Python脚本:
所有路径均相对于包含本SKILL.md文件的目录。 脚本运行命令:
uv run scripts/script_name.py [参数]
  • scripts/dataset_manager.py
    - 数据集创建与管理
  • scripts/sql_manager.py
    - 基于SQL的数据集查询与转换

Prerequisites

前提条件

  • uv
    package manager installed
  • HF_TOKEN
    environment variable must be set with a Write-access token

  • 已安装
    uv
    包管理器
  • 必须设置
    HF_TOKEN
    环境变量,且该令牌具有写入权限

SQL Dataset Querying (sql_manager.py)

SQL数据集查询(sql_manager.py)

Query, transform, and push Hugging Face datasets using DuckDB SQL. The
hf://
protocol provides direct access to any public dataset (or private with token).
使用DuckDB SQL查询、转换和推送Hugging Face数据集。
hf://
协议可直接访问任何公开数据集(使用令牌可访问私有数据集)。

Quick Start

快速开始

bash
undefined
bash
undefined

Query a dataset

查询数据集

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"

Get dataset schema

获取数据集Schema

uv run scripts/sql_manager.py describe --dataset "cais/mmlu"
uv run scripts/sql_manager.py describe --dataset "cais/mmlu"

Sample random rows

随机采样数据行

uv run scripts/sql_manager.py sample --dataset "cais/mmlu" --n 5
uv run scripts/sql_manager.py sample --dataset "cais/mmlu" --n 5

Count rows with filter

带过滤条件的计数

uv run scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"
undefined
uv run scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"
undefined

SQL Query Syntax

SQL查询语法

Use
data
as the table name in your SQL - it gets replaced with the actual
hf://
path:
sql
-- Basic select
SELECT * FROM data LIMIT 10

-- Filtering
SELECT * FROM data WHERE subject='nutrition'

-- Aggregations
SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC

-- Column selection and transformation
SELECT question, choices[answer] AS correct_answer FROM data

-- Regex matching
SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet')

-- String functions
SELECT regexp_replace(question, '\n', '') AS cleaned FROM data
在SQL中使用
data
作为表名,它会自动替换为实际的
hf://
路径:
sql
-- 基础查询
SELECT * FROM data LIMIT 10

-- 过滤
SELECT * FROM data WHERE subject='nutrition'

-- 聚合
SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC

-- 列选择与转换
SELECT question, choices[answer] AS correct_answer FROM data

-- 正则匹配
SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet')

-- 字符串函数
SELECT regexp_replace(question, '\n', '') AS cleaned FROM data

Common Operations

常见操作

1. Explore Dataset Structure

1. 探索数据集结构

bash
undefined
bash
undefined

Get schema

获取Schema

uv run scripts/sql_manager.py describe --dataset "cais/mmlu"
uv run scripts/sql_manager.py describe --dataset "cais/mmlu"

Get unique values in column

获取列的唯一值

uv run scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject"
uv run scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject"

Get value distribution

获取值分布

uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20
undefined
uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20
undefined

2. Filter and Transform

2. 过滤与转换

bash
undefined
bash
undefined

Complex filtering with SQL

复杂SQL过滤

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"

Using transform command

使用transform命令

uv run scripts/sql_manager.py transform
--dataset "cais/mmlu"
--select "subject, COUNT(*) as cnt"
--group-by "subject"
--order-by "cnt DESC"
--limit 10
undefined
uv run scripts/sql_manager.py transform
--dataset "cais/mmlu"
--select "subject, COUNT(*) as cnt"
--group-by "subject"
--order-by "cnt DESC"
--limit 10
undefined

3. Create Subsets and Push to Hub

3. 创建子集并推送到Hub

bash
undefined
bash
undefined

Query and push to new dataset

查询并推送到新数据集

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--push-to "username/mmlu-nutrition-subset"
--private
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--push-to "username/mmlu-nutrition-subset"
--private

Transform and push

转换并推送

uv run scripts/sql_manager.py transform
--dataset "ibm/duorc"
--config "ParaphraseRC"
--select "question, answers"
--where "LENGTH(question) > 50"
--push-to "username/duorc-long-questions"
undefined
uv run scripts/sql_manager.py transform
--dataset "ibm/duorc"
--config "ParaphraseRC"
--select "question, answers"
--where "LENGTH(question) > 50"
--push-to "username/duorc-long-questions"
undefined

4. Export to Local Files

4. 导出到本地文件

bash
undefined
bash
undefined

Export to Parquet

导出为Parquet格式

uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--output "nutrition.parquet"
--format parquet
uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--output "nutrition.parquet"
--format parquet

Export to JSONL

导出为JSONL格式

uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT * FROM data LIMIT 100"
--output "sample.jsonl"
--format jsonl
undefined
uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT * FROM data LIMIT 100"
--output "sample.jsonl"
--format jsonl
undefined

5. Working with Dataset Configs/Splits

5. 处理数据集配置/拆分

bash
undefined
bash
undefined

Specify config (subset)

指定配置(子集)

uv run scripts/sql_manager.py query
--dataset "ibm/duorc"
--config "ParaphraseRC"
--sql "SELECT * FROM data LIMIT 5"
uv run scripts/sql_manager.py query
--dataset "ibm/duorc"
--config "ParaphraseRC"
--sql "SELECT * FROM data LIMIT 5"

Specify split

指定拆分

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--split "test"
--sql "SELECT COUNT(*) FROM data"
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--split "test"
--sql "SELECT COUNT(*) FROM data"

Query all splits

查询所有拆分

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--split "*"
--sql "SELECT * FROM data LIMIT 10"
undefined
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--split "*"
--sql "SELECT * FROM data LIMIT 10"
undefined

6. Raw SQL with Full Paths

6. 使用完整路径的原始SQL

For complex queries or joining datasets:
bash
uv run scripts/sql_manager.py raw --sql "
  SELECT a.*, b.* 
  FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a
  JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b
  ON a.id = b.id
  LIMIT 100
"
对于复杂查询或多数据集连接:
bash
uv run scripts/sql_manager.py raw --sql "
  SELECT a.*, b.* 
  FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a
  JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b
  ON a.id = b.id
  LIMIT 100
"

Python API Usage

Python API使用

python
from sql_manager import HFDatasetSQL

sql = HFDatasetSQL()
python
from sql_manager import HFDatasetSQL

sql = HFDatasetSQL()

Query

查询

results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10")
results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10")

Get schema

获取Schema

schema = sql.describe("cais/mmlu")
schema = sql.describe("cais/mmlu")

Sample

采样

samples = sql.sample("cais/mmlu", n=5, seed=42)
samples = sql.sample("cais/mmlu", n=5, seed=42)

Count

计数

count = sql.count("cais/mmlu", where="subject='nutrition'")
count = sql.count("cais/mmlu", where="subject='nutrition'")

Histogram

直方图

dist = sql.histogram("cais/mmlu", "subject")
dist = sql.histogram("cais/mmlu", "subject")

Filter and transform

过滤与转换

results = sql.filter_and_transform( "cais/mmlu", select="subject, COUNT(*) as cnt", group_by="subject", order_by="cnt DESC", limit=10 )
results = sql.filter_and_transform( "cais/mmlu", select="subject, COUNT(*) as cnt", group_by="subject", order_by="cnt DESC", limit=10 )

Push to Hub

推送到Hub

url = sql.push_to_hub( "cais/mmlu", "username/nutrition-subset", sql="SELECT * FROM data WHERE subject='nutrition'", private=True )
url = sql.push_to_hub( "cais/mmlu", "username/nutrition-subset", sql="SELECT * FROM data WHERE subject='nutrition'", private=True )

Export locally

导出到本地

sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100")
sql.close()
undefined
sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100")
sql.close()
undefined

HF Path Format

HF路径格式

DuckDB uses the
hf://
protocol to access datasets:
hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet
Examples:
  • hf://datasets/cais/mmlu@~parquet/default/train/*.parquet
  • hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet
The
@~parquet
revision provides auto-converted Parquet files for any dataset format.
DuckDB使用
hf://
协议访问数据集:
hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet
示例:
  • hf://datasets/cais/mmlu@~parquet/default/train/*.parquet
  • hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet
@~parquet
修订版可为任何数据集格式提供自动转换的Parquet文件。

Useful DuckDB SQL Functions

实用的DuckDB SQL函数

sql
-- String functions
LENGTH(column)                    -- String length
regexp_replace(col, '\n', '')     -- Regex replace
regexp_matches(col, 'pattern')    -- Regex match
LOWER(col), UPPER(col)           -- Case conversion

-- Array functions  
choices[0]                        -- Array indexing (0-based)
array_length(choices)             -- Array length
unnest(choices)                   -- Expand array to rows

-- Aggregations
COUNT(*), SUM(col), AVG(col)
GROUP BY col HAVING condition

-- Sampling
USING SAMPLE 10                   -- Random sample
USING SAMPLE 10 (RESERVOIR, 42)   -- Reproducible sample

-- Window functions
ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)

sql
-- 字符串函数
LENGTH(column)                    -- 字符串长度
regexp_replace(col, '\n', '')     -- 正则替换
regexp_matches(col, 'pattern')    -- 正则匹配
LOWER(col), UPPER(col)           -- 大小写转换

-- 数组函数  
choices[0]                        -- 数组索引(从0开始)
array_length(choices)             -- 数组长度
unnest(choices)                   -- 将数组展开为行

-- 聚合函数
COUNT(*), SUM(col), AVG(col)
GROUP BY col HAVING condition

-- 采样
USING SAMPLE 10                   -- 随机采样
USING SAMPLE 10 (RESERVOIR, 42)   -- 可复现的采样

-- 窗口函数
ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)

Dataset Creation (dataset_manager.py)

数据集创建(dataset_manager.py)

Recommended Workflow

推荐工作流

1. Discovery (Use HF MCP Server):
python
undefined
1. 发现(使用HF MCP服务器):
python
undefined

Use HF MCP tools to find existing datasets

使用HF MCP工具查找现有数据集

search_datasets("conversational AI training") get_dataset_details("username/dataset-name")

**2. Creation (Use This Skill):**
```bash
search_datasets("conversational AI training") get_dataset_details("username/dataset-name")

**2. 创建(使用本Skill):**
```bash

Initialize new dataset

初始化新数据集

uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]

Configure with detailed system prompt

使用详细的系统提示词进行配置

uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"

**3. Content Management (Use This Skill):**
```bash
uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"

**3. 内容管理(使用本Skill):**
```bash

Quick setup with any template

使用任意模板快速设置

uv run scripts/dataset_manager.py quick_setup
--repo_id "your-username/dataset-name"
--template classification
uv run scripts/dataset_manager.py quick_setup
--repo_id "your-username/dataset-name"
--template classification

Add data with template validation

带模板验证的添加数据

uv run scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json "$(cat your_qa_data.json)"
undefined
uv run scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json "$(cat your_qa_data.json)"
undefined

Template-Based Data Structures

基于模板的数据结构

1. Chat Template (
--template chat
)
json
{
  "messages": [
    {"role": "user", "content": "Natural user request"},
    {"role": "assistant", "content": "Response with tool usage"},
    {"role": "tool", "content": "Tool response", "tool_call_id": "call_123"}
  ],
  "scenario": "Description of use case",
  "complexity": "simple|intermediate|advanced"
}
2. Classification Template (
--template classification
)
json
{
  "text": "Input text to be classified",
  "label": "classification_label",
  "confidence": 0.95,
  "metadata": {"domain": "technology", "language": "en"}
}
3. QA Template (
--template qa
)
json
{
  "question": "What is the question being asked?",
  "answer": "The complete answer",
  "context": "Additional context if needed",
  "answer_type": "factual|explanatory|opinion",
  "difficulty": "easy|medium|hard"
}
4. Completion Template (
--template completion
)
json
{
  "prompt": "The beginning text or context",
  "completion": "The expected continuation",
  "domain": "code|creative|technical|conversational",
  "style": "description of writing style"
}
5. Tabular Template (
--template tabular
)
json
{
  "columns": [
    {"name": "feature1", "type": "numeric", "description": "First feature"},
    {"name": "target", "type": "categorical", "description": "Target variable"}
  ],
  "data": [
    {"feature1": 123, "target": "class_a"},
    {"feature1": 456, "target": "class_b"}
  ]
}
1. 对话模板(
--template chat
json
{
  "messages": [
    {"role": "user", "content": "自然的用户请求"},
    {"role": "assistant", "content": "包含工具使用的响应"},
    {"role": "tool", "content": "工具响应", "tool_call_id": "call_123"}
  ],
  "scenario": "使用场景描述",
  "complexity": "simple|intermediate|advanced"
}
2. 分类模板(
--template classification
json
{
  "text": "待分类的输入文本",
  "label": "classification_label",
  "confidence": 0.95,
  "metadata": {"domain": "technology", "language": "en"}
}
3. 问答模板(
--template qa
json
{
  "question": "提出的问题是什么?",
  "answer": "完整的答案",
  "context": "(可选)额外的上下文信息",
  "answer_type": "factual|explanatory|opinion",
  "difficulty": "easy|medium|hard"
}
4. 补全模板(
--template completion
json
{
  "prompt": "开头文本或上下文",
  "completion": "预期的续写内容",
  "domain": "code|creative|technical|conversational",
  "style": "写作风格描述"
}
5. 表格模板(
--template tabular
json
{
  "columns": [
    {"name": "feature1", "type": "numeric", "description": "第一个特征"},
    {"name": "target", "type": "categorical", "description": "目标变量"}
  ],
  "data": [
    {"feature1": 123, "target": "class_a"},
    {"feature1": 456, "target": "class_b"}
  ]
}

Advanced System Prompt Template

高级系统提示词模板

For high-quality training data generation:
text
You are an AI assistant expert at using MCP tools effectively.
用于生成高质量训练数据:
text
你是一名擅长有效使用MCP工具的AI助手。

MCP SERVER DEFINITIONS

MCP服务器定义

[Define available servers and tools]
[定义可用的服务器和工具]

TRAINING EXAMPLE STRUCTURE

训练示例结构

[Specify exact JSON schema for chat templating]
[指定对话模板的精确JSON Schema]

QUALITY GUIDELINES

质量指南

[Detail requirements for realistic scenarios, progressive complexity, proper tool usage]
[详细说明真实场景、渐进复杂度、正确工具使用的要求]

EXAMPLE CATEGORIES

示例类别

[List development workflows, debugging scenarios, data management tasks]
undefined
[列出开发工作流、调试场景、数据管理任务]
undefined

Example Categories & Templates

示例类别与模板

The skill includes diverse training examples beyond just MCP usage:
Available Example Sets:
  • training_examples.json
    - MCP tool usage examples (debugging, project setup, database analysis)
  • diverse_training_examples.json
    - Broader scenarios including:
    • Educational Chat - Explaining programming concepts, tutorials
    • Git Workflows - Feature branches, version control guidance
    • Code Analysis - Performance optimization, architecture review
    • Content Generation - Professional writing, creative brainstorming
    • Codebase Navigation - Legacy code exploration, systematic analysis
    • Conversational Support - Problem-solving, technical discussions
Using Different Example Sets:
bash
undefined
本Skill包含除MCP使用之外的多种训练示例:
可用示例集:
  • training_examples.json
    - MCP工具使用示例(调试、项目设置、数据库分析)
  • diverse_training_examples.json
    - 更广泛的场景,包括:
    • 教育对话 - 解释编程概念、教程
    • Git工作流 - 功能分支、版本控制指导
    • 代码分析 - 性能优化、架构评审
    • 内容生成 - 专业写作、创意头脑风暴
    • 代码库导航 - 遗留代码探索、系统化分析
    • 会话支持 - 问题解决、技术讨论
使用不同的示例集:
bash
undefined

Add MCP-focused examples

添加聚焦MCP的示例

uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(cat examples/training_examples.json)"
uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(cat examples/training_examples.json)"

Add diverse conversational examples

添加多样化的会话示例

uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(cat examples/diverse_training_examples.json)"
uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(cat examples/diverse_training_examples.json)"

Mix both for comprehensive training data

混合两者以获取全面的训练数据

uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"
undefined
uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"
undefined

Commands Reference

命令参考

List Available Templates:
bash
uv run scripts/dataset_manager.py list_templates
Quick Setup (Recommended):
bash
uv run scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification
Manual Setup:
bash
undefined
列出可用模板:
bash
uv run scripts/dataset_manager.py list_templates
快速设置(推荐):
bash
uv run scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification
手动设置:
bash
undefined

Initialize repository

初始化仓库

uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]

Configure with system prompt

使用系统提示词配置

uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here"
uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "你的提示词内容"

Add data with validation

带验证的添加数据

uv run scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'

**View Dataset Statistics:**
```bash
uv run scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"
uv run scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json '[{"question": "什么是AI?", "answer": "人工智能是..."}]'

**查看数据集统计信息:**
```bash
uv run scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"

Error Handling

错误处理

  • Repository exists: Script will notify and continue with configuration
  • Invalid JSON: Clear error message with parsing details
  • Network issues: Automatic retry for transient failures
  • Token permissions: Validation before operations begin

  • 仓库已存在:脚本会发出通知并继续进行配置
  • 无效JSON:提供包含解析细节的清晰错误消息
  • 网络问题:对临时故障自动重试
  • 令牌权限:在操作开始前进行验证

Combined Workflow Examples

组合工作流示例

Example 1: Create Training Subset from Existing Dataset

示例1:从现有数据集创建训练子集

bash
undefined
bash
undefined

1. Explore the source dataset

1. 探索源数据集

uv run scripts/sql_manager.py describe --dataset "cais/mmlu" uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject"
uv run scripts/sql_manager.py describe --dataset "cais/mmlu" uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject"

2. Query and create subset

2. 查询并创建子集

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')"
--push-to "username/mmlu-medical-subset"
--private
undefined
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')"
--push-to "username/mmlu-medical-subset"
--private
undefined

Example 2: Transform and Reshape Data

示例2:转换并重塑数据

bash
undefined
bash
undefined

Transform MMLU to QA format with correct answers extracted

将MMLU转换为QA格式,提取正确答案

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT question, choices[answer] as correct_answer, subject FROM data"
--push-to "username/mmlu-qa-format"
undefined
uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT question, choices[answer] as correct_answer, subject FROM data"
--push-to "username/mmlu-qa-format"
undefined

Example 3: Merge Multiple Dataset Splits

示例3:合并多个数据集拆分

bash
undefined
bash
undefined

Export multiple splits and combine

导出多个拆分并合并

uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--split "*"
--output "mmlu_all.parquet"
undefined
uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--split "*"
--output "mmlu_all.parquet"
undefined

Example 4: Quality Filtering

示例4:质量过滤

bash
undefined
bash
undefined

Filter for high-quality examples

过滤高质量示例

uv run scripts/sql_manager.py query
--dataset "squad"
--sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20"
--push-to "username/squad-filtered"
undefined
uv run scripts/sql_manager.py query
--dataset "squad"
--sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20"
--push-to "username/squad-filtered"
undefined

Example 5: Create Custom Training Dataset

示例5:创建自定义训练数据集

bash
undefined
bash
undefined

1. Query source data

1. 查询源数据

uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT question, subject FROM data WHERE subject='nutrition'"
--output "nutrition_source.jsonl"
--format jsonl
uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT question, subject FROM data WHERE subject='nutrition'"
--output "nutrition_source.jsonl"
--format jsonl

2. Process with your pipeline (add answers, format, etc.)

2. 使用你的流水线处理(添加答案、格式化等)

3. Push processed data

3. 推送处理后的数据

uv run scripts/dataset_manager.py init --repo_id "username/nutrition-training" uv run scripts/dataset_manager.py add_rows
--repo_id "username/nutrition-training"
--template qa
--rows_json "$(cat processed_data.json)"
undefined
uv run scripts/dataset_manager.py init --repo_id "username/nutrition-training" uv run scripts/dataset_manager.py add_rows
--repo_id "username/nutrition-training"
--template qa
--rows_json "$(cat processed_data.json)"
undefined