hugging-face-datasets

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Overview

概述

This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, content management, and SQL-based data manipulation. It is designed to complement the existing Hugging Face MCP server by providing dataset editing and querying capabilities.

本Skill提供了在Hugging Face Hub上管理数据集的工具，重点关注数据集的创建、配置、内容管理以及基于SQL的数据操作。它旨在通过提供数据集编辑和查询功能，补充现有的Hugging Face MCP服务器的能力。

Integration with HF MCP Server

与HF MCP服务器的集成

Use HF MCP Server for: Dataset discovery, search, and metadata retrieval
Use This Skill for: Dataset creation, content editing, SQL queries, data transformation, and structured data formatting

使用HF MCP服务器进行：数据集发现、搜索和元数据检索
使用本Skill进行：数据集创建、内容编辑、SQL查询、数据转换和结构化数据格式化

Version

版本

2.1.0

Dependencies

依赖项

This skill uses PEP 723 scripts with inline dependency management

本Skill使用PEP 723脚本，支持内联依赖管理

Scripts auto-install requirements when run with: uv run scripts/script_name.py

使用以下命令运行脚本时会自动安装依赖：uv run scripts/script_name.py

uv (Python package manager)
Getting Started: See "Usage Instructions" below for PEP 723 usage

uv（Python包管理器）
入门指南：请参阅下方的“使用说明”了解PEP 723的使用方法

Core Capabilities

核心功能

1. Dataset Lifecycle Management

1. 数据集生命周期管理

Initialize: Create new dataset repositories with proper structure
Configure: Store detailed configuration including system prompts and metadata
Stream Updates: Add rows efficiently without downloading entire datasets

初始化：创建具有合理结构的新数据集仓库
配置：存储包含系统提示词和元数据的详细配置
流式更新：高效添加数据行，无需下载整个数据集

2. SQL-Based Dataset Querying (NEW)

2. 基于SQL的数据集查询（新增）

Query any Hugging Face dataset using DuckDB SQL via

scripts/sql_manager.py

Direct Queries: Run SQL on datasets using the
```
hf://
```
protocol
Schema Discovery: Describe dataset structure and column types
Data Sampling: Get random samples for exploration
Aggregations: Count, histogram, unique values analysis
Transformations: Filter, join, reshape data with SQL
Export & Push: Save results locally or push to new Hub repos

通过

scripts/sql_manager.py

使用DuckDB SQL查询任意Hugging Face数据集：

直接查询：使用
```
hf://
```
协议对数据集运行SQL
** Schema 发现**：描述数据集结构和列类型
数据采样：获取随机样本用于探索
聚合分析：计数、直方图、唯一值分析
数据转换：使用SQL进行过滤、连接、重塑数据
导出与推送：将结果保存到本地或推送到新的Hub仓库

3. Multi-Format Dataset Support

3. 多格式数据集支持

Supports diverse dataset types through template system:

Chat/Conversational: Chat templating, multi-turn dialogues, tool usage examples
Text Classification: Sentiment analysis, intent detection, topic classification
Question-Answering: Reading comprehension, factual QA, knowledge bases
Text Completion: Language modeling, code completion, creative writing
Tabular Data: Structured data for regression/classification tasks
Custom Formats: Flexible schema definition for specialized needs

通过模板系统支持多种数据集类型：

对话/会话型：对话模板、多轮对话、工具使用示例
文本分类：情感分析、意图识别、主题分类
问答：阅读理解、事实问答、知识库
文本补全：语言建模、代码补全、创意写作
表格数据：用于回归/分类任务的结构化数据
自定义格式：为特殊需求提供灵活的Schema定义

4. Quality Assurance Features

4. 质量保障功能

JSON Validation: Ensures data integrity during uploads
Batch Processing: Efficient handling of large datasets
Error Recovery: Graceful handling of upload failures and conflicts

JSON验证：确保上传过程中的数据完整性
批量处理：高效处理大型数据集
错误恢复：优雅处理上传失败和冲突

Usage Instructions

使用说明

The skill includes two Python scripts that use PEP 723 inline dependency management:

All paths are relative to the directory containing this SKILL.md file. Scripts are run with:
uv run scripts/script_name.py [arguments]

```
scripts/dataset_manager.py
```
- Dataset creation and management
```
scripts/sql_manager.py
```
- SQL-based dataset querying and transformation

本Skill包含两个使用PEP 723内联依赖管理的Python脚本：

所有路径均相对于包含本SKILL.md文件的目录。 脚本运行命令：
uv run scripts/script_name.py [参数]

```
scripts/dataset_manager.py
```
- 数据集创建与管理
```
scripts/sql_manager.py
```
- 基于SQL的数据集查询与转换

Prerequisites

前提条件

```
uv
```
package manager installed
```
HF_TOKEN
```
environment variable must be set with a Write-access token

已安装
```
uv
```
包管理器
必须设置
```
HF_TOKEN
```
环境变量，且该令牌具有写入权限

SQL Dataset Querying (sql_manager.py)

SQL数据集查询（sql_manager.py）

Query, transform, and push Hugging Face datasets using DuckDB SQL. The

hf://

protocol provides direct access to any public dataset (or private with token).

使用DuckDB SQL查询、转换和推送Hugging Face数据集。

hf://

协议可直接访问任何公开数据集（使用令牌可访问私有数据集）。

Quick Start

快速开始

bash

undefined

bash

undefined

Query a dataset

查询数据集

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"

Get dataset schema

获取数据集Schema

uv run scripts/sql_manager.py describe --dataset "cais/mmlu"

Sample random rows

随机采样数据行

uv run scripts/sql_manager.py sample --dataset "cais/mmlu" --n 5

Count rows with filter

带过滤条件的计数

uv run scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"

undefined

uv run scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"

undefined

SQL Query Syntax

SQL查询语法

Use

data

as the table name in your SQL - it gets replaced with the actual

hf://

path:

sql

-- Basic select
SELECT * FROM data LIMIT 10

-- Filtering
SELECT * FROM data WHERE subject='nutrition'

-- Aggregations
SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC

-- Column selection and transformation
SELECT question, choices[answer] AS correct_answer FROM data

-- Regex matching
SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet')

-- String functions
SELECT regexp_replace(question, '\n', '') AS cleaned FROM data

在SQL中使用

data

作为表名，它会自动替换为实际的

hf://

路径：

sql

-- 基础查询
SELECT * FROM data LIMIT 10

-- 过滤
SELECT * FROM data WHERE subject='nutrition'

-- 聚合
SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC

-- 列选择与转换
SELECT question, choices[answer] AS correct_answer FROM data

-- 正则匹配
SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet')

-- 字符串函数
SELECT regexp_replace(question, '\n', '') AS cleaned FROM data

Common Operations

常见操作

1. Explore Dataset Structure

1. 探索数据集结构

bash

undefined

bash

undefined

Get schema

获取Schema

uv run scripts/sql_manager.py describe --dataset "cais/mmlu"

Get unique values in column

获取列的唯一值

uv run scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject"

Get value distribution

获取值分布

uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20

undefined

uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20

undefined

2. Filter and Transform

2. 过滤与转换

bash

undefined

bash

undefined

Complex filtering with SQL

复杂SQL过滤

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"

Using transform command

使用transform命令

uv run scripts/sql_manager.py transform
--dataset "cais/mmlu"
--select "subject, COUNT(*) as cnt"
--group-by "subject"
--order-by "cnt DESC"
--limit 10

undefined

uv run scripts/sql_manager.py transform
--dataset "cais/mmlu"
--select "subject, COUNT(*) as cnt"
--group-by "subject"
--order-by "cnt DESC"
--limit 10

undefined

3. Create Subsets and Push to Hub

3. 创建子集并推送到Hub

bash

undefined

bash

undefined

Query and push to new dataset

查询并推送到新数据集

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--push-to "username/mmlu-nutrition-subset"
--private

Transform and push

转换并推送

uv run scripts/sql_manager.py transform
--dataset "ibm/duorc"
--config "ParaphraseRC"
--select "question, answers"
--where "LENGTH(question) > 50"
--push-to "username/duorc-long-questions"

undefined

uv run scripts/sql_manager.py transform
--dataset "ibm/duorc"
--config "ParaphraseRC"
--select "question, answers"
--where "LENGTH(question) > 50"
--push-to "username/duorc-long-questions"

undefined

4. Export to Local Files

4. 导出到本地文件

bash

undefined

bash

undefined

Export to Parquet

导出为Parquet格式

uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--output "nutrition.parquet"
--format parquet

Export to JSONL

导出为JSONL格式

uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT * FROM data LIMIT 100"
--output "sample.jsonl"
--format jsonl

undefined

uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT * FROM data LIMIT 100"
--output "sample.jsonl"
--format jsonl

undefined

5. Working with Dataset Configs/Splits

5. 处理数据集配置/拆分

bash

undefined

bash

undefined

Specify config (subset)

指定配置（子集）

uv run scripts/sql_manager.py query
--dataset "ibm/duorc"
--config "ParaphraseRC"
--sql "SELECT * FROM data LIMIT 5"

Specify split

指定拆分

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--split "test"
--sql "SELECT COUNT(*) FROM data"

Query all splits

查询所有拆分

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--split "*"
--sql "SELECT * FROM data LIMIT 10"

undefined

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--split "*"
--sql "SELECT * FROM data LIMIT 10"

undefined

6. Raw SQL with Full Paths

6. 使用完整路径的原始SQL

For complex queries or joining datasets:

bash

uv run scripts/sql_manager.py raw --sql "
  SELECT a.*, b.* 
  FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a
  JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b
  ON a.id = b.id
  LIMIT 100
"

对于复杂查询或多数据集连接：

bash

uv run scripts/sql_manager.py raw --sql "
  SELECT a.*, b.* 
  FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a
  JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b
  ON a.id = b.id
  LIMIT 100
"

Python API Usage

Python API使用

python

from sql_manager import HFDatasetSQL

sql = HFDatasetSQL()

python

from sql_manager import HFDatasetSQL

sql = HFDatasetSQL()

Query

查询

results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10")

Get schema

获取Schema

schema = sql.describe("cais/mmlu")

Sample

采样

samples = sql.sample("cais/mmlu", n=5, seed=42)

Count

计数

count = sql.count("cais/mmlu", where="subject='nutrition'")

Histogram

直方图

dist = sql.histogram("cais/mmlu", "subject")

Filter and transform

过滤与转换

results = sql.filter_and_transform( "cais/mmlu", select="subject, COUNT(*) as cnt", group_by="subject", order_by="cnt DESC", limit=10 )

Push to Hub

推送到Hub

url = sql.push_to_hub( "cais/mmlu", "username/nutrition-subset", sql="SELECT * FROM data WHERE subject='nutrition'", private=True )

Export locally

导出到本地

sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100")

sql.close()

undefined

sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100")

sql.close()

undefined

HF Path Format

HF路径格式

DuckDB uses the

hf://

protocol to access datasets:

hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet

Examples:

hf://datasets/cais/mmlu@~parquet/default/train/*.parquet

hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet

The

@~parquet

revision provides auto-converted Parquet files for any dataset format.

DuckDB使用

hf://

协议访问数据集：

hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet

示例：

hf://datasets/cais/mmlu@~parquet/default/train/*.parquet

hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet

@~parquet

修订版可为任何数据集格式提供自动转换的Parquet文件。

Useful DuckDB SQL Functions

实用的DuckDB SQL函数

sql

-- String functions
LENGTH(column)                    -- String length
regexp_replace(col, '\n', '')     -- Regex replace
regexp_matches(col, 'pattern')    -- Regex match
LOWER(col), UPPER(col)           -- Case conversion

-- Array functions  
choices[0]                        -- Array indexing (0-based)
array_length(choices)             -- Array length
unnest(choices)                   -- Expand array to rows

-- Aggregations
COUNT(*), SUM(col), AVG(col)
GROUP BY col HAVING condition

-- Sampling
USING SAMPLE 10                   -- Random sample
USING SAMPLE 10 (RESERVOIR, 42)   -- Reproducible sample

-- Window functions
ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)

sql

-- 字符串函数
LENGTH(column)                    -- 字符串长度
regexp_replace(col, '\n', '')     -- 正则替换
regexp_matches(col, 'pattern')    -- 正则匹配
LOWER(col), UPPER(col)           -- 大小写转换

-- 数组函数  
choices[0]                        -- 数组索引（从0开始）
array_length(choices)             -- 数组长度
unnest(choices)                   -- 将数组展开为行

-- 聚合函数
COUNT(*), SUM(col), AVG(col)
GROUP BY col HAVING condition

-- 采样
USING SAMPLE 10                   -- 随机采样
USING SAMPLE 10 (RESERVOIR, 42)   -- 可复现的采样

-- 窗口函数
ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)

Dataset Creation (dataset_manager.py)

数据集创建（dataset_manager.py）

Recommended Workflow

Use HF MCP tools to find existing datasets

使用HF MCP工具查找现有数据集

search_datasets("conversational AI training") get_dataset_details("username/dataset-name")


**2. Creation (Use This Skill):**
```bash

search_datasets("conversational AI training") get_dataset_details("username/dataset-name")


**2. 创建（使用本Skill）：**
```bash

Initialize new dataset

初始化新数据集

uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]

Configure with detailed system prompt

使用详细的系统提示词进行配置

uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"


**3. Content Management (Use This Skill):**
```bash

uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"


**3. 内容管理（使用本Skill）：**
```bash

Quick setup with any template

使用任意模板快速设置

uv run scripts/dataset_manager.py quick_setup
--repo_id "your-username/dataset-name"
--template classification

Add data with template validation

带模板验证的添加数据

uv run scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json "$(cat your_qa_data.json)"

undefined

uv run scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json "$(cat your_qa_data.json)"

undefined

Template-Based Data Structures

基于模板的数据结构

1. Chat Template (
--template chat
)

json

{
  "messages": [
    {"role": "user", "content": "Natural user request"},
    {"role": "assistant", "content": "Response with tool usage"},
    {"role": "tool", "content": "Tool response", "tool_call_id": "call_123"}
  ],
  "scenario": "Description of use case",
  "complexity": "simple|intermediate|advanced"
}

2. Classification Template (
--template classification
)

json

{
  "text": "Input text to be classified",
  "label": "classification_label",
  "confidence": 0.95,
  "metadata": {"domain": "technology", "language": "en"}
}

3. QA Template (
--template qa
)

json

{
  "question": "What is the question being asked?",
  "answer": "The complete answer",
  "context": "Additional context if needed",
  "answer_type": "factual|explanatory|opinion",
  "difficulty": "easy|medium|hard"
}

4. Completion Template (
--template completion
)

json

{
  "prompt": "The beginning text or context",
  "completion": "The expected continuation",
  "domain": "code|creative|technical|conversational",
  "style": "description of writing style"
}

5. Tabular Template (
--template tabular
)

json

{
  "columns": [
    {"name": "feature1", "type": "numeric", "description": "First feature"},
    {"name": "target", "type": "categorical", "description": "Target variable"}
  ],
  "data": [
    {"feature1": 123, "target": "class_a"},
    {"feature1": 456, "target": "class_b"}
  ]
}

1. 对话模板（
--template chat
）

json

{
  "messages": [
    {"role": "user", "content": "自然的用户请求"},
    {"role": "assistant", "content": "包含工具使用的响应"},
    {"role": "tool", "content": "工具响应", "tool_call_id": "call_123"}
  ],
  "scenario": "使用场景描述",
  "complexity": "simple|intermediate|advanced"
}

2. 分类模板（
--template classification
）

json

{
  "text": "待分类的输入文本",
  "label": "classification_label",
  "confidence": 0.95,
  "metadata": {"domain": "technology", "language": "en"}
}

3. 问答模板（
--template qa
）

json

{
  "question": "提出的问题是什么？",
  "answer": "完整的答案",
  "context": "（可选）额外的上下文信息",
  "answer_type": "factual|explanatory|opinion",
  "difficulty": "easy|medium|hard"
}

4. 补全模板（
--template completion
）

json

{
  "prompt": "开头文本或上下文",
  "completion": "预期的续写内容",
  "domain": "code|creative|technical|conversational",
  "style": "写作风格描述"
}

5. 表格模板（
--template tabular
）

json

{
  "columns": [
    {"name": "feature1", "type": "numeric", "description": "第一个特征"},
    {"name": "target", "type": "categorical", "description": "目标变量"}
  ],
  "data": [
    {"feature1": 123, "target": "class_a"},
    {"feature1": 456, "target": "class_b"}
  ]
}

Advanced System Prompt Template

高级系统提示词模板

For high-quality training data generation:

text

You are an AI assistant expert at using MCP tools effectively.

用于生成高质量训练数据：

text

你是一名擅长有效使用MCP工具的AI助手。

MCP SERVER DEFINITIONS

MCP服务器定义

[Define available servers and tools]

[定义可用的服务器和工具]

TRAINING EXAMPLE STRUCTURE

训练示例结构

[Specify exact JSON schema for chat templating]

[指定对话模板的精确JSON Schema]

QUALITY GUIDELINES

质量指南

[Detail requirements for realistic scenarios, progressive complexity, proper tool usage]

[详细说明真实场景、渐进复杂度、正确工具使用的要求]

EXAMPLE CATEGORIES

示例类别

[List development workflows, debugging scenarios, data management tasks]

undefined

[列出开发工作流、调试场景、数据管理任务]

undefined

Example Categories & Templates

示例类别与模板

The skill includes diverse training examples beyond just MCP usage:

Available Example Sets:

```
training_examples.json
```
- MCP tool usage examples (debugging, project setup, database analysis)
```
diverse_training_examples.json
```
- Broader scenarios including:
- Educational Chat - Explaining programming concepts, tutorials
- Git Workflows - Feature branches, version control guidance
- Code Analysis - Performance optimization, architecture review
- Content Generation - Professional writing, creative brainstorming
- Codebase Navigation - Legacy code exploration, systematic analysis
- Conversational Support - Problem-solving, technical discussions

Using Different Example Sets:

bash

undefined

本Skill包含除MCP使用之外的多种训练示例：

可用示例集：

```
training_examples.json
```
- MCP工具使用示例（调试、项目设置、数据库分析）
```
diverse_training_examples.json
```
- 更广泛的场景，包括：
- 教育对话 - 解释编程概念、教程
- Git工作流 - 功能分支、版本控制指导
- 代码分析 - 性能优化、架构评审
- 内容生成 - 专业写作、创意头脑风暴
- 代码库导航 - 遗留代码探索、系统化分析
- 会话支持 - 问题解决、技术讨论

使用不同的示例集：

bash

undefined

Add MCP-focused examples

添加聚焦MCP的示例

uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(cat examples/training_examples.json)"

Add diverse conversational examples

添加多样化的会话示例

uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(cat examples/diverse_training_examples.json)"

Mix both for comprehensive training data

混合两者以获取全面的训练数据

uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"

undefined

uv run scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"

undefined

Commands Reference

命令参考

List Available Templates:

bash

uv run scripts/dataset_manager.py list_templates

Quick Setup (Recommended):

bash

uv run scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification

Manual Setup:

bash

undefined

列出可用模板：

bash

uv run scripts/dataset_manager.py list_templates

快速设置（推荐）：

bash

uv run scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification

手动设置：

bash

undefined

Initialize repository

初始化仓库

uv run scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]

Configure with system prompt

使用系统提示词配置

uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here"

uv run scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "你的提示词内容"

Add data with validation

带验证的添加数据

uv run scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'


**View Dataset Statistics:**
```bash
uv run scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"

uv run scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json '[{"question": "什么是AI？", "answer": "人工智能是..."}]'


**查看数据集统计信息：**
```bash
uv run scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"

Error Handling

错误处理

Repository exists: Script will notify and continue with configuration
Invalid JSON: Clear error message with parsing details
Network issues: Automatic retry for transient failures
Token permissions: Validation before operations begin

仓库已存在：脚本会发出通知并继续进行配置
无效JSON：提供包含解析细节的清晰错误消息
网络问题：对临时故障自动重试
令牌权限：在操作开始前进行验证

Combined Workflow Examples

组合工作流示例

Example 1: Create Training Subset from Existing Dataset

示例1：从现有数据集创建训练子集

bash

undefined

bash

undefined

1. Explore the source dataset

1. 探索源数据集

uv run scripts/sql_manager.py describe --dataset "cais/mmlu" uv run scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject"

2. Query and create subset

2. 查询并创建子集

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')"
--push-to "username/mmlu-medical-subset"
--private

undefined

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')"
--push-to "username/mmlu-medical-subset"
--private

undefined

Example 2: Transform and Reshape Data

示例2：转换并重塑数据

bash

undefined

bash

undefined

Transform MMLU to QA format with correct answers extracted

将MMLU转换为QA格式，提取正确答案

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT question, choices[answer] as correct_answer, subject FROM data"
--push-to "username/mmlu-qa-format"

undefined

uv run scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT question, choices[answer] as correct_answer, subject FROM data"
--push-to "username/mmlu-qa-format"

undefined

Example 3: Merge Multiple Dataset Splits

示例3：合并多个数据集拆分

bash

undefined

bash

undefined

Export multiple splits and combine

导出多个拆分并合并

uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--split "*"
--output "mmlu_all.parquet"

undefined

uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--split "*"
--output "mmlu_all.parquet"

undefined

Example 4: Quality Filtering

示例4：质量过滤

bash

undefined

bash

undefined

Filter for high-quality examples

过滤高质量示例

uv run scripts/sql_manager.py query
--dataset "squad"
--sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20"
--push-to "username/squad-filtered"

undefined

uv run scripts/sql_manager.py query
--dataset "squad"
--sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20"
--push-to "username/squad-filtered"

undefined

Example 5: Create Custom Training Dataset

示例5：创建自定义训练数据集

bash

undefined

bash

undefined

1. Query source data

1. 查询源数据

uv run scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT question, subject FROM data WHERE subject='nutrition'"
--output "nutrition_source.jsonl"
--format jsonl

2. Process with your pipeline (add answers, format, etc.)

2. 使用你的流水线处理（添加答案、格式化等）

3. Push processed data

3. 推送处理后的数据

uv run scripts/dataset_manager.py init --repo_id "username/nutrition-training" uv run scripts/dataset_manager.py add_rows
--repo_id "username/nutrition-training"
--template qa
--rows_json "$(cat processed_data.json)"

undefined

undefined