fine-tuning-data-generator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Fine-Tuning Data Generator

微调数据集生成器

This skill generates high-quality synthetic training data in ChatML format for fine-tuning language models using frameworks like Unsloth, Axolotl, or similar tools.
本工具可生成ChatML格式的高质量合成训练数据集,用于使用Unsloth、Axolotl或类似工具微调语言模型。

What Do I Need?

我需要准备什么?

NeedResource
Planning my dataset - requirements, strategy, quality checklist
resources/dataset-strategy.md
How to create diverse examples - variation techniques, multi-turn patterns, format-specific guidance
resources/generation-techniques.md
ChatML format details - structure, specification, common issues, framework compatibility
resources/chatml-format.md
Example datasets - inspiration across domains, multi-turn samples, edge cases
resources/examples.md
Validating quality - validation workflow, analyzing datasets, troubleshooting
resources/quality-validation.md
Training & deployment - framework setup, hyperparameters, optimization, deployment
resources/framework-integration.md
需求资源
规划数据集 - 需求、策略、质量检查清单
resources/dataset-strategy.md
如何创建多样示例 - 多样化技巧、多轮对话模式、格式特定指南
resources/generation-techniques.md
ChatML格式细节 - 结构、规范、常见问题、框架兼容性
resources/chatml-format.md
示例数据集 - 跨领域参考、多轮对话样本、边缘案例
resources/examples.md
质量验证 - 验证流程、数据集分析、故障排除
resources/quality-validation.md
训练与部署 - 框架设置、超参数、优化、部署
resources/framework-integration.md

Workflow

工作流程

Phase 1: Gather Requirements

阶段1:收集需求

Start with these essential clarifying questions:
Task Definition:
  • What is the model being trained to do? (e.g., customer support, code generation, creative writing)
  • What specific domain or subject matter? (e.g., legal, medical, e-commerce, software development)
  • How many training examples are needed? (Recommend: 100+ for simple tasks, 500-1000+ for complex)
Quality & Diversity:
  • Complexity range: simple to complex mix, or focus on specific difficulty level?
  • Diversity: edge cases, error handling, unusual scenarios?
  • Tone/style: professional, friendly, technical, concise, detailed?
  • Response length preferences?
  • Any specific formats: code blocks, lists, tables, JSON?
Dataset Composition:
  • Distribution across subtopics: evenly distributed or weighted?
  • Include negative examples (what NOT to do)?
  • Need validation split? (Recommend 10-20% of total)
See
resources/dataset-strategy.md
for detailed question templates.
从以下关键澄清问题开始:
任务定义:
  • 要训练的模型需实现什么功能?(例如:客户支持、代码生成、创意写作)
  • 针对什么特定领域或主题?(例如:法律、医疗、电商、软件开发)
  • 需要多少训练示例?(建议:简单任务100+,复杂任务500-1000+)
质量与多样性:
  • 复杂度范围:简单到复杂混合,还是聚焦特定难度等级?
  • 多样性:是否需要包含边缘案例、错误处理、特殊场景?
  • 语气/风格:专业、友好、技术向、简洁、详细?
  • 响应长度偏好?
  • 是否需要特定格式:代码块、列表、表格、JSON?
数据集构成:
  • 子主题分布:均匀分布还是加权分布?
  • 是否包含负面示例(不应做的行为)?
  • 是否需要验证集拆分?(建议占总数的10-20%)
详见
resources/dataset-strategy.md
中的详细问题模板。

Phase 2: Create Generation Plan

阶段2:制定生成计划

Present a plan covering:
  • Number and distribution of examples across categories
  • Key topics/scenarios to cover
  • Diversity strategies (phrasing variations, complexity levels, edge cases)
  • System prompt approach (consistent vs. varied)
  • Quality assurance approach
Get user approval before generating.
提交包含以下内容的计划:
  • 各分类示例的数量与分布
  • 需覆盖的关键主题/场景
  • 多样化策略(措辞变化、复杂度等级、边缘案例)
  • 系统提示词方案(统一风格 vs 多样化)
  • 质量保障方案
生成前需获得用户批准。

Phase 3: Generate Synthetic Data

阶段3:生成合成数据

Create examples following these quality standards:
Key Principles:
  • Realistic scenarios reflecting real-world use cases
  • Natural language with varied phrasing and formality levels
  • Accurate, helpful responses aligned with desired behavior
  • Consistent ChatML formatting throughout
  • Balanced difficulty (unless specified)
  • Meaningful variety (no repetition)
  • Include edge cases and error scenarios
Diversity Techniques:
  • Vary query phrasing (questions, commands, statements)
  • Include different expertise levels (beginner, intermediate, expert)
  • Cover both positive and negative examples
  • Mix short and long-form responses
  • Include multi-step reasoning when appropriate
  • Add context variations
See
resources/generation-techniques.md
for detailed techniques, domain-specific guidance, and batch generation workflow.
遵循以下质量标准创建示例:
核心原则:
  • 反映真实场景的现实案例
  • 措辞和正式程度多样的自然语言
  • 符合预期行为的准确、有用响应
  • 全程保持一致的ChatML格式
  • 难度均衡(除非有特殊指定)
  • 有意义的多样性(无重复)
  • 包含边缘案例和错误场景
多样化技巧:
  • 多样化查询措辞(问题、命令、陈述)
  • 包含不同专业水平(入门、中级、专家)
  • 同时包含正面和负面示例
  • 混合短文本和长文本响应
  • 适当情况下加入多步推理
  • 添加上下文变化
详见
resources/generation-techniques.md
中的详细技巧、领域特定指南和批量生成流程。

Phase 4: Validate & Document

阶段4:验证与文档

Run validation tools and checks:
bash
undefined
运行验证工具和检查项:
bash
undefined

Validate JSON formatting and structure

验证JSON格式与结构

python scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl

Analyze dataset statistics and diversity

分析数据集统计信息与多样性

python scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl

Export statistics

导出统计数据

python scripts/analyze_dataset.py training_data.jsonl --export stats.json

**Quality Checklist:**
- [ ] JSON validation passed (no errors)
- [ ] Analysis shows good diversity metrics
- [ ] Manual sample review passed
- [ ] No duplicate or near-duplicate examples
- [ ] All required fields present
- [ ] Realistic user queries
- [ ] Accurate, helpful responses
- [ ] Balanced category distribution
- [ ] Dataset metadata documented

See [`resources/quality-validation.md`](resources/quality-validation.md) for validation details, troubleshooting, and documentation templates.
python scripts/analyze_dataset.py training_data.jsonl --export stats.json

**质量检查清单:**
- [ ] JSON验证通过(无错误)
- [ ] 分析显示良好的多样性指标
- [ ] 手动样本审核通过
- [ ] 无重复或近似重复示例
- [ ] 所有必填字段齐全
- [ ] 真实的用户查询
- [ ] 准确、有用的响应
- [ ] 分类分布均衡
- [ ] 数据集元数据已记录

详见[`resources/quality-validation.md`](resources/quality-validation.md)中的验证细节、故障排除和文档模板。

Phase 5: Integration & Training

阶段5:集成与训练

Prepare for training with your framework of choice:
Output Files:
  • training_data.jsonl
    - Main training set
  • validation_data.jsonl
    - Optional validation set
  • dataset_info.txt
    - Metadata and statistics
Framework Setup:
  • Unsloth: Automatic ChatML detection, efficient 4-bit training
  • Axolotl: Specify
    type: chat_template
    and
    chat_template: chatml
  • Hugging Face: Use tokenizer's
    apply_chat_template()
    method
  • Custom: Load from JSONL, handle ChatML formatting
See
resources/framework-integration.md
for setup code, hyperparameters, deployment options, and best practices.
为所选框架准备训练环境:
输出文件:
  • training_data.jsonl
    - 主训练集
  • validation_data.jsonl
    - 可选验证集
  • dataset_info.txt
    - 元数据与统计信息
框架设置:
  • Unsloth:自动检测ChatML,高效4位训练
  • Axolotl:指定
    type: chat_template
    chat_template: chatml
  • Hugging Face:使用分词器的
    apply_chat_template()
    方法
  • 自定义框架:从JSONL加载,处理ChatML格式
详见
resources/framework-integration.md
中的设置代码、超参数、部署选项和最佳实践。

ChatML Format Overview

ChatML格式概述

Each training example is a JSON object with a
messages
array:
json
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "Use slicing: `text[::-1]`"}]}
Roles:
  • system
    : Sets assistant behavior (optional but recommended)
  • user
    : User's input/query
  • assistant
    : Model's expected response
Multi-turn: Add additional user/assistant message pairs for conversations.
See
resources/chatml-format.md
for detailed specification, validation, common issues, and framework-specific notes.
每个训练示例是一个包含
messages
数组的JSON对象:
json
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "Use slicing: `text[::-1]`"}]}
角色:
  • system
    :设置助手行为(可选但推荐)
  • user
    :用户的输入/查询
  • assistant
    :模型的预期响应
多轮对话: 添加额外的user/assistant消息对来构建对话。
详见
resources/chatml-format.md
中的详细规范、验证方法、常见问题和框架特定说明。

Tool Reference

工具参考

Scripts in
scripts/

scripts/
目录下的脚本

validate_chatml.py

validate_chatml.py

Validates ChatML format JSONL files:
bash
python scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl --verbose
Checks:
  • Valid JSON formatting
  • Required fields (messages, role, content)
  • Valid role values (system, user, assistant)
  • Proper message order
  • Duplicate detection
  • Diversity metrics
验证ChatML格式的JSONL文件:
bash
python scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl --verbose
检查项:
  • 有效的JSON格式
  • 必填字段(messages、role、content)
  • 有效的角色值(system、user、assistant)
  • 正确的消息顺序
  • 重复检测
  • 多样性指标

analyze_dataset.py

analyze_dataset.py

Provides comprehensive statistics and analysis:
bash
python scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl --export stats.json
Provides:
  • Dataset overview (total examples, message counts)
  • Message length statistics
  • System prompt variations
  • User query patterns (questions, commands, code-related, length categories)
  • Assistant response patterns (code blocks, lists, headers, length categories)
  • Quality indicators (diversity score, balance ratio)
  • Token estimates and cost projection
提供全面的统计信息与分析:
bash
python scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl --export stats.json
输出内容:
  • 数据集概述(总示例数、消息数)
  • 消息长度统计
  • 系统提示词变化
  • 用户查询模式(问题、命令、代码相关、长度分类)
  • 助手响应模式(代码块、列表、标题、长度分类)
  • 质量指标(多样性得分、平衡比率)
  • Token估算与成本预测

Common Workflows

常见工作流程

Small Dataset (100-200 examples)

小型数据集(100-200个示例)

  1. Gather requirements
  2. Create generation plan for 1-2 categories
  3. Generate in single batch, review quality
  4. Validate and document
  5. Ready for training
  1. 收集需求
  2. 制定1-2个分类的生成计划
  3. 单次批量生成,审核质量
  4. 验证并记录文档
  5. 准备就绪可用于训练

Medium Dataset (500-1000 examples)

中型数据集(500-1000个示例)

  1. Gather requirements
  2. Create detailed plan with multiple categories
  3. Generate in 2-3 batches, reviewing after each
  4. Analyze diversity and adjust approach
  5. Fill any gaps
  6. Final validation and documentation
  1. 收集需求
  2. 制定包含多个分类的详细计划
  3. 分2-3批生成,每批生成后审核
  4. 分析多样性并调整方案
  5. 填补覆盖空白
  6. 最终验证与文档记录

Large Dataset (2000+ examples)

大型数据集(2000+个示例)

  1. Gather comprehensive requirements
  2. Create multi-batch generation plan
  3. Batch 1 (50-100): Foundation examples
  4. Batch 2 (100-200): Complexity expansion
  5. Batch 3 (100-200): Coverage filling
  6. Batch 4 (50-100): Polish and validation
  7. Run full validation suite
  8. Generate comprehensive documentation
  1. 收集全面需求
  2. 制定多批次生成计划
  3. 批次1(50-100个):基础示例
  4. 批次2(100-200个):复杂度扩展
  5. 批次3(100-200个):覆盖范围填补
  6. 批次4(50-100个):优化与验证
  7. 运行完整验证套件
  8. 生成全面文档

Best Practices

最佳实践

Start Small, Iterate

从小规模开始,逐步迭代

  1. Generate 10-20 examples first
  2. Review and get feedback
  3. Refine approach based on feedback
  4. Scale up to full dataset
  1. 先生成10-20个示例
  2. 审核并获取反馈
  3. 根据反馈优化方案
  4. 扩展至完整数据集

Quality Over Quantity

质量优先于数量

  • Better to have 500 diverse, high-quality examples than 5,000 repetitive ones
  • Each example should teach something new
  • Maintain consistent response quality throughout
  • 500个多样化、高质量示例优于5000个重复示例
  • 每个示例都应传递新的信息
  • 全程保持一致的响应质量

Diversify Systematically

系统性多样化

  • Vary query phrasing (questions, commands, statements)
  • Cover different expertise levels
  • Mix response complexities
  • Include edge cases (typically 20-30% of dataset)
  • Use batch generation workflow for large datasets
  • 多样化查询措辞(问题、命令、陈述)
  • 覆盖不同专业水平
  • 混合响应复杂度
  • 包含边缘案例(通常占数据集的20-30%)
  • 大型数据集使用批量生成流程

Test Before Deployment

部署前测试

  • Test dataset with actual training framework
  • Monitor training metrics for issues
  • Test fine-tuned model outputs before deployment
  • Compare results to base model
  • 在实际训练框架中测试数据集
  • 监控训练指标以发现问题
  • 部署前测试微调后模型的输出
  • 将结果与基础模型对比

Document Everything

全面记录

  • Keep notes on generation parameters
  • Save different dataset versions
  • Document any modifications made
  • Record generation strategies used
  • Track model performance metrics
  • 记录生成参数
  • 保存不同版本的数据集
  • 记录所有修改内容
  • 记录使用的生成策略
  • 跟踪模型性能指标

Advanced Features

高级功能

Batch Generation Strategy

批量生成策略

For datasets 500+ examples:
  • Generate 50-100 examples at a time
  • Review distribution and diversity after each batch
  • Adjust generation strategy based on identified gaps
  • Prevents repetition and maintains creativity
针对500+示例的数据集:
  • 每次生成50-100个示例
  • 每批生成后审核分布与多样性
  • 根据发现的空白调整生成策略
  • 避免重复并保持创意性

Common Pitfalls to Avoid

需避免的常见陷阱

  • Over-templating: Creates repetitive patterns (vary naturally)
  • Unrealistic Queries: Overly formal/robotic user inputs (use varied phrasing)
  • Narrow Coverage: Limited scenarios and phrasing (ensure diversity)
  • Inconsistent Quality: Quality degradation over time (use quality checklist)
  • JSON Errors: Invalid formatting breaking training (always validate)
  • Missing Context: System prompts without detail (provide clear instructions)
  • Response Mismatch: Responses don't address queries (verify relevance)
  • 过度模板化:造成重复模式(自然变化)
  • 不真实的查询:过于正式/机械化的用户输入(使用多样化措辞)
  • 覆盖范围狭窄:场景和措辞有限(确保多样性)
  • 质量不一致:质量随时间下降(使用质量检查清单)
  • JSON错误:无效格式导致训练中断(务必验证)
  • 上下文缺失:系统提示词不够详细(提供清晰指令)
  • 响应不匹配:响应未对应查询(验证相关性)

Dataset Size Recommendations

数据集规模建议

Task ComplexityRecommended SizeNotes
Simple tasks100-500Well-defined, limited variation
Medium tasks500-2,000Multiple scenarios, moderate complexity
Complex tasks2,000-10,000+Many edge cases, high variability
Domain adaptation1,000-5,000Specialized knowledge required
任务复杂度推荐规模说明
简单任务100-500定义明确,变化有限
中等任务500-2000多场景,中等复杂度
复杂任务2000-10000+大量边缘案例,高可变性
领域适配1000-5000需要专业知识

Resources

资源

  • Planning & Strategy:
    resources/dataset-strategy.md
    - Requirements gathering, planning, quality checklists
  • Generation Techniques:
    resources/generation-techniques.md
    - Diversity techniques, domain-specific guidance, batch workflows
  • ChatML Specification:
    resources/chatml-format.md
    - Format details, validation, framework notes
  • Example Datasets:
    resources/examples.md
    - Diverse domain examples, multi-turn patterns
  • Quality Validation:
    resources/quality-validation.md
    - Validation workflow, analysis, troubleshooting
  • Framework Integration:
    resources/framework-integration.md
    - Setup for Unsloth, Axolotl, HuggingFace; deployment options

Version: 2.0 | Updated: 2024 | Pattern: Modular Orchestration
  • 规划与策略
    resources/dataset-strategy.md
    - 需求收集、规划、质量检查清单
  • 生成技巧
    resources/generation-techniques.md
    - 多样化技巧、领域特定指南、批量流程
  • ChatML规范
    resources/chatml-format.md
    - 格式细节、验证、框架说明
  • 示例数据集
    resources/examples.md
    - 跨领域示例、多轮对话模式
  • 质量验证
    resources/quality-validation.md
    - 验证流程、分析、故障排除
  • 框架集成
    resources/framework-integration.md
    - Unsloth、Axolotl、HuggingFace设置;部署选项

版本: 2.0 | 更新日期: 2024 | 模式: 模块化编排