fine-tuning-data-generator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFine-Tuning Data Generator
微调数据集生成器
This skill generates high-quality synthetic training data in ChatML format for fine-tuning language models using frameworks like Unsloth, Axolotl, or similar tools.
本工具可生成ChatML格式的高质量合成训练数据集,用于使用Unsloth、Axolotl或类似工具微调语言模型。
What Do I Need?
我需要准备什么?
| Need | Resource |
|---|---|
| Planning my dataset - requirements, strategy, quality checklist | |
| How to create diverse examples - variation techniques, multi-turn patterns, format-specific guidance | |
| ChatML format details - structure, specification, common issues, framework compatibility | |
| Example datasets - inspiration across domains, multi-turn samples, edge cases | |
| Validating quality - validation workflow, analyzing datasets, troubleshooting | |
| Training & deployment - framework setup, hyperparameters, optimization, deployment | |
| 需求 | 资源 |
|---|---|
| 规划数据集 - 需求、策略、质量检查清单 | |
| 如何创建多样示例 - 多样化技巧、多轮对话模式、格式特定指南 | |
| ChatML格式细节 - 结构、规范、常见问题、框架兼容性 | |
| 示例数据集 - 跨领域参考、多轮对话样本、边缘案例 | |
| 质量验证 - 验证流程、数据集分析、故障排除 | |
| 训练与部署 - 框架设置、超参数、优化、部署 | |
Workflow
工作流程
Phase 1: Gather Requirements
阶段1:收集需求
Start with these essential clarifying questions:
Task Definition:
- What is the model being trained to do? (e.g., customer support, code generation, creative writing)
- What specific domain or subject matter? (e.g., legal, medical, e-commerce, software development)
- How many training examples are needed? (Recommend: 100+ for simple tasks, 500-1000+ for complex)
Quality & Diversity:
- Complexity range: simple to complex mix, or focus on specific difficulty level?
- Diversity: edge cases, error handling, unusual scenarios?
- Tone/style: professional, friendly, technical, concise, detailed?
- Response length preferences?
- Any specific formats: code blocks, lists, tables, JSON?
Dataset Composition:
- Distribution across subtopics: evenly distributed or weighted?
- Include negative examples (what NOT to do)?
- Need validation split? (Recommend 10-20% of total)
See for detailed question templates.
resources/dataset-strategy.md从以下关键澄清问题开始:
任务定义:
- 要训练的模型需实现什么功能?(例如:客户支持、代码生成、创意写作)
- 针对什么特定领域或主题?(例如:法律、医疗、电商、软件开发)
- 需要多少训练示例?(建议:简单任务100+,复杂任务500-1000+)
质量与多样性:
- 复杂度范围:简单到复杂混合,还是聚焦特定难度等级?
- 多样性:是否需要包含边缘案例、错误处理、特殊场景?
- 语气/风格:专业、友好、技术向、简洁、详细?
- 响应长度偏好?
- 是否需要特定格式:代码块、列表、表格、JSON?
数据集构成:
- 子主题分布:均匀分布还是加权分布?
- 是否包含负面示例(不应做的行为)?
- 是否需要验证集拆分?(建议占总数的10-20%)
详见中的详细问题模板。
resources/dataset-strategy.mdPhase 2: Create Generation Plan
阶段2:制定生成计划
Present a plan covering:
- Number and distribution of examples across categories
- Key topics/scenarios to cover
- Diversity strategies (phrasing variations, complexity levels, edge cases)
- System prompt approach (consistent vs. varied)
- Quality assurance approach
Get user approval before generating.
提交包含以下内容的计划:
- 各分类示例的数量与分布
- 需覆盖的关键主题/场景
- 多样化策略(措辞变化、复杂度等级、边缘案例)
- 系统提示词方案(统一风格 vs 多样化)
- 质量保障方案
生成前需获得用户批准。
Phase 3: Generate Synthetic Data
阶段3:生成合成数据
Create examples following these quality standards:
Key Principles:
- Realistic scenarios reflecting real-world use cases
- Natural language with varied phrasing and formality levels
- Accurate, helpful responses aligned with desired behavior
- Consistent ChatML formatting throughout
- Balanced difficulty (unless specified)
- Meaningful variety (no repetition)
- Include edge cases and error scenarios
Diversity Techniques:
- Vary query phrasing (questions, commands, statements)
- Include different expertise levels (beginner, intermediate, expert)
- Cover both positive and negative examples
- Mix short and long-form responses
- Include multi-step reasoning when appropriate
- Add context variations
See for detailed techniques, domain-specific guidance, and batch generation workflow.
resources/generation-techniques.md遵循以下质量标准创建示例:
核心原则:
- 反映真实场景的现实案例
- 措辞和正式程度多样的自然语言
- 符合预期行为的准确、有用响应
- 全程保持一致的ChatML格式
- 难度均衡(除非有特殊指定)
- 有意义的多样性(无重复)
- 包含边缘案例和错误场景
多样化技巧:
- 多样化查询措辞(问题、命令、陈述)
- 包含不同专业水平(入门、中级、专家)
- 同时包含正面和负面示例
- 混合短文本和长文本响应
- 适当情况下加入多步推理
- 添加上下文变化
详见中的详细技巧、领域特定指南和批量生成流程。
resources/generation-techniques.mdPhase 4: Validate & Document
阶段4:验证与文档
Run validation tools and checks:
bash
undefined运行验证工具和检查项:
bash
undefinedValidate JSON formatting and structure
验证JSON格式与结构
python scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl
Analyze dataset statistics and diversity
分析数据集统计信息与多样性
python scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl
Export statistics
导出统计数据
python scripts/analyze_dataset.py training_data.jsonl --export stats.json
**Quality Checklist:**
- [ ] JSON validation passed (no errors)
- [ ] Analysis shows good diversity metrics
- [ ] Manual sample review passed
- [ ] No duplicate or near-duplicate examples
- [ ] All required fields present
- [ ] Realistic user queries
- [ ] Accurate, helpful responses
- [ ] Balanced category distribution
- [ ] Dataset metadata documented
See [`resources/quality-validation.md`](resources/quality-validation.md) for validation details, troubleshooting, and documentation templates.python scripts/analyze_dataset.py training_data.jsonl --export stats.json
**质量检查清单:**
- [ ] JSON验证通过(无错误)
- [ ] 分析显示良好的多样性指标
- [ ] 手动样本审核通过
- [ ] 无重复或近似重复示例
- [ ] 所有必填字段齐全
- [ ] 真实的用户查询
- [ ] 准确、有用的响应
- [ ] 分类分布均衡
- [ ] 数据集元数据已记录
详见[`resources/quality-validation.md`](resources/quality-validation.md)中的验证细节、故障排除和文档模板。Phase 5: Integration & Training
阶段5:集成与训练
Prepare for training with your framework of choice:
Output Files:
- - Main training set
training_data.jsonl - - Optional validation set
validation_data.jsonl - - Metadata and statistics
dataset_info.txt
Framework Setup:
- Unsloth: Automatic ChatML detection, efficient 4-bit training
- Axolotl: Specify and
type: chat_templatechat_template: chatml - Hugging Face: Use tokenizer's method
apply_chat_template() - Custom: Load from JSONL, handle ChatML formatting
See for setup code, hyperparameters, deployment options, and best practices.
resources/framework-integration.md为所选框架准备训练环境:
输出文件:
- - 主训练集
training_data.jsonl - - 可选验证集
validation_data.jsonl - - 元数据与统计信息
dataset_info.txt
框架设置:
- Unsloth:自动检测ChatML,高效4位训练
- Axolotl:指定和
type: chat_templatechat_template: chatml - Hugging Face:使用分词器的方法
apply_chat_template() - 自定义框架:从JSONL加载,处理ChatML格式
详见中的设置代码、超参数、部署选项和最佳实践。
resources/framework-integration.mdChatML Format Overview
ChatML格式概述
Each training example is a JSON object with a array:
messagesjson
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "Use slicing: `text[::-1]`"}]}Roles:
- : Sets assistant behavior (optional but recommended)
system - : User's input/query
user - : Model's expected response
assistant
Multi-turn: Add additional user/assistant message pairs for conversations.
See for detailed specification, validation, common issues, and framework-specific notes.
resources/chatml-format.md每个训练示例是一个包含数组的JSON对象:
messagesjson
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "Use slicing: `text[::-1]`"}]}角色:
- :设置助手行为(可选但推荐)
system - :用户的输入/查询
user - :模型的预期响应
assistant
多轮对话: 添加额外的user/assistant消息对来构建对话。
详见中的详细规范、验证方法、常见问题和框架特定说明。
resources/chatml-format.mdTool Reference
工具参考
Scripts in scripts/
scripts/scripts/
目录下的脚本
scripts/validate_chatml.py
validate_chatml.py
Validates ChatML format JSONL files:
bash
python scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl --verboseChecks:
- Valid JSON formatting
- Required fields (messages, role, content)
- Valid role values (system, user, assistant)
- Proper message order
- Duplicate detection
- Diversity metrics
验证ChatML格式的JSONL文件:
bash
python scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl --verbose检查项:
- 有效的JSON格式
- 必填字段(messages、role、content)
- 有效的角色值(system、user、assistant)
- 正确的消息顺序
- 重复检测
- 多样性指标
analyze_dataset.py
analyze_dataset.py
Provides comprehensive statistics and analysis:
bash
python scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl --export stats.jsonProvides:
- Dataset overview (total examples, message counts)
- Message length statistics
- System prompt variations
- User query patterns (questions, commands, code-related, length categories)
- Assistant response patterns (code blocks, lists, headers, length categories)
- Quality indicators (diversity score, balance ratio)
- Token estimates and cost projection
提供全面的统计信息与分析:
bash
python scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl --export stats.json输出内容:
- 数据集概述(总示例数、消息数)
- 消息长度统计
- 系统提示词变化
- 用户查询模式(问题、命令、代码相关、长度分类)
- 助手响应模式(代码块、列表、标题、长度分类)
- 质量指标(多样性得分、平衡比率)
- Token估算与成本预测
Common Workflows
常见工作流程
Small Dataset (100-200 examples)
小型数据集(100-200个示例)
- Gather requirements
- Create generation plan for 1-2 categories
- Generate in single batch, review quality
- Validate and document
- Ready for training
- 收集需求
- 制定1-2个分类的生成计划
- 单次批量生成,审核质量
- 验证并记录文档
- 准备就绪可用于训练
Medium Dataset (500-1000 examples)
中型数据集(500-1000个示例)
- Gather requirements
- Create detailed plan with multiple categories
- Generate in 2-3 batches, reviewing after each
- Analyze diversity and adjust approach
- Fill any gaps
- Final validation and documentation
- 收集需求
- 制定包含多个分类的详细计划
- 分2-3批生成,每批生成后审核
- 分析多样性并调整方案
- 填补覆盖空白
- 最终验证与文档记录
Large Dataset (2000+ examples)
大型数据集(2000+个示例)
- Gather comprehensive requirements
- Create multi-batch generation plan
- Batch 1 (50-100): Foundation examples
- Batch 2 (100-200): Complexity expansion
- Batch 3 (100-200): Coverage filling
- Batch 4 (50-100): Polish and validation
- Run full validation suite
- Generate comprehensive documentation
- 收集全面需求
- 制定多批次生成计划
- 批次1(50-100个):基础示例
- 批次2(100-200个):复杂度扩展
- 批次3(100-200个):覆盖范围填补
- 批次4(50-100个):优化与验证
- 运行完整验证套件
- 生成全面文档
Best Practices
最佳实践
Start Small, Iterate
从小规模开始,逐步迭代
- Generate 10-20 examples first
- Review and get feedback
- Refine approach based on feedback
- Scale up to full dataset
- 先生成10-20个示例
- 审核并获取反馈
- 根据反馈优化方案
- 扩展至完整数据集
Quality Over Quantity
质量优先于数量
- Better to have 500 diverse, high-quality examples than 5,000 repetitive ones
- Each example should teach something new
- Maintain consistent response quality throughout
- 500个多样化、高质量示例优于5000个重复示例
- 每个示例都应传递新的信息
- 全程保持一致的响应质量
Diversify Systematically
系统性多样化
- Vary query phrasing (questions, commands, statements)
- Cover different expertise levels
- Mix response complexities
- Include edge cases (typically 20-30% of dataset)
- Use batch generation workflow for large datasets
- 多样化查询措辞(问题、命令、陈述)
- 覆盖不同专业水平
- 混合响应复杂度
- 包含边缘案例(通常占数据集的20-30%)
- 大型数据集使用批量生成流程
Test Before Deployment
部署前测试
- Test dataset with actual training framework
- Monitor training metrics for issues
- Test fine-tuned model outputs before deployment
- Compare results to base model
- 在实际训练框架中测试数据集
- 监控训练指标以发现问题
- 部署前测试微调后模型的输出
- 将结果与基础模型对比
Document Everything
全面记录
- Keep notes on generation parameters
- Save different dataset versions
- Document any modifications made
- Record generation strategies used
- Track model performance metrics
- 记录生成参数
- 保存不同版本的数据集
- 记录所有修改内容
- 记录使用的生成策略
- 跟踪模型性能指标
Advanced Features
高级功能
Batch Generation Strategy
批量生成策略
For datasets 500+ examples:
- Generate 50-100 examples at a time
- Review distribution and diversity after each batch
- Adjust generation strategy based on identified gaps
- Prevents repetition and maintains creativity
针对500+示例的数据集:
- 每次生成50-100个示例
- 每批生成后审核分布与多样性
- 根据发现的空白调整生成策略
- 避免重复并保持创意性
Common Pitfalls to Avoid
需避免的常见陷阱
- Over-templating: Creates repetitive patterns (vary naturally)
- Unrealistic Queries: Overly formal/robotic user inputs (use varied phrasing)
- Narrow Coverage: Limited scenarios and phrasing (ensure diversity)
- Inconsistent Quality: Quality degradation over time (use quality checklist)
- JSON Errors: Invalid formatting breaking training (always validate)
- Missing Context: System prompts without detail (provide clear instructions)
- Response Mismatch: Responses don't address queries (verify relevance)
- 过度模板化:造成重复模式(自然变化)
- 不真实的查询:过于正式/机械化的用户输入(使用多样化措辞)
- 覆盖范围狭窄:场景和措辞有限(确保多样性)
- 质量不一致:质量随时间下降(使用质量检查清单)
- JSON错误:无效格式导致训练中断(务必验证)
- 上下文缺失:系统提示词不够详细(提供清晰指令)
- 响应不匹配:响应未对应查询(验证相关性)
Dataset Size Recommendations
数据集规模建议
| Task Complexity | Recommended Size | Notes |
|---|---|---|
| Simple tasks | 100-500 | Well-defined, limited variation |
| Medium tasks | 500-2,000 | Multiple scenarios, moderate complexity |
| Complex tasks | 2,000-10,000+ | Many edge cases, high variability |
| Domain adaptation | 1,000-5,000 | Specialized knowledge required |
| 任务复杂度 | 推荐规模 | 说明 |
|---|---|---|
| 简单任务 | 100-500 | 定义明确,变化有限 |
| 中等任务 | 500-2000 | 多场景,中等复杂度 |
| 复杂任务 | 2000-10000+ | 大量边缘案例,高可变性 |
| 领域适配 | 1000-5000 | 需要专业知识 |
Resources
资源
- Planning & Strategy: - Requirements gathering, planning, quality checklists
resources/dataset-strategy.md - Generation Techniques: - Diversity techniques, domain-specific guidance, batch workflows
resources/generation-techniques.md - ChatML Specification: - Format details, validation, framework notes
resources/chatml-format.md - Example Datasets: - Diverse domain examples, multi-turn patterns
resources/examples.md - Quality Validation: - Validation workflow, analysis, troubleshooting
resources/quality-validation.md - Framework Integration: - Setup for Unsloth, Axolotl, HuggingFace; deployment options
resources/framework-integration.md
Version: 2.0 | Updated: 2024 | Pattern: Modular Orchestration
- 规划与策略:- 需求收集、规划、质量检查清单
resources/dataset-strategy.md - 生成技巧:- 多样化技巧、领域特定指南、批量流程
resources/generation-techniques.md - ChatML规范:- 格式细节、验证、框架说明
resources/chatml-format.md - 示例数据集:- 跨领域示例、多轮对话模式
resources/examples.md - 质量验证:- 验证流程、分析、故障排除
resources/quality-validation.md - 框架集成:- Unsloth、Axolotl、HuggingFace设置;部署选项
resources/framework-integration.md
版本: 2.0 | 更新日期: 2024 | 模式: 模块化编排