nanogpt
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesenanoGPT - Minimalist GPT Training
nanoGPT - 极简版GPT训练
Quick start
快速开始
nanoGPT is a simplified GPT implementation designed for learning and experimentation.
Installation:
bash
pip install torch numpy transformers datasets tiktoken wandb tqdmTrain on Shakespeare (CPU-friendly):
bash
undefinednanoGPT是一个为学习和实验设计的简化版GPT实现。
安装:
bash
pip install torch numpy transformers datasets tiktoken wandb tqdm在莎士比亚数据集上训练(适合CPU):
bash
undefinedPrepare data
准备数据
python data/shakespeare_char/prepare.py
python data/shakespeare_char/prepare.py
Train (5 minutes on CPU)
训练(CPU上约5分钟)
python train.py config/train_shakespeare_char.py
python train.py config/train_shakespeare_char.py
Generate text
生成文本
python sample.py --out_dir=out-shakespeare-char
**Output**:ROMEO:
What say'st thou? Shall I speak, and be a man?
JULIET:
I am afeard, and yet I'll speak; for thou art
One that hath been a man, and yet I know not
What thou art.
undefinedpython sample.py --out_dir=out-shakespeare-char
**输出示例**:ROMEO:
What say'st thou? Shall I speak, and be a man?
JULIET:
I am afeard, and yet I'll speak; for thou art
One that hath been a man, and yet I know not
What thou art.
undefinedCommon workflows
常见工作流
Workflow 1: Character-level Shakespeare
工作流1:字符级莎士比亚数据集训练
Complete training pipeline:
bash
undefined完整训练流程:
bash
undefinedStep 1: Prepare data (creates train.bin, val.bin)
步骤1:准备数据(生成train.bin和val.bin)
python data/shakespeare_char/prepare.py
python data/shakespeare_char/prepare.py
Step 2: Train small model
步骤2:训练小型模型
python train.py config/train_shakespeare_char.py
python train.py config/train_shakespeare_char.py
Step 3: Generate text
步骤3:生成文本
python sample.py --out_dir=out-shakespeare-char
**Config** (`config/train_shakespeare_char.py`):
```pythonpython sample.py --out_dir=out-shakespeare-char
**配置文件** (`config/train_shakespeare_char.py`):
```pythonModel config
模型配置
n_layer = 6 # 6 transformer layers
n_head = 6 # 6 attention heads
n_embd = 384 # 384-dim embeddings
block_size = 256 # 256 char context
n_layer = 6 # 6个Transformer层
n_head = 6 # 6个注意力头
n_embd = 384 # 384维嵌入向量
block_size = 256 # 256字符上下文长度
Training config
训练配置
batch_size = 64
learning_rate = 1e-3
max_iters = 5000
eval_interval = 500
batch_size = 64
learning_rate = 1e-3
max_iters = 5000
eval_interval = 500
Hardware
硬件设置
device = 'cpu' # Or 'cuda'
compile = False # Set True for PyTorch 2.0
**Training time**: ~5 minutes (CPU), ~1 minute (GPU)device = 'cpu' # 或 'cuda'
compile = False # PyTorch 2.0版本可设置为True
**训练时间**: 约5分钟(CPU),约1分钟(GPU)Workflow 2: Reproduce GPT-2 (124M)
工作流2:复现GPT-2(124M参数)
Multi-GPU training on OpenWebText:
bash
undefined在OpenWebText数据集上进行多GPU训练:
bash
undefinedStep 1: Prepare OpenWebText (takes ~1 hour)
步骤1:准备OpenWebText数据集(约1小时)
python data/openwebtext/prepare.py
python data/openwebtext/prepare.py
Step 2: Train GPT-2 124M with DDP (8 GPUs)
步骤2:使用DDP在8块GPU上训练GPT-2 124M
torchrun --standalone --nproc_per_node=8
train.py config/train_gpt2.py
train.py config/train_gpt2.py
torchrun --standalone --nproc_per_node=8
train.py config/train_gpt2.py
train.py config/train_gpt2.py
Step 3: Sample from trained model
步骤3:用训练好的模型生成文本
python sample.py --out_dir=out
**Config** (`config/train_gpt2.py`):
```pythonpython sample.py --out_dir=out
**配置文件** (`config/train_gpt2.py`):
```pythonGPT-2 (124M) architecture
GPT-2(124M)架构配置
n_layer = 12
n_head = 12
n_embd = 768
block_size = 1024
dropout = 0.0
n_layer = 12
n_head = 12
n_embd = 768
block_size = 1024
dropout = 0.0
Training
训练配置
batch_size = 12
gradient_accumulation_steps = 5 * 8 # Total batch ~0.5M tokens
learning_rate = 6e-4
max_iters = 600000
lr_decay_iters = 600000
batch_size = 12
gradient_accumulation_steps = 5 * 8 # 总批次约50万tokens
learning_rate = 6e-4
max_iters = 600000
lr_decay_iters = 600000
System
系统设置
compile = True # PyTorch 2.0
**Training time**: ~4 days (8× A100)compile = True # PyTorch 2.0版本启用
**训练时间**: 约4天(8块A100 GPU)Workflow 3: Fine-tune pretrained GPT-2
工作流3:微调预训练GPT-2
Start from OpenAI checkpoint:
python
undefined从OpenAI预训练权重开始:
python
undefinedIn train.py or config
在train.py或配置文件中设置
init_from = 'gpt2' # Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl
init_from = 'gpt2' # 可选值: gpt2, gpt2-medium, gpt2-large, gpt2-xl
Model loads OpenAI weights automatically
模型会自动加载OpenAI的权重
python train.py config/finetune_shakespeare.py
**Example config** (`config/finetune_shakespeare.py`):
```pythonpython train.py config/finetune_shakespeare.py
**示例配置文件** (`config/finetune_shakespeare.py`):
```pythonStart from GPT-2
从GPT-2开始
init_from = 'gpt2'
init_from = 'gpt2'
Dataset
数据集设置
dataset = 'shakespeare_char'
batch_size = 1
block_size = 1024
dataset = 'shakespeare_char'
batch_size = 1
block_size = 1024
Fine-tuning
微调配置
learning_rate = 3e-5 # Lower LR for fine-tuning
max_iters = 2000
warmup_iters = 100
learning_rate = 3e-5 # 微调时使用更低的学习率
max_iters = 2000
warmup_iters = 100
Regularization
正则化设置
weight_decay = 1e-1
undefinedweight_decay = 1e-1
undefinedWorkflow 4: Custom dataset
工作流4:自定义数据集
Train on your own text:
python
undefined在自有文本数据集上训练:
python
undefineddata/custom/prepare.py
data/custom/prepare.py
import numpy as np
import numpy as np
Load your data
加载你的数据
with open('my_data.txt', 'r') as f:
text = f.read()
with open('my_data.txt', 'r') as f:
text = f.read()
Create character mappings
创建字符映射
chars = sorted(list(set(text)))
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
chars = sorted(list(set(text)))
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
Tokenize
分词
data = np.array([stoi[ch] for ch in text], dtype=np.uint16)
data = np.array([stoi[ch] for ch in text], dtype=np.uint16)
Split train/val
划分训练/验证集
n = len(data)
train_data = data[:int(n0.9)]
val_data = data[int(n0.9):]
n = len(data)
train_data = data[:int(n0.9)]
val_data = data[int(n0.9):]
Save
保存数据
train_data.tofile('data/custom/train.bin')
val_data.tofile('data/custom/val.bin')
**Train**:
```bash
python data/custom/prepare.py
python train.py --dataset=customtrain_data.tofile('data/custom/train.bin')
val_data.tofile('data/custom/val.bin')
**训练命令**:
```bash
python data/custom/prepare.py
python train.py --dataset=customWhen to use vs alternatives
适用场景与替代方案对比
Use nanoGPT when:
- Learning how GPT works
- Experimenting with transformer variants
- Teaching/education purposes
- Quick prototyping
- Limited compute (can run on CPU)
Simplicity advantages:
- ~300 lines: Entire model in
model.py - ~300 lines: Training loop in
train.py - Hackable: Easy to modify
- No abstractions: Pure PyTorch
Use alternatives instead:
- HuggingFace Transformers: Production use, many models
- Megatron-LM: Large-scale distributed training
- LitGPT: More architectures, production-ready
- PyTorch Lightning: Need high-level framework
适合使用nanoGPT的场景:
- 学习GPT的工作原理
- 实验Transformer变体
- 教学/教育用途
- 快速原型开发
- 计算资源有限(可在CPU上运行)
简洁性优势:
- 约300行代码: 完整模型定义在中
model.py - 约300行代码: 训练循环在中
train.py - 易于修改: 代码结构简单,便于调整
- 无额外抽象: 纯PyTorch实现
适合使用替代方案的场景:
- HuggingFace Transformers: 生产环境使用,支持多种模型
- Megatron-LM: 大规模分布式训练
- LitGPT: 支持更多架构,适合生产环境
- PyTorch Lightning: 需要高层级框架时使用
Common issues
常见问题
Issue: CUDA out of memory
Reduce batch size or context length:
python
batch_size = 1 # Reduce from 12
block_size = 512 # Reduce from 1024
gradient_accumulation_steps = 40 # Increase to maintain effective batchIssue: Training too slow
Enable compilation (PyTorch 2.0+):
python
compile = True # 2× speedupUse mixed precision:
python
dtype = 'bfloat16' # Or 'float16'Issue: Poor generation quality
Train longer:
python
max_iters = 10000 # Increase from 5000Lower temperature:
python
undefined问题:CUDA内存不足
减小批次大小或上下文长度:
python
batch_size = 1 # 从12减小
block_size = 512 # 从1024减小
gradient_accumulation_steps = 40 # 增大以维持有效批次大小问题:训练速度过慢
启用编译(PyTorch 2.0+):
python
compile = True # 速度提升2倍使用混合精度:
python
dtype = 'bfloat16' # 或 'float16'问题:生成文本质量差
延长训练时间:
python
max_iters = 10000 # 从5000增加降低温度参数:
python
undefinedIn sample.py
在sample.py中设置
temperature = 0.7 # Lower from 1.0
top_k = 200 # Add top-k sampling
**Issue: Can't load GPT-2 weights**
Install transformers:
```bash
pip install transformersCheck model name:
python
init_from = 'gpt2' # Valid: gpt2, gpt2-medium, gpt2-large, gpt2-xltemperature = 0.7 # 从1.0降低
top_k = 200 # 添加top-k采样
**问题:无法加载GPT-2权重**
安装transformers库:
```bash
pip install transformers检查模型名称:
python
init_from = 'gpt2' # 有效值: gpt2, gpt2-medium, gpt2-large, gpt2-xlAdvanced topics
进阶主题
Model architecture: See references/architecture.md for GPT block structure, multi-head attention, and MLP layers explained simply.
Training loop: See references/training.md for learning rate schedule, gradient accumulation, and distributed data parallel setup.
Data preparation: See references/data.md for tokenization strategies (character-level vs BPE) and binary format details.
模型架构: 参考references/architecture.md,其中简单解释了GPT块结构、多头注意力和MLP层。
训练循环: 参考references/training.md,其中介绍了学习率调度、梯度累积和分布式数据并行设置。
数据准备: 参考references/data.md,其中介绍了分词策略(字符级vs BPE)和二进制格式细节。
Hardware requirements
硬件要求
-
Shakespeare (char-level):
- CPU: 5 minutes
- GPU (T4): 1 minute
- VRAM: <1GB
-
GPT-2 (124M):
- 1× A100: ~1 week
- 8× A100: ~4 days
- VRAM: ~16GB per GPU
-
GPT-2 Medium (350M):
- 8× A100: ~2 weeks
- VRAM: ~40GB per GPU
Performance:
- With : 2× speedup
compile=True - With : 50% memory reduction
dtype=bfloat16
-
莎士比亚数据集(字符级):
- CPU: 5分钟
- GPU(T4): 1分钟
- 显存: <1GB
-
GPT-2(124M参数):
- 1块A100: 约1周
- 8块A100: 约4天
- 显存: 每块GPU约16GB
-
GPT-2 Medium(350M参数):
- 8块A100: 约2周
- 显存: 每块GPU约40GB
性能优化:
- 启用: 速度提升2倍
compile=True - 使用: 内存占用减少50%
dtype=bfloat16
Resources
相关资源
- GitHub: https://github.com/karpathy/nanoGPT ⭐ 48,000+
- Video: "Let's build GPT" by Andrej Karpathy
- Paper: "Attention is All You Need" (Vaswani et al.)
- OpenWebText: https://huggingface.co/datasets/Skylion007/openwebtext
- Educational: Best for understanding transformers from scratch
- GitHub: https://github.com/karpathy/nanoGPT ⭐ 48,000+
- 视频: Andrej Karpathy的《一起构建GPT》
- 论文: 《Attention is All You Need》(Vaswani等人)
- OpenWebText: https://huggingface.co/datasets/Skylion007/openwebtext
- 教育用途: 最适合从零开始理解Transformer