nanogpt

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

nanoGPT - Minimalist GPT Training

nanoGPT - 极简版GPT训练

Quick start

快速开始

nanoGPT is a simplified GPT implementation designed for learning and experimentation.
Installation:
bash
pip install torch numpy transformers datasets tiktoken wandb tqdm
Train on Shakespeare (CPU-friendly):
bash
undefined
nanoGPT是一个为学习和实验设计的简化版GPT实现。
安装:
bash
pip install torch numpy transformers datasets tiktoken wandb tqdm
在莎士比亚数据集上训练(适合CPU):
bash
undefined

Prepare data

准备数据

python data/shakespeare_char/prepare.py
python data/shakespeare_char/prepare.py

Train (5 minutes on CPU)

训练(CPU上约5分钟)

python train.py config/train_shakespeare_char.py
python train.py config/train_shakespeare_char.py

Generate text

生成文本

python sample.py --out_dir=out-shakespeare-char

**Output**:
ROMEO: What say'st thou? Shall I speak, and be a man?
JULIET: I am afeard, and yet I'll speak; for thou art One that hath been a man, and yet I know not What thou art.
undefined
python sample.py --out_dir=out-shakespeare-char

**输出示例**:
ROMEO: What say'st thou? Shall I speak, and be a man?
JULIET: I am afeard, and yet I'll speak; for thou art One that hath been a man, and yet I know not What thou art.
undefined

Common workflows

常见工作流

Workflow 1: Character-level Shakespeare

工作流1:字符级莎士比亚数据集训练

Complete training pipeline:
bash
undefined
完整训练流程:
bash
undefined

Step 1: Prepare data (creates train.bin, val.bin)

步骤1:准备数据(生成train.bin和val.bin)

python data/shakespeare_char/prepare.py
python data/shakespeare_char/prepare.py

Step 2: Train small model

步骤2:训练小型模型

python train.py config/train_shakespeare_char.py
python train.py config/train_shakespeare_char.py

Step 3: Generate text

步骤3:生成文本

python sample.py --out_dir=out-shakespeare-char

**Config** (`config/train_shakespeare_char.py`):
```python
python sample.py --out_dir=out-shakespeare-char

**配置文件** (`config/train_shakespeare_char.py`):
```python

Model config

模型配置

n_layer = 6 # 6 transformer layers n_head = 6 # 6 attention heads n_embd = 384 # 384-dim embeddings block_size = 256 # 256 char context
n_layer = 6 # 6个Transformer层 n_head = 6 # 6个注意力头 n_embd = 384 # 384维嵌入向量 block_size = 256 # 256字符上下文长度

Training config

训练配置

batch_size = 64 learning_rate = 1e-3 max_iters = 5000 eval_interval = 500
batch_size = 64 learning_rate = 1e-3 max_iters = 5000 eval_interval = 500

Hardware

硬件设置

device = 'cpu' # Or 'cuda' compile = False # Set True for PyTorch 2.0

**Training time**: ~5 minutes (CPU), ~1 minute (GPU)
device = 'cpu' # 或 'cuda' compile = False # PyTorch 2.0版本可设置为True

**训练时间**: 约5分钟(CPU),约1分钟(GPU)

Workflow 2: Reproduce GPT-2 (124M)

工作流2:复现GPT-2(124M参数)

Multi-GPU training on OpenWebText:
bash
undefined
在OpenWebText数据集上进行多GPU训练:
bash
undefined

Step 1: Prepare OpenWebText (takes ~1 hour)

步骤1:准备OpenWebText数据集(约1小时)

python data/openwebtext/prepare.py
python data/openwebtext/prepare.py

Step 2: Train GPT-2 124M with DDP (8 GPUs)

步骤2:使用DDP在8块GPU上训练GPT-2 124M

torchrun --standalone --nproc_per_node=8
train.py config/train_gpt2.py
torchrun --standalone --nproc_per_node=8
train.py config/train_gpt2.py

Step 3: Sample from trained model

步骤3:用训练好的模型生成文本

python sample.py --out_dir=out

**Config** (`config/train_gpt2.py`):
```python
python sample.py --out_dir=out

**配置文件** (`config/train_gpt2.py`):
```python

GPT-2 (124M) architecture

GPT-2(124M)架构配置

n_layer = 12 n_head = 12 n_embd = 768 block_size = 1024 dropout = 0.0
n_layer = 12 n_head = 12 n_embd = 768 block_size = 1024 dropout = 0.0

Training

训练配置

batch_size = 12 gradient_accumulation_steps = 5 * 8 # Total batch ~0.5M tokens learning_rate = 6e-4 max_iters = 600000 lr_decay_iters = 600000
batch_size = 12 gradient_accumulation_steps = 5 * 8 # 总批次约50万tokens learning_rate = 6e-4 max_iters = 600000 lr_decay_iters = 600000

System

系统设置

compile = True # PyTorch 2.0

**Training time**: ~4 days (8× A100)
compile = True # PyTorch 2.0版本启用

**训练时间**: 约4天(8块A100 GPU)

Workflow 3: Fine-tune pretrained GPT-2

工作流3:微调预训练GPT-2

Start from OpenAI checkpoint:
python
undefined
从OpenAI预训练权重开始:
python
undefined

In train.py or config

在train.py或配置文件中设置

init_from = 'gpt2' # Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl
init_from = 'gpt2' # 可选值: gpt2, gpt2-medium, gpt2-large, gpt2-xl

Model loads OpenAI weights automatically

模型会自动加载OpenAI的权重

python train.py config/finetune_shakespeare.py

**Example config** (`config/finetune_shakespeare.py`):
```python
python train.py config/finetune_shakespeare.py

**示例配置文件** (`config/finetune_shakespeare.py`):
```python

Start from GPT-2

从GPT-2开始

init_from = 'gpt2'
init_from = 'gpt2'

Dataset

数据集设置

dataset = 'shakespeare_char' batch_size = 1 block_size = 1024
dataset = 'shakespeare_char' batch_size = 1 block_size = 1024

Fine-tuning

微调配置

learning_rate = 3e-5 # Lower LR for fine-tuning max_iters = 2000 warmup_iters = 100
learning_rate = 3e-5 # 微调时使用更低的学习率 max_iters = 2000 warmup_iters = 100

Regularization

正则化设置

weight_decay = 1e-1
undefined
weight_decay = 1e-1
undefined

Workflow 4: Custom dataset

工作流4:自定义数据集

Train on your own text:
python
undefined
在自有文本数据集上训练:
python
undefined

data/custom/prepare.py

data/custom/prepare.py

import numpy as np
import numpy as np

Load your data

加载你的数据

with open('my_data.txt', 'r') as f: text = f.read()
with open('my_data.txt', 'r') as f: text = f.read()

Create character mappings

创建字符映射

chars = sorted(list(set(text))) stoi = {ch: i for i, ch in enumerate(chars)} itos = {i: ch for i, ch in enumerate(chars)}
chars = sorted(list(set(text))) stoi = {ch: i for i, ch in enumerate(chars)} itos = {i: ch for i, ch in enumerate(chars)}

Tokenize

分词

data = np.array([stoi[ch] for ch in text], dtype=np.uint16)
data = np.array([stoi[ch] for ch in text], dtype=np.uint16)

Split train/val

划分训练/验证集

n = len(data) train_data = data[:int(n0.9)] val_data = data[int(n0.9):]
n = len(data) train_data = data[:int(n0.9)] val_data = data[int(n0.9):]

Save

保存数据

train_data.tofile('data/custom/train.bin') val_data.tofile('data/custom/val.bin')

**Train**:
```bash
python data/custom/prepare.py
python train.py --dataset=custom
train_data.tofile('data/custom/train.bin') val_data.tofile('data/custom/val.bin')

**训练命令**:
```bash
python data/custom/prepare.py
python train.py --dataset=custom

When to use vs alternatives

适用场景与替代方案对比

Use nanoGPT when:
  • Learning how GPT works
  • Experimenting with transformer variants
  • Teaching/education purposes
  • Quick prototyping
  • Limited compute (can run on CPU)
Simplicity advantages:
  • ~300 lines: Entire model in
    model.py
  • ~300 lines: Training loop in
    train.py
  • Hackable: Easy to modify
  • No abstractions: Pure PyTorch
Use alternatives instead:
  • HuggingFace Transformers: Production use, many models
  • Megatron-LM: Large-scale distributed training
  • LitGPT: More architectures, production-ready
  • PyTorch Lightning: Need high-level framework
适合使用nanoGPT的场景:
  • 学习GPT的工作原理
  • 实验Transformer变体
  • 教学/教育用途
  • 快速原型开发
  • 计算资源有限(可在CPU上运行)
简洁性优势:
  • 约300行代码: 完整模型定义在
    model.py
  • 约300行代码: 训练循环在
    train.py
  • 易于修改: 代码结构简单,便于调整
  • 无额外抽象: 纯PyTorch实现
适合使用替代方案的场景:
  • HuggingFace Transformers: 生产环境使用,支持多种模型
  • Megatron-LM: 大规模分布式训练
  • LitGPT: 支持更多架构,适合生产环境
  • PyTorch Lightning: 需要高层级框架时使用

Common issues

常见问题

Issue: CUDA out of memory
Reduce batch size or context length:
python
batch_size = 1  # Reduce from 12
block_size = 512  # Reduce from 1024
gradient_accumulation_steps = 40  # Increase to maintain effective batch
Issue: Training too slow
Enable compilation (PyTorch 2.0+):
python
compile = True  # 2× speedup
Use mixed precision:
python
dtype = 'bfloat16'  # Or 'float16'
Issue: Poor generation quality
Train longer:
python
max_iters = 10000  # Increase from 5000
Lower temperature:
python
undefined
问题:CUDA内存不足
减小批次大小或上下文长度:
python
batch_size = 1  # 从12减小
block_size = 512  # 从1024减小
gradient_accumulation_steps = 40  # 增大以维持有效批次大小
问题:训练速度过慢
启用编译(PyTorch 2.0+):
python
compile = True  # 速度提升2倍
使用混合精度:
python
dtype = 'bfloat16'  # 或 'float16'
问题:生成文本质量差
延长训练时间:
python
max_iters = 10000  # 从5000增加
降低温度参数:
python
undefined

In sample.py

在sample.py中设置

temperature = 0.7 # Lower from 1.0 top_k = 200 # Add top-k sampling

**Issue: Can't load GPT-2 weights**

Install transformers:
```bash
pip install transformers
Check model name:
python
init_from = 'gpt2'  # Valid: gpt2, gpt2-medium, gpt2-large, gpt2-xl
temperature = 0.7 # 从1.0降低 top_k = 200 # 添加top-k采样

**问题:无法加载GPT-2权重**

安装transformers库:
```bash
pip install transformers
检查模型名称:
python
init_from = 'gpt2'  # 有效值: gpt2, gpt2-medium, gpt2-large, gpt2-xl

Advanced topics

进阶主题

Model architecture: See references/architecture.md for GPT block structure, multi-head attention, and MLP layers explained simply.
Training loop: See references/training.md for learning rate schedule, gradient accumulation, and distributed data parallel setup.
Data preparation: See references/data.md for tokenization strategies (character-level vs BPE) and binary format details.
模型架构: 参考references/architecture.md,其中简单解释了GPT块结构、多头注意力和MLP层。
训练循环: 参考references/training.md,其中介绍了学习率调度、梯度累积和分布式数据并行设置。
数据准备: 参考references/data.md,其中介绍了分词策略(字符级vs BPE)和二进制格式细节。

Hardware requirements

硬件要求

  • Shakespeare (char-level):
    • CPU: 5 minutes
    • GPU (T4): 1 minute
    • VRAM: <1GB
  • GPT-2 (124M):
    • 1× A100: ~1 week
    • 8× A100: ~4 days
    • VRAM: ~16GB per GPU
  • GPT-2 Medium (350M):
    • 8× A100: ~2 weeks
    • VRAM: ~40GB per GPU
Performance:
  • With
    compile=True
    : 2× speedup
  • With
    dtype=bfloat16
    : 50% memory reduction
  • 莎士比亚数据集(字符级):
    • CPU: 5分钟
    • GPU(T4): 1分钟
    • 显存: <1GB
  • GPT-2(124M参数):
    • 1块A100: 约1周
    • 8块A100: 约4天
    • 显存: 每块GPU约16GB
  • GPT-2 Medium(350M参数):
    • 8块A100: 约2周
    • 显存: 每块GPU约40GB
性能优化:
  • 启用
    compile=True
    : 速度提升2倍
  • 使用
    dtype=bfloat16
    : 内存占用减少50%

Resources

相关资源