peft-fine-tuning
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePEFT (Parameter-Efficient Fine-Tuning)
PEFT(参数高效微调)
Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.
通过LoRA、QLoRA及25+种适配器方法,仅训练不足1%的参数即可完成大语言模型(LLM)的微调。
When to use PEFT
何时使用PEFT
Use PEFT/LoRA when:
- Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
- Need to train <1% parameters (6MB adapters vs 14GB full model)
- Want fast iteration with multiple task-specific adapters
- Deploying multiple fine-tuned variants from one base model
Use QLoRA (PEFT + quantization) when:
- Fine-tuning 70B models on single 24GB GPU
- Memory is the primary constraint
- Can accept ~5% quality trade-off vs full fine-tuning
Use full fine-tuning instead when:
- Training small models (<1B parameters)
- Need maximum quality and have compute budget
- Significant domain shift requires updating all weights
在以下场景使用PEFT/LoRA:
- 在消费级GPU(RTX 4090、A100)上微调7B-70B参数规模的模型
- 仅需训练<1%的参数(适配器仅6MB,对比完整模型的14GB)
- 希望基于多个任务专用适配器快速迭代
- 基于一个基础模型部署多个微调变体
在以下场景使用QLoRA(PEFT+量化):
- 在单张24GB GPU上微调70B参数规模的模型
- 内存是主要限制因素
- 可以接受与全量微调相比约5%的精度损失
在以下场景使用全量微调:
- 训练小模型(<1B参数)
- 需要最高精度且具备足够计算资源
- 显著的领域迁移需要更新所有权重
Quick start
快速开始
Installation
安装
bash
undefinedbash
undefinedBasic installation
基础安装
pip install peft
pip install peft
With quantization support (recommended)
带量化支持(推荐)
pip install peft bitsandbytes
pip install peft bitsandbytes
Full stack
完整栈安装
pip install peft transformers accelerate bitsandbytes datasets
undefinedpip install peft transformers accelerate bitsandbytes datasets
undefinedLoRA fine-tuning (standard)
LoRA微调(标准方案)
python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_datasetpython
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_datasetLoad base model
加载基础模型
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
LoRA configuration
LoRA配置
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank (8-64, higher = more capacity)
lora_alpha=32, # Scaling factor (typically 2*r)
lora_dropout=0.05, # Dropout for regularization
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Attention layers
bias="none" # Don't train biases
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # 秩(8-64,值越大容量越高)
lora_alpha=32, # 缩放因子(通常为2*r)
lora_dropout=0.05, # 正则化用Dropout
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # 注意力层
bias="none" # 不训练偏置参数
)
Apply LoRA
应用LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%
输出: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%
Prepare dataset
准备数据集
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
def tokenize(example):
text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
return tokenizer(text, truncation=True, max_length=512, padding="max_length")
tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
def tokenize(example):
text = f"### 指令:\n{example['instruction']}\n\n### 响应:\n{example['response']}"
return tokenizer(text, truncation=True, max_length=512, padding="max_length")
tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)
Training
训练
training_args = TrainingArguments(
output_dir="./lora-llama",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized,
data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
"attention_mask": torch.stack([f["attention_mask"] for f in data]),
"labels": torch.stack([f["input_ids"] for f in data])}
)
trainer.train()
training_args = TrainingArguments(
output_dir="./lora-llama",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized,
data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
"attention_mask": torch.stack([f["attention_mask"] for f in data]),
"labels": torch.stack([f["input_ids"] for f in data])}
)
trainer.train()
Save adapter only (6MB vs 16GB)
仅保存适配器(6MB对比全量模型的16GB)
model.save_pretrained("./lora-llama-adapter")
undefinedmodel.save_pretrained("./lora-llama-adapter")
undefinedQLoRA fine-tuning (memory-efficient)
QLoRA微调(内存高效方案)
python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_trainingpython
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training4-bit quantization config
4位量化配置
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 (best for LLMs)
bnb_4bit_compute_dtype="bfloat16", # Compute in bf16
bnb_4bit_use_double_quant=True # Nested quantization
)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4(LLM最优选择)
bnb_4bit_compute_dtype="bfloat16", # 以bf16精度计算
bnb_4bit_use_double_quant=True # 嵌套量化
)
Load quantized model
加载量化模型
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B",
quantization_config=bnb_config,
device_map="auto"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B",
quantization_config=bnb_config,
device_map="auto"
)
Prepare for training (enables gradient checkpointing)
为训练做准备(启用梯度检查点)
model = prepare_model_for_kbit_training(model)
model = prepare_model_for_kbit_training(model)
LoRA config for QLoRA
QLoRA对应的LoRA配置
lora_config = LoraConfig(
r=64, # Higher rank for 70B
lora_alpha=128,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
lora_config = LoraConfig(
r=64, # 70B模型使用更高的秩
lora_alpha=128,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
70B model now fits on single 24GB GPU!
70B模型现在可以在单张24GB GPU上运行了!
undefinedundefinedLoRA parameter selection
LoRA参数选择
Rank (r) - capacity vs efficiency
秩(r)- 容量与效率的平衡
| Rank | Trainable Params | Memory | Quality | Use Case |
|---|---|---|---|---|
| 4 | ~3M | Minimal | Lower | Simple tasks, prototyping |
| 8 | ~7M | Low | Good | Recommended starting point |
| 16 | ~14M | Medium | Better | General fine-tuning |
| 32 | ~27M | Higher | High | Complex tasks |
| 64 | ~54M | High | Highest | Domain adaptation, 70B models |
| 秩 | 可训练参数 | 内存占用 | 精度表现 | 适用场景 |
|---|---|---|---|---|
| 4 | ~3M | 极小 | 较低 | 简单任务、原型开发 |
| 8 | ~7M | 低 | 良好 | 推荐起始值 |
| 16 | ~14M | 中等 | 更优 | 通用微调场景 |
| 32 | ~27M | 较高 | 高 | 复杂任务 |
| 64 | ~54M | 高 | 最高 | 领域适配、70B模型 |
Alpha (lora_alpha) - scaling factor
Alpha(lora_alpha)- 缩放因子
python
undefinedpython
undefinedRule of thumb: alpha = 2 * rank
经验法则:alpha = 2 * 秩
LoraConfig(r=16, lora_alpha=32) # Standard
LoraConfig(r=16, lora_alpha=16) # Conservative (lower learning rate effect)
LoraConfig(r=16, lora_alpha=64) # Aggressive (higher learning rate effect)
undefinedLoraConfig(r=16, lora_alpha=32) # 标准配置
LoraConfig(r=16, lora_alpha=16) # 保守配置(学习率影响更低)
LoraConfig(r=16, lora_alpha=64) # 激进配置(学习率影响更高)
undefinedTarget modules by architecture
不同架构的目标模块
python
undefinedpython
undefinedLlama / Mistral / Qwen
Llama / Mistral / Qwen
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
GPT-2 / GPT-Neo
GPT-2 / GPT-Neo
target_modules = ["c_attn", "c_proj", "c_fc"]
target_modules = ["c_attn", "c_proj", "c_fc"]
Falcon
Falcon
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
BLOOM
BLOOM
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
Auto-detect all linear layers
自动检测所有线性层
target_modules = "all-linear" # PEFT 0.6.0+
undefinedtarget_modules = "all-linear" # PEFT 0.6.0+版本支持
undefinedLoading and merging adapters
加载与合并适配器
Load trained adapter
加载训练好的适配器
python
from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLMpython
from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLMOption 1: Load with PeftModel
方式1:通过PeftModel加载
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")
Option 2: Load directly (recommended)
方式2:直接加载(推荐)
model = AutoPeftModelForCausalLM.from_pretrained(
"./lora-llama-adapter",
device_map="auto"
)
undefinedmodel = AutoPeftModelForCausalLM.from_pretrained(
"./lora-llama-adapter",
device_map="auto"
)
undefinedMerge adapter into base model
将适配器合并到基础模型
python
undefinedpython
undefinedMerge for deployment (no adapter overhead)
合并以用于部署(消除适配器开销)
merged_model = model.merge_and_unload()
merged_model = model.merge_and_unload()
Save merged model
保存合并后的模型
merged_model.save_pretrained("./llama-merged")
tokenizer.save_pretrained("./llama-merged")
merged_model.save_pretrained("./llama-merged")
tokenizer.save_pretrained("./llama-merged")
Push to Hub
推送到HuggingFace Hub
merged_model.push_to_hub("username/llama-finetuned")
undefinedmerged_model.push_to_hub("username/llama-finetuned")
undefinedMulti-adapter serving
多适配器部署
python
from peft import PeftModelpython
from peft import PeftModelLoad base with first adapter
加载基础模型与第一个适配器
model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")
model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")
Load additional adapters
加载额外适配器
model.load_adapter("./adapter-task2", adapter_name="task2")
model.load_adapter("./adapter-task3", adapter_name="task3")
model.load_adapter("./adapter-task2", adapter_name="task2")
model.load_adapter("./adapter-task3", adapter_name="task3")
Switch between adapters at runtime
运行时切换适配器
model.set_adapter("task1") # Use task1 adapter
output1 = model.generate(**inputs)
model.set_adapter("task2") # Switch to task2
output2 = model.generate(**inputs)
model.set_adapter("task1") # 使用task1适配器
output1 = model.generate(**inputs)
model.set_adapter("task2") # 切换到task2适配器
output2 = model.generate(**inputs)
Disable adapters (use base model)
禁用适配器(使用基础模型)
with model.disable_adapter():
base_output = model.generate(**inputs)
undefinedwith model.disable_adapter():
base_output = model.generate(**inputs)
undefinedPEFT methods comparison
PEFT方法对比
| Method | Trainable % | Memory | Speed | Best For |
|---|---|---|---|---|
| LoRA | 0.1-1% | Low | Fast | General fine-tuning |
| QLoRA | 0.1-1% | Very Low | Medium | Memory-constrained |
| AdaLoRA | 0.1-1% | Low | Medium | Automatic rank selection |
| IA3 | 0.01% | Minimal | Fastest | Few-shot adaptation |
| Prefix Tuning | 0.1% | Low | Medium | Generation control |
| Prompt Tuning | 0.001% | Minimal | Fast | Simple task adaptation |
| P-Tuning v2 | 0.1% | Low | Medium | NLU tasks |
| 方法 | 可训练参数占比 | 内存占用 | 速度 | 最佳适用场景 |
|---|---|---|---|---|
| LoRA | 0.1-1% | 低 | 快 | 通用微调 |
| QLoRA | 0.1-1% | 极低 | 中等 | 内存受限场景 |
| AdaLoRA | 0.1-1% | 低 | 中等 | 自动秩选择 |
| IA3 | 0.01% | 极小 | 最快 | 少样本适配 |
| Prefix Tuning | 0.1% | 低 | 中等 | 生成控制 |
| Prompt Tuning | 0.001% | 极小 | 快 | 简单任务适配 |
| P-Tuning v2 | 0.1% | 低 | 中等 | 自然语言理解任务 |
IA3 (minimal parameters)
IA3(极少量参数)
python
from peft import IA3Config
ia3_config = IA3Config(
target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
feedforward_modules=["down_proj"]
)
model = get_peft_model(model, ia3_config)python
from peft import IA3Config
ia3_config = IA3Config(
target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
feedforward_modules=["down_proj"]
)
model = get_peft_model(model, ia3_config)Trains only 0.01% of parameters!
仅训练0.01%的参数!
undefinedundefinedPrefix Tuning
Prefix Tuning
python
from peft import PrefixTuningConfig
prefix_config = PrefixTuningConfig(
task_type="CAUSAL_LM",
num_virtual_tokens=20, # Prepended tokens
prefix_projection=True # Use MLP projection
)
model = get_peft_model(model, prefix_config)python
from peft import PrefixTuningConfig
prefix_config = PrefixTuningConfig(
task_type="CAUSAL_LM",
num_virtual_tokens=20, # 前置虚拟token数量
prefix_projection=True # 使用MLP投影
)
model = get_peft_model(model, prefix_config)Integration patterns
集成方案
With TRL (SFTTrainer)
与TRL(SFTTrainer)集成
python
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")
trainer = SFTTrainer(
model=model,
args=SFTConfig(output_dir="./output", max_seq_length=512),
train_dataset=dataset,
peft_config=lora_config, # Pass LoRA config directly
)
trainer.train()python
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")
trainer = SFTTrainer(
model=model,
args=SFTConfig(output_dir="./output", max_seq_length=512),
train_dataset=dataset,
peft_config=lora_config, # 直接传入LoRA配置
)
trainer.train()With Axolotl (YAML config)
与Axolotl(YAML配置)集成
yaml
undefinedyaml
undefinedaxolotl config.yaml
axolotl config.yaml
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj lora_target_linear: true # Target all linear layers
undefinedadapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj lora_target_linear: true # 目标为所有线性层
undefinedWith vLLM (inference)
与vLLM(推理)集成
python
from vllm import LLM
from vllm.lora.request import LoRARequestpython
from vllm import LLM
from vllm.lora.request import LoRARequestLoad base model with LoRA support
加载支持LoRA的基础模型
llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)
llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)
Serve with adapter
结合适配器提供服务
outputs = llm.generate(
prompts,
lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
)
undefinedoutputs = llm.generate(
prompts,
lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
)
undefinedPerformance benchmarks
性能基准测试
Memory usage (Llama 3.1 8B)
内存占用(Llama 3.1 8B)
| Method | GPU Memory | Trainable Params |
|---|---|---|
| Full fine-tuning | 60+ GB | 8B (100%) |
| LoRA r=16 | 18 GB | 14M (0.17%) |
| QLoRA r=16 | 6 GB | 14M (0.17%) |
| IA3 | 16 GB | 800K (0.01%) |
| 方法 | GPU内存占用 | 可训练参数 |
|---|---|---|
| 全量微调 | 60+ GB | 8B(100%) |
| LoRA r=16 | 18 GB | 14M(0.17%) |
| QLoRA r=16 | 6 GB | 14M(0.17%) |
| IA3 | 16 GB | 800K(0.01%) |
Training speed (A100 80GB)
训练速度(A100 80GB)
| Method | Tokens/sec | vs Full FT |
|---|---|---|
| Full FT | 2,500 | 1x |
| LoRA | 3,200 | 1.3x |
| QLoRA | 2,100 | 0.84x |
| 方法 | 每秒处理Token数 | 相对全量微调速度 |
|---|---|---|
| 全量微调 | 2,500 | 1x |
| LoRA | 3,200 | 1.3x |
| QLoRA | 2,100 | 0.84x |
Quality (MMLU benchmark)
精度表现(MMLU基准测试)
| Model | Full FT | LoRA | QLoRA |
|---|---|---|---|
| Llama 2-7B | 45.3 | 44.8 | 44.1 |
| Llama 2-13B | 54.8 | 54.2 | 53.5 |
| 模型 | 全量微调 | LoRA | QLoRA |
|---|---|---|---|
| Llama 2-7B | 45.3 | 44.8 | 44.1 |
| Llama 2-13B | 54.8 | 54.2 | 53.5 |
Common issues
常见问题
CUDA OOM during training
训练时CUDA内存不足(OOM)
python
undefinedpython
undefinedSolution 1: Enable gradient checkpointing
解决方案1:启用梯度检查点
model.gradient_checkpointing_enable()
model.gradient_checkpointing_enable()
Solution 2: Reduce batch size + increase accumulation
解决方案2:减小批次大小并增加梯度累积步数
TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=16
)
TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=16
)
Solution 3: Use QLoRA
解决方案3:使用QLoRA
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
undefinedfrom transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
undefinedAdapter not applying
适配器未生效
python
undefinedpython
undefinedVerify adapter is active
验证适配器是否激活
print(model.active_adapters) # Should show adapter name
print(model.active_adapters) # 应显示适配器名称
Check trainable parameters
检查可训练参数
model.print_trainable_parameters()
model.print_trainable_parameters()
Ensure model in training mode
确保模型处于训练模式
model.train()
undefinedmodel.train()
undefinedQuality degradation
精度下降
python
undefinedpython
undefinedIncrease rank
增大秩参数
LoraConfig(r=32, lora_alpha=64)
LoraConfig(r=32, lora_alpha=64)
Target more modules
选择更多目标模块
target_modules = "all-linear"
target_modules = "all-linear"
Use more training data and epochs
使用更多训练数据与训练轮次
TrainingArguments(num_train_epochs=5)
TrainingArguments(num_train_epochs=5)
Lower learning rate
降低学习率
TrainingArguments(learning_rate=1e-4)
undefinedTrainingArguments(learning_rate=1e-4)
undefinedBest practices
最佳实践
- Start with r=8-16, increase if quality insufficient
- Use alpha = 2 * rank as starting point
- Target attention + MLP layers for best quality/efficiency
- Enable gradient checkpointing for memory savings
- Save adapters frequently (small files, easy rollback)
- Evaluate on held-out data before merging
- Use QLoRA for 70B+ models on consumer hardware
- 从r=8-16开始,若精度不足再增大
- 初始设置alpha = 2 * rank
- 选择注意力层+MLP层作为目标模块,以平衡精度与效率
- 启用梯度检查点以节省内存
- 频繁保存适配器(文件体积小,便于回滚)
- 合并前在预留数据集上评估
- 在消费级硬件上微调70B+模型时使用QLoRA
References
参考资料
- Advanced Usage - DoRA, LoftQ, rank stabilization, custom modules
- Troubleshooting - Common errors, debugging, optimization
- 高级用法 - DoRA、LoftQ、秩稳定、自定义模块
- 故障排查 - 常见错误、调试、优化
Resources
资源
- GitHub: https://github.com/huggingface/peft
- Docs: https://huggingface.co/docs/peft
- LoRA Paper: arXiv:2106.09685
- QLoRA Paper: arXiv:2305.14314
- Models: https://huggingface.co/models?library=peft
- GitHub: https://github.com/huggingface/peft
- 文档: https://huggingface.co/docs/peft
- LoRA论文: arXiv:2106.09685
- QLoRA论文: arXiv:2305.14314
- 模型: https://huggingface.co/models?library=peft