peft-fine-tuning

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PEFT (Parameter-Efficient Fine-Tuning)

PEFT(参数高效微调)

Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.
通过LoRA、QLoRA及25+种适配器方法,仅训练不足1%的参数即可完成大语言模型(LLM)的微调。

When to use PEFT

何时使用PEFT

Use PEFT/LoRA when:
  • Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
  • Need to train <1% parameters (6MB adapters vs 14GB full model)
  • Want fast iteration with multiple task-specific adapters
  • Deploying multiple fine-tuned variants from one base model
Use QLoRA (PEFT + quantization) when:
  • Fine-tuning 70B models on single 24GB GPU
  • Memory is the primary constraint
  • Can accept ~5% quality trade-off vs full fine-tuning
Use full fine-tuning instead when:
  • Training small models (<1B parameters)
  • Need maximum quality and have compute budget
  • Significant domain shift requires updating all weights
在以下场景使用PEFT/LoRA:
  • 在消费级GPU(RTX 4090、A100)上微调7B-70B参数规模的模型
  • 仅需训练<1%的参数(适配器仅6MB,对比完整模型的14GB)
  • 希望基于多个任务专用适配器快速迭代
  • 基于一个基础模型部署多个微调变体
在以下场景使用QLoRA(PEFT+量化):
  • 在单张24GB GPU上微调70B参数规模的模型
  • 内存是主要限制因素
  • 可以接受与全量微调相比约5%的精度损失
在以下场景使用全量微调:
  • 训练小模型(<1B参数)
  • 需要最高精度且具备足够计算资源
  • 显著的领域迁移需要更新所有权重

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

Basic installation

基础安装

pip install peft
pip install peft

With quantization support (recommended)

带量化支持(推荐)

pip install peft bitsandbytes
pip install peft bitsandbytes

Full stack

完整栈安装

pip install peft transformers accelerate bitsandbytes datasets
undefined
pip install peft transformers accelerate bitsandbytes datasets
undefined

LoRA fine-tuning (standard)

LoRA微调(标准方案)

python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset
python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset

Load base model

加载基础模型

model_name = "meta-llama/Llama-3.1-8B" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token
model_name = "meta-llama/Llama-3.1-8B" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token

LoRA configuration

LoRA配置

lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # Rank (8-64, higher = more capacity) lora_alpha=32, # Scaling factor (typically 2*r) lora_dropout=0.05, # Dropout for regularization target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Attention layers bias="none" # Don't train biases )
lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # 秩(8-64,值越大容量越高) lora_alpha=32, # 缩放因子(通常为2*r) lora_dropout=0.05, # 正则化用Dropout target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # 注意力层 bias="none" # 不训练偏置参数 )

Apply LoRA

应用LoRA

model = get_peft_model(model, lora_config) model.print_trainable_parameters()
model = get_peft_model(model, lora_config) model.print_trainable_parameters()

Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

输出: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

Prepare dataset

准备数据集

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
def tokenize(example): text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}" return tokenizer(text, truncation=True, max_length=512, padding="max_length")
tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
def tokenize(example): text = f"### 指令:\n{example['instruction']}\n\n### 响应:\n{example['response']}" return tokenizer(text, truncation=True, max_length=512, padding="max_length")
tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)

Training

训练

training_args = TrainingArguments( output_dir="./lora-llama", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, fp16=True, logging_steps=10, save_strategy="epoch" )
trainer = Trainer( model=model, args=training_args, train_dataset=tokenized, data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]), "attention_mask": torch.stack([f["attention_mask"] for f in data]), "labels": torch.stack([f["input_ids"] for f in data])} )
trainer.train()
training_args = TrainingArguments( output_dir="./lora-llama", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, fp16=True, logging_steps=10, save_strategy="epoch" )
trainer = Trainer( model=model, args=training_args, train_dataset=tokenized, data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]), "attention_mask": torch.stack([f["attention_mask"] for f in data]), "labels": torch.stack([f["input_ids"] for f in data])} )
trainer.train()

Save adapter only (6MB vs 16GB)

仅保存适配器(6MB对比全量模型的16GB)

model.save_pretrained("./lora-llama-adapter")
undefined
model.save_pretrained("./lora-llama-adapter")
undefined

QLoRA fine-tuning (memory-efficient)

QLoRA微调(内存高效方案)

python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training
python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

4-bit quantization config

4位量化配置

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4 (best for LLMs) bnb_4bit_compute_dtype="bfloat16", # Compute in bf16 bnb_4bit_use_double_quant=True # Nested quantization )
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4(LLM最优选择) bnb_4bit_compute_dtype="bfloat16", # 以bf16精度计算 bnb_4bit_use_double_quant=True # 嵌套量化 )

Load quantized model

加载量化模型

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-70B", quantization_config=bnb_config, device_map="auto" )
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-70B", quantization_config=bnb_config, device_map="auto" )

Prepare for training (enables gradient checkpointing)

为训练做准备(启用梯度检查点)

model = prepare_model_for_kbit_training(model)
model = prepare_model_for_kbit_training(model)

LoRA config for QLoRA

QLoRA对应的LoRA配置

lora_config = LoraConfig( r=64, # Higher rank for 70B lora_alpha=128, lora_dropout=0.1, target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", task_type="CAUSAL_LM" )
model = get_peft_model(model, lora_config)
lora_config = LoraConfig( r=64, # 70B模型使用更高的秩 lora_alpha=128, lora_dropout=0.1, target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", task_type="CAUSAL_LM" )
model = get_peft_model(model, lora_config)

70B model now fits on single 24GB GPU!

70B模型现在可以在单张24GB GPU上运行了!

undefined
undefined

LoRA parameter selection

LoRA参数选择

Rank (r) - capacity vs efficiency

秩(r)- 容量与效率的平衡

RankTrainable ParamsMemoryQualityUse Case
4~3MMinimalLowerSimple tasks, prototyping
8~7MLowGoodRecommended starting point
16~14MMediumBetterGeneral fine-tuning
32~27MHigherHighComplex tasks
64~54MHighHighestDomain adaptation, 70B models
可训练参数内存占用精度表现适用场景
4~3M极小较低简单任务、原型开发
8~7M良好推荐起始值
16~14M中等更优通用微调场景
32~27M较高复杂任务
64~54M最高领域适配、70B模型

Alpha (lora_alpha) - scaling factor

Alpha(lora_alpha)- 缩放因子

python
undefined
python
undefined

Rule of thumb: alpha = 2 * rank

经验法则:alpha = 2 * 秩

LoraConfig(r=16, lora_alpha=32) # Standard LoraConfig(r=16, lora_alpha=16) # Conservative (lower learning rate effect) LoraConfig(r=16, lora_alpha=64) # Aggressive (higher learning rate effect)
undefined
LoraConfig(r=16, lora_alpha=32) # 标准配置 LoraConfig(r=16, lora_alpha=16) # 保守配置(学习率影响更低) LoraConfig(r=16, lora_alpha=64) # 激进配置(学习率影响更高)
undefined

Target modules by architecture

不同架构的目标模块

python
undefined
python
undefined

Llama / Mistral / Qwen

Llama / Mistral / Qwen

target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

GPT-2 / GPT-Neo

GPT-2 / GPT-Neo

target_modules = ["c_attn", "c_proj", "c_fc"]
target_modules = ["c_attn", "c_proj", "c_fc"]

Falcon

Falcon

target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

BLOOM

BLOOM

target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

Auto-detect all linear layers

自动检测所有线性层

target_modules = "all-linear" # PEFT 0.6.0+
undefined
target_modules = "all-linear" # PEFT 0.6.0+版本支持
undefined

Loading and merging adapters

加载与合并适配器

Load trained adapter

加载训练好的适配器

python
from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM
python
from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM

Option 1: Load with PeftModel

方式1:通过PeftModel加载

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")

Option 2: Load directly (recommended)

方式2:直接加载(推荐)

model = AutoPeftModelForCausalLM.from_pretrained( "./lora-llama-adapter", device_map="auto" )
undefined
model = AutoPeftModelForCausalLM.from_pretrained( "./lora-llama-adapter", device_map="auto" )
undefined

Merge adapter into base model

将适配器合并到基础模型

python
undefined
python
undefined

Merge for deployment (no adapter overhead)

合并以用于部署(消除适配器开销)

merged_model = model.merge_and_unload()
merged_model = model.merge_and_unload()

Save merged model

保存合并后的模型

merged_model.save_pretrained("./llama-merged") tokenizer.save_pretrained("./llama-merged")
merged_model.save_pretrained("./llama-merged") tokenizer.save_pretrained("./llama-merged")

Push to Hub

推送到HuggingFace Hub

merged_model.push_to_hub("username/llama-finetuned")
undefined
merged_model.push_to_hub("username/llama-finetuned")
undefined

Multi-adapter serving

多适配器部署

python
from peft import PeftModel
python
from peft import PeftModel

Load base with first adapter

加载基础模型与第一个适配器

model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")
model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")

Load additional adapters

加载额外适配器

model.load_adapter("./adapter-task2", adapter_name="task2") model.load_adapter("./adapter-task3", adapter_name="task3")
model.load_adapter("./adapter-task2", adapter_name="task2") model.load_adapter("./adapter-task3", adapter_name="task3")

Switch between adapters at runtime

运行时切换适配器

model.set_adapter("task1") # Use task1 adapter output1 = model.generate(**inputs)
model.set_adapter("task2") # Switch to task2 output2 = model.generate(**inputs)
model.set_adapter("task1") # 使用task1适配器 output1 = model.generate(**inputs)
model.set_adapter("task2") # 切换到task2适配器 output2 = model.generate(**inputs)

Disable adapters (use base model)

禁用适配器(使用基础模型)

with model.disable_adapter(): base_output = model.generate(**inputs)
undefined
with model.disable_adapter(): base_output = model.generate(**inputs)
undefined

PEFT methods comparison

PEFT方法对比

MethodTrainable %MemorySpeedBest For
LoRA0.1-1%LowFastGeneral fine-tuning
QLoRA0.1-1%Very LowMediumMemory-constrained
AdaLoRA0.1-1%LowMediumAutomatic rank selection
IA30.01%MinimalFastestFew-shot adaptation
Prefix Tuning0.1%LowMediumGeneration control
Prompt Tuning0.001%MinimalFastSimple task adaptation
P-Tuning v20.1%LowMediumNLU tasks
方法可训练参数占比内存占用速度最佳适用场景
LoRA0.1-1%通用微调
QLoRA0.1-1%极低中等内存受限场景
AdaLoRA0.1-1%中等自动秩选择
IA30.01%极小最快少样本适配
Prefix Tuning0.1%中等生成控制
Prompt Tuning0.001%极小简单任务适配
P-Tuning v20.1%中等自然语言理解任务

IA3 (minimal parameters)

IA3(极少量参数)

python
from peft import IA3Config

ia3_config = IA3Config(
    target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
    feedforward_modules=["down_proj"]
)
model = get_peft_model(model, ia3_config)
python
from peft import IA3Config

ia3_config = IA3Config(
    target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
    feedforward_modules=["down_proj"]
)
model = get_peft_model(model, ia3_config)

Trains only 0.01% of parameters!

仅训练0.01%的参数!

undefined
undefined

Prefix Tuning

Prefix Tuning

python
from peft import PrefixTuningConfig

prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,      # Prepended tokens
    prefix_projection=True       # Use MLP projection
)
model = get_peft_model(model, prefix_config)
python
from peft import PrefixTuningConfig

prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,      # 前置虚拟token数量
    prefix_projection=True       # 使用MLP投影
)
model = get_peft_model(model, prefix_config)

Integration patterns

集成方案

With TRL (SFTTrainer)

与TRL(SFTTrainer)集成

python
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", max_seq_length=512),
    train_dataset=dataset,
    peft_config=lora_config,  # Pass LoRA config directly
)
trainer.train()
python
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", max_seq_length=512),
    train_dataset=dataset,
    peft_config=lora_config,  # 直接传入LoRA配置
)
trainer.train()

With Axolotl (YAML config)

与Axolotl(YAML配置)集成

yaml
undefined
yaml
undefined

axolotl config.yaml

axolotl config.yaml

adapter: lora lora_r: 16 lora_alpha: 32 lora_dropout: 0.05 lora_target_modules:
  • q_proj
  • v_proj
  • k_proj
  • o_proj lora_target_linear: true # Target all linear layers
undefined
adapter: lora lora_r: 16 lora_alpha: 32 lora_dropout: 0.05 lora_target_modules:
  • q_proj
  • v_proj
  • k_proj
  • o_proj lora_target_linear: true # 目标为所有线性层
undefined

With vLLM (inference)

与vLLM(推理)集成

python
from vllm import LLM
from vllm.lora.request import LoRARequest
python
from vllm import LLM
from vllm.lora.request import LoRARequest

Load base model with LoRA support

加载支持LoRA的基础模型

llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)
llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)

Serve with adapter

结合适配器提供服务

outputs = llm.generate( prompts, lora_request=LoRARequest("adapter1", 1, "./lora-adapter") )
undefined
outputs = llm.generate( prompts, lora_request=LoRARequest("adapter1", 1, "./lora-adapter") )
undefined

Performance benchmarks

性能基准测试

Memory usage (Llama 3.1 8B)

内存占用(Llama 3.1 8B)

MethodGPU MemoryTrainable Params
Full fine-tuning60+ GB8B (100%)
LoRA r=1618 GB14M (0.17%)
QLoRA r=166 GB14M (0.17%)
IA316 GB800K (0.01%)
方法GPU内存占用可训练参数
全量微调60+ GB8B(100%)
LoRA r=1618 GB14M(0.17%)
QLoRA r=166 GB14M(0.17%)
IA316 GB800K(0.01%)

Training speed (A100 80GB)

训练速度(A100 80GB)

MethodTokens/secvs Full FT
Full FT2,5001x
LoRA3,2001.3x
QLoRA2,1000.84x
方法每秒处理Token数相对全量微调速度
全量微调2,5001x
LoRA3,2001.3x
QLoRA2,1000.84x

Quality (MMLU benchmark)

精度表现(MMLU基准测试)

ModelFull FTLoRAQLoRA
Llama 2-7B45.344.844.1
Llama 2-13B54.854.253.5
模型全量微调LoRAQLoRA
Llama 2-7B45.344.844.1
Llama 2-13B54.854.253.5

Common issues

常见问题

CUDA OOM during training

训练时CUDA内存不足(OOM)

python
undefined
python
undefined

Solution 1: Enable gradient checkpointing

解决方案1:启用梯度检查点

model.gradient_checkpointing_enable()
model.gradient_checkpointing_enable()

Solution 2: Reduce batch size + increase accumulation

解决方案2:减小批次大小并增加梯度累积步数

TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=16 )
TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=16 )

Solution 3: Use QLoRA

解决方案3:使用QLoRA

from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
undefined
from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
undefined

Adapter not applying

适配器未生效

python
undefined
python
undefined

Verify adapter is active

验证适配器是否激活

print(model.active_adapters) # Should show adapter name
print(model.active_adapters) # 应显示适配器名称

Check trainable parameters

检查可训练参数

model.print_trainable_parameters()
model.print_trainable_parameters()

Ensure model in training mode

确保模型处于训练模式

model.train()
undefined
model.train()
undefined

Quality degradation

精度下降

python
undefined
python
undefined

Increase rank

增大秩参数

LoraConfig(r=32, lora_alpha=64)
LoraConfig(r=32, lora_alpha=64)

Target more modules

选择更多目标模块

target_modules = "all-linear"
target_modules = "all-linear"

Use more training data and epochs

使用更多训练数据与训练轮次

TrainingArguments(num_train_epochs=5)
TrainingArguments(num_train_epochs=5)

Lower learning rate

降低学习率

TrainingArguments(learning_rate=1e-4)
undefined
TrainingArguments(learning_rate=1e-4)
undefined

Best practices

最佳实践

  1. Start with r=8-16, increase if quality insufficient
  2. Use alpha = 2 * rank as starting point
  3. Target attention + MLP layers for best quality/efficiency
  4. Enable gradient checkpointing for memory savings
  5. Save adapters frequently (small files, easy rollback)
  6. Evaluate on held-out data before merging
  7. Use QLoRA for 70B+ models on consumer hardware
  1. 从r=8-16开始,若精度不足再增大
  2. 初始设置alpha = 2 * rank
  3. 选择注意力层+MLP层作为目标模块,以平衡精度与效率
  4. 启用梯度检查点以节省内存
  5. 频繁保存适配器(文件体积小,便于回滚)
  6. 合并前在预留数据集上评估
  7. 在消费级硬件上微调70B+模型时使用QLoRA

References

参考资料

  • Advanced Usage - DoRA, LoftQ, rank stabilization, custom modules
  • Troubleshooting - Common errors, debugging, optimization
  • 高级用法 - DoRA、LoftQ、秩稳定、自定义模块
  • 故障排查 - 常见错误、调试、优化

Resources

资源