PEFT (Parameter-Efficient Fine-Tuning)

PEFT（参数高效微调）

Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.

通过LoRA、QLoRA及25+种适配器方法，仅训练不足1%的参数即可完成大语言模型（LLM）的微调。

When to use PEFT

何时使用PEFT

Use PEFT/LoRA when:

Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
Need to train <1% parameters (6MB adapters vs 14GB full model)
Want fast iteration with multiple task-specific adapters
Deploying multiple fine-tuned variants from one base model

Use QLoRA (PEFT + quantization) when:

Fine-tuning 70B models on single 24GB GPU
Memory is the primary constraint
Can accept ~5% quality trade-off vs full fine-tuning

Use full fine-tuning instead when:

Training small models (<1B parameters)
Need maximum quality and have compute budget
Significant domain shift requires updating all weights

在以下场景使用PEFT/LoRA：

在消费级GPU（RTX 4090、A100）上微调7B-70B参数规模的模型
仅需训练<1%的参数（适配器仅6MB，对比完整模型的14GB）
希望基于多个任务专用适配器快速迭代
基于一个基础模型部署多个微调变体

在以下场景使用QLoRA（PEFT+量化）：

在单张24GB GPU上微调70B参数规模的模型
内存是主要限制因素
可以接受与全量微调相比约5%的精度损失

在以下场景使用全量微调：

训练小模型（<1B参数）
需要最高精度且具备足够计算资源
显著的领域迁移需要更新所有权重

Quick start

快速开始

Installation

安装

bash

undefined

bash

undefined

Basic installation

基础安装

pip install peft

With quantization support (recommended)

带量化支持（推荐）

pip install peft bitsandbytes

Full stack

完整栈安装

pip install peft transformers accelerate bitsandbytes datasets

undefined

pip install peft transformers accelerate bitsandbytes datasets

undefined

LoRA fine-tuning (standard)

LoRA微调（标准方案）

python

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset

python

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset

Load base model

加载基础模型

model_name = "meta-llama/Llama-3.1-8B" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token

LoRA configuration

LoRA配置

lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # Rank (8-64, higher = more capacity) lora_alpha=32, # Scaling factor (typically 2*r) lora_dropout=0.05, # Dropout for regularization target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Attention layers bias="none" # Don't train biases )

lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # 秩（8-64，值越大容量越高） lora_alpha=32, # 缩放因子（通常为2*r） lora_dropout=0.05, # 正则化用Dropout target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # 注意力层 bias="none" # 不训练偏置参数 )

Apply LoRA

应用LoRA

model = get_peft_model(model, lora_config) model.print_trainable_parameters()

Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

输出: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

Prepare dataset

准备数据集

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

def tokenize(example): text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}" return tokenizer(text, truncation=True, max_length=512, padding="max_length")

tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

def tokenize(example): text = f"### 指令:\n{example['instruction']}\n\n### 响应:\n{example['response']}" return tokenizer(text, truncation=True, max_length=512, padding="max_length")

tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)

Training

训练

training_args = TrainingArguments( output_dir="./lora-llama", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, fp16=True, logging_steps=10, save_strategy="epoch" )

trainer = Trainer( model=model, args=training_args, train_dataset=tokenized, data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]), "attention_mask": torch.stack([f["attention_mask"] for f in data]), "labels": torch.stack([f["input_ids"] for f in data])} )

trainer.train()

training_args = TrainingArguments( output_dir="./lora-llama", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, fp16=True, logging_steps=10, save_strategy="epoch" )

trainer = Trainer( model=model, args=training_args, train_dataset=tokenized, data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]), "attention_mask": torch.stack([f["attention_mask"] for f in data]), "labels": torch.stack([f["input_ids"] for f in data])} )

trainer.train()

Save adapter only (6MB vs 16GB)

仅保存适配器（6MB对比全量模型的16GB）

model.save_pretrained("./lora-llama-adapter")

undefined

model.save_pretrained("./lora-llama-adapter")

undefined

QLoRA fine-tuning (memory-efficient)

QLoRA微调（内存高效方案）

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

4-bit quantization config

4位量化配置

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4 (best for LLMs) bnb_4bit_compute_dtype="bfloat16", # Compute in bf16 bnb_4bit_use_double_quant=True # Nested quantization )

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4（LLM最优选择） bnb_4bit_compute_dtype="bfloat16", # 以bf16精度计算 bnb_4bit_use_double_quant=True # 嵌套量化 )

Load quantized model

加载量化模型

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-70B", quantization_config=bnb_config, device_map="auto" )

Prepare for training (enables gradient checkpointing)

为训练做准备（启用梯度检查点）

model = prepare_model_for_kbit_training(model)

LoRA config for QLoRA

QLoRA对应的LoRA配置

lora_config = LoraConfig( r=64, # Higher rank for 70B lora_alpha=128, lora_dropout=0.1, target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", task_type="CAUSAL_LM" )

model = get_peft_model(model, lora_config)

lora_config = LoraConfig( r=64, # 70B模型使用更高的秩 lora_alpha=128, lora_dropout=0.1, target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", task_type="CAUSAL_LM" )

model = get_peft_model(model, lora_config)

70B model now fits on single 24GB GPU!

70B模型现在可以在单张24GB GPU上运行了！

undefined

undefined

LoRA parameter selection

LoRA参数选择

Rank (r) - capacity vs efficiency

秩（r）- 容量与效率的平衡

Rank	Trainable Params	Memory	Quality	Use Case
4	~3M	Minimal	Lower	Simple tasks, prototyping
8	~7M	Low	Good	Recommended starting point
16	~14M	Medium	Better	General fine-tuning
32	~27M	Higher	High	Complex tasks
64	~54M	High	Highest	Domain adaptation, 70B models

秩	可训练参数	内存占用	精度表现	适用场景
4	~3M	极小	较低	简单任务、原型开发
8	~7M	低	良好	推荐起始值
16	~14M	中等	更优	通用微调场景
32	~27M	较高	高	复杂任务
64	~54M	高	最高	领域适配、70B模型

Alpha (lora_alpha) - scaling factor

Alpha（lora_alpha）- 缩放因子

python

undefined

python

undefined

Rule of thumb: alpha = 2 * rank

经验法则：alpha = 2 * 秩

LoraConfig(r=16, lora_alpha=32) # Standard LoraConfig(r=16, lora_alpha=16) # Conservative (lower learning rate effect) LoraConfig(r=16, lora_alpha=64) # Aggressive (higher learning rate effect)

undefined

LoraConfig(r=16, lora_alpha=32) # 标准配置 LoraConfig(r=16, lora_alpha=16) # 保守配置（学习率影响更低） LoraConfig(r=16, lora_alpha=64) # 激进配置（学习率影响更高）

undefined

Target modules by architecture

不同架构的目标模块

python

undefined

python

undefined

Llama / Mistral / Qwen

target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

GPT-2 / GPT-Neo

target_modules = ["c_attn", "c_proj", "c_fc"]

Falcon

target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

BLOOM

target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

Auto-detect all linear layers

自动检测所有线性层

target_modules = "all-linear" # PEFT 0.6.0+

undefined

target_modules = "all-linear" # PEFT 0.6.0+版本支持

undefined

Loading and merging adapters

加载与合并适配器

Load trained adapter

加载训练好的适配器

python

from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM

python

from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM

Option 1: Load with PeftModel

方式1：通过PeftModel加载

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")

Option 2: Load directly (recommended)

方式2：直接加载（推荐）

model = AutoPeftModelForCausalLM.from_pretrained( "./lora-llama-adapter", device_map="auto" )

undefined

model = AutoPeftModelForCausalLM.from_pretrained( "./lora-llama-adapter", device_map="auto" )

undefined

Merge adapter into base model

将适配器合并到基础模型

python

undefined

python

undefined

Merge for deployment (no adapter overhead)

合并以用于部署（消除适配器开销）

merged_model = model.merge_and_unload()

Save merged model

保存合并后的模型

merged_model.save_pretrained("./llama-merged") tokenizer.save_pretrained("./llama-merged")

Push to Hub

推送到HuggingFace Hub

merged_model.push_to_hub("username/llama-finetuned")

undefined

merged_model.push_to_hub("username/llama-finetuned")

undefined

Multi-adapter serving

多适配器部署

python

from peft import PeftModel

python

from peft import PeftModel

Load base with first adapter

加载基础模型与第一个适配器

model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")

Load additional adapters

加载额外适配器

model.load_adapter("./adapter-task2", adapter_name="task2") model.load_adapter("./adapter-task3", adapter_name="task3")

Switch between adapters at runtime

运行时切换适配器

model.set_adapter("task1") # Use task1 adapter output1 = model.generate(**inputs)

model.set_adapter("task2") # Switch to task2 output2 = model.generate(**inputs)

model.set_adapter("task1") # 使用task1适配器 output1 = model.generate(**inputs)

model.set_adapter("task2") # 切换到task2适配器 output2 = model.generate(**inputs)

Disable adapters (use base model)

禁用适配器（使用基础模型）

with model.disable_adapter(): base_output = model.generate(**inputs)

undefined

with model.disable_adapter(): base_output = model.generate(**inputs)

undefined

PEFT methods comparison

PEFT方法对比

Method	Trainable %	Memory	Speed	Best For
LoRA	0.1-1%	Low	Fast	General fine-tuning
QLoRA	0.1-1%	Very Low	Medium	Memory-constrained
AdaLoRA	0.1-1%	Low	Medium	Automatic rank selection
IA3	0.01%	Minimal	Fastest	Few-shot adaptation
Prefix Tuning	0.1%	Low	Medium	Generation control
Prompt Tuning	0.001%	Minimal	Fast	Simple task adaptation
P-Tuning v2	0.1%	Low	Medium	NLU tasks

方法	可训练参数占比	内存占用	速度	最佳适用场景
LoRA	0.1-1%	低	快	通用微调
QLoRA	0.1-1%	极低	中等	内存受限场景
AdaLoRA	0.1-1%	低	中等	自动秩选择
IA3	0.01%	极小	最快	少样本适配
Prefix Tuning	0.1%	低	中等	生成控制
Prompt Tuning	0.001%	极小	快	简单任务适配
P-Tuning v2	0.1%	低	中等	自然语言理解任务

IA3 (minimal parameters)

IA3（极少量参数）

python

from peft import IA3Config

ia3_config = IA3Config(
    target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
    feedforward_modules=["down_proj"]
)
model = get_peft_model(model, ia3_config)

python

from peft import IA3Config

ia3_config = IA3Config(
    target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
    feedforward_modules=["down_proj"]
)
model = get_peft_model(model, ia3_config)

Trains only 0.01% of parameters!

仅训练0.01%的参数！

undefined

undefined

Prefix Tuning

python

from peft import PrefixTuningConfig

prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,      # Prepended tokens
    prefix_projection=True       # Use MLP projection
)
model = get_peft_model(model, prefix_config)

python

from peft import PrefixTuningConfig

prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,      # 前置虚拟token数量
    prefix_projection=True       # 使用MLP投影
)
model = get_peft_model(model, prefix_config)

Integration patterns

集成方案

With TRL (SFTTrainer)

与TRL（SFTTrainer）集成

python

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", max_seq_length=512),
    train_dataset=dataset,
    peft_config=lora_config,  # Pass LoRA config directly
)
trainer.train()

python

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", max_seq_length=512),
    train_dataset=dataset,
    peft_config=lora_config,  # 直接传入LoRA配置
)
trainer.train()

With Axolotl (YAML config)

与Axolotl（YAML配置）集成

yaml

undefined

yaml

undefined

axolotl config.yaml

adapter: lora lora_r: 16 lora_alpha: 32 lora_dropout: 0.05 lora_target_modules:

q_proj
v_proj
k_proj
o_proj lora_target_linear: true # Target all linear layers

undefined

adapter: lora lora_r: 16 lora_alpha: 32 lora_dropout: 0.05 lora_target_modules:

q_proj
v_proj
k_proj
o_proj lora_target_linear: true # 目标为所有线性层

undefined

With vLLM (inference)

与vLLM（推理）集成

python

from vllm import LLM
from vllm.lora.request import LoRARequest

python

from vllm import LLM
from vllm.lora.request import LoRARequest

Load base model with LoRA support

加载支持LoRA的基础模型

llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)

Serve with adapter

结合适配器提供服务

outputs = llm.generate( prompts, lora_request=LoRARequest("adapter1", 1, "./lora-adapter") )

undefined

outputs = llm.generate( prompts, lora_request=LoRARequest("adapter1", 1, "./lora-adapter") )

undefined

Performance benchmarks

性能基准测试

Memory usage (Llama 3.1 8B)

内存占用（Llama 3.1 8B）

Method	GPU Memory	Trainable Params
Full fine-tuning	60+ GB	8B (100%)
LoRA r=16	18 GB	14M (0.17%)
QLoRA r=16	6 GB	14M (0.17%)
IA3	16 GB	800K (0.01%)

方法	GPU内存占用	可训练参数
全量微调	60+ GB	8B（100%）
LoRA r=16	18 GB	14M（0.17%）
QLoRA r=16	6 GB	14M（0.17%）
IA3	16 GB	800K（0.01%）

Training speed (A100 80GB)

训练速度（A100 80GB）

Method	Tokens/sec	vs Full FT
Full FT	2,500	1x
LoRA	3,200	1.3x
QLoRA	2,100	0.84x

方法	每秒处理Token数	相对全量微调速度
全量微调	2,500	1x
LoRA	3,200	1.3x
QLoRA	2,100	0.84x

Quality (MMLU benchmark)

精度表现（MMLU基准测试）

Model	Full FT	LoRA	QLoRA
Llama 2-7B	45.3	44.8	44.1
Llama 2-13B	54.8	54.2	53.5

模型	全量微调	LoRA	QLoRA
Llama 2-7B	45.3	44.8	44.1
Llama 2-13B	54.8	54.2	53.5

Common issues

常见问题

CUDA OOM during training

训练时CUDA内存不足（OOM）

python

undefined

python

undefined

Solution 1: Enable gradient checkpointing

解决方案1：启用梯度检查点

model.gradient_checkpointing_enable()

Solution 2: Reduce batch size + increase accumulation

解决方案2：减小批次大小并增加梯度累积步数

TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=16 )

Solution 3: Use QLoRA

解决方案3：使用QLoRA

from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")

undefined

from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")

undefined

Adapter not applying

适配器未生效

python

undefined

python

undefined

Verify adapter is active

验证适配器是否激活

print(model.active_adapters) # Should show adapter name

print(model.active_adapters) # 应显示适配器名称

Check trainable parameters

检查可训练参数

model.print_trainable_parameters()

Ensure model in training mode

确保模型处于训练模式

model.train()

undefined

model.train()

undefined

Quality degradation

精度下降

python

undefined

python

undefined

Increase rank

增大秩参数

LoraConfig(r=32, lora_alpha=64)

Target more modules

选择更多目标模块

target_modules = "all-linear"

Use more training data and epochs

使用更多训练数据与训练轮次

TrainingArguments(num_train_epochs=5)

Lower learning rate

降低学习率

TrainingArguments(learning_rate=1e-4)

undefined

TrainingArguments(learning_rate=1e-4)

undefined

Best practices

最佳实践

Start with r=8-16, increase if quality insufficient
Use alpha = 2 * rank as starting point
Target attention + MLP layers for best quality/efficiency
Enable gradient checkpointing for memory savings
Save adapters frequently (small files, easy rollback)
Evaluate on held-out data before merging
Use QLoRA for 70B+ models on consumer hardware

从r=8-16开始，若精度不足再增大
初始设置alpha = 2 * rank
选择注意力层+MLP层作为目标模块，以平衡精度与效率
启用梯度检查点以节省内存
频繁保存适配器（文件体积小，便于回滚）
合并前在预留数据集上评估
在消费级硬件上微调70B+模型时使用QLoRA

References

参考资料

Advanced Usage - DoRA, LoftQ, rank stabilization, custom modules
Troubleshooting - Common errors, debugging, optimization

高级用法 - DoRA、LoftQ、秩稳定、自定义模块
故障排查 - 常见错误、调试、优化

Resources

资源

GitHub: https://github.com/huggingface/peft
Docs: https://huggingface.co/docs/peft
LoRA Paper: arXiv:2106.09685
QLoRA Paper: arXiv:2305.14314
Models: https://huggingface.co/models?library=peft

GitHub: https://github.com/huggingface/peft
文档: https://huggingface.co/docs/peft
LoRA论文: arXiv:2106.09685
QLoRA论文: arXiv:2305.14314
模型: https://huggingface.co/models?library=peft

peft-fine-tuning

Original

Translation

PEFT (Parameter-Efficient Fine-Tuning)

PEFT（参数高效微调）

When to use PEFT

何时使用PEFT

Quick start

快速开始

Installation

安装

Basic installation

基础安装

With quantization support (recommended)

带量化支持（推荐）

Full stack

完整栈安装

LoRA fine-tuning (standard)

LoRA微调（标准方案）

Load base model

加载基础模型

LoRA configuration

LoRA配置

Apply LoRA

应用LoRA

Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

输出: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

Prepare dataset

准备数据集

Training

训练

Save adapter only (6MB vs 16GB)

仅保存适配器（6MB对比全量模型的16GB）

QLoRA fine-tuning (memory-efficient)

QLoRA微调（内存高效方案）

4-bit quantization config

4位量化配置

Load quantized model

加载量化模型

Prepare for training (enables gradient checkpointing)

为训练做准备（启用梯度检查点）

LoRA config for QLoRA

QLoRA对应的LoRA配置

70B model now fits on single 24GB GPU!

70B模型现在可以在单张24GB GPU上运行了！

LoRA parameter selection

LoRA参数选择

Rank (r) - capacity vs efficiency

秩（r）- 容量与效率的平衡

Alpha (lora_alpha) - scaling factor

Alpha（lora_alpha）- 缩放因子

Rule of thumb: alpha = 2 * rank

经验法则：alpha = 2 * 秩

Target modules by architecture

不同架构的目标模块

Llama / Mistral / Qwen

Llama / Mistral / Qwen

GPT-2 / GPT-Neo

GPT-2 / GPT-Neo

Falcon

Falcon

BLOOM

BLOOM

Auto-detect all linear layers

自动检测所有线性层

Loading and merging adapters

加载与合并适配器

Load trained adapter

加载训练好的适配器

Option 1: Load with PeftModel

方式1：通过PeftModel加载

Option 2: Load directly (recommended)

方式2：直接加载（推荐）

Merge adapter into base model

将适配器合并到基础模型

Merge for deployment (no adapter overhead)

合并以用于部署（消除适配器开销）

Save merged model

保存合并后的模型

Push to Hub