lora
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseUsing LoRA for Fine-tuning
使用LoRA进行微调
LoRA (Low-Rank Adaptation) enables efficient fine-tuning by freezing pretrained weights and injecting small trainable matrices into transformer layers. This reduces trainable parameters to ~0.1% of the original model while maintaining performance.
LoRA(Low-Rank Adaptation)通过冻结预训练权重并在Transformer层中注入小型可训练矩阵,实现高效微调。这将可训练参数减少到原模型的约0.1%,同时保持模型性能。
Table of Contents
目录
Core Concepts
核心概念
How LoRA Works
LoRA工作原理
Instead of updating all weights during fine-tuning, LoRA decomposes weight updates into low-rank matrices:
W' = W + BAWhere:
- is the frozen pretrained weight matrix (d × k)
W - is a trainable matrix (d × r)
B - is a trainable matrix (r × k)
A - is the rank, much smaller than d and k
r
The key insight: weight updates during fine-tuning have low intrinsic rank, so we can represent them efficiently with smaller matrices.
在微调期间不更新所有权重,LoRA将权重更新分解为低秩矩阵:
W' = W + BA其中:
- 是冻结的预训练权重矩阵(d × k)
W - 是可训练矩阵(d × r)
B - 是可训练矩阵(r × k)
A - 是秩,远小于d和k
r
核心思路:微调过程中的权重更新具有低内在秩,因此可以用更小的矩阵高效表示。
Why Use LoRA
为什么选择LoRA
| Aspect | Full Fine-tuning | LoRA |
|---|---|---|
| Trainable params | 100% | ~0.1-1% |
| Memory usage | High | Low |
| Adapter size | Full model | ~3-100 MB |
| Training speed | Slower | Faster |
| Multiple tasks | Separate models | Swap adapters |
| 对比维度 | 全量微调 | LoRA |
|---|---|---|
| 可训练参数占比 | 100% | ~0.1-1% |
| 内存占用 | 高 | 低 |
| 适配器大小 | 完整模型 | ~3-100 MB |
| 训练速度 | 较慢 | 更快 |
| 多任务支持 | 独立模型 | 切换适配器 |
Basic Setup
基础设置
Installation
安装
bash
pip install peft transformers acceleratebash
pip install peft transformers accelerateMinimal Example
最简示例
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torchpython
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torchLoad base model
加载基础模型
model_name = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model_name = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
Configure LoRA
配置LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
Apply LoRA
应用LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
trainable params: 3,407,872 || all params: 1,238,300,672 || trainable%: 0.28%
trainable params: 3,407,872 || all params: 1,238,300,672 || trainable%: 0.28%
undefinedundefinedConfiguration Parameters
配置参数
LoraConfig Options
LoraConfig选项
python
from peft import LoraConfig, TaskType
config = LoraConfig(
# Core parameters
r=16, # Rank of update matrices
lora_alpha=32, # Scaling factor (alpha/r applied to updates)
target_modules=["q_proj", "v_proj"], # Layers to adapt
# Regularization
lora_dropout=0.05, # Dropout on LoRA layers
bias="none", # "none", "all", or "lora_only"
# Task configuration
task_type=TaskType.CAUSAL_LM, # CAUSAL_LM, SEQ_CLS, SEQ_2_SEQ_LM, TOKEN_CLS
# Advanced
modules_to_save=None, # Additional modules to train (e.g., ["lm_head"])
layers_to_transform=None, # Specific layer indices to adapt
use_rslora=False, # Rank-stabilized LoRA scaling
use_dora=False, # Weight-Decomposed LoRA
)python
from peft import LoraConfig, TaskType
config = LoraConfig(
# 核心参数
r=16, # 更新矩阵的秩
lora_alpha=32, # 缩放因子(更新时应用alpha/r)
target_modules=["q_proj", "v_proj"], # 要适配的层
# 正则化
lora_dropout=0.05, # LoRA层的 dropout
bias="none", # "none"、"all" 或 "lora_only"
# 任务配置
task_type=TaskType.CAUSAL_LM, # CAUSAL_LM、SEQ_CLS、SEQ_2_SEQ_LM、TOKEN_CLS
# 高级选项
modules_to_save=None, # 额外要训练的模块(例如:["lm_head"])
layers_to_transform=None, # 要适配的特定层索引
use_rslora=False, # 秩稳定LoRA缩放
use_dora=False, # 权重分解LoRA
)Target Modules by Architecture
不同架构的目标模块
python
undefinedpython
undefinedLlama, Mistral, Qwen
Llama、Mistral、Qwen
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
GPT-2, GPT-J
GPT-2、GPT-J
target_modules = ["c_attn", "c_proj", "c_fc"]
target_modules = ["c_attn", "c_proj", "c_fc"]
BERT, RoBERTa
BERT、RoBERTa
target_modules = ["query", "key", "value", "dense"]
target_modules = ["query", "key", "value", "dense"]
Falcon
Falcon
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
Phi
Phi
target_modules = ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"]
undefinedtarget_modules = ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"]
undefinedFinding Target Modules
查找目标模块
python
undefinedpython
undefinedPrint all linear layer names
打印所有线性层名称
from peft.utils import get_peft_model_state_dict
def find_target_modules(model):
linear_modules = set()
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
# Get the last part of the name (e.g., "q_proj" from "model.layers.0.self_attn.q_proj")
layer_name = name.split(".")[-1]
linear_modules.add(layer_name)
return list(linear_modules)
print(find_target_modules(model))
undefinedfrom peft.utils import get_peft_model_state_dict
def find_target_modules(model):
linear_modules = set()
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
# 获取名称的最后部分(例如:从"model.layers.0.self_attn.q_proj"中提取"q_proj")
layer_name = name.split(".")[-1]
linear_modules.add(layer_name)
return list(linear_modules)
print(find_target_modules(model))
undefinedQLoRA (Quantized LoRA)
QLoRA(量化LoRA)
QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs.
QLoRA将4位量化与LoRA相结合,支持在消费级GPU上微调大模型。
Setup
设置
python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torchpython
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch4-bit quantization config
4位量化配置
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normalized float 4-bit
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Nested quantization
)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # 归一化浮点4位
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # 嵌套量化
)
Load quantized model
加载量化模型
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
quantization_config=bnb_config,
device_map="auto",
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
quantization_config=bnb_config,
device_map="auto",
)
Prepare for k-bit training
为k位训练做准备
model = prepare_model_for_kbit_training(model)
model = prepare_model_for_kbit_training(model)
Apply LoRA
应用LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
undefinedlora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
undefinedMemory Requirements
内存需求
| Model Size | Full FT (16-bit) | LoRA (16-bit) | QLoRA (4-bit) |
|---|---|---|---|
| 7B | ~60 GB | ~16 GB | ~6 GB |
| 13B | ~104 GB | ~28 GB | ~10 GB |
| 70B | ~560 GB | ~160 GB | ~48 GB |
| 模型大小 | 全量微调(16位) | LoRA(16位) | QLoRA(4位) |
|---|---|---|---|
| 7B | ~60 GB | ~16 GB | ~6 GB |
| 13B | ~104 GB | ~28 GB | ~10 GB |
| 70B | ~560 GB | ~160 GB | ~48 GB |
Training Patterns
训练模式
With Hugging Face Trainer
使用Hugging Face Trainer
python
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_datasetpython
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_datasetPrepare dataset
准备数据集
dataset = load_dataset("tatsu-lab/alpaca", split="train")
def format_prompt(example):
if example["input"]:
text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
else:
text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
return {"text": text}
dataset = dataset.map(format_prompt)
def tokenize(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=512,
padding=False,
)
tokenized = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names)
dataset = load_dataset("tatsu-lab/alpaca", split="train")
def format_prompt(example):
if example["input"]:
text = f"### 指令:\n{example['instruction']}\n\n### 输入:\n{example['input']}\n\n### 响应:\n{example['output']}"
else:
text = f"### 指令:\n{example['instruction']}\n\n### 响应:\n{example['output']}"
return {"text": text}
dataset = dataset.map(format_prompt)
def tokenize(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=512,
padding=False,
)
tokenized = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names)
Training arguments (note higher learning rate)
训练参数(注意更高的学习率)
training_args = TrainingArguments(
output_dir="./lora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4, # Higher than full fine-tuning
bf16=True,
logging_steps=10,
save_steps=500,
warmup_ratio=0.03,
gradient_checkpointing=True,
optim="adamw_torch_fused",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
undefinedtraining_args = TrainingArguments(
output_dir="./lora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4, # 高于全量微调的学习率
bf16=True,
logging_steps=10,
save_steps=500,
warmup_ratio=0.03,
gradient_checkpointing=True,
optim="adamw_torch_fused",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
undefinedWith SFTTrainer (TRL)
使用SFTTrainer(TRL)
python
from trl import SFTTrainer, SFTConfig
sft_config = SFTConfig(
output_dir="./sft-lora",
max_seq_length=1024,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
gradient_checkpointing=True,
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=lora_config, # Pass config directly, SFTTrainer applies it
dataset_text_field="text",
)
trainer.train()python
from trl import SFTTrainer, SFTConfig
sft_config = SFTConfig(
output_dir="./sft-lora",
max_seq_length=1024,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
gradient_checkpointing=True,
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=lora_config, # 直接传入配置,SFTTrainer会自动应用
dataset_text_field="text",
)
trainer.train()Classification Task
分类任务
python
from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model, TaskType
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2,
)
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["query", "value"],
lora_dropout=0.1,
bias="none",
task_type=TaskType.SEQ_CLS,
modules_to_save=["classifier"], # Train classification head fully
)
model = get_peft_model(model, lora_config)python
from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model, TaskType
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2,
)
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["query", "value"],
lora_dropout=0.1,
bias="none",
task_type=TaskType.SEQ_CLS,
modules_to_save=["classifier"], # 全量训练分类头
)
model = get_peft_model(model, lora_config)Saving and Loading
保存与加载
Save Adapter
保存适配器
python
undefinedpython
undefinedSave only LoRA weights (small file)
仅保存LoRA权重(文件体积小)
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")
Push to Hub
推送到Hub
model.push_to_hub("username/my-lora-adapter")
undefinedmodel.push_to_hub("username/my-lora-adapter")
undefinedLoad Adapter
加载适配器
python
from peft import PeftModel
from transformers import AutoModelForCausalLMpython
from peft import PeftModel
from transformers import AutoModelForCausalLMLoad base model
加载基础模型
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
Load adapter
加载适配器
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
For inference
推理模式
model.eval()
undefinedmodel.eval()
undefinedSwitch Between Adapters
切换适配器
python
undefinedpython
undefinedLoad multiple adapters
加载多个适配器
model.load_adapter("./adapter-1", adapter_name="task1")
model.load_adapter("./adapter-2", adapter_name="task2")
model.load_adapter("./adapter-1", adapter_name="task1")
model.load_adapter("./adapter-2", adapter_name="task2")
Switch active adapter
切换激活的适配器
model.set_adapter("task1")
output = model.generate(**inputs)
model.set_adapter("task2")
output = model.generate(**inputs)
model.set_adapter("task1")
output = model.generate(**inputs)
model.set_adapter("task2")
output = model.generate(**inputs)
Disable adapter (use base model)
禁用适配器(使用基础模型)
with model.disable_adapter():
output = model.generate(**inputs)
undefinedwith model.disable_adapter():
output = model.generate(**inputs)
undefinedMerging Adapters
合并适配器
Merge LoRA weights into the base model for deployment without adapter overhead.
python
from peft import PeftModel将LoRA权重合并到基础模型中,消除部署时的适配器开销。
python
from peft import PeftModelLoad base model
加载基础模型
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B",
torch_dtype=torch.bfloat16,
device_map="cpu", # Merge on CPU to avoid memory issues
)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B",
torch_dtype=torch.bfloat16,
device_map="cpu", # 在CPU上合并以避免内存问题
)
Load adapter
加载适配器
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
Merge and unload
合并并卸载适配器
merged_model = model.merge_and_unload()
merged_model = model.merge_and_unload()
Save merged model
保存合并后的模型
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
Push merged model to Hub
将合并后的模型推送到Hub
merged_model.push_to_hub("username/my-merged-model")
undefinedmerged_model.push_to_hub("username/my-merged-model")
undefinedBest Practices
最佳实践
-
Start with r=16: Scale up to 32 or 64 if the model underfits, down to 8 if overfitting or memory-constrained
-
Set lora_alpha = 2 × r: This is a common heuristic; the effective scaling is
alpha/r -
Target all attention and MLP layers: For best results on LLMs, include gate/up/down projections:python
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] -
Use higher learning rate: 2e-4 is typical for LoRA vs 2e-5 for full fine-tuning
-
Enable gradient checkpointing: Reduces memory at cost of ~20% slower training:python
model.gradient_checkpointing_enable() -
Use QLoRA for large models: Essential for fine-tuning 7B+ models on consumer GPUs
-
Keep dropout low: 0.05 is usually sufficient; higher values may hurt performance
-
Save checkpoints frequently: LoRA adapters are small, so save often
-
Evaluate on base model too: Ensure adapter doesn't degrade base capabilities
-
Consider modules_to_save for task heads: For classification, train the classifier fully:python
modules_to_save=["classifier", "score"]
-
从r=16开始:如果模型欠拟合,可增大到32或64;如果过拟合或内存受限,可减小到8
-
设置lora_alpha = 2 × r:这是常用的经验法则,有效缩放比例为
alpha/r -
针对所有注意力和MLP层:为了在大语言模型上获得最佳效果,包含gate/up/down投影层:python
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] -
使用更高的学习率:LoRA的典型学习率为2e-4,而全量微调用2e-5
-
启用梯度检查点:以约20%的训练速度为代价减少内存占用:python
model.gradient_checkpointing_enable() -
对大模型使用QLoRA:在消费级GPU上微调7B+模型的必备技术
-
保持低dropout:0.05通常足够;更高的值可能损害性能
-
频繁保存检查点:LoRA适配器体积小,因此可以经常保存
-
同时在基础模型上评估:确保适配器不会降低基础模型的能力
-
考虑为任务头设置modules_to_save:对于分类任务,全量训练分类头:python
modules_to_save=["classifier", "score"]
References
参考资料
See for detailed documentation:
reference/- - DoRA, rsLoRA, adapter composition, and debugging
advanced-techniques.md
详见目录下的详细文档:
reference/- - DoRA、rsLoRA、适配器组合与调试
advanced-techniques.md