lora

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Using LoRA for Fine-tuning

使用LoRA进行微调

LoRA (Low-Rank Adaptation) enables efficient fine-tuning by freezing pretrained weights and injecting small trainable matrices into transformer layers. This reduces trainable parameters to ~0.1% of the original model while maintaining performance.
LoRA(Low-Rank Adaptation)通过冻结预训练权重并在Transformer层中注入小型可训练矩阵,实现高效微调。这将可训练参数减少到原模型的约0.1%,同时保持模型性能。

Table of Contents

目录

Core Concepts

核心概念

How LoRA Works

LoRA工作原理

Instead of updating all weights during fine-tuning, LoRA decomposes weight updates into low-rank matrices:
W' = W + BA
Where:
  • W
    is the frozen pretrained weight matrix (d × k)
  • B
    is a trainable matrix (d × r)
  • A
    is a trainable matrix (r × k)
  • r
    is the rank, much smaller than d and k
The key insight: weight updates during fine-tuning have low intrinsic rank, so we can represent them efficiently with smaller matrices.
在微调期间不更新所有权重,LoRA将权重更新分解为低秩矩阵:
W' = W + BA
其中:
  • W
    是冻结的预训练权重矩阵(d × k)
  • B
    是可训练矩阵(d × r)
  • A
    是可训练矩阵(r × k)
  • r
    是秩,远小于d和k
核心思路:微调过程中的权重更新具有低内在秩,因此可以用更小的矩阵高效表示。

Why Use LoRA

为什么选择LoRA

AspectFull Fine-tuningLoRA
Trainable params100%~0.1-1%
Memory usageHighLow
Adapter sizeFull model~3-100 MB
Training speedSlowerFaster
Multiple tasksSeparate modelsSwap adapters
对比维度全量微调LoRA
可训练参数占比100%~0.1-1%
内存占用
适配器大小完整模型~3-100 MB
训练速度较慢更快
多任务支持独立模型切换适配器

Basic Setup

基础设置

Installation

安装

bash
pip install peft transformers accelerate
bash
pip install peft transformers accelerate

Minimal Example

最简示例

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

Load base model

加载基础模型

model_name = "meta-llama/Llama-3.2-1B" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token
model_name = "meta-llama/Llama-3.2-1B" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token

Configure LoRA

配置LoRA

lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, )
lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, )

Apply LoRA

应用LoRA

model = get_peft_model(model, lora_config) model.print_trainable_parameters()
model = get_peft_model(model, lora_config) model.print_trainable_parameters()

trainable params: 3,407,872 || all params: 1,238,300,672 || trainable%: 0.28%

trainable params: 3,407,872 || all params: 1,238,300,672 || trainable%: 0.28%

undefined
undefined

Configuration Parameters

配置参数

LoraConfig Options

LoraConfig选项

python
from peft import LoraConfig, TaskType

config = LoraConfig(
    # Core parameters
    r=16,                          # Rank of update matrices
    lora_alpha=32,                 # Scaling factor (alpha/r applied to updates)
    target_modules=["q_proj", "v_proj"],  # Layers to adapt

    # Regularization
    lora_dropout=0.05,             # Dropout on LoRA layers
    bias="none",                   # "none", "all", or "lora_only"

    # Task configuration
    task_type=TaskType.CAUSAL_LM,  # CAUSAL_LM, SEQ_CLS, SEQ_2_SEQ_LM, TOKEN_CLS

    # Advanced
    modules_to_save=None,          # Additional modules to train (e.g., ["lm_head"])
    layers_to_transform=None,      # Specific layer indices to adapt
    use_rslora=False,              # Rank-stabilized LoRA scaling
    use_dora=False,                # Weight-Decomposed LoRA
)
python
from peft import LoraConfig, TaskType

config = LoraConfig(
    # 核心参数
    r=16,                          # 更新矩阵的秩
    lora_alpha=32,                 # 缩放因子(更新时应用alpha/r)
    target_modules=["q_proj", "v_proj"],  # 要适配的层

    # 正则化
    lora_dropout=0.05,             # LoRA层的 dropout
    bias="none",                   # "none"、"all" 或 "lora_only"

    # 任务配置
    task_type=TaskType.CAUSAL_LM,  # CAUSAL_LM、SEQ_CLS、SEQ_2_SEQ_LM、TOKEN_CLS

    # 高级选项
    modules_to_save=None,          # 额外要训练的模块(例如:["lm_head"])
    layers_to_transform=None,      # 要适配的特定层索引
    use_rslora=False,              # 秩稳定LoRA缩放
    use_dora=False,                # 权重分解LoRA
)

Target Modules by Architecture

不同架构的目标模块

python
undefined
python
undefined

Llama, Mistral, Qwen

Llama、Mistral、Qwen

target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

GPT-2, GPT-J

GPT-2、GPT-J

target_modules = ["c_attn", "c_proj", "c_fc"]
target_modules = ["c_attn", "c_proj", "c_fc"]

BERT, RoBERTa

BERT、RoBERTa

target_modules = ["query", "key", "value", "dense"]
target_modules = ["query", "key", "value", "dense"]

Falcon

Falcon

target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

Phi

Phi

target_modules = ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"]
undefined
target_modules = ["q_proj", "k_proj", "v_proj", "dense", "fc1", "fc2"]
undefined

Finding Target Modules

查找目标模块

python
undefined
python
undefined

Print all linear layer names

打印所有线性层名称

from peft.utils import get_peft_model_state_dict
def find_target_modules(model): linear_modules = set() for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): # Get the last part of the name (e.g., "q_proj" from "model.layers.0.self_attn.q_proj") layer_name = name.split(".")[-1] linear_modules.add(layer_name) return list(linear_modules)
print(find_target_modules(model))
undefined
from peft.utils import get_peft_model_state_dict
def find_target_modules(model): linear_modules = set() for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): # 获取名称的最后部分(例如:从"model.layers.0.self_attn.q_proj"中提取"q_proj") layer_name = name.split(".")[-1] linear_modules.add(layer_name) return list(linear_modules)
print(find_target_modules(model))
undefined

QLoRA (Quantized LoRA)

QLoRA(量化LoRA)

QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs.
QLoRA将4位量化与LoRA相结合,支持在消费级GPU上微调大模型。

Setup

设置

python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

4-bit quantization config

4位量化配置

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # Normalized float 4-bit bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, # Nested quantization )
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # 归一化浮点4位 bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, # 嵌套量化 )

Load quantized model

加载量化模型

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", quantization_config=bnb_config, device_map="auto", )
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", quantization_config=bnb_config, device_map="auto", )

Prepare for k-bit training

为k位训练做准备

model = prepare_model_for_kbit_training(model)
model = prepare_model_for_kbit_training(model)

Apply LoRA

应用LoRA

lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, )
model = get_peft_model(model, lora_config)
undefined
lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, )
model = get_peft_model(model, lora_config)
undefined

Memory Requirements

内存需求

Model SizeFull FT (16-bit)LoRA (16-bit)QLoRA (4-bit)
7B~60 GB~16 GB~6 GB
13B~104 GB~28 GB~10 GB
70B~560 GB~160 GB~48 GB
模型大小全量微调(16位)LoRA(16位)QLoRA(4位)
7B~60 GB~16 GB~6 GB
13B~104 GB~28 GB~10 GB
70B~560 GB~160 GB~48 GB

Training Patterns

训练模式

With Hugging Face Trainer

使用Hugging Face Trainer

python
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset
python
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset

Prepare dataset

准备数据集

dataset = load_dataset("tatsu-lab/alpaca", split="train")
def format_prompt(example): if example["input"]: text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}" else: text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}" return {"text": text}
dataset = dataset.map(format_prompt)
def tokenize(examples): return tokenizer( examples["text"], truncation=True, max_length=512, padding=False, )
tokenized = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names)
dataset = load_dataset("tatsu-lab/alpaca", split="train")
def format_prompt(example): if example["input"]: text = f"### 指令:\n{example['instruction']}\n\n### 输入:\n{example['input']}\n\n### 响应:\n{example['output']}" else: text = f"### 指令:\n{example['instruction']}\n\n### 响应:\n{example['output']}" return {"text": text}
dataset = dataset.map(format_prompt)
def tokenize(examples): return tokenizer( examples["text"], truncation=True, max_length=512, padding=False, )
tokenized = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names)

Training arguments (note higher learning rate)

训练参数(注意更高的学习率)

training_args = TrainingArguments( output_dir="./lora-output", per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=1, learning_rate=2e-4, # Higher than full fine-tuning bf16=True, logging_steps=10, save_steps=500, warmup_ratio=0.03, gradient_checkpointing=True, optim="adamw_torch_fused", )
trainer = Trainer( model=model, args=training_args, train_dataset=tokenized, data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), )
trainer.train()
undefined
training_args = TrainingArguments( output_dir="./lora-output", per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=1, learning_rate=2e-4, # 高于全量微调的学习率 bf16=True, logging_steps=10, save_steps=500, warmup_ratio=0.03, gradient_checkpointing=True, optim="adamw_torch_fused", )
trainer = Trainer( model=model, args=training_args, train_dataset=tokenized, data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), )
trainer.train()
undefined

With SFTTrainer (TRL)

使用SFTTrainer(TRL)

python
from trl import SFTTrainer, SFTConfig

sft_config = SFTConfig(
    output_dir="./sft-lora",
    max_seq_length=1024,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    gradient_checkpointing=True,
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=lora_config,      # Pass config directly, SFTTrainer applies it
    dataset_text_field="text",
)

trainer.train()
python
from trl import SFTTrainer, SFTConfig

sft_config = SFTConfig(
    output_dir="./sft-lora",
    max_seq_length=1024,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    gradient_checkpointing=True,
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=lora_config,      # 直接传入配置,SFTTrainer会自动应用
    dataset_text_field="text",
)

trainer.train()

Classification Task

分类任务

python
from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model, TaskType

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query", "value"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS,
    modules_to_save=["classifier"],  # Train classification head fully
)

model = get_peft_model(model, lora_config)
python
from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model, TaskType

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
)

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query", "value"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS,
    modules_to_save=["classifier"],  # 全量训练分类头
)

model = get_peft_model(model, lora_config)

Saving and Loading

保存与加载

Save Adapter

保存适配器

python
undefined
python
undefined

Save only LoRA weights (small file)

仅保存LoRA权重(文件体积小)

model.save_pretrained("./my-lora-adapter") tokenizer.save_pretrained("./my-lora-adapter")
model.save_pretrained("./my-lora-adapter") tokenizer.save_pretrained("./my-lora-adapter")

Push to Hub

推送到Hub

model.push_to_hub("username/my-lora-adapter")
undefined
model.push_to_hub("username/my-lora-adapter")
undefined

Load Adapter

加载适配器

python
from peft import PeftModel
from transformers import AutoModelForCausalLM
python
from peft import PeftModel
from transformers import AutoModelForCausalLM

Load base model

加载基础模型

base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-1B", torch_dtype=torch.bfloat16, device_map="auto", )
base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-1B", torch_dtype=torch.bfloat16, device_map="auto", )

Load adapter

加载适配器

model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

For inference

推理模式

model.eval()
undefined
model.eval()
undefined

Switch Between Adapters

切换适配器

python
undefined
python
undefined

Load multiple adapters

加载多个适配器

model.load_adapter("./adapter-1", adapter_name="task1") model.load_adapter("./adapter-2", adapter_name="task2")
model.load_adapter("./adapter-1", adapter_name="task1") model.load_adapter("./adapter-2", adapter_name="task2")

Switch active adapter

切换激活的适配器

model.set_adapter("task1") output = model.generate(**inputs)
model.set_adapter("task2") output = model.generate(**inputs)
model.set_adapter("task1") output = model.generate(**inputs)
model.set_adapter("task2") output = model.generate(**inputs)

Disable adapter (use base model)

禁用适配器(使用基础模型)

with model.disable_adapter(): output = model.generate(**inputs)
undefined
with model.disable_adapter(): output = model.generate(**inputs)
undefined

Merging Adapters

合并适配器

Merge LoRA weights into the base model for deployment without adapter overhead.
python
from peft import PeftModel
将LoRA权重合并到基础模型中,消除部署时的适配器开销。
python
from peft import PeftModel

Load base model

加载基础模型

base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-1B", torch_dtype=torch.bfloat16, device_map="cpu", # Merge on CPU to avoid memory issues )
base_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-1B", torch_dtype=torch.bfloat16, device_map="cpu", # 在CPU上合并以避免内存问题 )

Load adapter

加载适配器

model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

Merge and unload

合并并卸载适配器

merged_model = model.merge_and_unload()
merged_model = model.merge_and_unload()

Save merged model

保存合并后的模型

merged_model.save_pretrained("./merged-model") tokenizer.save_pretrained("./merged-model")
merged_model.save_pretrained("./merged-model") tokenizer.save_pretrained("./merged-model")

Push merged model to Hub

将合并后的模型推送到Hub

merged_model.push_to_hub("username/my-merged-model")
undefined
merged_model.push_to_hub("username/my-merged-model")
undefined

Best Practices

最佳实践

  1. Start with r=16: Scale up to 32 or 64 if the model underfits, down to 8 if overfitting or memory-constrained
  2. Set lora_alpha = 2 × r: This is a common heuristic; the effective scaling is
    alpha/r
  3. Target all attention and MLP layers: For best results on LLMs, include gate/up/down projections:
    python
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
  4. Use higher learning rate: 2e-4 is typical for LoRA vs 2e-5 for full fine-tuning
  5. Enable gradient checkpointing: Reduces memory at cost of ~20% slower training:
    python
    model.gradient_checkpointing_enable()
  6. Use QLoRA for large models: Essential for fine-tuning 7B+ models on consumer GPUs
  7. Keep dropout low: 0.05 is usually sufficient; higher values may hurt performance
  8. Save checkpoints frequently: LoRA adapters are small, so save often
  9. Evaluate on base model too: Ensure adapter doesn't degrade base capabilities
  10. Consider modules_to_save for task heads: For classification, train the classifier fully:
    python
    modules_to_save=["classifier", "score"]
  1. 从r=16开始:如果模型欠拟合,可增大到32或64;如果过拟合或内存受限,可减小到8
  2. 设置lora_alpha = 2 × r:这是常用的经验法则,有效缩放比例为
    alpha/r
  3. 针对所有注意力和MLP层:为了在大语言模型上获得最佳效果,包含gate/up/down投影层:
    python
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
  4. 使用更高的学习率:LoRA的典型学习率为2e-4,而全量微调用2e-5
  5. 启用梯度检查点:以约20%的训练速度为代价减少内存占用:
    python
    model.gradient_checkpointing_enable()
  6. 对大模型使用QLoRA:在消费级GPU上微调7B+模型的必备技术
  7. 保持低dropout:0.05通常足够;更高的值可能损害性能
  8. 频繁保存检查点:LoRA适配器体积小,因此可以经常保存
  9. 同时在基础模型上评估:确保适配器不会降低基础模型的能力
  10. 考虑为任务头设置modules_to_save:对于分类任务,全量训练分类头:
    python
    modules_to_save=["classifier", "score"]

References

参考资料

See
reference/
for detailed documentation:
  • advanced-techniques.md
    - DoRA, rsLoRA, adapter composition, and debugging
详见
reference/
目录下的详细文档:
  • advanced-techniques.md
    - DoRA、rsLoRA、适配器组合与调试