transformers
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseUsing Hugging Face Transformers
使用Hugging Face Transformers
Transformers is the model-definition framework for state-of-the-art machine learning across text, vision, audio, and multimodal domains. It provides unified APIs for loading pretrained models, running inference, and fine-tuning.
Transformers是面向文本、视觉、音频及多模态领域的最先进机器学习模型定义框架,它提供了加载预训练模型、运行推理和微调的统一API。
Table of Contents
目录
Core Concepts
核心概念
The Three Core Classes
三大核心类
Every model in Transformers has three core components:
python
from transformers import AutoConfig, AutoModel, AutoTokenizerTransformers中的每个模型都包含三个核心组件:
python
from transformers import AutoConfig, AutoModel, AutoTokenizerConfiguration: hyperparameters and architecture settings
配置:超参数和架构设置
config = AutoConfig.from_pretrained("bert-base-uncased")
config = AutoConfig.from_pretrained("bert-base-uncased")
Model: the neural network weights
模型:神经网络权重
model = AutoModel.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
Tokenizer/Processor: converts inputs to tensors
分词器/处理器:将输入转换为张量
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
undefinedtokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
undefinedThe from_pretrained
Pattern
from_pretrainedfrom_pretrained
模式
from_pretrainedAll loading uses which handles downloading, caching, and device placement:
from_pretrained()python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto", # Automatic device placement
)所有加载操作都使用方法,它会处理模型的下载、缓存和设备分配:
from_pretrained()python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto", # 自动设备分配
)Auto Classes
自动类(Auto Classes)
Use task-specific Auto classes for the correct model head:
python
from transformers import (
AutoModelForCausalLM, # Text generation (GPT, Llama)
AutoModelForSeq2SeqLM, # Encoder-decoder (T5, BART)
AutoModelForSequenceClassification, # Classification
AutoModelForTokenClassification, # NER, POS tagging
AutoModelForQuestionAnswering, # Extractive QA
AutoModelForMaskedLM, # BERT-style masked LM
AutoModelForImageClassification, # Vision models
AutoModelForSpeechSeq2Seq, # Speech recognition
)使用任务专属的自动类来获取正确的模型头部:
python
from transformers import (
AutoModelForCausalLM, # 文本生成(GPT、Llama)
AutoModelForSeq2SeqLM, # 编码器-解码器(T5、BART)
AutoModelForSequenceClassification, # 文本分类
AutoModelForTokenClassification, # 命名实体识别、词性标注
AutoModelForQuestionAnswering, # 抽取式问答
AutoModelForMaskedLM, # BERT风格掩码语言模型
AutoModelForImageClassification, # 视觉模型
AutoModelForSpeechSeq2Seq, # 语音识别
)Pipeline API
Pipeline API
The function provides high-level inference with minimal code:
pipeline()pipeline()Text Tasks
文本任务
python
from transformers import pipelinepython
from transformers import pipelineText generation
文本生成
generator = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B")
output = generator("The secret to success is", max_new_tokens=50)
generator = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B")
output = generator("成功的秘诀是", max_new_tokens=50)
Text classification
文本分类
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")
classifier = pipeline("sentiment-analysis")
result = classifier("我喜欢这个产品!")
[{'label': 'POSITIVE', 'score': 0.9998}]
[{'label': 'POSITIVE', 'score': 0.9998}]
Named entity recognition
命名实体识别
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Hugging Face is based in New York City.")
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Hugging Face的总部位于纽约市。")
Question answering
问答
qa = pipeline("question-answering")
answer = qa(question="What is the capital?", context="Paris is the capital of France.")
qa = pipeline("question-answering")
answer = qa(question="法国的首都是什么?", context="巴黎是法国的首都。")
Summarization
文本摘要
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(long_text, max_length=130, min_length=30)
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(long_text, max_length=130, min_length=30)
Translation
机器翻译
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hello, how are you?")
undefinedtranslator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hello, how are you?")
undefinedChat/Conversational
对话交互
python
from transformers import pipeline
import torch
pipe = pipeline(
"text-generation",
model="meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."},
]
response = pipe(messages, max_new_tokens=256)
print(response[0]["generated_text"][-1]["content"])python
from transformers import pipeline
import torch
pipe = pipeline(
"text-generation",
model="meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "你是一个乐于助人的助手。"},
{"role": "user", "content": "用简单的语言解释量子计算。"},
]
response = pipe(messages, max_new_tokens=256)
print(response[0]["generated_text"][-1]["content"])Vision Tasks
视觉任务
python
undefinedpython
undefinedImage classification
图像分类
classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
result = classifier("path/to/image.jpg")
classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
result = classifier("path/to/image.jpg")
Object detection
目标检测
detector = pipeline("object-detection", model="facebook/detr-resnet-50")
objects = detector("path/to/image.jpg")
detector = pipeline("object-detection", model="facebook/detr-resnet-50")
objects = detector("path/to/image.jpg")
Image segmentation
图像分割
segmenter = pipeline("image-segmentation", model="facebook/mask2former-swin-base-coco-panoptic")
masks = segmenter("path/to/image.jpg")
undefinedsegmenter = pipeline("image-segmentation", model="facebook/mask2former-swin-base-coco-panoptic")
masks = segmenter("path/to/image.jpg")
undefinedAudio Tasks
音频任务
python
undefinedpython
undefinedSpeech recognition
语音识别
transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3")
text = transcriber("path/to/audio.mp3")
transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3")
text = transcriber("path/to/audio.mp3")
Audio classification
音频分类
classifier = pipeline("audio-classification", model="superb/wav2vec2-base-superb-ks")
result = classifier("path/to/audio.wav")
undefinedclassifier = pipeline("audio-classification", model="superb/wav2vec2-base-superb-ks")
result = classifier("path/to/audio.wav")
undefinedMultimodal Tasks
多模态任务
python
undefinedpython
undefinedVisual question answering
视觉问答
vqa = pipeline("visual-question-answering", model="Salesforce/blip-vqa-base")
answer = vqa(image="image.jpg", question="What color is the car?")
vqa = pipeline("visual-question-answering", model="Salesforce/blip-vqa-base")
answer = vqa(image="image.jpg", question="这辆车是什么颜色?")
Image-to-text (captioning)
图像转文本(图像描述)
captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
caption = captioner("image.jpg")
captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
caption = captioner("image.jpg")
Document question answering
文档问答
doc_qa = pipeline("document-question-answering", model="impira/layoutlm-document-qa")
answer = doc_qa(image="document.png", question="What is the total?")
undefineddoc_qa = pipeline("document-question-answering", model="impira/layoutlm-document-qa")
answer = doc_qa(image="document.png", question="总金额是多少?")
undefinedModel Loading
模型加载
Device Placement
设备分配
python
from transformers import AutoModelForCausalLM
import torchpython
from transformers import AutoModelForCausalLM
import torchAutomatic placement across available devices
自动分配到可用设备
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
device_map="auto",
torch_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
device_map="auto",
torch_dtype=torch.bfloat16,
)
Specific device
指定设备
model = AutoModelForCausalLM.from_pretrained(
"gpt2",
device_map="cuda:0",
)
model = AutoModelForCausalLM.from_pretrained(
"gpt2",
device_map="cuda:0",
)
Custom device map for model parallelism
自定义设备映射实现模型并行
device_map = {
"model.embed_tokens": 0,
"model.layers.0": 0,
"model.layers.1": 1,
"model.norm": 1,
"lm_head": 1,
}
model = AutoModelForCausalLM.from_pretrained(model_name, device_map=device_map)
undefineddevice_map = {
"model.embed_tokens": 0,
"model.layers.0": 0,
"model.layers.1": 1,
"model.norm": 1,
"lm_head": 1,
}
model = AutoModelForCausalLM.from_pretrained(model_name, device_map=device_map)
undefinedLoading from Local Path
从本地路径加载
python
undefinedpython
undefinedSave model locally
将模型保存到本地
model.save_pretrained("./my_model")
tokenizer.save_pretrained("./my_model")
model.save_pretrained("./my_model")
tokenizer.save_pretrained("./my_model")
Load from local path
从本地路径加载
model = AutoModelForCausalLM.from_pretrained("./my_model")
tokenizer = AutoTokenizer.from_pretrained("./my_model")
undefinedmodel = AutoModelForCausalLM.from_pretrained("./my_model")
tokenizer = AutoTokenizer.from_pretrained("./my_model")
undefinedTrust Remote Code
信任远程代码
Some models require executing custom code from the Hub:
python
model = AutoModelForCausalLM.from_pretrained(
"microsoft/phi-2",
trust_remote_code=True, # Required for custom architectures
)部分模型需要执行Hub中的自定义代码:
python
model = AutoModelForCausalLM.from_pretrained(
"microsoft/phi-2",
trust_remote_code=True, # 自定义架构需要开启此选项
)Inference Patterns
推理模式
Text Generation
文本生成
python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)Basic generation
基础生成
inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
inputs = tokenizer("很久很久以前", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
With generation config
结合生成配置
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.1,
)
undefinedoutputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.1,
)
undefinedChat Templates
对话模板
python
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]python
messages = [
{"role": "system", "content": "你是一个乐于助人的助手。"},
{"role": "user", "content": "法国的首都是什么?"},
]Apply chat template
应用对话模板
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
undefinedinput_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
undefinedGetting Embeddings
获取嵌入向量
python
from transformers import AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def get_embeddings(texts: list[str]) -> torch.Tensor:
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling
attention_mask = inputs["attention_mask"]
embeddings = outputs.last_hidden_state
mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
sum_embeddings = (embeddings * mask_expanded).sum(1)
sum_mask = mask_expanded.sum(1).clamp(min=1e-9)
return sum_embeddings / sum_mask
embeddings = get_embeddings(["Hello world", "How are you?"])python
from transformers import AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def get_embeddings(texts: list[str]) -> torch.Tensor:
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# 均值池化
attention_mask = inputs["attention_mask"]
embeddings = outputs.last_hidden_state
mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
sum_embeddings = (embeddings * mask_expanded).sum(1)
sum_mask = mask_expanded.sum(1).clamp(min=1e-9)
return sum_embeddings / sum_mask
embeddings = get_embeddings(["你好世界", "最近怎么样?"])Classification
分类任务
python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
inputs = tokenizer("I love this movie!", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.softmax(outputs.logits, dim=-1)
labels = model.config.id2label
for idx, prob in enumerate(predictions[0]):
print(f"{labels[idx]}: {prob:.4f}")python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
inputs = tokenizer("我喜欢这部电影!", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.softmax(outputs.logits, dim=-1)
labels = model.config.id2label
for idx, prob in enumerate(predictions[0]):
print(f"{labels[idx]}: {prob:.4f}")Fine-tuning with Trainer
使用Trainer进行微调
Basic Fine-tuning
基础微调
python
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments,
)
from datasets import load_datasetpython
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments,
)
from datasets import load_datasetLoad data and model
加载数据和模型
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=2,
)
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=2,
)
Tokenize dataset
数据集分词
def tokenize(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized = dataset.map(tokenize, batched=True)
def tokenize(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized = dataset.map(tokenize, batched=True)
Training arguments
训练参数
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
logging_steps=100,
save_strategy="epoch",
load_best_model_at_end=True,
)
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
logging_steps=100,
save_strategy="epoch",
load_best_model_at_end=True,
)
Train
开始训练
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
)
trainer.train()
undefinedtrainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
)
trainer.train()
undefinedPushing to Hub
推送模型至Hub
python
undefinedpython
undefinedLogin first: huggingface-cli login
先登录:huggingface-cli login
Push model and tokenizer
推送模型和分词器
model.push_to_hub("my-username/my-fine-tuned-model")
tokenizer.push_to_hub("my-username/my-fine-tuned-model")
model.push_to_hub("my-username/my-fine-tuned-model")
tokenizer.push_to_hub("my-username/my-fine-tuned-model")
Or use trainer
或通过Trainer推送
trainer.push_to_hub()
See `reference/fine-tuning.md` for advanced patterns including LoRA, custom data collators, and evaluation metrics.trainer.push_to_hub()
更多高级微调模式(包括LoRA、自定义数据整理器和评估指标)请参考`reference/fine-tuning.md`。Working with Modalities
多模态任务处理
Vision Models
视觉模型
python
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")
image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()
print(model.config.id2label[predicted_class])python
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")
image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()
print(model.config.id2label[predicted_class])Audio Models
音频模型
python
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"openai/whisper-large-v3",
torch_dtype=torch.float16,
device_map="auto",
)python
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"openai/whisper-large-v3",
torch_dtype=torch.float16,
device_map="auto",
)Load audio (use librosa, soundfile, or datasets)
加载音频(可使用librosa、soundfile或datasets库)
import librosa
audio, sr = librosa.load("audio.mp3", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
undefinedimport librosa
audio, sr = librosa.load("audio.mp3", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
undefinedVision-Language Models
视觉语言模型
python
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch
model_name = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
image = Image.open("image.jpg")
prompt = "USER: <image>\nDescribe this image in detail.\nASSISTANT:"
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(outputs[0], skip_special_tokens=True)python
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch
model_name = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
image = Image.open("image.jpg")
prompt = "USER: <image>\n详细描述这张图片。\nASSISTANT:"
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(outputs[0], skip_special_tokens=True)Memory and Performance
内存与性能优化
Quantization
量化
python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torchpython
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch4-bit quantization
4-bit量化
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
quantization_config=bnb_config,
device_map="auto",
)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
quantization_config=bnb_config,
device_map="auto",
)
8-bit quantization
8-bit量化
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
load_in_8bit=True,
device_map="auto",
)
undefinedmodel = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
load_in_8bit=True,
device_map="auto",
)
undefinedFlash Attention
Flash Attention
python
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2", # Requires flash-attn package
device_map="auto",
)python
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2", # 需要安装flash-attn包
device_map="auto",
)torch.compile
torch.compile
python
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model = torch.compile(model, mode="reduce-overhead")python
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model = torch.compile(model, mode="reduce-overhead")Batched Inference
批量推理
python
texts = ["First prompt", "Second prompt", "Third prompt"]
inputs = tokenizer(texts, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=50)
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)See for streaming, KV caching, and serving patterns.
reference/advanced-inference.mdpython
texts = ["第一个提示词", "第二个提示词", "第三个提示词"]
inputs = tokenizer(texts, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=50)
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)关于流式推理、KV缓存和服务部署模式,请参考。
reference/advanced-inference.mdBest Practices
最佳实践
- Use bfloat16 over float16: Better numerical stability on modern GPUs
- Set pad token for generation:
tokenizer.pad_token = tokenizer.eos_token - Use device_map="auto": Let Accelerate handle device placement
- Enable Flash Attention: Significant speedup for long sequences
- Batch when possible: Amortize fixed costs across multiple inputs
- Use pipeline for quick prototyping: Switch to manual control for production
- Cache models locally: Set environment variable for model cache location
HF_HOME - Check model license: Verify usage rights before deployment
- 优先使用bfloat16而非float16:在现代GPU上具有更好的数值稳定性
- 为生成任务设置pad token:
tokenizer.pad_token = tokenizer.eos_token - 使用device_map="auto":让Accelerate自动处理设备分配
- 启用Flash Attention:对长序列推理有显著加速效果
- 尽可能批量处理:通过多输入分摊固定成本
- 快速原型开发用Pipeline:生产环境切换为手动控制模式
- 本地缓存模型:设置环境变量指定模型缓存路径
HF_HOME - 检查模型许可证:部署前确认使用权限
References
参考资料
See for detailed documentation:
reference/- - Advanced fine-tuning patterns with LoRA, PEFT, and custom training
fine-tuning.md - - Generation strategies, streaming, and serving
advanced-inference.md
更多详细文档请查看目录:
reference/- - 包含LoRA、PEFT和自定义训练的高级微调模式
fine-tuning.md - - 生成策略、流式推理和服务部署
advanced-inference.md