transformers

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Using Hugging Face Transformers

使用Hugging Face Transformers

Transformers is the model-definition framework for state-of-the-art machine learning across text, vision, audio, and multimodal domains. It provides unified APIs for loading pretrained models, running inference, and fine-tuning.
Transformers是面向文本、视觉、音频及多模态领域的最先进机器学习模型定义框架,它提供了加载预训练模型、运行推理和微调的统一API。

Table of Contents

目录

Core Concepts

核心概念

The Three Core Classes

三大核心类

Every model in Transformers has three core components:
python
from transformers import AutoConfig, AutoModel, AutoTokenizer
Transformers中的每个模型都包含三个核心组件:
python
from transformers import AutoConfig, AutoModel, AutoTokenizer

Configuration: hyperparameters and architecture settings

配置:超参数和架构设置

config = AutoConfig.from_pretrained("bert-base-uncased")
config = AutoConfig.from_pretrained("bert-base-uncased")

Model: the neural network weights

模型:神经网络权重

model = AutoModel.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

Tokenizer/Processor: converts inputs to tensors

分词器/处理器:将输入转换为张量

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
undefined
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
undefined

The
from_pretrained
Pattern

from_pretrained
模式

All loading uses
from_pretrained()
which handles downloading, caching, and device placement:
python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",  # Automatic device placement
)
所有加载操作都使用
from_pretrained()
方法,它会处理模型的下载、缓存和设备分配:
python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",  # 自动设备分配
)

Auto Classes

自动类(Auto Classes)

Use task-specific Auto classes for the correct model head:
python
from transformers import (
    AutoModelForCausalLM,          # Text generation (GPT, Llama)
    AutoModelForSeq2SeqLM,         # Encoder-decoder (T5, BART)
    AutoModelForSequenceClassification,  # Classification
    AutoModelForTokenClassification,     # NER, POS tagging
    AutoModelForQuestionAnswering,       # Extractive QA
    AutoModelForMaskedLM,                # BERT-style masked LM
    AutoModelForImageClassification,     # Vision models
    AutoModelForSpeechSeq2Seq,           # Speech recognition
)
使用任务专属的自动类来获取正确的模型头部:
python
from transformers import (
    AutoModelForCausalLM,          # 文本生成(GPT、Llama)
    AutoModelForSeq2SeqLM,         # 编码器-解码器(T5、BART)
    AutoModelForSequenceClassification,  # 文本分类
    AutoModelForTokenClassification,     # 命名实体识别、词性标注
    AutoModelForQuestionAnswering,       # 抽取式问答
    AutoModelForMaskedLM,                # BERT风格掩码语言模型
    AutoModelForImageClassification,     # 视觉模型
    AutoModelForSpeechSeq2Seq,           # 语音识别
)

Pipeline API

Pipeline API

The
pipeline()
function provides high-level inference with minimal code:
pipeline()
函数提供了极简代码的高层级推理能力:

Text Tasks

文本任务

python
from transformers import pipeline
python
from transformers import pipeline

Text generation

文本生成

generator = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B") output = generator("The secret to success is", max_new_tokens=50)
generator = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B") output = generator("成功的秘诀是", max_new_tokens=50)

Text classification

文本分类

classifier = pipeline("sentiment-analysis") result = classifier("I love this product!")
classifier = pipeline("sentiment-analysis") result = classifier("我喜欢这个产品!")

[{'label': 'POSITIVE', 'score': 0.9998}]

[{'label': 'POSITIVE', 'score': 0.9998}]

Named entity recognition

命名实体识别

ner = pipeline("ner", aggregation_strategy="simple") entities = ner("Hugging Face is based in New York City.")
ner = pipeline("ner", aggregation_strategy="simple") entities = ner("Hugging Face的总部位于纽约市。")

Question answering

问答

qa = pipeline("question-answering") answer = qa(question="What is the capital?", context="Paris is the capital of France.")
qa = pipeline("question-answering") answer = qa(question="法国的首都是什么?", context="巴黎是法国的首都。")

Summarization

文本摘要

summarizer = pipeline("summarization", model="facebook/bart-large-cnn") summary = summarizer(long_text, max_length=130, min_length=30)
summarizer = pipeline("summarization", model="facebook/bart-large-cnn") summary = summarizer(long_text, max_length=130, min_length=30)

Translation

机器翻译

translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr") result = translator("Hello, how are you?")
undefined
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr") result = translator("Hello, how are you?")
undefined

Chat/Conversational

对话交互

python
from transformers import pipeline
import torch

pipe = pipeline(
    "text-generation",
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."},
]

response = pipe(messages, max_new_tokens=256)
print(response[0]["generated_text"][-1]["content"])
python
from transformers import pipeline
import torch

pipe = pipeline(
    "text-generation",
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "你是一个乐于助人的助手。"},
    {"role": "user", "content": "用简单的语言解释量子计算。"},
]

response = pipe(messages, max_new_tokens=256)
print(response[0]["generated_text"][-1]["content"])

Vision Tasks

视觉任务

python
undefined
python
undefined

Image classification

图像分类

classifier = pipeline("image-classification", model="google/vit-base-patch16-224") result = classifier("path/to/image.jpg")
classifier = pipeline("image-classification", model="google/vit-base-patch16-224") result = classifier("path/to/image.jpg")

Object detection

目标检测

detector = pipeline("object-detection", model="facebook/detr-resnet-50") objects = detector("path/to/image.jpg")
detector = pipeline("object-detection", model="facebook/detr-resnet-50") objects = detector("path/to/image.jpg")

Image segmentation

图像分割

segmenter = pipeline("image-segmentation", model="facebook/mask2former-swin-base-coco-panoptic") masks = segmenter("path/to/image.jpg")
undefined
segmenter = pipeline("image-segmentation", model="facebook/mask2former-swin-base-coco-panoptic") masks = segmenter("path/to/image.jpg")
undefined

Audio Tasks

音频任务

python
undefined
python
undefined

Speech recognition

语音识别

transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3") text = transcriber("path/to/audio.mp3")
transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3") text = transcriber("path/to/audio.mp3")

Audio classification

音频分类

classifier = pipeline("audio-classification", model="superb/wav2vec2-base-superb-ks") result = classifier("path/to/audio.wav")
undefined
classifier = pipeline("audio-classification", model="superb/wav2vec2-base-superb-ks") result = classifier("path/to/audio.wav")
undefined

Multimodal Tasks

多模态任务

python
undefined
python
undefined

Visual question answering

视觉问答

vqa = pipeline("visual-question-answering", model="Salesforce/blip-vqa-base") answer = vqa(image="image.jpg", question="What color is the car?")
vqa = pipeline("visual-question-answering", model="Salesforce/blip-vqa-base") answer = vqa(image="image.jpg", question="这辆车是什么颜色?")

Image-to-text (captioning)

图像转文本(图像描述)

captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base") caption = captioner("image.jpg")
captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base") caption = captioner("image.jpg")

Document question answering

文档问答

doc_qa = pipeline("document-question-answering", model="impira/layoutlm-document-qa") answer = doc_qa(image="document.png", question="What is the total?")
undefined
doc_qa = pipeline("document-question-answering", model="impira/layoutlm-document-qa") answer = doc_qa(image="document.png", question="总金额是多少?")
undefined

Model Loading

模型加载

Device Placement

设备分配

python
from transformers import AutoModelForCausalLM
import torch
python
from transformers import AutoModelForCausalLM
import torch

Automatic placement across available devices

自动分配到可用设备

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", device_map="auto", torch_dtype=torch.bfloat16, )
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", device_map="auto", torch_dtype=torch.bfloat16, )

Specific device

指定设备

model = AutoModelForCausalLM.from_pretrained( "gpt2", device_map="cuda:0", )
model = AutoModelForCausalLM.from_pretrained( "gpt2", device_map="cuda:0", )

Custom device map for model parallelism

自定义设备映射实现模型并行

device_map = { "model.embed_tokens": 0, "model.layers.0": 0, "model.layers.1": 1, "model.norm": 1, "lm_head": 1, } model = AutoModelForCausalLM.from_pretrained(model_name, device_map=device_map)
undefined
device_map = { "model.embed_tokens": 0, "model.layers.0": 0, "model.layers.1": 1, "model.norm": 1, "lm_head": 1, } model = AutoModelForCausalLM.from_pretrained(model_name, device_map=device_map)
undefined

Loading from Local Path

从本地路径加载

python
undefined
python
undefined

Save model locally

将模型保存到本地

model.save_pretrained("./my_model") tokenizer.save_pretrained("./my_model")
model.save_pretrained("./my_model") tokenizer.save_pretrained("./my_model")

Load from local path

从本地路径加载

model = AutoModelForCausalLM.from_pretrained("./my_model") tokenizer = AutoTokenizer.from_pretrained("./my_model")
undefined
model = AutoModelForCausalLM.from_pretrained("./my_model") tokenizer = AutoTokenizer.from_pretrained("./my_model")
undefined

Trust Remote Code

信任远程代码

Some models require executing custom code from the Hub:
python
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    trust_remote_code=True,  # Required for custom architectures
)
部分模型需要执行Hub中的自定义代码:
python
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    trust_remote_code=True,  # 自定义架构需要开启此选项
)

Inference Patterns

推理模式

Text Generation

文本生成

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Basic generation

基础生成

inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) text = tokenizer.decode(outputs[0], skip_special_tokens=True)
inputs = tokenizer("很久很久以前", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) text = tokenizer.decode(outputs[0], skip_special_tokens=True)

With generation config

结合生成配置

outputs = model.generate( **inputs, max_new_tokens=100, do_sample=True, temperature=0.7, top_p=0.9, top_k=50, repetition_penalty=1.1, )
undefined
outputs = model.generate( **inputs, max_new_tokens=100, do_sample=True, temperature=0.7, top_p=0.9, top_k=50, repetition_penalty=1.1, )
undefined

Chat Templates

对话模板

python
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]
python
messages = [
    {"role": "system", "content": "你是一个乐于助人的助手。"},
    {"role": "user", "content": "法国的首都是什么?"},
]

Apply chat template

应用对话模板

input_text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, )
inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) response = tokenizer.decode(outputs[0], skip_special_tokens=True)
undefined
input_text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, )
inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) response = tokenizer.decode(outputs[0], skip_special_tokens=True)
undefined

Getting Embeddings

获取嵌入向量

python
from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def get_embeddings(texts: list[str]) -> torch.Tensor:
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)

    # Mean pooling
    attention_mask = inputs["attention_mask"]
    embeddings = outputs.last_hidden_state
    mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
    sum_embeddings = (embeddings * mask_expanded).sum(1)
    sum_mask = mask_expanded.sum(1).clamp(min=1e-9)
    return sum_embeddings / sum_mask

embeddings = get_embeddings(["Hello world", "How are you?"])
python
from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def get_embeddings(texts: list[str]) -> torch.Tensor:
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)

    # 均值池化
    attention_mask = inputs["attention_mask"]
    embeddings = outputs.last_hidden_state
    mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
    sum_embeddings = (embeddings * mask_expanded).sum(1)
    sum_mask = mask_expanded.sum(1).clamp(min=1e-9)
    return sum_embeddings / sum_mask

embeddings = get_embeddings(["你好世界", "最近怎么样?"])

Classification

分类任务

python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("I love this movie!", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.softmax(outputs.logits, dim=-1)

labels = model.config.id2label
for idx, prob in enumerate(predictions[0]):
    print(f"{labels[idx]}: {prob:.4f}")
python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("我喜欢这部电影!", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.softmax(outputs.logits, dim=-1)

labels = model.config.id2label
for idx, prob in enumerate(predictions[0]):
    print(f"{labels[idx]}: {prob:.4f}")

Fine-tuning with Trainer

使用Trainer进行微调

Basic Fine-tuning

基础微调

python
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
)
from datasets import load_dataset
python
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
)
from datasets import load_dataset

Load data and model

加载数据和模型

dataset = load_dataset("imdb") tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") model = AutoModelForSequenceClassification.from_pretrained( "distilbert-base-uncased", num_labels=2, )
dataset = load_dataset("imdb") tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") model = AutoModelForSequenceClassification.from_pretrained( "distilbert-base-uncased", num_labels=2, )

Tokenize dataset

数据集分词

def tokenize(examples): return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized = dataset.map(tokenize, batched=True)
def tokenize(examples): return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized = dataset.map(tokenize, batched=True)

Training arguments

训练参数

training_args = TrainingArguments( output_dir="./results", eval_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, weight_decay=0.01, logging_steps=100, save_strategy="epoch", load_best_model_at_end=True, )
training_args = TrainingArguments( output_dir="./results", eval_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, weight_decay=0.01, logging_steps=100, save_strategy="epoch", load_best_model_at_end=True, )

Train

开始训练

trainer = Trainer( model=model, args=training_args, train_dataset=tokenized["train"], eval_dataset=tokenized["test"], )
trainer.train()
undefined
trainer = Trainer( model=model, args=training_args, train_dataset=tokenized["train"], eval_dataset=tokenized["test"], )
trainer.train()
undefined

Pushing to Hub

推送模型至Hub

python
undefined
python
undefined

Login first: huggingface-cli login

先登录:huggingface-cli login

Push model and tokenizer

推送模型和分词器

model.push_to_hub("my-username/my-fine-tuned-model") tokenizer.push_to_hub("my-username/my-fine-tuned-model")
model.push_to_hub("my-username/my-fine-tuned-model") tokenizer.push_to_hub("my-username/my-fine-tuned-model")

Or use trainer

或通过Trainer推送

trainer.push_to_hub()

See `reference/fine-tuning.md` for advanced patterns including LoRA, custom data collators, and evaluation metrics.
trainer.push_to_hub()

更多高级微调模式(包括LoRA、自定义数据整理器和评估指标)请参考`reference/fine-tuning.md`。

Working with Modalities

多模态任务处理

Vision Models

视觉模型

python
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image

processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")

image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_class = outputs.logits.argmax(-1).item()
    print(model.config.id2label[predicted_class])
python
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image

processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")

image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_class = outputs.logits.argmax(-1).item()
    print(model.config.id2label[predicted_class])

Audio Models

音频模型

python
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch

processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "openai/whisper-large-v3",
    torch_dtype=torch.float16,
    device_map="auto",
)
python
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch

processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "openai/whisper-large-v3",
    torch_dtype=torch.float16,
    device_map="auto",
)

Load audio (use librosa, soundfile, or datasets)

加载音频(可使用librosa、soundfile或datasets库)

import librosa audio, sr = librosa.load("audio.mp3", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt") inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs) transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
undefined
import librosa audio, sr = librosa.load("audio.mp3", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt") inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs) transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
undefined

Vision-Language Models

视觉语言模型

python
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch

model_name = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

image = Image.open("image.jpg")
prompt = "USER: <image>\nDescribe this image in detail.\nASSISTANT:"

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(outputs[0], skip_special_tokens=True)
python
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch

model_name = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

image = Image.open("image.jpg")
prompt = "USER: <image>\n详细描述这张图片。\nASSISTANT:"

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(outputs[0], skip_special_tokens=True)

Memory and Performance

内存与性能优化

Quantization

量化

python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

4-bit quantization

4-bit量化

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, )
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", quantization_config=bnb_config, device_map="auto", )
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, )
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", quantization_config=bnb_config, device_map="auto", )

8-bit quantization

8-bit量化

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", load_in_8bit=True, device_map="auto", )
undefined
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", load_in_8bit=True, device_map="auto", )
undefined

Flash Attention

Flash Attention

python
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # Requires flash-attn package
    device_map="auto",
)
python
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # 需要安装flash-attn包
    device_map="auto",
)

torch.compile

torch.compile

python
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model = torch.compile(model, mode="reduce-overhead")
python
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model = torch.compile(model, mode="reduce-overhead")

Batched Inference

批量推理

python
texts = ["First prompt", "Second prompt", "Third prompt"]
inputs = tokenizer(texts, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)

decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
See
reference/advanced-inference.md
for streaming, KV caching, and serving patterns.
python
texts = ["第一个提示词", "第二个提示词", "第三个提示词"]
inputs = tokenizer(texts, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)

decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
关于流式推理、KV缓存和服务部署模式,请参考
reference/advanced-inference.md

Best Practices

最佳实践

  1. Use bfloat16 over float16: Better numerical stability on modern GPUs
  2. Set pad token for generation:
    tokenizer.pad_token = tokenizer.eos_token
  3. Use device_map="auto": Let Accelerate handle device placement
  4. Enable Flash Attention: Significant speedup for long sequences
  5. Batch when possible: Amortize fixed costs across multiple inputs
  6. Use pipeline for quick prototyping: Switch to manual control for production
  7. Cache models locally: Set
    HF_HOME
    environment variable for model cache location
  8. Check model license: Verify usage rights before deployment
  1. 优先使用bfloat16而非float16:在现代GPU上具有更好的数值稳定性
  2. 为生成任务设置pad token
    tokenizer.pad_token = tokenizer.eos_token
  3. 使用device_map="auto":让Accelerate自动处理设备分配
  4. 启用Flash Attention:对长序列推理有显著加速效果
  5. 尽可能批量处理:通过多输入分摊固定成本
  6. 快速原型开发用Pipeline:生产环境切换为手动控制模式
  7. 本地缓存模型:设置
    HF_HOME
    环境变量指定模型缓存路径
  8. 检查模型许可证:部署前确认使用权限

References

参考资料

See
reference/
for detailed documentation:
  • fine-tuning.md
    - Advanced fine-tuning patterns with LoRA, PEFT, and custom training
  • advanced-inference.md
    - Generation strategies, streaming, and serving
更多详细文档请查看
reference/
目录:
  • fine-tuning.md
    - 包含LoRA、PEFT和自定义训练的高级微调模式
  • advanced-inference.md
    - 生成策略、流式推理和服务部署