transformers

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Hugging Face Transformers - Modern AI Models

Hugging Face Transformers - 现代AI模型

Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. It reduces compute costs and carbon footprint by allowing researchers to reuse models instead of training from scratch.
Transformers提供API和工具,可轻松下载并训练最先进的预训练模型。它允许研究人员复用模型而非从头开始训练,从而降低计算成本和碳足迹。

When to Use

适用场景

  • Natural Language Processing (Summarization, Translation, Named Entity Recognition).
  • Scientific Sequence Analysis (Protein folding, DNA/RNA sequence modeling).
  • Chemical Property Prediction (Using molecular strings like SMILES).
  • Computer Vision (Vision Transformers - ViT, Image Classification).
  • Time Series Forecasting with foundation models.
  • Fine-tuning Large Language Models (LLMs) on domain-specific scientific literature.
  • Multimodal tasks (Document AI, Visual Question Answering).
  • 自然语言处理(摘要、翻译、命名实体识别)。
  • 科学序列分析(蛋白质折叠、DNA/RNA序列建模)。
  • 化学性质预测(使用SMILES等分子字符串)。
  • 计算机视觉(Vision Transformers - ViT、图像分类)。
  • 基于基础模型的时间序列预测。
  • 在特定领域的科学文献上微调大语言模型(LLMs)。
  • 多模态任务(文档AI、视觉问答)。

Reference Documentation

参考文档

Official docs: https://huggingface.co/docs/transformers/
Model Hub: https://huggingface.co/models
Search patterns:
pipeline
,
AutoModel
,
AutoTokenizer
,
Trainer
,
PEFT
(Parameter-Efficient Fine-Tuning)
官方文档: https://huggingface.co/docs/transformers/
模型仓库: https://huggingface.co/models
搜索关键词:
pipeline
,
AutoModel
,
AutoTokenizer
,
Trainer
,
PEFT
(Parameter-Efficient Fine-Tuning)

Core Principles

核心原则

The "Auto" Classes

「Auto」类

Hugging Face uses "Auto" classes (
AutoModel
,
AutoTokenizer
) that automatically infer the correct architecture from the model name/path. This makes code highly portable.
Hugging Face使用「Auto」类(
AutoModel
AutoTokenizer
),可根据模型名称/路径自动推断正确的架构,让代码具备高度可移植性。

Tokenization

分词处理

Before data enters a model, it must be converted into numerical tokens. The Tokenizer handles this, including padding, truncation, and special tokens (like
[CLS]
,
[SEP]
).
数据输入模型前必须转换为数值化的tokens。Tokenizer负责这项工作,包括填充、截断和特殊token(如
[CLS]
[SEP]
)的处理。

Pipelines

管道(Pipelines)

The simplest way to use a model. It abstracts away tokenization, model execution, and post-processing into a single
pipe(data)
call.
这是使用模型最简单的方式,它将分词、模型执行和后处理抽象为单一的
pipe(data)
调用。

Quick Reference

快速参考

Installation

安装

bash
pip install transformers datasets tokenizers
bash
pip install transformers datasets tokenizers

Requires a backend (PyTorch or JAX)

需要后端环境(PyTorch或JAX)

pip install torch
undefined
pip install torch
undefined

Standard Imports

标准导入

python
from transformers import pipeline, AutoModel, AutoTokenizer, TrainingArguments, Trainer
import torch
python
from transformers import pipeline, AutoModel, AutoTokenizer, TrainingArguments, Trainer
import torch

Basic Pattern - Using a Pretrained Pipeline

基础用法 - 使用预训练管道

python
from transformers import pipeline
python
from transformers import pipeline

1. Initialize a pipeline (automatically downloads model)

1. 初始化管道(自动下载模型)

classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

2. Run inference

2. 运行推理

results = classifier("The molecular structure of this compound is fascinating.") print(results)
undefined
results = classifier("The molecular structure of this compound is fascinating.") print(results)
undefined

Critical Rules

重要规则

✅ DO

✅ 推荐做法

  • Use the Auto Classes - Always prefer
    AutoTokenizer.from_pretrained()
    and
    AutoModel.from_pretrained()
    for flexibility.
  • Set the Device - Explicitly set
    device=0
    (for CUDA) or
    device="mps"
    (for Mac) in pipelines to ensure GPU acceleration.
  • Cache Models - Models are large. Use the
    HF_HOME
    environment variable to manage where models are stored on disk.
  • Handle Truncation - Most models have a maximum sequence length (usually 512). Always use
    truncation=True
    in tokenizers.
  • Use Datasets Library - For training, use the
    datasets
    library to handle data loading and streaming without filling RAM.
  • Save Tokenizers with Models - When fine-tuning, always save the tokenizer alongside the model to ensure consistency.
  • 使用Auto类 - 始终优先使用
    AutoTokenizer.from_pretrained()
    AutoModel.from_pretrained()
    以获得灵活性。
  • 设置设备 - 在管道中显式设置
    device=0
    (适用于CUDA)或
    device="mps"
    (适用于Mac),确保GPU加速。
  • 缓存模型 - 模型体积较大,使用
    HF_HOME
    环境变量管理模型在磁盘上的存储位置。
  • 处理截断 - 大多数模型有最大序列长度限制(通常为512),在tokenizer中始终设置
    truncation=True
  • 使用Datasets库 - 训练时使用
    datasets
    库处理数据加载和流式传输,避免占用过多内存。
  • 随模型保存Tokenizer - 微调时,务必将tokenizer与模型一起保存,确保一致性。

❌ DON'T

❌ 避免做法

  • Load Models in a Loop - Loading a model takes seconds and GBs of RAM. Load once, reuse many times.
  • Upload Private Data - Be careful when using models that might send data to an API (though transformers is mostly local execution).
  • Ignore Padding - For batch processing, ensure
    padding=True
    so all sequences in the batch have the same length.
  • Use Wrong Model for Task - A "BERT" model is for understanding; "GPT" is for generation. Use the right architecture.
  • 在循环中加载模型 - 加载模型需要数秒时间并占用数GB内存,只需加载一次,多次复用。
  • 上传私有数据 - 使用可能将数据发送到API的模型时要小心(不过transformers主要是本地执行)。
  • 忽略填充 - 批量处理时,确保设置
    padding=True
    ,使批次中所有序列长度相同。
  • 为任务选择错误模型 - "BERT"模型用于理解任务;"GPT"用于生成任务。请选择合适的架构。

Anti-Patterns (NEVER)

反模式(绝对避免)

python
from transformers import AutoModel, AutoTokenizer
python
from transformers import AutoModel, AutoTokenizer

❌ BAD: Re-initializing the model inside a function called frequently

❌ 错误:在频繁调用的函数内重新初始化模型

def get_prediction(text): model = AutoModel.from_pretrained("bert-base-uncased") # ❌ SLOW & RAM HEAVY return model(text)
def get_prediction(text): model = AutoModel.from_pretrained("bert-base-uncased") # ❌ 缓慢且占用大量内存 return model(text)

✅ GOOD: Load once globally or in a class

✅ 正确:全局或在类中加载一次

model = AutoModel.from_pretrained("bert-base-uncased") def get_prediction(text): return model(text)
model = AutoModel.from_pretrained("bert-base-uncased") def get_prediction(text): return model(text)

❌ BAD: Manual string splitting for "tokens"

❌ 错误:手动拆分字符串获取“tokens"

tokens = text.split(" ") # ❌ Not compatible with model vocabulary

tokens = text.split(" ") # ❌ 与模型词汇表不兼容

✅ GOOD: Use the model's specific tokenizer

✅ 正确:使用模型专属的tokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") inputs = tokenizer(text, return_tensors="pt")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") inputs = tokenizer(text, return_tensors="pt")

❌ BAD: Forgetting to move model to GPU

❌ 错误:忘记将模型移至GPU

model = AutoModel.from_pretrained("...")

model = AutoModel.from_pretrained("...")

output = model(inputs.to("cuda")) # ❌ Error: Model is on CPU!

output = model(inputs.to("cuda")) # ❌ 错误:模型在CPU上!

undefined
undefined

Tokenization Deep Dive

分词处理深入解析

Preparing Data for Models

为模型准备数据

python
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

texts = ["Science is cool.", "Quantum physics is hard."]
python
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

texts = ["Science is cool.", "Quantum physics is hard."]

Batch encoding

批量编码

inputs = tokenizer( texts, padding=True, truncation=True, max_length=128, return_tensors="pt" # Returns PyTorch tensors )
print(inputs['input_ids'].shape) # (batch_size, seq_len)
undefined
inputs = tokenizer( texts, padding=True, truncation=True, max_length=128, return_tensors="pt" # 返回PyTorch张量 )
print(inputs['input_ids'].shape) # (批次大小, 序列长度)
undefined

The Trainer API

Trainer API

Simplified Training Loop

简化训练循环

python
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy="epoch",
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
)
python
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy="epoch",
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
)

trainer.train()

trainer.train()

undefined
undefined

Scientific Applications

科学领域应用

1. Protein Sequence Analysis (ESM)

1. 蛋白质序列分析(ESM)

python
undefined
python
undefined

ESM-2 is a powerful protein language model

ESM-2是功能强大的蛋白质语言模型

tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D") model = AutoModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
protein_seq = "MAPLRKTYLLG" inputs = tokenizer(protein_seq, return_tensors="pt")
with torch.no_grad(): outputs = model(**inputs) # The 'last_hidden_state' represents a "feature vector" for each amino acid embeddings = outputs.last_hidden_state
undefined
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D") model = AutoModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
protein_seq = "MAPLRKTYLLG" inputs = tokenizer(protein_seq, return_tensors="pt")
with torch.no_grad(): outputs = model(**inputs) # 'last_hidden_state'表示每个氨基酸的“特征向量” embeddings = outputs.last_hidden_state
undefined

2. Chemical Property Prediction (SMILES)

2. 化学性质预测(SMILES)

python
undefined
python
undefined

Using a model trained on molecular strings

使用基于分子字符串训练的模型

pipe = pipeline("text-classification", model="seyonec/ChemBERTa-zinc-base-v1")
smiles = "CC(=O)Oc1ccccc1C(=O)O" # Aspirin result = pipe(smiles) print(f"Prediction: {result}")
undefined
pipe = pipeline("text-classification", model="seyonec/ChemBERTa-zinc-base-v1")
smiles = "CC(=O)Oc1ccccc1C(=O)O" # 阿司匹林 result = pipe(smiles) print(f"预测结果: {result}")
undefined

3. Named Entity Recognition (NER) for Papers

3. 论文中的命名实体识别(NER)

python
undefined
python
undefined

Extracting genes, proteins, or chemicals from text

从文本中提取基因、蛋白质或化学物质

ner_pipe = pipeline("ner", model="dslim/bert-base-NER") text = "The expression of the BRCA1 gene was observed in the sample." entities = ner_pipe(text)
undefined
ner_pipe = pipeline("ner", model="dslim/bert-base-NER") text = "The expression of the BRCA1 gene was observed in the sample." entities = ner_pipe(text)
undefined

Performance and Efficiency

性能与效率

1. Quantization (bitsandbytes)

1. 量化(bitsandbytes)

Running large models on consumer GPUs by reducing precision (8-bit or 4-bit).
python
from transformers import BitsAndBytesConfig
通过降低精度(8位或4位)在消费级GPU上运行大型模型。
python
from transformers import BitsAndBytesConfig

Load model in 4-bit precision

以4位精度加载模型

quant_config = BitsAndBytesConfig(load_in_4bit=True) model = AutoModel.from_pretrained("model_name", quantization_config=quant_config)
undefined
quant_config = BitsAndBytesConfig(load_in_4bit=True) model = AutoModel.from_pretrained("model_name", quantization_config=quant_config)
undefined

2. Using pipeline with GPU

2. 在GPU上使用管道

python
undefined
python
undefined

'device=0' targets the first CUDA device

'device=0'指向第一个CUDA设备

pipe = pipeline("translation_en_to_fr", model="t5-base", device=0)
undefined
pipe = pipeline("translation_en_to_fr", model="t5-base", device=0)
undefined

Common Pitfalls and Solutions

常见问题与解决方案

"Out of Memory" (OOM) on GPU

GPU内存不足(OOM)

python
undefined
python
undefined

❌ Problem: Batch size is too large for GPU RAM

❌ 问题:批量大小超出GPU内存

✅ Solution:

✅ 解决方案:

1. Reduce 'per_device_train_batch_size'

1. 减小'per_device_train_batch_size'

2. Use 'gradient_accumulation_steps' to keep effective batch size

2. 使用'gradient_accumulation_steps'维持有效批量大小

3. Use 'fp16=True' in TrainingArguments

3. 在TrainingArguments中设置'fp16=True'

undefined
undefined

Model Output is a Dictionary, not a Tensor

模型输出是字典而非张量

python
undefined
python
undefined

❌ Problem: outputs[0] works, but is confusing

❌ 问题:outputs[0]可行,但易混淆

✅ Solution: Access by name

✅ 解决方案:通过名称访问

outputs = model(**inputs) hidden_states = outputs.last_hidden_state
undefined
outputs = model(**inputs) hidden_states = outputs.last_hidden_state
undefined

Slow Tokenization

分词速度慢

python
undefined
python
undefined

✅ Solution: Use "Fast" tokenizers (written in Rust, usually default)

✅ 解决方案:使用“Fast”分词器(基于Rust编写,通常为默认选项)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

Hugging Face Transformers has democratized AI for the scientific community. By providing a unified interface to the world's most powerful models, it allows researchers to spend less time on engineering and more time on discovering insights from data.
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

Hugging Face Transformers为科学界普及了AI技术。通过为全球最强大的模型提供统一接口,它让研究人员可以减少工程方面的时间投入,将更多精力用于从数据中挖掘洞察。