transformers
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHugging Face Transformers - Modern AI Models
Hugging Face Transformers - 现代AI模型
Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. It reduces compute costs and carbon footprint by allowing researchers to reuse models instead of training from scratch.
Transformers提供API和工具,可轻松下载并训练最先进的预训练模型。它允许研究人员复用模型而非从头开始训练,从而降低计算成本和碳足迹。
When to Use
适用场景
- Natural Language Processing (Summarization, Translation, Named Entity Recognition).
- Scientific Sequence Analysis (Protein folding, DNA/RNA sequence modeling).
- Chemical Property Prediction (Using molecular strings like SMILES).
- Computer Vision (Vision Transformers - ViT, Image Classification).
- Time Series Forecasting with foundation models.
- Fine-tuning Large Language Models (LLMs) on domain-specific scientific literature.
- Multimodal tasks (Document AI, Visual Question Answering).
- 自然语言处理(摘要、翻译、命名实体识别)。
- 科学序列分析(蛋白质折叠、DNA/RNA序列建模)。
- 化学性质预测(使用SMILES等分子字符串)。
- 计算机视觉(Vision Transformers - ViT、图像分类)。
- 基于基础模型的时间序列预测。
- 在特定领域的科学文献上微调大语言模型(LLMs)。
- 多模态任务(文档AI、视觉问答)。
Reference Documentation
参考文档
Official docs: https://huggingface.co/docs/transformers/
Model Hub: https://huggingface.co/models
Search patterns:, , , , (Parameter-Efficient Fine-Tuning)
Model Hub: https://huggingface.co/models
Search patterns:
pipelineAutoModelAutoTokenizerTrainerPEFT官方文档: https://huggingface.co/docs/transformers/
模型仓库: https://huggingface.co/models
搜索关键词:, , , , (Parameter-Efficient Fine-Tuning)
模型仓库: https://huggingface.co/models
搜索关键词:
pipelineAutoModelAutoTokenizerTrainerPEFTCore Principles
核心原则
The "Auto" Classes
「Auto」类
Hugging Face uses "Auto" classes (, ) that automatically infer the correct architecture from the model name/path. This makes code highly portable.
AutoModelAutoTokenizerHugging Face使用「Auto」类(、),可根据模型名称/路径自动推断正确的架构,让代码具备高度可移植性。
AutoModelAutoTokenizerTokenization
分词处理
Before data enters a model, it must be converted into numerical tokens. The Tokenizer handles this, including padding, truncation, and special tokens (like , ).
[CLS][SEP]数据输入模型前必须转换为数值化的tokens。Tokenizer负责这项工作,包括填充、截断和特殊token(如、)的处理。
[CLS][SEP]Pipelines
管道(Pipelines)
The simplest way to use a model. It abstracts away tokenization, model execution, and post-processing into a single call.
pipe(data)这是使用模型最简单的方式,它将分词、模型执行和后处理抽象为单一的调用。
pipe(data)Quick Reference
快速参考
Installation
安装
bash
pip install transformers datasets tokenizersbash
pip install transformers datasets tokenizersRequires a backend (PyTorch or JAX)
需要后端环境(PyTorch或JAX)
pip install torch
undefinedpip install torch
undefinedStandard Imports
标准导入
python
from transformers import pipeline, AutoModel, AutoTokenizer, TrainingArguments, Trainer
import torchpython
from transformers import pipeline, AutoModel, AutoTokenizer, TrainingArguments, Trainer
import torchBasic Pattern - Using a Pretrained Pipeline
基础用法 - 使用预训练管道
python
from transformers import pipelinepython
from transformers import pipeline1. Initialize a pipeline (automatically downloads model)
1. 初始化管道(自动下载模型)
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
2. Run inference
2. 运行推理
results = classifier("The molecular structure of this compound is fascinating.")
print(results)
undefinedresults = classifier("The molecular structure of this compound is fascinating.")
print(results)
undefinedCritical Rules
重要规则
✅ DO
✅ 推荐做法
- Use the Auto Classes - Always prefer and
AutoTokenizer.from_pretrained()for flexibility.AutoModel.from_pretrained() - Set the Device - Explicitly set (for CUDA) or
device=0(for Mac) in pipelines to ensure GPU acceleration.device="mps" - Cache Models - Models are large. Use the environment variable to manage where models are stored on disk.
HF_HOME - Handle Truncation - Most models have a maximum sequence length (usually 512). Always use in tokenizers.
truncation=True - Use Datasets Library - For training, use the library to handle data loading and streaming without filling RAM.
datasets - Save Tokenizers with Models - When fine-tuning, always save the tokenizer alongside the model to ensure consistency.
- 使用Auto类 - 始终优先使用和
AutoTokenizer.from_pretrained()以获得灵活性。AutoModel.from_pretrained() - 设置设备 - 在管道中显式设置(适用于CUDA)或
device=0(适用于Mac),确保GPU加速。device="mps" - 缓存模型 - 模型体积较大,使用环境变量管理模型在磁盘上的存储位置。
HF_HOME - 处理截断 - 大多数模型有最大序列长度限制(通常为512),在tokenizer中始终设置。
truncation=True - 使用Datasets库 - 训练时使用库处理数据加载和流式传输,避免占用过多内存。
datasets - 随模型保存Tokenizer - 微调时,务必将tokenizer与模型一起保存,确保一致性。
❌ DON'T
❌ 避免做法
- Load Models in a Loop - Loading a model takes seconds and GBs of RAM. Load once, reuse many times.
- Upload Private Data - Be careful when using models that might send data to an API (though transformers is mostly local execution).
- Ignore Padding - For batch processing, ensure so all sequences in the batch have the same length.
padding=True - Use Wrong Model for Task - A "BERT" model is for understanding; "GPT" is for generation. Use the right architecture.
- 在循环中加载模型 - 加载模型需要数秒时间并占用数GB内存,只需加载一次,多次复用。
- 上传私有数据 - 使用可能将数据发送到API的模型时要小心(不过transformers主要是本地执行)。
- 忽略填充 - 批量处理时,确保设置,使批次中所有序列长度相同。
padding=True - 为任务选择错误模型 - "BERT"模型用于理解任务;"GPT"用于生成任务。请选择合适的架构。
Anti-Patterns (NEVER)
反模式(绝对避免)
python
from transformers import AutoModel, AutoTokenizerpython
from transformers import AutoModel, AutoTokenizer❌ BAD: Re-initializing the model inside a function called frequently
❌ 错误:在频繁调用的函数内重新初始化模型
def get_prediction(text):
model = AutoModel.from_pretrained("bert-base-uncased") # ❌ SLOW & RAM HEAVY
return model(text)
def get_prediction(text):
model = AutoModel.from_pretrained("bert-base-uncased") # ❌ 缓慢且占用大量内存
return model(text)
✅ GOOD: Load once globally or in a class
✅ 正确:全局或在类中加载一次
model = AutoModel.from_pretrained("bert-base-uncased")
def get_prediction(text):
return model(text)
model = AutoModel.from_pretrained("bert-base-uncased")
def get_prediction(text):
return model(text)
❌ BAD: Manual string splitting for "tokens"
❌ 错误:手动拆分字符串获取“tokens"
tokens = text.split(" ") # ❌ Not compatible with model vocabulary
tokens = text.split(" ") # ❌ 与模型词汇表不兼容
✅ GOOD: Use the model's specific tokenizer
✅ 正确:使用模型专属的tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(text, return_tensors="pt")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(text, return_tensors="pt")
❌ BAD: Forgetting to move model to GPU
❌ 错误:忘记将模型移至GPU
model = AutoModel.from_pretrained("...")
model = AutoModel.from_pretrained("...")
output = model(inputs.to("cuda")) # ❌ Error: Model is on CPU!
output = model(inputs.to("cuda")) # ❌ 错误:模型在CPU上!
undefinedundefinedTokenization Deep Dive
分词处理深入解析
Preparing Data for Models
为模型准备数据
python
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
texts = ["Science is cool.", "Quantum physics is hard."]python
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
texts = ["Science is cool.", "Quantum physics is hard."]Batch encoding
批量编码
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=128,
return_tensors="pt" # Returns PyTorch tensors
)
print(inputs['input_ids'].shape) # (batch_size, seq_len)
undefinedinputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=128,
return_tensors="pt" # 返回PyTorch张量
)
print(inputs['input_ids'].shape) # (批次大小, 序列长度)
undefinedThe Trainer API
Trainer API
Simplified Training Loop
简化训练循环
python
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
evaluation_strategy="epoch",
logging_dir="./logs",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_val,
)python
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
evaluation_strategy="epoch",
logging_dir="./logs",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_val,
)trainer.train()
trainer.train()
undefinedundefinedScientific Applications
科学领域应用
1. Protein Sequence Analysis (ESM)
1. 蛋白质序列分析(ESM)
python
undefinedpython
undefinedESM-2 is a powerful protein language model
ESM-2是功能强大的蛋白质语言模型
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
model = AutoModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
protein_seq = "MAPLRKTYLLG"
inputs = tokenizer(protein_seq, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# The 'last_hidden_state' represents a "feature vector" for each amino acid
embeddings = outputs.last_hidden_state
undefinedtokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
model = AutoModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
protein_seq = "MAPLRKTYLLG"
inputs = tokenizer(protein_seq, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# 'last_hidden_state'表示每个氨基酸的“特征向量”
embeddings = outputs.last_hidden_state
undefined2. Chemical Property Prediction (SMILES)
2. 化学性质预测(SMILES)
python
undefinedpython
undefinedUsing a model trained on molecular strings
使用基于分子字符串训练的模型
pipe = pipeline("text-classification", model="seyonec/ChemBERTa-zinc-base-v1")
smiles = "CC(=O)Oc1ccccc1C(=O)O" # Aspirin
result = pipe(smiles)
print(f"Prediction: {result}")
undefinedpipe = pipeline("text-classification", model="seyonec/ChemBERTa-zinc-base-v1")
smiles = "CC(=O)Oc1ccccc1C(=O)O" # 阿司匹林
result = pipe(smiles)
print(f"预测结果: {result}")
undefined3. Named Entity Recognition (NER) for Papers
3. 论文中的命名实体识别(NER)
python
undefinedpython
undefinedExtracting genes, proteins, or chemicals from text
从文本中提取基因、蛋白质或化学物质
ner_pipe = pipeline("ner", model="dslim/bert-base-NER")
text = "The expression of the BRCA1 gene was observed in the sample."
entities = ner_pipe(text)
undefinedner_pipe = pipeline("ner", model="dslim/bert-base-NER")
text = "The expression of the BRCA1 gene was observed in the sample."
entities = ner_pipe(text)
undefinedPerformance and Efficiency
性能与效率
1. Quantization (bitsandbytes)
1. 量化(bitsandbytes)
Running large models on consumer GPUs by reducing precision (8-bit or 4-bit).
python
from transformers import BitsAndBytesConfig通过降低精度(8位或4位)在消费级GPU上运行大型模型。
python
from transformers import BitsAndBytesConfigLoad model in 4-bit precision
以4位精度加载模型
quant_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModel.from_pretrained("model_name", quantization_config=quant_config)
undefinedquant_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModel.from_pretrained("model_name", quantization_config=quant_config)
undefined2. Using pipeline with GPU
2. 在GPU上使用管道
python
undefinedpython
undefined'device=0' targets the first CUDA device
'device=0'指向第一个CUDA设备
pipe = pipeline("translation_en_to_fr", model="t5-base", device=0)
undefinedpipe = pipeline("translation_en_to_fr", model="t5-base", device=0)
undefinedCommon Pitfalls and Solutions
常见问题与解决方案
"Out of Memory" (OOM) on GPU
GPU内存不足(OOM)
python
undefinedpython
undefined❌ Problem: Batch size is too large for GPU RAM
❌ 问题:批量大小超出GPU内存
✅ Solution:
✅ 解决方案:
1. Reduce 'per_device_train_batch_size'
1. 减小'per_device_train_batch_size'
2. Use 'gradient_accumulation_steps' to keep effective batch size
2. 使用'gradient_accumulation_steps'维持有效批量大小
3. Use 'fp16=True' in TrainingArguments
3. 在TrainingArguments中设置'fp16=True'
undefinedundefinedModel Output is a Dictionary, not a Tensor
模型输出是字典而非张量
python
undefinedpython
undefined❌ Problem: outputs[0] works, but is confusing
❌ 问题:outputs[0]可行,但易混淆
✅ Solution: Access by name
✅ 解决方案:通过名称访问
outputs = model(**inputs)
hidden_states = outputs.last_hidden_state
undefinedoutputs = model(**inputs)
hidden_states = outputs.last_hidden_state
undefinedSlow Tokenization
分词速度慢
python
undefinedpython
undefined✅ Solution: Use "Fast" tokenizers (written in Rust, usually default)
✅ 解决方案:使用“Fast”分词器(基于Rust编写,通常为默认选项)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
Hugging Face Transformers has democratized AI for the scientific community. By providing a unified interface to the world's most powerful models, it allows researchers to spend less time on engineering and more time on discovering insights from data.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
Hugging Face Transformers为科学界普及了AI技术。通过为全球最强大的模型提供统一接口,它让研究人员可以减少工程方面的时间投入,将更多精力用于从数据中挖掘洞察。