huggingface-tokenizers
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<!-- Adapted from: claude-scientific-skills/scientific-skills/huggingface-tokenizers -->
<!-- 改编自:claude-scientific-skills/scientific-skills/huggingface-tokenizers -->
HuggingFace Tokenizers
HuggingFace Tokenizers
Fast, production-ready tokenization - Rust-powered, Python API.
基于Rust开发、提供Python API的高性能、可用于生产环境的分词工具。
When to Use
适用场景
- High-performance tokenization (<20s per GB)
- Train custom tokenizers from scratch
- Track token-to-text alignment
- Production NLP pipelines
- Need BPE, WordPiece, or Unigram tokenization
- 高性能分词(每GB文本处理耗时<20秒)
- 从零开始训练自定义分词器
- 追踪token与文本的对齐关系
- 生产环境NLP流水线
- 需要BPE、WordPiece或Unigram分词方式
Quick Start
快速入门
python
from tokenizers import Tokenizerpython
from tokenizers import TokenizerLoad pretrained
加载预训练模型
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
Encode
编码文本
output = tokenizer.encode("Hello, how are you?")
print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?']
print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]
output = tokenizer.encode("Hello, how are you?")
print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?']
print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]
Decode
解码文本
text = tokenizer.decode(output.ids)
undefinedtext = tokenizer.decode(output.ids)
undefinedTrain Custom Tokenizer
训练自定义分词器
BPE (GPT-2 style)
BPE(GPT-2风格)
python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevelpython
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevelInitialize
初始化分词器
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
tokenizer.pre_tokenizer = ByteLevel()
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
tokenizer.pre_tokenizer = ByteLevel()
Configure trainer
配置训练器
trainer = BpeTrainer(
vocab_size=50000,
special_tokens=["<|endoftext|>", "<|pad|>"],
min_frequency=2
)
trainer = BpeTrainer(
vocab_size=50000,
special_tokens=["<|endoftext|>", "<|pad|>"],
min_frequency=2
)
Train
开始训练
tokenizer.train(files=["data.txt"], trainer=trainer)
tokenizer.train(files=["data.txt"], trainer=trainer)
Save
保存分词器
tokenizer.save("my-tokenizer.json")
undefinedtokenizer.save("my-tokenizer.json")
undefinedWordPiece (BERT style)
WordPiece(BERT风格)
python
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = WordPieceTrainer(
vocab_size=30000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
tokenizer.train(files=["data.txt"], trainer=trainer)python
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = WordPieceTrainer(
vocab_size=30000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
tokenizer.train(files=["data.txt"], trainer=trainer)Encoding Options
编码选项
python
undefinedpython
undefinedSingle text
单文本编码
output = tokenizer.encode("Hello world")
output = tokenizer.encode("Hello world")
Batch encoding
批量编码
outputs = tokenizer.encode_batch(["Hello", "World"])
outputs = tokenizer.encode_batch(["Hello", "World"])
With padding
启用填充
tokenizer.enable_padding(pad_id=0, pad_token="[PAD]")
outputs = tokenizer.encode_batch(texts)
tokenizer.enable_padding(pad_id=0, pad_token="[PAD]")
outputs = tokenizer.encode_batch(texts)
With truncation
启用截断
tokenizer.enable_truncation(max_length=512)
output = tokenizer.encode(long_text)
undefinedtokenizer.enable_truncation(max_length=512)
output = tokenizer.encode(long_text)
undefinedAccess Encoding Data
访问编码数据
python
output = tokenizer.encode("Hello world")
output.ids # Token IDs
output.tokens # Token strings
output.attention_mask # Attention mask
output.offsets # Character offsets (alignment)
output.word_ids # Word indicespython
output = tokenizer.encode("Hello world")
output.ids # Token ID列表
output.tokens # Token字符串列表
output.attention_mask # 注意力掩码
output.offsets # 字符偏移量(用于对齐)
output.word_ids # 单词索引Pre-tokenizers
预分词器
python
from tokenizers.pre_tokenizers import (
Whitespace, # Split on whitespace
ByteLevel, # Byte-level (GPT-2)
BertPreTokenizer, # BERT style
Punctuation, # Split on punctuation
Sequence, # Chain multiple
)python
from tokenizers.pre_tokenizers import (
Whitespace, # 按空白符分割
ByteLevel, # 字节级分割(GPT-2风格)
BertPreTokenizer, # BERT风格预分词
Punctuation, # 按标点符号分割
Sequence, # 组合多个预分词器
)Chain pre-tokenizers
组合预分词器
from tokenizers.pre_tokenizers import Sequence, Whitespace, Punctuation
tokenizer.pre_tokenizer = Sequence([Whitespace(), Punctuation()])
undefinedfrom tokenizers.pre_tokenizers import Sequence, Whitespace, Punctuation
tokenizer.pre_tokenizer = Sequence([Whitespace(), Punctuation()])
undefinedPost-processing
后处理
python
from tokenizers.processors import TemplateProcessingpython
from tokenizers.processors import TemplateProcessingBERT-style: [CLS] ... [SEP]
BERT风格:[CLS] ... [SEP]
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]")),
],
)
undefinedtokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]")),
],
)
undefinedNormalization
归一化
python
from tokenizers.normalizers import (
NFD, NFKC, Lowercase, StripAccents, Sequence
)python
from tokenizers.normalizers import (
NFD, NFKC, Lowercase, StripAccents, Sequence
)BERT normalization
BERT风格归一化
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
undefinedtokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
undefinedWith Transformers
与Transformers配合使用
python
from transformers import PreTrainedTokenizerFastpython
from transformers import PreTrainedTokenizerFastWrap for transformers compatibility
封装为Transformers兼容格式
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
Now works with transformers
现在可与Transformers库配合使用
encoded = fast_tokenizer("Hello world", return_tensors="pt")
undefinedencoded = fast_tokenizer("Hello world", return_tensors="pt")
undefinedSave and Load
保存与加载
python
undefinedpython
undefinedSave
保存分词器
tokenizer.save("tokenizer.json")
tokenizer.save("tokenizer.json")
Load
加载分词器
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer = Tokenizer.from_file("tokenizer.json")
From HuggingFace Hub
从HuggingFace Hub加载
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
undefinedtokenizer = Tokenizer.from_pretrained("bert-base-uncased")
undefinedPerformance Tips
性能优化技巧
- Use batch encoding for multiple texts
- Enable padding/truncation once, not per-encode
- Pre-tokenizer choice affects speed significantly
- Train on representative data for better vocabulary
- 使用批量编码处理多段文本
- 一次性启用填充/截断,而非每次编码时重复设置
- 预分词器的选择对处理速度影响显著
- 使用代表性数据训练以获得更优的词汇表
vs Alternatives
与其他工具对比
| Tool | Best For |
|---|---|
| tokenizers | Speed, custom training, production |
| SentencePiece | T5/ALBERT, language-independent |
| tiktoken | OpenAI models (GPT) |
| 工具 | 最佳适用场景 |
|---|---|
| tokenizers | 速度快、支持自定义训练、适用于生产环境 |
| SentencePiece | T5/ALBERT模型、独立于语言的分词 |
| tiktoken | OpenAI模型(GPT系列) |