huggingface-tokenizers

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<!-- Adapted from: claude-scientific-skills/scientific-skills/huggingface-tokenizers -->
<!-- 改编自:claude-scientific-skills/scientific-skills/huggingface-tokenizers -->

HuggingFace Tokenizers

HuggingFace Tokenizers

Fast, production-ready tokenization - Rust-powered, Python API.
基于Rust开发、提供Python API的高性能、可用于生产环境的分词工具。

When to Use

适用场景

  • High-performance tokenization (<20s per GB)
  • Train custom tokenizers from scratch
  • Track token-to-text alignment
  • Production NLP pipelines
  • Need BPE, WordPiece, or Unigram tokenization
  • 高性能分词(每GB文本处理耗时<20秒)
  • 从零开始训练自定义分词器
  • 追踪token与文本的对齐关系
  • 生产环境NLP流水线
  • 需要BPE、WordPiece或Unigram分词方式

Quick Start

快速入门

python
from tokenizers import Tokenizer
python
from tokenizers import Tokenizer

Load pretrained

加载预训练模型

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

Encode

编码文本

output = tokenizer.encode("Hello, how are you?") print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?'] print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]
output = tokenizer.encode("Hello, how are you?") print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?'] print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]

Decode

解码文本

text = tokenizer.decode(output.ids)
undefined
text = tokenizer.decode(output.ids)
undefined

Train Custom Tokenizer

训练自定义分词器

BPE (GPT-2 style)

BPE(GPT-2风格)

python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel

Initialize

初始化分词器

tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>")) tokenizer.pre_tokenizer = ByteLevel()
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>")) tokenizer.pre_tokenizer = ByteLevel()

Configure trainer

配置训练器

trainer = BpeTrainer( vocab_size=50000, special_tokens=["<|endoftext|>", "<|pad|>"], min_frequency=2 )
trainer = BpeTrainer( vocab_size=50000, special_tokens=["<|endoftext|>", "<|pad|>"], min_frequency=2 )

Train

开始训练

tokenizer.train(files=["data.txt"], trainer=trainer)
tokenizer.train(files=["data.txt"], trainer=trainer)

Save

保存分词器

tokenizer.save("my-tokenizer.json")
undefined
tokenizer.save("my-tokenizer.json")
undefined

WordPiece (BERT style)

WordPiece(BERT风格)

python
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

trainer = WordPieceTrainer(
    vocab_size=30000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)

tokenizer.train(files=["data.txt"], trainer=trainer)
python
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

trainer = WordPieceTrainer(
    vocab_size=30000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)

tokenizer.train(files=["data.txt"], trainer=trainer)

Encoding Options

编码选项

python
undefined
python
undefined

Single text

单文本编码

output = tokenizer.encode("Hello world")
output = tokenizer.encode("Hello world")

Batch encoding

批量编码

outputs = tokenizer.encode_batch(["Hello", "World"])
outputs = tokenizer.encode_batch(["Hello", "World"])

With padding

启用填充

tokenizer.enable_padding(pad_id=0, pad_token="[PAD]") outputs = tokenizer.encode_batch(texts)
tokenizer.enable_padding(pad_id=0, pad_token="[PAD]") outputs = tokenizer.encode_batch(texts)

With truncation

启用截断

tokenizer.enable_truncation(max_length=512) output = tokenizer.encode(long_text)
undefined
tokenizer.enable_truncation(max_length=512) output = tokenizer.encode(long_text)
undefined

Access Encoding Data

访问编码数据

python
output = tokenizer.encode("Hello world")

output.ids           # Token IDs
output.tokens        # Token strings
output.attention_mask  # Attention mask
output.offsets       # Character offsets (alignment)
output.word_ids      # Word indices
python
output = tokenizer.encode("Hello world")

output.ids           # Token ID列表
output.tokens        # Token字符串列表
output.attention_mask  # 注意力掩码
output.offsets       # 字符偏移量(用于对齐)
output.word_ids      # 单词索引

Pre-tokenizers

预分词器

python
from tokenizers.pre_tokenizers import (
    Whitespace,      # Split on whitespace
    ByteLevel,       # Byte-level (GPT-2)
    BertPreTokenizer,  # BERT style
    Punctuation,     # Split on punctuation
    Sequence,        # Chain multiple
)
python
from tokenizers.pre_tokenizers import (
    Whitespace,      # 按空白符分割
    ByteLevel,       # 字节级分割(GPT-2风格)
    BertPreTokenizer,  # BERT风格预分词
    Punctuation,     # 按标点符号分割
    Sequence,        # 组合多个预分词器
)

Chain pre-tokenizers

组合预分词器

from tokenizers.pre_tokenizers import Sequence, Whitespace, Punctuation tokenizer.pre_tokenizer = Sequence([Whitespace(), Punctuation()])
undefined
from tokenizers.pre_tokenizers import Sequence, Whitespace, Punctuation tokenizer.pre_tokenizer = Sequence([Whitespace(), Punctuation()])
undefined

Post-processing

后处理

python
from tokenizers.processors import TemplateProcessing
python
from tokenizers.processors import TemplateProcessing

BERT-style: [CLS] ... [SEP]

BERT风格:[CLS] ... [SEP]

tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1", special_tokens=[ ("[CLS]", tokenizer.token_to_id("[CLS]")), ("[SEP]", tokenizer.token_to_id("[SEP]")), ], )
undefined
tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1", special_tokens=[ ("[CLS]", tokenizer.token_to_id("[CLS]")), ("[SEP]", tokenizer.token_to_id("[SEP]")), ], )
undefined

Normalization

归一化

python
from tokenizers.normalizers import (
    NFD, NFKC, Lowercase, StripAccents, Sequence
)
python
from tokenizers.normalizers import (
    NFD, NFKC, Lowercase, StripAccents, Sequence
)

BERT normalization

BERT风格归一化

tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
undefined
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
undefined

With Transformers

与Transformers配合使用

python
from transformers import PreTrainedTokenizerFast
python
from transformers import PreTrainedTokenizerFast

Wrap for transformers compatibility

封装为Transformers兼容格式

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

Now works with transformers

现在可与Transformers库配合使用

encoded = fast_tokenizer("Hello world", return_tensors="pt")
undefined
encoded = fast_tokenizer("Hello world", return_tensors="pt")
undefined

Save and Load

保存与加载

python
undefined
python
undefined

Save

保存分词器

tokenizer.save("tokenizer.json")
tokenizer.save("tokenizer.json")

Load

加载分词器

tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer = Tokenizer.from_file("tokenizer.json")

From HuggingFace Hub

从HuggingFace Hub加载

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
undefined
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
undefined

Performance Tips

性能优化技巧

  1. Use batch encoding for multiple texts
  2. Enable padding/truncation once, not per-encode
  3. Pre-tokenizer choice affects speed significantly
  4. Train on representative data for better vocabulary
  1. 使用批量编码处理多段文本
  2. 一次性启用填充/截断,而非每次编码时重复设置
  3. 预分词器的选择对处理速度影响显著
  4. 使用代表性数据训练以获得更优的词汇表

vs Alternatives

与其他工具对比

ToolBest For
tokenizersSpeed, custom training, production
SentencePieceT5/ALBERT, language-independent
tiktokenOpenAI models (GPT)
工具最佳适用场景
tokenizers速度快、支持自定义训练、适用于生产环境
SentencePieceT5/ALBERT模型、独立于语言的分词
tiktokenOpenAI模型(GPT系列)

Resources

相关资源