train-sentence-transformers

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Train a sentence-transformers Model

训练sentence-transformers模型

This SKILL.md is a router, not a manual. It tells you which references and example scripts to load for your task. The actual content — recommended losses, evaluators, training-script structure, model selection, training-arg knobs, troubleshooting — lives in

references/

and

scripts/

Do not synthesize a training script from this file alone. Open the per-type production template (

scripts/train_<type>_example.py

) and copy it as your starting point. The templates contain load-bearing scaffolding (autocast helper, model-card class, logger silencing list,

force=True

seed

, TF32, version-compatible imports, named-evaluator metric handling) that prior agent runs have repeatedly missed when rolling their own from a synthesized snippet.

**本SKILL.md是一个导航文件，而非操作手册。**它会告知你针对任务应加载哪些参考资料和示例脚本。实际内容——推荐的损失函数、评估器、训练脚本结构、模型选择、训练参数调节、故障排查——都存放在

references/

和

scripts/

目录中。

**请勿仅从此文件生成训练脚本。**打开对应类型的生产模板（

scripts/train_<type>_example.py

）并将其复制作为你的起始模板。这些模板包含关键的脚手架代码（自动混合精度辅助工具、模型卡片类、日志器静默列表、

force=True

、随机种子、TF32、版本兼容的导入、命名评估器指标处理），之前的Agent在从零合成代码片段时多次遗漏了这些内容。

1. Identify the model type

1. 确定模型类型

Tag	Class	What it does	When to pick
[SentenceTransformer]	`SentenceTransformer` (bi-encoder)	Maps each input to a fixed-dim dense vector	Retrieval, similarity, clustering, classification, paraphrase mining, dedup
[CrossEncoder]	`CrossEncoder` (reranker)	Scores `(query, passage)` pairs jointly	Two-stage retrieval (rerank top-100 from bi-encoder), pair classification
[SparseEncoder]	`SparseEncoder` (SPLADE)	Sparse vectors over the vocabulary	Learned-sparse retrieval, inverted-index backends (Elasticsearch / OpenSearch / Lucene)

Tiebreakers when the request is ambiguous: "embedding model" / "vector search" / "similarity" → [SentenceTransformer]. "rerank" / "ranker" / "two-stage" → [CrossEncoder]. "SPLADE" / "sparse" / "inverted index" → [SparseEncoder]. If still unclear, ask.

标签	类	功能	适用场景
[SentenceTransformer]	`SentenceTransformer` （双编码器）	将每个输入映射为固定维度的密集向量	检索、相似度计算、聚类、分类、paraphrase mining、去重
[CrossEncoder]	`CrossEncoder` （重排序器）	联合对 `(query, passage)` 配对进行评分	两阶段检索（对双编码器输出的前100个结果重排序）、配对分类
[SparseEncoder]	`SparseEncoder` （SPLADE）	基于词汇表生成稀疏向量	学习型稀疏检索、倒排索引后端（Elasticsearch / OpenSearch / Lucene）

当请求模糊时的优先级判断："embedding model" / "vector search" / "similarity" → [SentenceTransformer]；"rerank" / "ranker" / "two-stage" → [CrossEncoder]；"SPLADE" / "sparse" / "inverted index" → [SparseEncoder]。若仍不明确，请询问用户。

2. Required reading

2. 必读书目

Read these in full before writing any code. Do not triage by perceived relevance.

在编写任何代码前，请完整阅读以下内容。请勿根据主观判断筛选内容。

Per-type — always required

按模型类型分类——必看内容

[SentenceTransformer]

references/losses_sentence_transformer.md

— loss-to-data-shape mapping;

BatchSamplers.NO_DUPLICATES

requirement for MNRL-family;

Cached*

↔

gradient_checkpointing

incompatibility.

```
references/evaluators_sentence_transformer.md
```
— evaluator-to-task mapping;
```
metric_for_best_model
```
key construction (named vs unnamed); per-evaluator
```
primary_metric
```
values.
```
references/model_architectures.md
```
— encoder vs decoder vs static vs Router pipelines; pooling rules (mean / cls / lasttoken); auto-mean-pooling behavior for fresh-start MLM bases.
```
scripts/train_sentence_transformer_example.py
```
— production template; copy this as your starting point.

[CrossEncoder]

```
references/losses_cross_encoder.md
```
— pointwise / pairwise / listwise / distillation;
```
pos_weight
```
derivation;
```
activation_fn=Identity()
```
mandatory for non-BCE losses (silent eval-rank collapse otherwise).

references/evaluators_cross_encoder.md

—

CrossEncoderRerankingEvaluator

recipe; named-evaluator key format

eval_{name}_{primary_metric}

```
scripts/train_cross_encoder_example.py
```
— production template; copy this as your starting point.

[SparseEncoder]

```
references/losses_sparse_encoder.md
```
—
```
SpladeLoss
```
wrapper requirement; FLOPS regularizer weights; smoke-test active-dim ramp behavior.

references/evaluators_sparse_encoder.md

—

SparseNanoBEIREvaluator

(English-only) and the in-domain alternative;

eval_{name}_{primary_metric}

key format.

```
scripts/train_sparse_encoder_example.py
```
— production template; copy this as your starting point.

[SentenceTransformer]

references/losses_sentence_transformer.md

— 损失函数与数据形状的映射关系；MNRL系列损失对

BatchSamplers.NO_DUPLICATES

的要求；

Cached*

与

gradient_checkpointing

不兼容。

```
references/evaluators_sentence_transformer.md
```
— 评估器与任务的映射关系；
```
metric_for_best_model
```
键的构造（命名式 vs 非命名式）；各评估器的
```
primary_metric
```
取值。
```
references/model_architectures.md
```
— 编码器、解码器、静态模型与Router流水线；池化规则（均值/CLS/最后一个token）；全新MLM基础模型的自动均值池化行为。
```
scripts/train_sentence_transformer_example.py
```
— 生产模板；复制此文件作为你的起始模板。

[CrossEncoder]

```
references/losses_cross_encoder.md
```
— 逐点/成对/列表式/蒸馏损失；
```
pos_weight
```
推导；非BCE损失必须设置
```
activation_fn=Identity()
```
（否则会导致评估排名无声崩溃）。

references/evaluators_cross_encoder.md

—

CrossEncoderRerankingEvaluator

使用指南；命名评估器的键格式为

eval_{name}_{primary_metric}

。

```
scripts/train_cross_encoder_example.py
```
— 生产模板；复制此文件作为你的起始模板。

[SparseEncoder]

```
references/losses_sparse_encoder.md
```
— 必须使用
```
SpladeLoss
```
封装；FLOPS正则化权重；冒烟测试激活维度渐变行为。

references/evaluators_sparse_encoder.md

—

SparseNanoBEIREvaluator

（仅支持英文）及域内替代方案；键格式为

eval_{name}_{primary_metric}

。

```
scripts/train_sparse_encoder_example.py
```
— 生产模板；复制此文件作为你的起始模板。

Cross-cutting — always required (regardless of task)

通用必看内容（无论任务类型）

```
references/training_args.md
```
—
```
TrainingArguments
```
knobs, precision rules (load fp32 + autocast bf16/fp16; never
```
torch_dtype=bfloat16
```
),
```
warmup_steps
```
(float) vs deprecated
```
warmup_ratio
```
,
```
save_steps
```
must be a multiple of
```
eval_steps
```
for
```
load_best_model_at_end
```
, schedulers, HPO, tracker, resume, hub-push variants.
```
references/dataset_formats.md
```
— column-matching rules (label name auto-detection; column-order-not-name); reshaping recipes; hard-negative mining options.
```
references/base_model_selection.md
```
— discovery commands; per-type model namespaces; ModernBERT-family
```
max_seq_length=8192
```
trap;
```
datasets >= 4
```
script-loader rejection; non-English starting-point shortcuts.
```
references/troubleshooting.md
```
— symptom-indexed failure recipes. Skim the section headings on every run, even a healthy one; the "Metrics don't improve" and "Hub push fails" entries cover bugs that bite frequently and are cheaper to recognize before they fire than to debug after.

```
references/training_args.md
```
—
```
TrainingArguments
```
参数调节、精度规则（加载fp32+自动混合精度bf16/fp16；绝不要使用
```
torch_dtype=bfloat16
```
）；
```
warmup_steps
```
（浮点型）与已弃用的
```
warmup_ratio
```
对比；
```
save_steps
```
必须是
```
eval_steps
```
的倍数才能启用
```
load_best_model_at_end
```
；调度器、超参数优化、跟踪器、续训、Hub推送变体。
```
references/dataset_formats.md
```
— 列匹配规则（标签名自动检测；按列顺序而非列名匹配）；数据重塑方法；难负样本挖掘选项。
```
references/base_model_selection.md
```
— 模型发现命令；各类型模型命名空间；ModernBERT系列的
```
max_seq_length=8192
```
陷阱；
```
datasets >=4
```
版本对脚本加载器的限制；非英文起始模型快捷方式。
```
references/troubleshooting.md
```
— 按症状分类的故障解决指南。每次运行都要浏览章节标题，即使运行正常；“指标无提升”和“Hub推送失败”条目涵盖了频繁出现的问题，提前识别比事后调试成本更低。

Cross-cutting — load when applicable

通用可选内容（按需加载）

```
references/hardware_guide.md
```
— VRAM sizing, multi-GPU, FSDP / DeepSpeed, HF Jobs flavors. Required for >24GB models, multi-GPU, or HF Jobs runs.
```
references/hf_jobs_execution.md
```
— required when running on HF Jobs.
```
references/prompts_and_instructions.md
```
— required when using prompt-tuned bases (E5, BGE, GTE, Qwen3-Embedding, Instructor, Nomic, etc.) or adding
```
query: 
```
/
```
passage: 
```
style prefixes.

```
references/hardware_guide.md
```
— VRAM规格、多GPU、FSDP/DeepSpeed、HF Jobs类型。当模型超过24GB、使用多GPU或运行HF Jobs时必须阅读。
```
references/hf_jobs_execution.md
```
— 在HF Jobs上运行时必须阅读。
```
references/prompts_and_instructions.md
```
— 使用prompt-tuned基础模型（E5、BGE、GTE、Qwen3-Embedding、Instructor、Nomic等）或添加
```
query: 
```
/
```
passage: 
```
类前缀时必须阅读。

Variant scripts (open when the task matches)

变体脚本（任务匹配时打开）

[SentenceTransformer]

scripts/train_sentence_transformer_<matryoshka|multi_dataset|with_lora|distillation|make_multilingual|static_embedding>_example.py

[CrossEncoder]

scripts/train_cross_encoder_<distillation|listwise>_example.py

[SparseEncoder]

scripts/train_sparse_encoder_distillation_example.py

Hard-negative mining CLI —
```
scripts/mine_hard_negatives.py
```
.

[SentenceTransformer]

scripts/train_sentence_transformer_<matryoshka|multi_dataset|with_lora|distillation|make_multilingual|static_embedding>_example.py

。

[CrossEncoder]

scripts/train_cross_encoder_<distillation|listwise>_example.py

。

[SparseEncoder]

scripts/train_sparse_encoder_distillation_example.py

。

难负样本挖掘CLI —
```
scripts/mine_hard_negatives.py
```
。

3. Defaults

3. 默认规则

Override only if the user specifies otherwise:

Local execution. Pitch HF Jobs only if local hardware can't fit the job.
Single run. After it completes, propose experimentation if the user would benefit (weak/marginal verdict, "see how high you can push it" framing, etc.). Iteration rules in
```
references/training_args.md
```
(Experimentation section).
Public Hub push at end-of-run, wrapped in try-except. On HF Jobs (ephemeral env) ALSO enable in-trainer push (
```
push_to_hub=True
```
+
```
hub_strategy="every_save"
```
); details in
```
references/hf_jobs_execution.md
```
.

仅当用户明确指定时才覆盖以下规则：

本地执行。仅当本地硬件无法完成任务时，才推荐使用HF Jobs。
单次运行。运行完成后，如果用户能从中受益（结果较弱/边际、“探索性能上限”等场景），再提议进行实验迭代。迭代规则见
```
references/training_args.md
```
（实验章节）。
运行结束时使用try-except包裹进行公开Hub推送。在HF Jobs（临时环境）上，还需启用训练器内推送（
```
push_to_hub=True
```
+
```
hub_strategy="every_save"
```
）；详情见
```
references/hf_jobs_execution.md
```
。

4. Constraints the produced script must satisfy

4. 生成脚本必须满足的约束

These are non-negotiable contracts. Implementation lives in the production templates and references — do not reinvent.

Capture the pre-training evaluator score as
```
baseline_eval
```
before
```
trainer.train()
```
.

Emit a single end-of-run line:

VERDICT: WIN|MARGINAL|REGRESSION | score=... | baseline=... | delta=...

. A monitor scrapes for this.

Silence
```
httpx
```
,
```
httpcore
```
,
```
huggingface_hub
```
,
```
urllib3
```
,
```
filelock
```
,
```
fsspec
```
to WARNING (otherwise HF download URLs flood the agent's context).
Tee logs to
```
logs/{RUN_NAME}.log
```
.
End with
```
model.push_to_hub(...)
```
wrapped in
```
try/except
```
.
Smoke-test before any long run (
```
max_steps=1
```
+ tiny dataset slice). The production templates show one common pattern (
```
SMOKE_TEST
```
env var).
[CrossEncoder] Include
```
EarlyStoppingCallback(patience>=3)
```
— CE rerankers often peak mid-training and regress.
[SparseEncoder] Log
```
query_active_dims
```
/
```
corpus_active_dims
```
on the verdict line; high nDCG with collapsed sparsity is not a win. The keys come back name-prefixed (e.g.
```
..._query_active_dims
```
); use suffix matching to pluck them — see the SPARSE production template for the exact pattern.

这些是不可协商的规则。实现方式已包含在生产模板和参考资料中——请勿重新发明。

在
```
trainer.train()
```
之前，将预训练评估器分数捕获为
```
baseline_eval
```
。
运行结束时输出一行结果：
```
VERDICT: WIN|MARGINAL|REGRESSION | score=... | baseline=... | delta=...
```
。监控系统会抓取该行内容。
将
```
httpx
```
、
```
httpcore
```
、
```
huggingface_hub
```
、
```
urllib3
```
、
```
filelock
```
、
```
fsspec
```
的日志级别设置为WARNING（否则HF下载链接会充斥Agent的上下文）。
将日志输出到
```
logs/{RUN_NAME}.log
```
。
结尾用
```
try/except
```
包裹
```
model.push_to_hub(...)
```
。
任何长时间运行前先进行冒烟测试（
```
max_steps=1
```
+ 小型数据集切片）。生产模板展示了一种常见模式（
```
SMOKE_TEST
```
环境变量）。
[CrossEncoder] 必须包含
```
EarlyStoppingCallback(patience>=3)
```
——CE重排序器通常在训练中期达到峰值，之后会出现性能退化。
[SparseEncoder] 在结果行中记录
```
query_active_dims
```
/
```
corpus_active_dims
```
；稀疏性崩溃的高nDCG不算有效结果。这些键会带有名称前缀（例如
```
..._query_active_dims
```
）；使用后缀匹配提取——参考SPARSE生产模板中的具体实现。

5. Workflow

5. 工作流程

Identify the model type (§1). Ask if ambiguous.
Load the §2 required-reading files for that type.
Open
```
scripts/train_<type>_example.py
```
and copy it as your starting point.
Replace
```
MODEL_NAME
```
,
```
DATASET_NAME
```
,
```
RUN_NAME
```
, the loss, and the evaluator with the user's task. Cross-check loss/data-shape match against
```
references/losses_<type>.md
```
; cross-check the
```
metric_for_best_model
```
key against
```
references/evaluators_<type>.md
```
(named evaluators format the key as
```
eval_{name}_{primary_metric}
```
).
Smoke-test (
```
max_steps=1
```
).
Run.
After the run, append to
```
logs/experiments.md
```
and propose iteration if the verdict is weak/marginal.

确定模型类型（第1节）。若模糊不清，请询问用户。
加载对应类型的第2节必读书目。
打开
```
scripts/train_<type>_example.py
```
并复制作为起始模板。
将
```
MODEL_NAME
```
、
```
DATASET_NAME
```
、
```
RUN_NAME
```
、损失函数和评估器替换为用户任务对应的内容。对照
```
references/losses_<type>.md
```
检查损失函数与数据形状是否匹配；对照
```
references/evaluators_<type>.md
```
检查
```
metric_for_best_model
```
键是否正确（命名评估器的键格式为
```
eval_{name}_{primary_metric}
```
）。
进行冒烟测试（
```
max_steps=1
```
）。
运行训练。
运行结束后，将结果追加到
```
logs/experiments.md
```
，若结果较弱/边际，则提议进行迭代。

Prerequisites

前置依赖

bash

pip install "sentence-transformers[train]>=5.0"        # add [train,image] / [audio] / [video] for [SentenceTransformer] multimodal
pip install trackio                                    # optional tracker; or wandb / tensorboard / mlflow
hf auth login                                          # or set HF_TOKEN with write scope (for Hub push)

GPU strongly recommended. CPU works only for demos and

[SentenceTransformer]

StaticEmbedding

bash

pip install "sentence-transformers[train]>=5.0"        # 若使用[SentenceTransformer]多模态，添加[train,image] / [audio] / [video]
pip install trackio                                    # 可选跟踪工具；也可使用wandb / tensorboard / mlflow
hf auth login                                          # 或设置带有写入权限的HF_TOKEN（用于Hub推送）

强烈推荐使用GPU。CPU仅适用于演示和

[SentenceTransformer]

的

StaticEmbedding

模型。