train-sentence-transformers

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Train a sentence-transformers Model

训练sentence-transformers模型

This SKILL.md is a router, not a manual. It tells you which references and example scripts to load for your task. The actual content — recommended losses, evaluators, training-script structure, model selection, training-arg knobs, troubleshooting — lives in
references/
and
scripts/
.
Do not synthesize a training script from this file alone. Open the per-type production template (
scripts/train_<type>_example.py
) and copy it as your starting point. The templates contain load-bearing scaffolding (autocast helper, model-card class, logger silencing list,
force=True
,
seed
, TF32, version-compatible imports, named-evaluator metric handling) that prior agent runs have repeatedly missed when rolling their own from a synthesized snippet.
**本SKILL.md是一个导航文件,而非操作手册。**它会告知你针对任务应加载哪些参考资料和示例脚本。实际内容——推荐的损失函数、评估器、训练脚本结构、模型选择、训练参数调节、故障排查——都存放在
references/
scripts/
目录中。
**请勿仅从此文件生成训练脚本。**打开对应类型的生产模板(
scripts/train_<type>_example.py
)并将其复制作为你的起始模板。这些模板包含关键的脚手架代码(自动混合精度辅助工具、模型卡片类、日志器静默列表、
force=True
、随机种子、TF32、版本兼容的导入、命名评估器指标处理),之前的Agent在从零合成代码片段时多次遗漏了这些内容。

1. Identify the model type

1. 确定模型类型

TagClassWhat it doesWhen to pick
[SentenceTransformer]
SentenceTransformer
(bi-encoder)
Maps each input to a fixed-dim dense vectorRetrieval, similarity, clustering, classification, paraphrase mining, dedup
[CrossEncoder]
CrossEncoder
(reranker)
Scores
(query, passage)
pairs jointly
Two-stage retrieval (rerank top-100 from bi-encoder), pair classification
[SparseEncoder]
SparseEncoder
(SPLADE)
Sparse vectors over the vocabularyLearned-sparse retrieval, inverted-index backends (Elasticsearch / OpenSearch / Lucene)
Tiebreakers when the request is ambiguous: "embedding model" / "vector search" / "similarity" → [SentenceTransformer]. "rerank" / "ranker" / "two-stage" → [CrossEncoder]. "SPLADE" / "sparse" / "inverted index" → [SparseEncoder]. If still unclear, ask.
标签功能适用场景
[SentenceTransformer]
SentenceTransformer
(双编码器)
将每个输入映射为固定维度的密集向量检索、相似度计算、聚类、分类、paraphrase mining、去重
[CrossEncoder]
CrossEncoder
(重排序器)
联合对
(query, passage)
配对进行评分
两阶段检索(对双编码器输出的前100个结果重排序)、配对分类
[SparseEncoder]
SparseEncoder
(SPLADE)
基于词汇表生成稀疏向量学习型稀疏检索、倒排索引后端(Elasticsearch / OpenSearch / Lucene)
当请求模糊时的优先级判断:"embedding model" / "vector search" / "similarity" → [SentenceTransformer];"rerank" / "ranker" / "two-stage" → [CrossEncoder];"SPLADE" / "sparse" / "inverted index" → [SparseEncoder]。若仍不明确,请询问用户。

2. Required reading

2. 必读书目

Read these in full before writing any code. Do not triage by perceived relevance.
在编写任何代码前,请完整阅读以下内容。请勿根据主观判断筛选内容。

Per-type — always required

按模型类型分类——必看内容

[SentenceTransformer]
  • references/losses_sentence_transformer.md
    — loss-to-data-shape mapping;
    BatchSamplers.NO_DUPLICATES
    requirement for MNRL-family;
    Cached*
    gradient_checkpointing
    incompatibility.
  • references/evaluators_sentence_transformer.md
    — evaluator-to-task mapping;
    metric_for_best_model
    key construction (named vs unnamed); per-evaluator
    primary_metric
    values.
  • references/model_architectures.md
    — encoder vs decoder vs static vs Router pipelines; pooling rules (mean / cls / lasttoken); auto-mean-pooling behavior for fresh-start MLM bases.
  • scripts/train_sentence_transformer_example.py
    — production template; copy this as your starting point.
[CrossEncoder]
  • references/losses_cross_encoder.md
    — pointwise / pairwise / listwise / distillation;
    pos_weight
    derivation;
    activation_fn=Identity()
    mandatory for non-BCE losses (silent eval-rank collapse otherwise).
  • references/evaluators_cross_encoder.md
    CrossEncoderRerankingEvaluator
    recipe; named-evaluator key format
    eval_{name}_{primary_metric}
    .
  • scripts/train_cross_encoder_example.py
    — production template; copy this as your starting point.
[SparseEncoder]
  • references/losses_sparse_encoder.md
    SpladeLoss
    wrapper requirement; FLOPS regularizer weights; smoke-test active-dim ramp behavior.
  • references/evaluators_sparse_encoder.md
    SparseNanoBEIREvaluator
    (English-only) and the in-domain alternative;
    eval_{name}_{primary_metric}
    key format.
  • scripts/train_sparse_encoder_example.py
    — production template; copy this as your starting point.
[SentenceTransformer]
  • references/losses_sentence_transformer.md
    — 损失函数与数据形状的映射关系;MNRL系列损失对
    BatchSamplers.NO_DUPLICATES
    的要求;
    Cached*
    gradient_checkpointing
    不兼容。
  • references/evaluators_sentence_transformer.md
    — 评估器与任务的映射关系;
    metric_for_best_model
    键的构造(命名式 vs 非命名式);各评估器的
    primary_metric
    取值。
  • references/model_architectures.md
    — 编码器、解码器、静态模型与Router流水线;池化规则(均值/CLS/最后一个token);全新MLM基础模型的自动均值池化行为。
  • scripts/train_sentence_transformer_example.py
    — 生产模板;复制此文件作为你的起始模板。
[CrossEncoder]
  • references/losses_cross_encoder.md
    — 逐点/成对/列表式/蒸馏损失;
    pos_weight
    推导;非BCE损失必须设置
    activation_fn=Identity()
    (否则会导致评估排名无声崩溃)。
  • references/evaluators_cross_encoder.md
    CrossEncoderRerankingEvaluator
    使用指南;命名评估器的键格式为
    eval_{name}_{primary_metric}
  • scripts/train_cross_encoder_example.py
    — 生产模板;复制此文件作为你的起始模板。
[SparseEncoder]
  • references/losses_sparse_encoder.md
    — 必须使用
    SpladeLoss
    封装;FLOPS正则化权重;冒烟测试激活维度渐变行为。
  • references/evaluators_sparse_encoder.md
    SparseNanoBEIREvaluator
    (仅支持英文)及域内替代方案;键格式为
    eval_{name}_{primary_metric}
  • scripts/train_sparse_encoder_example.py
    — 生产模板;复制此文件作为你的起始模板。

Cross-cutting — always required (regardless of task)

通用必看内容(无论任务类型)

  • references/training_args.md
    TrainingArguments
    knobs, precision rules (load fp32 + autocast bf16/fp16; never
    torch_dtype=bfloat16
    ),
    warmup_steps
    (float) vs deprecated
    warmup_ratio
    ,
    save_steps
    must be a multiple of
    eval_steps
    for
    load_best_model_at_end
    , schedulers, HPO, tracker, resume, hub-push variants.
  • references/dataset_formats.md
    — column-matching rules (label name auto-detection; column-order-not-name); reshaping recipes; hard-negative mining options.
  • references/base_model_selection.md
    — discovery commands; per-type model namespaces; ModernBERT-family
    max_seq_length=8192
    trap;
    datasets >= 4
    script-loader rejection; non-English starting-point shortcuts.
  • references/troubleshooting.md
    — symptom-indexed failure recipes. Skim the section headings on every run, even a healthy one; the "Metrics don't improve" and "Hub push fails" entries cover bugs that bite frequently and are cheaper to recognize before they fire than to debug after.
  • references/training_args.md
    TrainingArguments
    参数调节、精度规则(加载fp32+自动混合精度bf16/fp16;绝不要使用
    torch_dtype=bfloat16
    );
    warmup_steps
    (浮点型)与已弃用的
    warmup_ratio
    对比;
    save_steps
    必须是
    eval_steps
    的倍数才能启用
    load_best_model_at_end
    ;调度器、超参数优化、跟踪器、续训、Hub推送变体。
  • references/dataset_formats.md
    — 列匹配规则(标签名自动检测;按列顺序而非列名匹配);数据重塑方法;难负样本挖掘选项。
  • references/base_model_selection.md
    — 模型发现命令;各类型模型命名空间;ModernBERT系列的
    max_seq_length=8192
    陷阱;
    datasets >=4
    版本对脚本加载器的限制;非英文起始模型快捷方式。
  • references/troubleshooting.md
    — 按症状分类的故障解决指南。每次运行都要浏览章节标题,即使运行正常;“指标无提升”和“Hub推送失败”条目涵盖了频繁出现的问题,提前识别比事后调试成本更低。

Cross-cutting — load when applicable

通用可选内容(按需加载)

  • references/hardware_guide.md
    — VRAM sizing, multi-GPU, FSDP / DeepSpeed, HF Jobs flavors. Required for >24GB models, multi-GPU, or HF Jobs runs.
  • references/hf_jobs_execution.md
    — required when running on HF Jobs.
  • references/prompts_and_instructions.md
    — required when using prompt-tuned bases (E5, BGE, GTE, Qwen3-Embedding, Instructor, Nomic, etc.) or adding
    query: 
    /
    passage: 
    style prefixes.
  • references/hardware_guide.md
    — VRAM规格、多GPU、FSDP/DeepSpeed、HF Jobs类型。当模型超过24GB、使用多GPU或运行HF Jobs时必须阅读。
  • references/hf_jobs_execution.md
    — 在HF Jobs上运行时必须阅读。
  • references/prompts_and_instructions.md
    — 使用prompt-tuned基础模型(E5、BGE、GTE、Qwen3-Embedding、Instructor、Nomic等)或添加
    query: 
    /
    passage: 
    类前缀时必须阅读。

Variant scripts (open when the task matches)

变体脚本(任务匹配时打开)

  • [SentenceTransformer]
    scripts/train_sentence_transformer_<matryoshka|multi_dataset|with_lora|distillation|make_multilingual|static_embedding>_example.py
    .
  • [CrossEncoder]
    scripts/train_cross_encoder_<distillation|listwise>_example.py
    .
  • [SparseEncoder]
    scripts/train_sparse_encoder_distillation_example.py
    .
  • Hard-negative mining CLI —
    scripts/mine_hard_negatives.py
    .
  • [SentenceTransformer]
    scripts/train_sentence_transformer_<matryoshka|multi_dataset|with_lora|distillation|make_multilingual|static_embedding>_example.py
  • [CrossEncoder]
    scripts/train_cross_encoder_<distillation|listwise>_example.py
  • [SparseEncoder]
    scripts/train_sparse_encoder_distillation_example.py
  • 难负样本挖掘CLI —
    scripts/mine_hard_negatives.py

3. Defaults

3. 默认规则

Override only if the user specifies otherwise:
  • Local execution. Pitch HF Jobs only if local hardware can't fit the job.
  • Single run. After it completes, propose experimentation if the user would benefit (weak/marginal verdict, "see how high you can push it" framing, etc.). Iteration rules in
    references/training_args.md
    (Experimentation section).
  • Public Hub push at end-of-run, wrapped in try-except. On HF Jobs (ephemeral env) ALSO enable in-trainer push (
    push_to_hub=True
    +
    hub_strategy="every_save"
    ); details in
    references/hf_jobs_execution.md
    .
仅当用户明确指定时才覆盖以下规则:
  • 本地执行。仅当本地硬件无法完成任务时,才推荐使用HF Jobs。
  • 单次运行。运行完成后,如果用户能从中受益(结果较弱/边际、“探索性能上限”等场景),再提议进行实验迭代。迭代规则见
    references/training_args.md
    (实验章节)。
  • 运行结束时使用try-except包裹进行公开Hub推送。在HF Jobs(临时环境)上,还需启用训练器内推送(
    push_to_hub=True
    +
    hub_strategy="every_save"
    );详情见
    references/hf_jobs_execution.md

4. Constraints the produced script must satisfy

4. 生成脚本必须满足的约束

These are non-negotiable contracts. Implementation lives in the production templates and references — do not reinvent.
  • Capture the pre-training evaluator score as
    baseline_eval
    before
    trainer.train()
    .
  • Emit a single end-of-run line:
    VERDICT: WIN|MARGINAL|REGRESSION | score=... | baseline=... | delta=...
    . A monitor scrapes for this.
  • Silence
    httpx
    ,
    httpcore
    ,
    huggingface_hub
    ,
    urllib3
    ,
    filelock
    ,
    fsspec
    to WARNING (otherwise HF download URLs flood the agent's context).
  • Tee logs to
    logs/{RUN_NAME}.log
    .
  • End with
    model.push_to_hub(...)
    wrapped in
    try/except
    .
  • Smoke-test before any long run (
    max_steps=1
    + tiny dataset slice). The production templates show one common pattern (
    SMOKE_TEST
    env var).
  • [CrossEncoder] Include
    EarlyStoppingCallback(patience>=3)
    — CE rerankers often peak mid-training and regress.
  • [SparseEncoder] Log
    query_active_dims
    /
    corpus_active_dims
    on the verdict line; high nDCG with collapsed sparsity is not a win. The keys come back name-prefixed (e.g.
    ..._query_active_dims
    ); use suffix matching to pluck them — see the SPARSE production template for the exact pattern.
这些是不可协商的规则。实现方式已包含在生产模板和参考资料中——请勿重新发明。
  • trainer.train()
    之前,将预训练评估器分数捕获为
    baseline_eval
  • 运行结束时输出一行结果:
    VERDICT: WIN|MARGINAL|REGRESSION | score=... | baseline=... | delta=...
    。监控系统会抓取该行内容。
  • httpx
    httpcore
    huggingface_hub
    urllib3
    filelock
    fsspec
    的日志级别设置为WARNING(否则HF下载链接会充斥Agent的上下文)。
  • 将日志输出到
    logs/{RUN_NAME}.log
  • 结尾用
    try/except
    包裹
    model.push_to_hub(...)
  • 任何长时间运行前先进行冒烟测试(
    max_steps=1
    + 小型数据集切片)。生产模板展示了一种常见模式(
    SMOKE_TEST
    环境变量)。
  • [CrossEncoder] 必须包含
    EarlyStoppingCallback(patience>=3)
    ——CE重排序器通常在训练中期达到峰值,之后会出现性能退化。
  • [SparseEncoder] 在结果行中记录
    query_active_dims
    /
    corpus_active_dims
    ;稀疏性崩溃的高nDCG不算有效结果。这些键会带有名称前缀(例如
    ..._query_active_dims
    );使用后缀匹配提取——参考SPARSE生产模板中的具体实现。

5. Workflow

5. 工作流程

  1. Identify the model type (§1). Ask if ambiguous.
  2. Load the §2 required-reading files for that type.
  3. Open
    scripts/train_<type>_example.py
    and copy it as your starting point.
  4. Replace
    MODEL_NAME
    ,
    DATASET_NAME
    ,
    RUN_NAME
    , the loss, and the evaluator with the user's task. Cross-check loss/data-shape match against
    references/losses_<type>.md
    ; cross-check the
    metric_for_best_model
    key against
    references/evaluators_<type>.md
    (named evaluators format the key as
    eval_{name}_{primary_metric}
    ).
  5. Smoke-test (
    max_steps=1
    ).
  6. Run.
  7. After the run, append to
    logs/experiments.md
    and propose iteration if the verdict is weak/marginal.
  1. 确定模型类型(第1节)。若模糊不清,请询问用户。
  2. 加载对应类型的第2节必读书目。
  3. 打开
    scripts/train_<type>_example.py
    并复制作为起始模板。
  4. MODEL_NAME
    DATASET_NAME
    RUN_NAME
    、损失函数和评估器替换为用户任务对应的内容。对照
    references/losses_<type>.md
    检查损失函数与数据形状是否匹配;对照
    references/evaluators_<type>.md
    检查
    metric_for_best_model
    键是否正确(命名评估器的键格式为
    eval_{name}_{primary_metric}
    )。
  5. 进行冒烟测试(
    max_steps=1
    )。
  6. 运行训练。
  7. 运行结束后,将结果追加到
    logs/experiments.md
    ,若结果较弱/边际,则提议进行迭代。

Prerequisites

前置依赖

bash
pip install "sentence-transformers[train]>=5.0"        # add [train,image] / [audio] / [video] for [SentenceTransformer] multimodal
pip install trackio                                    # optional tracker; or wandb / tensorboard / mlflow
hf auth login                                          # or set HF_TOKEN with write scope (for Hub push)
GPU strongly recommended. CPU works only for demos and
[SentenceTransformer]
StaticEmbedding
.
bash
pip install "sentence-transformers[train]>=5.0"        # 若使用[SentenceTransformer]多模态,添加[train,image] / [audio] / [video]
pip install trackio                                    # 可选跟踪工具;也可使用wandb / tensorboard / mlflow
hf auth login                                          # 或设置带有写入权限的HF_TOKEN(用于Hub推送)
强烈推荐使用GPU。CPU仅适用于演示和
[SentenceTransformer]
StaticEmbedding
模型。