train-sentence-transformers
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTrain a sentence-transformers Model
训练sentence-transformers模型
This SKILL.md is a router, not a manual. It tells you which references and example scripts to load for your task. The actual content — recommended losses, evaluators, training-script structure, model selection, training-arg knobs, troubleshooting — lives in and .
references/scripts/Do not synthesize a training script from this file alone. Open the per-type production template () and copy it as your starting point. The templates contain load-bearing scaffolding (autocast helper, model-card class, logger silencing list, , , TF32, version-compatible imports, named-evaluator metric handling) that prior agent runs have repeatedly missed when rolling their own from a synthesized snippet.
scripts/train_<type>_example.pyforce=Trueseed**本SKILL.md是一个导航文件,而非操作手册。**它会告知你针对任务应加载哪些参考资料和示例脚本。实际内容——推荐的损失函数、评估器、训练脚本结构、模型选择、训练参数调节、故障排查——都存放在和目录中。
references/scripts/**请勿仅从此文件生成训练脚本。**打开对应类型的生产模板()并将其复制作为你的起始模板。这些模板包含关键的脚手架代码(自动混合精度辅助工具、模型卡片类、日志器静默列表、、随机种子、TF32、版本兼容的导入、命名评估器指标处理),之前的Agent在从零合成代码片段时多次遗漏了这些内容。
scripts/train_<type>_example.pyforce=True1. Identify the model type
1. 确定模型类型
| Tag | Class | What it does | When to pick |
|---|---|---|---|
| [SentenceTransformer] | | Maps each input to a fixed-dim dense vector | Retrieval, similarity, clustering, classification, paraphrase mining, dedup |
| [CrossEncoder] | | Scores | Two-stage retrieval (rerank top-100 from bi-encoder), pair classification |
| [SparseEncoder] | | Sparse vectors over the vocabulary | Learned-sparse retrieval, inverted-index backends (Elasticsearch / OpenSearch / Lucene) |
Tiebreakers when the request is ambiguous: "embedding model" / "vector search" / "similarity" → [SentenceTransformer]. "rerank" / "ranker" / "two-stage" → [CrossEncoder]. "SPLADE" / "sparse" / "inverted index" → [SparseEncoder]. If still unclear, ask.
| 标签 | 类 | 功能 | 适用场景 |
|---|---|---|---|
| [SentenceTransformer] | | 将每个输入映射为固定维度的密集向量 | 检索、相似度计算、聚类、分类、paraphrase mining、去重 |
| [CrossEncoder] | | 联合对 | 两阶段检索(对双编码器输出的前100个结果重排序)、配对分类 |
| [SparseEncoder] | | 基于词汇表生成稀疏向量 | 学习型稀疏检索、倒排索引后端(Elasticsearch / OpenSearch / Lucene) |
当请求模糊时的优先级判断:"embedding model" / "vector search" / "similarity" → [SentenceTransformer];"rerank" / "ranker" / "two-stage" → [CrossEncoder];"SPLADE" / "sparse" / "inverted index" → [SparseEncoder]。若仍不明确,请询问用户。
2. Required reading
2. 必读书目
Read these in full before writing any code. Do not triage by perceived relevance.
在编写任何代码前,请完整阅读以下内容。请勿根据主观判断筛选内容。
Per-type — always required
按模型类型分类——必看内容
[SentenceTransformer]
- — loss-to-data-shape mapping;
references/losses_sentence_transformer.mdrequirement for MNRL-family;BatchSamplers.NO_DUPLICATES↔Cached*incompatibility.gradient_checkpointing - — evaluator-to-task mapping;
references/evaluators_sentence_transformer.mdkey construction (named vs unnamed); per-evaluatormetric_for_best_modelvalues.primary_metric - — encoder vs decoder vs static vs Router pipelines; pooling rules (mean / cls / lasttoken); auto-mean-pooling behavior for fresh-start MLM bases.
references/model_architectures.md - — production template; copy this as your starting point.
scripts/train_sentence_transformer_example.py
[CrossEncoder]
- — pointwise / pairwise / listwise / distillation;
references/losses_cross_encoder.mdderivation;pos_weightmandatory for non-BCE losses (silent eval-rank collapse otherwise).activation_fn=Identity() - —
references/evaluators_cross_encoder.mdrecipe; named-evaluator key formatCrossEncoderRerankingEvaluator.eval_{name}_{primary_metric} - — production template; copy this as your starting point.
scripts/train_cross_encoder_example.py
[SparseEncoder]
- —
references/losses_sparse_encoder.mdwrapper requirement; FLOPS regularizer weights; smoke-test active-dim ramp behavior.SpladeLoss - —
references/evaluators_sparse_encoder.md(English-only) and the in-domain alternative;SparseNanoBEIREvaluatorkey format.eval_{name}_{primary_metric} - — production template; copy this as your starting point.
scripts/train_sparse_encoder_example.py
[SentenceTransformer]
- — 损失函数与数据形状的映射关系;MNRL系列损失对
references/losses_sentence_transformer.md的要求;BatchSamplers.NO_DUPLICATES与Cached*不兼容。gradient_checkpointing - — 评估器与任务的映射关系;
references/evaluators_sentence_transformer.md键的构造(命名式 vs 非命名式);各评估器的metric_for_best_model取值。primary_metric - — 编码器、解码器、静态模型与Router流水线;池化规则(均值/CLS/最后一个token);全新MLM基础模型的自动均值池化行为。
references/model_architectures.md - — 生产模板;复制此文件作为你的起始模板。
scripts/train_sentence_transformer_example.py
[CrossEncoder]
- — 逐点/成对/列表式/蒸馏损失;
references/losses_cross_encoder.md推导;非BCE损失必须设置pos_weight(否则会导致评估排名无声崩溃)。activation_fn=Identity() - —
references/evaluators_cross_encoder.md使用指南;命名评估器的键格式为CrossEncoderRerankingEvaluator。eval_{name}_{primary_metric} - — 生产模板;复制此文件作为你的起始模板。
scripts/train_cross_encoder_example.py
[SparseEncoder]
- — 必须使用
references/losses_sparse_encoder.md封装;FLOPS正则化权重;冒烟测试激活维度渐变行为。SpladeLoss - —
references/evaluators_sparse_encoder.md(仅支持英文)及域内替代方案;键格式为SparseNanoBEIREvaluator。eval_{name}_{primary_metric} - — 生产模板;复制此文件作为你的起始模板。
scripts/train_sparse_encoder_example.py
Cross-cutting — always required (regardless of task)
通用必看内容(无论任务类型)
- —
references/training_args.mdknobs, precision rules (load fp32 + autocast bf16/fp16; neverTrainingArguments),torch_dtype=bfloat16(float) vs deprecatedwarmup_steps,warmup_ratiomust be a multiple ofsave_stepsforeval_steps, schedulers, HPO, tracker, resume, hub-push variants.load_best_model_at_end - — column-matching rules (label name auto-detection; column-order-not-name); reshaping recipes; hard-negative mining options.
references/dataset_formats.md - — discovery commands; per-type model namespaces; ModernBERT-family
references/base_model_selection.mdtrap;max_seq_length=8192script-loader rejection; non-English starting-point shortcuts.datasets >= 4 - — symptom-indexed failure recipes. Skim the section headings on every run, even a healthy one; the "Metrics don't improve" and "Hub push fails" entries cover bugs that bite frequently and are cheaper to recognize before they fire than to debug after.
references/troubleshooting.md
- —
references/training_args.md参数调节、精度规则(加载fp32+自动混合精度bf16/fp16;绝不要使用TrainingArguments);torch_dtype=bfloat16(浮点型)与已弃用的warmup_steps对比;warmup_ratio必须是save_steps的倍数才能启用eval_steps;调度器、超参数优化、跟踪器、续训、Hub推送变体。load_best_model_at_end - — 列匹配规则(标签名自动检测;按列顺序而非列名匹配);数据重塑方法;难负样本挖掘选项。
references/dataset_formats.md - — 模型发现命令;各类型模型命名空间;ModernBERT系列的
references/base_model_selection.md陷阱;max_seq_length=8192版本对脚本加载器的限制;非英文起始模型快捷方式。datasets >=4 - — 按症状分类的故障解决指南。每次运行都要浏览章节标题,即使运行正常;“指标无提升”和“Hub推送失败”条目涵盖了频繁出现的问题,提前识别比事后调试成本更低。
references/troubleshooting.md
Cross-cutting — load when applicable
通用可选内容(按需加载)
- — VRAM sizing, multi-GPU, FSDP / DeepSpeed, HF Jobs flavors. Required for >24GB models, multi-GPU, or HF Jobs runs.
references/hardware_guide.md - — required when running on HF Jobs.
references/hf_jobs_execution.md - — required when using prompt-tuned bases (E5, BGE, GTE, Qwen3-Embedding, Instructor, Nomic, etc.) or adding
references/prompts_and_instructions.md/query:style prefixes.passage:
- — VRAM规格、多GPU、FSDP/DeepSpeed、HF Jobs类型。当模型超过24GB、使用多GPU或运行HF Jobs时必须阅读。
references/hardware_guide.md - — 在HF Jobs上运行时必须阅读。
references/hf_jobs_execution.md - — 使用prompt-tuned基础模型(E5、BGE、GTE、Qwen3-Embedding、Instructor、Nomic等)或添加
references/prompts_and_instructions.md/query:类前缀时必须阅读。passage:
Variant scripts (open when the task matches)
变体脚本(任务匹配时打开)
- [SentenceTransformer] .
scripts/train_sentence_transformer_<matryoshka|multi_dataset|with_lora|distillation|make_multilingual|static_embedding>_example.py - [CrossEncoder] .
scripts/train_cross_encoder_<distillation|listwise>_example.py - [SparseEncoder] .
scripts/train_sparse_encoder_distillation_example.py - Hard-negative mining CLI — .
scripts/mine_hard_negatives.py
- [SentenceTransformer] 。
scripts/train_sentence_transformer_<matryoshka|multi_dataset|with_lora|distillation|make_multilingual|static_embedding>_example.py - [CrossEncoder] 。
scripts/train_cross_encoder_<distillation|listwise>_example.py - [SparseEncoder] 。
scripts/train_sparse_encoder_distillation_example.py - 难负样本挖掘CLI — 。
scripts/mine_hard_negatives.py
3. Defaults
3. 默认规则
Override only if the user specifies otherwise:
- Local execution. Pitch HF Jobs only if local hardware can't fit the job.
- Single run. After it completes, propose experimentation if the user would benefit (weak/marginal verdict, "see how high you can push it" framing, etc.). Iteration rules in (Experimentation section).
references/training_args.md - Public Hub push at end-of-run, wrapped in try-except. On HF Jobs (ephemeral env) ALSO enable in-trainer push (+
push_to_hub=True); details inhub_strategy="every_save".references/hf_jobs_execution.md
仅当用户明确指定时才覆盖以下规则:
- 本地执行。仅当本地硬件无法完成任务时,才推荐使用HF Jobs。
- 单次运行。运行完成后,如果用户能从中受益(结果较弱/边际、“探索性能上限”等场景),再提议进行实验迭代。迭代规则见(实验章节)。
references/training_args.md - 运行结束时使用try-except包裹进行公开Hub推送。在HF Jobs(临时环境)上,还需启用训练器内推送(+
push_to_hub=True);详情见hub_strategy="every_save"。references/hf_jobs_execution.md
4. Constraints the produced script must satisfy
4. 生成脚本必须满足的约束
These are non-negotiable contracts. Implementation lives in the production templates and references — do not reinvent.
- Capture the pre-training evaluator score as before
baseline_eval.trainer.train() - Emit a single end-of-run line: . A monitor scrapes for this.
VERDICT: WIN|MARGINAL|REGRESSION | score=... | baseline=... | delta=... - Silence ,
httpx,httpcore,huggingface_hub,urllib3,filelockto WARNING (otherwise HF download URLs flood the agent's context).fsspec - Tee logs to .
logs/{RUN_NAME}.log - End with wrapped in
model.push_to_hub(...).try/except - Smoke-test before any long run (+ tiny dataset slice). The production templates show one common pattern (
max_steps=1env var).SMOKE_TEST - [CrossEncoder] Include — CE rerankers often peak mid-training and regress.
EarlyStoppingCallback(patience>=3) - [SparseEncoder] Log /
query_active_dimson the verdict line; high nDCG with collapsed sparsity is not a win. The keys come back name-prefixed (e.g.corpus_active_dims); use suffix matching to pluck them — see the SPARSE production template for the exact pattern...._query_active_dims
这些是不可协商的规则。实现方式已包含在生产模板和参考资料中——请勿重新发明。
- 在之前,将预训练评估器分数捕获为
trainer.train()。baseline_eval - 运行结束时输出一行结果:。监控系统会抓取该行内容。
VERDICT: WIN|MARGINAL|REGRESSION | score=... | baseline=... | delta=... - 将、
httpx、httpcore、huggingface_hub、urllib3、filelock的日志级别设置为WARNING(否则HF下载链接会充斥Agent的上下文)。fsspec - 将日志输出到。
logs/{RUN_NAME}.log - 结尾用包裹
try/except。model.push_to_hub(...) - 任何长时间运行前先进行冒烟测试(+ 小型数据集切片)。生产模板展示了一种常见模式(
max_steps=1环境变量)。SMOKE_TEST - [CrossEncoder] 必须包含——CE重排序器通常在训练中期达到峰值,之后会出现性能退化。
EarlyStoppingCallback(patience>=3) - [SparseEncoder] 在结果行中记录/
query_active_dims;稀疏性崩溃的高nDCG不算有效结果。这些键会带有名称前缀(例如corpus_active_dims);使用后缀匹配提取——参考SPARSE生产模板中的具体实现。..._query_active_dims
5. Workflow
5. 工作流程
- Identify the model type (§1). Ask if ambiguous.
- Load the §2 required-reading files for that type.
- Open and copy it as your starting point.
scripts/train_<type>_example.py - Replace ,
MODEL_NAME,DATASET_NAME, the loss, and the evaluator with the user's task. Cross-check loss/data-shape match againstRUN_NAME; cross-check thereferences/losses_<type>.mdkey againstmetric_for_best_model(named evaluators format the key asreferences/evaluators_<type>.md).eval_{name}_{primary_metric} - Smoke-test ().
max_steps=1 - Run.
- After the run, append to and propose iteration if the verdict is weak/marginal.
logs/experiments.md
- 确定模型类型(第1节)。若模糊不清,请询问用户。
- 加载对应类型的第2节必读书目。
- 打开并复制作为起始模板。
scripts/train_<type>_example.py - 将、
MODEL_NAME、DATASET_NAME、损失函数和评估器替换为用户任务对应的内容。对照RUN_NAME检查损失函数与数据形状是否匹配;对照references/losses_<type>.md检查references/evaluators_<type>.md键是否正确(命名评估器的键格式为metric_for_best_model)。eval_{name}_{primary_metric} - 进行冒烟测试()。
max_steps=1 - 运行训练。
- 运行结束后,将结果追加到,若结果较弱/边际,则提议进行迭代。
logs/experiments.md
Prerequisites
前置依赖
bash
pip install "sentence-transformers[train]>=5.0" # add [train,image] / [audio] / [video] for [SentenceTransformer] multimodal
pip install trackio # optional tracker; or wandb / tensorboard / mlflow
hf auth login # or set HF_TOKEN with write scope (for Hub push)GPU strongly recommended. CPU works only for demos and .
[SentenceTransformer]StaticEmbeddingbash
pip install "sentence-transformers[train]>=5.0" # 若使用[SentenceTransformer]多模态,添加[train,image] / [audio] / [video]
pip install trackio # 可选跟踪工具;也可使用wandb / tensorboard / mlflow
hf auth login # 或设置带有写入权限的HF_TOKEN(用于Hub推送)强烈推荐使用GPU。CPU仅适用于演示和的模型。
[SentenceTransformer]StaticEmbedding