mteb-retrieve

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MTEB Retrieve

MTEB 检索

Overview

概述

This skill provides guidance for text embedding retrieval tasks that involve encoding documents and queries using embedding models, computing similarity scores, and retrieving or ranking documents based on semantic similarity.

本技能为文本嵌入检索任务提供指导，涵盖使用嵌入模型对文档和查询语句进行编码、计算相似度分数，以及基于语义相似度检索或排序文档的流程。

Workflow

工作流程

Step 1: Inspect and Parse Data

步骤1：检查与解析数据

Before writing any code, carefully inspect the raw data format:

Read the data file and examine actual line contents
Identify formatting artifacts such as:
- Line number prefixes (e.g.,
```
1→
```
  ,
```
2→
```
  ,
```
1.
```
  ,
```
1:
```
  )
- Whitespace or tab characters
- Quote characters or escape sequences
- Header rows or metadata
Design parsing logic that strips all non-content artifacts

Common data format issues:

Files with line numbers prepended (e.g.,
```
1→Document text here
```
)
CSV/TSV files with headers
JSON files with nested structures
Files with trailing whitespace or newlines

Verification: Print 2-3 parsed documents to confirm they contain only the actual text content.

在编写代码前，请仔细检查原始数据格式：

读取数据文件并查看实际行内容
识别格式冗余信息，例如：
- 行号前缀（如
```
1→
```
  、
```
2→
```
  、
```
1.
```
  、
```
1:
```
  ）
- 空格或制表符
- 引号或转义序列
- 表头行或元数据
设计解析逻辑，去除所有非内容类冗余信息

常见数据格式问题：

文件带有前置行号（如
```
1→Document text here
```
）
带表头的CSV/TSV文件
嵌套结构的JSON文件
带有尾随空格或换行符的文件

验证： 打印2-3个解析后的文档，确认仅包含实际文本内容。

Step 2: Load the Embedding Model

步骤2：加载嵌入模型

Identify the model specified in the task (e.g.,
```
sentence-transformers/all-MiniLM-L6-v2
```
)
Load the model using the appropriate library (typically
```
sentence-transformers
```
)
Verify model loading succeeded before proceeding

python

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('model-name')

确定任务指定的模型（如
```
sentence-transformers/all-MiniLM-L6-v2
```
）
使用对应库加载模型（通常为
```
sentence-transformers
```
）
在继续操作前验证模型加载成功

python

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('model-name')

Step 3: Encode Documents and Query

步骤3：编码文档与查询语句

Encode all documents using the model's encode method
Encode the query using the same model
Ensure consistent encoding - same model and parameters for both

使用模型的encode方法编码所有文档
使用同一模型编码查询语句
确保编码一致性 - 文档和查询使用同一模型及参数

Step 4: Compute Similarities

步骤4：计算相似度

Use cosine similarity (most common for embedding retrieval)
Compute similarity between query embedding and all document embeddings
Store similarities with corresponding document indices

python

from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity([query_embedding], document_embeddings)[0]

使用余弦相似度（嵌入检索最常用的方式）
计算查询语句嵌入与所有文档嵌入之间的相似度
存储相似度结果及对应文档索引

python

from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity([query_embedding], document_embeddings)[0]

Step 5: Rank and Retrieve

步骤5：排序与检索

Sort documents by similarity score in descending order
Handle the ranking request (e.g., "5th most similar" means index 4 after sorting)
Extract the requested document(s)

按相似度分数降序排列文档
处理排序请求（例如，“第5个最相似文档”对应排序后的索引4）
提取目标文档

Step 6: Validate Results

步骤6：验证结果

Critical verification steps before finalizing:

Print top 10 results with similarity scores and document text
Semantic sanity check: Do top results relate to the query?
- If query is "terminal-bench", expect documents containing "terminal", "bench", or "benchmark"
- If results seem unrelated, investigate data parsing or encoding issues
Check for anomalies:
- Are similarity scores reasonable (typically 0.0 to 1.0)?
- Are there unexpected ties in similarity values?
- Do document texts look properly parsed?

最终确定前的关键验证步骤：

打印前10个结果，包含相似度分数和文档文本
语义合理性检查：前几名结果是否与查询语句相关？
- 如果查询语句是“terminal-bench”，则预期结果包含“terminal”、“bench”或“benchmark”等内容
- 如果结果看似无关，请检查数据解析或编码环节是否存在问题
检查异常情况：
- 相似度分数是否在合理范围内（通常为0.0到1.0）？
- 是否存在意外的相似度分数并列情况？
- 文档文本是否已正确解析？

Common Pitfalls

常见陷阱

1. Data Format Parsing Errors

1. 数据格式解析错误

Problem: Document files often include line numbers, prefixes, or other formatting artifacts.

Example: A file might contain:

 1→Beyond the Imitation Game...
 2→MTEB: Massive Text Embedding Benchmark

If not properly parsed, embeddings are computed on

1→Beyond the Imitation Game...

instead of just

Beyond the Imitation Game...

Solution: Always inspect raw file contents and strip all formatting artifacts before encoding.

问题： 文档文件通常包含行号、前缀或其他格式冗余信息。

示例： 某文件内容如下：

 1→Beyond the Imitation Game...
 2→MTEB: Massive Text Embedding Benchmark

如果解析不当，嵌入计算会基于

1→Beyond the Imitation Game...

而非仅

Beyond the Imitation Game...

。

解决方案： 在编码前务必检查原始文件内容，去除所有格式冗余信息。

2. Skipping Validation

2. 跳过验证环节

Problem: Accepting results without verification can lead to incorrect answers.

Solution: Always print intermediate results (top 10 documents with scores) and verify they make semantic sense given the query.

问题： 未验证就接受结果可能导致错误答案。

解决方案： 务必打印中间结果（带分数的前10个文档），并验证其在语义上是否与查询语句匹配。

3. Off-by-One Errors in Ranking

3. 排序中的索引偏移错误

Problem: Confusion between 0-indexed and 1-indexed rankings.

Example: "5th most similar" means:

Sort by similarity descending
Take index 4 (0-indexed) or position 5 (1-indexed)

Solution: Be explicit about indexing when retrieving ranked results.

问题： 混淆0索引和1索引排序方式。

示例： “第5个最相似文档”意味着：

按相似度降序排列
取索引4（0索引）或位置5（1索引）

解决方案： 检索排序结果时明确说明索引规则。

4. Ignoring Semantic Reasonableness

4. 忽略语义合理性

Problem: Not questioning whether results make logical sense.

Example: If query is "terminal-bench" and the 5th result is "HumanEval: Benchmarking Python code generation", ask: Does this semantically relate to the query? If not, something may be wrong.

Solution: Apply domain knowledge to sanity-check results before finalizing.

问题： 未质疑结果是否符合逻辑。

示例： 如果查询语句是“terminal-bench”，而第5个结果是“HumanEval: Benchmarking Python code generation”，请思考：该结果是否与查询语句语义相关？如果不相关，可能存在问题。

解决方案： 在最终确定前，运用领域知识对结果进行合理性检查。

mteb-retrieve

Original

Translation

MTEB Retrieve

MTEB 检索

Overview

概述

Workflow

工作流程

Step 1: Inspect and Parse Data

步骤1：检查与解析数据

Step 2: Load the Embedding Model

步骤2：加载嵌入模型

Step 3: Encode Documents and Query

步骤3：编码文档与查询语句

Step 4: Compute Similarities

步骤4：计算相似度

Step 5: Rank and Retrieve

步骤5：排序与检索

Step 6: Validate Results

步骤6：验证结果

Common Pitfalls

常见陷阱

1. Data Format Parsing Errors

1. 数据格式解析错误

2. Skipping Validation

2. 跳过验证环节

3. Off-by-One Errors in Ranking

3. 排序中的索引偏移错误

4. Ignoring Semantic Reasonableness

4. 忽略语义合理性

Verification Checklist

验证清单