mteb-retrieve
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMTEB Retrieve
MTEB 检索
Overview
概述
This skill provides guidance for text embedding retrieval tasks that involve encoding documents and queries using embedding models, computing similarity scores, and retrieving or ranking documents based on semantic similarity.
本技能为文本嵌入检索任务提供指导,涵盖使用嵌入模型对文档和查询语句进行编码、计算相似度分数,以及基于语义相似度检索或排序文档的流程。
Workflow
工作流程
Step 1: Inspect and Parse Data
步骤1:检查与解析数据
Before writing any code, carefully inspect the raw data format:
- Read the data file and examine actual line contents
- Identify formatting artifacts such as:
- Line number prefixes (e.g., ,
1→,2→,1.)1: - Whitespace or tab characters
- Quote characters or escape sequences
- Header rows or metadata
- Line number prefixes (e.g.,
- Design parsing logic that strips all non-content artifacts
Common data format issues:
- Files with line numbers prepended (e.g., )
1→Document text here - CSV/TSV files with headers
- JSON files with nested structures
- Files with trailing whitespace or newlines
Verification: Print 2-3 parsed documents to confirm they contain only the actual text content.
在编写代码前,请仔细检查原始数据格式:
- 读取数据文件并查看实际行内容
- 识别格式冗余信息,例如:
- 行号前缀(如、
1→、2→、1.)1: - 空格或制表符
- 引号或转义序列
- 表头行或元数据
- 行号前缀(如
- 设计解析逻辑,去除所有非内容类冗余信息
常见数据格式问题:
- 文件带有前置行号(如)
1→Document text here - 带表头的CSV/TSV文件
- 嵌套结构的JSON文件
- 带有尾随空格或换行符的文件
验证: 打印2-3个解析后的文档,确认仅包含实际文本内容。
Step 2: Load the Embedding Model
步骤2:加载嵌入模型
- Identify the model specified in the task (e.g., )
sentence-transformers/all-MiniLM-L6-v2 - Load the model using the appropriate library (typically )
sentence-transformers - Verify model loading succeeded before proceeding
python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('model-name')- 确定任务指定的模型(如)
sentence-transformers/all-MiniLM-L6-v2 - 使用对应库加载模型(通常为)
sentence-transformers - 在继续操作前验证模型加载成功
python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('model-name')Step 3: Encode Documents and Query
步骤3:编码文档与查询语句
- Encode all documents using the model's encode method
- Encode the query using the same model
- Ensure consistent encoding - same model and parameters for both
- 使用模型的encode方法编码所有文档
- 使用同一模型编码查询语句
- 确保编码一致性 - 文档和查询使用同一模型及参数
Step 4: Compute Similarities
步骤4:计算相似度
- Use cosine similarity (most common for embedding retrieval)
- Compute similarity between query embedding and all document embeddings
- Store similarities with corresponding document indices
python
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity([query_embedding], document_embeddings)[0]- 使用余弦相似度(嵌入检索最常用的方式)
- 计算查询语句嵌入与所有文档嵌入之间的相似度
- 存储相似度结果及对应文档索引
python
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity([query_embedding], document_embeddings)[0]Step 5: Rank and Retrieve
步骤5:排序与检索
- Sort documents by similarity score in descending order
- Handle the ranking request (e.g., "5th most similar" means index 4 after sorting)
- Extract the requested document(s)
- 按相似度分数降序排列文档
- 处理排序请求(例如,“第5个最相似文档”对应排序后的索引4)
- 提取目标文档
Step 6: Validate Results
步骤6:验证结果
Critical verification steps before finalizing:
- Print top 10 results with similarity scores and document text
- Semantic sanity check: Do top results relate to the query?
- If query is "terminal-bench", expect documents containing "terminal", "bench", or "benchmark"
- If results seem unrelated, investigate data parsing or encoding issues
- Check for anomalies:
- Are similarity scores reasonable (typically 0.0 to 1.0)?
- Are there unexpected ties in similarity values?
- Do document texts look properly parsed?
最终确定前的关键验证步骤:
- 打印前10个结果,包含相似度分数和文档文本
- 语义合理性检查:前几名结果是否与查询语句相关?
- 如果查询语句是“terminal-bench”,则预期结果包含“terminal”、“bench”或“benchmark”等内容
- 如果结果看似无关,请检查数据解析或编码环节是否存在问题
- 检查异常情况:
- 相似度分数是否在合理范围内(通常为0.0到1.0)?
- 是否存在意外的相似度分数并列情况?
- 文档文本是否已正确解析?
Common Pitfalls
常见陷阱
1. Data Format Parsing Errors
1. 数据格式解析错误
Problem: Document files often include line numbers, prefixes, or other formatting artifacts.
Example: A file might contain:
1→Beyond the Imitation Game...
2→MTEB: Massive Text Embedding BenchmarkIf not properly parsed, embeddings are computed on instead of just .
1→Beyond the Imitation Game...Beyond the Imitation Game...Solution: Always inspect raw file contents and strip all formatting artifacts before encoding.
问题: 文档文件通常包含行号、前缀或其他格式冗余信息。
示例: 某文件内容如下:
1→Beyond the Imitation Game...
2→MTEB: Massive Text Embedding Benchmark如果解析不当,嵌入计算会基于而非仅。
1→Beyond the Imitation Game...Beyond the Imitation Game...解决方案: 在编码前务必检查原始文件内容,去除所有格式冗余信息。
2. Skipping Validation
2. 跳过验证环节
Problem: Accepting results without verification can lead to incorrect answers.
Solution: Always print intermediate results (top 10 documents with scores) and verify they make semantic sense given the query.
问题: 未验证就接受结果可能导致错误答案。
解决方案: 务必打印中间结果(带分数的前10个文档),并验证其在语义上是否与查询语句匹配。
3. Off-by-One Errors in Ranking
3. 排序中的索引偏移错误
Problem: Confusion between 0-indexed and 1-indexed rankings.
Example: "5th most similar" means:
- Sort by similarity descending
- Take index 4 (0-indexed) or position 5 (1-indexed)
Solution: Be explicit about indexing when retrieving ranked results.
问题: 混淆0索引和1索引排序方式。
示例: “第5个最相似文档”意味着:
- 按相似度降序排列
- 取索引4(0索引)或位置5(1索引)
解决方案: 检索排序结果时明确说明索引规则。
4. Ignoring Semantic Reasonableness
4. 忽略语义合理性
Problem: Not questioning whether results make logical sense.
Example: If query is "terminal-bench" and the 5th result is "HumanEval: Benchmarking Python code generation", ask: Does this semantically relate to the query? If not, something may be wrong.
Solution: Apply domain knowledge to sanity-check results before finalizing.
问题: 未质疑结果是否符合逻辑。
示例: 如果查询语句是“terminal-bench”,而第5个结果是“HumanEval: Benchmarking Python code generation”,请思考:该结果是否与查询语句语义相关?如果不相关,可能存在问题。
解决方案: 在最终确定前,运用领域知识对结果进行合理性检查。
Verification Checklist
验证清单
Before submitting results, confirm:
- Data was inspected for formatting artifacts
- Documents were parsed to contain only actual text content
- Embedding model loaded successfully
- Query and documents encoded with same model
- Similarity computation used correct metric (cosine similarity)
- Top 10+ results printed and reviewed
- Top results semantically relate to the query
- Correct document extracted based on ranking request
提交结果前,请确认:
- 已检查数据中的格式冗余信息
- 文档已解析为仅包含实际文本内容
- 嵌入模型已成功加载
- 查询语句与文档使用同一模型编码
- 相似度计算使用了正确的指标(余弦相似度)
- 已打印并审核前10+个结果
- 前几名结果与查询语句语义相关
- 已根据排序请求提取正确的文档