rag-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

On-disk RAG evaluation (

corpus/

train.json

)

磁盘端RAG评估（

corpus/

train.json

）

Purpose

用途

Guide agents through NVIDIA RAG Blueprint filesystem benchmarks: preparing

corpus/

and

train.json

, running

scripts/eval/evaluate_rag.py

, tuning retrieval and generation flags for quality comparisons, interpreting RAGAS JSON outputs, and triaging failures (HTTP/stream errors, empty contexts, collection mismatch, judge API).

For latency, throughput, and load testing, use the rag-perf skill (

scripts/rag-perf

docs/performance-benchmarking.md

) — not this skill.

引导Agent完成NVIDIA RAG Blueprint文件系统基准测试：准备

corpus/

和

train.json

，运行

scripts/eval/evaluate_rag.py

，调整检索与生成参数以进行质量对比，解读RAGAS JSON输出，以及排查故障（HTTP/流错误、空上下文、集合不匹配、Judge API问题）。

如需进行延迟、吞吐量和负载测试，请使用rag-perf技能（

scripts/rag-perf

、

docs/performance-benchmarking.md

）——而非本技能。

When not to use

禁用场景

Do not use this skill for: deploying or repairing services (use rag-blueprint); evaluating APIs without the

corpus/

train.json

layout; general ML experimentation unrelated to this evaluator; production monitoring/alerting; or latency/throughput benchmarking (use rag-perf).

请勿将本技能用于：部署或修复服务（请使用rag-blueprint）；评估非

corpus/

train.json

结构的API；与本评估器无关的通用机器学习实验；生产环境监控/告警；或延迟/吞吐量基准测试（请使用rag-perf）。

Prerequisites

前置条件

Repo cloned; run commands from repo root (imports and paths assume this).
Python 3.11+ and uv; eval deps:
```
uv sync --project scripts/eval
```
.
Reachable RAG server and ingestor (defaults often
```
localhost:8081
```
/
```
8082
```
).
NVIDIA_API_KEY
for RAGAS (see credential hygiene); optional RAG_EVAL_JUDGE_MODEL
.
Dataset roots passed to
```
--dataset-paths
```
each contain corpus/
and train.json
.

已克隆仓库；从仓库根目录运行命令（导入路径和文件路径均以此为前提）。
Python 3.11+ 和 uv；评估依赖：
```
uv sync --project scripts/eval
```
。
可访问的RAG服务器和Ingestor（默认通常为
```
localhost:8081
```
/
```
8082
```
）。
用于RAGAS的**
```
NVIDIA_API_KEY
```
（参见凭证安全规范）；可选配置
```
RAG_EVAL_JUDGE_MODEL
```
**。
传递给
```
--dataset-paths
```
的数据集根目录需分别包含**
```
corpus/
```
和
```
train.json
```
**。

Instructions

操作步骤

Prepare data — Ensure each dataset directory matches the layout and
```
train.json
```
rules in
```
references/dataset-and-conversion.md
```
. When sources arrive as public links (sites or dataset pages), materialize documents under
```
corpus/
```
—prefer PDF for multimodal content so images stay embedded; convert CSV/JSONL/etc. using the patterns there.

Run eval —

uv run --project scripts/eval python scripts/eval/evaluate_rag.py

with

--dataset-paths

--host

, and

--port

. See

references/benchmark-execution.md

for command examples, outputs, and errors. Use

references/evaluate-rag-cli.md

for flag-level detail.

Tune quality — Adjust
```
--top_k
```
/
```
--vdb_top_k
```
, reranker and query-rewriting toggles, and generation overrides (
```
--temperature
```
,
```
--top-p
```
,
```
--max-tokens
```
) as documented in
```
references/benchmark-execution.md
```
when comparing retrieval/generation configs for RAGAS scores.
Analyze results — Use
```
references/result-analysis.md
```
for scripts; scan
```
rag_*_evaluation_summary.json
```
for headline RAGAS metrics.
Triage errors — Use the error signal table and the Troubleshooting section below.

准备数据——确保每个数据集目录符合
```
references/dataset-and-conversion.md
```
中的目录结构和
```
train.json
```
规则。当数据源为公开链接（网站或数据集页面）时，将文档落地到
```
corpus/
```
下——多模态内容优先选择PDF格式，以保证图片嵌入；按照文档中的转换规则处理CSV/JSONL等格式。

运行评估——使用

--dataset-paths

、

--host

和

--port

参数执行

uv run --project scripts/eval python scripts/eval/evaluate_rag.py

。命令示例、输出结果和错误排查请参见

references/benchmark-execution.md

。参数详情请参见

references/evaluate-rag-cli.md

。

调优质量——当对比不同检索/生成配置的RAGAS分数时，按照
```
references/benchmark-execution.md
```
中的说明调整
```
--top_k
```
/
```
--vdb_top_k
```
、重排器与查询重写开关，以及生成参数覆盖项（
```
--temperature
```
、
```
--top-p
```
、
```
--max-tokens
```
）。
分析结果——使用
```
references/result-analysis.md
```
中的脚本；查看
```
rag_*_evaluation_summary.json
```
文件获取核心RAGAS指标。
排查故障——使用错误信号表和下方的故障排除章节。

Examples

示例

Set API key without putting secrets in shell history (preferred patterns): load from a gitignored env file or secrets manager; avoid committing

.env

; rotate keys if exposed. Details:

references/benchmark-execution.md#credential-hygiene-nvidia_api_key

Minimal eval (key already in environment):

bash

uv sync --project scripts/eval
uv run --project scripts/eval python scripts/eval/evaluate_rag.py \
  --dataset-paths /path/to/my_dataset \
  --host localhost \
  --port 8081

Pretty-print summary JSON:

bash

python3 -m json.tool results/my_dataset/rag_my_dataset_evaluation_summary.json

More examples (skip ingestion, quality sweeps):

references/benchmark-execution.md

不在Shell历史中存储密钥的API密钥设置方式（推荐）：从Git忽略的环境文件或密钥管理器加载；避免提交

.env

文件；若密钥泄露请及时轮换。详情请参见

references/benchmark-execution.md#credential-hygiene-nvidia_api_key

。

最简评估（密钥已在环境中配置）：

bash

uv sync --project scripts/eval
uv run --project scripts/eval python scripts/eval/evaluate_rag.py \
  --dataset-paths /path/to/my_dataset \
  --host localhost \
  --port 8081

格式化输出摘要JSON：

bash

python3 -m json.tool results/my_dataset/rag_my_dataset_evaluation_summary.json

更多示例（跳过数据导入、质量扫描）：

references/benchmark-execution.md

。

Limitations

局限性

Evaluator behavior is fixed to the filesystem contract and
```
evaluate_rag.py
```
; it does not substitute for custom offline judges or non-RAG benchmarks.
Vector DB / embedding choices follow deployed ingestor and RAG env — not overridden by this CLI alone.
Scores depend on retrieval quality, judge model availability, and
```
NVIDIA_API_KEY
```
; empty contexts yield partial RAGAS metrics (see references).
Large procedural detail lives under references/
to keep routing concise; read those files when the user needs step-by-step conversion, full flags, or error tables.

评估器的行为受文件系统约定和
```
evaluate_rag.py
```
限制，无法替代自定义离线Judge或非RAG基准测试。
向量数据库/嵌入模型的选择取决于已部署的Ingestor和RAG环境——无法仅通过本CLI覆盖配置。
分数依赖于检索质量、Judge模型可用性和
```
NVIDIA_API_KEY
```
；空上下文会导致部分RAGAS指标缺失（参见参考文档）。
大量流程细节存于**
```
references/
```
**目录以简化引导；当用户需要分步转换、完整参数或错误表时，请阅读这些文件。

Troubleshooting

故障排除

Error / signal	Likely cause	What to do
Immediate exit mentioning `NVIDIA_API_KEY`	Missing or invalid key	Set key via secure channel; see credential hygiene in `references/benchmark-execution.md` .
`train.json must be a JSON array`	Wrong JSON shape	Top-level array of objects; validate per `references/dataset-and-conversion.md` .
Fewer rows in `evaluation_data.json` than `train.json`	Per-query failures	Check stderr: network or stream JSON errors; see error table in benchmark-execution.
Empty `generated_contexts` everywhere	Retrieval gap	Verify collection, ingestion, `top_k` / `vdb_top_k` , and `ingestor_server_url` without `/v1` suffix.
Ingestor 404 on upload	Bad ingestor base URL	Pass `http://host:port` only — code appends `/v1/` .

Full signal table:

references/benchmark-execution.md#common-error-cases-and-signals

错误/信号	可能原因	解决方法
程序立即退出并提示 `NVIDIA_API_KEY`	密钥缺失或无效	通过安全渠道设置密钥；参见 `references/benchmark-execution.md` 中的凭证安全规范。
`train.json must be a JSON array`	JSON格式错误	顶层需为对象数组；按照 `references/dataset-and-conversion.md` 验证格式。
`evaluation_data.json` 中的行数少于 `train.json`	单查询失败	检查标准错误输出：网络或流JSON错误；参见benchmark-execution中的错误表。
所有 `generated_contexts` 为空	检索失效	验证集合、数据导入、 `top_k` / `vdb_top_k` ，以及 `ingestor_server_url` 不要包含 `/v1` 后缀。
Ingestor上传时返回404	Ingestor基础URL错误	仅传递 `http://host:port` ——代码会自动追加 `/v1/` 。

完整信号表：

references/benchmark-execution.md#common-error-cases-and-signals

。

Gotchas

注意事项

Run from repo root: paths and imports in
```
scripts/eval/evaluate_rag.py
```
assume this; a wrong directory silently breaks imports.
--ingestor_server_url
: pass
```
http://host:port
```
without
```
/v1
```
—the code appends
```
/v1/
```
automatically. Including
```
/v1
```
causes 404s on ingestor calls.
Vector DB / embedding settings: not set by this CLI; configure via the deployed ingestor and RAG server env vars (e.g.
```
APP_VECTORSTORE_URL
```
, embedding model).
--model
/
--llm_endpoint
: forwarded verbatim only when explicitly set; omit to keep the server's configured LLM.
Stale collections: a previous run's ingested data persists unless you use
```
--force_ingestion
```
. Use
```
--collection
```
with a unique name when comparing quality across isolated runs.
Empty context metrics: if all
```
generated_contexts
```
are empty, RAGAS scores only
```
nv_accuracy
```
and leaves the other two metrics blank—this is not a silent success.

从仓库根目录运行：
```
scripts/eval/evaluate_rag.py
```
中的路径和导入均以此为前提；错误的目录会导致导入失败且无提示。
--ingestor_server_url
：传递
```
http://host:port
```
时不要包含
```
/v1
```
——代码会自动追加
```
/v1/
```
。包含
```
/v1
```
会导致Ingestor调用返回404。
向量数据库/嵌入模型设置：不由本CLI设置；通过已部署的Ingestor和RAG服务器环境变量配置（例如
```
APP_VECTORSTORE_URL
```
、嵌入模型）。
--model
/
--llm_endpoint
：仅在显式设置时才会原样转发；省略该参数将保留服务器配置的LLM。
过期集合：除非使用
```
--force_ingestion
```
，否则之前运行导入的数据会保留。在隔离运行中对比质量时，请使用
```
--collection
```
指定唯一名称。
空上下文指标：若所有
```
generated_contexts
```
为空，RAGAS仅会计算
```
nv_accuracy
```
分数，其余两个指标留空——这并非静默成功。

Source of truth

权威来源

Piece	Location
Driver	`scripts/eval/evaluate_rag.py` ( `CORPUS_DIRECTORY` = `corpus` , `EVAL_DATA` = `train.json` )
Human README (always in-repo)	`scripts/eval/README.md`
Full CLI (flags, defaults)	`scripts/eval/evaluate_rag.py --help` ; `references/evaluate-rag-cli.md`
Dataset / conversion	`references/dataset-and-conversion.md`
Runs, outputs, errors	`references/benchmark-execution.md`
Result analysis scripts	`references/result-analysis.md`
Latency / throughput	rag-perf skill, `docs/performance-benchmarking.md`

内容	位置
驱动脚本	`scripts/eval/evaluate_rag.py` （ `CORPUS_DIRECTORY` = `corpus` ， `EVAL_DATA` = `train.json` ）
人工维护的README（始终在仓库内）	`scripts/eval/README.md`
完整CLI（参数、默认值）	`scripts/eval/evaluate_rag.py --help` ； `references/evaluate-rag-cli.md`
数据集/转换规则	`references/dataset-and-conversion.md`
运行、输出、错误	`references/benchmark-execution.md`
结果分析脚本	`references/result-analysis.md`
延迟/吞吐量	rag-perf技能， `docs/performance-benchmarking.md`

Agent playbook

Agent操作手册

Run eval —

uv sync --project scripts/eval

then

uv run --project scripts/eval python scripts/eval/evaluate_rag.py

with required

--dataset-paths

--host

, and

--port

(and env

NVIDIA_API_KEY

). Argument

--ingestor_server_url

is optional (defaults to

http://localhost:8082

); pass it only when overriding the ingestor endpoint.

Quality tuning — See

references/benchmark-execution.md

--top_k

--vdb_top_k

, reranker and query-rewriting toggles,

--temperature

--top-p

--max-tokens

Data conversion — Follow
```
references/dataset-and-conversion.md
```
.

Analyze results —

references/result-analysis.md

; quick scan:

python3 -m json.tool results/<dataset>/rag_<dataset>_evaluation_summary.json

Error triage —

references/benchmark-execution.md#common-error-cases-and-signals

运行评估——先执行
```
uv sync --project scripts/eval
```
，再使用必填的
```
--dataset-paths
```
、
```
--host
```
和
```
--port
```
参数（以及环境变量
```
NVIDIA_API_KEY
```
）执行
```
uv run --project scripts/eval python scripts/eval/evaluate_rag.py
```
。
```
--ingestor_server_url
```
为可选参数（默认值为
```
http://localhost:8082
```
）；仅在需要覆盖Ingestor端点时传递该参数。

质量调优——参见

references/benchmark-execution.md

：

--top_k

--vdb_top_k

、重排器与查询重写开关、

--temperature

、

--top-p

、

--max-tokens

。

数据转换——遵循
```
references/dataset-and-conversion.md
```
中的规则。

分析结果——参见

references/result-analysis.md

；快速查看：

python3 -m json.tool results/<dataset>/rag_<dataset>_evaluation_summary.json

。

故障排查——参见

references/benchmark-execution.md#common-error-cases-and-signals

。

rag-eval

Original

Translation

On-disk RAG evaluation (
`corpus/`
+
`train.json`
)

磁盘端RAG评估（
`corpus/`
+
`train.json`
）

Purpose

用途

When not to use

禁用场景

Prerequisites

前置条件

Instructions

操作步骤

Examples

示例

Limitations

局限性

Troubleshooting

故障排除

Gotchas

注意事项

Source of truth

权威来源

Agent playbook

Agent操作手册

rag-eval

Original

Translation

On-disk RAG evaluation (corpus/ + train.json)

磁盘端RAG评估（corpus/ + train.json）

Purpose

用途

When not to use

禁用场景

Prerequisites

前置条件

Instructions

操作步骤

Examples

示例

Limitations

局限性

Troubleshooting

故障排除

Gotchas

注意事项

Source of truth

权威来源

Agent playbook

Agent操作手册

On-disk RAG evaluation (
`corpus/`
+
`train.json`
)

磁盘端RAG评估（
`corpus/`
+
`train.json`
）