rag-eval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

On-disk RAG evaluation (
corpus/
+
train.json
)

磁盘端RAG评估(
corpus/
+
train.json

Purpose

用途

Guide agents through NVIDIA RAG Blueprint filesystem benchmarks: preparing
corpus/
and
train.json
, running
scripts/eval/evaluate_rag.py
, tuning retrieval and generation flags for quality comparisons, interpreting RAGAS JSON outputs, and triaging failures (HTTP/stream errors, empty contexts, collection mismatch, judge API).
For latency, throughput, and load testing, use the rag-perf skill (
scripts/rag-perf
,
docs/performance-benchmarking.md
) — not this skill.
引导Agent完成NVIDIA RAG Blueprint文件系统基准测试:准备
corpus/
train.json
,运行
scripts/eval/evaluate_rag.py
,调整检索与生成参数以进行质量对比,解读RAGAS JSON输出,以及排查故障(HTTP/流错误、空上下文、集合不匹配、Judge API问题)。
如需进行延迟、吞吐量和负载测试,请使用rag-perf技能(
scripts/rag-perf
docs/performance-benchmarking.md
)——而非本技能。

When not to use

禁用场景

Do not use this skill for: deploying or repairing services (use rag-blueprint); evaluating APIs without the
corpus/
+
train.json
layout; general ML experimentation unrelated to this evaluator; production monitoring/alerting; or latency/throughput benchmarking (use rag-perf).
请勿将本技能用于:部署或修复服务(请使用rag-blueprint);评估非
corpus/
+
train.json
结构的API;与本评估器无关的通用机器学习实验;生产环境监控/告警;或延迟/吞吐量基准测试(请使用rag-perf)。

Prerequisites

前置条件

  • Repo cloned; run commands from repo root (imports and paths assume this).
  • Python 3.11+ and uv; eval deps:
    uv sync --project scripts/eval
    .
  • Reachable RAG server and ingestor (defaults often
    localhost:8081
    /
    8082
    ).
  • NVIDIA_API_KEY
    for RAGAS (see credential hygiene); optional
    RAG_EVAL_JUDGE_MODEL
    .
  • Dataset roots passed to
    --dataset-paths
    each contain
    corpus/
    and
    train.json
    .
  • 已克隆仓库;从仓库根目录运行命令(导入路径和文件路径均以此为前提)。
  • Python 3.11+uv;评估依赖:
    uv sync --project scripts/eval
  • 可访问的RAG服务器Ingestor(默认通常为
    localhost:8081
    /
    8082
    )。
  • 用于RAGAS的**
    NVIDIA_API_KEY
    (参见凭证安全规范);可选配置
    RAG_EVAL_JUDGE_MODEL
    **。
  • 传递给
    --dataset-paths
    的数据集根目录需分别包含**
    corpus/
    train.json
    **。

Instructions

操作步骤

  1. Prepare data — Ensure each dataset directory matches the layout and
    train.json
    rules in
    references/dataset-and-conversion.md
    . When sources arrive as public links (sites or dataset pages), materialize documents under
    corpus/
    —prefer PDF for multimodal content so images stay embedded; convert CSV/JSONL/etc. using the patterns there.
  2. Run eval
    uv run --project scripts/eval python scripts/eval/evaluate_rag.py
    with
    --dataset-paths
    ,
    --host
    , and
    --port
    . See
    references/benchmark-execution.md
    for command examples, outputs, and errors. Use
    references/evaluate-rag-cli.md
    for flag-level detail.
  3. Tune quality — Adjust
    --top_k
    /
    --vdb_top_k
    , reranker and query-rewriting toggles, and generation overrides (
    --temperature
    ,
    --top-p
    ,
    --max-tokens
    ) as documented in
    references/benchmark-execution.md
    when comparing retrieval/generation configs for RAGAS scores.
  4. Analyze results — Use
    references/result-analysis.md
    for scripts; scan
    rag_*_evaluation_summary.json
    for headline RAGAS metrics.
  5. Triage errors — Use the error signal table and the Troubleshooting section below.
  1. 准备数据——确保每个数据集目录符合
    references/dataset-and-conversion.md
    中的目录结构和
    train.json
    规则。当数据源为公开链接(网站或数据集页面)时,将文档落地到
    corpus/
    下——多模态内容优先选择PDF格式,以保证图片嵌入;按照文档中的转换规则处理CSV/JSONL等格式。
  2. 运行评估——使用
    --dataset-paths
    --host
    --port
    参数执行
    uv run --project scripts/eval python scripts/eval/evaluate_rag.py
    。命令示例、输出结果和错误排查请参见
    references/benchmark-execution.md
    。参数详情请参见
    references/evaluate-rag-cli.md
  3. 调优质量——当对比不同检索/生成配置的RAGAS分数时,按照
    references/benchmark-execution.md
    中的说明调整
    --top_k
    /
    --vdb_top_k
    、重排器与查询重写开关,以及生成参数覆盖项(
    --temperature
    --top-p
    --max-tokens
    )。
  4. 分析结果——使用
    references/result-analysis.md
    中的脚本;查看
    rag_*_evaluation_summary.json
    文件获取核心RAGAS指标。
  5. 排查故障——使用错误信号表和下方的故障排除章节。

Examples

示例

Set API key without putting secrets in shell history (preferred patterns): load from a gitignored env file or secrets manager; avoid committing
.env
; rotate keys if exposed. Details:
references/benchmark-execution.md#credential-hygiene-nvidia_api_key
.
Minimal eval (key already in environment):
bash
uv sync --project scripts/eval
uv run --project scripts/eval python scripts/eval/evaluate_rag.py \
  --dataset-paths /path/to/my_dataset \
  --host localhost \
  --port 8081
Pretty-print summary JSON:
bash
python3 -m json.tool results/my_dataset/rag_my_dataset_evaluation_summary.json
More examples (skip ingestion, quality sweeps):
references/benchmark-execution.md
.
不在Shell历史中存储密钥的API密钥设置方式(推荐):从Git忽略的环境文件或密钥管理器加载;避免提交
.env
文件;若密钥泄露请及时轮换。详情请参见
references/benchmark-execution.md#credential-hygiene-nvidia_api_key
最简评估(密钥已在环境中配置)
bash
uv sync --project scripts/eval
uv run --project scripts/eval python scripts/eval/evaluate_rag.py \
  --dataset-paths /path/to/my_dataset \
  --host localhost \
  --port 8081
格式化输出摘要JSON
bash
python3 -m json.tool results/my_dataset/rag_my_dataset_evaluation_summary.json
更多示例(跳过数据导入、质量扫描):
references/benchmark-execution.md

Limitations

局限性

  • Evaluator behavior is fixed to the filesystem contract and
    evaluate_rag.py
    ; it does not substitute for custom offline judges or non-RAG benchmarks.
  • Vector DB / embedding choices follow deployed ingestor and RAG env — not overridden by this CLI alone.
  • Scores depend on retrieval quality, judge model availability, and
    NVIDIA_API_KEY
    ; empty contexts yield partial RAGAS metrics (see references).
  • Large procedural detail lives under
    references/
    to keep routing concise; read those files when the user needs step-by-step conversion, full flags, or error tables.
  • 评估器的行为受文件系统约定
    evaluate_rag.py
    限制,无法替代自定义离线Judge或非RAG基准测试。
  • 向量数据库/嵌入模型的选择取决于已部署的Ingestor和RAG环境——无法仅通过本CLI覆盖配置。
  • 分数依赖于检索质量、Judge模型可用性和
    NVIDIA_API_KEY
    ;空上下文会导致部分RAGAS指标缺失(参见参考文档)。
  • 大量流程细节存于**
    references/
    **目录以简化引导;当用户需要分步转换、完整参数或错误表时,请阅读这些文件。

Troubleshooting

故障排除

Error / signalLikely causeWhat to do
Immediate exit mentioning
NVIDIA_API_KEY
Missing or invalid keySet key via secure channel; see credential hygiene in
references/benchmark-execution.md
.
train.json must be a JSON array
Wrong JSON shapeTop-level array of objects; validate per
references/dataset-and-conversion.md
.
Fewer rows in
evaluation_data.json
than
train.json
Per-query failuresCheck stderr: network or stream JSON errors; see error table in benchmark-execution.
Empty
generated_contexts
everywhere
Retrieval gapVerify collection, ingestion,
top_k
/
vdb_top_k
, and
ingestor_server_url
without
/v1
suffix.
Ingestor 404 on uploadBad ingestor base URLPass
http://host:port
only — code appends
/v1/
.
Full signal table:
references/benchmark-execution.md#common-error-cases-and-signals
.
错误/信号可能原因解决方法
程序立即退出并提示
NVIDIA_API_KEY
密钥缺失或无效通过安全渠道设置密钥;参见
references/benchmark-execution.md
中的凭证安全规范。
train.json must be a JSON array
JSON格式错误顶层需为对象数组;按照
references/dataset-and-conversion.md
验证格式。
evaluation_data.json
中的行数少于
train.json
单查询失败检查标准错误输出:网络或流JSON错误;参见benchmark-execution中的错误表。
所有
generated_contexts
为空
检索失效验证集合、数据导入、
top_k
/
vdb_top_k
,以及
ingestor_server_url
不要包含
/v1
后缀。
Ingestor上传时返回404Ingestor基础URL错误仅传递
http://host:port
——代码会自动追加
/v1/
完整信号表:
references/benchmark-execution.md#common-error-cases-and-signals

Gotchas

注意事项

  • Run from repo root: paths and imports in
    scripts/eval/evaluate_rag.py
    assume this; a wrong directory silently breaks imports.
  • --ingestor_server_url
    : pass
    http://host:port
    without
    /v1
    —the code appends
    /v1/
    automatically. Including
    /v1
    causes 404s on ingestor calls.
  • Vector DB / embedding settings: not set by this CLI; configure via the deployed ingestor and RAG server env vars (e.g.
    APP_VECTORSTORE_URL
    , embedding model).
  • --model
    /
    --llm_endpoint
    : forwarded verbatim only when explicitly set; omit to keep the server's configured LLM.
  • Stale collections: a previous run's ingested data persists unless you use
    --force_ingestion
    . Use
    --collection
    with a unique name when comparing quality across isolated runs.
  • Empty context metrics: if all
    generated_contexts
    are empty, RAGAS scores only
    nv_accuracy
    and leaves the other two metrics blank—this is not a silent success.
  • 从仓库根目录运行
    scripts/eval/evaluate_rag.py
    中的路径和导入均以此为前提;错误的目录会导致导入失败且无提示。
  • --ingestor_server_url
    :传递
    http://host:port
    时不要包含
    /v1
    ——代码会自动追加
    /v1/
    。包含
    /v1
    会导致Ingestor调用返回404。
  • 向量数据库/嵌入模型设置:不由本CLI设置;通过已部署的Ingestor和RAG服务器环境变量配置(例如
    APP_VECTORSTORE_URL
    、嵌入模型)。
  • --model
    /
    --llm_endpoint
    :仅在显式设置时才会原样转发;省略该参数将保留服务器配置的LLM。
  • 过期集合:除非使用
    --force_ingestion
    ,否则之前运行导入的数据会保留。在隔离运行中对比质量时,请使用
    --collection
    指定唯一名称。
  • 空上下文指标:若所有
    generated_contexts
    为空,RAGAS仅会计算
    nv_accuracy
    分数,其余两个指标留空——这并非静默成功。

Source of truth

权威来源

PieceLocation
Driver
scripts/eval/evaluate_rag.py
(
CORPUS_DIRECTORY
=
corpus
,
EVAL_DATA
=
train.json
)
Human README (always in-repo)
scripts/eval/README.md
Full CLI (flags, defaults)
scripts/eval/evaluate_rag.py --help
;
references/evaluate-rag-cli.md
Dataset / conversion
references/dataset-and-conversion.md
Runs, outputs, errors
references/benchmark-execution.md
Result analysis scripts
references/result-analysis.md
Latency / throughputrag-perf skill,
docs/performance-benchmarking.md
内容位置
驱动脚本
scripts/eval/evaluate_rag.py
CORPUS_DIRECTORY
=
corpus
EVAL_DATA
=
train.json
人工维护的README(始终在仓库内)
scripts/eval/README.md
完整CLI(参数、默认值)
scripts/eval/evaluate_rag.py --help
references/evaluate-rag-cli.md
数据集/转换规则
references/dataset-and-conversion.md
运行、输出、错误
references/benchmark-execution.md
结果分析脚本
references/result-analysis.md
延迟/吞吐量rag-perf技能,
docs/performance-benchmarking.md

Agent playbook

Agent操作手册

  1. Run eval
    uv sync --project scripts/eval
    then
    uv run --project scripts/eval python scripts/eval/evaluate_rag.py
    with required
    --dataset-paths
    ,
    --host
    , and
    --port
    (and env
    NVIDIA_API_KEY
    ). Argument
    --ingestor_server_url
    is optional (defaults to
    http://localhost:8082
    ); pass it only when overriding the ingestor endpoint.
  2. Quality tuning — See
    references/benchmark-execution.md
    :
    --top_k
    /
    --vdb_top_k
    , reranker and query-rewriting toggles,
    --temperature
    ,
    --top-p
    ,
    --max-tokens
    .
  3. Data conversion — Follow
    references/dataset-and-conversion.md
    .
  4. Analyze results
    references/result-analysis.md
    ; quick scan:
    python3 -m json.tool results/<dataset>/rag_<dataset>_evaluation_summary.json
    .
  5. Error triage
    references/benchmark-execution.md#common-error-cases-and-signals
    .
  1. 运行评估——先执行
    uv sync --project scripts/eval
    ,再使用必填的
    --dataset-paths
    --host
    --port
    参数(以及环境变量
    NVIDIA_API_KEY
    )执行
    uv run --project scripts/eval python scripts/eval/evaluate_rag.py
    --ingestor_server_url
    为可选参数(默认值为
    http://localhost:8082
    );仅在需要覆盖Ingestor端点时传递该参数。
  2. 质量调优——参见
    references/benchmark-execution.md
    --top_k
    /
    --vdb_top_k
    、重排器与查询重写开关、
    --temperature
    --top-p
    --max-tokens
  3. 数据转换——遵循
    references/dataset-and-conversion.md
    中的规则。
  4. 分析结果——参见
    references/result-analysis.md
    ;快速查看:
    python3 -m json.tool results/<dataset>/rag_<dataset>_evaluation_summary.json
  5. 故障排查——参见
    references/benchmark-execution.md#common-error-cases-and-signals