rag-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOn-disk RAG evaluation (corpus/
+ train.json
)
corpus/train.json磁盘端RAG评估(corpus/
+ train.json
)
corpus/train.jsonPurpose
用途
Guide agents through NVIDIA RAG Blueprint filesystem benchmarks: preparing and , running , tuning retrieval and generation flags for quality comparisons, interpreting RAGAS JSON outputs, and triaging failures (HTTP/stream errors, empty contexts, collection mismatch, judge API).
corpus/train.jsonscripts/eval/evaluate_rag.pyFor latency, throughput, and load testing, use the rag-perf skill (, ) — not this skill.
scripts/rag-perfdocs/performance-benchmarking.md引导Agent完成NVIDIA RAG Blueprint文件系统基准测试:准备和,运行,调整检索与生成参数以进行质量对比,解读RAGAS JSON输出,以及排查故障(HTTP/流错误、空上下文、集合不匹配、Judge API问题)。
corpus/train.jsonscripts/eval/evaluate_rag.py如需进行延迟、吞吐量和负载测试,请使用rag-perf技能(、)——而非本技能。
scripts/rag-perfdocs/performance-benchmarking.mdWhen not to use
禁用场景
Do not use this skill for: deploying or repairing services (use rag-blueprint); evaluating APIs without the + layout; general ML experimentation unrelated to this evaluator; production monitoring/alerting; or latency/throughput benchmarking (use rag-perf).
corpus/train.json请勿将本技能用于:部署或修复服务(请使用rag-blueprint);评估非 + 结构的API;与本评估器无关的通用机器学习实验;生产环境监控/告警;或延迟/吞吐量基准测试(请使用rag-perf)。
corpus/train.jsonPrerequisites
前置条件
- Repo cloned; run commands from repo root (imports and paths assume this).
- Python 3.11+ and uv; eval deps: .
uv sync --project scripts/eval - Reachable RAG server and ingestor (defaults often /
localhost:8081).8082 - for RAGAS (see credential hygiene); optional
NVIDIA_API_KEY.RAG_EVAL_JUDGE_MODEL - Dataset roots passed to each contain
--dataset-pathsandcorpus/.train.json
- 已克隆仓库;从仓库根目录运行命令(导入路径和文件路径均以此为前提)。
- Python 3.11+ 和 uv;评估依赖:。
uv sync --project scripts/eval - 可访问的RAG服务器和Ingestor(默认通常为/
localhost:8081)。8082 - 用于RAGAS的**(参见凭证安全规范);可选配置
NVIDIA_API_KEY**。RAG_EVAL_JUDGE_MODEL - 传递给的数据集根目录需分别包含**
--dataset-paths和corpus/**。train.json
Instructions
操作步骤
- Prepare data — Ensure each dataset directory matches the layout and rules in
train.json. When sources arrive as public links (sites or dataset pages), materialize documents underreferences/dataset-and-conversion.md—prefer PDF for multimodal content so images stay embedded; convert CSV/JSONL/etc. using the patterns there.corpus/ - Run eval — with
uv run --project scripts/eval python scripts/eval/evaluate_rag.py,--dataset-paths, and--host. See--portfor command examples, outputs, and errors. Usereferences/benchmark-execution.mdfor flag-level detail.references/evaluate-rag-cli.md - Tune quality — Adjust /
--top_k, reranker and query-rewriting toggles, and generation overrides (--vdb_top_k,--temperature,--top-p) as documented in--max-tokenswhen comparing retrieval/generation configs for RAGAS scores.references/benchmark-execution.md - Analyze results — Use for scripts; scan
references/result-analysis.mdfor headline RAGAS metrics.rag_*_evaluation_summary.json - Triage errors — Use the error signal table and the Troubleshooting section below.
- 准备数据——确保每个数据集目录符合中的目录结构和
references/dataset-and-conversion.md规则。当数据源为公开链接(网站或数据集页面)时,将文档落地到train.json下——多模态内容优先选择PDF格式,以保证图片嵌入;按照文档中的转换规则处理CSV/JSONL等格式。corpus/ - 运行评估——使用、
--dataset-paths和--host参数执行--port。命令示例、输出结果和错误排查请参见uv run --project scripts/eval python scripts/eval/evaluate_rag.py。参数详情请参见references/benchmark-execution.md。references/evaluate-rag-cli.md - 调优质量——当对比不同检索/生成配置的RAGAS分数时,按照中的说明调整
references/benchmark-execution.md/--top_k、重排器与查询重写开关,以及生成参数覆盖项(--vdb_top_k、--temperature、--top-p)。--max-tokens - 分析结果——使用中的脚本;查看
references/result-analysis.md文件获取核心RAGAS指标。rag_*_evaluation_summary.json - 排查故障——使用错误信号表和下方的故障排除章节。
Examples
示例
Set API key without putting secrets in shell history (preferred patterns): load from a gitignored env file or secrets manager; avoid committing ; rotate keys if exposed. Details: .
.envreferences/benchmark-execution.md#credential-hygiene-nvidia_api_keyMinimal eval (key already in environment):
bash
uv sync --project scripts/eval
uv run --project scripts/eval python scripts/eval/evaluate_rag.py \
--dataset-paths /path/to/my_dataset \
--host localhost \
--port 8081Pretty-print summary JSON:
bash
python3 -m json.tool results/my_dataset/rag_my_dataset_evaluation_summary.jsonMore examples (skip ingestion, quality sweeps): .
references/benchmark-execution.md不在Shell历史中存储密钥的API密钥设置方式(推荐):从Git忽略的环境文件或密钥管理器加载;避免提交文件;若密钥泄露请及时轮换。详情请参见。
.envreferences/benchmark-execution.md#credential-hygiene-nvidia_api_key最简评估(密钥已在环境中配置):
bash
uv sync --project scripts/eval
uv run --project scripts/eval python scripts/eval/evaluate_rag.py \
--dataset-paths /path/to/my_dataset \
--host localhost \
--port 8081格式化输出摘要JSON:
bash
python3 -m json.tool results/my_dataset/rag_my_dataset_evaluation_summary.json更多示例(跳过数据导入、质量扫描):。
references/benchmark-execution.mdLimitations
局限性
- Evaluator behavior is fixed to the filesystem contract and ; it does not substitute for custom offline judges or non-RAG benchmarks.
evaluate_rag.py - Vector DB / embedding choices follow deployed ingestor and RAG env — not overridden by this CLI alone.
- Scores depend on retrieval quality, judge model availability, and ; empty contexts yield partial RAGAS metrics (see references).
NVIDIA_API_KEY - Large procedural detail lives under to keep routing concise; read those files when the user needs step-by-step conversion, full flags, or error tables.
references/
- 评估器的行为受文件系统约定和限制,无法替代自定义离线Judge或非RAG基准测试。
evaluate_rag.py - 向量数据库/嵌入模型的选择取决于已部署的Ingestor和RAG环境——无法仅通过本CLI覆盖配置。
- 分数依赖于检索质量、Judge模型可用性和;空上下文会导致部分RAGAS指标缺失(参见参考文档)。
NVIDIA_API_KEY - 大量流程细节存于****目录以简化引导;当用户需要分步转换、完整参数或错误表时,请阅读这些文件。
references/
Troubleshooting
故障排除
| Error / signal | Likely cause | What to do |
|---|---|---|
Immediate exit mentioning | Missing or invalid key | Set key via secure channel; see credential hygiene in |
| Wrong JSON shape | Top-level array of objects; validate per |
Fewer rows in | Per-query failures | Check stderr: network or stream JSON errors; see error table in benchmark-execution. |
Empty | Retrieval gap | Verify collection, ingestion, |
| Ingestor 404 on upload | Bad ingestor base URL | Pass |
Full signal table: .
references/benchmark-execution.md#common-error-cases-and-signals| 错误/信号 | 可能原因 | 解决方法 |
|---|---|---|
程序立即退出并提示 | 密钥缺失或无效 | 通过安全渠道设置密钥;参见 |
| JSON格式错误 | 顶层需为对象数组;按照 |
| 单查询失败 | 检查标准错误输出:网络或流JSON错误;参见benchmark-execution中的错误表。 |
所有 | 检索失效 | 验证集合、数据导入、 |
| Ingestor上传时返回404 | Ingestor基础URL错误 | 仅传递 |
完整信号表:。
references/benchmark-execution.md#common-error-cases-and-signalsGotchas
注意事项
- Run from repo root: paths and imports in assume this; a wrong directory silently breaks imports.
scripts/eval/evaluate_rag.py - : pass
--ingestor_server_urlwithouthttp://host:port—the code appends/v1automatically. Including/v1/causes 404s on ingestor calls./v1 - Vector DB / embedding settings: not set by this CLI; configure via the deployed ingestor and RAG server env vars (e.g. , embedding model).
APP_VECTORSTORE_URL - /
--model: forwarded verbatim only when explicitly set; omit to keep the server's configured LLM.--llm_endpoint - Stale collections: a previous run's ingested data persists unless you use . Use
--force_ingestionwith a unique name when comparing quality across isolated runs.--collection - Empty context metrics: if all are empty, RAGAS scores only
generated_contextsand leaves the other two metrics blank—this is not a silent success.nv_accuracy
- 从仓库根目录运行:中的路径和导入均以此为前提;错误的目录会导致导入失败且无提示。
scripts/eval/evaluate_rag.py - :传递
--ingestor_server_url时不要包含http://host:port——代码会自动追加/v1。包含/v1/会导致Ingestor调用返回404。/v1 - 向量数据库/嵌入模型设置:不由本CLI设置;通过已部署的Ingestor和RAG服务器环境变量配置(例如、嵌入模型)。
APP_VECTORSTORE_URL - /
--model:仅在显式设置时才会原样转发;省略该参数将保留服务器配置的LLM。--llm_endpoint - 过期集合:除非使用,否则之前运行导入的数据会保留。在隔离运行中对比质量时,请使用
--force_ingestion指定唯一名称。--collection - 空上下文指标:若所有为空,RAGAS仅会计算
generated_contexts分数,其余两个指标留空——这并非静默成功。nv_accuracy
Source of truth
权威来源
| Piece | Location |
|---|---|
| Driver | |
| Human README (always in-repo) | |
| Full CLI (flags, defaults) | |
| Dataset / conversion | |
| Runs, outputs, errors | |
| Result analysis scripts | |
| Latency / throughput | rag-perf skill, |
| 内容 | 位置 |
|---|---|
| 驱动脚本 | |
| 人工维护的README(始终在仓库内) | |
| 完整CLI(参数、默认值) | |
| 数据集/转换规则 | |
| 运行、输出、错误 | |
| 结果分析脚本 | |
| 延迟/吞吐量 | rag-perf技能, |
Agent playbook
Agent操作手册
- Run eval — then
uv sync --project scripts/evalwith requireduv run --project scripts/eval python scripts/eval/evaluate_rag.py,--dataset-paths, and--host(and env--port). ArgumentNVIDIA_API_KEYis optional (defaults to--ingestor_server_url); pass it only when overriding the ingestor endpoint.http://localhost:8082 - Quality tuning — See :
references/benchmark-execution.md/--top_k, reranker and query-rewriting toggles,--vdb_top_k,--temperature,--top-p.--max-tokens - Data conversion — Follow .
references/dataset-and-conversion.md - Analyze results — ; quick scan:
references/result-analysis.md.python3 -m json.tool results/<dataset>/rag_<dataset>_evaluation_summary.json - Error triage — .
references/benchmark-execution.md#common-error-cases-and-signals
- 运行评估——先执行,再使用必填的
uv sync --project scripts/eval、--dataset-paths和--host参数(以及环境变量--port)执行NVIDIA_API_KEY。uv run --project scripts/eval python scripts/eval/evaluate_rag.py为可选参数(默认值为--ingestor_server_url);仅在需要覆盖Ingestor端点时传递该参数。http://localhost:8082 - 质量调优——参见:
references/benchmark-execution.md/--top_k、重排器与查询重写开关、--vdb_top_k、--temperature、--top-p。--max-tokens - 数据转换——遵循中的规则。
references/dataset-and-conversion.md - 分析结果——参见;快速查看:
references/result-analysis.md。python3 -m json.tool results/<dataset>/rag_<dataset>_evaluation_summary.json - 故障排查——参见。
references/benchmark-execution.md#common-error-cases-and-signals