rag-perf
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRAG-Perf — config-driven perf benchmark CLI
RAG-Perf — 基于配置的性能基准测试CLI
Purpose
用途
Drive a deployed NVIDIA RAG Blueprint server with a YAML config, run a server-side profiling pass (per-stage timing, citation quality, bottleneck inference) and an optional aiperf load test (TTFT / E2E / token & request throughput / error rate), and write a unified report. The CLI is intentionally minimal: plus / . Behaviour is fully config-driven; field variations belong in YAML.
rag-perf -c <config>--help--version通过YAML配置驱动已部署的NVIDIA RAG Blueprint服务器,运行服务器端性能分析流程(各阶段计时、引用质量、瓶颈推断)以及可选的aiperf负载测试(TTFT / E2E / 令牌与请求吞吐量 / 错误率),并生成统一报告。该CLI设计极简:仅需,外加/参数。所有行为完全由配置驱动;参数变更需在YAML中完成。
rag-perf -c <config>--help--versionScope
适用范围
- Accuracy / RAGAS scoring of answer quality → use the rag-eval skill.
- Deploying, repairing, or configuring services (compose, helm, NIM env vars) → use the rag-blueprint skill.
- Production monitoring / alerting — rag-perf is a one-shot benchmark tool.
- Runtime requirement: a deployed RAG server reachable on the network.
- 答案质量的准确性/RAGAS评分 → 使用rag-eval工具。
- 服务的部署、修复或配置(compose、helm、NIM环境变量)→ 使用rag-blueprint工具。
- 生产环境监控/告警 — rag-perf是一次性基准测试工具,不支持该场景。
- 运行要求:网络中可访问的已部署RAG服务器。
Prerequisites
前置条件
- Repo cloned; run commands from the repo root (config paths in the presets are repo-root-relative).
- Python 3.11+ and uv on PATH.
- Install rag-perf into its own uv-managed venv: .
uv sync --project scripts/rag-perf - For unit tests: install dev extras as well — (otherwise
uv sync --project scripts/rag-perf --extra devis missing and async tests error out at collection time).pytest-asyncio - A reachable RAG server (default ). For the aiperf phase, the bundled
http://localhost:8081endpoint plugin must be installed —nvidia_ragregisters it via thepip install -e ./scripts/rag-perfentry point.aiperf.plugins - For synthetic queries: an OpenAI-compatible chat-completions endpoint reachable at (default
synthetic.llm_url).http://localhost:8999/v1/chat/completions - rag-perf itself runs without (unlike rag-eval). The synthetic LLM endpoint may require its own auth — that's the deployment's concern.
NVIDIA_API_KEY
- 已克隆仓库;从仓库根目录执行命令(预设配置中的路径均为相对仓库根目录的路径)。
- PATH中已配置Python 3.11+ 和 uv。
- 将rag-perf安装到其独立的uv管理venv中:。
uv sync --project scripts/rag-perf - 若要运行单元测试:还需安装开发依赖 — (否则会缺少
uv sync --project scripts/rag-perf --extra dev,异步测试在收集阶段会报错)。pytest-asyncio - 可访问的RAG服务器(默认地址)。对于aiperf测试阶段,必须安装捆绑的
http://localhost:8081端点插件 —nvidia_rag会通过pip install -e ./scripts/rag-perf入口点注册该插件。aiperf.plugins - 若使用合成查询:需在地址可访问OpenAI兼容的聊天补全端点(默认地址
synthetic.llm_url)。http://localhost:8999/v1/chat/completions - rag-perf本身运行无需(与rag-eval不同)。合成LLM端点可能需要单独的认证 — 这属于部署层面的问题。
NVIDIA_API_KEY
Instructions
使用步骤
-
Pick a preset. The three underare:
scripts/rag-perf/configs/- — profile-only, ~30 s. Skips load test. For fast iteration on retrieval / reranker tuning.
quick_profile.yaml - — one concurrency level, profiling + aiperf, ~2 min. Regression checks.
single_run.yaml - — multi-axis sweep.
sweep.yaml,load.concurrency,rag.vdb_top_kare allrag.reranker_top_k; any of them as a list becomes a sweep axis (Cartesian product).int | list[int]
-
Edit the preset. Required: replacewith a real collection on the deployed ingestor server. Verify the collection exists via
rag.collection_names: ["<collection_name>"]on the ingestor. The placeholderGET /v1/collectionsvalidates fine but every request will fail at retrieval. Use a copied YAML preset for variants; the CLI surface is intentionally config-only.<collection_name> -
Run. From repo root:bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yamlSame form for the other presets. The CLI accepts only(required),-c / --config,--help.--version -
Read stdout. Every invocation prints, in order: a startup banner, a one-line summary, the fully resolved config as YAML (so the run is reproducible from terminal output), per-grid-point progress with the shlex-joined aiperf command in copy-pastable form, a rich per-point summary table (stage breakdown with bars, citation quality, bottleneck, load-test block), and finally a side-by-side comparison table auto-labelled by whichever axis varied. See.
references/output-and-analysis.md -
Inspect artifacts. Layout depends on run shape — flat for single-point +, nested under
iterations=1otherwise. Seeiter_<i>/<point>/...for the full directory tree, file purposes, and how to parsereferences/output-and-analysis.md/results.json/results.csv.report.md -
Summarise for the user. When reporting back, follow the playbook in: pick the canonical result file for the run shape, build a headline table (concurrency × top-k axes × TTFT × throughput × bottleneck × citation quality), compute scaling efficiency on sweeps, always flag zero citations / non-zero error rate / suspect
references/output-and-analysis.md#summarising-results-to-the-user/ small-sample p99, and propose a concrete next-experiment YAML.llm_ttft_ms -
Tune. Schema is fully documented inand the deeper-dive references below. Common knobs: turn
docs/performance-benchmarking.mdfor profile-only mode, increaseaiperf.enabled: falsefor variance estimation, setload.iterationsfor overnight Cartesian sweeps.load.sleep_between_points_s: 60
-
选择预设配置。下有三个预设:
scripts/rag-perf/configs/- — 仅执行性能分析,耗时约30秒,跳过负载测试。适用于检索/重排器调优的快速迭代。
quick_profile.yaml - — 单并发级别,执行性能分析+aiperf测试,耗时约2分钟。适用于回归检查。
single_run.yaml - — 多维度扫描。
sweep.yaml、load.concurrency、rag.vdb_top_k均支持rag.reranker_top_k类型;若为列表则作为扫描维度(笛卡尔积)。int | list[int]
-
编辑预设配置。必填项:将替换为已部署的摄入服务器上的真实集合名称。可通过摄入服务器的
rag.collection_names: ["<collection_name>"]接口验证集合是否存在。占位符GET /v1/collections可通过配置校验,但所有请求都会在检索阶段失败。可复制YAML预设来创建变体;CLI仅支持通过配置文件修改参数。<collection_name> -
运行测试。从仓库根目录执行:bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml其他预设的运行格式相同。CLI仅接受必填参数,以及-c / --config、--help。--version -
查看标准输出。每次运行会依次输出:启动横幅、单行摘要、完整解析后的YAML配置(确保可通过终端输出复现运行过程)、每个网格点的进度(包含可复制粘贴的shlex拼接的aiperf命令)、丰富的单点汇总表格(带进度条的阶段分解、引用质量、瓶颈、负载测试模块),最后是自动按维度标记的并排对比表格。详见。
references/output-and-analysis.md -
检查生成的产物。产物布局取决于运行形式 — 单点+为扁平结构,否则嵌套在
iterations=1下。完整目录结构、文件用途以及iter_<i>/<point>/.../results.json/results.csv的解析方法详见report.md。references/output-and-analysis.md -
为用户总结结果。汇报时遵循中的指南:根据运行形式选择标准结果文件,生成标题表格(并发数 × top-k维度 × TTFT × 吞吐量 × 瓶颈 × 引用质量),计算扫描测试的扩展效率,必须标记零引用/非零错误率/异常
references/output-and-analysis.md#summarising-results-to-the-user/小样本p99,并提出具体的下一轮实验YAML配置。llm_ttft_ms -
调优参数。配置Schema的完整文档位于及下方的深度参考文档中。常用调整项:设置
docs/performance-benchmarking.md启用仅分析模式,增加aiperf.enabled: false用于方差估算,设置load.iterations用于夜间笛卡尔扫描测试。load.sleep_between_points_s: 60
Examples
示例
Profile-only (quickest signal on retrieval / reranker tuning):
bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/quick_profile.yamlOutput: . The directory is omitted. Filenames are because .
rag-perf-results/quick_profile/run_<ts>/{profile_report.md, profile_results.json, profiling/}aiperf_rag_on/profile_*aiperf.enabled: falseSingle benchmark point with full report:
bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yamlOutput: flat .
run_<ts>/{report.md, results.json, results.csv, profiling/, aiperf_rag_on/}Concurrency sweep:
bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/sweep.yamlOutput: nested per point, plus aggregate / / at the run root.
run_<ts>/iter_1/<CR:_VDB-K:_RERANKER-K:_…>/{profiling,aiperf_rag_on}/report.mdresults.jsonresults.csvRun unit tests:
bash
uv sync --project scripts/rag-perf --extra dev # one-time, installs pytest-asyncio
uv run --project scripts/rag-perf python -m pytest tests/unit/test_rag_perf/仅执行性能分析(检索/重排器调优的最快反馈):
bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/quick_profile.yaml输出:。目录会被省略。文件名以开头,因为。
rag-perf-results/quick_profile/run_<ts>/{profile_report.md, profile_results.json, profiling/}aiperf_rag_on/profile_*aiperf.enabled: false带完整报告的单点基准测试:
bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml输出:扁平结构。
run_<ts>/{report.md, results.json, results.csv, profiling/, aiperf_rag_on/}并发数扫描测试:
bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/sweep.yaml输出:每个测试点对应嵌套结构,同时在运行根目录生成汇总的//。
run_<ts>/iter_1/<CR:_VDB-K:_RERANKER-K:_…>/{profiling,aiperf_rag_on}/report.mdresults.jsonresults.csv运行单元测试:
bash
uv sync --project scripts/rag-perf --extra dev # 一次性操作,安装pytest-asyncio
uv run --project scripts/rag-perf python -m pytest tests/unit/test_rag_perf/Limitations
局限性
- The CLI is config-only: author or copy YAML to vary a parameter.
- /
load.concurrency/rag.vdb_top_kacceptrag.reranker_top_k; the validator requires unique list values because each value names a unique point dir.int | list[int] - and
input.filefollow an XOR rule — both set fails validation. When neither is set,input.syntheticauto-fills with defaults so a bare config still validates.synthetic - File-based input format is inferred from extension only (or
.jsonl); other extensions are rejected..csv - Synthetic generation streams each query to disk as it completes (failure-resilient) but fails fast on the first LLM error — partial JSONL is preserved. Re-run after fixing the endpoint.
- Reasoning models (Nemotron Omni, Qwen-Reasoning) require (the default). Without it the model exhausts the token budget on chain-of-thought and
synthetic.disable_thinking: truereturns empty — the generator now raises with a clear message instead of substitutingcontentfor the answer.reasoning_content - aiperf-specific knobs outside the YAML surface (request rate distribution, GPU telemetry config, etc.) require editing in
AiperfRunner._base_aiperf_cmd.scripts/rag-perf/rag_perf/runner.py - Procedural detail lives under to keep this file concise.
references/
- CLI仅支持配置文件:需编写或复制YAML来修改参数。
- /
load.concurrency/rag.vdb_top_k接受rag.reranker_top_k类型;校验器要求列表值唯一,因为每个值对应唯一的测试点目录。int | list[int] - 和
input.file遵循互斥规则 — 同时设置会校验失败。若两者均未设置,input.synthetic会自动填充默认值,确保空配置仍可通过校验。synthetic - 基于文件的输入格式仅通过扩展名推断(或
.jsonl);其他扩展名会被拒绝。.csv - 合成查询生成会在每个查询完成后流式写入磁盘(具备故障恢复能力),但首次LLM错误会导致快速失败 — 部分JSONL会被保留。修复端点后可重新运行。
- 推理模型(Nemotron Omni、Qwen-Reasoning)需设置(默认值)。否则模型会在思维链上耗尽令牌额度,导致
synthetic.disable_thinking: true返回空 — 生成器现在会抛出明确错误,而非用content替代答案。reasoning_content - YAML配置之外的aiperf专属参数(请求率分布、GPU遥测配置等)需修改中的
scripts/rag-perf/rag_perf/runner.py。AiperfRunner._base_aiperf_cmd - 详细流程说明位于****目录下,以保持本文简洁。
references/
Troubleshooting
故障排查
| Error / signal | Likely cause | What to do |
|---|---|---|
| Both | Pick one. The XOR validator runs at YAML load time. |
| Extension other than | Rename or convert. |
| e.g. | Each concurrency maps to a unique point dir; dedupe. |
| YAML had | aiperf rejects warmup=0; minimum is 1. |
| Reasoning model used CoT and ran out of tokens | Set |
| Bad URL, server down, wrong collection | Verify |
Per-iteration | Some requests timed out / errored mid-run | Check rag-server logs, raise |
| LLM endpoint rejected a request mid-generation | Partial JSONL is at |
| Collection mismatch between | Run |
Tests error with | Dev extras missing | |
CI: | rag-perf package missing from CI venv | Add |
| 错误/信号 | 可能原因 | 解决方法 |
|---|---|---|
| 同时设置了 | 二选一。互斥校验在YAML加载阶段执行。 |
| 文件扩展名不是 | 重命名或转换文件格式。 |
| 例如 | 每个并发数对应唯一的测试点目录;去重列表值。 |
| YAML中设置了 | aiperf不接受预热请求数为0;最小值为1。 |
| 推理模型使用思维链时耗尽了令牌额度 | 设置 |
| URL错误、服务器宕机、集合名称错误 | 验证 |
每轮迭代出现 | 部分请求在运行中超时/报错 | 检查rag-server日志,提高 |
| LLM端点在生成过程中拒绝了请求 | 部分JSONL已保存至 |
非空部署中出现 | | 执行 |
测试报错 | 缺少开发依赖 | 执行 |
CI环境报错 | CI venv中缺少rag-perf包 | 在单元测试任务的顶层安装后添加 |
Gotchas
注意事项
- Run from repo root. Preset configs reference and
scripts/rag-perf/examples/queries.jsonlwith repo-root-relative paths. Running from insidescripts/rag-perf/prompts/default_prompts.yamlwill fail those file lookups.scripts/rag-perf/ - CLI is config-only. Edit the YAML or copy a preset for URL, concurrency, collection, and similar fields.
- Always edit before the first run. The presets ship with
rag.collection_namesas a deliberate placeholder. Validation passes, retrieval fails silently for every request — manifests as["<collection_name>"]everywhere.Citation count (mean): 0 - ,
load.concurrency_list,rag.vdb_top_k_listare read-only properties that normalise scalar-or-list to a list. Use them when reasoning about the grid; the underlying YAML field is whatever the user wrote.rag.reranker_top_k_list - changes filenames. The top-level outputs become
aiperf.enabled: false/profile_report.md/profile_results.json. The aggregate sweep table also suppresses load-test rows and the "Optimal throughput" footer.profile_results.csv - Resolved-config dump is verbose (50+ lines) — expected. It's what makes terminal output a self-contained reproducer; don't filter it out in scripts.
- The aiperf shell command is logged before each subprocess. Look for in stdout — copy-paste runnable for reproducing a single point outside rag-perf.
\n $ python -m aiperf profile -m ... --endpoint-type nvidia_rag ... - comes from the bundled plugin at
--endpoint-type nvidia_rag. It teaches aiperf about the RAGscripts/rag-perf/rag_perf/plugin/nvidia_rag.pyrequest shape and parses citations + per-stage/v1/generateout of the SSE stream. If aiperf can't resolvemetrics, rag-perf needs editable installation in the venv — re-runnvidia_rag(oruv sync --project scripts/rag-perf).uv pip install -e ./scripts/rag-perf - Sweep-mode point-name collision. When two points differ only in concurrency (e.g. × single
[1, 4]), the dir name encodes everything:vdb_top_k. Cluster / GPU / experiment_name (CR:1_ISL:50_OSL:512_VDB-K:20_RERANKER-K:4_Model:...,output.cluster,output.gpu) are appended too — useful for diff-friendly artifact paths across machines.output.experiment_name - repeats the entire grid. Each repetition writes to its own
load.iterations > 1. Aggregate CSV row count =iter_<i>/.n_points × iterations
- 从仓库根目录运行。预设配置引用和
scripts/rag-perf/examples/queries.jsonl时使用的是相对仓库根目录的路径。若从scripts/rag-perf/prompts/default_prompts.yaml目录内运行,会导致文件查找失败。scripts/rag-perf/ - CLI仅支持配置文件。需编辑YAML或复制预设来修改URL、并发数、集合名称等字段。
- 首次运行前务必修改。预设配置中使用
rag.collection_names作为故意设置的占位符。配置校验会通过,但所有请求会在检索阶段静默失败 — 表现为所有结果中["<collection_name>"]。Citation count (mean): 0 - **、
load.concurrency_list、rag.vdb_top_k_list**是只读属性,用于将标量或列表统一转换为列表。分析网格时可使用这些属性;底层YAML字段为用户输入的原始值。rag.reranker_top_k_list - 会改变文件名。顶层输出文件变为
aiperf.enabled: false/profile_report.md/profile_results.json。汇总扫描表格也会隐藏负载测试行和"Optimal throughput"页脚。profile_results.csv - 解析后的配置输出较为冗长(50+行)— 这是预期行为。它确保终端输出可独立复现运行过程;不要在脚本中过滤该输出。
- aiperf shell命令会在每个子进程运行前记录。在标准输出中查找— 该命令可复制粘贴,用于在rag-perf之外复现单个测试点。
\ $ python -m aiperf profile -m ... --endpoint-type nvidia_rag ... - ****来自捆绑插件
--endpoint-type nvidia_rag。它让aiperf了解RAG的scripts/rag-perf/rag_perf/plugin/nvidia_rag.py请求格式,并从SSE流中解析引用和各阶段/v1/generate。若aiperf无法解析metrics,需在venv中以可编辑模式安装rag-perf — 重新运行nvidia_rag(或uv sync --project scripts/rag-perf)。uv pip install -e ./scripts/rag-perf - 扫描模式下的测试点名称冲突。当两个测试点仅并发数不同时(例如× 单个
[1, 4]),目录名称会包含所有信息:vdb_top_k。集群/GPU/实验名称(CR:1_ISL:50_OSL:512_VDB-K:20_RERANKER-K:4_Model:...、output.cluster、output.gpu)也会被追加 — 便于跨机器生成易于对比的产物路径。output.experiment_name - 会重复整个网格。每次重复会写入独立的
load.iterations > 1目录。汇总CSV的行数 =iter_<i>/。测试点数 × 迭代次数
Source of truth
权威参考
| Piece | Location |
|---|---|
| Driver | |
| Schema | |
| Orchestrator | |
| aiperf plugin | |
| User-facing doc | |
| Presets | |
| Sample queries | |
| Synthetic prompts | |
| Config schema details | |
| Synthetic-query generation | |
| Output layout & metric semantics | |
| 组件 | 位置 |
|---|---|
| 驱动程序 | |
| 配置Schema | |
| 编排器 | |
| aiperf插件 | |
| 用户文档 | |
| 预设配置 | |
| 示例查询 | |
| 合成提示词 | |
| 配置Schema详情 | |
| 合成查询生成 | |
| 输出布局 & 指标语义 | |
Agent playbook
Agent操作指南
- Sync deps: (one-time per checkout).
uv sync --project scripts/rag-perf - Pick & customise a preset: copy if you want a variant; always set
scripts/rag-perf/configs/<preset>.yamlto a real collection.rag.collection_names - Run: from repo root.
uv run --project scripts/rag-perf rag-perf -c <config> - Read the per-point + aggregate tables on stdout. Bottleneck inference is in the per-point profiling section; comparison across points is the final aggregate table.
- Parse artifacts under — see
output.dir/run_<ts>/. For multi-point runs,references/output-and-analysis.mdhas one row per (point × iteration).results.csv - Summarise for the user using the playbook in — headline table, scaling-efficiency math for sweeps, mandatory flags for zero citations / non-zero errors / suspect
references/output-and-analysis.md#summarising-results-to-the-user/ low sample size, and a concrete next-experiment YAML.llm_ttft_ms - Tune retrieval / reranker: flip to or
quick_profile.yamlfor fast iteration, then return toaiperf.enabled: false/single_run.yamlwhen characterising under load.sweep.yaml - Triage failures: see Troubleshooting above and for empty-citation / bottleneck=N/A patterns.
references/output-and-analysis.md
- 同步依赖:(每次 checkout 执行一次)。
uv sync --project scripts/rag-perf - 选择并自定义预设:若需变体可复制;务必将
scripts/rag-perf/configs/<preset>.yaml设置为真实集合名称。rag.collection_names - 运行测试:从仓库根目录执行。
uv run --project scripts/rag-perf rag-perf -c <config> - 查看标准输出中的单点+汇总表格。瓶颈推断位于单点性能分析部分;跨测试点的对比位于最终汇总表格。
- 解析下的产物 — 详见
output.dir/run_<ts>/。对于多点运行,references/output-and-analysis.md每行对应一个(测试点 × 迭代次数)。results.csv - 为用户总结结果,遵循中的指南:生成标题表格,计算扫描测试的扩展效率,必须标记零引用/非零错误率/异常
references/output-and-analysis.md#summarising-results-to-the-user/小样本量,并提出具体的下一轮实验YAML配置。llm_ttft_ms - 调优检索/重排器:切换到或设置
quick_profile.yaml进行快速迭代,之后再使用aiperf.enabled: false/single_run.yaml进行负载下的性能表征。sweep.yaml - 排查故障:参考上述故障排查部分及中的零引用/瓶颈=N/A模式说明。",
references/output-and-analysis.md