rag-perf

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

RAG-Perf — config-driven perf benchmark CLI

RAG-Perf — 基于配置的性能基准测试CLI

Purpose

用途

Drive a deployed NVIDIA RAG Blueprint server with a YAML config, run a server-side profiling pass (per-stage timing, citation quality, bottleneck inference) and an optional aiperf load test (TTFT / E2E / token & request throughput / error rate), and write a unified report. The CLI is intentionally minimal:

rag-perf -c <config>

plus

--help

--version

. Behaviour is fully config-driven; field variations belong in YAML.

通过YAML配置驱动已部署的NVIDIA RAG Blueprint服务器，运行服务器端性能分析流程（各阶段计时、引用质量、瓶颈推断）以及可选的aiperf负载测试（TTFT / E2E / 令牌与请求吞吐量 / 错误率），并生成统一报告。该CLI设计极简：仅需

rag-perf -c <config>

，外加

--help

--version

参数。所有行为完全由配置驱动；参数变更需在YAML中完成。

Scope

适用范围

Accuracy / RAGAS scoring of answer quality → use the rag-eval skill.
Deploying, repairing, or configuring services (compose, helm, NIM env vars) → use the rag-blueprint skill.
Production monitoring / alerting — rag-perf is a one-shot benchmark tool.
Runtime requirement: a deployed RAG server reachable on the network.

答案质量的准确性/RAGAS评分 → 使用rag-eval工具。
服务的部署、修复或配置（compose、helm、NIM环境变量）→ 使用rag-blueprint工具。
生产环境监控/告警 — rag-perf是一次性基准测试工具，不支持该场景。
运行要求：网络中可访问的已部署RAG服务器。

Prerequisites

前置条件

Repo cloned; run commands from the repo root (config paths in the presets are repo-root-relative).
Python 3.11+ and uv on PATH.
Install rag-perf into its own uv-managed venv:
```
uv sync --project scripts/rag-perf
```
.
For unit tests: install dev extras as well —
```
uv sync --project scripts/rag-perf --extra dev
```
(otherwise
```
pytest-asyncio
```
is missing and async tests error out at collection time).
A reachable RAG server (default
```
http://localhost:8081
```
). For the aiperf phase, the bundled
```
nvidia_rag
```
endpoint plugin must be installed —
```
pip install -e ./scripts/rag-perf
```
registers it via the
```
aiperf.plugins
```
entry point.
For synthetic queries: an OpenAI-compatible chat-completions endpoint reachable at
```
synthetic.llm_url
```
(default
```
http://localhost:8999/v1/chat/completions
```
).
rag-perf itself runs without
```
NVIDIA_API_KEY
```
(unlike rag-eval). The synthetic LLM endpoint may require its own auth — that's the deployment's concern.

已克隆仓库；从仓库根目录执行命令（预设配置中的路径均为相对仓库根目录的路径）。
PATH中已配置Python 3.11+ 和 uv。
将rag-perf安装到其独立的uv管理venv中：
```
uv sync --project scripts/rag-perf
```
。
若要运行单元测试：还需安装开发依赖 —
```
uv sync --project scripts/rag-perf --extra dev
```
（否则会缺少
```
pytest-asyncio
```
，异步测试在收集阶段会报错）。
可访问的RAG服务器（默认地址
```
http://localhost:8081
```
）。对于aiperf测试阶段，必须安装捆绑的
```
nvidia_rag
```
端点插件 —
```
pip install -e ./scripts/rag-perf
```
会通过
```
aiperf.plugins
```
入口点注册该插件。
若使用合成查询：需在
```
synthetic.llm_url
```
地址可访问OpenAI兼容的聊天补全端点（默认地址
```
http://localhost:8999/v1/chat/completions
```
）。
rag-perf本身运行无需
```
NVIDIA_API_KEY
```
（与rag-eval不同）。合成LLM端点可能需要单独的认证 — 这属于部署层面的问题。

Instructions

使用步骤

Pick a preset. The three under
```
scripts/rag-perf/configs/
```
are:
- ```
quick_profile.yaml
```
  — profile-only, ~30 s. Skips load test. For fast iteration on retrieval / reranker tuning.
- ```
single_run.yaml
```
  — one concurrency level, profiling + aiperf, ~2 min. Regression checks.
- ```
sweep.yaml
```
  — multi-axis sweep.
```
load.concurrency
```
  ,
```
rag.vdb_top_k
```
  ,
```
rag.reranker_top_k
```
  are all
```
int | list[int]
```
  ; any of them as a list becomes a sweep axis (Cartesian product).
Edit the preset. Required: replace
```
rag.collection_names: ["<collection_name>"]
```
with a real collection on the deployed ingestor server. Verify the collection exists via
```
GET /v1/collections
```
on the ingestor. The placeholder
```
<collection_name>
```
validates fine but every request will fail at retrieval. Use a copied YAML preset for variants; the CLI surface is intentionally config-only.

Run. From repo root:

bash

uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml

Same form for the other presets. The CLI accepts only

-c / --config

(required),

--help

--version

Read stdout. Every invocation prints, in order: a startup banner, a one-line summary, the fully resolved config as YAML (so the run is reproducible from terminal output), per-grid-point progress with the shlex-joined aiperf command in copy-pastable form, a rich per-point summary table (stage breakdown with bars, citation quality, bottleneck, load-test block), and finally a side-by-side comparison table auto-labelled by whichever axis varied. See
```
references/output-and-analysis.md
```
.
Inspect artifacts. Layout depends on run shape — flat for single-point +
```
iterations=1
```
, nested under
```
iter_<i>/<point>/...
```
otherwise. See
```
references/output-and-analysis.md
```
for the full directory tree, file purposes, and how to parse
```
results.json
```
/
```
results.csv
```
/
```
report.md
```
.
Summarise for the user. When reporting back, follow the playbook in
```
references/output-and-analysis.md#summarising-results-to-the-user
```
: pick the canonical result file for the run shape, build a headline table (concurrency × top-k axes × TTFT × throughput × bottleneck × citation quality), compute scaling efficiency on sweeps, always flag zero citations / non-zero error rate / suspect
```
llm_ttft_ms
```
/ small-sample p99, and propose a concrete next-experiment YAML.
Tune. Schema is fully documented in
```
docs/performance-benchmarking.md
```
and the deeper-dive references below. Common knobs: turn
```
aiperf.enabled: false
```
for profile-only mode, increase
```
load.iterations
```
for variance estimation, set
```
load.sleep_between_points_s: 60
```
for overnight Cartesian sweeps.

选择预设配置。
```
scripts/rag-perf/configs/
```
下有三个预设：
- ```
quick_profile.yaml
```
  — 仅执行性能分析，耗时约30秒，跳过负载测试。适用于检索/重排器调优的快速迭代。
- ```
single_run.yaml
```
  — 单并发级别，执行性能分析+aiperf测试，耗时约2分钟。适用于回归检查。
- ```
sweep.yaml
```
  — 多维度扫描。
```
load.concurrency
```
  、
```
rag.vdb_top_k
```
  、
```
rag.reranker_top_k
```
  均支持
```
int | list[int]
```
  类型；若为列表则作为扫描维度（笛卡尔积）。
编辑预设配置。必填项：将
```
rag.collection_names: ["<collection_name>"]
```
替换为已部署的摄入服务器上的真实集合名称。可通过摄入服务器的
```
GET /v1/collections
```
接口验证集合是否存在。占位符
```
<collection_name>
```
可通过配置校验，但所有请求都会在检索阶段失败。可复制YAML预设来创建变体；CLI仅支持通过配置文件修改参数。
运行测试。从仓库根目录执行：
bash
```
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml
```
其他预设的运行格式相同。CLI仅接受必填参数
```
-c / --config
```
，以及
```
--help
```
、
```
--version
```
。
查看标准输出。每次运行会依次输出：启动横幅、单行摘要、完整解析后的YAML配置（确保可通过终端输出复现运行过程）、每个网格点的进度（包含可复制粘贴的shlex拼接的aiperf命令）、丰富的单点汇总表格（带进度条的阶段分解、引用质量、瓶颈、负载测试模块），最后是自动按维度标记的并排对比表格。详见
```
references/output-and-analysis.md
```
。
检查生成的产物。产物布局取决于运行形式 — 单点+
```
iterations=1
```
为扁平结构，否则嵌套在
```
iter_<i>/<point>/...
```
下。完整目录结构、文件用途以及
```
results.json
```
/
```
results.csv
```
/
```
report.md
```
的解析方法详见
```
references/output-and-analysis.md
```
。
为用户总结结果。汇报时遵循
```
references/output-and-analysis.md#summarising-results-to-the-user
```
中的指南：根据运行形式选择标准结果文件，生成标题表格（并发数 × top-k维度 × TTFT × 吞吐量 × 瓶颈 × 引用质量），计算扫描测试的扩展效率，必须标记零引用/非零错误率/异常
```
llm_ttft_ms
```
/小样本p99，并提出具体的下一轮实验YAML配置。
调优参数。配置Schema的完整文档位于
```
docs/performance-benchmarking.md
```
及下方的深度参考文档中。常用调整项：设置
```
aiperf.enabled: false
```
启用仅分析模式，增加
```
load.iterations
```
用于方差估算，设置
```
load.sleep_between_points_s: 60
```
用于夜间笛卡尔扫描测试。

Examples

示例

Profile-only (quickest signal on retrieval / reranker tuning):

bash

uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/quick_profile.yaml

Output:

rag-perf-results/quick_profile/run_<ts>/{profile_report.md, profile_results.json, profiling/}

. The

aiperf_rag_on/

directory is omitted. Filenames are

profile_*

because

aiperf.enabled: false

Single benchmark point with full report:

bash

uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml

Output: flat

run_<ts>/{report.md, results.json, results.csv, profiling/, aiperf_rag_on/}

Concurrency sweep:

bash

uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/sweep.yaml

Output: nested

run_<ts>/iter_1/<CR:_VDB-K:_RERANKER-K:_…>/{profiling,aiperf_rag_on}/

per point, plus aggregate

report.md

results.json

results.csv

at the run root.

Run unit tests:

bash

uv sync --project scripts/rag-perf --extra dev   # one-time, installs pytest-asyncio
uv run --project scripts/rag-perf python -m pytest tests/unit/test_rag_perf/

仅执行性能分析（检索/重排器调优的最快反馈）：

bash

uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/quick_profile.yaml

输出：

rag-perf-results/quick_profile/run_<ts>/{profile_report.md, profile_results.json, profiling/}

。

aiperf_rag_on/

目录会被省略。文件名以

profile_*

开头，因为

aiperf.enabled: false

。

带完整报告的单点基准测试：

bash

uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml

输出：扁平结构

run_<ts>/{report.md, results.json, results.csv, profiling/, aiperf_rag_on/}

。

并发数扫描测试：

bash

uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/sweep.yaml

输出：每个测试点对应嵌套结构

run_<ts>/iter_1/<CR:_VDB-K:_RERANKER-K:_…>/{profiling,aiperf_rag_on}/

，同时在运行根目录生成汇总的

report.md

results.json

results.csv

。

运行单元测试：

bash

uv sync --project scripts/rag-perf --extra dev   # 一次性操作，安装pytest-asyncio
uv run --project scripts/rag-perf python -m pytest tests/unit/test_rag_perf/

Limitations

局限性

The CLI is config-only: author or copy YAML to vary a parameter.
```
load.concurrency
```
/
```
rag.vdb_top_k
```
/
```
rag.reranker_top_k
```
accept
```
int | list[int]
```
; the validator requires unique list values because each value names a unique point dir.
```
input.file
```
and
```
input.synthetic
```
follow an XOR rule — both set fails validation. When neither is set,
```
synthetic
```
auto-fills with defaults so a bare config still validates.
File-based input format is inferred from extension only (
```
.jsonl
```
or
```
.csv
```
); other extensions are rejected.
Synthetic generation streams each query to disk as it completes (failure-resilient) but fails fast on the first LLM error — partial JSONL is preserved. Re-run after fixing the endpoint.
Reasoning models (Nemotron Omni, Qwen-Reasoning) require
```
synthetic.disable_thinking: true
```
(the default). Without it the model exhausts the token budget on chain-of-thought and
```
content
```
returns empty — the generator now raises with a clear message instead of substituting
```
reasoning_content
```
for the answer.
aiperf-specific knobs outside the YAML surface (request rate distribution, GPU telemetry config, etc.) require editing
```
AiperfRunner._base_aiperf_cmd
```
in
```
scripts/rag-perf/rag_perf/runner.py
```
.
Procedural detail lives under references/
to keep this file concise.

CLI仅支持配置文件：需编写或复制YAML来修改参数。
```
load.concurrency
```
/
```
rag.vdb_top_k
```
/
```
rag.reranker_top_k
```
接受
```
int | list[int]
```
类型；校验器要求列表值唯一，因为每个值对应唯一的测试点目录。
```
input.file
```
和
```
input.synthetic
```
遵循互斥规则 — 同时设置会校验失败。若两者均未设置，
```
synthetic
```
会自动填充默认值，确保空配置仍可通过校验。
基于文件的输入格式仅通过扩展名推断（
```
.jsonl
```
或
```
.csv
```
）；其他扩展名会被拒绝。
合成查询生成会在每个查询完成后流式写入磁盘（具备故障恢复能力），但首次LLM错误会导致快速失败 — 部分JSONL会被保留。修复端点后可重新运行。
推理模型（Nemotron Omni、Qwen-Reasoning）需设置
```
synthetic.disable_thinking: true
```
（默认值）。否则模型会在思维链上耗尽令牌额度，导致
```
content
```
返回空 — 生成器现在会抛出明确错误，而非用
```
reasoning_content
```
替代答案。
YAML配置之外的aiperf专属参数（请求率分布、GPU遥测配置等）需修改
```
scripts/rag-perf/rag_perf/runner.py
```
中的
```
AiperfRunner._base_aiperf_cmd
```
。
详细流程说明位于**
```
references/
```
**目录下，以保持本文简洁。

Troubleshooting

故障排查

Error / signal	Likely cause	What to do
`Configuration errors in <yaml>: • input — ... XOR rule`	Both `input.file` and `input.synthetic` set	Pick one. The XOR validator runs at YAML load time.
`input.file must end in .jsonl or .csv`	Extension other than `.jsonl` / `.csv`	Rename or convert.
`load.concurrency has duplicate values`	e.g. `[2, 2, 4]`	Each concurrency maps to a unique point dir; dedupe.
`warmup_requests must be >= 1`	YAML had `warmup_requests: 0`	aiperf rejects warmup=0; minimum is 1.
`LLM returned empty content (reasoning_content was populated — model exhausted its budget on chain-of-thought; raise min_query_tokens or set synthetic.disable_thinking=true).`	Reasoning model used CoT and ran out of tokens	Set `synthetic.disable_thinking: true` (the default) or raise `min_query_tokens` .
`✗ All N profiling requests failed across M point(s).` + exit 1	Bad URL, server down, wrong collection	Verify `target.url` , `rag.collection_names` (the `<collection_name>` placeholder will hit this).
Per-iteration `⚠ N profiling requests failed` warning, run continues	Some requests timed out / errored mid-run	Check rag-server logs, raise `target.timeout_s` , drop concurrency.
`RuntimeError: Random synthetic query generation failed at query N: ...`	LLM endpoint rejected a request mid-generation	Partial JSONL is at `synthetic.jsonl_output_path` ; fix endpoint and re-run with reduced `num_queries` , or point `input.file` at the partial file.
`Citation count (mean): 0` and `Citation relevance score: N/A` for a non-empty deployment	Collection mismatch between `rag.collection_names` and what's actually ingested	Run `curl -s http://<ingestor>:8082/v1/collections` to list real collections.
Tests error with `ModuleNotFoundError: No module named 'pytest_asyncio'`	Dev extras missing	`uv sync --project scripts/rag-perf --extra dev` .
CI: `ModuleNotFoundError: No module named 'ruamel'` from `tests/unit/test_rag_perf/`	rag-perf package missing from CI venv	Add `uv pip install -e ./scripts/rag-perf` after the top-level install in the unit-tests job.

错误/信号	可能原因	解决方法
`Configuration errors in <yaml>: • input — ... XOR rule`	同时设置了 `input.file` 和 `input.synthetic`	二选一。互斥校验在YAML加载阶段执行。
`input.file must end in .jsonl or .csv`	文件扩展名不是 `.jsonl` / `.csv`	重命名或转换文件格式。
`load.concurrency has duplicate values`	例如 `[2, 2, 4]`	每个并发数对应唯一的测试点目录；去重列表值。
`warmup_requests must be >= 1`	YAML中设置了 `warmup_requests: 0`	aiperf不接受预热请求数为0；最小值为1。
`LLM returned empty content (reasoning_content was populated — model exhausted its budget on chain-of-thought; raise min_query_tokens or set synthetic.disable_thinking=true).`	推理模型使用思维链时耗尽了令牌额度	设置 `synthetic.disable_thinking: true` （默认值）或提高 `min_query_tokens` 。
`✗ All N profiling requests failed across M point(s).` + 退出码1	URL错误、服务器宕机、集合名称错误	验证 `target.url` 、 `rag.collection_names` （占位符 `<collection_name>` 会触发此错误）。
每轮迭代出现 `⚠ N profiling requests failed` 警告，测试继续运行	部分请求在运行中超时/报错	检查rag-server日志，提高 `target.timeout_s` ，降低并发数。
`RuntimeError: Random synthetic query generation failed at query N: ...`	LLM端点在生成过程中拒绝了请求	部分JSONL已保存至 `synthetic.jsonl_output_path` ；修复端点后减少 `num_queries` 重新运行，或将 `input.file` 指向该部分文件。
非空部署中出现 `Citation count (mean): 0` 和 `Citation relevance score: N/A`	`rag.collection_names` 与实际摄入的集合不匹配	执行 `curl -s http://<ingestor>:8082/v1/collections` 列出真实集合。
测试报错 `ModuleNotFoundError: No module named 'pytest_asyncio'`	缺少开发依赖	执行 `uv sync --project scripts/rag-perf --extra dev` 。
CI环境报错 `ModuleNotFoundError: No module named 'ruamel'` 来自 `tests/unit/test_rag_perf/`	CI venv中缺少rag-perf包	在单元测试任务的顶层安装后添加 `uv pip install -e ./scripts/rag-perf` 。

Gotchas

注意事项

Run from repo root. Preset configs reference
```
scripts/rag-perf/examples/queries.jsonl
```
and
```
scripts/rag-perf/prompts/default_prompts.yaml
```
with repo-root-relative paths. Running from inside
```
scripts/rag-perf/
```
will fail those file lookups.
CLI is config-only. Edit the YAML or copy a preset for URL, concurrency, collection, and similar fields.
Always edit
rag.collection_names
before the first run. The presets ship with
```
["<collection_name>"]
```
as a deliberate placeholder. Validation passes, retrieval fails silently for every request — manifests as
```
Citation count (mean): 0
```
everywhere.
load.concurrency_list
,
rag.vdb_top_k_list
,
rag.reranker_top_k_list
are read-only properties that normalise scalar-or-list to a list. Use them when reasoning about the grid; the underlying YAML field is whatever the user wrote.
aiperf.enabled: false
changes filenames. The top-level outputs become
```
profile_report.md
```
/
```
profile_results.json
```
/
```
profile_results.csv
```
. The aggregate sweep table also suppresses load-test rows and the "Optimal throughput" footer.
Resolved-config dump is verbose (50+ lines) — expected. It's what makes terminal output a self-contained reproducer; don't filter it out in scripts.
The aiperf shell command is logged before each subprocess. Look for
```
\n  $ python -m aiperf profile -m ... --endpoint-type nvidia_rag ...
```
in stdout — copy-paste runnable for reproducing a single point outside rag-perf.
--endpoint-type nvidia_rag
comes from the bundled plugin at
```
scripts/rag-perf/rag_perf/plugin/nvidia_rag.py
```
. It teaches aiperf about the RAG
```
/v1/generate
```
request shape and parses citations + per-stage
```
metrics
```
out of the SSE stream. If aiperf can't resolve
```
nvidia_rag
```
, rag-perf needs editable installation in the venv — re-run
```
uv sync --project scripts/rag-perf
```
(or
```
uv pip install -e ./scripts/rag-perf
```
).
Sweep-mode point-name collision. When two points differ only in concurrency (e.g.
```
[1, 4]
```
× single
```
vdb_top_k
```
), the dir name encodes everything:
```
CR:1_ISL:50_OSL:512_VDB-K:20_RERANKER-K:4_Model:...
```
. Cluster / GPU / experiment_name (
```
output.cluster
```
,
```
output.gpu
```
,
```
output.experiment_name
```
) are appended too — useful for diff-friendly artifact paths across machines.
load.iterations > 1
repeats the entire grid. Each repetition writes to its own
```
iter_<i>/
```
. Aggregate CSV row count =
```
n_points × iterations
```
.

从仓库根目录运行。预设配置引用
```
scripts/rag-perf/examples/queries.jsonl
```
和
```
scripts/rag-perf/prompts/default_prompts.yaml
```
时使用的是相对仓库根目录的路径。若从
```
scripts/rag-perf/
```
目录内运行，会导致文件查找失败。
CLI仅支持配置文件。需编辑YAML或复制预设来修改URL、并发数、集合名称等字段。
首次运行前务必修改
rag.collection_names
。预设配置中使用
```
["<collection_name>"]
```
作为故意设置的占位符。配置校验会通过，但所有请求会在检索阶段静默失败 — 表现为所有结果中
```
Citation count (mean): 0
```
。
**
```
load.concurrency_list
```
、
```
rag.vdb_top_k_list
```
、
```
rag.reranker_top_k_list
```
**是只读属性，用于将标量或列表统一转换为列表。分析网格时可使用这些属性；底层YAML字段为用户输入的原始值。
aiperf.enabled: false
会改变文件名。顶层输出文件变为
```
profile_report.md
```
/
```
profile_results.json
```
/
```
profile_results.csv
```
。汇总扫描表格也会隐藏负载测试行和"Optimal throughput"页脚。
解析后的配置输出较为冗长（50+行）— 这是预期行为。它确保终端输出可独立复现运行过程；不要在脚本中过滤该输出。
aiperf shell命令会在每个子进程运行前记录。在标准输出中查找
```
\ $ python -m aiperf profile -m ... --endpoint-type nvidia_rag ...
```
— 该命令可复制粘贴，用于在rag-perf之外复现单个测试点。
**
```
--endpoint-type nvidia_rag
```
**来自捆绑插件
```
scripts/rag-perf/rag_perf/plugin/nvidia_rag.py
```
。它让aiperf了解RAG的
```
/v1/generate
```
请求格式，并从SSE流中解析引用和各阶段
```
metrics
```
。若aiperf无法解析
```
nvidia_rag
```
，需在venv中以可编辑模式安装rag-perf — 重新运行
```
uv sync --project scripts/rag-perf
```
（或
```
uv pip install -e ./scripts/rag-perf
```
）。
扫描模式下的测试点名称冲突。当两个测试点仅并发数不同时（例如
```
[1, 4]
```
× 单个
```
vdb_top_k
```
），目录名称会包含所有信息：
```
CR:1_ISL:50_OSL:512_VDB-K:20_RERANKER-K:4_Model:...
```
。集群/GPU/实验名称（
```
output.cluster
```
、
```
output.gpu
```
、
```
output.experiment_name
```
）也会被追加 — 便于跨机器生成易于对比的产物路径。
load.iterations > 1
会重复整个网格。每次重复会写入独立的
```
iter_<i>/
```
目录。汇总CSV的行数 =
```
测试点数 × 迭代次数
```
。

Source of truth

权威参考

Piece	Location
Driver	`scripts/rag-perf/rag_perf/cli.py` ( `main` is the single Click command)
Schema	`scripts/rag-perf/rag_perf/config.py` ( `RunConfig` and sub-models)
Orchestrator	`scripts/rag-perf/rag_perf/runner.py` ( `BenchmarkRunner.run` , `RagProfiler` , `AiperfRunner` )
aiperf plugin	`scripts/rag-perf/rag_perf/plugin/nvidia_rag.py`
User-facing doc	`docs/performance-benchmarking.md`
Presets	`scripts/rag-perf/configs/{quick_profile,single_run,sweep}.yaml`
Sample queries	`scripts/rag-perf/examples/queries.jsonl`
Synthetic prompts	`scripts/rag-perf/prompts/default_prompts.yaml`
Config schema details	`references/config-schema.md`
Synthetic-query generation	`references/synthetic-generation.md`
Output layout & metric semantics	`references/output-and-analysis.md`

组件	位置
驱动程序	`scripts/rag-perf/rag_perf/cli.py` （ `main` 是唯一的Click命令）
配置Schema	`scripts/rag-perf/rag_perf/config.py` （ `RunConfig` 及子模型）
编排器	`scripts/rag-perf/rag_perf/runner.py` （ `BenchmarkRunner.run` 、 `RagProfiler` 、 `AiperfRunner` ）
aiperf插件	`scripts/rag-perf/rag_perf/plugin/nvidia_rag.py`
用户文档	`docs/performance-benchmarking.md`
预设配置	`scripts/rag-perf/configs/{quick_profile,single_run,sweep}.yaml`
示例查询	`scripts/rag-perf/examples/queries.jsonl`
合成提示词	`scripts/rag-perf/prompts/default_prompts.yaml`
配置Schema详情	`references/config-schema.md`
合成查询生成	`references/synthetic-generation.md`
输出布局 & 指标语义	`references/output-and-analysis.md`

Agent playbook

Agent操作指南

Sync deps:
```
uv sync --project scripts/rag-perf
```
(one-time per checkout).
Pick & customise a preset: copy
```
scripts/rag-perf/configs/<preset>.yaml
```
if you want a variant; always set
```
rag.collection_names
```
to a real collection.

Run:

uv run --project scripts/rag-perf rag-perf -c <config>

from repo root.

Read the per-point + aggregate tables on stdout. Bottleneck inference is in the per-point profiling section; comparison across points is the final aggregate table.
Parse artifacts under
```
output.dir/run_<ts>/
```
— see
```
references/output-and-analysis.md
```
. For multi-point runs,
```
results.csv
```
has one row per (point × iteration).
Summarise for the user using the playbook in
```
references/output-and-analysis.md#summarising-results-to-the-user
```
— headline table, scaling-efficiency math for sweeps, mandatory flags for zero citations / non-zero errors / suspect
```
llm_ttft_ms
```
/ low sample size, and a concrete next-experiment YAML.
Tune retrieval / reranker: flip to
```
quick_profile.yaml
```
or
```
aiperf.enabled: false
```
for fast iteration, then return to
```
single_run.yaml
```
/
```
sweep.yaml
```
when characterising under load.
Triage failures: see Troubleshooting above and
```
references/output-and-analysis.md
```
for empty-citation / bottleneck=N/A patterns.

同步依赖：
```
uv sync --project scripts/rag-perf
```
（每次 checkout 执行一次）。
选择并自定义预设：若需变体可复制
```
scripts/rag-perf/configs/<preset>.yaml
```
；务必将
```
rag.collection_names
```
设置为真实集合名称。

运行测试：从仓库根目录执行

uv run --project scripts/rag-perf rag-perf -c <config>

。

查看标准输出中的单点+汇总表格。瓶颈推断位于单点性能分析部分；跨测试点的对比位于最终汇总表格。
解析
output.dir/run_<ts>/
下的产物 — 详见
```
references/output-and-analysis.md
```
。对于多点运行，
```
results.csv
```
每行对应一个（测试点 × 迭代次数）。
为用户总结结果，遵循
```
references/output-and-analysis.md#summarising-results-to-the-user
```
中的指南：生成标题表格，计算扫描测试的扩展效率，必须标记零引用/非零错误率/异常
```
llm_ttft_ms
```
/小样本量，并提出具体的下一轮实验YAML配置。
调优检索/重排器：切换到
```
quick_profile.yaml
```
或设置
```
aiperf.enabled: false
```
进行快速迭代，之后再使用
```
single_run.yaml
```
/
```
sweep.yaml
```
进行负载下的性能表征。
排查故障：参考上述故障排查部分及
```
references/output-and-analysis.md
```
中的零引用/瓶颈=N/A模式说明。",