rag-perf

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RAG-Perf — config-driven perf benchmark CLI

RAG-Perf — 基于配置的性能基准测试CLI

Purpose

用途

Drive a deployed NVIDIA RAG Blueprint server with a YAML config, run a server-side profiling pass (per-stage timing, citation quality, bottleneck inference) and an optional aiperf load test (TTFT / E2E / token & request throughput / error rate), and write a unified report. The CLI is intentionally minimal:
rag-perf -c <config>
plus
--help
/
--version
. Behaviour is fully config-driven; field variations belong in YAML.
通过YAML配置驱动已部署的NVIDIA RAG Blueprint服务器,运行服务器端性能分析流程(各阶段计时、引用质量、瓶颈推断)以及可选的aiperf负载测试(TTFT / E2E / 令牌与请求吞吐量 / 错误率),并生成统一报告。该CLI设计极简:仅需
rag-perf -c <config>
,外加
--help
/
--version
参数。所有行为完全由配置驱动;参数变更需在YAML中完成。

Scope

适用范围

  • Accuracy / RAGAS scoring of answer quality → use the rag-eval skill.
  • Deploying, repairing, or configuring services (compose, helm, NIM env vars) → use the rag-blueprint skill.
  • Production monitoring / alerting — rag-perf is a one-shot benchmark tool.
  • Runtime requirement: a deployed RAG server reachable on the network.
  • 答案质量的准确性/RAGAS评分 → 使用rag-eval工具。
  • 服务的部署、修复或配置(compose、helm、NIM环境变量)→ 使用rag-blueprint工具。
  • 生产环境监控/告警 — rag-perf是一次性基准测试工具,不支持该场景。
  • 运行要求:网络中可访问的已部署RAG服务器。

Prerequisites

前置条件

  • Repo cloned; run commands from the repo root (config paths in the presets are repo-root-relative).
  • Python 3.11+ and uv on PATH.
  • Install rag-perf into its own uv-managed venv:
    uv sync --project scripts/rag-perf
    .
  • For unit tests: install dev extras as well —
    uv sync --project scripts/rag-perf --extra dev
    (otherwise
    pytest-asyncio
    is missing and async tests error out at collection time).
  • A reachable RAG server (default
    http://localhost:8081
    ). For the aiperf phase, the bundled
    nvidia_rag
    endpoint plugin must be installed —
    pip install -e ./scripts/rag-perf
    registers it via the
    aiperf.plugins
    entry point.
  • For synthetic queries: an OpenAI-compatible chat-completions endpoint reachable at
    synthetic.llm_url
    (default
    http://localhost:8999/v1/chat/completions
    ).
  • rag-perf itself runs without
    NVIDIA_API_KEY
    (unlike rag-eval). The synthetic LLM endpoint may require its own auth — that's the deployment's concern.
  • 已克隆仓库;从仓库根目录执行命令(预设配置中的路径均为相对仓库根目录的路径)。
  • PATH中已配置Python 3.11+uv
  • 将rag-perf安装到其独立的uv管理venv中:
    uv sync --project scripts/rag-perf
  • 若要运行单元测试:还需安装开发依赖 —
    uv sync --project scripts/rag-perf --extra dev
    (否则会缺少
    pytest-asyncio
    ,异步测试在收集阶段会报错)。
  • 可访问的RAG服务器(默认地址
    http://localhost:8081
    )。对于aiperf测试阶段,必须安装捆绑的
    nvidia_rag
    端点插件 —
    pip install -e ./scripts/rag-perf
    会通过
    aiperf.plugins
    入口点注册该插件。
  • 若使用合成查询:需在
    synthetic.llm_url
    地址可访问OpenAI兼容的聊天补全端点(默认地址
    http://localhost:8999/v1/chat/completions
    )。
  • rag-perf本身运行无需
    NVIDIA_API_KEY
    (与rag-eval不同)。合成LLM端点可能需要单独的认证 — 这属于部署层面的问题。

Instructions

使用步骤

  1. Pick a preset. The three under
    scripts/rag-perf/configs/
    are:
    • quick_profile.yaml
      — profile-only, ~30 s. Skips load test. For fast iteration on retrieval / reranker tuning.
    • single_run.yaml
      — one concurrency level, profiling + aiperf, ~2 min. Regression checks.
    • sweep.yaml
      — multi-axis sweep.
      load.concurrency
      ,
      rag.vdb_top_k
      ,
      rag.reranker_top_k
      are all
      int | list[int]
      ; any of them as a list becomes a sweep axis (Cartesian product).
  2. Edit the preset. Required: replace
    rag.collection_names: ["<collection_name>"]
    with a real collection on the deployed ingestor server. Verify the collection exists via
    GET /v1/collections
    on the ingestor. The placeholder
    <collection_name>
    validates fine but every request will fail at retrieval. Use a copied YAML preset for variants; the CLI surface is intentionally config-only.
  3. Run. From repo root:
    bash
    uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml
    Same form for the other presets. The CLI accepts only
    -c / --config
    (required),
    --help
    ,
    --version
    .
  4. Read stdout. Every invocation prints, in order: a startup banner, a one-line summary, the fully resolved config as YAML (so the run is reproducible from terminal output), per-grid-point progress with the shlex-joined aiperf command in copy-pastable form, a rich per-point summary table (stage breakdown with bars, citation quality, bottleneck, load-test block), and finally a side-by-side comparison table auto-labelled by whichever axis varied. See
    references/output-and-analysis.md
    .
  5. Inspect artifacts. Layout depends on run shape — flat for single-point +
    iterations=1
    , nested under
    iter_<i>/<point>/...
    otherwise. See
    references/output-and-analysis.md
    for the full directory tree, file purposes, and how to parse
    results.json
    /
    results.csv
    /
    report.md
    .
  6. Summarise for the user. When reporting back, follow the playbook in
    references/output-and-analysis.md#summarising-results-to-the-user
    : pick the canonical result file for the run shape, build a headline table (concurrency × top-k axes × TTFT × throughput × bottleneck × citation quality), compute scaling efficiency on sweeps, always flag zero citations / non-zero error rate / suspect
    llm_ttft_ms
    / small-sample p99, and propose a concrete next-experiment YAML.
  7. Tune. Schema is fully documented in
    docs/performance-benchmarking.md
    and the deeper-dive references below. Common knobs: turn
    aiperf.enabled: false
    for profile-only mode, increase
    load.iterations
    for variance estimation, set
    load.sleep_between_points_s: 60
    for overnight Cartesian sweeps.
  1. 选择预设配置
    scripts/rag-perf/configs/
    下有三个预设:
    • quick_profile.yaml
      — 仅执行性能分析,耗时约30秒,跳过负载测试。适用于检索/重排器调优的快速迭代。
    • single_run.yaml
      — 单并发级别,执行性能分析+aiperf测试,耗时约2分钟。适用于回归检查。
    • sweep.yaml
      — 多维度扫描。
      load.concurrency
      rag.vdb_top_k
      rag.reranker_top_k
      均支持
      int | list[int]
      类型;若为列表则作为扫描维度(笛卡尔积)。
  2. 编辑预设配置必填项:将
    rag.collection_names: ["<collection_name>"]
    替换为已部署的摄入服务器上的真实集合名称。可通过摄入服务器的
    GET /v1/collections
    接口验证集合是否存在。占位符
    <collection_name>
    可通过配置校验,但所有请求都会在检索阶段失败。可复制YAML预设来创建变体;CLI仅支持通过配置文件修改参数。
  3. 运行测试。从仓库根目录执行:
    bash
    uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml
    其他预设的运行格式相同。CLI仅接受必填参数
    -c / --config
    ,以及
    --help
    --version
  4. 查看标准输出。每次运行会依次输出:启动横幅、单行摘要、完整解析后的YAML配置(确保可通过终端输出复现运行过程)、每个网格点的进度(包含可复制粘贴的shlex拼接的aiperf命令)、丰富的单点汇总表格(带进度条的阶段分解、引用质量、瓶颈、负载测试模块),最后是自动按维度标记的并排对比表格。详见
    references/output-and-analysis.md
  5. 检查生成的产物。产物布局取决于运行形式 — 单点+
    iterations=1
    为扁平结构,否则嵌套在
    iter_<i>/<point>/...
    下。完整目录结构、文件用途以及
    results.json
    /
    results.csv
    /
    report.md
    的解析方法详见
    references/output-and-analysis.md
  6. 为用户总结结果。汇报时遵循
    references/output-and-analysis.md#summarising-results-to-the-user
    中的指南:根据运行形式选择标准结果文件,生成标题表格(并发数 × top-k维度 × TTFT × 吞吐量 × 瓶颈 × 引用质量),计算扫描测试的扩展效率,必须标记零引用/非零错误率/异常
    llm_ttft_ms
    /小样本p99,并提出具体的下一轮实验YAML配置。
  7. 调优参数。配置Schema的完整文档位于
    docs/performance-benchmarking.md
    及下方的深度参考文档中。常用调整项:设置
    aiperf.enabled: false
    启用仅分析模式,增加
    load.iterations
    用于方差估算,设置
    load.sleep_between_points_s: 60
    用于夜间笛卡尔扫描测试。

Examples

示例

Profile-only (quickest signal on retrieval / reranker tuning):
bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/quick_profile.yaml
Output:
rag-perf-results/quick_profile/run_<ts>/{profile_report.md, profile_results.json, profiling/}
. The
aiperf_rag_on/
directory is omitted. Filenames are
profile_*
because
aiperf.enabled: false
.
Single benchmark point with full report:
bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml
Output: flat
run_<ts>/{report.md, results.json, results.csv, profiling/, aiperf_rag_on/}
.
Concurrency sweep:
bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/sweep.yaml
Output: nested
run_<ts>/iter_1/<CR:_VDB-K:_RERANKER-K:_…>/{profiling,aiperf_rag_on}/
per point, plus aggregate
report.md
/
results.json
/
results.csv
at the run root.
Run unit tests:
bash
uv sync --project scripts/rag-perf --extra dev   # one-time, installs pytest-asyncio
uv run --project scripts/rag-perf python -m pytest tests/unit/test_rag_perf/
仅执行性能分析(检索/重排器调优的最快反馈):
bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/quick_profile.yaml
输出:
rag-perf-results/quick_profile/run_<ts>/{profile_report.md, profile_results.json, profiling/}
aiperf_rag_on/
目录会被省略。文件名以
profile_*
开头,因为
aiperf.enabled: false
带完整报告的单点基准测试:
bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/single_run.yaml
输出:扁平结构
run_<ts>/{report.md, results.json, results.csv, profiling/, aiperf_rag_on/}
并发数扫描测试:
bash
uv run --project scripts/rag-perf rag-perf -c scripts/rag-perf/configs/sweep.yaml
输出:每个测试点对应嵌套结构
run_<ts>/iter_1/<CR:_VDB-K:_RERANKER-K:_…>/{profiling,aiperf_rag_on}/
,同时在运行根目录生成汇总的
report.md
/
results.json
/
results.csv
运行单元测试:
bash
uv sync --project scripts/rag-perf --extra dev   # 一次性操作,安装pytest-asyncio
uv run --project scripts/rag-perf python -m pytest tests/unit/test_rag_perf/

Limitations

局限性

  • The CLI is config-only: author or copy YAML to vary a parameter.
  • load.concurrency
    /
    rag.vdb_top_k
    /
    rag.reranker_top_k
    accept
    int | list[int]
    ; the validator requires unique list values because each value names a unique point dir.
  • input.file
    and
    input.synthetic
    follow an XOR rule — both set fails validation. When neither is set,
    synthetic
    auto-fills with defaults so a bare config still validates.
  • File-based input format is inferred from extension only (
    .jsonl
    or
    .csv
    ); other extensions are rejected.
  • Synthetic generation streams each query to disk as it completes (failure-resilient) but fails fast on the first LLM error — partial JSONL is preserved. Re-run after fixing the endpoint.
  • Reasoning models (Nemotron Omni, Qwen-Reasoning) require
    synthetic.disable_thinking: true
    (the default). Without it the model exhausts the token budget on chain-of-thought and
    content
    returns empty — the generator now raises with a clear message instead of substituting
    reasoning_content
    for the answer.
  • aiperf-specific knobs outside the YAML surface (request rate distribution, GPU telemetry config, etc.) require editing
    AiperfRunner._base_aiperf_cmd
    in
    scripts/rag-perf/rag_perf/runner.py
    .
  • Procedural detail lives under
    references/
    to keep this file concise.
  • CLI仅支持配置文件:需编写或复制YAML来修改参数。
  • load.concurrency
    /
    rag.vdb_top_k
    /
    rag.reranker_top_k
    接受
    int | list[int]
    类型;校验器要求列表值唯一,因为每个值对应唯一的测试点目录。
  • input.file
    input.synthetic
    遵循互斥规则 — 同时设置会校验失败。若两者均未设置,
    synthetic
    会自动填充默认值,确保空配置仍可通过校验。
  • 基于文件的输入格式仅通过扩展名推断
    .jsonl
    .csv
    );其他扩展名会被拒绝。
  • 合成查询生成会在每个查询完成后流式写入磁盘(具备故障恢复能力),但首次LLM错误会导致快速失败 — 部分JSONL会被保留。修复端点后可重新运行。
  • 推理模型(Nemotron Omni、Qwen-Reasoning)需设置
    synthetic.disable_thinking: true
    (默认值)。否则模型会在思维链上耗尽令牌额度,导致
    content
    返回空 — 生成器现在会抛出明确错误,而非用
    reasoning_content
    替代答案。
  • YAML配置之外的aiperf专属参数(请求率分布、GPU遥测配置等)需修改
    scripts/rag-perf/rag_perf/runner.py
    中的
    AiperfRunner._base_aiperf_cmd
  • 详细流程说明位于**
    references/
    **目录下,以保持本文简洁。

Troubleshooting

故障排查

Error / signalLikely causeWhat to do
Configuration errors in <yaml>:  •  input  —  ... XOR rule
Both
input.file
and
input.synthetic
set
Pick one. The XOR validator runs at YAML load time.
input.file must end in .jsonl or .csv
Extension other than
.jsonl
/
.csv
Rename or convert.
load.concurrency has duplicate values
e.g.
[2, 2, 4]
Each concurrency maps to a unique point dir; dedupe.
warmup_requests must be >= 1
YAML had
warmup_requests: 0
aiperf rejects warmup=0; minimum is 1.
LLM returned empty content (reasoning_content was populated — model exhausted its budget on chain-of-thought; raise min_query_tokens or set synthetic.disable_thinking=true).
Reasoning model used CoT and ran out of tokensSet
synthetic.disable_thinking: true
(the default) or raise
min_query_tokens
.
✗ All N profiling requests failed across M point(s).
+ exit 1
Bad URL, server down, wrong collectionVerify
target.url
,
rag.collection_names
(the
<collection_name>
placeholder will hit this).
Per-iteration
⚠ N profiling requests failed
warning, run continues
Some requests timed out / errored mid-runCheck rag-server logs, raise
target.timeout_s
, drop concurrency.
RuntimeError: Random synthetic query generation failed at query N: ...
LLM endpoint rejected a request mid-generationPartial JSONL is at
synthetic.jsonl_output_path
; fix endpoint and re-run with reduced
num_queries
, or point
input.file
at the partial file.
Citation count (mean): 0
and
Citation relevance score: N/A
for a non-empty deployment
Collection mismatch between
rag.collection_names
and what's actually ingested
Run
curl -s http://<ingestor>:8082/v1/collections
to list real collections.
Tests error with
ModuleNotFoundError: No module named 'pytest_asyncio'
Dev extras missing
uv sync --project scripts/rag-perf --extra dev
.
CI:
ModuleNotFoundError: No module named 'ruamel'
from
tests/unit/test_rag_perf/
rag-perf package missing from CI venvAdd
uv pip install -e ./scripts/rag-perf
after the top-level install in the unit-tests job.
错误/信号可能原因解决方法
Configuration errors in <yaml>:  •  input  —  ... XOR rule
同时设置了
input.file
input.synthetic
二选一。互斥校验在YAML加载阶段执行。
input.file must end in .jsonl or .csv
文件扩展名不是
.jsonl
/
.csv
重命名或转换文件格式。
load.concurrency has duplicate values
例如
[2, 2, 4]
每个并发数对应唯一的测试点目录;去重列表值。
warmup_requests must be >= 1
YAML中设置了
warmup_requests: 0
aiperf不接受预热请求数为0;最小值为1。
LLM returned empty content (reasoning_content was populated — model exhausted its budget on chain-of-thought; raise min_query_tokens or set synthetic.disable_thinking=true).
推理模型使用思维链时耗尽了令牌额度设置
synthetic.disable_thinking: true
(默认值)或提高
min_query_tokens
✗ All N profiling requests failed across M point(s).
+ 退出码1
URL错误、服务器宕机、集合名称错误验证
target.url
rag.collection_names
(占位符
<collection_name>
会触发此错误)。
每轮迭代出现
⚠ N profiling requests failed
警告,测试继续运行
部分请求在运行中超时/报错检查rag-server日志,提高
target.timeout_s
,降低并发数。
RuntimeError: Random synthetic query generation failed at query N: ...
LLM端点在生成过程中拒绝了请求部分JSONL已保存至
synthetic.jsonl_output_path
;修复端点后减少
num_queries
重新运行,或将
input.file
指向该部分文件。
非空部署中出现
Citation count (mean): 0
Citation relevance score: N/A
rag.collection_names
与实际摄入的集合不匹配
执行
curl -s http://<ingestor>:8082/v1/collections
列出真实集合。
测试报错
ModuleNotFoundError: No module named 'pytest_asyncio'
缺少开发依赖执行
uv sync --project scripts/rag-perf --extra dev
CI环境报错
ModuleNotFoundError: No module named 'ruamel'
来自
tests/unit/test_rag_perf/
CI venv中缺少rag-perf包在单元测试任务的顶层安装后添加
uv pip install -e ./scripts/rag-perf

Gotchas

注意事项

  • Run from repo root. Preset configs reference
    scripts/rag-perf/examples/queries.jsonl
    and
    scripts/rag-perf/prompts/default_prompts.yaml
    with repo-root-relative paths. Running from inside
    scripts/rag-perf/
    will fail those file lookups.
  • CLI is config-only. Edit the YAML or copy a preset for URL, concurrency, collection, and similar fields.
  • Always edit
    rag.collection_names
    before the first run.
    The presets ship with
    ["<collection_name>"]
    as a deliberate placeholder. Validation passes, retrieval fails silently for every request — manifests as
    Citation count (mean): 0
    everywhere.
  • load.concurrency_list
    ,
    rag.vdb_top_k_list
    ,
    rag.reranker_top_k_list
    are read-only properties that normalise scalar-or-list to a list. Use them when reasoning about the grid; the underlying YAML field is whatever the user wrote.
  • aiperf.enabled: false
    changes filenames.
    The top-level outputs become
    profile_report.md
    /
    profile_results.json
    /
    profile_results.csv
    . The aggregate sweep table also suppresses load-test rows and the "Optimal throughput" footer.
  • Resolved-config dump is verbose (50+ lines) — expected. It's what makes terminal output a self-contained reproducer; don't filter it out in scripts.
  • The aiperf shell command is logged before each subprocess. Look for
    \n  $ python -m aiperf profile -m ... --endpoint-type nvidia_rag ...
    in stdout — copy-paste runnable for reproducing a single point outside rag-perf.
  • --endpoint-type nvidia_rag
    comes from the bundled plugin at
    scripts/rag-perf/rag_perf/plugin/nvidia_rag.py
    . It teaches aiperf about the RAG
    /v1/generate
    request shape and parses citations + per-stage
    metrics
    out of the SSE stream. If aiperf can't resolve
    nvidia_rag
    , rag-perf needs editable installation in the venv — re-run
    uv sync --project scripts/rag-perf
    (or
    uv pip install -e ./scripts/rag-perf
    ).
  • Sweep-mode point-name collision. When two points differ only in concurrency (e.g.
    [1, 4]
    × single
    vdb_top_k
    ), the dir name encodes everything:
    CR:1_ISL:50_OSL:512_VDB-K:20_RERANKER-K:4_Model:...
    . Cluster / GPU / experiment_name (
    output.cluster
    ,
    output.gpu
    ,
    output.experiment_name
    ) are appended too — useful for diff-friendly artifact paths across machines.
  • load.iterations > 1
    repeats the entire grid
    . Each repetition writes to its own
    iter_<i>/
    . Aggregate CSV row count =
    n_points × iterations
    .
  • 从仓库根目录运行。预设配置引用
    scripts/rag-perf/examples/queries.jsonl
    scripts/rag-perf/prompts/default_prompts.yaml
    时使用的是相对仓库根目录的路径。若从
    scripts/rag-perf/
    目录内运行,会导致文件查找失败。
  • CLI仅支持配置文件。需编辑YAML或复制预设来修改URL、并发数、集合名称等字段。
  • 首次运行前务必修改
    rag.collection_names
    。预设配置中使用
    ["<collection_name>"]
    作为故意设置的占位符。配置校验会通过,但所有请求会在检索阶段静默失败 — 表现为所有结果中
    Citation count (mean): 0
  • **
    load.concurrency_list
    rag.vdb_top_k_list
    rag.reranker_top_k_list
    **是只读属性,用于将标量或列表统一转换为列表。分析网格时可使用这些属性;底层YAML字段为用户输入的原始值。
  • aiperf.enabled: false
    会改变文件名
    。顶层输出文件变为
    profile_report.md
    /
    profile_results.json
    /
    profile_results.csv
    。汇总扫描表格也会隐藏负载测试行和"Optimal throughput"页脚。
  • 解析后的配置输出较为冗长(50+行)— 这是预期行为。它确保终端输出可独立复现运行过程;不要在脚本中过滤该输出。
  • aiperf shell命令会在每个子进程运行前记录。在标准输出中查找
    \ $ python -m aiperf profile -m ... --endpoint-type nvidia_rag ...
    — 该命令可复制粘贴,用于在rag-perf之外复现单个测试点。
  • **
    --endpoint-type nvidia_rag
    **来自捆绑插件
    scripts/rag-perf/rag_perf/plugin/nvidia_rag.py
    。它让aiperf了解RAG的
    /v1/generate
    请求格式,并从SSE流中解析引用和各阶段
    metrics
    。若aiperf无法解析
    nvidia_rag
    ,需在venv中以可编辑模式安装rag-perf — 重新运行
    uv sync --project scripts/rag-perf
    (或
    uv pip install -e ./scripts/rag-perf
    )。
  • 扫描模式下的测试点名称冲突。当两个测试点仅并发数不同时(例如
    [1, 4]
    × 单个
    vdb_top_k
    ),目录名称会包含所有信息:
    CR:1_ISL:50_OSL:512_VDB-K:20_RERANKER-K:4_Model:...
    。集群/GPU/实验名称(
    output.cluster
    output.gpu
    output.experiment_name
    )也会被追加 — 便于跨机器生成易于对比的产物路径。
  • load.iterations > 1
    会重复整个网格
    。每次重复会写入独立的
    iter_<i>/
    目录。汇总CSV的行数 =
    测试点数 × 迭代次数

Source of truth

权威参考

PieceLocation
Driver
scripts/rag-perf/rag_perf/cli.py
(
main
is the single Click command)
Schema
scripts/rag-perf/rag_perf/config.py
(
RunConfig
and sub-models)
Orchestrator
scripts/rag-perf/rag_perf/runner.py
(
BenchmarkRunner.run
,
RagProfiler
,
AiperfRunner
)
aiperf plugin
scripts/rag-perf/rag_perf/plugin/nvidia_rag.py
User-facing doc
docs/performance-benchmarking.md
Presets
scripts/rag-perf/configs/{quick_profile,single_run,sweep}.yaml
Sample queries
scripts/rag-perf/examples/queries.jsonl
Synthetic prompts
scripts/rag-perf/prompts/default_prompts.yaml
Config schema details
references/config-schema.md
Synthetic-query generation
references/synthetic-generation.md
Output layout & metric semantics
references/output-and-analysis.md
组件位置
驱动程序
scripts/rag-perf/rag_perf/cli.py
main
是唯一的Click命令)
配置Schema
scripts/rag-perf/rag_perf/config.py
RunConfig
及子模型)
编排器
scripts/rag-perf/rag_perf/runner.py
BenchmarkRunner.run
RagProfiler
AiperfRunner
aiperf插件
scripts/rag-perf/rag_perf/plugin/nvidia_rag.py
用户文档
docs/performance-benchmarking.md
预设配置
scripts/rag-perf/configs/{quick_profile,single_run,sweep}.yaml
示例查询
scripts/rag-perf/examples/queries.jsonl
合成提示词
scripts/rag-perf/prompts/default_prompts.yaml
配置Schema详情
references/config-schema.md
合成查询生成
references/synthetic-generation.md
输出布局 & 指标语义
references/output-and-analysis.md

Agent playbook

Agent操作指南

  1. Sync deps:
    uv sync --project scripts/rag-perf
    (one-time per checkout).
  2. Pick & customise a preset: copy
    scripts/rag-perf/configs/<preset>.yaml
    if you want a variant; always set
    rag.collection_names
    to a real collection.
  3. Run:
    uv run --project scripts/rag-perf rag-perf -c <config>
    from repo root.
  4. Read the per-point + aggregate tables on stdout. Bottleneck inference is in the per-point profiling section; comparison across points is the final aggregate table.
  5. Parse artifacts under
    output.dir/run_<ts>/
    — see
    references/output-and-analysis.md
    . For multi-point runs,
    results.csv
    has one row per (point × iteration).
  6. Summarise for the user using the playbook in
    references/output-and-analysis.md#summarising-results-to-the-user
    — headline table, scaling-efficiency math for sweeps, mandatory flags for zero citations / non-zero errors / suspect
    llm_ttft_ms
    / low sample size, and a concrete next-experiment YAML.
  7. Tune retrieval / reranker: flip to
    quick_profile.yaml
    or
    aiperf.enabled: false
    for fast iteration, then return to
    single_run.yaml
    /
    sweep.yaml
    when characterising under load.
  8. Triage failures: see Troubleshooting above and
    references/output-and-analysis.md
    for empty-citation / bottleneck=N/A patterns.
  1. 同步依赖
    uv sync --project scripts/rag-perf
    (每次 checkout 执行一次)。
  2. 选择并自定义预设:若需变体可复制
    scripts/rag-perf/configs/<preset>.yaml
    ;务必将
    rag.collection_names
    设置为真实集合名称。
  3. 运行测试:从仓库根目录执行
    uv run --project scripts/rag-perf rag-perf -c <config>
  4. 查看标准输出中的单点+汇总表格。瓶颈推断位于单点性能分析部分;跨测试点的对比位于最终汇总表格。
  5. 解析
    output.dir/run_<ts>/
    下的产物
    — 详见
    references/output-and-analysis.md
    。对于多点运行,
    results.csv
    每行对应一个(测试点 × 迭代次数)。
  6. 为用户总结结果,遵循
    references/output-and-analysis.md#summarising-results-to-the-user
    中的指南:生成标题表格,计算扫描测试的扩展效率,必须标记零引用/非零错误率/异常
    llm_ttft_ms
    /小样本量,并提出具体的下一轮实验YAML配置。
  7. 调优检索/重排器:切换到
    quick_profile.yaml
    或设置
    aiperf.enabled: false
    进行快速迭代,之后再使用
    single_run.yaml
    /
    sweep.yaml
    进行负载下的性能表征。
  8. 排查故障:参考上述故障排查部分及
    references/output-and-analysis.md
    中的零引用/瓶颈=N/A模式说明。",