add-benchmark

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Add Benchmark to NeMo-Gym

向NeMo-Gym添加基准测试

Determine Integration Type

确定集成类型

Before starting, determine which type of benchmark you're adding:
Native benchmark — verification logic implemented directly in a Gym resources server:
  • Resources server implements
    verify()
    with reward logic
  • Agent server orchestrates model calls (use
    simple_agent
    for single-turn, or custom agent for multi-turn)
  • Example:
    code_gen
    ,
    instruction_following
    ,
    math_with_judge
External benchmark — wrapping a 3rd-party library that has its own orchestration:
  • Integrate at the agent server level (not resources server)
  • Agent's
    /run
    endpoint wraps the external library
  • Pre-process from Gym schema to library input, post-process back to
    BaseVerifyResponse
  • Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration
  • Add the dependency in
    requirements.txt
开始之前,请确定你要添加的基准测试类型:
原生基准测试 — 验证逻辑直接在Gym资源服务器中实现:
  • 资源服务器实现包含奖励逻辑的
    verify()
    方法
  • Agent服务器编排模型调用(单轮任务使用
    simple_agent
    ,多轮任务使用自定义agent)
  • 示例:
    code_gen
    instruction_following
    math_with_judge
外部基准测试 — 封装自带编排逻辑的第三方库:
  • 在Agent服务器层面集成(而非资源服务器)
  • Agent的
    /run
    端点封装外部库
  • 完成从Gym schema到库输入的预处理,以及从库输出到
    BaseVerifyResponse
    的后处理
  • 先在原仓库中复现公开报告的数值,再在Gym集成后再次复现
  • requirements.txt
    中添加依赖

Workflow

工作流

Step 1: Scaffold the server

步骤1:搭建服务器脚手架

Run
ng_init_resources_server
to generate the directory structure:
bash
ng_init_resources_server +entrypoint=resources_servers/my_benchmark
This creates:
resources_servers/my_benchmark/
├── app.py              # Server template
├── configs/my_benchmark.yaml
├── data/.gitignore
├── tests/test_app.py
├── requirements.txt
└── README.md
For external benchmarks, create the agent server manually under
responses_api_agents/my_agent/
with the same structure.
运行
ng_init_resources_server
生成目录结构:
bash
ng_init_resources_server +entrypoint=resources_servers/my_benchmark
该命令会创建以下结构:
resources_servers/my_benchmark/
├── app.py              # 服务器模板
├── configs/my_benchmark.yaml
├── data/.gitignore
├── tests/test_app.py
├── requirements.txt
└── README.md
对于外部基准测试,请手动在
responses_api_agents/my_agent/
下创建具有相同结构的Agent服务器。

Step 2: Prepare data

步骤2:准备数据

Convert your source dataset to Gym JSONL format. Each line must have
responses_create_params.input
(OpenAI message format). Task-specific verification data goes in
verifier_metadata
.
json
{
  "responses_create_params": {
    "input": [
      {"role": "system", "content": "System prompt"},
      {"role": "user", "content": "Problem statement"}
    ]
  },
  "verifier_metadata": {
    "test_cases": [{"input": "...", "expected_output": "..."}],
    "task_id": "unique_id"
  }
}
Data conversion: Write conversion scripts in the source repo (e.g. your dataset repository), not in NeMo-Gym. Prompt files also belong in the source repo. Exception: when there is no external source repo. See
references/patterns.md
§ "Data Conversion Script Pattern".
example.jsonl
: Generate 5 entries for smoke testing. This file is committed directly to git in
data/example.jsonl
.
train
/
validation
datasets
: Upload to the GitLab dataset registry — these must NOT be committed to git.
bash
ng_upload_dataset_to_gitlab \
    +dataset_name=my_benchmark \
    +version=0.0.1 \
    +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
Requires MLflow credentials in
env.yaml
(or passed via CLI):
yaml
mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
mlflow_tracking_token: <your-gitlab-api-token>
data/.gitignore
: The scaffold generates default patterns (
*train.jsonl
,
*validation.jsonl
, etc.). If your filename doesn't match (e.g.
my_eval.jsonl
), add a custom pattern (e.g.
*eval.jsonl
). If data was previously tracked, run
git rm --cached <file>
.
Validate your data:
bash
undefined
将源数据集转换为Gym JSONL格式。每行必须包含
responses_create_params.input
(OpenAI消息格式)。任务特定的验证数据放在
verifier_metadata
中。
json
{
  "responses_create_params": {
    "input": [
      {"role": "system", "content": "System prompt"},
      {"role": "user", "content": "Problem statement"}
    ]
  },
  "verifier_metadata": {
    "test_cases": [{"input": "...", "expected_output": "..."}],
    "task_id": "unique_id"
  }
}
数据转换:在源仓库(例如你的数据集仓库)中编写转换脚本,而非NeMo-Gym。提示文件也应放在源仓库中。例外情况:当没有外部源仓库时。请参考
references/patterns.md
中的「数据转换脚本模式」章节。
example.jsonl
:生成5条数据用于冒烟测试。该文件直接提交到git,路径为
data/example.jsonl
train
/
validation
数据集
:上传到GitLab数据集注册表 — 这些数据集禁止提交到git。
bash
ng_upload_dataset_to_gitlab \
    +dataset_name=my_benchmark \
    +version=0.0.1 \
    +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
需要在
env.yaml
中配置MLflow凭证(或通过CLI传入):
yaml
mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
mlflow_tracking_token: <your-gitlab-api-token>
data/.gitignore
:脚手架会生成默认规则(如
*train.jsonl
*validation.jsonl
等)。如果你的文件名不匹配(例如
my_eval.jsonl
),请添加自定义规则(例如
*eval.jsonl
)。如果数据之前已被git追踪,请运行
git rm --cached <file>
验证数据
bash
undefined

Validate example data (for PR submission)

验证示例数据(用于PR提交)

ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]"
+output_dirpath=/tmp/prepare +mode=example_validation
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]"
+output_dirpath=/tmp/prepare +mode=example_validation

Download and prepare train/validation from GitLab

从GitLab下载并准备训练/验证数据

ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]"
+output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab
undefined
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]"
+output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab
undefined

Step 3: Implement verify()

步骤3:实现verify()方法

Edit
app.py
. The
verify()
method receives model output +
verifier_metadata
, returns reward.
For code execution benchmarks, see
references/patterns.md
§ "Subprocess Execution with Ray" and "Resources Server Pattern".
Critical rules:
  • Return
    reward
    as 0.0 or 1.0 (binary)
  • Handle empty/missing model output gracefully — return 0.0, don't crash
  • Must handle 4k-65k concurrent requests without crashing
  • Use
    asyncio.Semaphore
    for subprocess concurrency control
  • For Ray remote tasks:
    result = await future
    (Ray futures are directly awaitable). Never call
    ray.get()
    in async context.
  • Decode subprocess output with
    errors="replace"
  • Strip
    <think>
    /
    <thinking>
    blocks before parsing model output (thinking models emit these)
  • Tests should
    pytest.mark.skipif
    when external tools aren't installed
  • If the benchmark auto-installs its tool (see Step 3b), add a
    pytest_configure
    hook in
    conftest.py
    to run the install before test collection —
    skipif
    evaluates at import time, before fixtures run
编辑
app.py
verify()
方法接收模型输出和
verifier_metadata
,返回奖励值。
对于代码执行类基准测试,请参考
references/patterns.md
中的「基于Ray的子进程执行」和「资源服务器模式」章节。
关键规则:
  • 返回的
    reward
    必须为0.0或1.0(二进制)
  • 优雅处理空/缺失的模型输出 — 返回0.0,避免崩溃
  • 必须能处理4k-65k并发请求而不崩溃
  • 使用
    asyncio.Semaphore
    控制子进程并发
  • 对于Ray远程任务:使用
    result = await future
    (Ray futures可直接await)。绝不要在异步上下文调用
    ray.get()
  • 使用
    errors="replace"
    解码子进程输出
  • 解析模型输出前,去除
    <think>
    /
    <thinking>
    块(思考型模型会生成这些内容)
  • 当外部工具未安装时,测试应使用
    pytest.mark.skipif
    跳过
  • 如果基准测试会自动安装工具(见步骤3b),请在
    conftest.py
    中添加
    pytest_configure
    钩子,在测试收集前运行安装 —
    skipif
    在导入时评估,早于fixtures执行

Step 3b: Auto-install external tools (if applicable)

步骤3b:自动安装外部工具(如适用)

If the benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup. See
references/patterns.md
§ "External Tool Auto-Install Pattern".
Key points:
  • Create
    setup_<tool>.py
    with
    ensure_<tool>()
    — checks PATH, forks on
    sys.platform
    (brew on macOS, build from source on Linux)
  • Call it in
    model_post_init()
    before semaphore init
  • Build scripts should be idempotent and install into a local gitignored prefix
  • Add a
    pytest_configure
    hook in
    tests/conftest.py
    that calls
    ensure_<tool>()
    before collection
如果基准测试需要外部工具(编译器、运行时等),请在服务器启动时自动安装,避免用户手动配置。请参考
references/patterns.md
中的「外部工具自动安装模式」章节。
关键点:
  • 创建
    setup_<tool>.py
    ,包含
    ensure_<tool>()
    方法 — 检查PATH,根据
    sys.platform
    分支处理(macOS使用brew,Linux从源码构建)
  • model_post_init()
    中、信号量初始化前调用该方法
  • 构建脚本应具备幂等性,并安装到本地被git忽略的前缀目录
  • tests/conftest.py
    中添加
    pytest_configure
    钩子,在测试收集前调用
    ensure_<tool>()

Step 4: Wire YAML config

步骤4:配置YAML文件

Edit
configs/my_benchmark.yaml
. Define the resources server instance and agent pairing(s). See
references/patterns.md
§ "YAML Config Pattern".
Key points:
  • verified: false
    is auto-added by pre-commit hook (set to
    true
    after baselining)
  • license
    is required for
    train
    and
    validation
    datasets
  • Agent references resources server and model server by instance name
For multi-turn benchmarks, either use
proof_refinement_agent
or create a custom agent. See
references/patterns.md
§ "Agent Patterns".
For
train
/
validation
datasets, add
gitlab_identifier
alongside
jsonl_fpath
:
yaml
datasets:
- name: my_dataset
  type: train
  jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
  gitlab_identifier:
    dataset_name: my_benchmark
    version: 0.0.1
    artifact_fpath: my_dataset.jsonl
  license: MIT
- name: example
  type: example
  jsonl_fpath: resources_servers/my_benchmark/data/example.jsonl
Both fields must coexist:
jsonl_fpath
is the local download destination,
gitlab_identifier
tells the system where to fetch from.
example
datasets don't need
gitlab_identifier
— they're committed to git directly.
编辑
configs/my_benchmark.yaml
。定义资源服务器实例和Agent配对。请参考
references/patterns.md
中的「YAML配置模式」章节。
关键点:
  • 预提交钩子会自动添加
    verified: false
    (完成基准测试后设置为
    true
  • train
    validation
    数据集必须填写
    license
  • Agent通过实例名称关联资源服务器和模型服务器
对于多轮基准测试,可使用
proof_refinement_agent
或创建自定义Agent。请参考
references/patterns.md
中的「Agent模式」章节。
对于
train
/
validation
数据集,在
jsonl_fpath
旁添加
gitlab_identifier
yaml
datasets:
- name: my_dataset
  type: train
  jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
  gitlab_identifier:
    dataset_name: my_benchmark
    version: 0.0.1
    artifact_fpath: my_dataset.jsonl
  license: MIT
- name: example
  type: example
  jsonl_fpath: resources_servers/my_benchmark/data/example.jsonl
两个字段必须同时存在:
jsonl_fpath
是本地下载目标路径,
gitlab_identifier
告知系统从何处获取数据。
example
数据集无需
gitlab_identifier
— 它们直接提交到git。

Step 5: Test

步骤5:测试

bash
undefined
bash
undefined

Run server tests (creates isolated .venv, slow on first run)

运行服务器测试(创建独立的.venv,首次运行较慢)

ng_test +entrypoint=resources_servers/my_benchmark
ng_test +entrypoint=resources_servers/my_benchmark

Run core library tests to check nothing broke

运行核心库测试,检查是否引入问题

pytest tests/unit_tests/ -x

Test coverage must be >= 95%. Write tests for: verify pass, verify fail (wrong output), verify fail (no code extracted), verify fail (compilation error if applicable), verify timeout.
pytest tests/unit_tests/ -x

测试覆盖率必须≥95%。需编写以下测试用例:验证通过、验证失败(输出错误)、验证失败(未提取代码)、验证失败(编译错误,如适用)、验证超时。

Step 6: Smoke test end-to-end

步骤6:端到端冒烟测试

bash
undefined
bash
undefined

Start servers

启动服务器

ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]"
ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]"

Quick test with example data

使用示例数据快速测试

ng_collect_rollouts +agent_name=my_benchmark_simple_agent
+input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl
+output_jsonl_fpath=results/example_rollouts.jsonl
+num_repeats=1
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
ng_collect_rollouts +agent_name=my_benchmark_simple_agent
+input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl
+output_jsonl_fpath=results/example_rollouts.jsonl
+num_repeats=1
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"

Inspect results

检查结果

undefined
undefined

Step 7: Baseline (reward profiling)

步骤7:基准测试(奖励分析)

Run against multiple models to validate correctness. Recommended suite:
  • Your policy model of interest
  • At least one open-source instruct model (e.g. Qwen 3 30B A3B Instruct)
  • At least one open-source thinking model (e.g. Qwen 3 30B A3B Thinking)
  • At least one closed-source model (e.g. GPT-5 Nano or GPT-5)
bash
undefined
在多个模型上运行以验证正确性。推荐测试套件:
  • 你关注的策略模型
  • 至少一个开源指令模型(例如Qwen 3 30B A3B Instruct)
  • 至少一个开源思考模型(例如Qwen 3 30B A3B Thinking)
  • 至少一个闭源模型(例如GPT-5 Nano或GPT-5)
bash
undefined

Collect rollouts

收集rollouts数据

ng_collect_rollouts +agent_name=my_benchmark_simple_agent
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
+output_jsonl_fpath=results/rollouts.jsonl
+num_repeats=5
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
ng_collect_rollouts +agent_name=my_benchmark_simple_agent
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
+output_jsonl_fpath=results/rollouts.jsonl
+num_repeats=5
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"

Compute per-task pass rates

计算每个任务的通过率

ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
+rollouts_jsonl_fpath=results/rollouts.jsonl
+output_jsonl_fpath=results/profiled.jsonl
+pass_threshold=1.0
ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
+rollouts_jsonl_fpath=results/rollouts.jsonl
+output_jsonl_fpath=results/profiled.jsonl
+pass_threshold=1.0

Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward)

聚合指标(pass@1 = avg_reward,pass@k来自max_reward)

python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl

Increase `num_repeats` until variance < 1% across runs on the same model.

Closed-source models should score at or above open-source models. If not, investigate for bugs. Inspect actual failure cases in the rollout JSONL, not just aggregate numbers.

For external benchmarks: reproduce the original repo's published numbers first. Then reproduce after Gym integration. Scores should match.
python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl

增加`num_repeats`直到同一模型多次运行的方差<1%。

闭源模型的得分应等于或高于开源模型。如果不是,请排查问题。检查rollout JSONL中的实际失败案例,不要只看聚合数值。

对于外部基准测试:先在原仓库中复现公开数值,再在Gym集成后复现。得分应一致。

Step 8: Pre-commit and PR

步骤8:预提交和PR

bash
pre-commit run --all-files
First run may fail as hooks auto-modify files (
verified: false
flag, README table). Stage changes and run again.
Set
verified: true
in YAML config after successful baselining. Include W&B links and screenshots of results in the PR description.
To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files:
bash
pre-commit run --files resources_servers/my_benchmark/**/*
If hooks modify files in other directories, discard those changes:
bash
git checkout -- resources_servers/other_server/
bash
pre-commit run --all-files
首次运行可能失败,因为钩子会自动修改文件(
verified: false
标记、README表格)。暂存更改后再次运行。
成功完成基准测试后,将YAML配置中的
verified
设置为
true
。在PR描述中包含W&B链接和结果截图。
为避免提交其他服务器的自动修复内容,请将预提交范围限定在你的文件:
bash
pre-commit run --files resources_servers/my_benchmark/**/*
如果钩子修改了其他目录的文件,请丢弃这些更改:
bash
git checkout -- resources_servers/other_server/

Constraints

约束条件

  • Use NeMo Gym's OpenAI client (
    nemo_gym/openai_utils.py
    ), not LiteLLM/Anthropic/other
  • Use aiohttp, not httpx, for async HTTP. All async HTTP calls must go through
    nemo_gym.server_utils.request()
    (aiohttp). httpx has O(n^2) connection pooling that hangs at high concurrency. When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter — see
    resources_servers/tavily_search/app.py
    (
    TavilySearchAIOHTTPClient
    ) for the pattern and
    docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md
    for the rationale.
  • Pass configuration through Gym config (YAML), not environment variables
  • Code must run on Linux
  • /run
    endpoint must be async
  • Errors from tool execution or bad model output must return error responses, not crash
  • All commits require DCO sign-off (
    -s
    ) and cryptographic signature (
    -S
    )
  • 使用NeMo Gym的OpenAI客户端(
    nemo_gym/openai_utils.py
    ),不要使用LiteLLM/Anthropic或其他客户端
  • 异步HTTP调用使用aiohttp,不要使用httpx。所有异步HTTP调用必须通过
    nemo_gym.server_utils.request()
    (基于aiohttp)。httpx的连接池存在O(n²)问题,高并发时会挂起。当封装内部使用httpx的外部库时,将其HTTP传输替换为aiohttp适配器 — 参考
    resources_servers/tavily_search/app.py
    中的
    TavilySearchAIOHTTPClient
    实现,以及
    docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md
    中的原理说明。
  • 通过Gym配置(YAML)传递参数,不要使用环境变量
  • 代码必须能在Linux上运行
  • /run
    端点必须是异步的
  • 工具执行或模型输出错误必须返回错误响应,不要崩溃
  • 所有提交需要DCO签名(
    -s
    )和加密签名(
    -S

Reference

参考资料

For detailed code patterns, schemas, and examples: see references/patterns.md.
如需详细的代码模式、schema和示例,请查看references/patterns.md