add-benchmark

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Add Benchmark to NeMo-Gym

向NeMo-Gym添加基准测试

Determine Integration Type

确定集成类型

Before starting, determine which type of benchmark you're adding:

Native benchmark — verification logic implemented directly in a Gym resources server:

Resources server implements
```
verify()
```
with reward logic
Agent server orchestrates model calls (use
```
simple_agent
```
for single-turn, or custom agent for multi-turn)

Example:

code_gen

instruction_following

math_with_judge

External benchmark — wrapping a 3rd-party library that has its own orchestration:

Integrate at the agent server level (not resources server)
Agent's
```
/run
```
endpoint wraps the external library
Pre-process from Gym schema to library input, post-process back to
```
BaseVerifyResponse
```
Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration
Add the dependency in
```
requirements.txt
```

开始之前，请确定你要添加的基准测试类型：

原生基准测试 — 验证逻辑直接在Gym资源服务器中实现：

资源服务器实现包含奖励逻辑的
```
verify()
```
方法
Agent服务器编排模型调用（单轮任务使用
```
simple_agent
```
，多轮任务使用自定义agent）

示例：

code_gen

、

instruction_following

、

math_with_judge

外部基准测试 — 封装自带编排逻辑的第三方库：

在Agent服务器层面集成（而非资源服务器）
Agent的
```
/run
```
端点封装外部库
完成从Gym schema到库输入的预处理，以及从库输出到
```
BaseVerifyResponse
```
的后处理
先在原仓库中复现公开报告的数值，再在Gym集成后再次复现
在
```
requirements.txt
```
中添加依赖

Workflow

工作流

Step 1: Scaffold the server

步骤1：搭建服务器脚手架

Run

ng_init_resources_server

to generate the directory structure:

bash

ng_init_resources_server +entrypoint=resources_servers/my_benchmark

This creates:

resources_servers/my_benchmark/
├── app.py              # Server template
├── configs/my_benchmark.yaml
├── data/.gitignore
├── tests/test_app.py
├── requirements.txt
└── README.md

For external benchmarks, create the agent server manually under

responses_api_agents/my_agent/

with the same structure.

运行

ng_init_resources_server

生成目录结构：

bash

ng_init_resources_server +entrypoint=resources_servers/my_benchmark

该命令会创建以下结构：

resources_servers/my_benchmark/
├── app.py              # 服务器模板
├── configs/my_benchmark.yaml
├── data/.gitignore
├── tests/test_app.py
├── requirements.txt
└── README.md

对于外部基准测试，请手动在

responses_api_agents/my_agent/

下创建具有相同结构的Agent服务器。

Step 2: Prepare data

步骤2：准备数据

Convert your source dataset to Gym JSONL format. Each line must have

responses_create_params.input

(OpenAI message format). Task-specific verification data goes in

verifier_metadata

json

{
  "responses_create_params": {
    "input": [
      {"role": "system", "content": "System prompt"},
      {"role": "user", "content": "Problem statement"}
    ]
  },
  "verifier_metadata": {
    "test_cases": [{"input": "...", "expected_output": "..."}],
    "task_id": "unique_id"
  }
}

Data conversion: Write conversion scripts in the source repo (e.g. your dataset repository), not in NeMo-Gym. Prompt files also belong in the source repo. Exception: when there is no external source repo. See

references/patterns.md

§ "Data Conversion Script Pattern".

example.jsonl
: Generate 5 entries for smoke testing. This file is committed directly to git in

data/example.jsonl

train
/
validation
datasets: Upload to the GitLab dataset registry — these must NOT be committed to git.

bash

ng_upload_dataset_to_gitlab \
    +dataset_name=my_benchmark \
    +version=0.0.1 \
    +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl

Requires MLflow credentials in

env.yaml

(or passed via CLI):

yaml

mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
mlflow_tracking_token: <your-gitlab-api-token>

data/.gitignore
: The scaffold generates default patterns (

*train.jsonl

*validation.jsonl

, etc.). If your filename doesn't match (e.g.

my_eval.jsonl

), add a custom pattern (e.g.

*eval.jsonl

). If data was previously tracked, run

git rm --cached <file>

Validate your data:

bash

undefined

将源数据集转换为Gym JSONL格式。每行必须包含

responses_create_params.input

（OpenAI消息格式）。任务特定的验证数据放在

verifier_metadata

中。

json

{
  "responses_create_params": {
    "input": [
      {"role": "system", "content": "System prompt"},
      {"role": "user", "content": "Problem statement"}
    ]
  },
  "verifier_metadata": {
    "test_cases": [{"input": "...", "expected_output": "..."}],
    "task_id": "unique_id"
  }
}

数据转换：在源仓库（例如你的数据集仓库）中编写转换脚本，而非NeMo-Gym。提示文件也应放在源仓库中。例外情况：当没有外部源仓库时。请参考

references/patterns.md

中的「数据转换脚本模式」章节。

example.jsonl
：生成5条数据用于冒烟测试。该文件直接提交到git，路径为

data/example.jsonl

。

train
/
validation
数据集：上传到GitLab数据集注册表 — 这些数据集禁止提交到git。

bash

ng_upload_dataset_to_gitlab \
    +dataset_name=my_benchmark \
    +version=0.0.1 \
    +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl

需要在

env.yaml

中配置MLflow凭证（或通过CLI传入）：

yaml

mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
mlflow_tracking_token: <your-gitlab-api-token>

data/.gitignore
：脚手架会生成默认规则（如

*train.jsonl

、

*validation.jsonl

等）。如果你的文件名不匹配（例如

my_eval.jsonl

），请添加自定义规则（例如

*eval.jsonl

）。如果数据之前已被git追踪，请运行

git rm --cached <file>

。

验证数据：

bash

undefined

Validate example data (for PR submission)

验证示例数据（用于PR提交）

ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]"
+output_dirpath=/tmp/prepare +mode=example_validation

Download and prepare train/validation from GitLab

从GitLab下载并准备训练/验证数据

ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]"
+output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab

undefined

ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]"
+output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab

undefined

Step 3: Implement verify()

步骤3：实现verify()方法

Edit

app.py

. The

verify()

method receives model output +

verifier_metadata

, returns reward.

For code execution benchmarks, see

references/patterns.md

§ "Subprocess Execution with Ray" and "Resources Server Pattern".

Critical rules:

Return
```
reward
```
as 0.0 or 1.0 (binary)
Handle empty/missing model output gracefully — return 0.0, don't crash
Must handle 4k-65k concurrent requests without crashing
Use
```
asyncio.Semaphore
```
for subprocess concurrency control
For Ray remote tasks:
```
result = await future
```
(Ray futures are directly awaitable). Never call
```
ray.get()
```
in async context.
Decode subprocess output with
```
errors="replace"
```
Strip
```
<think>
```
/
```
<thinking>
```
blocks before parsing model output (thinking models emit these)
Tests should
```
pytest.mark.skipif
```
when external tools aren't installed
If the benchmark auto-installs its tool (see Step 3b), add a
```
pytest_configure
```
hook in
```
conftest.py
```
to run the install before test collection —
```
skipif
```
evaluates at import time, before fixtures run

编辑

app.py

。

verify()

方法接收模型输出和

verifier_metadata

，返回奖励值。

对于代码执行类基准测试，请参考

references/patterns.md

中的「基于Ray的子进程执行」和「资源服务器模式」章节。

关键规则：

返回的
```
reward
```
必须为0.0或1.0（二进制）
优雅处理空/缺失的模型输出 — 返回0.0，避免崩溃
必须能处理4k-65k并发请求而不崩溃
使用
```
asyncio.Semaphore
```
控制子进程并发
对于Ray远程任务：使用
```
result = await future
```
（Ray futures可直接await）。绝不要在异步上下文调用
```
ray.get()
```
使用
```
errors="replace"
```
解码子进程输出
解析模型输出前，去除
```
<think>
```
/
```
<thinking>
```
块（思考型模型会生成这些内容）
当外部工具未安装时，测试应使用
```
pytest.mark.skipif
```
跳过
如果基准测试会自动安装工具（见步骤3b），请在
```
conftest.py
```
中添加
```
pytest_configure
```
钩子，在测试收集前运行安装 —
```
skipif
```
在导入时评估，早于fixtures执行

Step 3b: Auto-install external tools (if applicable)

步骤3b：自动安装外部工具（如适用）

If the benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup. See

references/patterns.md

§ "External Tool Auto-Install Pattern".

Key points:

Create
```
setup_<tool>.py
```
with
```
ensure_<tool>()
```
— checks PATH, forks on
```
sys.platform
```
(brew on macOS, build from source on Linux)
Call it in
```
model_post_init()
```
before semaphore init
Build scripts should be idempotent and install into a local gitignored prefix

Add a

pytest_configure

hook in

tests/conftest.py

that calls

ensure_<tool>()

before collection

如果基准测试需要外部工具（编译器、运行时等），请在服务器启动时自动安装，避免用户手动配置。请参考

references/patterns.md

中的「外部工具自动安装模式」章节。

关键点：

创建
```
setup_<tool>.py
```
，包含
```
ensure_<tool>()
```
方法 — 检查PATH，根据
```
sys.platform
```
分支处理（macOS使用brew，Linux从源码构建）
在
```
model_post_init()
```
中、信号量初始化前调用该方法
构建脚本应具备幂等性，并安装到本地被git忽略的前缀目录

在

tests/conftest.py

中添加

pytest_configure

钩子，在测试收集前调用

ensure_<tool>()

Step 4: Wire YAML config

步骤4：配置YAML文件

Edit

configs/my_benchmark.yaml

. Define the resources server instance and agent pairing(s). See

references/patterns.md

§ "YAML Config Pattern".

Key points:

```
verified: false
```
is auto-added by pre-commit hook (set to
```
true
```
after baselining)
```
license
```
is required for
```
train
```
and
```
validation
```
datasets
Agent references resources server and model server by instance name

For multi-turn benchmarks, either use

proof_refinement_agent

or create a custom agent. See

references/patterns.md

§ "Agent Patterns".

For

train

validation

datasets, add

gitlab_identifier

alongside

jsonl_fpath

yaml

datasets:
- name: my_dataset
  type: train
  jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
  gitlab_identifier:
    dataset_name: my_benchmark
    version: 0.0.1
    artifact_fpath: my_dataset.jsonl
  license: MIT
- name: example
  type: example
  jsonl_fpath: resources_servers/my_benchmark/data/example.jsonl

Both fields must coexist:

jsonl_fpath

is the local download destination,

gitlab_identifier

tells the system where to fetch from.

example

datasets don't need

gitlab_identifier

— they're committed to git directly.

编辑

configs/my_benchmark.yaml

。定义资源服务器实例和Agent配对。请参考

references/patterns.md

中的「YAML配置模式」章节。

关键点：

预提交钩子会自动添加
```
verified: false
```
（完成基准测试后设置为
```
true
```
）
```
train
```
和
```
validation
```
数据集必须填写
```
license
```
Agent通过实例名称关联资源服务器和模型服务器

对于多轮基准测试，可使用

proof_refinement_agent

或创建自定义Agent。请参考

references/patterns.md

中的「Agent模式」章节。

对于

train

validation

数据集，在

jsonl_fpath

旁添加

gitlab_identifier

：

yaml

datasets:
- name: my_dataset
  type: train
  jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
  gitlab_identifier:
    dataset_name: my_benchmark
    version: 0.0.1
    artifact_fpath: my_dataset.jsonl
  license: MIT
- name: example
  type: example
  jsonl_fpath: resources_servers/my_benchmark/data/example.jsonl

两个字段必须同时存在：

jsonl_fpath

是本地下载目标路径，

gitlab_identifier

告知系统从何处获取数据。

example

数据集无需

gitlab_identifier

— 它们直接提交到git。

Step 5: Test

步骤5：测试

bash

undefined

bash

undefined

Run server tests (creates isolated .venv, slow on first run)

运行服务器测试（创建独立的.venv，首次运行较慢）

ng_test +entrypoint=resources_servers/my_benchmark

Run core library tests to check nothing broke

运行核心库测试，检查是否引入问题

pytest tests/unit_tests/ -x


Test coverage must be >= 95%. Write tests for: verify pass, verify fail (wrong output), verify fail (no code extracted), verify fail (compilation error if applicable), verify timeout.

pytest tests/unit_tests/ -x


测试覆盖率必须≥95%。需编写以下测试用例：验证通过、验证失败（输出错误）、验证失败（未提取代码）、验证失败（编译错误，如适用）、验证超时。

Step 6: Smoke test end-to-end

步骤6：端到端冒烟测试

bash

undefined

bash

undefined

Start servers

启动服务器

ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]"

Quick test with example data

使用示例数据快速测试

ng_collect_rollouts +agent_name=my_benchmark_simple_agent
+input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl
+output_jsonl_fpath=results/example_rollouts.jsonl
+num_repeats=1
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"

Inspect results

检查结果

undefined

undefined

Step 7: Baseline (reward profiling)

步骤7：基准测试（奖励分析）

Run against multiple models to validate correctness. Recommended suite:

Your policy model of interest
At least one open-source instruct model (e.g. Qwen 3 30B A3B Instruct)
At least one open-source thinking model (e.g. Qwen 3 30B A3B Thinking)
At least one closed-source model (e.g. GPT-5 Nano or GPT-5)

bash

undefined

在多个模型上运行以验证正确性。推荐测试套件：

你关注的策略模型
至少一个开源指令模型（例如Qwen 3 30B A3B Instruct）
至少一个开源思考模型（例如Qwen 3 30B A3B Thinking）
至少一个闭源模型（例如GPT-5 Nano或GPT-5）

bash

undefined

Collect rollouts

收集rollouts数据

ng_collect_rollouts +agent_name=my_benchmark_simple_agent
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
+output_jsonl_fpath=results/rollouts.jsonl
+num_repeats=5
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"

Compute per-task pass rates

计算每个任务的通过率

ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
+rollouts_jsonl_fpath=results/rollouts.jsonl
+output_jsonl_fpath=results/profiled.jsonl
+pass_threshold=1.0

Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward)

聚合指标（pass@1 = avg_reward，pass@k来自max_reward）

python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl


Increase `num_repeats` until variance < 1% across runs on the same model.

Closed-source models should score at or above open-source models. If not, investigate for bugs. Inspect actual failure cases in the rollout JSONL, not just aggregate numbers.

For external benchmarks: reproduce the original repo's published numbers first. Then reproduce after Gym integration. Scores should match.

python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl


增加`num_repeats`直到同一模型多次运行的方差<1%。

闭源模型的得分应等于或高于开源模型。如果不是，请排查问题。检查rollout JSONL中的实际失败案例，不要只看聚合数值。

对于外部基准测试：先在原仓库中复现公开数值，再在Gym集成后复现。得分应一致。

Step 8: Pre-commit and PR

步骤8：预提交和PR

bash

pre-commit run --all-files

First run may fail as hooks auto-modify files (

verified: false

flag, README table). Stage changes and run again.

Set

verified: true

in YAML config after successful baselining. Include W&B links and screenshots of results in the PR description.

To avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files:

bash

pre-commit run --files resources_servers/my_benchmark/**/*

If hooks modify files in other directories, discard those changes:

bash

git checkout -- resources_servers/other_server/

bash

pre-commit run --all-files

首次运行可能失败，因为钩子会自动修改文件（

verified: false

标记、README表格）。暂存更改后再次运行。

成功完成基准测试后，将YAML配置中的

verified

设置为

true

。在PR描述中包含W&B链接和结果截图。

为避免提交其他服务器的自动修复内容，请将预提交范围限定在你的文件：

bash

pre-commit run --files resources_servers/my_benchmark/**/*

如果钩子修改了其他目录的文件，请丢弃这些更改：

bash

git checkout -- resources_servers/other_server/

Constraints

约束条件

Use NeMo Gym's OpenAI client (
```
nemo_gym/openai_utils.py
```
), not LiteLLM/Anthropic/other
Use aiohttp, not httpx, for async HTTP. All async HTTP calls must go through
```
nemo_gym.server_utils.request()
```
(aiohttp). httpx has O(n^2) connection pooling that hangs at high concurrency. When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter — see
```
resources_servers/tavily_search/app.py
```
(
```
TavilySearchAIOHTTPClient
```
) for the pattern and
```
docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md
```
for the rationale.
Pass configuration through Gym config (YAML), not environment variables
Code must run on Linux
```
/run
```
endpoint must be async
Errors from tool execution or bad model output must return error responses, not crash
All commits require DCO sign-off (
```
-s
```
) and cryptographic signature (
```
-S
```
)

使用NeMo Gym的OpenAI客户端（
```
nemo_gym/openai_utils.py
```
），不要使用LiteLLM/Anthropic或其他客户端
异步HTTP调用使用aiohttp，不要使用httpx。所有异步HTTP调用必须通过
```
nemo_gym.server_utils.request()
```
（基于aiohttp）。httpx的连接池存在O(n²)问题，高并发时会挂起。当封装内部使用httpx的外部库时，将其HTTP传输替换为aiohttp适配器 — 参考
```
resources_servers/tavily_search/app.py
```
中的
```
TavilySearchAIOHTTPClient
```
实现，以及
```
docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md
```
中的原理说明。
通过Gym配置（YAML）传递参数，不要使用环境变量
代码必须能在Linux上运行
```
/run
```
端点必须是异步的
工具执行或模型输出错误必须返回错误响应，不要崩溃
所有提交需要DCO签名（
```
-s
```
）和加密签名（
```
-S
```
）

Reference

参考资料

For detailed code patterns, schemas, and examples: see references/patterns.md.

如需详细的代码模式、schema和示例，请查看references/patterns.md。