add-benchmark
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAdd Benchmark to NeMo-Gym
向NeMo-Gym添加基准测试
Determine Integration Type
确定集成类型
Before starting, determine which type of benchmark you're adding:
Native benchmark — verification logic implemented directly in a Gym resources server:
- Resources server implements with reward logic
verify() - Agent server orchestrates model calls (use for single-turn, or custom agent for multi-turn)
simple_agent - Example: ,
code_gen,instruction_followingmath_with_judge
External benchmark — wrapping a 3rd-party library that has its own orchestration:
- Integrate at the agent server level (not resources server)
- Agent's endpoint wraps the external library
/run - Pre-process from Gym schema to library input, post-process back to
BaseVerifyResponse - Reproduce publicly reported numbers with the original repo first, then reproduce again after Gym integration
- Add the dependency in
requirements.txt
开始之前,请确定你要添加的基准测试类型:
原生基准测试 — 验证逻辑直接在Gym资源服务器中实现:
- 资源服务器实现包含奖励逻辑的方法
verify() - Agent服务器编排模型调用(单轮任务使用,多轮任务使用自定义agent)
simple_agent - 示例:、
code_gen、instruction_followingmath_with_judge
外部基准测试 — 封装自带编排逻辑的第三方库:
- 在Agent服务器层面集成(而非资源服务器)
- Agent的端点封装外部库
/run - 完成从Gym schema到库输入的预处理,以及从库输出到的后处理
BaseVerifyResponse - 先在原仓库中复现公开报告的数值,再在Gym集成后再次复现
- 在中添加依赖
requirements.txt
Workflow
工作流
Step 1: Scaffold the server
步骤1:搭建服务器脚手架
Run to generate the directory structure:
ng_init_resources_serverbash
ng_init_resources_server +entrypoint=resources_servers/my_benchmarkThis creates:
resources_servers/my_benchmark/
├── app.py # Server template
├── configs/my_benchmark.yaml
├── data/.gitignore
├── tests/test_app.py
├── requirements.txt
└── README.mdFor external benchmarks, create the agent server manually under with the same structure.
responses_api_agents/my_agent/运行生成目录结构:
ng_init_resources_serverbash
ng_init_resources_server +entrypoint=resources_servers/my_benchmark该命令会创建以下结构:
resources_servers/my_benchmark/
├── app.py # 服务器模板
├── configs/my_benchmark.yaml
├── data/.gitignore
├── tests/test_app.py
├── requirements.txt
└── README.md对于外部基准测试,请手动在下创建具有相同结构的Agent服务器。
responses_api_agents/my_agent/Step 2: Prepare data
步骤2:准备数据
Convert your source dataset to Gym JSONL format. Each line must have (OpenAI message format). Task-specific verification data goes in .
responses_create_params.inputverifier_metadatajson
{
"responses_create_params": {
"input": [
{"role": "system", "content": "System prompt"},
{"role": "user", "content": "Problem statement"}
]
},
"verifier_metadata": {
"test_cases": [{"input": "...", "expected_output": "..."}],
"task_id": "unique_id"
}
}Data conversion: Write conversion scripts in the source repo (e.g. your dataset repository), not in NeMo-Gym. Prompt files also belong in the source repo. Exception: when there is no external source repo. See § "Data Conversion Script Pattern".
references/patterns.mdexample.jsonldata/example.jsonltrainvalidationbash
ng_upload_dataset_to_gitlab \
+dataset_name=my_benchmark \
+version=0.0.1 \
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonlRequires MLflow credentials in (or passed via CLI):
env.yamlyaml
mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
mlflow_tracking_token: <your-gitlab-api-token>data/.gitignore*train.jsonl*validation.jsonlmy_eval.jsonl*eval.jsonlgit rm --cached <file>Validate your data:
bash
undefined将源数据集转换为Gym JSONL格式。每行必须包含(OpenAI消息格式)。任务特定的验证数据放在中。
responses_create_params.inputverifier_metadatajson
{
"responses_create_params": {
"input": [
{"role": "system", "content": "System prompt"},
{"role": "user", "content": "Problem statement"}
]
},
"verifier_metadata": {
"test_cases": [{"input": "...", "expected_output": "..."}],
"task_id": "unique_id"
}
}数据转换:在源仓库(例如你的数据集仓库)中编写转换脚本,而非NeMo-Gym。提示文件也应放在源仓库中。例外情况:当没有外部源仓库时。请参考中的「数据转换脚本模式」章节。
references/patterns.mdexample.jsonldata/example.jsonltrainvalidationbash
ng_upload_dataset_to_gitlab \
+dataset_name=my_benchmark \
+version=0.0.1 \
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl需要在中配置MLflow凭证(或通过CLI传入):
env.yamlyaml
mlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
mlflow_tracking_token: <your-gitlab-api-token>data/.gitignore*train.jsonl*validation.jsonlmy_eval.jsonl*eval.jsonlgit rm --cached <file>验证数据:
bash
undefinedValidate example data (for PR submission)
验证示例数据(用于PR提交)
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]"
+output_dirpath=/tmp/prepare +mode=example_validation
+output_dirpath=/tmp/prepare +mode=example_validation
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]"
+output_dirpath=/tmp/prepare +mode=example_validation
+output_dirpath=/tmp/prepare +mode=example_validation
Download and prepare train/validation from GitLab
从GitLab下载并准备训练/验证数据
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]"
+output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab
+output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab
undefinedng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]"
+output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab
+output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlab
undefinedStep 3: Implement verify()
步骤3:实现verify()方法
Edit . The method receives model output + , returns reward.
app.pyverify()verifier_metadataFor code execution benchmarks, see § "Subprocess Execution with Ray" and "Resources Server Pattern".
references/patterns.mdCritical rules:
- Return as 0.0 or 1.0 (binary)
reward - Handle empty/missing model output gracefully — return 0.0, don't crash
- Must handle 4k-65k concurrent requests without crashing
- Use for subprocess concurrency control
asyncio.Semaphore - For Ray remote tasks: (Ray futures are directly awaitable). Never call
result = await futurein async context.ray.get() - Decode subprocess output with
errors="replace" - Strip /
<think>blocks before parsing model output (thinking models emit these)<thinking> - Tests should when external tools aren't installed
pytest.mark.skipif - If the benchmark auto-installs its tool (see Step 3b), add a hook in
pytest_configureto run the install before test collection —conftest.pyevaluates at import time, before fixtures runskipif
编辑。方法接收模型输出和,返回奖励值。
app.pyverify()verifier_metadata对于代码执行类基准测试,请参考中的「基于Ray的子进程执行」和「资源服务器模式」章节。
references/patterns.md关键规则:
- 返回的必须为0.0或1.0(二进制)
reward - 优雅处理空/缺失的模型输出 — 返回0.0,避免崩溃
- 必须能处理4k-65k并发请求而不崩溃
- 使用控制子进程并发
asyncio.Semaphore - 对于Ray远程任务:使用(Ray futures可直接await)。绝不要在异步上下文调用
result = await futureray.get() - 使用解码子进程输出
errors="replace" - 解析模型输出前,去除/
<think>块(思考型模型会生成这些内容)<thinking> - 当外部工具未安装时,测试应使用跳过
pytest.mark.skipif - 如果基准测试会自动安装工具(见步骤3b),请在中添加
conftest.py钩子,在测试收集前运行安装 —pytest_configure在导入时评估,早于fixtures执行skipif
Step 3b: Auto-install external tools (if applicable)
步骤3b:自动安装外部工具(如适用)
If the benchmark requires an external tool (compiler, runtime, etc.), auto-install it on server startup so users don't need manual setup. See § "External Tool Auto-Install Pattern".
references/patterns.mdKey points:
- Create with
setup_<tool>.py— checks PATH, forks onensure_<tool>()(brew on macOS, build from source on Linux)sys.platform - Call it in before semaphore init
model_post_init() - Build scripts should be idempotent and install into a local gitignored prefix
- Add a hook in
pytest_configurethat callstests/conftest.pybefore collectionensure_<tool>()
如果基准测试需要外部工具(编译器、运行时等),请在服务器启动时自动安装,避免用户手动配置。请参考中的「外部工具自动安装模式」章节。
references/patterns.md关键点:
- 创建,包含
setup_<tool>.py方法 — 检查PATH,根据ensure_<tool>()分支处理(macOS使用brew,Linux从源码构建)sys.platform - 在中、信号量初始化前调用该方法
model_post_init() - 构建脚本应具备幂等性,并安装到本地被git忽略的前缀目录
- 在中添加
tests/conftest.py钩子,在测试收集前调用pytest_configureensure_<tool>()
Step 4: Wire YAML config
步骤4:配置YAML文件
Edit . Define the resources server instance and agent pairing(s). See § "YAML Config Pattern".
configs/my_benchmark.yamlreferences/patterns.mdKey points:
- is auto-added by pre-commit hook (set to
verified: falseafter baselining)true - is required for
licenseandtraindatasetsvalidation - Agent references resources server and model server by instance name
For multi-turn benchmarks, either use or create a custom agent. See § "Agent Patterns".
proof_refinement_agentreferences/patterns.mdFor / datasets, add alongside :
trainvalidationgitlab_identifierjsonl_fpathyaml
datasets:
- name: my_dataset
type: train
jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
gitlab_identifier:
dataset_name: my_benchmark
version: 0.0.1
artifact_fpath: my_dataset.jsonl
license: MIT
- name: example
type: example
jsonl_fpath: resources_servers/my_benchmark/data/example.jsonlBoth fields must coexist: is the local download destination, tells the system where to fetch from. datasets don't need — they're committed to git directly.
jsonl_fpathgitlab_identifierexamplegitlab_identifier编辑。定义资源服务器实例和Agent配对。请参考中的「YAML配置模式」章节。
configs/my_benchmark.yamlreferences/patterns.md关键点:
- 预提交钩子会自动添加(完成基准测试后设置为
verified: false)true - 和
train数据集必须填写validationlicense - Agent通过实例名称关联资源服务器和模型服务器
对于多轮基准测试,可使用或创建自定义Agent。请参考中的「Agent模式」章节。
proof_refinement_agentreferences/patterns.md对于/数据集,在旁添加:
trainvalidationjsonl_fpathgitlab_identifieryaml
datasets:
- name: my_dataset
type: train
jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
gitlab_identifier:
dataset_name: my_benchmark
version: 0.0.1
artifact_fpath: my_dataset.jsonl
license: MIT
- name: example
type: example
jsonl_fpath: resources_servers/my_benchmark/data/example.jsonl两个字段必须同时存在:是本地下载目标路径,告知系统从何处获取数据。数据集无需 — 它们直接提交到git。
jsonl_fpathgitlab_identifierexamplegitlab_identifierStep 5: Test
步骤5:测试
bash
undefinedbash
undefinedRun server tests (creates isolated .venv, slow on first run)
运行服务器测试(创建独立的.venv,首次运行较慢)
ng_test +entrypoint=resources_servers/my_benchmark
ng_test +entrypoint=resources_servers/my_benchmark
Run core library tests to check nothing broke
运行核心库测试,检查是否引入问题
pytest tests/unit_tests/ -x
Test coverage must be >= 95%. Write tests for: verify pass, verify fail (wrong output), verify fail (no code extracted), verify fail (compilation error if applicable), verify timeout.pytest tests/unit_tests/ -x
测试覆盖率必须≥95%。需编写以下测试用例:验证通过、验证失败(输出错误)、验证失败(未提取代码)、验证失败(编译错误,如适用)、验证超时。Step 6: Smoke test end-to-end
步骤6:端到端冒烟测试
bash
undefinedbash
undefinedStart servers
启动服务器
ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]"
ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]"
Quick test with example data
使用示例数据快速测试
ng_collect_rollouts +agent_name=my_benchmark_simple_agent
+input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl
+output_jsonl_fpath=results/example_rollouts.jsonl
+num_repeats=1
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
+input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl
+output_jsonl_fpath=results/example_rollouts.jsonl
+num_repeats=1
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
ng_collect_rollouts +agent_name=my_benchmark_simple_agent
+input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl
+output_jsonl_fpath=results/example_rollouts.jsonl
+num_repeats=1
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
+input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl
+output_jsonl_fpath=results/example_rollouts.jsonl
+num_repeats=1
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
Inspect results
检查结果
undefinedundefinedStep 7: Baseline (reward profiling)
步骤7:基准测试(奖励分析)
Run against multiple models to validate correctness. Recommended suite:
- Your policy model of interest
- At least one open-source instruct model (e.g. Qwen 3 30B A3B Instruct)
- At least one open-source thinking model (e.g. Qwen 3 30B A3B Thinking)
- At least one closed-source model (e.g. GPT-5 Nano or GPT-5)
bash
undefined在多个模型上运行以验证正确性。推荐测试套件:
- 你关注的策略模型
- 至少一个开源指令模型(例如Qwen 3 30B A3B Instruct)
- 至少一个开源思考模型(例如Qwen 3 30B A3B Thinking)
- 至少一个闭源模型(例如GPT-5 Nano或GPT-5)
bash
undefinedCollect rollouts
收集rollouts数据
ng_collect_rollouts +agent_name=my_benchmark_simple_agent
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
+output_jsonl_fpath=results/rollouts.jsonl
+num_repeats=5
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
+output_jsonl_fpath=results/rollouts.jsonl
+num_repeats=5
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
ng_collect_rollouts +agent_name=my_benchmark_simple_agent
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
+output_jsonl_fpath=results/rollouts.jsonl
+num_repeats=5
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
+output_jsonl_fpath=results/rollouts.jsonl
+num_repeats=5
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
Compute per-task pass rates
计算每个任务的通过率
ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
+rollouts_jsonl_fpath=results/rollouts.jsonl
+output_jsonl_fpath=results/profiled.jsonl
+pass_threshold=1.0
+rollouts_jsonl_fpath=results/rollouts.jsonl
+output_jsonl_fpath=results/profiled.jsonl
+pass_threshold=1.0
ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl
+rollouts_jsonl_fpath=results/rollouts.jsonl
+output_jsonl_fpath=results/profiled.jsonl
+pass_threshold=1.0
+rollouts_jsonl_fpath=results/rollouts.jsonl
+output_jsonl_fpath=results/profiled.jsonl
+pass_threshold=1.0
Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward)
聚合指标(pass@1 = avg_reward,pass@k来自max_reward)
python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl
Increase `num_repeats` until variance < 1% across runs on the same model.
Closed-source models should score at or above open-source models. If not, investigate for bugs. Inspect actual failure cases in the rollout JSONL, not just aggregate numbers.
For external benchmarks: reproduce the original repo's published numbers first. Then reproduce after Gym integration. Scores should match.python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonl
增加`num_repeats`直到同一模型多次运行的方差<1%。
闭源模型的得分应等于或高于开源模型。如果不是,请排查问题。检查rollout JSONL中的实际失败案例,不要只看聚合数值。
对于外部基准测试:先在原仓库中复现公开数值,再在Gym集成后复现。得分应一致。Step 8: Pre-commit and PR
步骤8:预提交和PR
bash
pre-commit run --all-filesFirst run may fail as hooks auto-modify files ( flag, README table). Stage changes and run again.
verified: falseSet in YAML config after successful baselining. Include W&B links and screenshots of results in the PR description.
verified: trueTo avoid committing unrelated auto-fixes from other servers, scope pre-commit to your files:
bash
pre-commit run --files resources_servers/my_benchmark/**/*If hooks modify files in other directories, discard those changes:
bash
git checkout -- resources_servers/other_server/bash
pre-commit run --all-files首次运行可能失败,因为钩子会自动修改文件(标记、README表格)。暂存更改后再次运行。
verified: false成功完成基准测试后,将YAML配置中的设置为。在PR描述中包含W&B链接和结果截图。
verifiedtrue为避免提交其他服务器的自动修复内容,请将预提交范围限定在你的文件:
bash
pre-commit run --files resources_servers/my_benchmark/**/*如果钩子修改了其他目录的文件,请丢弃这些更改:
bash
git checkout -- resources_servers/other_server/Constraints
约束条件
- Use NeMo Gym's OpenAI client (), not LiteLLM/Anthropic/other
nemo_gym/openai_utils.py - Use aiohttp, not httpx, for async HTTP. All async HTTP calls must go through (aiohttp). httpx has O(n^2) connection pooling that hangs at high concurrency. When wrapping external libraries that use httpx internally, replace their HTTP transport with an aiohttp adapter — see
nemo_gym.server_utils.request()(resources_servers/tavily_search/app.py) for the pattern andTavilySearchAIOHTTPClientfor the rationale.docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md - Pass configuration through Gym config (YAML), not environment variables
- Code must run on Linux
- endpoint must be async
/run - Errors from tool execution or bad model output must return error responses, not crash
- All commits require DCO sign-off () and cryptographic signature (
-s)-S
- 使用NeMo Gym的OpenAI客户端(),不要使用LiteLLM/Anthropic或其他客户端
nemo_gym/openai_utils.py - 异步HTTP调用使用aiohttp,不要使用httpx。所有异步HTTP调用必须通过(基于aiohttp)。httpx的连接池存在O(n²)问题,高并发时会挂起。当封装内部使用httpx的外部库时,将其HTTP传输替换为aiohttp适配器 — 参考
nemo_gym.server_utils.request()中的resources_servers/tavily_search/app.py实现,以及TavilySearchAIOHTTPClient中的原理说明。docs/infrastructure/engineering-notes/aiohttp-vs-httpx.md - 通过Gym配置(YAML)传递参数,不要使用环境变量
- 代码必须能在Linux上运行
- 端点必须是异步的
/run - 工具执行或模型输出错误必须返回错误响应,不要崩溃
- 所有提交需要DCO签名()和加密签名(
-s)-S
Reference
参考资料
For detailed code patterns, schemas, and examples: see references/patterns.md.
如需详细的代码模式、schema和示例,请查看references/patterns.md。