Loading...
Loading...
Guide for adding a new benchmark or training environment to NeMo-Gym. Use when the user asks to add, create, or integrate a benchmark, evaluation, training environment, or resources server into NeMo-Gym. Also use when wrapping an existing 3rd-party benchmark library. Covers the full workflow: data preparation, resources server implementation, agent wiring, YAML config, testing, and reward profiling (baselining). Triggered by: "add benchmark", "new resources server", "integrate benchmark", "wrap benchmark", "add training environment", "add eval".
npx skill4agent add nvidia/skills add-benchmarkverify()simple_agentcode_geninstruction_followingmath_with_judge/runBaseVerifyResponserequirements.txtng_init_resources_serverng_init_resources_server +entrypoint=resources_servers/my_benchmarkresources_servers/my_benchmark/
├── app.py # Server template
├── configs/my_benchmark.yaml
├── data/.gitignore
├── tests/test_app.py
├── requirements.txt
└── README.mdresponses_api_agents/my_agent/responses_create_params.inputverifier_metadata{
"responses_create_params": {
"input": [
{"role": "system", "content": "System prompt"},
{"role": "user", "content": "Problem statement"}
]
},
"verifier_metadata": {
"test_cases": [{"input": "...", "expected_output": "..."}],
"task_id": "unique_id"
}
}references/patterns.mdexample.jsonldata/example.jsonltrainvalidationng_upload_dataset_to_gitlab \
+dataset_name=my_benchmark \
+version=0.0.1 \
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonlenv.yamlmlflow_tracking_uri: <your-gitlab-mlflow-tracking-uri>
mlflow_tracking_token: <your-gitlab-api-token>data/.gitignore*train.jsonl*validation.jsonlmy_eval.jsonl*eval.jsonlgit rm --cached <file># Validate example data (for PR submission)
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \
+output_dirpath=/tmp/prepare +mode=example_validation
# Download and prepare train/validation from GitLab
ng_prepare_data "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml]" \
+output_dirpath=data/my_benchmark +mode=train_preparation +should_download=true +data_source=gitlabapp.pyverify()verifier_metadatareferences/patterns.mdrewardasyncio.Semaphoreresult = await futureray.get()errors="replace"<think><thinking>pytest.mark.skipifpytest_configureconftest.pyskipifreferences/patterns.mdsetup_<tool>.pyensure_<tool>()sys.platformmodel_post_init()pytest_configuretests/conftest.pyensure_<tool>()configs/my_benchmark.yamlreferences/patterns.mdverified: falsetruelicensetrainvalidationproof_refinement_agentreferences/patterns.mdtrainvalidationgitlab_identifierjsonl_fpathdatasets:
- name: my_dataset
type: train
jsonl_fpath: resources_servers/my_benchmark/data/my_dataset.jsonl
gitlab_identifier:
dataset_name: my_benchmark
version: 0.0.1
artifact_fpath: my_dataset.jsonl
license: MIT
- name: example
type: example
jsonl_fpath: resources_servers/my_benchmark/data/example.jsonljsonl_fpathgitlab_identifierexamplegitlab_identifier# Run server tests (creates isolated .venv, slow on first run)
ng_test +entrypoint=resources_servers/my_benchmark
# Run core library tests to check nothing broke
pytest tests/unit_tests/ -x# Start servers
ng_run "+config_paths=[resources_servers/my_benchmark/configs/my_benchmark.yaml,responses_api_models/openai_model/configs/openai_model.yaml]"
# Quick test with example data
ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
+input_jsonl_fpath=resources_servers/my_benchmark/data/example.jsonl \
+output_jsonl_fpath=results/example_rollouts.jsonl \
+num_repeats=1 \
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
# Inspect results# Collect rollouts
ng_collect_rollouts +agent_name=my_benchmark_simple_agent \
+input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
+output_jsonl_fpath=results/rollouts.jsonl \
+num_repeats=5 \
"+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"
# Compute per-task pass rates
ng_reward_profile +input_jsonl_fpath=resources_servers/my_benchmark/data/my_dataset.jsonl \
+rollouts_jsonl_fpath=results/rollouts.jsonl \
+output_jsonl_fpath=results/profiled.jsonl \
+pass_threshold=1.0
# Aggregate metrics (pass@1 = avg_reward, pass@k from max_reward)
python scripts/print_aggregate_results.py +jsonl_fpath=results/profiled.jsonlnum_repeatspre-commit run --all-filesverified: falseverified: truepre-commit run --files resources_servers/my_benchmark/**/*git checkout -- resources_servers/other_server/nemo_gym/openai_utils.pynemo_gym.server_utils.request()resources_servers/tavily_search/app.pyTavilySearchAIOHTTPClientdocs/infrastructure/engineering-notes/aiohttp-vs-httpx.md/run-s-S