nemo-evaluator-sdk
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNeMo Evaluator SDK - Enterprise LLM Benchmarking
NeMo Evaluator SDK - 企业级LLM基准测试
Quick Start
快速开始
NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).
Installation:
bash
pip install nemo-evaluator-launcherSet API key and run evaluation:
bash
export NGC_API_KEY=nvapi-your-key-hereNeMo Evaluator SDK 采用容器化、可复现的评估方式,支持多后端执行(本地Docker、Slurm HPC、Lepton云),可在来自18+测试套件的100+基准测试中评估LLM。
安装:
bash
pip install nemo-evaluator-launcher设置API密钥并运行评估:
bash
export NGC_API_KEY=nvapi-your-key-hereCreate minimal config
创建基础配置
cat > config.yaml << 'EOF'
defaults:
- execution: local
- deployment: none
- self
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation:
tasks:
- name: ifeval
EOF
cat > config.yaml << 'EOF'
defaults:
- execution: local
- deployment: none
- self
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation:
tasks:
- name: ifeval
EOF
Run evaluation
运行评估
nemo-evaluator-launcher run --config-dir . --config-name config
**View available tasks**:
```bash
nemo-evaluator-launcher ls tasksnemo-evaluator-launcher run --config-dir . --config-name config
**查看可用任务**:
```bash
nemo-evaluator-launcher ls tasksCommon Workflows
常见工作流
Workflow 1: Evaluate Model on Standard Benchmarks
工作流1:在标准基准测试中评估模型
Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.
Checklist:
Standard Evaluation:
- [ ] Step 1: Configure API endpoint
- [ ] Step 2: Select benchmarks
- [ ] Step 3: Run evaluation
- [ ] Step 4: Check resultsStep 1: Configure API endpoint
yaml
undefined在任何兼容OpenAI的端点上运行核心学术基准测试(MMLU、GSM8K、IFEval)。
检查清单:
标准评估:
- [ ] 步骤1:配置API端点
- [ ] 步骤2:选择基准测试
- [ ] 步骤3:运行评估
- [ ] 步骤4:查看结果步骤1:配置API端点
yaml
undefinedconfig.yaml
config.yaml
defaults:
- execution: local
- deployment: none
- self
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
For self-hosted endpoints (vLLM, TRT-LLM):
```yaml
target:
api_endpoint:
model_id: my-model
url: http://localhost:8000/v1/chat/completions
api_key_name: "" # No key needed for localStep 2: Select benchmarks
Add tasks to your config:
yaml
evaluation:
tasks:
- name: ifeval # Instruction following
- name: gpqa_diamond # Graduate-level QA
env_vars:
HF_TOKEN: HF_TOKEN # Some tasks need HF token
- name: gsm8k_cot_instruct # Math reasoning
- name: humaneval # Code generationStep 3: Run evaluation
bash
undefineddefaults:
- execution: local
- deployment: none
- self
execution:
output_dir: ./results
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
对于自托管端点(vLLM、TRT-LLM):
```yaml
target:
api_endpoint:
model_id: my-model
url: http://localhost:8000/v1/chat/completions
api_key_name: "" # 本地部署无需密钥步骤2:选择基准测试
在配置中添加任务:
yaml
evaluation:
tasks:
- name: ifeval # 指令遵循能力
- name: gpqa_diamond # 研究生级问答
env_vars:
HF_TOKEN: HF_TOKEN # 部分任务需要HF令牌
- name: gsm8k_cot_instruct # 数学推理
- name: humaneval # 代码生成步骤3:运行评估
bash
undefinedRun with config file
使用配置文件运行
nemo-evaluator-launcher run
--config-dir .
--config-name config
--config-dir .
--config-name config
nemo-evaluator-launcher run
--config-dir .
--config-name config
--config-dir .
--config-name config
Override output directory
覆盖输出目录
nemo-evaluator-launcher run
--config-dir .
--config-name config
-o execution.output_dir=./my_results
--config-dir .
--config-name config
-o execution.output_dir=./my_results
nemo-evaluator-launcher run
--config-dir .
--config-name config
-o execution.output_dir=./my_results
--config-dir .
--config-name config
-o execution.output_dir=./my_results
Limit samples for quick testing
限制样本数量以快速测试
nemo-evaluator-launcher run
--config-dir .
--config-name config
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
--config-dir .
--config-name config
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
**Step 4: Check results**
```bashnemo-evaluator-launcher run
--config-dir .
--config-name config
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
--config-dir .
--config-name config
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
**步骤4:查看结果**
```bashCheck job status
检查作业状态
nemo-evaluator-launcher status <invocation_id>
nemo-evaluator-launcher status <invocation_id>
List all runs
列出所有运行记录
nemo-evaluator-launcher ls runs
nemo-evaluator-launcher ls runs
View results
查看结果
cat results/<invocation_id>/<task>/artifacts/results.yml
undefinedcat results/<invocation_id>/<task>/artifacts/results.yml
undefinedWorkflow 2: Run Evaluation on Slurm HPC Cluster
工作流2:在Slurm HPC集群上运行评估
Execute large-scale evaluation on HPC infrastructure.
Checklist:
Slurm Evaluation:
- [ ] Step 1: Configure Slurm settings
- [ ] Step 2: Set up model deployment
- [ ] Step 3: Launch evaluation
- [ ] Step 4: Monitor job statusStep 1: Configure Slurm settings
yaml
undefined在HPC基础设施上执行大规模评估。
检查清单:
Slurm评估:
- [ ] 步骤1:配置Slurm设置
- [ ] 步骤2:设置模型部署
- [ ] 步骤3:启动评估
- [ ] 步骤4:监控作业状态步骤1:配置Slurm设置
yaml
undefinedslurm_config.yaml
slurm_config.yaml
defaults:
- execution: slurm
- deployment: vllm
- self
execution:
hostname: cluster.example.com
account: my_slurm_account
partition: gpu
output_dir: /shared/results
walltime: "04:00:00"
nodes: 1
gpus_per_node: 8
**Step 2: Set up model deployment**
```yaml
deployment:
checkpoint_path: /shared/models/llama-3.1-8b
tensor_parallel_size: 2
data_parallel_size: 4
max_model_len: 4096
target:
api_endpoint:
model_id: llama-3.1-8b
# URL auto-generated by deploymentStep 3: Launch evaluation
bash
nemo-evaluator-launcher run \
--config-dir . \
--config-name slurm_configStep 4: Monitor job status
bash
undefineddefaults:
- execution: slurm
- deployment: vllm
- self
execution:
hostname: cluster.example.com
account: my_slurm_account
partition: gpu
output_dir: /shared/results
walltime: "04:00:00"
nodes: 1
gpus_per_node: 8
**步骤2:设置模型部署**
```yaml
deployment:
checkpoint_path: /shared/models/llama-3.1-8b
tensor_parallel_size: 2
data_parallel_size: 4
max_model_len: 4096
target:
api_endpoint:
model_id: llama-3.1-8b
# URL由部署自动生成步骤3:启动评估
bash
nemo-evaluator-launcher run \
--config-dir . \
--config-name slurm_config步骤4:监控作业状态
bash
undefinedCheck status (queries sacct)
检查状态(查询sacct)
nemo-evaluator-launcher status <invocation_id>
nemo-evaluator-launcher status <invocation_id>
View detailed info
查看详细信息
nemo-evaluator-launcher info <invocation_id>
nemo-evaluator-launcher info <invocation_id>
Kill if needed
如需终止作业
nemo-evaluator-launcher kill <invocation_id>
undefinednemo-evaluator-launcher kill <invocation_id>
undefinedWorkflow 3: Compare Multiple Models
工作流3:对比多个模型
Benchmark multiple models on the same tasks for comparison.
Checklist:
Model Comparison:
- [ ] Step 1: Create base config
- [ ] Step 2: Run evaluations with overrides
- [ ] Step 3: Export and compare resultsStep 1: Create base config
yaml
undefined在相同任务上对多个模型进行基准测试以做对比。
检查清单:
模型对比:
- [ ] 步骤1:创建基础配置
- [ ] 步骤2:通过覆盖配置运行评估
- [ ] 步骤3:导出并对比结果步骤1:创建基础配置
yaml
undefinedbase_eval.yaml
base_eval.yaml
defaults:
- execution: local
- deployment: none
- self
execution:
output_dir: ./comparison_results
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.01
parallelism: 4
tasks:
- name: mmlu_pro
- name: gsm8k_cot_instruct
- name: ifeval
**Step 2: Run evaluations with model overrides**
```bashdefaults:
- execution: local
- deployment: none
- self
execution:
output_dir: ./comparison_results
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.01
parallelism: 4
tasks:
- name: mmlu_pro
- name: gsm8k_cot_instruct
- name: ifeval
**步骤2:通过模型覆盖配置运行评估**
```bashEvaluate Llama 3.1 8B
评估Llama 3.1 8B
nemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
nemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
Evaluate Mistral 7B
评估Mistral 7B
nemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
**Step 3: Export and compare**
```bashnemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
**步骤3:导出并对比**
```bashExport to MLflow
导出到MLflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow
nemo-evaluator-launcher export <invocation_id_2> --dest mlflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow
nemo-evaluator-launcher export <invocation_id_2> --dest mlflow
Export to local JSON
导出到本地JSON
nemo-evaluator-launcher export <invocation_id> --dest local --format json
nemo-evaluator-launcher export <invocation_id> --dest local --format json
Export to Weights & Biases
导出到Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb
undefinednemo-evaluator-launcher export <invocation_id> --dest wandb
undefinedWorkflow 4: Safety and Vision-Language Evaluation
工作流4:安全与视觉语言模型(VLM)评估
Evaluate models on safety benchmarks and VLM tasks.
Checklist:
Safety/VLM Evaluation:
- [ ] Step 1: Configure safety tasks
- [ ] Step 2: Set up VLM tasks (if applicable)
- [ ] Step 3: Run evaluationStep 1: Configure safety tasks
yaml
evaluation:
tasks:
- name: aegis # Safety harness
- name: wildguard # Safety classification
- name: garak # Security probingStep 2: Configure VLM tasks
yaml
undefined在安全基准测试和VLM任务上评估模型。
检查清单:
安全/VLM评估:
- [ ] 步骤1:配置安全任务
- [ ] 步骤2:设置VLM任务(如适用)
- [ ] 步骤3:运行评估步骤1:配置安全任务
yaml
evaluation:
tasks:
- name: aegis # 安全测试套件
- name: wildguard # 安全分类
- name: garak # 安全探测步骤2:配置VLM任务
yaml
undefinedFor vision-language models
针对视觉语言模型
target:
api_endpoint:
type: vlm # Vision-language endpoint
model_id: nvidia/llama-3.2-90b-vision-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
evaluation:
tasks:
- name: ocrbench # OCR evaluation
- name: chartqa # Chart understanding
- name: mmmu # Multimodal understanding
undefinedtarget:
api_endpoint:
type: vlm # 视觉语言端点
model_id: nvidia/llama-3.2-90b-vision-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
evaluation:
tasks:
- name: ocrbench # OCR评估
- name: chartqa # 图表理解
- name: mmmu # 多模态理解
undefinedWhen to Use vs Alternatives
适用场景与替代方案对比
Use NeMo Evaluator when:
- Need 100+ benchmarks from 18+ harnesses in one platform
- Running evaluations on Slurm HPC clusters or cloud
- Requiring reproducible containerized evaluation
- Evaluating against OpenAI-compatible APIs (vLLM, TRT-LLM, NIMs)
- Need enterprise-grade evaluation with result export (MLflow, W&B)
Use alternatives instead:
- lm-evaluation-harness: Simpler setup for quick local evaluation
- bigcode-evaluation-harness: Focused only on code benchmarks
- HELM: Stanford's broader evaluation (fairness, efficiency)
- Custom scripts: Highly specialized domain evaluation
选择NeMo Evaluator的场景:
- 需要在一个平台上覆盖18+测试套件的100+基准测试
- 在Slurm HPC集群或云平台上运行评估
- 要求可复现的容器化评估
- 针对兼容OpenAI的API(vLLM、TRT-LLM、NIMs)进行评估
- 需要支持结果导出(MLflow、W&B)的企业级评估
选择替代方案的场景:
- lm-evaluation-harness: 用于快速本地评估的简易设置
- bigcode-evaluation-harness: 仅专注于代码基准测试
- HELM: 斯坦福大学推出的更全面的评估工具(涵盖公平性、效率等)
- 自定义脚本: 高度专业化的领域评估
Supported Harnesses and Tasks
支持的测试套件与任务
| Harness | Task Count | Categories |
|---|---|---|
| 60+ | MMLU, GSM8K, HellaSwag, ARC |
| 20+ | GPQA, MATH, AIME |
| 25+ | HumanEval, MBPP, MultiPL-E |
| 3 | Aegis, WildGuard |
| 1 | Security probing |
| 6+ | OCRBench, ChartQA, MMMU |
| 6 | Function calling v2/v3 |
| 2 | Multi-turn conversation |
| 10+ | Live coding evaluation |
| 15 | Medical domain |
| 8 | Math, science, agentic |
| 测试套件 | 任务数量 | 分类 |
|---|---|---|
| 60+ | MMLU、GSM8K、HellaSwag、ARC |
| 20+ | GPQA、MATH、AIME |
| 25+ | HumanEval、MBPP、MultiPL-E |
| 3 | Aegis、WildGuard |
| 1 | 安全探测 |
| 6+ | OCRBench、ChartQA、MMMU |
| 6 | 函数调用v2/v3 |
| 2 | 多轮对话 |
| 10+ | 实时代码评估 |
| 15 | 医疗领域 |
| 8 | 数学、科学、智能体相关 |
Common Issues
常见问题
Issue: Container pull fails
Ensure NGC credentials are configured:
bash
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEYIssue: Task requires environment variable
Some tasks need HF_TOKEN or JUDGE_API_KEY:
yaml
evaluation:
tasks:
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN # Maps env var name to env varIssue: Evaluation timeout
Increase parallelism or reduce samples:
bash
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100Issue: Slurm job not starting
Check Slurm account and partition:
yaml
execution:
account: correct_account
partition: gpu
qos: normal # May need specific QOSIssue: Different results than expected
Verify configuration matches reported settings:
yaml
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.0 # Deterministic
num_fewshot: 5 # Check paper's fewshot count问题:容器拉取失败
确保已配置NGC凭证:
bash
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY问题:任务需要环境变量
部分任务需要HF_TOKEN或JUDGE_API_KEY:
yaml
evaluation:
tasks:
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN # 将环境变量名称映射到系统环境变量问题:评估超时
提高并行度或减少样本数量:
bash
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100问题:Slurm作业未启动
检查Slurm账户和分区:
yaml
execution:
account: correct_account
partition: gpu
qos: normal # 可能需要特定的QOS问题:结果与预期不符
验证配置是否与报告的设置一致:
yaml
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.0 # 确定性模式
num_fewshot: 5 # 对照论文中的few-shot数量CLI Reference
CLI参考
| Command | Description |
|---|---|
| Execute evaluation with config |
| Check job status |
| View detailed job info |
| List available benchmarks |
| List all invocations |
| Export results (mlflow/wandb/local) |
| Terminate running job |
| 命令 | 描述 |
|---|---|
| 使用配置文件执行评估 |
| 检查作业状态 |
| 查看作业详细信息 |
| 列出可用基准测试 |
| 列出所有调用记录 |
| 导出结果(支持mlflow/wandb/local) |
| 终止运行中的作业 |
Configuration Override Examples
配置覆盖示例
bash
undefinedbash
undefinedOverride model endpoint
覆盖模型端点
-o target.api_endpoint.model_id=my-model
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions
-o target.api_endpoint.model_id=my-model
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions
Add evaluation parameters
添加评估参数
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=50
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=50
Change execution settings
修改执行设置
-o execution.output_dir=/custom/path
-o execution.mode=parallel
-o execution.output_dir=/custom/path
-o execution.mode=parallel
Dynamically set tasks
动态设置任务
-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'
undefined-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'
undefinedPython API Usage
Python API使用
For programmatic evaluation without the CLI:
python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ApiEndpoint,
EndpointType,
ConfigParams
)无需CLI,通过编程方式执行评估:
python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
EvaluationConfig,
EvaluationTarget,
ApiEndpoint,
EndpointType,
ConfigParams
)Configure evaluation
配置评估
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=10,
temperature=0.0,
max_new_tokens=1024,
parallelism=4
)
)
eval_config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=10,
temperature=0.0,
max_new_tokens=1024,
parallelism=4
)
)
Configure target endpoint
配置目标端点
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
model_id="meta/llama-3.1-8b-instruct",
url="https://integrate.api.nvidia.com/v1/chat/completions",
type=EndpointType.CHAT,
api_key="nvapi-your-key-here"
)
)
target_config = EvaluationTarget(
api_endpoint=ApiEndpoint(
model_id="meta/llama-3.1-8b-instruct",
url="https://integrate.api.nvidia.com/v1/chat/completions",
type=EndpointType.CHAT,
api_key="nvapi-your-key-here"
)
)
Run evaluation
运行评估
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
undefinedresult = evaluate(eval_cfg=eval_config, target_cfg=target_config)
undefinedAdvanced Topics
高级主题
Multi-backend execution: See references/execution-backends.md
Configuration deep-dive: See references/configuration.md
Adapter and interceptor system: See references/adapter-system.md
Custom benchmark integration: See references/custom-benchmarks.md
多后端执行: 查看 references/execution-backends.md
配置深入解析: 查看 references/configuration.md
适配器与拦截器系统: 查看 references/adapter-system.md
自定义基准测试集成: 查看 references/custom-benchmarks.md
Requirements
系统要求
- Python: 3.10-3.13
- Docker: Required for local execution
- NGC API Key: For pulling containers and using NVIDIA Build
- HF_TOKEN: Required for some benchmarks (GPQA, MMLU)
- Python: 3.10-3.13
- Docker: 本地执行时必需
- NGC API密钥: 用于拉取容器和使用NVIDIA Build
- HF_TOKEN: 部分基准测试(如GPQA、MMLU)需要
Resources
资源
- GitHub: https://github.com/NVIDIA-NeMo/Evaluator
- NGC Containers: nvcr.io/nvidia/eval-factory/
- NVIDIA Build: https://build.nvidia.com (free hosted models)
- Documentation: https://github.com/NVIDIA-NeMo/Evaluator/tree/main/docs
- GitHub: https://github.com/NVIDIA-NeMo/Evaluator
- NGC容器: nvcr.io/nvidia/eval-factory/
- NVIDIA Build: https://build.nvidia.com(免费托管模型)
- 文档: https://github.com/NVIDIA-NeMo/Evaluator/tree/main/docs