nemo-evaluator-sdk

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

NeMo Evaluator SDK - Enterprise LLM Benchmarking

NeMo Evaluator SDK - 企业级LLM基准测试

Quick Start

快速开始

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).
Installation:
bash
pip install nemo-evaluator-launcher
Set API key and run evaluation:
bash
export NGC_API_KEY=nvapi-your-key-here
NeMo Evaluator SDK 采用容器化、可复现的评估方式,支持多后端执行(本地Docker、Slurm HPC、Lepton云),可在来自18+测试套件的100+基准测试中评估LLM。
安装:
bash
pip install nemo-evaluator-launcher
设置API密钥并运行评估:
bash
export NGC_API_KEY=nvapi-your-key-here

Create minimal config

创建基础配置

cat > config.yaml << 'EOF' defaults:
  • execution: local
  • deployment: none
  • self
execution: output_dir: ./results
target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY
evaluation: tasks: - name: ifeval EOF
cat > config.yaml << 'EOF' defaults:
  • execution: local
  • deployment: none
  • self
execution: output_dir: ./results
target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY
evaluation: tasks: - name: ifeval EOF

Run evaluation

运行评估

nemo-evaluator-launcher run --config-dir . --config-name config

**View available tasks**:
```bash
nemo-evaluator-launcher ls tasks
nemo-evaluator-launcher run --config-dir . --config-name config

**查看可用任务**:
```bash
nemo-evaluator-launcher ls tasks

Common Workflows

常见工作流

Workflow 1: Evaluate Model on Standard Benchmarks

工作流1:在标准基准测试中评估模型

Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.
Checklist:
Standard Evaluation:
- [ ] Step 1: Configure API endpoint
- [ ] Step 2: Select benchmarks
- [ ] Step 3: Run evaluation
- [ ] Step 4: Check results
Step 1: Configure API endpoint
yaml
undefined
在任何兼容OpenAI的端点上运行核心学术基准测试(MMLU、GSM8K、IFEval)。
检查清单:
标准评估:
- [ ] 步骤1:配置API端点
- [ ] 步骤2:选择基准测试
- [ ] 步骤3:运行评估
- [ ] 步骤4:查看结果
步骤1:配置API端点
yaml
undefined

config.yaml

config.yaml

defaults:
  • execution: local
  • deployment: none
  • self
execution: output_dir: ./results
target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY

For self-hosted endpoints (vLLM, TRT-LLM):
```yaml
target:
  api_endpoint:
    model_id: my-model
    url: http://localhost:8000/v1/chat/completions
    api_key_name: ""  # No key needed for local
Step 2: Select benchmarks
Add tasks to your config:
yaml
evaluation:
  tasks:
    - name: ifeval           # Instruction following
    - name: gpqa_diamond     # Graduate-level QA
      env_vars:
        HF_TOKEN: HF_TOKEN   # Some tasks need HF token
    - name: gsm8k_cot_instruct  # Math reasoning
    - name: humaneval        # Code generation
Step 3: Run evaluation
bash
undefined
defaults:
  • execution: local
  • deployment: none
  • self
execution: output_dir: ./results
target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY

对于自托管端点(vLLM、TRT-LLM):
```yaml
target:
  api_endpoint:
    model_id: my-model
    url: http://localhost:8000/v1/chat/completions
    api_key_name: ""  # 本地部署无需密钥
步骤2:选择基准测试
在配置中添加任务:
yaml
evaluation:
  tasks:
    - name: ifeval           # 指令遵循能力
    - name: gpqa_diamond     # 研究生级问答
      env_vars:
        HF_TOKEN: HF_TOKEN   # 部分任务需要HF令牌
    - name: gsm8k_cot_instruct  # 数学推理
    - name: humaneval        # 代码生成
步骤3:运行评估
bash
undefined

Run with config file

使用配置文件运行

nemo-evaluator-launcher run
--config-dir .
--config-name config
nemo-evaluator-launcher run
--config-dir .
--config-name config

Override output directory

覆盖输出目录

nemo-evaluator-launcher run
--config-dir .
--config-name config
-o execution.output_dir=./my_results
nemo-evaluator-launcher run
--config-dir .
--config-name config
-o execution.output_dir=./my_results

Limit samples for quick testing

限制样本数量以快速测试

nemo-evaluator-launcher run
--config-dir .
--config-name config
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10

**Step 4: Check results**

```bash
nemo-evaluator-launcher run
--config-dir .
--config-name config
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10

**步骤4:查看结果**

```bash

Check job status

检查作业状态

nemo-evaluator-launcher status <invocation_id>
nemo-evaluator-launcher status <invocation_id>

List all runs

列出所有运行记录

nemo-evaluator-launcher ls runs
nemo-evaluator-launcher ls runs

View results

查看结果

cat results/<invocation_id>/<task>/artifacts/results.yml
undefined
cat results/<invocation_id>/<task>/artifacts/results.yml
undefined

Workflow 2: Run Evaluation on Slurm HPC Cluster

工作流2:在Slurm HPC集群上运行评估

Execute large-scale evaluation on HPC infrastructure.
Checklist:
Slurm Evaluation:
- [ ] Step 1: Configure Slurm settings
- [ ] Step 2: Set up model deployment
- [ ] Step 3: Launch evaluation
- [ ] Step 4: Monitor job status
Step 1: Configure Slurm settings
yaml
undefined
在HPC基础设施上执行大规模评估。
检查清单:
Slurm评估:
- [ ] 步骤1:配置Slurm设置
- [ ] 步骤2:设置模型部署
- [ ] 步骤3:启动评估
- [ ] 步骤4:监控作业状态
步骤1:配置Slurm设置
yaml
undefined

slurm_config.yaml

slurm_config.yaml

defaults:
  • execution: slurm
  • deployment: vllm
  • self
execution: hostname: cluster.example.com account: my_slurm_account partition: gpu output_dir: /shared/results walltime: "04:00:00" nodes: 1 gpus_per_node: 8

**Step 2: Set up model deployment**

```yaml
deployment:
  checkpoint_path: /shared/models/llama-3.1-8b
  tensor_parallel_size: 2
  data_parallel_size: 4
  max_model_len: 4096

target:
  api_endpoint:
    model_id: llama-3.1-8b
    # URL auto-generated by deployment
Step 3: Launch evaluation
bash
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name slurm_config
Step 4: Monitor job status
bash
undefined
defaults:
  • execution: slurm
  • deployment: vllm
  • self
execution: hostname: cluster.example.com account: my_slurm_account partition: gpu output_dir: /shared/results walltime: "04:00:00" nodes: 1 gpus_per_node: 8

**步骤2:设置模型部署**

```yaml
deployment:
  checkpoint_path: /shared/models/llama-3.1-8b
  tensor_parallel_size: 2
  data_parallel_size: 4
  max_model_len: 4096

target:
  api_endpoint:
    model_id: llama-3.1-8b
    # URL由部署自动生成
步骤3:启动评估
bash
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name slurm_config
步骤4:监控作业状态
bash
undefined

Check status (queries sacct)

检查状态(查询sacct)

nemo-evaluator-launcher status <invocation_id>
nemo-evaluator-launcher status <invocation_id>

View detailed info

查看详细信息

nemo-evaluator-launcher info <invocation_id>
nemo-evaluator-launcher info <invocation_id>

Kill if needed

如需终止作业

nemo-evaluator-launcher kill <invocation_id>
undefined
nemo-evaluator-launcher kill <invocation_id>
undefined

Workflow 3: Compare Multiple Models

工作流3:对比多个模型

Benchmark multiple models on the same tasks for comparison.
Checklist:
Model Comparison:
- [ ] Step 1: Create base config
- [ ] Step 2: Run evaluations with overrides
- [ ] Step 3: Export and compare results
Step 1: Create base config
yaml
undefined
在相同任务上对多个模型进行基准测试以做对比。
检查清单:
模型对比:
- [ ] 步骤1:创建基础配置
- [ ] 步骤2:通过覆盖配置运行评估
- [ ] 步骤3:导出并对比结果
步骤1:创建基础配置
yaml
undefined

base_eval.yaml

base_eval.yaml

defaults:
  • execution: local
  • deployment: none
  • self
execution: output_dir: ./comparison_results
evaluation: nemo_evaluator_config: config: params: temperature: 0.01 parallelism: 4 tasks: - name: mmlu_pro - name: gsm8k_cot_instruct - name: ifeval

**Step 2: Run evaluations with model overrides**

```bash
defaults:
  • execution: local
  • deployment: none
  • self
execution: output_dir: ./comparison_results
evaluation: nemo_evaluator_config: config: params: temperature: 0.01 parallelism: 4 tasks: - name: mmlu_pro - name: gsm8k_cot_instruct - name: ifeval

**步骤2:通过模型覆盖配置运行评估**

```bash

Evaluate Llama 3.1 8B

评估Llama 3.1 8B

nemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions
nemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

Evaluate Mistral 7B

评估Mistral 7B

nemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

**Step 3: Export and compare**

```bash
nemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

**步骤3:导出并对比**

```bash

Export to MLflow

导出到MLflow

nemo-evaluator-launcher export <invocation_id_1> --dest mlflow nemo-evaluator-launcher export <invocation_id_2> --dest mlflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow nemo-evaluator-launcher export <invocation_id_2> --dest mlflow

Export to local JSON

导出到本地JSON

nemo-evaluator-launcher export <invocation_id> --dest local --format json
nemo-evaluator-launcher export <invocation_id> --dest local --format json

Export to Weights & Biases

导出到Weights & Biases

nemo-evaluator-launcher export <invocation_id> --dest wandb
undefined
nemo-evaluator-launcher export <invocation_id> --dest wandb
undefined

Workflow 4: Safety and Vision-Language Evaluation

工作流4:安全与视觉语言模型(VLM)评估

Evaluate models on safety benchmarks and VLM tasks.
Checklist:
Safety/VLM Evaluation:
- [ ] Step 1: Configure safety tasks
- [ ] Step 2: Set up VLM tasks (if applicable)
- [ ] Step 3: Run evaluation
Step 1: Configure safety tasks
yaml
evaluation:
  tasks:
    - name: aegis              # Safety harness
    - name: wildguard          # Safety classification
    - name: garak              # Security probing
Step 2: Configure VLM tasks
yaml
undefined
在安全基准测试和VLM任务上评估模型。
检查清单:
安全/VLM评估:
- [ ] 步骤1:配置安全任务
- [ ] 步骤2:设置VLM任务(如适用)
- [ ] 步骤3:运行评估
步骤1:配置安全任务
yaml
evaluation:
  tasks:
    - name: aegis              # 安全测试套件
    - name: wildguard          # 安全分类
    - name: garak              # 安全探测
步骤2:配置VLM任务
yaml
undefined

For vision-language models

针对视觉语言模型

target: api_endpoint: type: vlm # Vision-language endpoint model_id: nvidia/llama-3.2-90b-vision-instruct url: https://integrate.api.nvidia.com/v1/chat/completions
evaluation: tasks: - name: ocrbench # OCR evaluation - name: chartqa # Chart understanding - name: mmmu # Multimodal understanding
undefined
target: api_endpoint: type: vlm # 视觉语言端点 model_id: nvidia/llama-3.2-90b-vision-instruct url: https://integrate.api.nvidia.com/v1/chat/completions
evaluation: tasks: - name: ocrbench # OCR评估 - name: chartqa # 图表理解 - name: mmmu # 多模态理解
undefined

When to Use vs Alternatives

适用场景与替代方案对比

Use NeMo Evaluator when:
  • Need 100+ benchmarks from 18+ harnesses in one platform
  • Running evaluations on Slurm HPC clusters or cloud
  • Requiring reproducible containerized evaluation
  • Evaluating against OpenAI-compatible APIs (vLLM, TRT-LLM, NIMs)
  • Need enterprise-grade evaluation with result export (MLflow, W&B)
Use alternatives instead:
  • lm-evaluation-harness: Simpler setup for quick local evaluation
  • bigcode-evaluation-harness: Focused only on code benchmarks
  • HELM: Stanford's broader evaluation (fairness, efficiency)
  • Custom scripts: Highly specialized domain evaluation
选择NeMo Evaluator的场景:
  • 需要在一个平台上覆盖18+测试套件的100+基准测试
  • 在Slurm HPC集群或云平台上运行评估
  • 要求可复现的容器化评估
  • 针对兼容OpenAI的API(vLLM、TRT-LLM、NIMs)进行评估
  • 需要支持结果导出(MLflow、W&B)的企业级评估
选择替代方案的场景:
  • lm-evaluation-harness: 用于快速本地评估的简易设置
  • bigcode-evaluation-harness: 仅专注于代码基准测试
  • HELM: 斯坦福大学推出的更全面的评估工具(涵盖公平性、效率等)
  • 自定义脚本: 高度专业化的领域评估

Supported Harnesses and Tasks

支持的测试套件与任务

HarnessTask CountCategories
lm-evaluation-harness
60+MMLU, GSM8K, HellaSwag, ARC
simple-evals
20+GPQA, MATH, AIME
bigcode-evaluation-harness
25+HumanEval, MBPP, MultiPL-E
safety-harness
3Aegis, WildGuard
garak
1Security probing
vlmevalkit
6+OCRBench, ChartQA, MMMU
bfcl
6Function calling v2/v3
mtbench
2Multi-turn conversation
livecodebench
10+Live coding evaluation
helm
15Medical domain
nemo-skills
8Math, science, agentic
测试套件任务数量分类
lm-evaluation-harness
60+MMLU、GSM8K、HellaSwag、ARC
simple-evals
20+GPQA、MATH、AIME
bigcode-evaluation-harness
25+HumanEval、MBPP、MultiPL-E
safety-harness
3Aegis、WildGuard
garak
1安全探测
vlmevalkit
6+OCRBench、ChartQA、MMMU
bfcl
6函数调用v2/v3
mtbench
2多轮对话
livecodebench
10+实时代码评估
helm
15医疗领域
nemo-skills
8数学、科学、智能体相关

Common Issues

常见问题

Issue: Container pull fails
Ensure NGC credentials are configured:
bash
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
Issue: Task requires environment variable
Some tasks need HF_TOKEN or JUDGE_API_KEY:
yaml
evaluation:
  tasks:
    - name: gpqa_diamond
      env_vars:
        HF_TOKEN: HF_TOKEN  # Maps env var name to env var
Issue: Evaluation timeout
Increase parallelism or reduce samples:
bash
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100
Issue: Slurm job not starting
Check Slurm account and partition:
yaml
execution:
  account: correct_account
  partition: gpu
  qos: normal  # May need specific QOS
Issue: Different results than expected
Verify configuration matches reported settings:
yaml
evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.0  # Deterministic
        num_fewshot: 5    # Check paper's fewshot count
问题:容器拉取失败
确保已配置NGC凭证:
bash
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
问题:任务需要环境变量
部分任务需要HF_TOKEN或JUDGE_API_KEY:
yaml
evaluation:
  tasks:
    - name: gpqa_diamond
      env_vars:
        HF_TOKEN: HF_TOKEN  # 将环境变量名称映射到系统环境变量
问题:评估超时
提高并行度或减少样本数量:
bash
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100
问题:Slurm作业未启动
检查Slurm账户和分区:
yaml
execution:
  account: correct_account
  partition: gpu
  qos: normal  # 可能需要特定的QOS
问题:结果与预期不符
验证配置是否与报告的设置一致:
yaml
evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.0  # 确定性模式
        num_fewshot: 5    # 对照论文中的few-shot数量

CLI Reference

CLI参考

CommandDescription
run
Execute evaluation with config
status <id>
Check job status
info <id>
View detailed job info
ls tasks
List available benchmarks
ls runs
List all invocations
export <id>
Export results (mlflow/wandb/local)
kill <id>
Terminate running job
命令描述
run
使用配置文件执行评估
status <id>
检查作业状态
info <id>
查看作业详细信息
ls tasks
列出可用基准测试
ls runs
列出所有调用记录
export <id>
导出结果(支持mlflow/wandb/local)
kill <id>
终止运行中的作业

Configuration Override Examples

配置覆盖示例

bash
undefined
bash
undefined

Override model endpoint

覆盖模型端点

-o target.api_endpoint.model_id=my-model -o target.api_endpoint.url=http://localhost:8000/v1/chat/completions
-o target.api_endpoint.model_id=my-model -o target.api_endpoint.url=http://localhost:8000/v1/chat/completions

Add evaluation parameters

添加评估参数

-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5 -o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=50
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5 -o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=50

Change execution settings

修改执行设置

-o execution.output_dir=/custom/path -o execution.mode=parallel
-o execution.output_dir=/custom/path -o execution.mode=parallel

Dynamically set tasks

动态设置任务

-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'
undefined
-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'
undefined

Python API Usage

Python API使用

For programmatic evaluation without the CLI:
python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig,
    EvaluationTarget,
    ApiEndpoint,
    EndpointType,
    ConfigParams
)
无需CLI,通过编程方式执行评估:
python
from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig,
    EvaluationTarget,
    ApiEndpoint,
    EndpointType,
    ConfigParams
)

Configure evaluation

配置评估

eval_config = EvaluationConfig( type="mmlu_pro", output_dir="./results", params=ConfigParams( limit_samples=10, temperature=0.0, max_new_tokens=1024, parallelism=4 ) )
eval_config = EvaluationConfig( type="mmlu_pro", output_dir="./results", params=ConfigParams( limit_samples=10, temperature=0.0, max_new_tokens=1024, parallelism=4 ) )

Configure target endpoint

配置目标端点

target_config = EvaluationTarget( api_endpoint=ApiEndpoint( model_id="meta/llama-3.1-8b-instruct", url="https://integrate.api.nvidia.com/v1/chat/completions", type=EndpointType.CHAT, api_key="nvapi-your-key-here" ) )
target_config = EvaluationTarget( api_endpoint=ApiEndpoint( model_id="meta/llama-3.1-8b-instruct", url="https://integrate.api.nvidia.com/v1/chat/completions", type=EndpointType.CHAT, api_key="nvapi-your-key-here" ) )

Run evaluation

运行评估

result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
undefined
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)
undefined

Advanced Topics

高级主题

Multi-backend execution: See references/execution-backends.md Configuration deep-dive: See references/configuration.md Adapter and interceptor system: See references/adapter-system.md Custom benchmark integration: See references/custom-benchmarks.md
多后端执行: 查看 references/execution-backends.md 配置深入解析: 查看 references/configuration.md 适配器与拦截器系统: 查看 references/adapter-system.md 自定义基准测试集成: 查看 references/custom-benchmarks.md

Requirements

系统要求

  • Python: 3.10-3.13
  • Docker: Required for local execution
  • NGC API Key: For pulling containers and using NVIDIA Build
  • HF_TOKEN: Required for some benchmarks (GPQA, MMLU)
  • Python: 3.10-3.13
  • Docker: 本地执行时必需
  • NGC API密钥: 用于拉取容器和使用NVIDIA Build
  • HF_TOKEN: 部分基准测试(如GPQA、MMLU)需要

Resources

资源