Loading...
Loading...
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
npx skill4agent add huggingface/skills hugging-face-evaluationuvuv run--create-pruv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"All paths are relative to the directory containing this SKILL.md file. Before running any script, firstto that directory or use the full path.cd
--helpuv runuv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --helpget-prsinspect-tablesextract-readme --table N--apply--create-prinspect-tables--table N--model-column-index--model-name-override--task-typetask.typetext-generationsummarizationinspect-aiuvhf_jobs()nvidia-smiuv run scripts/train_sft_example.pyscripts/uv runpip install huggingface-hub markdown-it-py python-dotenv pyyaml requestsHF_TOKENAA_API_KEY.envpython-dotenv--help# 1) Inspect tables to get table numbers and column hints
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"
# 2) Extract a specific table (prints YAML by default)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
[--model-column-index <column index shown by inspect-tables>] \
[--model-name-override "<column header/model name>"] # use exact header text if you can't use the index
# 3) Apply changes (push or PR)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--apply # push directly
# or
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--create-pr # open a PR--model-column-index--model-name-overrideAA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"# Create .env file
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env
# Run import
uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name" \
--create-prhf jobs uv runHF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor cpu-basic \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "mmlu"HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "gsm8k"uv run scripts/run_eval_job.py \
--model "meta-llama/Llama-2-7b-hf" \
--task "mmlu" \
--hardware "t4-small"| Feature | vLLM Scripts | Inference Provider Scripts |
|---|---|---|
| Model access | Any HF model | Models with API endpoints |
| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |
| Cost | HF Jobs compute cost | API usage fees |
| Speed | vLLM optimized | Depends on provider |
| Offline | Yes (after download) | No |
# Run MMLU 5-shot with vLLM
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
# Run multiple tasks
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
# Use accelerate backend instead of vLLM
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5" \
--backend accelerate
# Chat/instruction-tuned models
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B-Instruct \
--tasks "leaderboard|mmlu|5" \
--use-chat-templatehf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"suite|task|num_fewshotleaderboard|mmlu|5leaderboard|gsm8k|5lighteval|hellaswag|0leaderboard|arc_challenge|25suite|task|num_fewshot|00leaderboardlightevalbigbenchoriginalsuite|task|num_fewshot0--tasksleaderboard|mmlu|0leaderboard|mmlu|05bigbench|abstract_narrative_understanding|0bigbench|abstract_narrative_understanding|0lighteval|wmt14:hi-en|0lighteval|wmt14:hi-en|0--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"# Run MMLU with vLLM
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu
# Use HuggingFace Transformers backend
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu \
--backend hf
# Multi-GPU with tensor parallelism
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-70B \
--task mmlu \
--tensor-parallel-size 4hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--task mmlummlugsm8khellaswagarc_challengetruthfulqawinograndehumaneval# Auto-detect hardware based on model size
uv run scripts/run_vllm_eval_job.py \
--model meta-llama/Llama-3.2-1B \
--task "leaderboard|mmlu|5" \
--framework lighteval
# Explicit hardware selection
uv run scripts/run_vllm_eval_job.py \
--model meta-llama/Llama-3.2-70B \
--task mmlu \
--framework inspect \
--hardware a100-large \
--tensor-parallel-size 4
# Use HF Transformers backend
uv run scripts/run_vllm_eval_job.py \
--model microsoft/phi-2 \
--task mmlu \
--framework inspect \
--backend hf| Model Size | Recommended Hardware |
|---|---|
| < 3B params | |
| 3B - 13B | |
| 13B - 34B | |
| 34B+ | |
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --versionuv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name" \
--table N \
[--model-column-index N] \
[--model-name-override "Exact Column Header or Model Name"] \
[--task-type "text-generation"] \
[--dataset-name "Custom Benchmarks"] \
[--apply | --create-pr]AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "creator-name" \
--model-name "model-slug" \
--repo-id "username/model-name" \
[--create-pr]uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"hf jobs uv run scripts/inspect_eval_uv.py \
--flavor "cpu-basic|t4-small|..." \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "task-name"uv run scripts/run_eval_job.py \
--model "model-id" \
--task "task-name" \
--hardware "cpu-basic|t4-small|..."# lighteval with vLLM
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor "a10g-small" \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--tasks "leaderboard|mmlu|5"
# inspect-ai with vLLM
hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor "a10g-small" \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "mmlu"
# Helper script (auto hardware selection)
uv run scripts/run_vllm_eval_job.py \
--model "model-id" \
--task "leaderboard|mmlu|5" \
--framework lightevalmodel-index:
- name: Model Name
results:
- task:
type: text-generation
dataset:
name: Benchmark Dataset
type: benchmark_type
metrics:
- name: MMLU
type: mmlu
value: 85.2
- name: HumanEval
type: humaneval
value: 72.5
source:
name: Source Name
url: https://source-url.comget-prsinspect-tables--helpinspect-tables --help--apply--create-pr--table N--model-name-overrideinspect-tables--create-pr**[]()-_"OLMo-3-32B"{"olmo", "3", "32b"}"**Olmo 3 32B**""[Olmo-3-32B](...)# Extract from README and push directly
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "your-username/your-model" \
--task-type "text-generation"# Step 1: ALWAYS check for existing PRs first
uv run scripts/evaluation_manager.py get-prs \
--repo-id "other-username/their-model"
# Step 2: If NO open PRs exist, proceed with creating one
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "other-username/their-model" \
--create-pr
# If open PRs DO exist:
# - Warn the user about existing PRs
# - Show them the PR URLs
# - Do NOT create a new PR unless user explicitly confirms# Step 1: Check for existing PRs
uv run scripts/evaluation_manager.py get-prs \
--repo-id "anthropic/claude-sonnet-4"
# Step 2: If no PRs, import from Artificial Analysis
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "anthropic/claude-sonnet-4" \
--create-pr--model-name-override--model-name-override "**Olmo 3-32B**"--gpu-memory-utilization--tensor-parallel-size--backend hf--backend accelerate--trust-remote-code--use-chat-templateimport subprocess
import os
def update_model_evaluations(repo_id, readme_content):
"""Update model card with evaluations from README."""
result = subprocess.run([
"python", "scripts/evaluation_manager.py",
"extract-readme",
"--repo-id", repo_id,
"--create-pr"
], capture_output=True, text=True)
if result.returncode == 0:
print(f"Successfully updated {repo_id}")
else:
print(f"Error: {result.stderr}")