vision-bench

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Vision Bench — LLM Image Evaluation

Vision Bench — LLM图像评估

Compare images by scoring them with one or more vision LLM judges against structured rubric criteria.
通过一个或多个视觉LLM评估模型,对照结构化的评分标准对图像进行评分和对比。

Quick Start

快速开始

bash
undefined
bash
undefined

Install dependencies

安装依赖

pip install pyyaml openai anthropic mistralai
pip install pyyaml openai anthropic mistralai

Score a single image

对单张图像评分

python bench.py image.png --criteria photorealism --judge gemini-2.5-flash
python bench.py image.png --criteria photorealism --judge gemini-2.5-flash

Compare two AI-generated images

对比两张AI生成的图像

python bench.py img_a.png img_b.png
--criteria text_to_image
--prompt "a fox in a snowy forest"
--judge gpt-4o
python bench.py img_a.png img_b.png
--criteria text_to_image
--prompt "a fox in a snowy forest"
--judge gpt-4o

Multi-judge consensus

多评估模型共识评分

python bench.py img.png
--criteria portrait
--judges gpt-4o gemini-2.5-flash claude-opus-4-5-20251022
python bench.py img.png
--criteria portrait
--judges gpt-4o gemini-2.5-flash claude-opus-4-5-20251022

OpenRouter models (any vision-capable model)

OpenRouter模型(任何支持视觉的模型)

python bench.py img_a.png img_b.png
--criteria artistic_style
--judges "openrouter/meta-llama/llama-4-maverick" "openrouter/mistralai/pixtral-large-2411"
python bench.py img_a.png img_b.png
--criteria artistic_style
--judges "openrouter/meta-llama/llama-4-maverick" "openrouter/mistralai/pixtral-large-2411"

List all presets

列出所有预设

python bench.py --list-presets
python bench.py --list-presets

Save report to file

将报告保存到文件

python bench.py img.png --criteria chart_analysis --save report.md
undefined
python bench.py img.png --criteria chart_analysis --save report.md
undefined

Presets

预设标准

PresetUse Case
text_to_image
Compare AI image generators (Midjourney, DALL-E, Flux)
photorealism
How convincingly an image looks like a photo
artistic_style
Style consistency, composition, color harmony
portrait
AI-generated portrait quality and realism
product_photo
E-commerce product image quality
document_ocr
Document text extraction and layout understanding
chart_analysis
Chart and data visualization comprehension
invoice
Financial document field extraction accuracy
ui_screenshot
App/web screenshot understanding
scientific
Scientific/medical image accuracy
alt_text
Accessibility image description quality
Custom criteria: pass any
.yaml
file as
--criteria path/to/my.yaml
.
预设名称使用场景
text_to_image
对比AI图像生成器(Midjourney、DALL-E、Flux)
photorealism
图像的照片真实感程度
artistic_style
风格一致性、构图、色彩协调性
portrait
AI生成肖像的质量与真实感
product_photo
电商产品图像质量
document_ocr
文档文本提取与布局理解能力
chart_analysis
图表与数据可视化的理解能力
invoice
财务文档字段提取的准确性
ui_screenshot
App/网页截图的理解能力
scientific
科学/医学图像的准确性
alt_text
无障碍图像描述的质量
自定义标准:传入任意
.yaml
文件作为
--criteria path/to/my.yaml

Judge Providers

评估模型提供商

PrefixProviderExample
gpt-
,
o1
,
o3
,
o4
OpenAI
gpt-4o
claude-
Anthropic
claude-sonnet-4-5-20251022
gemini-
Google Gemini
gemini-2.5-flash
pixtral-
,
mistral-
,
ministral-
Mistral
pixtral-12b-2409
openrouter/
OpenRouter (any model)
openrouter/meta-llama/llama-4-maverick
前缀提供商示例
gpt-
,
o1
,
o3
,
o4
OpenAI
gpt-4o
claude-
Anthropic
claude-sonnet-4-5-20251022
gemini-
Google Gemini
gemini-2.5-flash
pixtral-
,
mistral-
,
ministral-
Mistral
pixtral-12b-2409
openrouter/
OpenRouter(任意模型)
openrouter/meta-llama/llama-4-maverick

API Keys

API密钥

Keys are loaded from
secrets.enc.yaml
(SOPS + age encrypted) with fallback to environment variables.
Supported keys:
OPENAI_API_KEY
,
ANTHROPIC_API_KEY
,
GEMINI_API_KEY
,
OPENROUTER_API_KEY
To encrypt your own keys:
bash
sops --config .sops.yaml --encrypt --input-type yaml --output-type yaml secrets.yaml > secrets.enc.yaml
密钥从
secrets.enc.yaml
(SOPS + age加密)加载, fallback到环境变量。
支持的密钥:
OPENAI_API_KEY
,
ANTHROPIC_API_KEY
,
GEMINI_API_KEY
,
OPENROUTER_API_KEY
加密自己的密钥:
bash
sops --config .sops.yaml --encrypt --input-type yaml --output-type yaml secrets.yaml > secrets.enc.yaml

Output Formats

输出格式

--output markdown
(default) ·
--output json
·
--output table
--output markdown
(默认)·
--output json
·
--output table

Files

文件说明

  • bench.py
    — CLI entry point
  • judge.py
    — Multi-provider LLM judge logic
  • report.py
    — Report generation
  • vault.py
    — SOPS secrets decryption
  • criteria/
    — 11 YAML preset files
  • .sops.yaml
    — Age key config for encryption
  • secrets.enc.yaml
    — Encrypted API keys
  • bench.py
    — CLI入口文件
  • judge.py
    — 多提供商LLM评估逻辑
  • report.py
    — 报告生成模块
  • vault.py
    — SOPS密钥解密模块
  • criteria/
    — 11个YAML预设文件
  • .sops.yaml
    — Age密钥加密配置
  • secrets.enc.yaml
    — 加密后的API密钥