vision-bench
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVision Bench — LLM Image Evaluation
Vision Bench — LLM图像评估
Compare images by scoring them with one or more vision LLM judges against structured rubric criteria.
通过一个或多个视觉LLM评估模型,对照结构化的评分标准对图像进行评分和对比。
Quick Start
快速开始
bash
undefinedbash
undefinedInstall dependencies
安装依赖
pip install pyyaml openai anthropic mistralai
pip install pyyaml openai anthropic mistralai
Score a single image
对单张图像评分
python bench.py image.png --criteria photorealism --judge gemini-2.5-flash
python bench.py image.png --criteria photorealism --judge gemini-2.5-flash
Compare two AI-generated images
对比两张AI生成的图像
python bench.py img_a.png img_b.png
--criteria text_to_image
--prompt "a fox in a snowy forest"
--judge gpt-4o
--criteria text_to_image
--prompt "a fox in a snowy forest"
--judge gpt-4o
python bench.py img_a.png img_b.png
--criteria text_to_image
--prompt "a fox in a snowy forest"
--judge gpt-4o
--criteria text_to_image
--prompt "a fox in a snowy forest"
--judge gpt-4o
Multi-judge consensus
多评估模型共识评分
python bench.py img.png
--criteria portrait
--judges gpt-4o gemini-2.5-flash claude-opus-4-5-20251022
--criteria portrait
--judges gpt-4o gemini-2.5-flash claude-opus-4-5-20251022
python bench.py img.png
--criteria portrait
--judges gpt-4o gemini-2.5-flash claude-opus-4-5-20251022
--criteria portrait
--judges gpt-4o gemini-2.5-flash claude-opus-4-5-20251022
OpenRouter models (any vision-capable model)
OpenRouter模型(任何支持视觉的模型)
python bench.py img_a.png img_b.png
--criteria artistic_style
--judges "openrouter/meta-llama/llama-4-maverick" "openrouter/mistralai/pixtral-large-2411"
--criteria artistic_style
--judges "openrouter/meta-llama/llama-4-maverick" "openrouter/mistralai/pixtral-large-2411"
python bench.py img_a.png img_b.png
--criteria artistic_style
--judges "openrouter/meta-llama/llama-4-maverick" "openrouter/mistralai/pixtral-large-2411"
--criteria artistic_style
--judges "openrouter/meta-llama/llama-4-maverick" "openrouter/mistralai/pixtral-large-2411"
List all presets
列出所有预设
python bench.py --list-presets
python bench.py --list-presets
Save report to file
将报告保存到文件
python bench.py img.png --criteria chart_analysis --save report.md
undefinedpython bench.py img.png --criteria chart_analysis --save report.md
undefinedPresets
预设标准
| Preset | Use Case |
|---|---|
| Compare AI image generators (Midjourney, DALL-E, Flux) |
| How convincingly an image looks like a photo |
| Style consistency, composition, color harmony |
| AI-generated portrait quality and realism |
| E-commerce product image quality |
| Document text extraction and layout understanding |
| Chart and data visualization comprehension |
| Financial document field extraction accuracy |
| App/web screenshot understanding |
| Scientific/medical image accuracy |
| Accessibility image description quality |
Custom criteria: pass any file as .
.yaml--criteria path/to/my.yaml| 预设名称 | 使用场景 |
|---|---|
| 对比AI图像生成器(Midjourney、DALL-E、Flux) |
| 图像的照片真实感程度 |
| 风格一致性、构图、色彩协调性 |
| AI生成肖像的质量与真实感 |
| 电商产品图像质量 |
| 文档文本提取与布局理解能力 |
| 图表与数据可视化的理解能力 |
| 财务文档字段提取的准确性 |
| App/网页截图的理解能力 |
| 科学/医学图像的准确性 |
| 无障碍图像描述的质量 |
自定义标准:传入任意文件作为。
.yaml--criteria path/to/my.yamlJudge Providers
评估模型提供商
| Prefix | Provider | Example |
|---|---|---|
| OpenAI | |
| Anthropic | |
| Google Gemini | |
| Mistral | |
| OpenRouter (any model) | |
| 前缀 | 提供商 | 示例 |
|---|---|---|
| OpenAI | |
| Anthropic | |
| Google Gemini | |
| Mistral | |
| OpenRouter(任意模型) | |
API Keys
API密钥
Keys are loaded from (SOPS + age encrypted) with fallback to environment variables.
secrets.enc.yamlSupported keys: , , ,
OPENAI_API_KEYANTHROPIC_API_KEYGEMINI_API_KEYOPENROUTER_API_KEYTo encrypt your own keys:
bash
sops --config .sops.yaml --encrypt --input-type yaml --output-type yaml secrets.yaml > secrets.enc.yaml密钥从(SOPS + age加密)加载, fallback到环境变量。
secrets.enc.yaml支持的密钥:, , ,
OPENAI_API_KEYANTHROPIC_API_KEYGEMINI_API_KEYOPENROUTER_API_KEY加密自己的密钥:
bash
sops --config .sops.yaml --encrypt --input-type yaml --output-type yaml secrets.yaml > secrets.enc.yamlOutput Formats
输出格式
--output markdown--output json--output table--output markdown--output json--output tableFiles
文件说明
- — CLI entry point
bench.py - — Multi-provider LLM judge logic
judge.py - — Report generation
report.py - — SOPS secrets decryption
vault.py - — 11 YAML preset files
criteria/ - — Age key config for encryption
.sops.yaml - — Encrypted API keys
secrets.enc.yaml
- — CLI入口文件
bench.py - — 多提供商LLM评估逻辑
judge.py - — 报告生成模块
report.py - — SOPS密钥解密模块
vault.py - — 11个YAML预设文件
criteria/ - — Age密钥加密配置
.sops.yaml - — 加密后的API密钥
secrets.enc.yaml