promptfoo-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Promptfoo Evaluation

Promptfoo 评估

Overview

概述

This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.
本技能提供使用Promptfoo配置和运行LLM评估的指导,Promptfoo是一个用于测试和比较LLM输出的开源CLI工具。

Quick Start

快速开始

bash
undefined
bash
undefined

Initialize a new evaluation project

Initialize a new evaluation project

npx promptfoo@latest init
npx promptfoo@latest init

Run evaluation

Run evaluation

npx promptfoo@latest eval
npx promptfoo@latest eval

View results in browser

View results in browser

npx promptfoo@latest view
undefined
npx promptfoo@latest view
undefined

Configuration Structure

配置结构

A typical Promptfoo project structure:
project/
├── promptfooconfig.yaml    # Main configuration
├── prompts/
│   ├── system.md           # System prompt
│   └── chat.json           # Chat format prompt
├── tests/
│   └── cases.yaml          # Test cases
└── scripts/
    └── metrics.py          # Custom Python assertions
典型的Promptfoo项目结构:
project/
├── promptfooconfig.yaml    # Main configuration
├── prompts/
│   ├── system.md           # System prompt
│   └── chat.json           # Chat format prompt
├── tests/
│   └── cases.yaml          # Test cases
└── scripts/
    └── metrics.py          # Custom Python assertions

Core Configuration (promptfooconfig.yaml)

核心配置(promptfooconfig.yaml)

yaml
undefined
yaml
undefined

yaml-language-server: $schema=https://promptfoo.dev/config-schema.json

yaml-language-server: $schema=https://promptfoo.dev/config-schema.json

description: "My LLM Evaluation"
description: "My LLM Evaluation"

Prompts to test

Prompts to test

prompts:
  • file://prompts/system.md
  • file://prompts/chat.json
prompts:
  • file://prompts/system.md
  • file://prompts/chat.json

Models to compare

Models to compare

providers:
  • id: anthropic:messages:claude-sonnet-4-5-20250929 label: Claude-4.5-Sonnet
  • id: openai:gpt-4.1 label: GPT-4.1
providers:
  • id: anthropic:messages:claude-sonnet-4-5-20250929 label: Claude-4.5-Sonnet
  • id: openai:gpt-4.1 label: GPT-4.1

Test cases

Test cases

tests: file://tests/cases.yaml
tests: file://tests/cases.yaml

Default assertions for all tests

Default assertions for all tests

defaultTest: assert: - type: python value: file://scripts/metrics.py:custom_assert - type: llm-rubric value: | Evaluate the response quality on a 0-1 scale. threshold: 0.7
defaultTest: assert: - type: python value: file://scripts/metrics.py:custom_assert - type: llm-rubric value: | Evaluate the response quality on a 0-1 scale. threshold: 0.7

Output path

Output path

outputPath: results/eval-results.json
undefined
outputPath: results/eval-results.json
undefined

Prompt Formats

提示词格式

Text Prompt (system.md)

文本提示词(system.md)

markdown
You are a helpful assistant.

Task: {{task}}
Context: {{context}}
markdown
You are a helpful assistant.

Task: {{task}}
Context: {{context}}

Chat Format (chat.json)

对话格式(chat.json)

json
[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "{{user_input}}"}
]
json
[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "{{user_input}}"}
]

Few-Shot Pattern

少样本模式

Embed examples directly in prompt or use chat format with assistant messages:
json
[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "Example input: {{example_input}}"},
  {"role": "assistant", "content": "{{example_output}}"},
  {"role": "user", "content": "Now process: {{actual_input}}"}
]
直接在提示词中嵌入示例,或使用包含助手消息的对话格式:
json
[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "Example input: {{example_input}}"},
  {"role": "assistant", "content": "{{example_output}}"},
  {"role": "user", "content": "Now process: {{actual_input}}"}
]

Test Cases (tests/cases.yaml)

测试用例(tests/cases.yaml)

yaml
- description: "Test case 1"
  vars:
    system_prompt: file://prompts/system.md
    user_input: "Hello world"
    # Load content from files
    context: file://data/context.txt
  assert:
    - type: contains
      value: "expected text"
    - type: python
      value: file://scripts/metrics.py:custom_check
      threshold: 0.8
yaml
- description: "Test case 1"
  vars:
    system_prompt: file://prompts/system.md
    user_input: "Hello world"
    # Load content from files
    context: file://data/context.txt
  assert:
    - type: contains
      value: "expected text"
    - type: python
      value: file://scripts/metrics.py:custom_check
      threshold: 0.8

Python Custom Assertions

Python自定义断言

Create a Python file for custom assertions (e.g.,
scripts/metrics.py
):
python
def get_assert(output: str, context: dict) -> dict:
    """Default assertion function."""
    vars_dict = context.get('vars', {})

    # Access test variables
    expected = vars_dict.get('expected', '')

    # Return result
    return {
        "pass": expected in output,
        "score": 0.8,
        "reason": "Contains expected content",
        "named_scores": {"relevance": 0.9}
    }

def custom_check(output: str, context: dict) -> dict:
    """Custom named assertion."""
    word_count = len(output.split())
    passed = 100 <= word_count <= 500

    return {
        "pass": passed,
        "score": min(1.0, word_count / 300),
        "reason": f"Word count: {word_count}"
    }
Key points:
  • Default function name is
    get_assert
  • Specify function with
    file://path.py:function_name
  • Return
    bool
    ,
    float
    (score), or
    dict
    with pass/score/reason
  • Access variables via
    context['vars']
创建一个Python文件用于自定义断言(例如
scripts/metrics.py
):
python
def get_assert(output: str, context: dict) -> dict:
    """Default assertion function."""
    vars_dict = context.get('vars', {})

    # Access test variables
    expected = vars_dict.get('expected', '')

    # Return result
    return {
        "pass": expected in output,
        "score": 0.8,
        "reason": "Contains expected content",
        "named_scores": {"relevance": 0.9}
    }

def custom_check(output: str, context: dict) -> dict:
    """Custom named assertion."""
    word_count = len(output.split())
    passed = 100 <= word_count <= 500

    return {
        "pass": passed,
        "score": min(1.0, word_count / 300),
        "reason": f"Word count: {word_count}"
    }
关键点:
  • 默认函数名为
    get_assert
  • 使用
    file://path.py:function_name
    指定函数
  • 返回
    bool
    float
    (分数)或包含pass/score/reason的
    dict
  • 通过
    context['vars']
    访问变量

LLM-as-Judge (llm-rubric)

LLM-as-Judge(llm-rubric)

yaml
assert:
  - type: llm-rubric
    value: |
      Evaluate the response based on:
      1. Accuracy of information
      2. Clarity of explanation
      3. Completeness

      Score 0.0-1.0 where 0.7+ is passing.
    threshold: 0.7
    provider: openai:gpt-4.1  # Optional: override grader model
Best practices:
  • Provide clear scoring criteria
  • Use
    threshold
    to set minimum passing score
  • Default grader uses available API keys (OpenAI → Anthropic → Google)
yaml
assert:
  - type: llm-rubric
    value: |
      Evaluate the response based on:
      1. Accuracy of information
      2. Clarity of explanation
      3. Completeness

      Score 0.0-1.0 where 0.7+ is passing.
    threshold: 0.7
    provider: openai:gpt-4.1  # Optional: override grader model
最佳实践:
  • 提供明确的评分标准
  • 使用
    threshold
    设置最低及格分数
  • 默认评分器使用可用的API密钥(优先级:OpenAI → Anthropic → Google)

Common Assertion Types

常见断言类型

TypeUsageExample
contains
Check substring
value: "hello"
icontains
Case-insensitive
value: "HELLO"
equals
Exact match
value: "42"
regex
Pattern match
value: "\\d{4}"
python
Custom logic
value: file://script.py
llm-rubric
LLM grading
value: "Is professional"
latency
Response time
threshold: 1000
类型用途示例
contains
检查子字符串
value: "hello"
icontains
不区分大小写检查
value: "HELLO"
equals
完全匹配
value: "42"
regex
模式匹配
value: "\\d{4}"
python
自定义逻辑
value: file://script.py
llm-rubric
LLM评分
value: "Is professional"
latency
响应时间
threshold: 1000

File References

文件引用

All paths are relative to config file location:
yaml
undefined
所有路径均相对于配置文件位置:
yaml
undefined

Load file content as variable

Load file content as variable

vars: content: file://data/input.txt
vars: content: file://data/input.txt

Load prompt from file

Load prompt from file

prompts:
  • file://prompts/main.md
prompts:
  • file://prompts/main.md

Load test cases from file

Load test cases from file

tests: file://tests/cases.yaml
tests: file://tests/cases.yaml

Load Python assertion

Load Python assertion

assert:
  • type: python value: file://scripts/check.py:validate
undefined
assert:
  • type: python value: file://scripts/check.py:validate
undefined

Running Evaluations

运行评估

bash
undefined
bash
undefined

Basic run

Basic run

npx promptfoo@latest eval
npx promptfoo@latest eval

With specific config

With specific config

npx promptfoo@latest eval --config path/to/config.yaml
npx promptfoo@latest eval --config path/to/config.yaml

Output to file

Output to file

npx promptfoo@latest eval --output results.json
npx promptfoo@latest eval --output results.json

Filter tests

Filter tests

npx promptfoo@latest eval --filter-metadata category=math
npx promptfoo@latest eval --filter-metadata category=math

View results

View results

npx promptfoo@latest view
undefined
npx promptfoo@latest view
undefined

Troubleshooting

故障排除

Python not found:
bash
export PROMPTFOO_PYTHON=python3
Large outputs truncated: Outputs over 30000 characters are truncated. Use
head_limit
in assertions.
File not found errors: Ensure paths are relative to
promptfooconfig.yaml
location.
未找到Python:
bash
export PROMPTFOO_PYTHON=python3
大输出被截断: 超过30000字符的输出会被截断。在断言中使用
head_limit
参数。
文件未找到错误: 确保路径相对于
promptfooconfig.yaml
的位置。

Echo Provider (Preview Mode)

Echo Provider(预览模式)

Use the echo provider to preview rendered prompts without making API calls:
yaml
undefined
使用echo provider预览渲染后的提示词,无需调用API:
yaml
undefined

promptfooconfig-preview.yaml

promptfooconfig-preview.yaml

providers:
  • echo # Returns prompt as output, no API calls
tests:
  • vars: input: "test content"

**Use cases:**
- Preview prompt rendering before expensive API calls
- Verify Few-shot examples are loaded correctly
- Debug variable substitution issues
- Validate prompt structure

```bash
providers:
  • echo # Returns prompt as output, no API calls
tests:
  • vars: input: "test content"

**适用场景:**
- 在调用昂贵的API之前预览提示词渲染效果
- 验证少样本示例是否正确加载
- 调试变量替换问题
- 验证提示词结构

```bash

Run preview mode

Run preview mode

npx promptfoo@latest eval --config promptfooconfig-preview.yaml

**Cost:** Free - no API tokens consumed.
npx promptfoo@latest eval --config promptfooconfig-preview.yaml

**成本:** 免费 - 不消耗API令牌。

Advanced Few-Shot Implementation

高级少样本实现

Multi-turn Conversation Pattern

多轮对话模式

For complex few-shot learning with full examples:
json
[
  {"role": "system", "content": "{{system_prompt}}"},

  // Few-shot Example 1
  {"role": "user", "content": "Task: {{example_input_1}}"},
  {"role": "assistant", "content": "{{example_output_1}}"},

  // Few-shot Example 2 (optional)
  {"role": "user", "content": "Task: {{example_input_2}}"},
  {"role": "assistant", "content": "{{example_output_2}}"},

  // Actual test
  {"role": "user", "content": "Task: {{actual_input}}"}
]
Test case configuration:
yaml
tests:
  - vars:
      system_prompt: file://prompts/system.md
      # Few-shot examples
      example_input_1: file://data/examples/input1.txt
      example_output_1: file://data/examples/output1.txt
      example_input_2: file://data/examples/input2.txt
      example_output_2: file://data/examples/output2.txt
      # Actual test
      actual_input: file://data/test1.txt
Best practices:
  • Use 1-3 few-shot examples (more may dilute effectiveness)
  • Ensure examples match the task format exactly
  • Load examples from files for better maintainability
  • Use echo provider first to verify structure
适用于包含完整示例的复杂少样本学习场景:
json
[
  {"role": "system", "content": "{{system_prompt}}"},

  // Few-shot Example 1
  {"role": "user", "content": "Task: {{example_input_1}}"},
  {"role": "assistant", "content": "{{example_output_1}}"},

  // Few-shot Example 2 (optional)
  {"role": "user", "content": "Task: {{example_input_2}}"},
  {"role": "assistant", "content": "{{example_output_2}}"},

  // Actual test
  {"role": "user", "content": "Task: {{actual_input}}"}
]
测试用例配置:
yaml
tests:
  - vars:
      system_prompt: file://prompts/system.md
      # Few-shot examples
      example_input_1: file://data/examples/input1.txt
      example_output_1: file://data/examples/output1.txt
      example_input_2: file://data/examples/input2.txt
      example_output_2: file://data/examples/output2.txt
      # Actual test
      actual_input: file://data/test1.txt
最佳实践:
  • 使用1-3个少样本示例(过多可能降低效果)
  • 确保示例与任务格式完全匹配
  • 从文件加载示例以提升可维护性
  • 先使用echo provider验证结构

Long Text Handling

长文本处理

For Chinese/long-form content evaluations (10k+ characters):
Configuration:
yaml
providers:
  - id: anthropic:messages:claude-sonnet-4-5-20250929
    config:
      max_tokens: 8192  # Increase for long outputs

defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:check_length
Python assertion for text metrics:
python
import re

def strip_tags(text: str) -> str:
    """Remove HTML tags for pure text."""
    return re.sub(r'<[^>]+>', '', text)

def check_length(output: str, context: dict) -> dict:
    """Check output length constraints."""
    raw_input = context['vars'].get('raw_input', '')

    input_len = len(strip_tags(raw_input))
    output_len = len(strip_tags(output))

    reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0

    return {
        "pass": 0.7 <= reduction_ratio <= 0.9,
        "score": reduction_ratio,
        "reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
        "named_scores": {
            "input_length": input_len,
            "output_length": output_len,
            "reduction_ratio": reduction_ratio
        }
    }
针对中文/长文本内容评估(10000+字符):
配置:
yaml
providers:
  - id: anthropic:messages:claude-sonnet-4-5-20250929
    config:
      max_tokens: 8192  # Increase for long outputs

defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:check_length
用于文本指标的Python断言:
python
import re

def strip_tags(text: str) -> str:
    """Remove HTML tags for pure text."""
    return re.sub(r'<[^>]+>', '', text)

def check_length(output: str, context: dict) -> dict:
    """Check output length constraints."""
    raw_input = context['vars'].get('raw_input', '')

    input_len = len(strip_tags(raw_input))
    output_len = len(strip_tags(output))

    reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0

    return {
        "pass": 0.7 <= reduction_ratio <= 0.9,
        "score": reduction_ratio,
        "reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
        "named_scores": {
            "input_length": input_len,
            "output_length": output_len,
            "reduction_ratio": reduction_ratio
        }
    }

Real-World Example

实际示例

Project: Chinese short-video content curation from long transcripts
Structure:
tiaogaoren/
├── promptfooconfig.yaml          # Production config
├── promptfooconfig-preview.yaml  # Preview config (echo provider)
├── prompts/
│   ├── tiaogaoren-prompt.json   # Chat format with few-shot
│   └── v4/system-v4.md          # System prompt
├── tests/cases.yaml              # 3 test samples
├── scripts/metrics.py            # Custom metrics (reduction ratio, etc.)
├── data/                         # 5 samples (2 few-shot, 3 eval)
└── results/
See:
/Users/tiansheng/Workspace/prompts/tiaogaoren/
for full implementation.
项目: 从长转录文本中筛选中文短视频内容
结构:
tiaogaoren/
├── promptfooconfig.yaml          # Production config
├── promptfooconfig-preview.yaml  # Preview config (echo provider)
├── prompts/
│   ├── tiaogaoren-prompt.json   # Chat format with few-shot
│   └── v4/system-v4.md          # System prompt
├── tests/cases.yaml              # 3 test samples
├── scripts/metrics.py            # Custom metrics (reduction ratio, etc.)
├── data/                         # 5 samples (2 few-shot, 3 eval)
└── results/
查看:
/Users/tiansheng/Workspace/prompts/tiaogaoren/
获取完整实现。

Resources

资源

For detailed API reference and advanced patterns, see references/promptfoo_api.md.
如需详细API参考和高级模式,请查看references/promptfoo_api.md