promptfoo-evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Promptfoo Evaluation

Promptfoo 评估

Overview

概述

This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.

本技能提供使用Promptfoo配置和运行LLM评估的指导，Promptfoo是一个用于测试和比较LLM输出的开源CLI工具。

Quick Start

快速开始

bash

undefined

bash

undefined

Initialize a new evaluation project

npx promptfoo@latest init

Run evaluation

npx promptfoo@latest eval

View results in browser

npx promptfoo@latest view

undefined

npx promptfoo@latest view

undefined

Configuration Structure

配置结构

A typical Promptfoo project structure:

project/
├── promptfooconfig.yaml    # Main configuration
├── prompts/
│   ├── system.md           # System prompt
│   └── chat.json           # Chat format prompt
├── tests/
│   └── cases.yaml          # Test cases
└── scripts/
    └── metrics.py          # Custom Python assertions

典型的Promptfoo项目结构：

project/
├── promptfooconfig.yaml    # Main configuration
├── prompts/
│   ├── system.md           # System prompt
│   └── chat.json           # Chat format prompt
├── tests/
│   └── cases.yaml          # Test cases
└── scripts/
    └── metrics.py          # Custom Python assertions

Core Configuration (promptfooconfig.yaml)

核心配置（promptfooconfig.yaml）

yaml

undefined

yaml

undefined

yaml-language-server: $schema=https://promptfoo.dev/config-schema.json

description: "My LLM Evaluation"

Prompts to test

prompts:

file://prompts/system.md
file://prompts/chat.json

prompts:

file://prompts/system.md
file://prompts/chat.json

Models to compare

providers:

id: anthropic:messages:claude-sonnet-4-5-20250929 label: Claude-4.5-Sonnet
id: openai:gpt-4.1 label: GPT-4.1

providers:

id: anthropic:messages:claude-sonnet-4-5-20250929 label: Claude-4.5-Sonnet
id: openai:gpt-4.1 label: GPT-4.1

Test cases

tests: file://tests/cases.yaml

Default assertions for all tests

defaultTest: assert: - type: python value: file://scripts/metrics.py:custom_assert - type: llm-rubric value: | Evaluate the response quality on a 0-1 scale. threshold: 0.7

Output path

outputPath: results/eval-results.json

undefined

outputPath: results/eval-results.json

undefined

Prompt Formats

提示词格式

Text Prompt (system.md)

文本提示词（system.md）

markdown

You are a helpful assistant.

Task: {{task}}
Context: {{context}}

markdown

You are a helpful assistant.

Task: {{task}}
Context: {{context}}

Chat Format (chat.json)

对话格式（chat.json）

json

[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "{{user_input}}"}
]

json

[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "{{user_input}}"}
]

Few-Shot Pattern

少样本模式

Embed examples directly in prompt or use chat format with assistant messages:

json

[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "Example input: {{example_input}}"},
  {"role": "assistant", "content": "{{example_output}}"},
  {"role": "user", "content": "Now process: {{actual_input}}"}
]

直接在提示词中嵌入示例，或使用包含助手消息的对话格式：

json

[
  {"role": "system", "content": "{{system_prompt}}"},
  {"role": "user", "content": "Example input: {{example_input}}"},
  {"role": "assistant", "content": "{{example_output}}"},
  {"role": "user", "content": "Now process: {{actual_input}}"}
]

Test Cases (tests/cases.yaml)

测试用例（tests/cases.yaml）

yaml

- description: "Test case 1"
  vars:
    system_prompt: file://prompts/system.md
    user_input: "Hello world"
    # Load content from files
    context: file://data/context.txt
  assert:
    - type: contains
      value: "expected text"
    - type: python
      value: file://scripts/metrics.py:custom_check
      threshold: 0.8

yaml

- description: "Test case 1"
  vars:
    system_prompt: file://prompts/system.md
    user_input: "Hello world"
    # Load content from files
    context: file://data/context.txt
  assert:
    - type: contains
      value: "expected text"
    - type: python
      value: file://scripts/metrics.py:custom_check
      threshold: 0.8

Python Custom Assertions

Python自定义断言

Create a Python file for custom assertions (e.g.,

scripts/metrics.py

python

def get_assert(output: str, context: dict) -> dict:
    """Default assertion function."""
    vars_dict = context.get('vars', {})

    # Access test variables
    expected = vars_dict.get('expected', '')

    # Return result
    return {
        "pass": expected in output,
        "score": 0.8,
        "reason": "Contains expected content",
        "named_scores": {"relevance": 0.9}
    }

def custom_check(output: str, context: dict) -> dict:
    """Custom named assertion."""
    word_count = len(output.split())
    passed = 100 <= word_count <= 500

    return {
        "pass": passed,
        "score": min(1.0, word_count / 300),
        "reason": f"Word count: {word_count}"
    }

Key points:

Default function name is
```
get_assert
```
Specify function with
```
file://path.py:function_name
```
Return
```
bool
```
,
```
float
```
(score), or
```
dict
```
with pass/score/reason
Access variables via
```
context['vars']
```

创建一个Python文件用于自定义断言（例如

scripts/metrics.py

）：

python

def get_assert(output: str, context: dict) -> dict:
    """Default assertion function."""
    vars_dict = context.get('vars', {})

    # Access test variables
    expected = vars_dict.get('expected', '')

    # Return result
    return {
        "pass": expected in output,
        "score": 0.8,
        "reason": "Contains expected content",
        "named_scores": {"relevance": 0.9}
    }

def custom_check(output: str, context: dict) -> dict:
    """Custom named assertion."""
    word_count = len(output.split())
    passed = 100 <= word_count <= 500

    return {
        "pass": passed,
        "score": min(1.0, word_count / 300),
        "reason": f"Word count: {word_count}"
    }

关键点：

默认函数名为
```
get_assert
```
使用
```
file://path.py:function_name
```
指定函数
返回
```
bool
```
、
```
float
```
（分数）或包含pass/score/reason的
```
dict
```
通过
```
context['vars']
```
访问变量

LLM-as-Judge (llm-rubric)

LLM-as-Judge（llm-rubric）

yaml

assert:
  - type: llm-rubric
    value: |
      Evaluate the response based on:
      1. Accuracy of information
      2. Clarity of explanation
      3. Completeness

      Score 0.0-1.0 where 0.7+ is passing.
    threshold: 0.7
    provider: openai:gpt-4.1  # Optional: override grader model

Best practices:

Provide clear scoring criteria
Use
```
threshold
```
to set minimum passing score
Default grader uses available API keys (OpenAI → Anthropic → Google)

yaml

assert:
  - type: llm-rubric
    value: |
      Evaluate the response based on:
      1. Accuracy of information
      2. Clarity of explanation
      3. Completeness

      Score 0.0-1.0 where 0.7+ is passing.
    threshold: 0.7
    provider: openai:gpt-4.1  # Optional: override grader model

最佳实践：

提供明确的评分标准
使用
```
threshold
```
设置最低及格分数
默认评分器使用可用的API密钥（优先级：OpenAI → Anthropic → Google）

Common Assertion Types

常见断言类型

Type	Usage	Example
`contains`	Check substring	`value: "hello"`
`icontains`	Case-insensitive	`value: "HELLO"`
`equals`	Exact match	`value: "42"`
`regex`	Pattern match	`value: "\\d{4}"`
`python`	Custom logic	`value: file://script.py`
`llm-rubric`	LLM grading	`value: "Is professional"`
`latency`	Response time	`threshold: 1000`

类型	用途	示例
`contains`	检查子字符串	`value: "hello"`
`icontains`	不区分大小写检查	`value: "HELLO"`
`equals`	完全匹配	`value: "42"`
`regex`	模式匹配	`value: "\\d{4}"`
`python`	自定义逻辑	`value: file://script.py`
`llm-rubric`	LLM评分	`value: "Is professional"`
`latency`	响应时间	`threshold: 1000`

File References

文件引用

All paths are relative to config file location:

yaml

undefined

所有路径均相对于配置文件位置：

yaml

undefined

Load file content as variable

vars: content: file://data/input.txt

Load prompt from file

prompts:

file://prompts/main.md

prompts:

file://prompts/main.md

Load test cases from file

tests: file://tests/cases.yaml

Load Python assertion

assert:

type: python value: file://scripts/check.py:validate

undefined

assert:

type: python value: file://scripts/check.py:validate

undefined

Running Evaluations

运行评估

bash

undefined

bash

undefined

Basic run

npx promptfoo@latest eval

With specific config

npx promptfoo@latest eval --config path/to/config.yaml

Output to file

npx promptfoo@latest eval --output results.json

Filter tests

npx promptfoo@latest eval --filter-metadata category=math

View results

npx promptfoo@latest view

undefined

npx promptfoo@latest view

undefined

Troubleshooting

故障排除

Python not found:

bash

export PROMPTFOO_PYTHON=python3

Large outputs truncated: Outputs over 30000 characters are truncated. Use

head_limit

in assertions.

File not found errors: Ensure paths are relative to

promptfooconfig.yaml

location.

未找到Python：

bash

export PROMPTFOO_PYTHON=python3

大输出被截断： 超过30000字符的输出会被截断。在断言中使用

head_limit

参数。

文件未找到错误： 确保路径相对于

promptfooconfig.yaml

的位置。

Echo Provider (Preview Mode)

Echo Provider（预览模式）

Use the echo provider to preview rendered prompts without making API calls:

yaml

undefined

使用echo provider预览渲染后的提示词，无需调用API：

yaml

undefined

promptfooconfig-preview.yaml

providers:

echo # Returns prompt as output, no API calls

tests:

vars: input: "test content"


**Use cases:**
- Preview prompt rendering before expensive API calls
- Verify Few-shot examples are loaded correctly
- Debug variable substitution issues
- Validate prompt structure

```bash

providers:

echo # Returns prompt as output, no API calls

tests:

vars: input: "test content"


**适用场景：**
- 在调用昂贵的API之前预览提示词渲染效果
- 验证少样本示例是否正确加载
- 调试变量替换问题
- 验证提示词结构

```bash

Run preview mode

npx promptfoo@latest eval --config promptfooconfig-preview.yaml


**Cost:** Free - no API tokens consumed.

npx promptfoo@latest eval --config promptfooconfig-preview.yaml


**成本：** 免费 - 不消耗API令牌。

Advanced Few-Shot Implementation

高级少样本实现

Multi-turn Conversation Pattern

多轮对话模式

For complex few-shot learning with full examples:

json

[
  {"role": "system", "content": "{{system_prompt}}"},

  // Few-shot Example 1
  {"role": "user", "content": "Task: {{example_input_1}}"},
  {"role": "assistant", "content": "{{example_output_1}}"},

  // Few-shot Example 2 (optional)
  {"role": "user", "content": "Task: {{example_input_2}}"},
  {"role": "assistant", "content": "{{example_output_2}}"},

  // Actual test
  {"role": "user", "content": "Task: {{actual_input}}"}
]

Test case configuration:

yaml

tests:
  - vars:
      system_prompt: file://prompts/system.md
      # Few-shot examples
      example_input_1: file://data/examples/input1.txt
      example_output_1: file://data/examples/output1.txt
      example_input_2: file://data/examples/input2.txt
      example_output_2: file://data/examples/output2.txt
      # Actual test
      actual_input: file://data/test1.txt

Best practices:

Use 1-3 few-shot examples (more may dilute effectiveness)
Ensure examples match the task format exactly
Load examples from files for better maintainability
Use echo provider first to verify structure

适用于包含完整示例的复杂少样本学习场景：

json

[
  {"role": "system", "content": "{{system_prompt}}"},

  // Few-shot Example 1
  {"role": "user", "content": "Task: {{example_input_1}}"},
  {"role": "assistant", "content": "{{example_output_1}}"},

  // Few-shot Example 2 (optional)
  {"role": "user", "content": "Task: {{example_input_2}}"},
  {"role": "assistant", "content": "{{example_output_2}}"},

  // Actual test
  {"role": "user", "content": "Task: {{actual_input}}"}
]

测试用例配置：

yaml

tests:
  - vars:
      system_prompt: file://prompts/system.md
      # Few-shot examples
      example_input_1: file://data/examples/input1.txt
      example_output_1: file://data/examples/output1.txt
      example_input_2: file://data/examples/input2.txt
      example_output_2: file://data/examples/output2.txt
      # Actual test
      actual_input: file://data/test1.txt

最佳实践：

使用1-3个少样本示例（过多可能降低效果）
确保示例与任务格式完全匹配
从文件加载示例以提升可维护性
先使用echo provider验证结构

Long Text Handling

长文本处理

For Chinese/long-form content evaluations (10k+ characters):

Configuration:

yaml

providers:
  - id: anthropic:messages:claude-sonnet-4-5-20250929
    config:
      max_tokens: 8192  # Increase for long outputs

defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:check_length

Python assertion for text metrics:

python

import re

def strip_tags(text: str) -> str:
    """Remove HTML tags for pure text."""
    return re.sub(r'<[^>]+>', '', text)

def check_length(output: str, context: dict) -> dict:
    """Check output length constraints."""
    raw_input = context['vars'].get('raw_input', '')

    input_len = len(strip_tags(raw_input))
    output_len = len(strip_tags(output))

    reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0

    return {
        "pass": 0.7 <= reduction_ratio <= 0.9,
        "score": reduction_ratio,
        "reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
        "named_scores": {
            "input_length": input_len,
            "output_length": output_len,
            "reduction_ratio": reduction_ratio
        }
    }

针对中文/长文本内容评估（10000+字符）：

配置：

yaml

providers:
  - id: anthropic:messages:claude-sonnet-4-5-20250929
    config:
      max_tokens: 8192  # Increase for long outputs

defaultTest:
  assert:
    - type: python
      value: file://scripts/metrics.py:check_length

用于文本指标的Python断言：

python

import re

def strip_tags(text: str) -> str:
    """Remove HTML tags for pure text."""
    return re.sub(r'<[^>]+>', '', text)

def check_length(output: str, context: dict) -> dict:
    """Check output length constraints."""
    raw_input = context['vars'].get('raw_input', '')

    input_len = len(strip_tags(raw_input))
    output_len = len(strip_tags(output))

    reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0

    return {
        "pass": 0.7 <= reduction_ratio <= 0.9,
        "score": reduction_ratio,
        "reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
        "named_scores": {
            "input_length": input_len,
            "output_length": output_len,
            "reduction_ratio": reduction_ratio
        }
    }

Real-World Example

实际示例

Project: Chinese short-video content curation from long transcripts

Structure:

tiaogaoren/
├── promptfooconfig.yaml          # Production config
├── promptfooconfig-preview.yaml  # Preview config (echo provider)
├── prompts/
│   ├── tiaogaoren-prompt.json   # Chat format with few-shot
│   └── v4/system-v4.md          # System prompt
├── tests/cases.yaml              # 3 test samples
├── scripts/metrics.py            # Custom metrics (reduction ratio, etc.)
├── data/                         # 5 samples (2 few-shot, 3 eval)
└── results/

See:

/Users/tiansheng/Workspace/prompts/tiaogaoren/

for full implementation.

项目： 从长转录文本中筛选中文短视频内容

结构：

tiaogaoren/
├── promptfooconfig.yaml          # Production config
├── promptfooconfig-preview.yaml  # Preview config (echo provider)
├── prompts/
│   ├── tiaogaoren-prompt.json   # Chat format with few-shot
│   └── v4/system-v4.md          # System prompt
├── tests/cases.yaml              # 3 test samples
├── scripts/metrics.py            # Custom metrics (reduction ratio, etc.)
├── data/                         # 5 samples (2 few-shot, 3 eval)
└── results/

查看：

/Users/tiansheng/Workspace/prompts/tiaogaoren/

获取完整实现。

Resources

资源

For detailed API reference and advanced patterns, see references/promptfoo_api.md.

如需详细API参考和高级模式，请查看references/promptfoo_api.md。