eval-driven-dev

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Eval-Driven Development with pixie

基于pixie的评估驱动开发

This skill is about doing the work, not describing it. When a user asks you to set up evals for their app, you should be reading their code, editing their files, running commands, and producing a working test pipeline — not writing a plan for them to follow later.

All pixie-generated files live in a single
pixie_qa
directory at the project root:

pixie_qa/
  MEMORY.md              # your understanding and eval plan
  observations.db        # SQLite trace DB (auto-created by enable_storage)
  datasets/              # golden datasets (JSON files)
  tests/                 # eval test files (test_*.py)
  scripts/               # helper scripts (build_dataset.py, etc.)

本技能专注于实际操作，而非理论描述。当用户要求为其应用设置评估时，你应该读取他们的代码、编辑文件、运行命令并生成可工作的测试流水线——而不是编写让他们后续执行的计划。

所有由pixie生成的文件都存放在项目根目录下的单个
pixie_qa
目录中：

pixie_qa/
  MEMORY.md              # 你的理解和评估计划
  observations.db        # SQLite追踪数据库（由enable_storage自动创建）
  datasets/              # 黄金数据集（JSON文件）
  tests/                 # 评估测试文件（test_*.py）
  scripts/               # 辅助脚本（build_dataset.py等）

Setup vs. Iteration: when to stop

设置 vs. 迭代：何时停止

This is critical. What you do depends on what the user asked for.

这一点至关重要。你要做的工作取决于用户的需求。

"Setup QA" / "set up evals" / "add tests" (setup intent)

"设置QA" / "设置评估" / "添加测试"（设置意图）

The user wants a working eval pipeline. Your job is Stages 0–6: install, understand, instrument, write tests, build dataset, run tests. Stop after the first test run, regardless of whether tests pass or fail. Report:

What you set up (instrumentation, test file, dataset)
The test results (pass/fail, scores)
If tests failed: a brief summary of what failed and likely causes — but do NOT fix anything

Then ask: "QA setup is complete. Tests show N/M passing. Want me to investigate the failures and start iterating?"

Only proceed to Stage 7 (investigation and fixes) if the user confirms.

Exception: If the test run itself errors out (import failures, missing API keys, configuration bugs) — those are setup problems, not eval failures. Fix them and re-run until you get a clean test execution where pass/fail reflects actual app quality, not broken plumbing.

用户需要一个可运行的评估流水线。你的工作是完成阶段0-6：安装、理解、埋点、编写测试、构建数据集、运行测试。首次测试运行后即停止，无论测试是否通过。报告内容包括：

你设置的内容（埋点、测试文件、数据集）
测试结果（通过/失败、分数）
如果测试失败：简要总结失败内容和可能原因——但不要修复任何问题

然后询问："QA设置已完成。测试显示N/M通过。是否需要我排查故障并开始迭代？"

只有在用户确认后，才进入阶段7（调查和修复）。

例外情况：如果测试运行本身报错（导入失败、缺少API密钥、配置错误）——这些属于设置问题，而非评估失败。需要修复这些问题并重新运行，直到测试执行正常，通过/失败结果能真实反映应用质量，而非因基础组件故障导致的错误。

"Fix" / "improve" / "debug" / "why is X failing" (iteration intent)

"修复" / "改进" / "调试" / "为什么X失败"（迭代意图）

The user wants you to investigate and fix. Proceed through all stages including Stage 7 — investigate failures, root-cause them, apply fixes, rebuild dataset, re-run tests, iterate.

用户希望你调查并修复问题。执行所有阶段，包括阶段7——排查故障、找到根本原因、应用修复、重建数据集、重新运行测试、迭代优化。

Ambiguous requests

模糊的请求

If the intent is unclear, default to setup only and ask before iterating. It's better to stop early and ask than to make unwanted changes to the user's application code.

如果用户意图不明确，默认仅完成设置工作，并在迭代前询问用户。与其擅自进行不必要的修改，不如提前停止并确认需求。

The eval boundary: what to evaluate

评估边界：要评估什么

Eval-driven development focuses on LLM-dependent behaviour. The purpose is to catch quality regressions in the parts of the system that are non-deterministic and hard to test with traditional unit tests — namely, LLM calls and the decisions they drive.

评估驱动开发聚焦于依赖LLM的行为。其目的是捕获系统中那些非确定性、难以用传统单元测试覆盖的部分的质量回归问题——即LLM调用及其驱动的决策。

In scope (evaluate this)

评估范围内的内容

LLM response quality: factual accuracy, relevance, format compliance, safety
Agent routing decisions: did the LLM choose the right tool/handoff/action?
Prompt effectiveness: does the prompt produce the desired behaviour?
Multi-turn coherence: does the agent maintain context across turns?

LLM响应质量：事实准确性、相关性、格式合规性、安全性
Agent路由决策：LLM是否选择了正确的工具/交接/操作？
提示词有效性：提示词是否能产生预期行为？
多轮对话连贯性：Agent是否能在多轮对话中保持上下文？

Out of scope (do NOT evaluate this with evals)

评估范围外的内容（不要用评估来测试这些）

Tool implementations (database queries, API calls, keyword matching, business logic) — these are traditional software; test them with unit tests
Infrastructure (authentication, rate limiting, caching, serialization)
Deterministic post-processing (formatting, filtering, sorting results)

The boundary is: everything downstream of the LLM call (tools, databases, APIs) produces deterministic outputs that serve as inputs to the LLM-powered system. Eval tests should treat those as given facts and focus on what the LLM does with them.

Example: If an FAQ tool has a keyword-matching bug that returns wrong data, that's a traditional bug — fix it with a regular code change, not by adjusting eval thresholds. The eval tests exist to verify that given correct tool outputs, the LLM agent produces correct user-facing responses.

When building datasets and expected outputs, use the actual tool/system outputs as ground truth. The expected output for an eval case should reflect what a correct LLM response looks like given the tool results the system actually produces.

工具实现（数据库查询、API调用、关键词匹配、业务逻辑）——这些属于传统软件，应使用单元测试
基础设施（认证、速率限制、缓存、序列化）
确定性后处理（格式化、过滤、结果排序）

边界定义：LLM调用下游的所有内容（工具、数据库、API）产生确定性输出，作为LLM驱动系统的输入。评估测试应将这些视为既定事实，专注于LLM如何处理这些输入。

示例：如果一个FAQ工具存在关键词匹配错误，返回了错误数据，这是传统bug——应通过常规代码修改修复，而非调整评估阈值。评估测试的目的是验证：在工具输出正确的前提下，LLM Agent能否生成正确的用户可见响应。

构建数据集和预期输出时，使用实际的工具/系统输出作为基准真值。评估用例的预期输出应反映：在系统实际产生的工具结果下，正确的LLM响应是什么样的。

Stage 0: Ensure pixie-qa is Installed and API Keys Are Set

阶段0：确保pixie-qa已安装且API密钥已配置

Before doing anything else, check that the

pixie-qa

package is available:

bash

python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"

If it's not installed, install it:

bash

pip install pixie-qa

This provides the

pixie

Python module, the

pixie

CLI, and the

pixie test

runner — all required for instrumentation and evals. Don't skip this step; everything else in this skill depends on it.

在进行任何操作前，检查

pixie-qa

包是否可用：

bash

python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"

如果未安装，执行安装命令：

bash

pip install pixie-qa

这会提供

pixie

Python模块、

pixie

CLI和

pixie test

测试运行器——这些都是埋点和评估所需的工具。不要跳过此步骤，后续所有操作都依赖于此。

Verify API keys

验证API密钥

The application under test almost certainly needs an LLM provider API key (e.g.

OPENAI_API_KEY

ANTHROPIC_API_KEY

). LLM-as-judge evaluators like

FactualityEval

also need

OPENAI_API_KEY

. Before running anything, verify the key is set:

bash

[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"

If not set, ask the user. Do not proceed with running the app or evals without it — you'll get silent failures or import-time errors.

被测应用几乎肯定需要LLM提供商的API密钥（如

OPENAI_API_KEY

、

ANTHROPIC_API_KEY

）。像

FactualityEval

这类基于LLM作为评判者的评估工具也需要

OPENAI_API_KEY

。在运行任何内容前，验证密钥是否已配置：

bash

[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"

如果未配置，询问用户。没有API密钥不要继续运行应用或评估——会导致静默失败或导入时错误。

Stage 1: Understand the Application

阶段1：理解应用

Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would, and it puts you in a much better position to make good decisions about what and how to evaluate.

在修改任何代码前，花时间阅读源代码。代码能提供比询问用户更多的信息，也能让你更好地决定评估的内容和方式。

What to investigate

需要调查的内容

How the software runs: What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
All inputs to the LLM: This is not limited to the user's message. Trace every piece of data that gets incorporated into any LLM prompt:
- User input (queries, messages, uploaded files)
- System prompts (hardcoded or templated)
- Retrieved context (RAG chunks, search results, database records)
- Tool definitions and function schemas
- Conversation history / memory
- Configuration or feature flags that change prompt behavior
All intermediate steps and outputs: Walk through the code path from input to final output and document each stage:
- Retrieval / search results
- Tool calls and their results
- Agent routing / handoff decisions
- Intermediate LLM calls (e.g., summarization before final answer)
- Post-processing or formatting steps
The final output: What does the user see? What format is it in? What are the quality expectations?
Use cases and expected behaviors: What are the distinct things the app is supposed to handle? For each use case, what does a "good" response look like? What would constitute a failure?

软件运行方式：入口点是什么？如何启动？是CLI、服务器还是库函数？需要哪些参数、配置文件或环境变量？
LLM的所有输入：不仅限于用户消息。追踪所有被纳入LLM提示词的数据：
- 用户输入（查询、消息、上传文件）
- 系统提示词（硬编码或模板化）
- 检索到的上下文（RAG片段、搜索结果、数据库记录）
- 工具定义和函数 schema
- 对话历史/记忆
- 会改变提示词行为的配置或功能开关
所有中间步骤和输出：遍历从输入到最终输出的代码路径，并记录每个阶段：
- 检索/搜索结果
- 工具调用及其结果
- Agent路由/交接决策
- 中间LLM调用（如最终回答前的摘要）
- 后处理或格式化步骤
最终输出：用户看到什么？格式是什么？质量期望是什么？
用例和预期行为：应用需要处理哪些不同场景？每个用例下，“好”的响应是什么样的？什么情况算失败？

Write MEMORY.md

编写MEMORY.md

Write your findings down in

pixie_qa/MEMORY.md

. This is the primary working document for the eval effort. It should be human-readable and detailed enough that someone unfamiliar with the project can understand the application and the eval strategy.

CRITICAL: MEMORY.md documents your understanding of the existing application code. It must NOT contain references to pixie commands, instrumentation code you plan to add, or scripts/functions that don't exist yet. Those belong in later sections, only after they've been implemented.

The understanding section should include:

markdown

undefined

将你的发现写入

pixie_qa/MEMORY.md

。这是评估工作的主要文档。内容需易于阅读，足够详细，让不熟悉项目的人也能理解应用和评估策略。

**关键注意事项：MEMORY.md记录你对现有应用代码的理解。绝对不能包含pixie命令、你计划添加的埋点代码，或尚未存在的脚本/函数。**这些内容属于后续章节，只有在实现后才能添加。

理解部分应包含以下结构：

markdown

undefined

Eval Notes: <Project Name>

评估笔记：<项目名称>

How the application works

应用工作原理

Entry point and execution flow

入口点和执行流程

<描述如何启动/运行应用，逐步说明执行过程>

Inputs to LLM calls

LLM调用的输入

Where it is in the code (file + function name)
What system prompt it uses (quote it or summarize)
What user/dynamic content feeds into it
What tools/functions are available to it

<针对代码库中的每个LLM调用，记录：>

代码中的位置（文件+函数名）
使用的系统提示词（引用或摘要）
传入的用户/动态内容
可用的工具/函数

Intermediate processing

中间处理

<Describe any steps between input and output:> - Retrieval, routing, tool execution, etc. - Include code pointers (file:line) for each step

<描述输入到输出之间的所有步骤：>

检索、路由、工具执行等
每个步骤的代码位置（文件:行号）

Final output

最终输出

<用户看到的内容、格式、质量标准>

Use cases

用例

<列出应用处理的每个不同场景，包含好/坏输出示例>

Evaluation plan

评估计划

What to evaluate and why

评估内容及原因

<质量维度：事实准确性、相关性、格式合规性、安全性等>

Evaluation granularity

评估粒度

<哪个函数/观测边界代表一个“测试用例”？选择该边界的原因？>

Evaluators and criteria

评估器和标准

<针对每个评估测试，指定：评估器、数据集、阈值、理由>

Data needed for evaluation

评估所需数据


If something is genuinely unclear from the code, ask the user — but most questions answer themselves once you've read the code carefully.

---

<需要捕获的数据点及代码位置>


如果某些内容确实无法从代码中明确，再询问用户——但仔细阅读代码后，大多数问题都会自行解决。

---

Stage 2: Decide What to Evaluate

阶段2：确定评估内容

Now that you understand the app, you can make thoughtful choices about what to measure:

What quality dimension matters most? Factual accuracy for QA apps, output format for structured extraction, relevance for RAG, safety for user-facing text.
Which span to evaluate: the whole pipeline (
```
root
```
) or just the LLM call (
```
last_llm_call
```
)? If you're debugging retrieval, you might evaluate at a different point than if you're checking final answer quality.
Which evaluators fit: see
```
references/pixie-api.md
```
→ Evaluators. For factual QA:
```
FactualityEval
```
. For structured output:
```
ValidJSONEval
```
/
```
JSONDiffEval
```
. For RAG pipelines:
```
ContextRelevancyEval
```
/
```
FaithfulnessEval
```
.
Pass criteria:
```
ScoreThreshold(threshold=0.7, pct=0.8)
```
means 80% of cases must score ≥ 0.7. Think about what "good enough" looks like for this app.
Expected outputs:
```
FactualityEval
```
needs them. Format evaluators usually don't.

Update

pixie_qa/MEMORY.md

with the plan before writing any code.

在理解应用后，你可以合理选择要测量的内容：

最重要的质量维度是什么？ QA应用关注事实准确性，结构化提取关注输出格式，RAG应用关注相关性，用户面向的文本关注安全性。
评估哪个观测范围：整个流水线（
```
root
```
）还是仅LLM调用（
```
last_llm_call
```
）？如果调试检索问题，评估的点可能与检查最终回答质量时不同。

适合的评估器：参考

references/pixie-api.md

→ 评估器。事实QA用

FactualityEval

，结构化输出用

ValidJSONEval

JSONDiffEval

，RAG用

ContextRelevancyEval

FaithfulnessEval

。

通过标准：
```
ScoreThreshold(threshold=0.7, pct=0.8)
```
表示80%的用例得分≥0.7。考虑应用的“足够好”标准是什么。
预期输出：
```
FactualityEval
```
需要预期输出。格式评估器通常不需要。

在编写代码前，将计划更新到

pixie_qa/MEMORY.md

中。

Stage 3: Instrument the Application

阶段3：为应用添加埋点

Add pixie instrumentation to the existing production code. The goal is to capture the inputs and outputs of functions that are already part of the application's normal execution path. Instrumentation must be on the real code path — the same code that runs when the app is used in production — so that traces are captured both during eval runs and real usage.

在现有生产代码中添加pixie埋点。目标是捕获应用正常执行路径中函数的输入和输出。埋点必须在真实代码路径上——即应用在生产环境中运行时调用的代码路径，这样在评估运行和实际使用时都能捕获追踪数据。

Add

enable_storage()

at application startup

在应用启动时调用

enable_storage()

Call

enable_storage()

once at the beginning of the application's startup code — inside

main()

, or at the top of a server's initialization. Never at module level (top of a file outside any function), because that causes storage setup to trigger on import.

Good places:

Inside
```
if __name__ == "__main__":
```
blocks
In a FastAPI
```
lifespan
```
or
```
on_startup
```
handler
At the top of
```
main()
```
/
```
run()
```
functions
Inside the
```
runnable
```
function in test files

python

undefined

在应用启动代码的开头调用一次

enable_storage()

——比如在

main()

函数内，或服务器初始化代码的顶部。绝对不要在模块级别（文件顶部的函数外）调用，因为这会导致在导入时就触发存储设置。

合适的位置：

```
if __name__ == "__main__":
```
代码块内
FastAPI的
```
lifespan
```
或
```
on_startup
```
处理器
```
main()
```
/
```
run()
```
函数的顶部
测试文件中的
```
runnable
```
函数内

python

undefined

✅ CORRECT — at application startup

✅ 正确——在应用启动时调用

async def main(): enable_storage() ...

✅ CORRECT — in a runnable for tests

✅ 正确——在测试的runnable函数内

def runnable(eval_input): enable_storage() my_function(**eval_input)

❌ WRONG — at module level, runs on import

❌ 错误——模块级别调用，导入时执行

from pixie import enable_storage enable_storage() # this runs when any file imports this module!

undefined

from pixie import enable_storage enable_storage() # 当任何文件导入此模块时，都会执行此代码！

undefined

Wrap existing functions with

@observe

start_observation

使用

@observe

或

start_observation

包裹现有函数

CRITICAL: Instrument the production code path. Never create separate functions or alternate code paths for testing.

The

@observe

decorator or

start_observation

context manager goes on the existing function that the app actually calls during normal operation. If the app's entry point is an interactive

main()

loop, instrument

main()

or the core function it calls per user turn — not a new helper function that duplicates logic.

python

undefined

关键注意事项：埋点生产代码路径。绝不要为测试创建单独的函数或替代代码路径。

@observe

装饰器或

start_observation

上下文管理器应添加到应用在正常运行时实际调用的现有函数上。如果应用的入口点是交互式

main()

循环，埋点

main()

或每个用户轮次调用的核心函数——不要创建重复逻辑的新辅助函数。

python

undefined

✅ CORRECT — decorating the existing production function

✅ 正确——装饰现有生产函数

from pixie import observe

@observe(name="answer_question") def answer_question(question: str, context: str) -> str: # existing function ... # existing code, unchanged


```python

from pixie import observe

@observe(name="answer_question") def answer_question(question: str, context: str) -> str: # 现有函数 ... # 现有代码，保持不变


```python

✅ CORRECT — context manager inside an existing function

✅ 正确——在现有函数内使用上下文管理器

from pixie import start_observation

async def main(): # existing function ... with start_observation(input={"user_input": user_input}, name="handle_turn") as obs: result = await Runner.run(current_agent, input_items, context=context) # ... existing response handling ... obs.set_output(response_text) ...


```python

from pixie import start_observation

async def main(): # 现有函数 ... with start_observation(input={"user_input": user_input}, name="handle_turn") as obs: result = await Runner.run(current_agent, input_items, context=context) # ... 现有响应处理代码 ... obs.set_output(response_text) ...


```python

❌ WRONG — creating a new function that duplicates logic from main()

❌ 错误——创建重复main()逻辑的新函数

@observe(name="run_for_eval") async def run_for_eval(user_messages: list[str]) -> str: # This duplicates what main() does, creating a separate code path # that diverges from production. Don't do this. ...


**Rules:**

- **Never add new wrapper functions** to the application code for eval purposes.
- **Never change the function's interface** (arguments, return type, behavior).
- **Never duplicate production logic** into a separate "testable" function.
- The instrumentation is purely additive — if you removed all pixie imports and decorators, the app would work identically.
- After instrumentation, call `flush()` at the end of runs to make sure all spans are written.
- For interactive apps (CLI loops, chat interfaces), instrument the **per-turn processing** function — the one that takes user input and produces a response. The eval `runnable` should call this same function.

**Important**: All pixie symbols are importable from the top-level `pixie` package. Never tell users to import from submodules (`pixie.instrumentation`, `pixie.evals`, `pixie.storage.evaluable`, etc.) — always use `from pixie import ...`.

---

@observe(name="run_for_eval") async def run_for_eval(user_messages: list[str]) -> str: # 此函数复制了main()的逻辑，创建了与生产环境不同的代码路径。不要这样做。 ...


**规则：**

- **绝不要为评估目的添加新的包装函数**到应用代码中。
- **绝不要更改函数的接口**（参数、返回类型、行为）。
- **绝不要将生产逻辑复制到单独的“可测试”函数中**。
- 埋点是纯增量式的——如果移除所有pixie导入和装饰器，应用应能正常工作。
- 埋点后，在运行结束时调用`flush()`以确保所有观测数据都已写入。
- 对于交互式应用（CLI循环、聊天界面），埋点**每轮处理**函数——即接收用户输入并生成响应的函数。评估的`runnable`应调用同一个函数。

**重要提示**：所有pixie符号都可从顶级`pixie`包导入。绝不要让用户从子模块导入（如`pixie.instrumentation`、`pixie.evals`、`pixie.storage.evaluable`等）——始终使用`from pixie import ...`。

---

Stage 4: Write the Eval Test File

阶段4：编写评估测试文件

Write the test file before building the dataset. This might seem backwards, but it forces you to decide what you're actually measuring before you start collecting data — otherwise the data collection has no direction.

Create

pixie_qa/tests/test_<feature>.py

. The pattern is: a

runnable

adapter that calls the app's existing production function, plus an async test function that calls

assert_dataset_pass

python

from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call

from myapp import answer_question


def runnable(eval_input):
    """Replays one dataset item through the app.

    Calls the same function the production app uses.
    enable_storage() here ensures traces are captured during eval runs.
    """
    enable_storage()
    answer_question(**eval_input)


async def test_factuality():
    await assert_dataset_pass(
        runnable=runnable,
        dataset_name="<dataset-name>",
        evaluators=[FactualityEval()],
        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
        from_trace=last_llm_call,
    )

Note that

enable_storage()

belongs inside the

runnable

, not at module level in the test file — it needs to fire on each invocation so the trace is captured for that specific run.

The

runnable

calls the same function that production uses — it does not create a new code path. The only addition is

enable_storage()

to capture traces during eval.

The test runner is

pixie test

(not

pytest

bash

pixie test                           # run all test_*.py in current directory
pixie test pixie_qa/tests/           # specify path
pixie test -k factuality             # filter by name
pixie test -v                        # verbose: shows per-case scores and reasoning

pixie test

automatically finds the project root (the directory containing

pyproject.toml

setup.py

, or

setup.cfg

) and adds it to

sys.path

— just like pytest. No

sys.path

hacks are needed in test files.

在构建数据集前编写测试文件。这看似反向操作，但能迫使你在收集数据前明确要测量的内容——否则数据收集将毫无方向。

创建

pixie_qa/tests/test_<feature>.py

。模式为：一个

runnable

适配器调用应用的现有生产函数，加上一个异步测试函数调用

assert_dataset_pass

：

python

from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call

from myapp import answer_question


def runnable(eval_input):
    """将一个数据集条目重放至应用中。

    调用生产环境中使用的同一个函数。
    此处的enable_storage()确保在评估运行时捕获追踪数据。
    """
    enable_storage()
    answer_question(**eval_input)


async def test_factuality():
    await assert_dataset_pass(
        runnable=runnable,
        dataset_name="<dataset-name>",
        evaluators=[FactualityEval()],
        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
        from_trace=last_llm_call,
    )

注意

enable_storage()

应放在

runnable

内部，而非测试文件的模块级别——需要在每次调用时触发，以捕获该次运行的追踪数据。

runnable

调用生产环境使用的同一个函数——不会创建新的代码路径。唯一的新增内容是

enable_storage()

，用于在评估时捕获追踪数据。

测试运行器是

pixie test

（不是

pytest

）：

bash

pixie test                           # 运行当前目录下所有test_*.py
pixie test pixie_qa/tests/           # 指定路径
pixie test -k factuality             # 按名称过滤
pixie test -v                        # 详细模式：显示每个用例的分数和理由

pixie test

会自动找到项目根目录（包含

pyproject.toml

、

setup.py

或

setup.cfg

的目录）并将其添加到

sys.path

——就像pytest一样。测试文件中不需要

sys.path

Stage 5: Build the Dataset

阶段5：构建数据集

Create the dataset first, then populate it by actually running the app with representative inputs. This is critical — dataset items should contain real app outputs and trace metadata, not fabricated data.

bash

pixie dataset create <dataset-name>
pixie dataset list   # verify it exists

先创建数据集，然后通过实际运行应用并传入代表性输入来填充数据。这一点至关重要——数据集条目应包含真实的应用输出和追踪元数据，而非虚构数据。

bash

pixie dataset create <dataset-name>
pixie dataset list   # 验证数据集已创建

Run the app and capture traces to the dataset

运行应用并将追踪数据捕获到数据集中

Write a simple script (

pixie_qa/scripts/build_dataset.py

) that calls the instrumented function for each input, flushes traces, then saves them to the dataset:

python

import asyncio
from pixie import enable_storage, flush, DatasetStore, Evaluable

from myapp import answer_question

GOLDEN_CASES = [
    ("What is the capital of France?", "Paris"),
    ("What is the speed of light?", "299,792,458 meters per second"),
]

async def build_dataset():
    enable_storage()
    store = DatasetStore()
    try:
        store.create("qa-golden-set")
    except FileExistsError:
        pass

    for question, expected in GOLDEN_CASES:
        result = answer_question(question=question)
        flush()

        store.append("qa-golden-set", Evaluable(
            eval_input={"question": question},
            eval_output=result,
            expected_output=expected,
        ))

asyncio.run(build_dataset())

Alternatively, use the CLI for per-case capture:

bash

undefined

编写一个简单脚本（

pixie_qa/scripts/build_dataset.py

），为每个输入调用埋点后的函数，刷新追踪数据，然后保存到数据集：

python

import asyncio
from pixie import enable_storage, flush, DatasetStore, Evaluable

from myapp import answer_question

GOLDEN_CASES = [
    ("法国的首都是什么？", "巴黎"),
    ("光速是多少？", "299,792,458米/秒"),
]

async def build_dataset():
    enable_storage()
    store = DatasetStore()
    try:
        store.create("qa-golden-set")
    except FileExistsError:
        pass

    for question, expected in GOLDEN_CASES:
        result = answer_question(question=question)
        flush()

        store.append("qa-golden-set", Evaluable(
            eval_input={"question": question},
            eval_output=result,
            expected_output=expected,
        ))

asyncio.run(build_dataset())

或者使用CLI逐个捕获用例：

bash

undefined

Run the app (enable_storage() must be active)

运行应用（必须已启用enable_storage()）

python -c "from myapp import main; main('What is the capital of France?')"

python -c "from myapp import main; main('法国的首都是什么？')"

Save the root span to the dataset

将根观测数据保存到数据集

pixie dataset save <dataset-name>

Or specifically save the last LLM call:

或者专门保存最后一次LLM调用：

pixie dataset save <dataset-name> --select last_llm_call

Add context:

添加说明：

pixie dataset save <dataset-name> --notes "basic geography question"

pixie dataset save <dataset-name> --notes "基础地理问题"

Attach expected output for evaluators like FactualityEval:

为FactualityEval等评估器附加预期输出：

echo '"Paris"' | pixie dataset save <dataset-name> --expected-output


**Key rules for dataset building:**

- **Always run the app** — never fabricate `eval_output` manually. The whole point is capturing what the app actually produces.
- **Include expected outputs** for comparison-based evaluators like `FactualityEval`. Expected outputs should reflect the **correct LLM response given what the tools/system actually return** — not an idealized answer predicated on fixing non-LLM bugs.
- **Cover the range** of inputs you care about: normal cases, edge cases, things the app might plausibly get wrong.
- When using `pixie dataset save`, the evaluable's `eval_metadata` will automatically include `trace_id` and `span_id` for later debugging.

---

echo '"巴黎"' | pixie dataset save <dataset-name> --expected-output


**数据集构建的关键规则：**

- **始终运行应用**——不要手动编造`eval_output`。核心目的是捕获应用实际产生的输出。
- 对于`FactualityEval`等基于比较的评估器，**包含预期输出**。预期输出应反映**工具/系统实际返回结果下的正确LLM响应**——而非基于修复非LLM bug的理想化答案。
- **覆盖多种输入**：正常用例、边缘用例、应用可能出错的场景。
- 使用`pixie dataset save`时，评估条目的`eval_metadata`会自动包含`trace_id`和`span_id`，用于后续调试。

---

Stage 6: Run the Tests

阶段6：运行测试

bash

pixie test pixie_qa/tests/ -v

The

-v

flag shows per-case scores and reasoning, which makes it much easier to see what's passing and what isn't. Check that the pass rates look reasonable given your

ScoreThreshold

After this stage, if the user's intent was "setup" — STOP. Report results and ask before proceeding. See "Setup vs. Iteration" above.

bash

pixie test pixie_qa/tests/ -v

-v

标志会显示每个用例的分数和理由，便于查看哪些通过、哪些失败。检查通过率是否符合你设置的

ScoreThreshold

。

**此阶段后，如果用户意图是“设置”——停止操作。**报告结果并在继续前询问用户。详见“设置 vs. 迭代”部分。

Stage 7: Investigate Failures

阶段7：排查故障

Only proceed here if the user asked for iteration/fixing, or explicitly confirmed after setup.

When tests fail, the goal is to understand why, not to adjust thresholds until things pass. Investigation must be thorough and documented — the user needs to see the actual data, your reasoning, and your conclusion.

仅当用户要求迭代/修复，或设置后明确确认时，才进入此阶段。

测试失败时，目标是理解原因，而非调整阈值直到通过。调查必须全面且有文档记录——用户需要看到实际数据、你的推理和结论。

Step 1: Get the detailed test output

步骤1：获取详细测试输出

bash

pixie test pixie_qa/tests/ -v    # shows score and reasoning per case

Capture the full verbose output. For each failing case, note:

The
```
eval_input
```
(what was sent)
The
```
eval_output
```
(what the app produced)
The
```
expected_output
```
(what was expected, if applicable)
The evaluator score and reasoning

bash

pixie test pixie_qa/tests/ -v    # 显示每个用例的分数和理由

捕获完整详细输出。对于每个失败用例，记录：

```
eval_input
```
（传入内容）
```
eval_output
```
（应用输出内容）
```
expected_output
```
（预期内容，若有）
评估器分数和理由

Step 2: Inspect the trace data

步骤2：检查追踪数据

For each failing case, look up the full trace to see what happened inside the app:

python

from pixie import DatasetStore

store = DatasetStore()
ds = store.get("<dataset-name>")
for i, item in enumerate(ds.items):
    print(i, item.eval_metadata)   # trace_id is here

Then inspect the full span tree:

python

import asyncio
from pixie import ObservationStore

async def inspect(trace_id: str):
    store = ObservationStore()
    roots = await store.get_trace(trace_id)
    for root in roots:
        print(root.to_text())   # full span tree: inputs, outputs, LLM messages

asyncio.run(inspect("the-trace-id-here"))

对于每个失败用例，查找完整追踪数据以了解应用内部情况：

python

from pixie import DatasetStore

store = DatasetStore()
ds = store.get("<dataset-name>")
for i, item in enumerate(ds.items):
    print(i, item.eval_metadata)   # 包含trace_id

然后检查完整观测树：

python

import asyncio
from pixie import ObservationStore

async def inspect(trace_id: str):
    store = ObservationStore()
    roots = await store.get_trace(trace_id)
    for root in roots:
        print(root.to_text())   # 完整观测树：输入、输出、LLM消息

asyncio.run(inspect("the-trace-id-here"))

Step 3: Root-cause analysis

步骤3：根本原因分析

Walk through the trace and identify exactly where the failure originates. Common patterns:

Symptom	Likely cause

LLM-related failures (fix with prompt/model/eval changes):

Symptom	Likely cause
Output is factually wrong despite correct tool results	Prompt doesn't instruct the LLM to use tool output faithfully
Agent routes to wrong tool/handoff	Routing prompt or handoff descriptions are ambiguous
Output format is wrong	Missing format instructions in prompt
LLM hallucinated instead of using tool	Prompt doesn't enforce tool usage

Non-LLM failures (fix with traditional code changes, out of eval scope):

Symptom	Likely cause
Tool returned wrong data	Bug in tool implementation — fix the tool, not the eval
Tool wasn't called at all due to keyword mismatch	Tool-selection logic is broken — fix the code
Database returned stale/wrong records	Data issue — fix independently
API call failed with error	Infrastructure issue

For non-LLM failures: note them in the investigation log and recommend the code fix, but do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code. The eval test should measure LLM quality assuming the rest of the system works correctly.

遍历追踪数据，准确定位失败根源。常见模式：

LLM相关故障（通过提示词/模型/评估变更修复）：

症状	可能原因
工具结果正确，但LLM输出事实错误	提示词未指示LLM忠实使用工具输出
Agent路由到错误工具/交接	路由提示词或交接描述不明确
输出格式错误	提示词缺少格式说明
LLM生成幻觉而非使用工具	提示词未强制要求使用工具

非LLM故障（通过传统代码变更修复，不属于评估范围）：

症状	可能原因
工具返回错误数据	工具实现存在bug——修复工具，而非评估
因关键词不匹配导致工具未被调用	工具选择逻辑损坏——修复代码
数据库返回过期/错误记录	数据问题——独立修复
API调用报错	基础设施问题

对于非LLM故障：在调查日志中记录并建议代码修复，但不要调整评估预期或阈值以适应非LLM代码中的bug。评估测试应在假设系统其他部分正常工作的前提下，测量LLM质量。

Step 4: Document findings in MEMORY.md

步骤4：在MEMORY.md中记录发现

Every failure investigation must be documented in
pixie_qa/MEMORY.md
in a structured format:

markdown

undefined

所有故障调查都必须以结构化格式记录在
pixie_qa/MEMORY.md
中：

markdown

undefined

Investigation: <test_name> failure — <date>

调查：<测试名称>失败 — <日期>

Test:

test_faq_factuality

pixie_qa/tests/test_customer_service.py

Result: 3/5 cases passed (60%), threshold was 80% ≥ 0.7

测试：

test_faq_factuality

位于

pixie_qa/tests/test_customer_service.py

结果：3/5用例通过（60%），阈值要求80%≥0.7

Failing case 1: "What rows have extra legroom?"

失败用例1："哪些座位有额外腿部空间？"

eval_input:

{"user_message": "What rows have extra legroom?"}

eval_output: "I'm sorry, I don't have the exact row numbers for extra legroom..."
expected_output: "rows 5-8 Economy Plus with extra legroom"
Evaluator score: 0.1 (FactualityEval)
Evaluator reasoning: "The output claims not to know the answer while the reference clearly states rows 5-8..."

Trace analysis: Inspected trace

abc123

. The span tree shows:

Triage Agent routed to FAQ Agent ✓

FAQ Agent called

faq_lookup_tool("What rows have extra legroom?")

✓

```
faq_lookup_tool
```
returned "I'm sorry, I don't know..." ← root cause

Root cause:

faq_lookup_tool

(customer_service.py:112) uses keyword matching. The seat FAQ entry is triggered by keywords

["seat", "seats", "seating", "plane"]

. The question "What rows have extra legroom?" contains none of these keywords, so it falls through to the default "I don't know" response.

Classification: Non-LLM failure — the keyword-matching tool is broken. The LLM agent correctly routed to the FAQ agent and used the tool; the tool itself returned wrong data.

Fix: Add

"row"

"rows"

"legroom"

to the seating keyword list in

faq_lookup_tool

(customer_service.py:130). This is a traditional code fix, not an eval/prompt change.

Verification: After fix, re-run: ```bash python pixie_qa/scripts/build_dataset.py # refresh dataset pixie test pixie_qa/tests/ -k faq -v # verify ```

undefined

eval_input：

{"user_message": "哪些座位有额外腿部空间？"}

eval_output："抱歉，我没有额外腿部空间的具体座位号信息..."
expected_output："5-8排为经济舱Plus，提供额外腿部空间"
评估器分数：0.1（FactualityEval）
评估器理由："输出声称不知道答案，但参考内容明确说明5-8排..."

追踪分析: 检查了追踪

abc123

。观测树显示：

分流Agent正确路由到FAQ Agent ✓

FAQ Agent调用了

faq_lookup_tool("哪些座位有额外腿部空间？")

✓

```
faq_lookup_tool
```
返回"抱歉，我不知道..." ← 根本原因

根本原因：

faq_lookup_tool

（customer_service.py:112）使用关键词匹配。座位FAQ条目的触发关键词为

["seat", "seats", "seating", "plane"]

。问题“哪些座位有额外腿部空间？”不包含这些关键词，因此触发了默认的“我不知道”响应。

分类：非LLM故障——关键词匹配工具损坏。LLM Agent正确路由到FAQ Agent并使用了工具；工具本身返回了错误数据。

修复方案：在

faq_lookup_tool

（customer_service.py:130）的座位关键词列表中添加

"row"

、

"rows"

、

"legroom"

。这是传统代码修复，而非评估/提示词变更。

验证：修复后，重新运行： ```bash python pixie_qa/scripts/build_dataset.py # 刷新数据集 pixie test pixie_qa/tests/ -k faq -v # 验证 ```

undefined

Step 5: Fix and re-run

步骤5：修复并重新运行

Make the targeted change, rebuild the dataset if needed, and re-run. Always finish by giving the user the exact commands to verify:

bash

pixie test pixie_qa/tests/test_<feature>.py -v

进行针对性修改，必要时重建数据集，然后重新运行。最后务必向用户提供验证的具体命令：

bash

pixie test pixie_qa/tests/test_<feature>.py -v

Memory Template

记忆模板

markdown

undefined

markdown

undefined

Eval Notes: <Project Name>

评估笔记：<项目名称>

How the application works

应用工作原理

Entry point and execution flow

入口点和执行流程

<如何启动/运行应用。从输入到输出的逐步流程。>

Inputs to LLM calls

LLM调用的输入

<针对每个LLM调用，记录：代码位置、系统提示词、动态内容、可用工具>

Intermediate processing

中间处理

<输入到输出之间的步骤：检索、路由、工具调用等。每个步骤的代码位置。>

Final output

最终输出

<用户看到的内容、格式、质量期望。>

Use cases

用例

<Use case 1>: <description>
- Input example: ...
- Good output: ...
- Bad output: ...

<每个场景及好/坏输出示例：>

<用例1>：<描述>
- 输入示例：...
- 好输出：...
- 坏输出：...

Evaluation plan

评估计划

What to evaluate and why

评估内容及原因

<质量维度及理由>

Evaluators and criteria

评估器和标准

Test	Dataset	Evaluator	Criteria	Rationale
...	...	...	...	...

测试	数据集	评估器	标准	理由
...	...	...	...	...

Data needed for evaluation

评估所需数据

<需要捕获的数据点及代码位置>

Datasets

数据集

Dataset	Items	Purpose
...	...	...

数据集	条目数	用途
...	...	...

Investigation log

调查日志

<date> — <test_name> failure

<日期> — <测试名称>失败

---

<如阶段7所述的完整结构化调查内容>

---

Reference

参考

See

references/pixie-api.md

for all CLI commands, evaluator signatures, and the Python dataset/store API.

所有CLI命令、评估器签名和Python数据集/存储API，请参考

references/pixie-api.md

。

eval-driven-dev

Original

Translation

Eval-Driven Development with pixie

基于pixie的评估驱动开发

Setup vs. Iteration: when to stop

设置 vs. 迭代：何时停止

"Setup QA" / "set up evals" / "add tests" (setup intent)

"设置QA" / "设置评估" / "添加测试"（设置意图）

"Fix" / "improve" / "debug" / "why is X failing" (iteration intent)

"修复" / "改进" / "调试" / "为什么X失败"（迭代意图）

Ambiguous requests

模糊的请求

The eval boundary: what to evaluate

评估边界：要评估什么

In scope (evaluate this)

评估范围内的内容

Out of scope (do NOT evaluate this with evals)

评估范围外的内容（不要用评估来测试这些）

Stage 0: Ensure pixie-qa is Installed and API Keys Are Set

阶段0：确保pixie-qa已安装且API密钥已配置

Verify API keys

验证API密钥

Stage 1: Understand the Application

阶段1：理解应用

What to investigate

需要调查的内容

Write MEMORY.md

编写MEMORY.md

Eval Notes: <Project Name>

评估笔记：<项目名称>

How the application works

应用工作原理

Entry point and execution flow

入口点和执行流程

Inputs to LLM calls

LLM调用的输入

Intermediate processing

中间处理

Final output

最终输出

Use cases

用例

Evaluation plan

评估计划

What to evaluate and why

评估内容及原因

Evaluation granularity

评估粒度

Evaluators and criteria

评估器和标准

Data needed for evaluation

评估所需数据

Stage 2: Decide What to Evaluate

阶段2：确定评估内容

Stage 3: Instrument the Application

阶段3：为应用添加埋点

Add enable_storage() at application startup

在应用启动时调用enable_storage()

✅ CORRECT — at application startup

✅ 正确——在应用启动时调用

✅ CORRECT — in a runnable for tests

✅ 正确——在测试的runnable函数内

❌ WRONG — at module level, runs on import

❌ 错误——模块级别调用，导入时执行

Wrap existing functions with @observe or start_observation

使用@observe或start_observation包裹现有函数

✅ CORRECT — decorating the existing production function

✅ 正确——装饰现有生产函数

✅ CORRECT — context manager inside an existing function

✅ 正确——在现有函数内使用上下文管理器

❌ WRONG — creating a new function that duplicates logic from main()

❌ 错误——创建重复main()逻辑的新函数

Stage 4: Write the Eval Test File

阶段4：编写评估测试文件

Stage 5: Build the Dataset

阶段5：构建数据集

Run the app and capture traces to the dataset

运行应用并将追踪数据捕获到数据集中

Run the app (enable_storage() must be active)

Add
`enable_storage()`
at application startup

在应用启动时调用
`enable_storage()`

Wrap existing functions with
`@observe`
or
`start_observation`

使用
`@observe`
或
`start_observation`
包裹现有函数