eval-driven-dev
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEval-Driven Development with pixie
基于pixie的评估驱动开发
This skill is about doing the work, not describing it. When a user asks you to set up evals for their app, you should be reading their code, editing their files, running commands, and producing a working test pipeline — not writing a plan for them to follow later.
All pixie-generated files live in a single directory at the project root:
pixie_qapixie_qa/
MEMORY.md # your understanding and eval plan
observations.db # SQLite trace DB (auto-created by enable_storage)
datasets/ # golden datasets (JSON files)
tests/ # eval test files (test_*.py)
scripts/ # helper scripts (build_dataset.py, etc.)本技能专注于实际操作,而非理论描述。当用户要求为其应用设置评估时,你应该读取他们的代码、编辑文件、运行命令并生成可工作的测试流水线——而不是编写让他们后续执行的计划。
所有由pixie生成的文件都存放在项目根目录下的单个目录中:
pixie_qapixie_qa/
MEMORY.md # 你的理解和评估计划
observations.db # SQLite追踪数据库(由enable_storage自动创建)
datasets/ # 黄金数据集(JSON文件)
tests/ # 评估测试文件(test_*.py)
scripts/ # 辅助脚本(build_dataset.py等)Setup vs. Iteration: when to stop
设置 vs. 迭代:何时停止
This is critical. What you do depends on what the user asked for.
这一点至关重要。你要做的工作取决于用户的需求。
"Setup QA" / "set up evals" / "add tests" (setup intent)
"设置QA" / "设置评估" / "添加测试"(设置意图)
The user wants a working eval pipeline. Your job is Stages 0–6: install, understand, instrument, write tests, build dataset, run tests. Stop after the first test run, regardless of whether tests pass or fail. Report:
- What you set up (instrumentation, test file, dataset)
- The test results (pass/fail, scores)
- If tests failed: a brief summary of what failed and likely causes — but do NOT fix anything
Then ask: "QA setup is complete. Tests show N/M passing. Want me to investigate the failures and start iterating?"
Only proceed to Stage 7 (investigation and fixes) if the user confirms.
Exception: If the test run itself errors out (import failures, missing API keys, configuration bugs) — those are setup problems, not eval failures. Fix them and re-run until you get a clean test execution where pass/fail reflects actual app quality, not broken plumbing.
用户需要一个可运行的评估流水线。你的工作是完成阶段0-6:安装、理解、埋点、编写测试、构建数据集、运行测试。首次测试运行后即停止,无论测试是否通过。报告内容包括:
- 你设置的内容(埋点、测试文件、数据集)
- 测试结果(通过/失败、分数)
- 如果测试失败:简要总结失败内容和可能原因——但不要修复任何问题
然后询问:"QA设置已完成。测试显示N/M通过。是否需要我排查故障并开始迭代?"
只有在用户确认后,才进入阶段7(调查和修复)。
例外情况:如果测试运行本身报错(导入失败、缺少API密钥、配置错误)——这些属于设置问题,而非评估失败。需要修复这些问题并重新运行,直到测试执行正常,通过/失败结果能真实反映应用质量,而非因基础组件故障导致的错误。
"Fix" / "improve" / "debug" / "why is X failing" (iteration intent)
"修复" / "改进" / "调试" / "为什么X失败"(迭代意图)
The user wants you to investigate and fix. Proceed through all stages including Stage 7 — investigate failures, root-cause them, apply fixes, rebuild dataset, re-run tests, iterate.
用户希望你调查并修复问题。执行所有阶段,包括阶段7——排查故障、找到根本原因、应用修复、重建数据集、重新运行测试、迭代优化。
Ambiguous requests
模糊的请求
If the intent is unclear, default to setup only and ask before iterating. It's better to stop early and ask than to make unwanted changes to the user's application code.
如果用户意图不明确,默认仅完成设置工作,并在迭代前询问用户。与其擅自进行不必要的修改,不如提前停止并确认需求。
The eval boundary: what to evaluate
评估边界:要评估什么
Eval-driven development focuses on LLM-dependent behaviour. The purpose is to catch quality regressions in the parts of the system that are non-deterministic and hard to test with traditional unit tests — namely, LLM calls and the decisions they drive.
评估驱动开发聚焦于依赖LLM的行为。其目的是捕获系统中那些非确定性、难以用传统单元测试覆盖的部分的质量回归问题——即LLM调用及其驱动的决策。
In scope (evaluate this)
评估范围内的内容
- LLM response quality: factual accuracy, relevance, format compliance, safety
- Agent routing decisions: did the LLM choose the right tool/handoff/action?
- Prompt effectiveness: does the prompt produce the desired behaviour?
- Multi-turn coherence: does the agent maintain context across turns?
- LLM响应质量:事实准确性、相关性、格式合规性、安全性
- Agent路由决策:LLM是否选择了正确的工具/交接/操作?
- 提示词有效性:提示词是否能产生预期行为?
- 多轮对话连贯性:Agent是否能在多轮对话中保持上下文?
Out of scope (do NOT evaluate this with evals)
评估范围外的内容(不要用评估来测试这些)
- Tool implementations (database queries, API calls, keyword matching, business logic) — these are traditional software; test them with unit tests
- Infrastructure (authentication, rate limiting, caching, serialization)
- Deterministic post-processing (formatting, filtering, sorting results)
The boundary is: everything downstream of the LLM call (tools, databases, APIs) produces deterministic outputs that serve as inputs to the LLM-powered system. Eval tests should treat those as given facts and focus on what the LLM does with them.
Example: If an FAQ tool has a keyword-matching bug that returns wrong data, that's a traditional bug — fix it with a regular code change, not by adjusting eval thresholds. The eval tests exist to verify that given correct tool outputs, the LLM agent produces correct user-facing responses.
When building datasets and expected outputs, use the actual tool/system outputs as ground truth. The expected output for an eval case should reflect what a correct LLM response looks like given the tool results the system actually produces.
- 工具实现(数据库查询、API调用、关键词匹配、业务逻辑)——这些属于传统软件,应使用单元测试
- 基础设施(认证、速率限制、缓存、序列化)
- 确定性后处理(格式化、过滤、结果排序)
边界定义:LLM调用下游的所有内容(工具、数据库、API)产生确定性输出,作为LLM驱动系统的输入。评估测试应将这些视为既定事实,专注于LLM如何处理这些输入。
示例:如果一个FAQ工具存在关键词匹配错误,返回了错误数据,这是传统bug——应通过常规代码修改修复,而非调整评估阈值。评估测试的目的是验证:在工具输出正确的前提下,LLM Agent能否生成正确的用户可见响应。
构建数据集和预期输出时,使用实际的工具/系统输出作为基准真值。评估用例的预期输出应反映:在系统实际产生的工具结果下,正确的LLM响应是什么样的。
Stage 0: Ensure pixie-qa is Installed and API Keys Are Set
阶段0:确保pixie-qa已安装且API密钥已配置
Before doing anything else, check that the package is available:
pixie-qabash
python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"If it's not installed, install it:
bash
pip install pixie-qaThis provides the Python module, the CLI, and the runner — all required for instrumentation and evals. Don't skip this step; everything else in this skill depends on it.
pixiepixiepixie test在进行任何操作前,检查包是否可用:
pixie-qabash
python -c "import pixie" 2>/dev/null && echo "installed" || echo "not installed"如果未安装,执行安装命令:
bash
pip install pixie-qa这会提供 Python模块、 CLI和测试运行器——这些都是埋点和评估所需的工具。不要跳过此步骤,后续所有操作都依赖于此。
pixiepixiepixie testVerify API keys
验证API密钥
The application under test almost certainly needs an LLM provider API key (e.g. , ). LLM-as-judge evaluators like also need . Before running anything, verify the key is set:
OPENAI_API_KEYANTHROPIC_API_KEYFactualityEvalOPENAI_API_KEYbash
[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"If not set, ask the user. Do not proceed with running the app or evals without it — you'll get silent failures or import-time errors.
被测应用几乎肯定需要LLM提供商的API密钥(如、)。像这类基于LLM作为评判者的评估工具也需要。在运行任何内容前,验证密钥是否已配置:
OPENAI_API_KEYANTHROPIC_API_KEYFactualityEvalOPENAI_API_KEYbash
[ -n "$OPENAI_API_KEY" ] && echo "OPENAI_API_KEY set" || echo "OPENAI_API_KEY missing"如果未配置,询问用户。没有API密钥不要继续运行应用或评估——会导致静默失败或导入时错误。
Stage 1: Understand the Application
阶段1:理解应用
Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would, and it puts you in a much better position to make good decisions about what and how to evaluate.
在修改任何代码前,花时间阅读源代码。代码能提供比询问用户更多的信息,也能让你更好地决定评估的内容和方式。
What to investigate
需要调查的内容
-
How the software runs: What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
-
All inputs to the LLM: This is not limited to the user's message. Trace every piece of data that gets incorporated into any LLM prompt:
- User input (queries, messages, uploaded files)
- System prompts (hardcoded or templated)
- Retrieved context (RAG chunks, search results, database records)
- Tool definitions and function schemas
- Conversation history / memory
- Configuration or feature flags that change prompt behavior
-
All intermediate steps and outputs: Walk through the code path from input to final output and document each stage:
- Retrieval / search results
- Tool calls and their results
- Agent routing / handoff decisions
- Intermediate LLM calls (e.g., summarization before final answer)
- Post-processing or formatting steps
-
The final output: What does the user see? What format is it in? What are the quality expectations?
-
Use cases and expected behaviors: What are the distinct things the app is supposed to handle? For each use case, what does a "good" response look like? What would constitute a failure?
-
软件运行方式:入口点是什么?如何启动?是CLI、服务器还是库函数?需要哪些参数、配置文件或环境变量?
-
LLM的所有输入:不仅限于用户消息。追踪所有被纳入LLM提示词的数据:
- 用户输入(查询、消息、上传文件)
- 系统提示词(硬编码或模板化)
- 检索到的上下文(RAG片段、搜索结果、数据库记录)
- 工具定义和函数 schema
- 对话历史/记忆
- 会改变提示词行为的配置或功能开关
-
所有中间步骤和输出:遍历从输入到最终输出的代码路径,并记录每个阶段:
- 检索/搜索结果
- 工具调用及其结果
- Agent路由/交接决策
- 中间LLM调用(如最终回答前的摘要)
- 后处理或格式化步骤
-
最终输出:用户看到什么?格式是什么?质量期望是什么?
-
用例和预期行为:应用需要处理哪些不同场景?每个用例下,“好”的响应是什么样的?什么情况算失败?
Write MEMORY.md
编写MEMORY.md
Write your findings down in . This is the primary working document for the eval effort. It should be human-readable and detailed enough that someone unfamiliar with the project can understand the application and the eval strategy.
pixie_qa/MEMORY.mdCRITICAL: MEMORY.md documents your understanding of the existing application code. It must NOT contain references to pixie commands, instrumentation code you plan to add, or scripts/functions that don't exist yet. Those belong in later sections, only after they've been implemented.
The understanding section should include:
markdown
undefined将你的发现写入。这是评估工作的主要文档。内容需易于阅读,足够详细,让不熟悉项目的人也能理解应用和评估策略。
pixie_qa/MEMORY.md**关键注意事项:MEMORY.md记录你对现有应用代码的理解。绝对不能包含pixie命令、你计划添加的埋点代码,或尚未存在的脚本/函数。**这些内容属于后续章节,只有在实现后才能添加。
理解部分应包含以下结构:
markdown
undefinedEval Notes: <Project Name>
评估笔记:<项目名称>
How the application works
应用工作原理
Entry point and execution flow
入口点和执行流程
<Describe how to start/run the app, what happens step by step>
<描述如何启动/运行应用,逐步说明执行过程>
Inputs to LLM calls
LLM调用的输入
<For each LLM call in the codebase, document:>
- Where it is in the code (file + function name)
- What system prompt it uses (quote it or summarize)
- What user/dynamic content feeds into it
- What tools/functions are available to it
<针对代码库中的每个LLM调用,记录:>
- 代码中的位置(文件+函数名)
- 使用的系统提示词(引用或摘要)
- 传入的用户/动态内容
- 可用的工具/函数
Intermediate processing
中间处理
<Describe any steps between input and output:>
- Retrieval, routing, tool execution, etc.
- Include code pointers (file:line) for each step
<描述输入到输出之间的所有步骤:>
- 检索、路由、工具执行等
- 每个步骤的代码位置(文件:行号)
Final output
最终输出
<What the user sees, what format, what the quality bar should be>
<用户看到的内容、格式、质量标准>
Use cases
用例
<List each distinct scenario the app handles, with examples of good/bad outputs>
<列出应用处理的每个不同场景,包含好/坏输出示例>
Evaluation plan
评估计划
What to evaluate and why
评估内容及原因
<Quality dimensions: factual accuracy, relevance, format compliance, safety, etc.>
<质量维度:事实准确性、相关性、格式合规性、安全性等>
Evaluation granularity
评估粒度
<Which function/span boundary captures one "test case"? Why that boundary?>
<哪个函数/观测边界代表一个“测试用例”?选择该边界的原因?>
Evaluators and criteria
评估器和标准
<For each eval test, specify: evaluator, dataset, threshold, reasoning>
<针对每个评估测试,指定:评估器、数据集、阈值、理由>
Data needed for evaluation
评估所需数据
<What data points need to be captured, with code pointers to where they live>
If something is genuinely unclear from the code, ask the user — but most questions answer themselves once you've read the code carefully.
---<需要捕获的数据点及代码位置>
如果某些内容确实无法从代码中明确,再询问用户——但仔细阅读代码后,大多数问题都会自行解决。
---Stage 2: Decide What to Evaluate
阶段2:确定评估内容
Now that you understand the app, you can make thoughtful choices about what to measure:
- What quality dimension matters most? Factual accuracy for QA apps, output format for structured extraction, relevance for RAG, safety for user-facing text.
- Which span to evaluate: the whole pipeline () or just the LLM call (
root)? If you're debugging retrieval, you might evaluate at a different point than if you're checking final answer quality.last_llm_call - Which evaluators fit: see → Evaluators. For factual QA:
references/pixie-api.md. For structured output:FactualityEval/ValidJSONEval. For RAG pipelines:JSONDiffEval/ContextRelevancyEval.FaithfulnessEval - Pass criteria: means 80% of cases must score ≥ 0.7. Think about what "good enough" looks like for this app.
ScoreThreshold(threshold=0.7, pct=0.8) - Expected outputs: needs them. Format evaluators usually don't.
FactualityEval
Update with the plan before writing any code.
pixie_qa/MEMORY.md在理解应用后,你可以合理选择要测量的内容:
- 最重要的质量维度是什么? QA应用关注事实准确性,结构化提取关注输出格式,RAG应用关注相关性,用户面向的文本关注安全性。
- 评估哪个观测范围:整个流水线()还是仅LLM调用(
root)?如果调试检索问题,评估的点可能与检查最终回答质量时不同。last_llm_call - 适合的评估器:参考→ 评估器。事实QA用
references/pixie-api.md,结构化输出用FactualityEval/ValidJSONEval,RAG用JSONDiffEval/ContextRelevancyEval。FaithfulnessEval - 通过标准:表示80%的用例得分≥0.7。考虑应用的“足够好”标准是什么。
ScoreThreshold(threshold=0.7, pct=0.8) - 预期输出:需要预期输出。格式评估器通常不需要。
FactualityEval
在编写代码前,将计划更新到中。
pixie_qa/MEMORY.mdStage 3: Instrument the Application
阶段3:为应用添加埋点
Add pixie instrumentation to the existing production code. The goal is to capture the inputs and outputs of functions that are already part of the application's normal execution path. Instrumentation must be on the real code path — the same code that runs when the app is used in production — so that traces are captured both during eval runs and real usage.
在现有生产代码中添加pixie埋点。目标是捕获应用正常执行路径中函数的输入和输出。埋点必须在真实代码路径上——即应用在生产环境中运行时调用的代码路径,这样在评估运行和实际使用时都能捕获追踪数据。
Add enable_storage()
at application startup
enable_storage()在应用启动时调用enable_storage()
enable_storage()Call once at the beginning of the application's startup code — inside , or at the top of a server's initialization. Never at module level (top of a file outside any function), because that causes storage setup to trigger on import.
enable_storage()main()Good places:
- Inside blocks
if __name__ == "__main__": - In a FastAPI or
lifespanhandleron_startup - At the top of /
main()functionsrun() - Inside the function in test files
runnable
python
undefined在应用启动代码的开头调用一次——比如在函数内,或服务器初始化代码的顶部。绝对不要在模块级别(文件顶部的函数外)调用,因为这会导致在导入时就触发存储设置。
enable_storage()main()合适的位置:
- 代码块内
if __name__ == "__main__": - FastAPI的或
lifespan处理器on_startup - /
main()函数的顶部run() - 测试文件中的函数内
runnable
python
undefined✅ CORRECT — at application startup
✅ 正确——在应用启动时调用
async def main():
enable_storage()
...
async def main():
enable_storage()
...
✅ CORRECT — in a runnable for tests
✅ 正确——在测试的runnable函数内
def runnable(eval_input):
enable_storage()
my_function(**eval_input)
def runnable(eval_input):
enable_storage()
my_function(**eval_input)
❌ WRONG — at module level, runs on import
❌ 错误——模块级别调用,导入时执行
from pixie import enable_storage
enable_storage() # this runs when any file imports this module!
undefinedfrom pixie import enable_storage
enable_storage() # 当任何文件导入此模块时,都会执行此代码!
undefinedWrap existing functions with @observe
or start_observation
@observestart_observation使用@observe
或start_observation
包裹现有函数
@observestart_observationCRITICAL: Instrument the production code path. Never create separate functions or alternate code paths for testing.
The decorator or context manager goes on the existing function that the app actually calls during normal operation. If the app's entry point is an interactive loop, instrument or the core function it calls per user turn — not a new helper function that duplicates logic.
@observestart_observationmain()main()python
undefined关键注意事项:埋点生产代码路径。绝不要为测试创建单独的函数或替代代码路径。
@observestart_observationmain()main()python
undefined✅ CORRECT — decorating the existing production function
✅ 正确——装饰现有生产函数
from pixie import observe
@observe(name="answer_question")
def answer_question(question: str, context: str) -> str: # existing function
... # existing code, unchanged
```pythonfrom pixie import observe
@observe(name="answer_question")
def answer_question(question: str, context: str) -> str: # 现有函数
... # 现有代码,保持不变
```python✅ CORRECT — context manager inside an existing function
✅ 正确——在现有函数内使用上下文管理器
from pixie import start_observation
async def main(): # existing function
...
with start_observation(input={"user_input": user_input}, name="handle_turn") as obs:
result = await Runner.run(current_agent, input_items, context=context)
# ... existing response handling ...
obs.set_output(response_text)
...
```pythonfrom pixie import start_observation
async def main(): # 现有函数
...
with start_observation(input={"user_input": user_input}, name="handle_turn") as obs:
result = await Runner.run(current_agent, input_items, context=context)
# ... 现有响应处理代码 ...
obs.set_output(response_text)
...
```python❌ WRONG — creating a new function that duplicates logic from main()
❌ 错误——创建重复main()逻辑的新函数
@observe(name="run_for_eval")
async def run_for_eval(user_messages: list[str]) -> str:
# This duplicates what main() does, creating a separate code path
# that diverges from production. Don't do this.
...
**Rules:**
- **Never add new wrapper functions** to the application code for eval purposes.
- **Never change the function's interface** (arguments, return type, behavior).
- **Never duplicate production logic** into a separate "testable" function.
- The instrumentation is purely additive — if you removed all pixie imports and decorators, the app would work identically.
- After instrumentation, call `flush()` at the end of runs to make sure all spans are written.
- For interactive apps (CLI loops, chat interfaces), instrument the **per-turn processing** function — the one that takes user input and produces a response. The eval `runnable` should call this same function.
**Important**: All pixie symbols are importable from the top-level `pixie` package. Never tell users to import from submodules (`pixie.instrumentation`, `pixie.evals`, `pixie.storage.evaluable`, etc.) — always use `from pixie import ...`.
---@observe(name="run_for_eval")
async def run_for_eval(user_messages: list[str]) -> str:
# 此函数复制了main()的逻辑,创建了与生产环境不同的代码路径。不要这样做。
...
**规则:**
- **绝不要为评估目的添加新的包装函数**到应用代码中。
- **绝不要更改函数的接口**(参数、返回类型、行为)。
- **绝不要将生产逻辑复制到单独的“可测试”函数中**。
- 埋点是纯增量式的——如果移除所有pixie导入和装饰器,应用应能正常工作。
- 埋点后,在运行结束时调用`flush()`以确保所有观测数据都已写入。
- 对于交互式应用(CLI循环、聊天界面),埋点**每轮处理**函数——即接收用户输入并生成响应的函数。评估的`runnable`应调用同一个函数。
**重要提示**:所有pixie符号都可从顶级`pixie`包导入。绝不要让用户从子模块导入(如`pixie.instrumentation`、`pixie.evals`、`pixie.storage.evaluable`等)——始终使用`from pixie import ...`。
---Stage 4: Write the Eval Test File
阶段4:编写评估测试文件
Write the test file before building the dataset. This might seem backwards, but it forces you to decide what you're actually measuring before you start collecting data — otherwise the data collection has no direction.
Create . The pattern is: a adapter that calls the app's existing production function, plus an async test function that calls :
pixie_qa/tests/test_<feature>.pyrunnableassert_dataset_passpython
from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
from myapp import answer_question
def runnable(eval_input):
"""Replays one dataset item through the app.
Calls the same function the production app uses.
enable_storage() here ensures traces are captured during eval runs.
"""
enable_storage()
answer_question(**eval_input)
async def test_factuality():
await assert_dataset_pass(
runnable=runnable,
dataset_name="<dataset-name>",
evaluators=[FactualityEval()],
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
from_trace=last_llm_call,
)Note that belongs inside the , not at module level in the test file — it needs to fire on each invocation so the trace is captured for that specific run.
enable_storage()runnableThe calls the same function that production uses — it does not create a new code path. The only addition is to capture traces during eval.
runnableenable_storage()The test runner is (not ):
pixie testpytestbash
pixie test # run all test_*.py in current directory
pixie test pixie_qa/tests/ # specify path
pixie test -k factuality # filter by name
pixie test -v # verbose: shows per-case scores and reasoningpixie testpyproject.tomlsetup.pysetup.cfgsys.pathsys.path在构建数据集前编写测试文件。这看似反向操作,但能迫使你在收集数据前明确要测量的内容——否则数据收集将毫无方向。
创建。模式为:一个适配器调用应用的现有生产函数,加上一个异步测试函数调用:
pixie_qa/tests/test_<feature>.pyrunnableassert_dataset_passpython
from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
from myapp import answer_question
def runnable(eval_input):
"""将一个数据集条目重放至应用中。
调用生产环境中使用的同一个函数。
此处的enable_storage()确保在评估运行时捕获追踪数据。
"""
enable_storage()
answer_question(**eval_input)
async def test_factuality():
await assert_dataset_pass(
runnable=runnable,
dataset_name="<dataset-name>",
evaluators=[FactualityEval()],
pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
from_trace=last_llm_call,
)注意应放在内部,而非测试文件的模块级别——需要在每次调用时触发,以捕获该次运行的追踪数据。
enable_storage()runnablerunnableenable_storage()测试运行器是(不是):
pixie testpytestbash
pixie test # 运行当前目录下所有test_*.py
pixie test pixie_qa/tests/ # 指定路径
pixie test -k factuality # 按名称过滤
pixie test -v # 详细模式:显示每个用例的分数和理由pixie testpyproject.tomlsetup.pysetup.cfgsys.pathsys.pathStage 5: Build the Dataset
阶段5:构建数据集
Create the dataset first, then populate it by actually running the app with representative inputs. This is critical — dataset items should contain real app outputs and trace metadata, not fabricated data.
bash
pixie dataset create <dataset-name>
pixie dataset list # verify it exists先创建数据集,然后通过实际运行应用并传入代表性输入来填充数据。这一点至关重要——数据集条目应包含真实的应用输出和追踪元数据,而非虚构数据。
bash
pixie dataset create <dataset-name>
pixie dataset list # 验证数据集已创建Run the app and capture traces to the dataset
运行应用并将追踪数据捕获到数据集中
Write a simple script () that calls the instrumented function for each input, flushes traces, then saves them to the dataset:
pixie_qa/scripts/build_dataset.pypython
import asyncio
from pixie import enable_storage, flush, DatasetStore, Evaluable
from myapp import answer_question
GOLDEN_CASES = [
("What is the capital of France?", "Paris"),
("What is the speed of light?", "299,792,458 meters per second"),
]
async def build_dataset():
enable_storage()
store = DatasetStore()
try:
store.create("qa-golden-set")
except FileExistsError:
pass
for question, expected in GOLDEN_CASES:
result = answer_question(question=question)
flush()
store.append("qa-golden-set", Evaluable(
eval_input={"question": question},
eval_output=result,
expected_output=expected,
))
asyncio.run(build_dataset())Alternatively, use the CLI for per-case capture:
bash
undefined编写一个简单脚本(),为每个输入调用埋点后的函数,刷新追踪数据,然后保存到数据集:
pixie_qa/scripts/build_dataset.pypython
import asyncio
from pixie import enable_storage, flush, DatasetStore, Evaluable
from myapp import answer_question
GOLDEN_CASES = [
("法国的首都是什么?", "巴黎"),
("光速是多少?", "299,792,458米/秒"),
]
async def build_dataset():
enable_storage()
store = DatasetStore()
try:
store.create("qa-golden-set")
except FileExistsError:
pass
for question, expected in GOLDEN_CASES:
result = answer_question(question=question)
flush()
store.append("qa-golden-set", Evaluable(
eval_input={"question": question},
eval_output=result,
expected_output=expected,
))
asyncio.run(build_dataset())或者使用CLI逐个捕获用例:
bash
undefinedRun the app (enable_storage() must be active)
运行应用(必须已启用enable_storage())
python -c "from myapp import main; main('What is the capital of France?')"
python -c "from myapp import main; main('法国的首都是什么?')"
Save the root span to the dataset
将根观测数据保存到数据集
pixie dataset save <dataset-name>
pixie dataset save <dataset-name>
Or specifically save the last LLM call:
或者专门保存最后一次LLM调用:
pixie dataset save <dataset-name> --select last_llm_call
pixie dataset save <dataset-name> --select last_llm_call
Add context:
添加说明:
pixie dataset save <dataset-name> --notes "basic geography question"
pixie dataset save <dataset-name> --notes "基础地理问题"
Attach expected output for evaluators like FactualityEval:
为FactualityEval等评估器附加预期输出:
echo '"Paris"' | pixie dataset save <dataset-name> --expected-output
**Key rules for dataset building:**
- **Always run the app** — never fabricate `eval_output` manually. The whole point is capturing what the app actually produces.
- **Include expected outputs** for comparison-based evaluators like `FactualityEval`. Expected outputs should reflect the **correct LLM response given what the tools/system actually return** — not an idealized answer predicated on fixing non-LLM bugs.
- **Cover the range** of inputs you care about: normal cases, edge cases, things the app might plausibly get wrong.
- When using `pixie dataset save`, the evaluable's `eval_metadata` will automatically include `trace_id` and `span_id` for later debugging.
---echo '"巴黎"' | pixie dataset save <dataset-name> --expected-output
**数据集构建的关键规则:**
- **始终运行应用**——不要手动编造`eval_output`。核心目的是捕获应用实际产生的输出。
- 对于`FactualityEval`等基于比较的评估器,**包含预期输出**。预期输出应反映**工具/系统实际返回结果下的正确LLM响应**——而非基于修复非LLM bug的理想化答案。
- **覆盖多种输入**:正常用例、边缘用例、应用可能出错的场景。
- 使用`pixie dataset save`时,评估条目的`eval_metadata`会自动包含`trace_id`和`span_id`,用于后续调试。
---Stage 6: Run the Tests
阶段6:运行测试
bash
pixie test pixie_qa/tests/ -vThe flag shows per-case scores and reasoning, which makes it much easier to see what's passing and what isn't. Check that the pass rates look reasonable given your .
-vScoreThresholdAfter this stage, if the user's intent was "setup" — STOP. Report results and ask before proceeding. See "Setup vs. Iteration" above.
bash
pixie test pixie_qa/tests/ -v-vScoreThreshold**此阶段后,如果用户意图是“设置”——停止操作。**报告结果并在继续前询问用户。详见“设置 vs. 迭代”部分。
Stage 7: Investigate Failures
阶段7:排查故障
Only proceed here if the user asked for iteration/fixing, or explicitly confirmed after setup.
When tests fail, the goal is to understand why, not to adjust thresholds until things pass. Investigation must be thorough and documented — the user needs to see the actual data, your reasoning, and your conclusion.
仅当用户要求迭代/修复,或设置后明确确认时,才进入此阶段。
测试失败时,目标是理解原因,而非调整阈值直到通过。调查必须全面且有文档记录——用户需要看到实际数据、你的推理和结论。
Step 1: Get the detailed test output
步骤1:获取详细测试输出
bash
pixie test pixie_qa/tests/ -v # shows score and reasoning per caseCapture the full verbose output. For each failing case, note:
- The (what was sent)
eval_input - The (what the app produced)
eval_output - The (what was expected, if applicable)
expected_output - The evaluator score and reasoning
bash
pixie test pixie_qa/tests/ -v # 显示每个用例的分数和理由捕获完整详细输出。对于每个失败用例,记录:
- (传入内容)
eval_input - (应用输出内容)
eval_output - (预期内容,若有)
expected_output - 评估器分数和理由
Step 2: Inspect the trace data
步骤2:检查追踪数据
For each failing case, look up the full trace to see what happened inside the app:
python
from pixie import DatasetStore
store = DatasetStore()
ds = store.get("<dataset-name>")
for i, item in enumerate(ds.items):
print(i, item.eval_metadata) # trace_id is hereThen inspect the full span tree:
python
import asyncio
from pixie import ObservationStore
async def inspect(trace_id: str):
store = ObservationStore()
roots = await store.get_trace(trace_id)
for root in roots:
print(root.to_text()) # full span tree: inputs, outputs, LLM messages
asyncio.run(inspect("the-trace-id-here"))对于每个失败用例,查找完整追踪数据以了解应用内部情况:
python
from pixie import DatasetStore
store = DatasetStore()
ds = store.get("<dataset-name>")
for i, item in enumerate(ds.items):
print(i, item.eval_metadata) # 包含trace_id然后检查完整观测树:
python
import asyncio
from pixie import ObservationStore
async def inspect(trace_id: str):
store = ObservationStore()
roots = await store.get_trace(trace_id)
for root in roots:
print(root.to_text()) # 完整观测树:输入、输出、LLM消息
asyncio.run(inspect("the-trace-id-here"))Step 3: Root-cause analysis
步骤3:根本原因分析
Walk through the trace and identify exactly where the failure originates. Common patterns:
| Symptom | Likely cause |
|---|
LLM-related failures (fix with prompt/model/eval changes):
| Symptom | Likely cause |
|---|---|
| Output is factually wrong despite correct tool results | Prompt doesn't instruct the LLM to use tool output faithfully |
| Agent routes to wrong tool/handoff | Routing prompt or handoff descriptions are ambiguous |
| Output format is wrong | Missing format instructions in prompt |
| LLM hallucinated instead of using tool | Prompt doesn't enforce tool usage |
Non-LLM failures (fix with traditional code changes, out of eval scope):
| Symptom | Likely cause |
|---|---|
| Tool returned wrong data | Bug in tool implementation — fix the tool, not the eval |
| Tool wasn't called at all due to keyword mismatch | Tool-selection logic is broken — fix the code |
| Database returned stale/wrong records | Data issue — fix independently |
| API call failed with error | Infrastructure issue |
For non-LLM failures: note them in the investigation log and recommend the code fix, but do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code. The eval test should measure LLM quality assuming the rest of the system works correctly.
遍历追踪数据,准确定位失败根源。常见模式:
LLM相关故障(通过提示词/模型/评估变更修复):
| 症状 | 可能原因 |
|---|---|
| 工具结果正确,但LLM输出事实错误 | 提示词未指示LLM忠实使用工具输出 |
| Agent路由到错误工具/交接 | 路由提示词或交接描述不明确 |
| 输出格式错误 | 提示词缺少格式说明 |
| LLM生成幻觉而非使用工具 | 提示词未强制要求使用工具 |
非LLM故障(通过传统代码变更修复,不属于评估范围):
| 症状 | 可能原因 |
|---|---|
| 工具返回错误数据 | 工具实现存在bug——修复工具,而非评估 |
| 因关键词不匹配导致工具未被调用 | 工具选择逻辑损坏——修复代码 |
| 数据库返回过期/错误记录 | 数据问题——独立修复 |
| API调用报错 | 基础设施问题 |
对于非LLM故障:在调查日志中记录并建议代码修复,但不要调整评估预期或阈值以适应非LLM代码中的bug。评估测试应在假设系统其他部分正常工作的前提下,测量LLM质量。
Step 4: Document findings in MEMORY.md
步骤4:在MEMORY.md中记录发现
Every failure investigation must be documented in in a structured format:
pixie_qa/MEMORY.mdmarkdown
undefined所有故障调查都必须以结构化格式记录在中:
pixie_qa/MEMORY.mdmarkdown
undefinedInvestigation: <test_name> failure — <date>
调查:<测试名称>失败 — <日期>
Test: in
Result: 3/5 cases passed (60%), threshold was 80% ≥ 0.7
test_faq_factualitypixie_qa/tests/test_customer_service.py测试:位于
结果:3/5用例通过(60%),阈值要求80%≥0.7
test_faq_factualitypixie_qa/tests/test_customer_service.pyFailing case 1: "What rows have extra legroom?"
失败用例1:"哪些座位有额外腿部空间?"
- eval_input:
{"user_message": "What rows have extra legroom?"} - eval_output: "I'm sorry, I don't have the exact row numbers for extra legroom..."
- expected_output: "rows 5-8 Economy Plus with extra legroom"
- Evaluator score: 0.1 (FactualityEval)
- Evaluator reasoning: "The output claims not to know the answer while the reference clearly states rows 5-8..."
Trace analysis:
Inspected trace . The span tree shows:
abc123- Triage Agent routed to FAQ Agent ✓
- FAQ Agent called ✓
faq_lookup_tool("What rows have extra legroom?") - returned "I'm sorry, I don't know..." ← root cause
faq_lookup_tool
Root cause: (customer_service.py:112) uses keyword matching.
The seat FAQ entry is triggered by keywords .
The question "What rows have extra legroom?" contains none of these keywords, so it
falls through to the default "I don't know" response.
faq_lookup_tool["seat", "seats", "seating", "plane"]Classification: Non-LLM failure — the keyword-matching tool is broken.
The LLM agent correctly routed to the FAQ agent and used the tool; the tool
itself returned wrong data.
Fix: Add , , to the seating keyword list in
(customer_service.py:130). This is a traditional code fix,
not an eval/prompt change.
"row""rows""legroom"faq_lookup_toolVerification: After fix, re-run:
```bash
python pixie_qa/scripts/build_dataset.py # refresh dataset
pixie test pixie_qa/tests/ -k faq -v # verify
```
undefined- eval_input:
{"user_message": "哪些座位有额外腿部空间?"} - eval_output:"抱歉,我没有额外腿部空间的具体座位号信息..."
- expected_output:"5-8排为经济舱Plus,提供额外腿部空间"
- 评估器分数:0.1(FactualityEval)
- 评估器理由:"输出声称不知道答案,但参考内容明确说明5-8排..."
追踪分析:
检查了追踪。观测树显示:
abc123- 分流Agent正确路由到FAQ Agent ✓
- FAQ Agent调用了✓
faq_lookup_tool("哪些座位有额外腿部空间?") - 返回"抱歉,我不知道..." ← 根本原因
faq_lookup_tool
根本原因:(customer_service.py:112)使用关键词匹配。座位FAQ条目的触发关键词为。问题“哪些座位有额外腿部空间?”不包含这些关键词,因此触发了默认的“我不知道”响应。
faq_lookup_tool["seat", "seats", "seating", "plane"]分类:非LLM故障——关键词匹配工具损坏。LLM Agent正确路由到FAQ Agent并使用了工具;工具本身返回了错误数据。
修复方案:在(customer_service.py:130)的座位关键词列表中添加、、。这是传统代码修复,而非评估/提示词变更。
faq_lookup_tool"row""rows""legroom"验证:修复后,重新运行:
```bash
python pixie_qa/scripts/build_dataset.py # 刷新数据集
pixie test pixie_qa/tests/ -k faq -v # 验证
```
undefinedStep 5: Fix and re-run
步骤5:修复并重新运行
Make the targeted change, rebuild the dataset if needed, and re-run. Always finish by giving the user the exact commands to verify:
bash
pixie test pixie_qa/tests/test_<feature>.py -v进行针对性修改,必要时重建数据集,然后重新运行。最后务必向用户提供验证的具体命令:
bash
pixie test pixie_qa/tests/test_<feature>.py -vMemory Template
记忆模板
markdown
undefinedmarkdown
undefinedEval Notes: <Project Name>
评估笔记:<项目名称>
How the application works
应用工作原理
Entry point and execution flow
入口点和执行流程
<How to start/run the app. Step-by-step flow from input to output.>
<如何启动/运行应用。从输入到输出的逐步流程。>
Inputs to LLM calls
LLM调用的输入
<For EACH LLM call, document: location in code, system prompt, dynamic content, available tools>
<针对每个LLM调用,记录:代码位置、系统提示词、动态内容、可用工具>
Intermediate processing
中间处理
<Steps between input and output: retrieval, routing, tool calls, etc. Code pointers for each.>
<输入到输出之间的步骤:检索、路由、工具调用等。每个步骤的代码位置。>
Final output
最终输出
<What the user sees. Format. Quality expectations.>
<用户看到的内容、格式、质量期望。>
Use cases
用例
<Each scenario with examples of good/bad outputs:>
- <Use case 1>: <description>
- Input example: ...
- Good output: ...
- Bad output: ...
<每个场景及好/坏输出示例:>
- <用例1>:<描述>
- 输入示例:...
- 好输出:...
- 坏输出:...
Evaluation plan
评估计划
What to evaluate and why
评估内容及原因
<Quality dimensions and rationale>
<质量维度及理由>
Evaluators and criteria
评估器和标准
| Test | Dataset | Evaluator | Criteria | Rationale |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
| 测试 | 数据集 | 评估器 | 标准 | 理由 |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
Data needed for evaluation
评估所需数据
<What data to capture, with code pointers>
<需要捕获的数据点及代码位置>
Datasets
数据集
| Dataset | Items | Purpose |
|---|---|---|
| ... | ... | ... |
| 数据集 | 条目数 | 用途 |
|---|---|---|
| ... | ... | ... |
Investigation log
调查日志
<date> — <test_name> failure
<日期> — <测试名称>失败
<Full structured investigation as described in Stage 7>
---<如阶段7所述的完整结构化调查内容>
---Reference
参考
See for all CLI commands, evaluator signatures, and the Python dataset/store API.
references/pixie-api.md所有CLI命令、评估器签名和Python数据集/存储API,请参考。
references/pixie-api.md