phoenix-evals
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePhoenix Evals
Phoenix 评估器
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
为AI/LLM应用构建评估器。优先使用代码实现,借助LLM处理细微差异,基于人工验证结果校准。
Quick Reference
快速参考
| Task | Files |
|---|---|
| Setup | |
| Build code evaluator | |
| Build LLM evaluator | |
| Run experiment | |
| Create dataset | |
| Validate evaluator | |
| Analyze errors | |
| RAG evals | |
| Production | |
| 任务 | 文件 |
|---|---|
| 环境搭建 | |
| 构建代码评估器 | |
| 构建LLM评估器 | |
| 运行实验 | |
| 创建数据集 | |
| 验证评估器 | |
| 错误分析 | |
| RAG评估 | |
| 生产环境部署 | |
Workflows
工作流
Starting Fresh:
→ → →
observe-tracing-setuperror-analysisaxial-codingevaluators-overviewBuilding Evaluator:
→ →
fundamentalsevaluators-{code\|llm}-{python\|typescript}validation-calibration-{python\|typescript}RAG Systems:
→ (retrieval) → (faithfulness)
evaluators-ragevaluators-code-*evaluators-llm-*Production:
→ →
production-overviewproduction-guardrailsproduction-continuous从零开始:
→ → →
observe-tracing-setuperror-analysisaxial-codingevaluators-overview构建评估器:
→ →
fundamentalsevaluators-{code\|llm}-{python\|typescript}validation-calibration-{python\|typescript}RAG系统:
→ (检索) → (忠实度)
evaluators-ragevaluators-code-*evaluators-llm-*生产环境:
→ →
production-overviewproduction-guardrailsproduction-continuousRule Categories
规则分类
| Prefix | Description |
|---|---|
| Types, scores, anti-patterns |
| Tracing, sampling |
| Finding failures |
| Categorizing failures |
| Code, LLM, RAG evaluators |
| Datasets, running experiments |
| Calibrating judges |
| CI/CD, monitoring |
| 前缀 | 描述 |
|---|---|
| 类型、评分、反模式 |
| 追踪、采样 |
| 定位故障 |
| 故障分类 |
| 代码、LLM、RAG评估器 |
| 数据集、运行实验 |
| 校准评估基准 |
| CI/CD、监控 |
Key Principles
核心原则
| Principle | Action |
|---|---|
| Error analysis first | Can't automate what you haven't observed |
| Custom > generic | Build from your failures |
| Code first | Deterministic before LLM |
| Validate judges | >80% TPR/TNR |
| Binary > Likert | Pass/fail, not 1-5 |
| 原则 | 行动指南 |
|---|---|
| 先做错误分析 | 无法自动化未被观测到的内容 |
| 自定义优于通用 | 基于自身故障场景构建 |
| 优先使用代码实现 | 先确定确定性逻辑,再借助LLM |
| 验证评估基准 | 实现>80%的真阳性率/真阴性率 |
| 二元判断优于李克特量表 | 采用通过/不通过,而非1-5分制 |