ai-evals

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AI Evals

AI评估

Help the user create systematic evaluations for AI products using insights from AI practitioners.

帮助用户利用AI从业者的经验为AI产品创建系统性评估。

How to Help

如何提供帮助

When the user asks for help with AI evals:

Understand what they're evaluating - Ask what AI feature or model they're testing and what "good" looks like
Help design the eval approach - Suggest rubrics, test cases, and measurement methods
Guide implementation - Help them think through edge cases, scoring criteria, and iteration cycles
Connect to product requirements - Ensure evals align with actual user needs, not just technical metrics

当用户请求AI评估相关帮助时：

明确评估对象 - 询问他们要测试的AI功能或模型，以及“合格”的标准是什么
协助设计评估方案 - 建议评分标准、测试用例和衡量方法
指导落地实施 - 帮助他们考虑边缘情况、评分准则和迭代周期
关联产品需求 - 确保评估与实际用户需求对齐，而非仅关注技术指标

Core Principles

核心原则

Evals are the new PRD

评估是新的产品需求文档（PRD）

Brendan Foody: "If the model is the product, then the eval is the product requirement document." Evals define what success looks like in AI products—they're not optional quality checks, they're core specifications.

Brendan Foody表示：“如果模型是产品，那么评估就是产品需求文档。”评估定义了AI产品的成功标准——它们不是可选的质量检查，而是核心规格。

Evals are a core product skill

评估是一项核心产品技能

Hamel Husain & Shreya Shankar: "Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders." This isn't just for ML engineers—product people need to master this.

Hamel Husain与Shreya Shankar指出：“Anthropic和OpenAI的首席产品官都提到，评估正成为产品构建者最重要的新技能。”这不仅适用于机器学习工程师——产品人员也需要掌握这项技能。

The workflow matters

工作流程至关重要

Building good evals involves error analysis, open coding (writing down what's wrong), clustering failure patterns, and creating rubrics. It's a systematic process, not a one-time test.

构建优质评估需要错误分析、开放式编码（记录问题所在）、归类失败模式以及制定评分标准。这是一个系统性的过程，而非一次性测试。

Questions to Help Users

用于引导用户的问题

"What does 'good' look like for this AI output?"
"What are the most common failure modes you've seen?"
"How will you know if the model got better or worse?"
"Are you measuring what users actually care about?"
"Have you manually reviewed enough outputs to understand failure patterns?"

“该AI输出的‘合格’标准是什么？”
“你见过最常见的失败模式有哪些？”
“你如何判断模型性能是提升还是下降了？”
“你衡量的指标是否是用户真正关心的？”
“你是否已手动审查足够多的输出来了解失败模式？”

Common Mistakes to Flag

需要指出的常见误区

Skipping manual review - You can't write good evals without first understanding failure patterns through manual trace analysis
Using vague criteria - "The output should be good" isn't an eval; you need specific, measurable criteria
LLM-as-judge without validation - If using an LLM to judge, you must validate that judge against human experts
Likert scales over binary - Force Pass/Fail decisions; 1-5 scales produce meaningless averages

跳过人工审查 - 若不先通过人工追踪分析了解失败模式，就无法制定优质的评估方案
使用模糊的评估标准 - “输出应该不错”不是有效的评估标准；你需要具体、可衡量的准则
未验证就用LLM作为评判者 - 若使用LLM作为评判工具，必须对照人类专家的判断进行验证
用李克特量表而非二元判定 - 应采用通过/不通过的二元决策；1-5分的量表会产生无意义的平均值

Deep Dive

深入探讨

For all 2 insights from 2 guests, see

references/guest-insights.md

如需了解2位嘉宾的全部2项见解，请查看

references/guest-insights.md