llm-as-a-judge

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LLM-as-a-Judge

LLM-as-Judge

Build reliable automated evaluators that use an LLM to judge the outputs of another LLM pipeline. Each judge targets a single, binary (Pass/Fail) failure mode identified during error analysis.

构建可靠的自动化评估器，使用一个LLM来评判另一个LLM流水线的输出。每个Judge针对错误分析中识别出的单一二元（通过/失败）故障模式。

When to Use LLM-as-Judge vs. Code

何时选择LLM-as-Judge vs. 代码评估器

Choose the right evaluator type for each failure mode:

Use code-based evaluators when the failure is objective and deterministic:

JSON/SQL syntax validity, regex/string matching, structural constraints, execution errors, logical checks.
These are fast, cheap, deterministic, and interpretable.

Use LLM-as-Judge when the failure requires interpretation or nuance:

Tone appropriateness, summary faithfulness, response helpfulness, explanation clarity, creative quality.
These require a separate LLM (distinct from the application) to judge outputs.

Each failure mode gets its own dedicated evaluator. Never combine multiple criteria into a single judge prompt—this introduces ambiguity and makes diagnosis harder.

为每种故障模式选择合适的评估器类型：

选择基于代码的评估器当故障是客观且确定性的：

JSON/SQL语法有效性、正则/字符串匹配、结构约束、执行错误、逻辑检查。
这类评估器速度快、成本低、结果确定且易于解释。

选择LLM-as-Judge当故障需要解读或涉及细微差异时：

语气恰当性、摘要忠实度、回复有用性、解释清晰度、创意质量。
这类评估器需要使用独立于应用的另一个LLM来评判输出。

每个故障模式都需要专属的评估器。切勿将多个标准合并到单个Judge提示词中——这会引入歧义，增加诊断难度。

The Full Workflow

完整工作流

1. Write Prompt Template
2. Split Labeled Data (Train / Dev / Test)
3. Iteratively Refine Prompt (measure TPR/TNR on Dev)
4. Estimate & Correct Success Rate (on Test + Unlabeled)

1. 编写提示词模板
2. 拆分标注数据（训练集 / 开发集 / 测试集）
3. 迭代优化提示词（在开发集上衡量TPR/TNR）
4. 估算并校正成功率（在测试集 + 未标注数据上）

Step 1: Write the Judge Prompt

步骤1：编写Judge提示词

A well-structured judge prompt has four essential components. Read

references/prompt-template.md

for a complete annotated example.

结构良好的Judge提示词包含四个核心组件。可查看

references/prompt-template.md

获取完整带注释的示例。

1. Clear Task and Evaluation Criterion

1. 清晰的任务与评估标准

Focus on ONE well-scoped failure mode. Vague tasks lead to unreliable judgments.

❌ "Is this email good?"
✅ "Is the tone appropriate for a luxury buyer persona?"

聚焦于单一范围明确的故障模式。模糊的任务会导致不可靠的评判结果。

❌ “这封邮件写得好吗？”
✅ “该语气是否符合高端买家 persona 的要求？”

2. Precise Pass/Fail Definitions

2. 精准的通过/失败定义

Define what counts as Pass (failure absent) and Fail (failure present), grounded in the failure descriptions from error analysis. Be specific about boundary conditions.

基于错误分析中的故障描述，明确定义什么是通过（无故障）和失败（存在故障），并明确边界条件。

3. Few-Shot Examples

3. 少样本示例

Include labeled examples that clearly Pass and clearly Fail. These calibrate the judge's decision boundary. Best drawn from human-labeled traces.

Use clear-cut cases, not edge cases, for initial examples.
For binary judgments, include at least one Pass and one Fail example.
If using finer-grained scales (e.g., 1–3 severity), include examples for every point on the scale.

包含明确的通过和失败标注示例。这些示例用于校准Judge的决策边界，最好来自人工标注的追踪数据。

初始示例使用明确的典型案例，而非边缘情况。
对于二元评判，至少包含1个通过和1个失败示例。
如果使用更精细的评分尺度（如1-3级严重程度），需为每个评分点提供示例。

4. Structured Output Format

4. 结构化输出格式

The judge responds in a consistent, machine-readable format:

json

{
  "reasoning": "1-2 sentence explanation for the decision.",
  "answer": "Pass"
}

The

reasoning

field comes first—this induces chain-of-thought before the verdict, improving accuracy.

Judge需以一致的、机器可读的格式回复：

json

{
  "reasoning": "1-2句话的决策解释。",
  "answer": "Pass"
}

reasoning

字段放在前面——这会促使Judge在给出结论前进行链式思考，提升准确率。

Step 2: Split Labeled Data

步骤2：拆分标注数据

Designing a judge resembles training a classifier, except "training" happens through prompt engineering. Split your human-labeled traces into three disjoint sets:

Set	Purpose	Typical Allocation
Training	Pool of candidates for few-shot examples in the prompt	10–20%
Dev	Iteratively refine the prompt; measure agreement with human labels	40–45%
Test	Final, unbiased measurement of judge accuracy (TPR/TNR)	40–45%

Key rules:

Dev examples must never appear in the prompt. This ensures generalization measurement.
Test examples are held out until the prompt is finalized. Never look at them during development.
In-context learning typically saturates after 1–8 well-chosen examples. Allocate more data to evaluation.
Both Dev and Test should contain enough Pass and Fail examples—ideally 30–50 of each.
Reusing examples across splits leads to overfitting and inflated accuracy.

If you have ~100 labeled traces (50 Pass, 50 Fail), a reasonable split: 10 training, 40 dev, 50 test.

设计Judge的过程类似于训练分类器，只是“训练”通过提示词工程完成。将人工标注的追踪数据拆分为三个互斥的数据集：

数据集	用途	典型分配比例
训练集	作为提示词中少样本示例的候选池	10–20%
开发集	迭代优化提示词；衡量与人工标注的一致性	40–45%
测试集	对Judge准确率（TPR/TNR）进行最终无偏测量	40–45%

核心规则：

开发集示例绝不能出现在提示词中。这确保了对泛化能力的有效测量。
测试集示例需保留到提示词最终确定后再使用。开发过程中切勿查看测试集数据。
上下文学习通常在1-8个精心选择的示例后达到饱和。应分配更多数据用于评估。
开发集和测试集应包含足够的通过和失败示例——理想情况下各30-50个。
在不同数据集中重复使用示例会导致过拟合，使准确率被高估。

如果您有约100条标注追踪数据（50条通过，50条失败），合理的拆分比例为：10条训练数据、40条开发数据、50条测试数据。

Step 3: Iteratively Refine the Prompt

步骤3：迭代优化提示词

This is the core loop. Think of it as tuning a classifier, but by revising text instead of adjusting parameters.

这是核心循环。可将其视为调优分类器，但通过修改文本而非调整参数来实现。

The Refinement Loop

优化循环

Write a baseline prompt using the four components above, with a few examples from the Training set.
Run the judge on the Dev set. Compare each judgment to human ground truth.
Measure agreement using TPR and TNR:
- TPR = (actual Passes correctly judged Pass) / (total actual Passes)
- TNR = (actual Fails correctly judged Fail) / (total actual Fails)
Inspect disagreements. Review false passes (judge said Pass, human said Fail) and false fails. Identify ambiguous criteria or missing edge cases.
Refine the prompt: Clarify Pass/Fail definitions, swap in better few-shot examples from Training, add representative edge cases.
Repeat until TPR and TNR stabilize at acceptable levels.

编写基线提示词：使用上述四个组件，从训练集中选取少量示例。
在开发集上运行Judge：将每个评判结果与人工标注的真实值进行对比。
使用TPR和TNR衡量一致性：
- TPR（真阳性率） = （被正确判定为通过的真实通过数） / （总真实通过数）
- TNR（真阴性率） = （被正确判定为失败的真实失败数） / （总真实失败数）
检查不一致情况：查看误判通过（Judge判定为通过，人工判定为失败）和误判失败的案例，识别模糊的标准或缺失的边缘情况。
优化提示词：明确通过/失败定义、从训练集中替换更合适的少样本示例、添加有代表性的边缘情况。
重复循环：直到TPR和TNR稳定在可接受的水平。

Why TPR and TNR (Not Precision/Recall)

为什么选择TPR和TNR（而非精确率/召回率）

The end goal is estimating the true pass rate of the pipeline. A judge can only mis-estimate this in two ways: missing real Passes (lowers the observed rate) or passing real Fails (inflates it). TPR and TNR capture these two error modes directly.

最终目标是估算流水线的真实通过率。Judge只能通过两种方式误估该指标：遗漏真实的通过案例（降低观测通过率）或误判失败案例为通过（抬高观测通过率）。TPR和TNR直接捕捉这两种错误模式。

When to Stop

何时停止优化

Stop when TPR and TNR reach satisfactory levels (typically >90%). Missing a real failure may be costlier than flagging a false one—adjust thresholds to your application's risk tolerance.

当TPR和TNR达到满意水平（通常>90%）时停止。遗漏真实故障的成本可能高于误报故障——可根据应用的风险容忍度调整阈值。

If Alignment Stalls

如果一致性停滞不前

Use a more capable LLM — a larger model may resolve subtle errors.
Decompose the criterion — break a complex failure into smaller, atomic checks.
Improve labeled data — add diverse, high-quality examples, especially edge cases.
Verify label quality — sometimes the issue is inconsistent or incorrect human labels.

Manual iteration is recommended before automation (e.g., DSPy). It builds intuition about both the failure mode and the judge's behavior. Writing the prompt forces you to externalize your specification.

使用能力更强的LLM——更大的模型可能解决细微的错误。
分解评估标准——将复杂故障拆分为更小的原子检查项。
改进标注数据——添加多样化、高质量的示例，尤其是边缘情况。
验证标注质量——有时问题出在不一致或错误的人工标注上。

在自动化（如使用DSPy）之前，建议先进行手动迭代。这能帮助您建立对故障模式和Judge行为的直觉。编写提示词的过程会迫使您明确自己的评估规范。

Step 4: Estimate True Success Rates

步骤4：估算真实成功率

After finalizing the prompt, freeze it and run on the Test set to get TPR and TNR. Then use the judge on unlabeled production traces with bias correction.

Read

references/success-rate-estimation.md

for the full procedure, formula, Python code, and confidence interval calculation.

提示词最终确定后，将其冻结并在测试集上运行以获取TPR和TNR。然后使用带偏差校正的Judge处理未标注的生产追踪数据。

可查看

references/success-rate-estimation.md

获取完整流程、公式、Python代码和置信区间计算方法。

Quick Reference

快速参考

Measure judge accuracy on Test set → TPR, TNR
Observe raw success rate on unlabeled data → p_obs = k/m

Correct for bias using Rogan-Gladen formula:

θ̂ = (p_obs + TNR - 1) / (TPR + TNR - 1)    [clipped to 0,1]

Bootstrap confidence interval — resample Test set labels B times, recompute corrected rate each time, take 2.5th/97.5th percentiles.

If TPR + TNR - 1 ≈ 0, the judge is no better than random chance and correction is invalid.

在测试集上测量Judge准确率 → TPR、TNR
在未标注数据上观测原始成功率 → p_obs = k/m

使用Rogan-Gladen公式校正偏差：

θ̂ = (p_obs + TNR - 1) / (TPR + TNR - 1)    [截断到0,1范围内]

通过Bootstrap法计算置信区间——对测试集标签进行B次重采样，每次重新计算校正后的成功率，取2.5%和97.5%分位数。

如果TPR + TNR - 1 ≈ 0，说明Judge的表现与随机猜测无异，校正是无效的。

Key Insight

核心洞察

Improving TPR (the judge's ability to identify true successes) narrows the confidence interval the most. Judge errors mainly inflate uncertainty rather than shifting the corrected estimate.

提升TPR（Judge识别真实通过案例的能力）对缩小置信区间的作用最大。Judge的错误主要会增加不确定性，而非偏移校正后的估算值。

Common Pitfalls

常见误区

Omitting examples from the prompt. Without concrete examples, the judge lacks grounding. This is the most common mistake.
Evaluating multiple criteria in a single prompt. Break complex metrics into narrower, specific prompts for better alignment and diagnosability.
Skipping alignment validation. Don't assume the judge "just works." Domain-specific criteria require prompt refinement and human-labeled validation.
Overfitting to labeled traces. If few-shot examples also appear in the evaluation set, TPR/TNR will be inflated. Any trace used in the prompt must be excluded from Dev and Test.
Never revisiting the judge. Production data drifts, new failure modes emerge, and LLM updates shift behavior. Periodically re-validate.
Not pinning the judge model version. In CI pipelines, pin the exact model version (e.g.,
```
claude-sonnet-4-5-20250929
```
) to prevent results from fluctuating due to unannounced updates.

提示词中省略示例。没有具体示例的话，Judge缺乏判断依据。这是最常见的错误。
在单个提示词中评估多个标准。将复杂指标拆分为更窄、更具体的提示词，以提升一致性和可诊断性。
跳过一致性验证。不要假设Judge“开箱即用”。领域特定的标准需要提示词优化和人工标注验证。
过拟合到标注追踪数据。如果少样本示例同时出现在评估集中，TPR/TNR会被高估。任何用于提示词的追踪数据都必须从开发集和测试集中排除。
从不重新验证Judge。生产数据会漂移，新的故障模式会出现，LLM更新也会改变行为。需定期重新验证Judge。
未固定Judge的模型版本。在CI流水线中，需固定确切的模型版本（如
```
claude-sonnet-4-5-20250929
```
），防止因未公告的模型更新导致结果波动。

Long-Document Considerations

长文档场景注意事项

When judging outputs from long-document pipelines:

Don't feed the full document into the judge — use only the relevant portion (e.g., the source paragraph a summary came from).
Consider chunk-level evaluation with aggregated per-chunk judgments.
Make rubrics especially clear about what "correct" means since the judge won't see the full context.

当评判长文档流水线的输出时：

不要将完整文档输入Judge——仅使用相关部分（如摘要对应的源段落）。
考虑按块进行评估，然后汇总每个块的评判结果。
由于Judge无法查看完整上下文，需特别明确评分标准中“正确”的定义。

CI Integration

CI集成

For continuous integration, build a golden dataset of curated input examples with reference outputs. On each pipeline change:

Run all golden inputs through the pipeline.
Evaluate outputs with your suite of automated evaluators (code-based + LLM-as-Judge).
Pin the judge model version to prevent CI flicker.
Include examples covering core features, known failure modes, and edge cases.

This catches regressions but does not predict overall production accuracy — its purpose is stability as the pipeline evolves.

对于持续集成，构建一个包含精选输入示例和参考输出的黄金数据集。每次流水线变更时：

将所有黄金输入通过流水线运行。
使用自动化评估器套件（基于代码 + LLM-as-Judge）评估输出。
固定Judge的模型版本，防止CI结果波动。
包含覆盖核心功能、已知故障模式和边缘情况的示例。

这能捕获回归问题，但无法预测整体生产准确率——其目的是在流水线演进过程中保持稳定性。