Loading...
Loading...
Compare original and translation side by side
| Task Type | Primary Metrics | Secondary Metrics |
|---|---|---|
| Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's kappa |
| Ordinal scale (1-5 rating) | Spearman's rho, Kendall's tau | Cohen's kappa (weighted) |
| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |
| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |
| 任务类型 | 主要指标 | 次要指标 |
|---|---|---|
| 二元分类(通过/不通过) | 召回率、精确率、F1值 | Cohen's kappa系数 |
| 有序量表(1-5评分) | Spearman相关系数、Kendall tau系数 | 加权Cohen's kappa系数 |
| 成对偏好 | 一致率、位置一致性 | 置信度校准 |
| 多标签分类 | Macro-F1、Micro-F1 | 单标签精确率/召回率 |
Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]You are an expert evaluator assessing response quality.Criterion: [名称]
Description: [该标准衡量的内容]
Weight: [相对重要性,0-1]You are an expert evaluator assessing response quality.
Always require justification before the score in all scoring prompts because research shows this improves reliability by 15-25% compared to score-first approaches.
在所有评分提示词中,始终要求先提供理由再给出分数,因为研究表明,与先给分后说明理由的方式相比,这种方法可将评估可靠性提升15-25%。You are an expert evaluator comparing two AI responses.You are an expert evaluator comparing two AI responses.
**Confidence Calibration** — Map confidence to position consistency:
- Both passes agree: confidence = average of individual confidences
- Passes disagree: confidence = 0.5, verdict = TIE
**置信度校准** — 将置信度与位置一致性关联:
- 两轮结果一致:置信度 = 两轮单独置信度的平均值
- 两轮结果不一致:置信度 = 0.5,结论 = TIEIs there an objective ground truth?
+-- Yes -> Direct Scoring
| Examples: factual accuracy, instruction following, format compliance
|
+-- No -> Is it a preference or quality judgment?
+-- Yes -> Pairwise Comparison
| Examples: tone, style, persuasiveness, creativity
|
+-- No -> Consider reference-based evaluation
Examples: summarization (compare to source), translation (compare to reference)Is there an objective ground truth?
+-- Yes -> Direct Scoring
| Examples: factual accuracy, instruction following, format compliance
|
+-- No -> Is it a preference or quality judgment?
+-- Yes -> Pairwise Comparison
| Examples: tone, style, persuasiveness, creativity
|
+-- No -> Consider reference-based evaluation
Examples: summarization (compare to source), translation (compare to reference)Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5{
"criterion": "Factual Accuracy",
"score": 5,
"evidence": [
"Correctly identifies axial tilt as primary cause",
"Correctly explains differential sunlight by hemisphere",
"No factual errors present"
],
"justification": "Response accurately explains the cause of seasons with correct
scientific reasoning. Both the axial tilt and its effect on sunlight distribution
are correctly described.",
"improvement": "Could add the specific tilt angle (23.5 degrees) for completeness."
}Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5{
"criterion": "Factual Accuracy",
"score": 5,
"evidence": [
"Correctly identifies axial tilt as primary cause",
"Correctly explains differential sunlight by hemisphere",
"No factual errors present"
],
"justification": "Response accurately explains the cause of seasons with correct
scientific reasoning. Both the axial tilt and its effect on sunlight distribution
are correctly described.",
"improvement": "Could add the specific tilt angle (23.5 degrees) for completeness."
}Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]{ "winner": "B", "confidence": 0.8 }{ "winner": "A", "confidence": 0.6 }{ "winner": "B", "confidence": 0.6 }{
"winner": "B",
"confidence": 0.7,
"positionConsistency": {
"consistent": true,
"firstPassWinner": "B",
"secondPassWinner": "B"
}
}Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]{ "winner": "B", "confidence": 0.8 }{ "winner": "A", "confidence": 0.6 }{ "winner": "B", "confidence": 0.6 }{
"winner": "B",
"confidence": 0.7,
"positionConsistency": {
"consistent": true,
"firstPassWinner": "B",
"secondPassWinner": "B"
}
}criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"
domain: "software engineering"
scale: "1-5"
strictness: "balanced"{
"levels": [
{
"score": 1,
"label": "Poor",
"description": "Code is difficult to understand without significant effort",
"characteristics": [
"No meaningful variable or function names",
"No comments or documentation",
"Deeply nested or convoluted logic"
]
},
{
"score": 3,
"label": "Adequate",
"description": "Code is understandable with some effort",
"characteristics": [
"Most variables have meaningful names",
"Basic comments present for complex sections",
"Logic is followable but could be cleaner"
]
},
{
"score": 5,
"label": "Excellent",
"description": "Code is immediately clear and maintainable",
"characteristics": [
"All names are descriptive and consistent",
"Comprehensive documentation",
"Clean, modular structure"
]
}
],
"edgeCases": [
{
"situation": "Code is well-structured but uses domain-specific abbreviations",
"guidance": "Score based on readability for domain experts, not general audience"
}
]
}criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"
domain: "software engineering"
scale: "1-5"
strictness: "balanced"{
"levels": [
{
"score": 1,
"label": "Poor",
"description": "Code is difficult to understand without significant effort",
"characteristics": [
"No meaningful variable or function names",
"No comments or documentation",
"Deeply nested or convoluted logic"
]
},
{
"score": 3,
"label": "Adequate",
"description": "Code is understandable with some effort",
"characteristics": [
"Most variables have meaningful names",
"Basic comments present for complex sections",
"Logic is followable but could be cleaner"
]
},
{
"score": 5,
"label": "Excellent",
"description": "Code is immediately clear and maintainable",
"characteristics": [
"All names are descriptive and consistent",
"Comprehensive documentation",
"Clean, modular structure"
]
}
],
"edgeCases": [
{
"situation": "Code is well-structured but uses domain-specific abbreviations",
"guidance": "Score based on readability for domain experts, not general audience"
}
]
}