llm-evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Evaluation
LLM 评估
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
掌握LLM应用的全面评估策略,涵盖自动化指标、人工评估和A/B测试。
When to Use This Skill
何时使用此技能
- Measuring LLM application performance systematically
- Comparing different models or prompts
- Detecting performance regressions before deployment
- Validating improvements from prompt changes
- Building confidence in production systems
- Establishing baselines and tracking progress over time
- Debugging unexpected model behavior
- 系统地衡量LLM应用性能
- 比较不同模型或提示词
- 在部署前检测性能退化
- 验证提示词变更带来的改进
- 增强对生产系统的信心
- 建立基线并跟踪长期进展
- 调试模型的异常行为
Core Evaluation Types
核心评估类型
1. Automated Metrics
1. 自动化指标
Fast, repeatable, scalable evaluation using computed scores.
Text Generation:
- BLEU: N-gram overlap (translation)
- ROUGE: Recall-oriented (summarization)
- METEOR: Semantic similarity
- BERTScore: Embedding-based similarity
- Perplexity: Language model confidence
Classification:
- Accuracy: Percentage correct
- Precision/Recall/F1: Class-specific performance
- Confusion Matrix: Error patterns
- AUC-ROC: Ranking quality
Retrieval (RAG):
- MRR: Mean Reciprocal Rank
- NDCG: Normalized Discounted Cumulative Gain
- Precision@K: Relevant in top K
- Recall@K: Coverage in top K
使用计算分数进行快速、可重复、可扩展的评估。
文本生成:
- BLEU:N元语法重叠(适用于翻译场景)
- ROUGE:面向召回率(适用于摘要场景)
- METEOR:语义相似度
- BERTScore:基于嵌入的相似度
- Perplexity:语言模型置信度
分类任务:
- Accuracy:正确率百分比
- Precision/Recall/F1:特定类别性能
- Confusion Matrix:错误模式
- AUC-ROC:排序质量
检索增强生成(RAG):
- MRR:平均倒数排名
- NDCG:归一化折损累积增益
- Precision@K:前K个结果中的相关占比
- Recall@K:前K个结果中的覆盖范围
2. Human Evaluation
2. 人工评估
Manual assessment for quality aspects difficult to automate.
Dimensions:
- Accuracy: Factual correctness
- Coherence: Logical flow
- Relevance: Answers the question
- Fluency: Natural language quality
- Safety: No harmful content
- Helpfulness: Useful to the user
针对难以自动化的质量维度进行手动评估。
评估维度:
- 准确性:事实正确性
- 连贯性:逻辑流畅性
- 相关性:是否回答问题
- 流畅度:自然语言质量
- 安全性:无有害内容
- 有用性:对用户有帮助
3. LLM-as-Judge
3. LLM-as-Judge(大模型作为评估者)
Use stronger LLMs to evaluate weaker model outputs.
Approaches:
- Pointwise: Score individual responses
- Pairwise: Compare two responses
- Reference-based: Compare to gold standard
- Reference-free: Judge without ground truth
使用性能更强的LLM评估较弱模型的输出。
评估方法:
- Pointwise:对单个响应打分
- Pairwise:比较两个响应
- Reference-based:与黄金标准对比
- Reference-free:无基准情况下的判断
Quick Start
快速开始
python
from dataclasses import dataclass
from typing import Callable
import numpy as np
@dataclass
class Metric:
name: str
fn: Callable
@staticmethod
def accuracy():
return Metric("accuracy", calculate_accuracy)
@staticmethod
def bleu():
return Metric("bleu", calculate_bleu)
@staticmethod
def bertscore():
return Metric("bertscore", calculate_bertscore)
@staticmethod
def custom(name: str, fn: Callable):
return Metric(name, fn)
class EvaluationSuite:
def __init__(self, metrics: list[Metric]):
self.metrics = metrics
async def evaluate(self, model, test_cases: list[dict]) -> dict:
results = {m.name: [] for m in self.metrics}
for test in test_cases:
prediction = await model.predict(test["input"])
for metric in self.metrics:
score = metric.fn(
prediction=prediction,
reference=test.get("expected"),
context=test.get("context")
)
results[metric.name].append(score)
return {
"metrics": {k: np.mean(v) for k, v in results.items()},
"raw_scores": results
}python
from dataclasses import dataclass
from typing import Callable
import numpy as np
@dataclass
class Metric:
name: str
fn: Callable
@staticmethod
def accuracy():
return Metric("accuracy", calculate_accuracy)
@staticmethod
def bleu():
return Metric("bleu", calculate_bleu)
@staticmethod
def bertscore():
return Metric("bertscore", calculate_bertscore)
@staticmethod
def custom(name: str, fn: Callable):
return Metric(name, fn)
class EvaluationSuite:
def __init__(self, metrics: list[Metric]):
self.metrics = metrics
async def evaluate(self, model, test_cases: list[dict]) -> dict:
results = {m.name: [] for m in self.metrics}
for test in test_cases:
prediction = await model.predict(test["input"])
for metric in self.metrics:
score = metric.fn(
prediction=prediction,
reference=test.get("expected"),
context=test.get("context")
)
results[metric.name].append(score)
return {
"metrics": {k: np.mean(v) for k, v in results.items()},
"raw_scores": results
}Usage
Usage
suite = EvaluationSuite([
Metric.accuracy(),
Metric.bleu(),
Metric.bertscore(),
Metric.custom("groundedness", check_groundedness)
])
test_cases = [
{
"input": "What is the capital of France?",
"expected": "Paris",
"context": "France is a country in Europe. Paris is its capital."
},
]
results = await suite.evaluate(model=your_model, test_cases=test_cases)
undefinedsuite = EvaluationSuite([
Metric.accuracy(),
Metric.bleu(),
Metric.bertscore(),
Metric.custom("groundedness", check_groundedness)
])
test_cases = [
{
"input": "What is the capital of France?",
"expected": "Paris",
"context": "France is a country in Europe. Paris is its capital."
},
]
results = await suite.evaluate(model=your_model, test_cases=test_cases)
undefinedAutomated Metrics Implementation
自动化指标实现
BLEU Score
BLEU Score
python
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
def calculate_bleu(reference: str, hypothesis: str, **kwargs) -> float:
"""Calculate BLEU score between reference and hypothesis."""
smoothie = SmoothingFunction().method4
return sentence_bleu(
[reference.split()],
hypothesis.split(),
smoothing_function=smoothie
)python
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
def calculate_bleu(reference: str, hypothesis: str, **kwargs) -> float:
"""Calculate BLEU score between reference and hypothesis."""
smoothie = SmoothingFunction().method4
return sentence_bleu(
[reference.split()],
hypothesis.split(),
smoothing_function=smoothie
)ROUGE Score
ROUGE Score
python
from rouge_score import rouge_scorer
def calculate_rouge(reference: str, hypothesis: str, **kwargs) -> dict:
"""Calculate ROUGE scores."""
scorer = rouge_scorer.RougeScorer(
['rouge1', 'rouge2', 'rougeL'],
use_stemmer=True
)
scores = scorer.score(reference, hypothesis)
return {
'rouge1': scores['rouge1'].fmeasure,
'rouge2': scores['rouge2'].fmeasure,
'rougeL': scores['rougeL'].fmeasure
}python
from rouge_score import rouge_scorer
def calculate_rouge(reference: str, hypothesis: str, **kwargs) -> dict:
"""Calculate ROUGE scores."""
scorer = rouge_scorer.RougeScorer(
['rouge1', 'rouge2', 'rougeL'],
use_stemmer=True
)
scores = scorer.score(reference, hypothesis)
return {
'rouge1': scores['rouge1'].fmeasure,
'rouge2': scores['rouge2'].fmeasure,
'rougeL': scores['rougeL'].fmeasure
}BERTScore
BERTScore
python
from bert_score import score
def calculate_bertscore(
references: list[str],
hypotheses: list[str],
**kwargs
) -> dict:
"""Calculate BERTScore using pre-trained model."""
P, R, F1 = score(
hypotheses,
references,
lang='en',
model_type='microsoft/deberta-xlarge-mnli'
)
return {
'precision': P.mean().item(),
'recall': R.mean().item(),
'f1': F1.mean().item()
}python
from bert_score import score
def calculate_bertscore(
references: list[str],
hypotheses: list[str],
**kwargs
) -> dict:
"""Calculate BERTScore using pre-trained model."""
P, R, F1 = score(
hypotheses,
references,
lang='en',
model_type='microsoft/deberta-xlarge-mnli'
)
return {
'precision': P.mean().item(),
'recall': R.mean().item(),
'f1': F1.mean().item()
}Custom Metrics
Custom Metrics
python
def calculate_groundedness(response: str, context: str, **kwargs) -> float:
"""Check if response is grounded in provided context."""
from transformers import pipeline
nli = pipeline(
"text-classification",
model="microsoft/deberta-large-mnli"
)
result = nli(f"{context} [SEP] {response}")[0]
# Return confidence that response is entailed by context
return result['score'] if result['label'] == 'ENTAILMENT' else 0.0
def calculate_toxicity(text: str, **kwargs) -> float:
"""Measure toxicity in generated text."""
from detoxify import Detoxify
results = Detoxify('original').predict(text)
return max(results.values()) # Return highest toxicity score
def calculate_factuality(claim: str, sources: list[str], **kwargs) -> float:
"""Verify factual claims against sources."""
from transformers import pipeline
nli = pipeline("text-classification", model="facebook/bart-large-mnli")
scores = []
for source in sources:
result = nli(f"{source}</s></s>{claim}")[0]
if result['label'] == 'entailment':
scores.append(result['score'])
return max(scores) if scores else 0.0python
def calculate_groundedness(response: str, context: str, **kwargs) -> float:
"""Check if response is grounded in provided context."""
from transformers import pipeline
nli = pipeline(
"text-classification",
model="microsoft/deberta-large-mnli"
)
result = nli(f"{context} [SEP] {response}")[0]
# Return confidence that response is entailed by context
return result['score'] if result['label'] == 'ENTAILMENT' else 0.0
def calculate_toxicity(text: str, **kwargs) -> float:
"""Measure toxicity in generated text."""
from detoxify import Detoxify
results = Detoxify('original').predict(text)
return max(results.values()) # Return highest toxicity score
def calculate_factuality(claim: str, sources: list[str], **kwargs) -> float:
"""Verify factual claims against sources."""
from transformers import pipeline
nli = pipeline("text-classification", model="facebook/bart-large-mnli")
scores = []
for source in sources:
result = nli(f"{source}</s></s>{claim}")[0]
if result['label'] == 'entailment':
scores.append(result['score'])
return max(scores) if scores else 0.0LLM-as-Judge Patterns
LLM-as-Judge 模式
Single Output Evaluation
Single Output Evaluation
python
from anthropic import Anthropic
from pydantic import BaseModel, Field
import json
class QualityRating(BaseModel):
accuracy: int = Field(ge=1, le=10, description="Factual correctness")
helpfulness: int = Field(ge=1, le=10, description="Answers the question")
clarity: int = Field(ge=1, le=10, description="Well-written and understandable")
reasoning: str = Field(description="Brief explanation")
async def llm_judge_quality(
response: str,
question: str,
context: str = None
) -> QualityRating:
"""Use Claude to judge response quality."""
client = Anthropic()
system = """You are an expert evaluator of AI responses.
Rate responses on accuracy, helpfulness, and clarity (1-10 scale).
Provide brief reasoning for your ratings."""
prompt = f"""Rate the following response:
Question: {question}
{f'Context: {context}' if context else ''}
Response: {response}
Provide ratings in JSON format:
{{
"accuracy": <1-10>,
"helpfulness": <1-10>,
"clarity": <1-10>,
"reasoning": "<brief explanation>"
}}"""
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=500,
system=system,
messages=[{"role": "user", "content": prompt}]
)
return QualityRating(**json.loads(message.content[0].text))python
from anthropic import Anthropic
from pydantic import BaseModel, Field
import json
class QualityRating(BaseModel):
accuracy: int = Field(ge=1, le=10, description="Factual correctness")
helpfulness: int = Field(ge=1, le=10, description="Answers the question")
clarity: int = Field(ge=1, le=10, description="Well-written and understandable")
reasoning: str = Field(description="Brief explanation")
async def llm_judge_quality(
response: str,
question: str,
context: str = None
) -> QualityRating:
"""Use Claude to judge response quality."""
client = Anthropic()
system = """You are an expert evaluator of AI responses.
Rate responses on accuracy, helpfulness, and clarity (1-10 scale).
Provide brief reasoning for your ratings."""
prompt = f"""Rate the following response:
Question: {question}
{f'Context: {context}' if context else ''}
Response: {response}
Provide ratings in JSON format:
{{
"accuracy": <1-10>,
"helpfulness": <1-10>,
"clarity": <1-10>,
"reasoning": "<brief explanation>"
}}"""
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=500,
system=system,
messages=[{"role": "user", "content": prompt}]
)
return QualityRating(**json.loads(message.content[0].text))Pairwise Comparison
Pairwise Comparison
python
from pydantic import BaseModel, Field
from typing import Literal
class ComparisonResult(BaseModel):
winner: Literal["A", "B", "tie"]
reasoning: str
confidence: int = Field(ge=1, le=10)
async def compare_responses(
question: str,
response_a: str,
response_b: str
) -> ComparisonResult:
"""Compare two responses using LLM judge."""
client = Anthropic()
prompt = f"""Compare these two responses and determine which is better.
Question: {question}
Response A: {response_a}
Response B: {response_b}
Consider accuracy, helpfulness, and clarity.
Answer with JSON:
{{
"winner": "A" or "B" or "tie",
"reasoning": "<explanation>",
"confidence": <1-10>
}}"""
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return ComparisonResult(**json.loads(message.content[0].text))python
from pydantic import BaseModel, Field
from typing import Literal
class ComparisonResult(BaseModel):
winner: Literal["A", "B", "tie"]
reasoning: str
confidence: int = Field(ge=1, le=10)
async def compare_responses(
question: str,
response_a: str,
response_b: str
) -> ComparisonResult:
"""Compare two responses using LLM judge."""
client = Anthropic()
prompt = f"""Compare these two responses and determine which is better.
Question: {question}
Response A: {response_a}
Response B: {response_b}
Consider accuracy, helpfulness, and clarity.
Answer with JSON:
{{
"winner": "A" or "B" or "tie",
"reasoning": "<explanation>",
"confidence": <1-10>
}}"""
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return ComparisonResult(**json.loads(message.content[0].text))Reference-Based Evaluation
Reference-Based Evaluation
python
class ReferenceEvaluation(BaseModel):
semantic_similarity: float = Field(ge=0, le=1)
factual_accuracy: float = Field(ge=0, le=1)
completeness: float = Field(ge=0, le=1)
issues: list[str]
async def evaluate_against_reference(
response: str,
reference: str,
question: str
) -> ReferenceEvaluation:
"""Evaluate response against gold standard reference."""
client = Anthropic()
prompt = f"""Compare the response to the reference answer.
Question: {question}
Reference Answer: {reference}
Response to Evaluate: {response}
Evaluate:
1. Semantic similarity (0-1): How similar is the meaning?
2. Factual accuracy (0-1): Are all facts correct?
3. Completeness (0-1): Does it cover all key points?
4. List any specific issues or errors.
Respond in JSON:
{{
"semantic_similarity": <0-1>,
"factual_accuracy": <0-1>,
"completeness": <0-1>,
"issues": ["issue1", "issue2"]
}}"""
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return ReferenceEvaluation(**json.loads(message.content[0].text))python
class ReferenceEvaluation(BaseModel):
semantic_similarity: float = Field(ge=0, le=1)
factual_accuracy: float = Field(ge=0, le=1)
completeness: float = Field(ge=0, le=1)
issues: list[str]
async def evaluate_against_reference(
response: str,
reference: str,
question: str
) -> ReferenceEvaluation:
"""Evaluate response against gold standard reference."""
client = Anthropic()
prompt = f"""Compare the response to the reference answer.
Question: {question}
Reference Answer: {reference}
Response to Evaluate: {response}
Evaluate:
1. Semantic similarity (0-1): How similar is the meaning?
2. Factual accuracy (0-1): Are all facts correct?
3. Completeness (0-1): Does it cover all key points?
4. List any specific issues or errors.
Respond in JSON:
{{
"semantic_similarity": <0-1>,
"factual_accuracy": <0-1>,
"completeness": <0-1>,
"issues": ["issue1", "issue2"]
}}"""
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return ReferenceEvaluation(**json.loads(message.content[0].text))Human Evaluation Frameworks
人工评估框架
Annotation Guidelines
标注指南
python
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class AnnotationTask:
"""Structure for human annotation task."""
response: str
question: str
context: Optional[str] = None
def get_annotation_form(self) -> dict:
return {
"question": self.question,
"context": self.context,
"response": self.response,
"ratings": {
"accuracy": {
"scale": "1-5",
"description": "Is the response factually correct?"
},
"relevance": {
"scale": "1-5",
"description": "Does it answer the question?"
},
"coherence": {
"scale": "1-5",
"description": "Is it logically consistent?"
}
},
"issues": {
"factual_error": False,
"hallucination": False,
"off_topic": False,
"unsafe_content": False
},
"feedback": ""
}python
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class AnnotationTask:
"""Structure for human annotation task."""
response: str
question: str
context: Optional[str] = None
def get_annotation_form(self) -> dict:
return {
"question": self.question,
"context": self.context,
"response": self.response,
"ratings": {
"accuracy": {
"scale": "1-5",
"description": "Is the response factually correct?"
},
"relevance": {
"scale": "1-5",
"description": "Does it answer the question?"
},
"coherence": {
"scale": "1-5",
"description": "Is it logically consistent?"
}
},
"issues": {
"factual_error": False,
"hallucination": False,
"off_topic": False,
"unsafe_content": False
},
"feedback": ""
}Inter-Rater Agreement
评估者间一致性
python
from sklearn.metrics import cohen_kappa_score
def calculate_agreement(
rater1_scores: list[int],
rater2_scores: list[int]
) -> dict:
"""Calculate inter-rater agreement."""
kappa = cohen_kappa_score(rater1_scores, rater2_scores)
if kappa < 0:
interpretation = "Poor"
elif kappa < 0.2:
interpretation = "Slight"
elif kappa < 0.4:
interpretation = "Fair"
elif kappa < 0.6:
interpretation = "Moderate"
elif kappa < 0.8:
interpretation = "Substantial"
else:
interpretation = "Almost Perfect"
return {
"kappa": kappa,
"interpretation": interpretation
}python
from sklearn.metrics import cohen_kappa_score
def calculate_agreement(
rater1_scores: list[int],
rater2_scores: list[int]
) -> dict:
"""Calculate inter-rater agreement."""
kappa = cohen_kappa_score(rater1_scores, rater2_scores)
if kappa < 0:
interpretation = "Poor"
elif kappa < 0.2:
interpretation = "Slight"
elif kappa < 0.4:
interpretation = "Fair"
elif kappa < 0.6:
interpretation = "Moderate"
elif kappa < 0.8:
interpretation = "Substantial"
else:
interpretation = "Almost Perfect"
return {
"kappa": kappa,
"interpretation": interpretation
}A/B Testing
A/B测试
Statistical Testing Framework
统计测试框架
python
from scipy import stats
import numpy as np
from dataclasses import dataclass, field
@dataclass
class ABTest:
variant_a_name: str = "A"
variant_b_name: str = "B"
variant_a_scores: list[float] = field(default_factory=list)
variant_b_scores: list[float] = field(default_factory=list)
def add_result(self, variant: str, score: float):
"""Add evaluation result for a variant."""
if variant == "A":
self.variant_a_scores.append(score)
else:
self.variant_b_scores.append(score)
def analyze(self, alpha: float = 0.05) -> dict:
"""Perform statistical analysis."""
a_scores = np.array(self.variant_a_scores)
b_scores = np.array(self.variant_b_scores)
# T-test
t_stat, p_value = stats.ttest_ind(a_scores, b_scores)
# Effect size (Cohen's d)
pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)
cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std
return {
"variant_a_mean": np.mean(a_scores),
"variant_b_mean": np.mean(b_scores),
"difference": np.mean(b_scores) - np.mean(a_scores),
"relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),
"p_value": p_value,
"statistically_significant": p_value < alpha,
"cohens_d": cohens_d,
"effect_size": self._interpret_cohens_d(cohens_d),
"winner": self.variant_b_name if np.mean(b_scores) > np.mean(a_scores) else self.variant_a_name
}
@staticmethod
def _interpret_cohens_d(d: float) -> str:
"""Interpret Cohen's d effect size."""
abs_d = abs(d)
if abs_d < 0.2:
return "negligible"
elif abs_d < 0.5:
return "small"
elif abs_d < 0.8:
return "medium"
else:
return "large"python
from scipy import stats
import numpy as np
from dataclasses import dataclass, field
@dataclass
class ABTest:
variant_a_name: str = "A"
variant_b_name: str = "B"
variant_a_scores: list[float] = field(default_factory=list)
variant_b_scores: list[float] = field(default_factory=list)
def add_result(self, variant: str, score: float):
"""Add evaluation result for a variant."""
if variant == "A":
self.variant_a_scores.append(score)
else:
self.variant_b_scores.append(score)
def analyze(self, alpha: float = 0.05) -> dict:
"""Perform statistical analysis."""
a_scores = np.array(self.variant_a_scores)
b_scores = np.array(self.variant_b_scores)
# T-test
t_stat, p_value = stats.ttest_ind(a_scores, b_scores)
# Effect size (Cohen's d)
pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)
cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std
return {
"variant_a_mean": np.mean(a_scores),
"variant_b_mean": np.mean(b_scores),
"difference": np.mean(b_scores) - np.mean(a_scores),
"relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),
"p_value": p_value,
"statistically_significant": p_value < alpha,
"cohens_d": cohens_d,
"effect_size": self._interpret_cohens_d(cohens_d),
"winner": self.variant_b_name if np.mean(b_scores) > np.mean(a_scores) else self.variant_a_name
}
@staticmethod
def _interpret_cohens_d(d: float) -> str:
"""Interpret Cohen's d effect size."""
abs_d = abs(d)
if abs_d < 0.2:
return "negligible"
elif abs_d < 0.5:
return "small"
elif abs_d < 0.8:
return "medium"
else:
return "large"Regression Testing
回归测试
Regression Detection
退化检测
python
from dataclasses import dataclass
@dataclass
class RegressionResult:
metric: str
baseline: float
current: float
change: float
is_regression: bool
class RegressionDetector:
def __init__(self, baseline_results: dict, threshold: float = 0.05):
self.baseline = baseline_results
self.threshold = threshold
def check_for_regression(self, new_results: dict) -> dict:
"""Detect if new results show regression."""
regressions = []
for metric in self.baseline.keys():
baseline_score = self.baseline[metric]
new_score = new_results.get(metric)
if new_score is None:
continue
# Calculate relative change
relative_change = (new_score - baseline_score) / baseline_score
# Flag if significant decrease
is_regression = relative_change < -self.threshold
if is_regression:
regressions.append(RegressionResult(
metric=metric,
baseline=baseline_score,
current=new_score,
change=relative_change,
is_regression=True
))
return {
"has_regression": len(regressions) > 0,
"regressions": regressions,
"summary": f"{len(regressions)} metric(s) regressed"
}python
from dataclasses import dataclass
@dataclass
class RegressionResult:
metric: str
baseline: float
current: float
change: float
is_regression: bool
class RegressionDetector:
def __init__(self, baseline_results: dict, threshold: float = 0.05):
self.baseline = baseline_results
self.threshold = threshold
def check_for_regression(self, new_results: dict) -> dict:
"""Detect if new results show regression."""
regressions = []
for metric in self.baseline.keys():
baseline_score = self.baseline[metric]
new_score = new_results.get(metric)
if new_score is None:
continue
# Calculate relative change
relative_change = (new_score - baseline_score) / baseline_score
# Flag if significant decrease
is_regression = relative_change < -self.threshold
if is_regression:
regressions.append(RegressionResult(
metric=metric,
baseline=baseline_score,
current=new_score,
change=relative_change,
is_regression=True
))
return {
"has_regression": len(regressions) > 0,
"regressions": regressions,
"summary": f"{len(regressions)} metric(s) regressed"
}LangSmith Evaluation Integration
LangSmith 评估集成
python
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluatorpython
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluatorInitialize LangSmith client
Initialize LangSmith client
client = Client()
client = Client()
Create dataset
Create dataset
dataset = client.create_dataset("qa_test_cases")
client.create_examples(
inputs=[{"question": q} for q in questions],
outputs=[{"answer": a} for a in expected_answers],
dataset_id=dataset.id
)
dataset = client.create_dataset("qa_test_cases")
client.create_examples(
inputs=[{"question": q} for q in questions],
outputs=[{"answer": a} for a in expected_answers],
dataset_id=dataset.id
)
Define evaluators
Define evaluators
evaluators = [
LangChainStringEvaluator("qa"), # QA correctness
LangChainStringEvaluator("context_qa"), # Context-grounded QA
LangChainStringEvaluator("cot_qa"), # Chain-of-thought QA
]
evaluators = [
LangChainStringEvaluator("qa"), # QA correctness
LangChainStringEvaluator("context_qa"), # Context-grounded QA
LangChainStringEvaluator("cot_qa"), # Chain-of-thought QA
]
Run evaluation
Run evaluation
async def target_function(inputs: dict) -> dict:
result = await your_chain.ainvoke(inputs)
return {"answer": result}
experiment_results = await evaluate(
target_function,
data=dataset.name,
evaluators=evaluators,
experiment_prefix="v1.0.0",
metadata={"model": "claude-sonnet-4-5", "version": "1.0.0"}
)
print(f"Mean score: {experiment_results.aggregate_metrics['qa']['mean']}")
undefinedasync def target_function(inputs: dict) -> dict:
result = await your_chain.ainvoke(inputs)
return {"answer": result}
experiment_results = await evaluate(
target_function,
data=dataset.name,
evaluators=evaluators,
experiment_prefix="v1.0.0",
metadata={"model": "claude-sonnet-4-5", "version": "1.0.0"}
)
print(f"Mean score: {experiment_results.aggregate_metrics['qa']['mean']}")
undefinedBenchmarking
基准测试
Running Benchmarks
运行基准测试
python
from dataclasses import dataclass
import numpy as np
@dataclass
class BenchmarkResult:
metric: str
mean: float
std: float
min: float
max: float
class BenchmarkRunner:
def __init__(self, benchmark_dataset: list[dict]):
self.dataset = benchmark_dataset
async def run_benchmark(
self,
model,
metrics: list[Metric]
) -> dict[str, BenchmarkResult]:
"""Run model on benchmark and calculate metrics."""
results = {metric.name: [] for metric in metrics}
for example in self.dataset:
# Generate prediction
prediction = await model.predict(example["input"])
# Calculate each metric
for metric in metrics:
score = metric.fn(
prediction=prediction,
reference=example["reference"],
context=example.get("context")
)
results[metric.name].append(score)
# Aggregate results
return {
metric: BenchmarkResult(
metric=metric,
mean=np.mean(scores),
std=np.std(scores),
min=min(scores),
max=max(scores)
)
for metric, scores in results.items()
}python
from dataclasses import dataclass
import numpy as np
@dataclass
class BenchmarkResult:
metric: str
mean: float
std: float
min: float
max: float
class BenchmarkRunner:
def __init__(self, benchmark_dataset: list[dict]):
self.dataset = benchmark_dataset
async def run_benchmark(
self,
model,
metrics: list[Metric]
) -> dict[str, BenchmarkResult]:
"""Run model on benchmark and calculate metrics."""
results = {metric.name: [] for metric in metrics}
for example in self.dataset:
# Generate prediction
prediction = await model.predict(example["input"])
# Calculate each metric
for metric in metrics:
score = metric.fn(
prediction=prediction,
reference=example["reference"],
context=example.get("context")
)
results[metric.name].append(score)
# Aggregate results
return {
metric: BenchmarkResult(
metric=metric,
mean=np.mean(scores),
std=np.std(scores),
min=min(scores),
max=max(scores)
)
for metric, scores in results.items()
}Resources
资源
Best Practices
最佳实践
- Multiple Metrics: Use diverse metrics for comprehensive view
- Representative Data: Test on real-world, diverse examples
- Baselines: Always compare against baseline performance
- Statistical Rigor: Use proper statistical tests for comparisons
- Continuous Evaluation: Integrate into CI/CD pipeline
- Human Validation: Combine automated metrics with human judgment
- Error Analysis: Investigate failures to understand weaknesses
- Version Control: Track evaluation results over time
- 多指标组合:使用多样化指标获取全面视角
- 代表性数据:在真实、多样化的示例上测试
- 基线对比:始终与基线性能进行比较
- 统计严谨性:使用合适的统计测试进行对比
- 持续评估:集成到CI/CD流水线中
- 人工验证:结合自动化指标与人工判断
- 错误分析:调查失败案例以了解模型弱点
- 版本控制:跟踪评估结果随时间的变化
Common Pitfalls
常见陷阱
- Single Metric Obsession: Optimizing for one metric at the expense of others
- Small Sample Size: Drawing conclusions from too few examples
- Data Contamination: Testing on training data
- Ignoring Variance: Not accounting for statistical uncertainty
- Metric Mismatch: Using metrics not aligned with business goals
- Position Bias: In pairwise evals, randomize order
- Overfitting Prompts: Optimizing for test set instead of real use
- 单一指标依赖:为优化某一指标而牺牲其他维度
- 样本量过小:基于少量示例得出结论
- 数据污染:在训练数据上进行测试
- 忽略方差:未考虑统计不确定性
- 指标不匹配:使用与业务目标不一致的指标
- 位置偏差:在成对评估中随机化顺序
- 提示词过拟合:针对测试集优化而非实际使用场景