prompt-regression-tester
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePrompt Regression Tester
Prompt回归测试工具
Systematically test prompt changes to prevent regressions.
系统化测试Prompt变更,防止出现回归问题。
Test Case Format
测试用例格式
json
{
"test_cases": [
{
"id": "test_001",
"input": "Summarize this article",
"context": "Article text here...",
"expected_behavior": "Concise 2-3 sentence summary",
"baseline_output": "Output from v1.0 prompt",
"must_include": ["main point", "conclusion"],
"must_not_include": ["opinion", "speculation"]
}
]
}json
{
"test_cases": [
{
"id": "test_001",
"input": "Summarize this article",
"context": "Article text here...",
"expected_behavior": "Concise 2-3 sentence summary",
"baseline_output": "Output from v1.0 prompt",
"must_include": ["main point", "conclusion"],
"must_not_include": ["opinion", "speculation"]
}
]
}Comparison Framework
对比框架
python
def compare_prompts(old_prompt, new_prompt, test_cases):
results = {
"test_cases": [],
"summary": {
"total": len(test_cases),
"improvements": 0,
"regressions": 0,
"unchanged": 0,
},
"breakages": []
}
for test in test_cases:
old_output = llm(old_prompt.format(**test))
new_output = llm(new_prompt.format(**test))
comparison = {
"test_id": test["id"],
"old_output": old_output,
"new_output": new_output,
"diff": compute_diff(old_output, new_output),
"scores": {
"old": score_output(old_output, test),
"new": score_output(new_output, test),
},
"verdict": classify_change(old_output, new_output, test)
}
results["test_cases"].append(comparison)
results["summary"][comparison["verdict"]] += 1
if comparison["verdict"] == "regressions":
results["breakages"].append(analyze_breakage(comparison, test))
return resultspython
def compare_prompts(old_prompt, new_prompt, test_cases):
results = {
"test_cases": [],
"summary": {
"total": len(test_cases),
"improvements": 0,
"regressions": 0,
"unchanged": 0,
},
"breakages": []
}
for test in test_cases:
old_output = llm(old_prompt.format(**test))
new_output = llm(new_prompt.format(**test))
comparison = {
"test_id": test["id"],
"old_output": old_output,
"new_output": new_output,
"diff": compute_diff(old_output, new_output),
"scores": {
"old": score_output(old_output, test),
"new": score_output(new_output, test),
},
"verdict": classify_change(old_output, new_output, test)
}
results["test_cases"].append(comparison)
results["summary"][comparison["verdict"]] += 1
if comparison["verdict"] == "regressions":
results["breakages"].append(analyze_breakage(comparison, test))
return resultsStability Metrics
稳定性指标
python
def calculate_stability_metrics(results):
return {
"output_stability": measure_output_consistency(results),
"format_stability": check_format_preservation(results),
"constraint_adherence": check_constraints(results),
"behavioral_consistency": measure_behavior_delta(results),
}
def measure_output_consistency(results):
"""How similar are outputs between versions?"""
similarities = []
for result in results["test_cases"]:
sim = semantic_similarity(
result["old_output"],
result["new_output"]
)
similarities.append(sim)
return sum(similarities) / len(similarities)python
def calculate_stability_metrics(results):
return {
"output_stability": measure_output_consistency(results),
"format_stability": check_format_preservation(results),
"constraint_adherence": check_constraints(results),
"behavioral_consistency": measure_behavior_delta(results),
}
def measure_output_consistency(results):
"""How similar are outputs between versions?"""
similarities = []
for result in results["test_cases"]:
sim = semantic_similarity(
result["old_output"],
result["new_output"]
)
similarities.append(sim)
return sum(similarities) / len(similarities)Breakage Analysis
故障分析
python
def analyze_breakage(comparison, test_case):
"""Identify why the new prompt failed"""
reasons = []
new_out = comparison["new_output"]
# Missing required content
for keyword in test_case.get("must_include", []):
if keyword.lower() not in new_out.lower():
reasons.append(f"Missing required content: '{keyword}'")
# Contains forbidden content
for keyword in test_case.get("must_not_include", []):
if keyword.lower() in new_out.lower():
reasons.append(f"Contains forbidden content: '{keyword}'")
# Format violations
if not check_format(new_out, test_case.get("expected_format")):
reasons.append("Output format violation")
# Length issues
expected_length = test_case.get("expected_length")
if expected_length:
actual_length = len(new_out.split())
if abs(actual_length - expected_length) > expected_length * 0.3:
reasons.append(f"Length deviation: expected ~{expected_length}, got {actual_length}")
return {
"test_id": test_case["id"],
"reasons": reasons,
"old_output": comparison["old_output"][:100],
"new_output": comparison["new_output"][:100],
}python
def analyze_breakage(comparison, test_case):
"""Identify why the new prompt failed"""
reasons = []
new_out = comparison["new_output"]
# Missing required content
for keyword in test_case.get("must_include", []):
if keyword.lower() not in new_out.lower():
reasons.append(f"Missing required content: '{keyword}'")
# Contains forbidden content
for keyword in test_case.get("must_not_include", []):
if keyword.lower() in new_out.lower():
reasons.append(f"Contains forbidden content: '{keyword}'")
# Format violations
if not check_format(new_out, test_case.get("expected_format")):
reasons.append("Output format violation")
# Length issues
expected_length = test_case.get("expected_length")
if expected_length:
actual_length = len(new_out.split())
if abs(actual_length - expected_length) > expected_length * 0.3:
reasons.append(f"Length deviation: expected ~{expected_length}, got {actual_length}")
return {
"test_id": test_case["id"],
"reasons": reasons,
"old_output": comparison["old_output"][:100],
"new_output": comparison["new_output"][:100],
}Fix Suggestions
修复建议
python
def suggest_fixes(breakages):
"""Generate fix suggestions based on breakage patterns"""
suggestions = []
# Group breakages by reason
reason_groups = {}
for breakage in breakages:
for reason in breakage["reasons"]:
if reason not in reason_groups:
reason_groups[reason] = []
reason_groups[reason].append(breakage["test_id"])
# Generate suggestions
for reason, test_ids in reason_groups.items():
if "Missing required content" in reason:
suggestions.append({
"issue": reason,
"affected_tests": test_ids,
"suggestion": "Add explicit instruction in prompt to include this content",
"example": f"Make sure to mention {reason.split(':')[1]} in your response."
})
elif "format violation" in reason:
suggestions.append({
"issue": reason,
"affected_tests": test_ids,
"suggestion": "Add stricter format constraints to prompt",
"example": "Output must follow this exact format: ..."
})
return suggestionspython
def suggest_fixes(breakages):
"""Generate fix suggestions based on breakage patterns"""
suggestions = []
# Group breakages by reason
reason_groups = {}
for breakage in breakages:
for reason in breakage["reasons"]:
if reason not in reason_groups:
reason_groups[reason] = []
reason_groups[reason].append(breakage["test_id"])
# Generate suggestions
for reason, test_ids in reason_groups.items():
if "Missing required content" in reason:
suggestions.append({
"issue": reason,
"affected_tests": test_ids,
"suggestion": "Add explicit instruction in prompt to include this content",
"example": f"Make sure to mention {reason.split(':')[1]} in your response."
})
elif "format violation" in reason:
suggestions.append({
"issue": reason,
"affected_tests": test_ids,
"suggestion": "Add stricter format constraints to prompt",
"example": "Output must follow this exact format: ..."
})
return suggestionsReport Generation
报告生成
markdown
undefinedmarkdown
undefinedPrompt Regression Report
Prompt Regression Report
Summary
Summary
- Total tests: 50
- Improvements: 5 (10%)
- Regressions: 3 (6%)
- Unchanged: 42 (84%)
- Total tests: 50
- Improvements: 5 (10%)
- Regressions: 3 (6%)
- Unchanged: 42 (84%)
Stability Metrics
Stability Metrics
- Output stability: 0.87
- Format stability: 0.95
- Constraint adherence: 0.94
- Output stability: 0.87
- Format stability: 0.95
- Constraint adherence: 0.94
Regressions (3)
Regressions (3)
test_005: Missing required content
test_005: Missing required content
Old output: "The main benefit is cost savings..."
New output: "This approach provides flexibility..."
Issue: Missing required keyword 'cost'
Fix: Add explicit instruction: "Mention cost implications in your response"
Old output: "The main benefit is cost savings..."
New output: "This approach provides flexibility..."
Issue: Missing required keyword 'cost'
Fix: Add explicit instruction: "Mention cost implications in your response"
Recommended Actions
Recommended Actions
- Revert changes that caused regressions (tests: 005, 012, 023)
- Add format constraints for JSON output
- Run full test suite before deployment
undefined- Revert changes that caused regressions (tests: 005, 012, 023)
- Add format constraints for JSON output
- Run full test suite before deployment
undefinedBest Practices
最佳实践
- Test with diverse inputs
- Compare across multiple runs (LLMs are stochastic)
- Track metrics over time
- Automate in CI/CD
- Review all regressions before deploy
- Maintain test case library
- 使用多样化输入进行测试
- 对比多次运行结果(LLM具有随机性)
- 长期跟踪指标变化
- 在CI/CD中自动化执行
- 部署前审核所有回归问题
- 维护测试用例库
Output Checklist
输出检查清单
- Test cases defined (30+)
- Comparison runner
- Stability metrics
- Breakage analyzer
- Fix suggestions
- Diff visualizer
- Automated report
- CI integration
- 已定义测试用例(30+)
- 对比执行器
- 稳定性指标
- 故障分析器
- 修复建议
- 差异可视化工具
- 自动化报告
- CI集成