addon-llm-judge-evals
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAdd-on: LLM Judge Evals
附加组件:LLM Judge Evals
Use this skill when you need qualitative evaluation (clarity, domain fit, UX coherence, docs quality) in addition to deterministic checks.
当你除了确定性检查之外,还需要进行定性评估(清晰度、领域适配性、UX一致性、文档质量)时,可以使用这个skill。
Compatibility
兼容性
- Works with all stacks.
- Best paired with .
addon-deterministic-eval-suite
- 适用于所有技术栈。
- 最佳搭配为。
addon-deterministic-eval-suite
Inputs
输入参数
Collect:
- :
JUDGE_BACKEND|auto|langchain(defaultgoogle-adk).auto - : model id to run scoring.
JUDGE_MODEL - : default
JUDGE_TIMEOUT_SECONDS.60 - : default
JUDGE_MAX_RETRIES.2 - : default
JUDGE_TEMPERATURE.0 - :
JUDGE_FAIL_ON_BACKEND_MISMATCH|yes(defaultno).yes - :
JUDGE_RUBRIC_MODE|product|security|developer-experience.custom - : default
PASS_THRESHOLD.0.75 - :
BLOCK_ON_JUDGE_FAIL|yes(defaultno).no
需收集:
- :
JUDGE_BACKEND|auto|langchain(默认值google-adk)。auto - : 用于执行评分的模型ID。
JUDGE_MODEL - : 默认值
JUDGE_TIMEOUT_SECONDS。60 - : 默认值
JUDGE_MAX_RETRIES。2 - : 默认值
JUDGE_TEMPERATURE。0 - :
JUDGE_FAIL_ON_BACKEND_MISMATCH|yes(默认值no)。yes - :
JUDGE_RUBRIC_MODE|product|security|developer-experience。custom - : 默认值
PASS_THRESHOLD。0.75 - :
BLOCK_ON_JUDGE_FAIL|yes(默认值no)。no
Integration Workflow
集成工作流
- Add judge artifacts:
text
config/skill_manifest.json
evals/judge/rubric.md
evals/judge/cases/
scripts/evals/run_llm_judge.py
.github/workflows/evals-judge.yml
REVIEW_BUNDLE/JUDGE_REPORT.md- Copy and adapt this skill's bundled starter script:
scripts/run_llm_judge.py- Place the adapted result in the target project at .
scripts/evals/run_llm_judge.py
- Define rubric:
- scoring categories and weights
- failure reasons template
- required evidence links (files/lines/commands)
- Execute judge run:
- evaluate generated files against rubric per scenario
- resolve backend from plus judge inputs
config/skill_manifest.json - use a single adapter boundary for backend-specific scoring
- store structured JSON + markdown summary
- replace the bundled starter template's placeholder reporting with a real project-local backend adapter before treating judge scores as authoritative
- Merge policy:
- default advisory ()
BLOCK_ON_JUDGE_FAIL=no - blocking only when explicitly configured
- 添加评审相关文件:
text
config/skill_manifest.json
evals/judge/rubric.md
evals/judge/cases/
scripts/evals/run_llm_judge.py
.github/workflows/evals-judge.yml
REVIEW_BUNDLE/JUDGE_REPORT.md- 复制并适配本skill捆绑的入门脚本:
scripts/run_llm_judge.py- 将适配后的文件放到目标项目的路径下。
scripts/evals/run_llm_judge.py
- 定义评分规则:
- 评分类别和权重
- 失败原因模板
- 所需的证据链接(文件/代码行/命令)
- 执行评审运行:
- 按场景对照评分规则评估生成的文件
- 从和评审输入参数解析后端
config/skill_manifest.json - 针对后端特定的评分使用统一的适配器边界
- 存储结构化JSON + markdown摘要
- 在将评审分数视为权威结果之前,将捆绑的入门模板中的占位符报告替换为实际的项目本地后端适配器
- 合并策略:
- 默认仅作建议()
BLOCK_ON_JUDGE_FAIL=no - 仅在显式配置时才会阻塞合并
Backend Resolution Contract
后端解析规则
- must read
scripts/evals/run_llm_judge.pyas the source of truth for selected skills and declared judge capabilities.config/skill_manifest.json - The manifest should include:
json
{
"base_skill": "architect-python-uv-fastapi-sqlalchemy",
"addons": [
"addon-deterministic-eval-suite",
"addon-llm-judge-evals",
"addon-langchain-llm"
],
"capabilities": {
"judge_backends": ["langchain"]
}
}- Resolution order:
- If , use the requested backend only if the matching addon is present in the manifest.
JUDGE_BACKEND != auto - If and only
JUDGE_BACKEND=autois present, useaddon-langchain-llm.langchain - If and only
JUDGE_BACKEND=autois present, useaddon-google-agent-dev-kit.google-adk - If both addons are present, fail and require explicit .
JUDGE_BACKEND - If neither addon is present, fail with an explicit unsupported configuration error.
- If
- Model resolution:
- wins when set.
JUDGE_MODEL - For , fall back to
langchain.DEFAULT_MODEL - For , fall back to
google-adk.ADK_DEFAULT_MODEL
- The judge runner should expose a stable adapter interface (for example ) so rubric logic, thresholding, and report generation stay backend-agnostic.
JudgeBackend.score(prompt)
- 必须读取
scripts/evals/run_llm_judge.py作为所选skill和声明的评审能力的可信源。config/skill_manifest.json - 清单应包含:
json
{
"base_skill": "architect-python-uv-fastapi-sqlalchemy",
"addons": [
"addon-deterministic-eval-suite",
"addon-llm-judge-evals",
"addon-langchain-llm"
],
"capabilities": {
"judge_backends": ["langchain"]
}
}- 解析顺序:
- 如果,仅当清单中存在匹配的附加组件时才使用请求的后端。
JUDGE_BACKEND != auto - 如果且仅存在
JUDGE_BACKEND=auto,则使用addon-langchain-llm。langchain - 如果且仅存在
JUDGE_BACKEND=auto,则使用addon-google-agent-dev-kit。google-adk - 如果两个附加组件都存在,则报错并要求显式指定。
JUDGE_BACKEND - 如果两个附加组件都不存在,则报错并提示配置不支持。
- 如果
- 模型解析:
- 已设置时优先使用该值。
JUDGE_MODEL - 对于,回退到
langchain。DEFAULT_MODEL - 对于,回退到
google-adk。ADK_DEFAULT_MODEL
- 已设置
- 评审运行器应暴露稳定的适配器接口(例如),以便评分规则逻辑、阈值处理和报告生成都与后端无关。
JudgeBackend.score(prompt)
Required Template
所需模板
evals/judge/rubric.md
evals/judge/rubric.mdevals/judge/rubric.md
evals/judge/rubric.mdmarkdown
undefinedmarkdown
undefinedJudge Rubric
评审评分规则
- Technical coherence (0-1)
- Requirement coverage (0-1)
- Domain language alignment (0-1)
- UX quality and states (0-1)
- Documentation clarity (0-1)
Pass threshold: 0.75
undefined- 技术一致性 (0-1)
- 需求覆盖率 (0-1)
- 领域语言对齐度 (0-1)
- UX质量和状态完整性 (0-1)
- 文档清晰度 (0-1)
通过阈值:0.75
undefinedGuardrails
防护规则
-
Documentation contract for generated code:
- Python: write module docstrings and docstrings for public classes, methods, and functions.
- Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
- Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
- Apply this contract even when using template snippets below; expand templates as needed.
-
Never replace deterministic gates with judge scores.
-
Keep prompts/rubrics versioned in repo for auditability.
-
Record model/version and timestamp for each run.
-
Surface uncertainty as explicit notes, not silent pass.
-
Do not infer judge backend from incidental files or imports; use the manifest and explicit inputs.
-
If multiple LLM-capable addons are installed, do not guess. Require an explicit.
JUDGE_BACKEND
-
生成代码的文档约定:
- Python:为模块、公共类、方法和函数编写文档字符串。
- Next.js/TypeScript:为导出的组件、hooks、工具函数和路由处理函数编写JSDoc。
- 仅为非直观逻辑、不变量或安全约束添加简明的原理注释。
- 即使使用以下模板片段也需遵守本约定,可根据需要扩展模板。
-
永远不要用评审分数替代确定性检查关口。
-
提示词/评分规则需在仓库中进行版本控制,以便审计。
-
记录每次运行的模型/版本和时间戳。
-
将不确定性明确标注为说明,不要静默通过。
-
不要通过偶然出现的文件或导入推断评审后端,使用清单和显式输入参数。
-
如果安装了多个支持LLM的附加组件,不要猜测,要求显式指定。
JUDGE_BACKEND
Validation Checklist
验证检查清单
- Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.
bash
test -f evals/judge/rubric.md
test -f scripts/evals/run_llm_judge.py
test -f .github/workflows/evals-judge.yml
test -f REVIEW_BUNDLE/JUDGE_REPORT.md || true- 确认生成的代码包含所需的文档字符串/JSDoc,以及针对非直观逻辑的原理注释。
bash
test -f evals/judge/rubric.md
test -f scripts/evals/run_llm_judge.py
test -f .github/workflows/evals-judge.yml
test -f REVIEW_BUNDLE/JUDGE_REPORT.md || trueDecision Justification Rule
决策说明规则
- Every non-trivial decision must include a concrete justification.
- Capture the alternatives considered and why they were rejected.
- State tradeoffs and residual risks for the chosen option.
- If justification is missing, treat the task as incomplete and surface it as a blocker.
- 每个非trivial决策都必须包含具体的理由。
- 记录考虑过的备选方案以及拒绝它们的原因。
- 说明所选方案的权衡和残余风险。
- 如果缺少理由,将任务视为未完成,并将其标记为阻塞项。