addon-llm-judge-evals

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Add-on: LLM Judge Evals

附加组件:LLM Judge Evals

Use this skill when you need qualitative evaluation (clarity, domain fit, UX coherence, docs quality) in addition to deterministic checks.
当你除了确定性检查之外,还需要进行定性评估(清晰度、领域适配性、UX一致性、文档质量)时,可以使用这个skill。

Compatibility

兼容性

  • Works with all stacks.
  • Best paired with
    addon-deterministic-eval-suite
    .
  • 适用于所有技术栈。
  • 最佳搭配为
    addon-deterministic-eval-suite

Inputs

输入参数

Collect:
  • JUDGE_BACKEND
    :
    auto
    |
    langchain
    |
    google-adk
    (default
    auto
    ).
  • JUDGE_MODEL
    : model id to run scoring.
  • JUDGE_TIMEOUT_SECONDS
    : default
    60
    .
  • JUDGE_MAX_RETRIES
    : default
    2
    .
  • JUDGE_TEMPERATURE
    : default
    0
    .
  • JUDGE_FAIL_ON_BACKEND_MISMATCH
    :
    yes
    |
    no
    (default
    yes
    ).
  • JUDGE_RUBRIC_MODE
    :
    product
    |
    security
    |
    developer-experience
    |
    custom
    .
  • PASS_THRESHOLD
    : default
    0.75
    .
  • BLOCK_ON_JUDGE_FAIL
    :
    yes
    |
    no
    (default
    no
    ).
需收集:
  • JUDGE_BACKEND
    :
    auto
    |
    langchain
    |
    google-adk
    (默认值
    auto
    )。
  • JUDGE_MODEL
    : 用于执行评分的模型ID。
  • JUDGE_TIMEOUT_SECONDS
    : 默认值
    60
  • JUDGE_MAX_RETRIES
    : 默认值
    2
  • JUDGE_TEMPERATURE
    : 默认值
    0
  • JUDGE_FAIL_ON_BACKEND_MISMATCH
    :
    yes
    |
    no
    (默认值
    yes
    )。
  • JUDGE_RUBRIC_MODE
    :
    product
    |
    security
    |
    developer-experience
    |
    custom
  • PASS_THRESHOLD
    : 默认值
    0.75
  • BLOCK_ON_JUDGE_FAIL
    :
    yes
    |
    no
    (默认值
    no
    )。

Integration Workflow

集成工作流

  1. Add judge artifacts:
text
config/skill_manifest.json
evals/judge/rubric.md
evals/judge/cases/
scripts/evals/run_llm_judge.py
.github/workflows/evals-judge.yml
REVIEW_BUNDLE/JUDGE_REPORT.md
  • Copy and adapt this skill's bundled starter script:
  • scripts/run_llm_judge.py
  • Place the adapted result in the target project at
    scripts/evals/run_llm_judge.py
    .
  1. Define rubric:
  • scoring categories and weights
  • failure reasons template
  • required evidence links (files/lines/commands)
  1. Execute judge run:
  • evaluate generated files against rubric per scenario
  • resolve backend from
    config/skill_manifest.json
    plus judge inputs
  • use a single adapter boundary for backend-specific scoring
  • store structured JSON + markdown summary
  • replace the bundled starter template's placeholder reporting with a real project-local backend adapter before treating judge scores as authoritative
  1. Merge policy:
  • default advisory (
    BLOCK_ON_JUDGE_FAIL=no
    )
  • blocking only when explicitly configured
  1. 添加评审相关文件:
text
config/skill_manifest.json
evals/judge/rubric.md
evals/judge/cases/
scripts/evals/run_llm_judge.py
.github/workflows/evals-judge.yml
REVIEW_BUNDLE/JUDGE_REPORT.md
  • 复制并适配本skill捆绑的入门脚本:
  • scripts/run_llm_judge.py
  • 将适配后的文件放到目标项目的
    scripts/evals/run_llm_judge.py
    路径下。
  1. 定义评分规则:
  • 评分类别和权重
  • 失败原因模板
  • 所需的证据链接(文件/代码行/命令)
  1. 执行评审运行:
  • 按场景对照评分规则评估生成的文件
  • config/skill_manifest.json
    和评审输入参数解析后端
  • 针对后端特定的评分使用统一的适配器边界
  • 存储结构化JSON + markdown摘要
  • 在将评审分数视为权威结果之前,将捆绑的入门模板中的占位符报告替换为实际的项目本地后端适配器
  1. 合并策略:
  • 默认仅作建议(
    BLOCK_ON_JUDGE_FAIL=no
  • 仅在显式配置时才会阻塞合并

Backend Resolution Contract

后端解析规则

  • scripts/evals/run_llm_judge.py
    must read
    config/skill_manifest.json
    as the source of truth for selected skills and declared judge capabilities.
  • The manifest should include:
json
{
  "base_skill": "architect-python-uv-fastapi-sqlalchemy",
  "addons": [
    "addon-deterministic-eval-suite",
    "addon-llm-judge-evals",
    "addon-langchain-llm"
  ],
  "capabilities": {
    "judge_backends": ["langchain"]
  }
}
  • Resolution order:
    • If
      JUDGE_BACKEND != auto
      , use the requested backend only if the matching addon is present in the manifest.
    • If
      JUDGE_BACKEND=auto
      and only
      addon-langchain-llm
      is present, use
      langchain
      .
    • If
      JUDGE_BACKEND=auto
      and only
      addon-google-agent-dev-kit
      is present, use
      google-adk
      .
    • If both addons are present, fail and require explicit
      JUDGE_BACKEND
      .
    • If neither addon is present, fail with an explicit unsupported configuration error.
  • Model resolution:
    • JUDGE_MODEL
      wins when set.
    • For
      langchain
      , fall back to
      DEFAULT_MODEL
      .
    • For
      google-adk
      , fall back to
      ADK_DEFAULT_MODEL
      .
  • The judge runner should expose a stable adapter interface (for example
    JudgeBackend.score(prompt)
    ) so rubric logic, thresholding, and report generation stay backend-agnostic.
  • scripts/evals/run_llm_judge.py
    必须读取
    config/skill_manifest.json
    作为所选skill和声明的评审能力的可信源。
  • 清单应包含:
json
{
  "base_skill": "architect-python-uv-fastapi-sqlalchemy",
  "addons": [
    "addon-deterministic-eval-suite",
    "addon-llm-judge-evals",
    "addon-langchain-llm"
  ],
  "capabilities": {
    "judge_backends": ["langchain"]
  }
}
  • 解析顺序:
    • 如果
      JUDGE_BACKEND != auto
      ,仅当清单中存在匹配的附加组件时才使用请求的后端。
    • 如果
      JUDGE_BACKEND=auto
      且仅存在
      addon-langchain-llm
      ,则使用
      langchain
    • 如果
      JUDGE_BACKEND=auto
      且仅存在
      addon-google-agent-dev-kit
      ,则使用
      google-adk
    • 如果两个附加组件都存在,则报错并要求显式指定
      JUDGE_BACKEND
    • 如果两个附加组件都不存在,则报错并提示配置不支持。
  • 模型解析:
    • 已设置
      JUDGE_MODEL
      时优先使用该值。
    • 对于
      langchain
      ,回退到
      DEFAULT_MODEL
    • 对于
      google-adk
      ,回退到
      ADK_DEFAULT_MODEL
  • 评审运行器应暴露稳定的适配器接口(例如
    JudgeBackend.score(prompt)
    ),以便评分规则逻辑、阈值处理和报告生成都与后端无关。

Required Template

所需模板

evals/judge/rubric.md

evals/judge/rubric.md

markdown
undefined
markdown
undefined

Judge Rubric

评审评分规则

  • Technical coherence (0-1)
  • Requirement coverage (0-1)
  • Domain language alignment (0-1)
  • UX quality and states (0-1)
  • Documentation clarity (0-1)
Pass threshold: 0.75
undefined
  • 技术一致性 (0-1)
  • 需求覆盖率 (0-1)
  • 领域语言对齐度 (0-1)
  • UX质量和状态完整性 (0-1)
  • 文档清晰度 (0-1)
通过阈值:0.75
undefined

Guardrails

防护规则

  • Documentation contract for generated code:
    • Python: write module docstrings and docstrings for public classes, methods, and functions.
    • Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
    • Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
    • Apply this contract even when using template snippets below; expand templates as needed.
  • Never replace deterministic gates with judge scores.
  • Keep prompts/rubrics versioned in repo for auditability.
  • Record model/version and timestamp for each run.
  • Surface uncertainty as explicit notes, not silent pass.
  • Do not infer judge backend from incidental files or imports; use the manifest and explicit inputs.
  • If multiple LLM-capable addons are installed, do not guess. Require an explicit
    JUDGE_BACKEND
    .
  • 生成代码的文档约定:
    • Python:为模块、公共类、方法和函数编写文档字符串。
    • Next.js/TypeScript:为导出的组件、hooks、工具函数和路由处理函数编写JSDoc。
    • 仅为非直观逻辑、不变量或安全约束添加简明的原理注释。
    • 即使使用以下模板片段也需遵守本约定,可根据需要扩展模板。
  • 永远不要用评审分数替代确定性检查关口。
  • 提示词/评分规则需在仓库中进行版本控制,以便审计。
  • 记录每次运行的模型/版本和时间戳。
  • 将不确定性明确标注为说明,不要静默通过。
  • 不要通过偶然出现的文件或导入推断评审后端,使用清单和显式输入参数。
  • 如果安装了多个支持LLM的附加组件,不要猜测,要求显式指定
    JUDGE_BACKEND

Validation Checklist

验证检查清单

  • Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.
bash
test -f evals/judge/rubric.md
test -f scripts/evals/run_llm_judge.py
test -f .github/workflows/evals-judge.yml
test -f REVIEW_BUNDLE/JUDGE_REPORT.md || true
  • 确认生成的代码包含所需的文档字符串/JSDoc,以及针对非直观逻辑的原理注释。
bash
test -f evals/judge/rubric.md
test -f scripts/evals/run_llm_judge.py
test -f .github/workflows/evals-judge.yml
test -f REVIEW_BUNDLE/JUDGE_REPORT.md || true

Decision Justification Rule

决策说明规则

  • Every non-trivial decision must include a concrete justification.
  • Capture the alternatives considered and why they were rejected.
  • State tradeoffs and residual risks for the chosen option.
  • If justification is missing, treat the task as incomplete and surface it as a blocker.
  • 每个非trivial决策都必须包含具体的理由。
  • 记录考虑过的备选方案以及拒绝它们的原因。
  • 说明所选方案的权衡和残余风险。
  • 如果缺少理由,将任务视为未完成,并将其标记为阻塞项。