addon-llm-judge-evals

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Add-on: LLM Judge Evals

附加组件：LLM Judge Evals

Use this skill when you need qualitative evaluation (clarity, domain fit, UX coherence, docs quality) in addition to deterministic checks.

当你除了确定性检查之外，还需要进行定性评估（清晰度、领域适配性、UX一致性、文档质量）时，可以使用这个skill。

Compatibility

兼容性

Works with all stacks.
Best paired with
```
addon-deterministic-eval-suite
```
.

适用于所有技术栈。
最佳搭配为
```
addon-deterministic-eval-suite
```
。

Inputs

输入参数

Collect:

JUDGE_BACKEND

auto

langchain

google-adk

(default

auto

```
JUDGE_MODEL
```
: model id to run scoring.
```
JUDGE_TIMEOUT_SECONDS
```
: default
```
60
```
.
```
JUDGE_MAX_RETRIES
```
: default
```
2
```
.
```
JUDGE_TEMPERATURE
```
: default
```
0
```
.
```
JUDGE_FAIL_ON_BACKEND_MISMATCH
```
:
```
yes
```
|
```
no
```
(default
```
yes
```
).

JUDGE_RUBRIC_MODE

product

security

developer-experience

custom

```
PASS_THRESHOLD
```
: default
```
0.75
```
.
```
BLOCK_ON_JUDGE_FAIL
```
:
```
yes
```
|
```
no
```
(default
```
no
```
).

需收集：

JUDGE_BACKEND

auto

langchain

google-adk

(默认值

auto

)。

```
JUDGE_MODEL
```
: 用于执行评分的模型ID。
```
JUDGE_TIMEOUT_SECONDS
```
: 默认值
```
60
```
。
```
JUDGE_MAX_RETRIES
```
: 默认值
```
2
```
。
```
JUDGE_TEMPERATURE
```
: 默认值
```
0
```
。
```
JUDGE_FAIL_ON_BACKEND_MISMATCH
```
:
```
yes
```
|
```
no
```
(默认值
```
yes
```
)。

JUDGE_RUBRIC_MODE

product

security

developer-experience

custom

。

```
PASS_THRESHOLD
```
: 默认值
```
0.75
```
。
```
BLOCK_ON_JUDGE_FAIL
```
:
```
yes
```
|
```
no
```
(默认值
```
no
```
)。

Integration Workflow

集成工作流

Add judge artifacts:

text

config/skill_manifest.json
evals/judge/rubric.md
evals/judge/cases/
scripts/evals/run_llm_judge.py
.github/workflows/evals-judge.yml
REVIEW_BUNDLE/JUDGE_REPORT.md

Copy and adapt this skill's bundled starter script:
```
scripts/run_llm_judge.py
```
Place the adapted result in the target project at
```
scripts/evals/run_llm_judge.py
```
.

Define rubric:

scoring categories and weights
failure reasons template
required evidence links (files/lines/commands)

Execute judge run:

evaluate generated files against rubric per scenario
resolve backend from
```
config/skill_manifest.json
```
plus judge inputs
use a single adapter boundary for backend-specific scoring
store structured JSON + markdown summary
replace the bundled starter template's placeholder reporting with a real project-local backend adapter before treating judge scores as authoritative

Merge policy:

default advisory (
```
BLOCK_ON_JUDGE_FAIL=no
```
)
blocking only when explicitly configured

添加评审相关文件：

text

config/skill_manifest.json
evals/judge/rubric.md
evals/judge/cases/
scripts/evals/run_llm_judge.py
.github/workflows/evals-judge.yml
REVIEW_BUNDLE/JUDGE_REPORT.md

复制并适配本skill捆绑的入门脚本：
```
scripts/run_llm_judge.py
```
将适配后的文件放到目标项目的
```
scripts/evals/run_llm_judge.py
```
路径下。

定义评分规则：

评分类别和权重
失败原因模板
所需的证据链接（文件/代码行/命令）

执行评审运行：

按场景对照评分规则评估生成的文件
从
```
config/skill_manifest.json
```
和评审输入参数解析后端
针对后端特定的评分使用统一的适配器边界
存储结构化JSON + markdown摘要
在将评审分数视为权威结果之前，将捆绑的入门模板中的占位符报告替换为实际的项目本地后端适配器

合并策略：

默认仅作建议（
```
BLOCK_ON_JUDGE_FAIL=no
```
）
仅在显式配置时才会阻塞合并

Backend Resolution Contract

后端解析规则

```
scripts/evals/run_llm_judge.py
```
must read
```
config/skill_manifest.json
```
as the source of truth for selected skills and declared judge capabilities.
The manifest should include:

json

{
  "base_skill": "architect-python-uv-fastapi-sqlalchemy",
  "addons": [
    "addon-deterministic-eval-suite",
    "addon-llm-judge-evals",
    "addon-langchain-llm"
  ],
  "capabilities": {
    "judge_backends": ["langchain"]
  }
}

Resolution order:
- If
```
JUDGE_BACKEND != auto
```
  , use the requested backend only if the matching addon is present in the manifest.
- If
```
JUDGE_BACKEND=auto
```
  and only
```
addon-langchain-llm
```
  is present, use
```
langchain
```
  .
- If
```
JUDGE_BACKEND=auto
```
  and only
```
addon-google-agent-dev-kit
```
  is present, use
```
google-adk
```
  .
- If both addons are present, fail and require explicit
```
JUDGE_BACKEND
```
  .
- If neither addon is present, fail with an explicit unsupported configuration error.
Model resolution:
- ```
JUDGE_MODEL
```
  wins when set.
- For
```
langchain
```
  , fall back to
```
DEFAULT_MODEL
```
  .
- For
```
google-adk
```
  , fall back to
```
ADK_DEFAULT_MODEL
```
  .
The judge runner should expose a stable adapter interface (for example
```
JudgeBackend.score(prompt)
```
) so rubric logic, thresholding, and report generation stay backend-agnostic.

```
scripts/evals/run_llm_judge.py
```
必须读取
```
config/skill_manifest.json
```
作为所选skill和声明的评审能力的可信源。
清单应包含：

json

{
  "base_skill": "architect-python-uv-fastapi-sqlalchemy",
  "addons": [
    "addon-deterministic-eval-suite",
    "addon-llm-judge-evals",
    "addon-langchain-llm"
  ],
  "capabilities": {
    "judge_backends": ["langchain"]
  }
}

解析顺序：
- 如果
```
JUDGE_BACKEND != auto
```
  ，仅当清单中存在匹配的附加组件时才使用请求的后端。
- 如果
```
JUDGE_BACKEND=auto
```
  且仅存在
```
addon-langchain-llm
```
  ，则使用
```
langchain
```
  。
- 如果
```
JUDGE_BACKEND=auto
```
  且仅存在
```
addon-google-agent-dev-kit
```
  ，则使用
```
google-adk
```
  。
- 如果两个附加组件都存在，则报错并要求显式指定
```
JUDGE_BACKEND
```
  。
- 如果两个附加组件都不存在，则报错并提示配置不支持。
模型解析：
- 已设置
```
JUDGE_MODEL
```
  时优先使用该值。
- 对于
```
langchain
```
  ，回退到
```
DEFAULT_MODEL
```
  。
- 对于
```
google-adk
```
  ，回退到
```
ADK_DEFAULT_MODEL
```
  。
评审运行器应暴露稳定的适配器接口（例如
```
JudgeBackend.score(prompt)
```
），以便评分规则逻辑、阈值处理和报告生成都与后端无关。

Required Template

所需模板

evals/judge/rubric.md

evals/judge/rubric.md

markdown

undefined

markdown

undefined

Judge Rubric

评审评分规则

Technical coherence (0-1)
Requirement coverage (0-1)
Domain language alignment (0-1)
UX quality and states (0-1)
Documentation clarity (0-1)

Pass threshold: 0.75

undefined

技术一致性 (0-1)
需求覆盖率 (0-1)
领域语言对齐度 (0-1)
UX质量和状态完整性 (0-1)
文档清晰度 (0-1)

通过阈值：0.75

undefined

Guardrails

防护规则

Documentation contract for generated code:
- Python: write module docstrings and docstrings for public classes, methods, and functions.
- Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
- Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
- Apply this contract even when using template snippets below; expand templates as needed.
Never replace deterministic gates with judge scores.
Keep prompts/rubrics versioned in repo for auditability.
Record model/version and timestamp for each run.
Surface uncertainty as explicit notes, not silent pass.
Do not infer judge backend from incidental files or imports; use the manifest and explicit inputs.
If multiple LLM-capable addons are installed, do not guess. Require an explicit
```
JUDGE_BACKEND
```
.

生成代码的文档约定：
- Python：为模块、公共类、方法和函数编写文档字符串。
- Next.js/TypeScript：为导出的组件、hooks、工具函数和路由处理函数编写JSDoc。
- 仅为非直观逻辑、不变量或安全约束添加简明的原理注释。
- 即使使用以下模板片段也需遵守本约定，可根据需要扩展模板。
永远不要用评审分数替代确定性检查关口。
提示词/评分规则需在仓库中进行版本控制，以便审计。
记录每次运行的模型/版本和时间戳。
将不确定性明确标注为说明，不要静默通过。
不要通过偶然出现的文件或导入推断评审后端，使用清单和显式输入参数。
如果安装了多个支持LLM的附加组件，不要猜测，要求显式指定
```
JUDGE_BACKEND
```
。

Validation Checklist

验证检查清单

Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.

bash

test -f evals/judge/rubric.md
test -f scripts/evals/run_llm_judge.py
test -f .github/workflows/evals-judge.yml
test -f REVIEW_BUNDLE/JUDGE_REPORT.md || true

确认生成的代码包含所需的文档字符串/JSDoc，以及针对非直观逻辑的原理注释。