addon-llm-judge-evals
Original:🇺🇸 English
Translated
1 scriptsChecked / no sensitive code detected
Use when you want rubric based LLM quality scoring on generated outputs; pair with addon-deterministic-eval-suite.
2installs
Sourceajrlewis/ai-skills
Added on
NPX Install
npx skill4agent add ajrlewis/ai-skills addon-llm-judge-evalsTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Add-on: LLM Judge Evals
Use this skill when you need qualitative evaluation (clarity, domain fit, UX coherence, docs quality) in addition to deterministic checks.
Compatibility
- Works with all stacks.
- Best paired with .
addon-deterministic-eval-suite
Inputs
Collect:
- :
JUDGE_BACKEND|auto|langchain(defaultgoogle-adk).auto - : model id to run scoring.
JUDGE_MODEL - : default
JUDGE_TIMEOUT_SECONDS.60 - : default
JUDGE_MAX_RETRIES.2 - : default
JUDGE_TEMPERATURE.0 - :
JUDGE_FAIL_ON_BACKEND_MISMATCH|yes(defaultno).yes - :
JUDGE_RUBRIC_MODE|product|security|developer-experience.custom - : default
PASS_THRESHOLD.0.75 - :
BLOCK_ON_JUDGE_FAIL|yes(defaultno).no
Integration Workflow
- Add judge artifacts:
text
config/skill_manifest.json
evals/judge/rubric.md
evals/judge/cases/
scripts/evals/run_llm_judge.py
.github/workflows/evals-judge.yml
REVIEW_BUNDLE/JUDGE_REPORT.md- Copy and adapt this skill's bundled starter script:
scripts/run_llm_judge.py- Place the adapted result in the target project at .
scripts/evals/run_llm_judge.py
- Define rubric:
- scoring categories and weights
- failure reasons template
- required evidence links (files/lines/commands)
- Execute judge run:
- evaluate generated files against rubric per scenario
- resolve backend from plus judge inputs
config/skill_manifest.json - use a single adapter boundary for backend-specific scoring
- store structured JSON + markdown summary
- replace the bundled starter template's placeholder reporting with a real project-local backend adapter before treating judge scores as authoritative
- Merge policy:
- default advisory ()
BLOCK_ON_JUDGE_FAIL=no - blocking only when explicitly configured
Backend Resolution Contract
- must read
scripts/evals/run_llm_judge.pyas the source of truth for selected skills and declared judge capabilities.config/skill_manifest.json - The manifest should include:
json
{
"base_skill": "architect-python-uv-fastapi-sqlalchemy",
"addons": [
"addon-deterministic-eval-suite",
"addon-llm-judge-evals",
"addon-langchain-llm"
],
"capabilities": {
"judge_backends": ["langchain"]
}
}- Resolution order:
- If , use the requested backend only if the matching addon is present in the manifest.
JUDGE_BACKEND != auto - If and only
JUDGE_BACKEND=autois present, useaddon-langchain-llm.langchain - If and only
JUDGE_BACKEND=autois present, useaddon-google-agent-dev-kit.google-adk - If both addons are present, fail and require explicit .
JUDGE_BACKEND - If neither addon is present, fail with an explicit unsupported configuration error.
- If
- Model resolution:
- wins when set.
JUDGE_MODEL - For , fall back to
langchain.DEFAULT_MODEL - For , fall back to
google-adk.ADK_DEFAULT_MODEL
- The judge runner should expose a stable adapter interface (for example ) so rubric logic, thresholding, and report generation stay backend-agnostic.
JudgeBackend.score(prompt)
Required Template
evals/judge/rubric.md
evals/judge/rubric.mdmarkdown
# Judge Rubric
- Technical coherence (0-1)
- Requirement coverage (0-1)
- Domain language alignment (0-1)
- UX quality and states (0-1)
- Documentation clarity (0-1)
Pass threshold: 0.75Guardrails
-
Documentation contract for generated code:
- Python: write module docstrings and docstrings for public classes, methods, and functions.
- Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
- Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
- Apply this contract even when using template snippets below; expand templates as needed.
-
Never replace deterministic gates with judge scores.
-
Keep prompts/rubrics versioned in repo for auditability.
-
Record model/version and timestamp for each run.
-
Surface uncertainty as explicit notes, not silent pass.
-
Do not infer judge backend from incidental files or imports; use the manifest and explicit inputs.
-
If multiple LLM-capable addons are installed, do not guess. Require an explicit.
JUDGE_BACKEND
Validation Checklist
- Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.
bash
test -f evals/judge/rubric.md
test -f scripts/evals/run_llm_judge.py
test -f .github/workflows/evals-judge.yml
test -f REVIEW_BUNDLE/JUDGE_REPORT.md || trueDecision Justification Rule
- Every non-trivial decision must include a concrete justification.
- Capture the alternatives considered and why they were rejected.
- State tradeoffs and residual risks for the chosen option.
- If justification is missing, treat the task as incomplete and surface it as a blocker.