eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese/hub:eval — Evaluate Agent Results
/hub:eval — 评估Agent结果
Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
为某一会话对所有Agent结果进行排名。支持基于指标的评估(运行命令)、LLM评判者(比较差异)或混合模式。
Usage
使用方法
/hub:eval # Eval latest session using configured criteria
/hub:eval 20260317-143022 # Eval specific session
/hub:eval --judge # Force LLM judge mode (ignore metric config)/hub:eval # 使用配置的标准评估最新会话
/hub:eval 20260317-143022 # 评估特定会话
/hub:eval --judge # 强制使用LLM评判模式(忽略指标配置)What It Does
功能说明
Metric Mode (eval command configured)
指标模式(已配置评估命令)
Run the evaluation command in each agent's worktree:
bash
python {skill_path}/scripts/result_ranker.py \
--session {session-id} \
--eval-cmd "{eval_cmd}" \
--metric {metric} --direction {direction}Output:
RANK AGENT METRIC DELTA FILES
1 agent-2 142ms -38ms 2
2 agent-1 165ms -15ms 3
3 agent-3 190ms +10ms 1
Winner: agent-2 (142ms)在每个Agent的工作目录中运行评估命令:
bash
python {skill_path}/scripts/result_ranker.py \
--session {session-id} \
--eval-cmd "{eval_cmd}" \
--metric {metric} --direction {direction}输出示例:
RANK AGENT METRIC DELTA FILES
1 agent-2 142ms -38ms 2
2 agent-1 165ms -15ms 3
3 agent-3 190ms +10ms 1
Winner: agent-2 (142ms)LLM Judge Mode (no eval command, or --judge flag)
LLM评判模式(无评估命令,或使用--judge参数)
For each agent:
- Get the diff:
git diff {base_branch}...{agent_branch} - Read the agent's result post from
.agenthub/board/results/agent-{i}-result.md - Compare all diffs and rank by:
- Correctness — Does it solve the task?
- Simplicity — Fewer lines changed is better (when equal correctness)
- Quality — Clean execution, good structure, no regressions
Present rankings with justification.
Example LLM judge output for a content task:
RANK AGENT VERDICT WORD COUNT
1 agent-1 Strong narrative, clear CTA 1480
2 agent-3 Good data points, weak intro 1520
3 agent-2 Generic tone, no differentiation 1350
Winner: agent-1 (strongest narrative arc and call-to-action)针对每个Agent:
- 获取差异:
git diff {base_branch}...{agent_branch} - 读取Agent的结果帖子:
.agenthub/board/results/agent-{i}-result.md - 对比所有差异并按以下维度排名:
- 正确性 — 是否解决了任务?
- 简洁性 — 在正确性相同的情况下,修改行数越少越好
- 质量 — 执行流畅、结构良好、无回归问题
附带理由展示排名结果。
内容类任务的LLM评判输出示例:
RANK AGENT VERDICT WORD COUNT
1 agent-1 Strong narrative, clear CTA 1480
2 agent-3 Good data points, weak intro 1520
3 agent-2 Generic tone, no differentiation 1350
Winner: agent-1 (strongest narrative arc and call-to-action)Hybrid Mode
混合模式
- Run metric evaluation first
- If top agents are within 10% of each other, use LLM judge to break ties
- Present both metric and qualitative rankings
- 先运行指标评估
- 如果排名靠前的Agent之间差距在10%以内,使用LLM评判者来打破平局
- 同时展示指标排名和定性排名
After Eval
评估完成后
- Update session state:
bash
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating- Tell the user:
- Ranked results with winner highlighted
- Next step: to merge the winner
/hub:merge - Or to be explicit
/hub:merge {session-id} --agent {winner}
- 更新会话状态:
bash
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating- 告知用户:
- 带有获胜者高亮的排名结果
- 下一步:使用合并获胜者的结果
/hub:merge - 或者使用来明确指定
/hub:merge {session-id} --agent {winner}