rewardkit
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHelp the user write task verifiers with Reward Kit. Reward Kit is a lightweight Python
package that turns a directory of criteria files into a reward score. Each criterion is a
Python function call or a TOML judge file; folders become separate rewards.
帮助用户使用Reward Kit编写任务验证器。Reward Kit是一个轻量级Python包,可将标准文件目录转换为奖励分数。每个标准可以是Python函数调用或TOML评判文件;文件夹会对应独立的奖励项。
Setup in a Harbor task
在Harbor任务中配置
Put criteria alongside in the task's directory:
test.shtests/tests/
├── test.sh
├── checks.py # programmatic criteria
└── judge.toml # optional LLM/agent judgetests/test.shbash
#!/bin/bash
uvx --from 'harbor-rewardkit==0.1.*' rewardkit /testsThis runs all criteria in against the workspace at and writes
. Defaults match Harbor's conventions — no extra config needed.
/tests//app/logs/verifier/reward.jsonIf judge criteria need API keys, pass them through :
task.tomltoml
[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"Ask whether Reward Kit should run in the agent's shared environment or in a
separate verifier environment. Prefer a separate verifier environment when judge
prompts, grading dependencies, API keys, or clean-room checks should not be
available to the agent:
toml
[environment]
network_mode = "no-network" # Agent env baseline — offline during agent.run()
[verifier]
environment_mode = "separate"
[verifier.environment]
network_mode = "public" # Verifier env baseline — LLM judge API calls
docker_image = "python:3.12-slim"In shared mode, the verifier runs in the agent container and inherits
. Put only when verify()
needs different network access than the agent phase (a phase override, not a
baseline). If agent and verifier need different baselines without runtime
switching, use and set
.
[environment].network_mode[verifier].network_modeenvironment_mode = "separate"[verifier.environment].network_modeJudge criteria that call external APIs need a baseline or allowlist on
the verifier environment. Programmatic checks that only read local files can use
.
publicno-networkIn separate mode, is the verifier image build context and must provide
at runtime; Harbor does not upload into the running
verifier container.
tests//tests/test.shtests/将标准文件与一同放在任务的目录下:
test.shtests/tests/
├── test.sh
├── checks.py # 程序化标准
└── judge.toml # 可选的LLM/Agent评判器tests/test.shbash
#!/bin/bash
uvx --from 'harbor-rewardkit==0.1.*' rewardkit /tests此命令会针对/app目录下的工作区运行/tests/中的所有标准,并将结果写入/logs/verifier/reward.json。默认配置符合Harbor的约定——无需额外配置。
如果评判标准需要API密钥,可通过传递:
task.tomltoml
[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"询问Reward Kit应在Agent的共享环境中运行,还是在独立的验证器环境中运行。当评判提示、评分依赖项、API密钥或隔离检查不应被Agent访问时,优先选择独立验证器环境:
toml
[environment]
network_mode = "no-network" # Agent环境基准——agent.run()期间离线
[verifier]
environment_mode = "separate"
[verifier.environment]
network_mode = "public" # 验证器环境基准——LLM评判器API调用
docker_image = "python:3.12-slim"在共享模式下,验证器在Agent容器中运行,并继承。仅当verify()需要与Agent阶段不同的网络访问时(阶段覆盖,而非基准),才设置。如果Agent和验证器需要不同的基准且无需运行时切换,请使用并设置。
[environment].network_mode[verifier].network_modeenvironment_mode = "separate"[verifier.environment].network_mode调用外部API的评判标准需要验证器环境设置基准或允许列表。仅读取本地文件的程序化检查可使用。
publicno-network在独立模式下,是验证器镜像的构建上下文,必须在运行时提供;Harbor不会将上传到运行中的验证器容器。
tests//tests/test.shtests/Programmatic criteria
程序化标准
Call built-ins from any file in :
.pytests/python
import rewardkit as rk
rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py", weight=2.0)
rk.json_key_equals("result.json", "status", "ok")All criteria accept (default ) and (default , runs in
overlayfs so side effects don't leak).
weight1.0isolatedFalse在下的任意文件中调用内置函数:
tests/.pypython
import rewardkit as rk
rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py", weight=2.0)
rk.json_key_equals("result.json", "status", "ok")所有标准都支持(默认值)和(默认值,在overlayfs中运行,避免副作用扩散)参数。
weight1.0isolatedFalseAvailable built-ins
可用内置函数
- Files: ,
file_exists,file_not_exists,file_contains,file_contains_regex,file_matches,files_equaldiff_ratio - Commands: ,
command_succeeds,command_output_contains,command_output_matches(30s default timeout, optionalcommand_output_matches_regex)cwd - Data: ,
json_key_equals,json_path_equals,csv_cell_equals(needsxlsx_cell_equalsextra),[office]sqlite_query_equals - HTTP: ,
http_status_equalshttp_response_contains - Images: ,
image_similarity(needsimage_size_equalsextra)[image] - Trajectory: ,
trajectory_tool_used,trajectory_tool_not_usedtrajectory_turn_count
For extras, install with .
uv tool install harbor-rewardkit[all]- 文件类:、
file_exists、file_not_exists、file_contains、file_contains_regex、file_matches、files_equaldiff_ratio - 命令类:、
command_succeeds、command_output_contains、command_output_matches(默认超时30秒,可选command_output_matches_regex参数)cwd - 数据类:、
json_key_equals、json_path_equals、csv_cell_equals(需要xlsx_cell_equals扩展)、[office]sqlite_query_equals - HTTP类:、
http_status_equalshttp_response_contains - 图片类:、
image_similarity(需要image_size_equals扩展)[image] - 轨迹类:、
trajectory_tool_used、trajectory_tool_not_usedtrajectory_turn_count
如需安装扩展,请执行。
uv tool install harbor-rewardkit[all]Custom criteria
自定义标准
Use the decorator. First parameter is always . Returns
or :
@criterionworkspace: Pathboolfloatpython
from pathlib import Path
from rewardkit import criterion
@criterion
def has_valid_output(workspace: Path) -> bool:
return (workspace / "output.txt").read_text().strip() != ""Zero-parameter criteria auto-register. Criteria with extra args must be called via :
rkpython
@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
return len((workspace / "output.txt").read_text().splitlines()) >= n
rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)For criteria shared across reward subdirs, define with in a root-level file
and call from subdirs.
shared=True使用装饰器。第一个参数始终为。返回值为或类型:
@criterionworkspace: Pathboolfloatpython
from pathlib import Path
from rewardkit import criterion
@criterion
def has_valid_output(workspace: Path) -> bool:
return (workspace / "output.txt").read_text().strip() != ""无参数的标准会自动注册。带有额外参数的标准必须通过调用:
rkpython
@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
return len((workspace / "output.txt").read_text().splitlines()) >= n
rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)如需在奖励子目录间共享标准,请在根级文件中使用定义,并在子目录中调用。
shared=TrueJudge criteria (LLM or agent-as-a-judge)
评判标准(LLM或Agent作为评判器)
For subjective checks (quality, readability, edge cases), create a TOML file:
toml
[judge]
judge = "anthropic/claude-sonnet-4-6" # LiteLLM model string
files = ["/app/main.py"]
[[criterion]]
description = "Is the code correct?"
type = "binary"
[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
weight = 2.0Criterion types:
- — yes/no → 1.0 or 0.0
binary - — 1..points, normalized to [0, 1]
likert - — min..max, normalized to [0, 1]
numeric
对于主观性检查(质量、可读性、边缘情况),创建TOML文件:
toml
[judge]
judge = "anthropic/claude-sonnet-4-6" # LiteLLM模型字符串
files = ["/app/main.py"]
[[criterion]]
description = "Is the code correct?"
type = "binary"
[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
weight = 2.0标准类型:
- — 是/否 → 1.0或0.0
binary - — 1..points,归一化到[0, 1]
likert - — min..max,归一化到[0, 1]
numeric
Agent judges
Agent评判器
Agent judges shell out to a CLI and can explore the filesystem:
toml
[judge]
judge = "claude-code"
model = "anthropic/claude-sonnet-4-6"
isolated = true
[[criterion]]
description = "Does the solution handle edge cases?"
type = "binary"Slower and more expensive than LLM judges, but they can run commands and inspect files.
Agent评判器通过CLI调用,可探索文件系统:
toml
[judge]
judge = "claude-code"
model = "anthropic/claude-sonnet-4-6"
isolated = true
[[criterion]]
description = "Does the solution handle edge cases?"
type = "binary"比LLM评判器速度慢、成本高,但可以运行命令和检查文件。
Useful [judge]
options
[judge]实用的[judge]
选项
[judge]timeoutreasoning_effortlowmediumhighreferenceatif-trajectoryweightprompt_template{criteria}timeoutreasoning_effortlowmediumhighreferenceatif-trajectoryweightprompt_template{criteria}Scoring aggregation (within one judge TOML)
评分聚合(单个评判TOML文件内)
toml
[scoring]
aggregation = "all_pass" # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7 # only for thresholdOnly affects how this file's own criteria combine. To aggregate across
dimensions, see Aggregating dimensions.
toml
[scoring]
aggregation = "all_pass" # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7 # 仅适用于threshold模式仅影响此文件内标准的组合方式。如需跨维度聚合,请查看聚合维度。
Multi-reward tasks
多奖励任务
Put criteria in subdirectories — each becomes a separate reward:
tests/
├── test.sh
├── correctness/
│ └── check.py
├── structure/
│ └── files_exist.py
└── quality/
└── quality.tomlProduces:
json
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }将标准放在子目录中——每个子目录对应一个独立奖励:
tests/
├── test.sh
├── correctness/
│ └── check.py
├── structure/
│ └── files_exist.py
└── quality/
└── quality.toml输出结果:
json
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }Aggregating dimensions
聚合维度
To add aggregated scores on top of the per-dimension keys, add a root-level
with one or more tables. Each adds one key to
, aggregating the dimensions with the same modes as :
tests/reward.toml[[reward]]reward.json[scoring]toml
undefined如需在各维度分数基础上添加聚合分数,请在根目录下创建,包含一个或多个表。每个表会向添加一个键,使用与相同的模式聚合维度:
tests/reward.toml[[reward]]reward.json[scoring]toml
undefinedtests/reward.toml
tests/reward.toml
[[reward]]
name = "reward"
aggregation = "all_pass" # weighted_mean | all_pass | any_pass | threshold
[[reward]]
name = "reward"
aggregation = "all_pass" # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7 # only for threshold
threshold = 0.7 # 仅适用于threshold模式
```json
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6, "reward": 0.0 }The per-dimension scores stay; aggregated keys are added alongside them (a
may not collide with a dimension). Each dimension is weighted by the sum
of its criteria weights; keeps the full breakdown.
namereward-details.json
输出结果:
```json
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6, "reward": 0.0 }各维度分数会保留;聚合键会添加到旁边(不能与维度名称冲突)。每个维度的权重为其所有标准权重之和;会保留完整的评分明细。
namereward-details.jsonOutput files
输出文件
- — per-reward scores
/logs/verifier/reward.json - — per-criterion results, judge reasoning, errors
/logs/verifier/reward-details.json
- — 各奖励项分数
/logs/verifier/reward.json - — 各标准结果、评判推理过程、错误信息
/logs/verifier/reward-details.json
Multi-step tasks
多步骤任务
In a multi-step task, each step has its own under
, and the verifier runs once per step. Reward Kit behaves
the same as in a single-step task: for each step it reads , runs the
criteria against , and writes for that step.
Harbor then aggregates per-step results into a trial-level reward via
in — aggregation happens outside
Reward Kit, so don't try to encode cross-step logic in your criteria.
tests/steps/{name}/tests//tests/app/logs/verifier/reward.jsonmulti_step_reward_strategytask.tomlA task-level directory (at the task root) is uploaded to
first, then the step's own is layered on top (same-name files win).
Put shared helpers (common functions with , fixture
files, a fallback ) at the task level, and step-specific criteria
under each step.
tests//teststests/checks.pyshared=Truetest.shMulti-reward subdirectories still work within a step:
can contain , , — each produces a
separate reward key for that step, and
averages each key across steps. Use when the last step is an
end-to-end check whose rewards already represent the full task.
steps/foo/tests/correctness/structure/quality/multi_step_reward_strategy = "mean""final"在多步骤任务中,每个步骤在下有自己的目录,验证器会在每个步骤运行一次。Reward Kit的行为与单步骤任务相同:针对每个步骤读取,对运行标准,并为该步骤写入。Harbor随后通过中的将各步骤结果聚合为试验级奖励——聚合操作在Reward Kit外部进行,因此请勿在标准中编写跨步骤逻辑。
steps/{name}/tests/tests//tests/app/logs/verifier/reward.jsontask.tomlmulti_step_reward_strategy任务级别的目录(位于任务根目录)会先上传到,然后步骤自身的会覆盖上去(同名文件优先)。将共享辅助工具(带有的通用函数、 fixture文件、备用)放在任务级别,将步骤特定的标准放在每个步骤下。
tests//teststests/shared=Truechecks.pytest.sh多奖励子目录在步骤内仍然有效:可以包含、、——每个子目录会为该步骤生成独立的奖励键,会在各步骤间平均每个键的分数。当最后一步是端到端检查且其奖励已代表整个任务时,使用策略。
steps/foo/tests/correctness/structure/quality/multi_step_reward_strategy = "mean""final"When to reach for what
场景选择指南
- Use built-ins for file existence, string matches, command output, JSON/CSV checks, HTTP probes.
- Use when logic is task-specific but still programmatic.
@criterion - Use LLM judges for subjective quality dimensions (readability, correctness of prose).
- Use agent judges when the rubric requires exploring the filesystem or running code (e.g. "does the test suite actually pass?").
- Use subdirectories when you want separate scores (correctness vs structure vs quality) rather than one blended number.
- Use for any criterion that runs mutating commands, so it doesn't corrupt the workspace for other criteria.
isolated=True
- 使用内置函数:适用于文件存在性检查、字符串匹配、命令输出检查、JSON/CSV检查、HTTP探测。
- 使用:适用于任务特定但仍可程序化实现的逻辑。
@criterion - 使用LLM评判器:适用于主观质量维度(可读性、文本正确性)。
- 使用Agent评判器:适用于评分规则需要探索文件系统或运行代码的场景(例如“测试套件是否真的能通过?”)。
- 使用子目录:当需要独立分数(正确性vs结构vs质量)而非单一综合分数时。
- 使用:适用于任何会运行修改性命令的标准,避免破坏其他标准的工作区。
isolated=True
Working example
示例项目
See in the Harbor repo.
examples/tasks/reward-kit-example/请查看Harbor仓库中的。
examples/tasks/reward-kit-example/