rewardkit

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
Help the user write task verifiers with Reward Kit. Reward Kit is a lightweight Python package that turns a directory of criteria files into a reward score. Each criterion is a Python function call or a TOML judge file; folders become separate rewards.
帮助用户使用Reward Kit编写任务验证器。Reward Kit是一个轻量级Python包,可将标准文件目录转换为奖励分数。每个标准可以是Python函数调用或TOML评判文件;文件夹会对应独立的奖励项。

Setup in a Harbor task

在Harbor任务中配置

Put criteria alongside
test.sh
in the task's
tests/
directory:
tests/
├── test.sh
├── checks.py         # programmatic criteria
└── judge.toml        # optional LLM/agent judge
tests/test.sh
:
bash
#!/bin/bash
uvx --from 'harbor-rewardkit==0.1.*' rewardkit /tests
This runs all criteria in
/tests/
against the workspace at
/app
and writes
/logs/verifier/reward.json
. Defaults match Harbor's conventions — no extra config needed.
If judge criteria need API keys, pass them through
task.toml
:
toml
[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"
Ask whether Reward Kit should run in the agent's shared environment or in a separate verifier environment. Prefer a separate verifier environment when judge prompts, grading dependencies, API keys, or clean-room checks should not be available to the agent:
toml
[environment]
network_mode = "no-network"   # Agent env baseline — offline during agent.run()

[verifier]
environment_mode = "separate"

[verifier.environment]
network_mode = "public"     # Verifier env baseline — LLM judge API calls
docker_image = "python:3.12-slim"
In shared mode, the verifier runs in the agent container and inherits
[environment].network_mode
. Put
[verifier].network_mode
only when verify() needs different network access than the agent phase (a phase override, not a baseline). If agent and verifier need different baselines without runtime switching, use
environment_mode = "separate"
and set
[verifier.environment].network_mode
.
Judge criteria that call external APIs need a
public
baseline or allowlist on the verifier environment. Programmatic checks that only read local files can use
no-network
.
In separate mode,
tests/
is the verifier image build context and must provide
/tests/test.sh
at runtime; Harbor does not upload
tests/
into the running verifier container.
将标准文件与
test.sh
一同放在任务的
tests/
目录下:
tests/
├── test.sh
├── checks.py         # 程序化标准
└── judge.toml        # 可选的LLM/Agent评判器
tests/test.sh
:
bash
#!/bin/bash
uvx --from 'harbor-rewardkit==0.1.*' rewardkit /tests
此命令会针对/app目录下的工作区运行/tests/中的所有标准,并将结果写入/logs/verifier/reward.json。默认配置符合Harbor的约定——无需额外配置。
如果评判标准需要API密钥,可通过
task.toml
传递:
toml
[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"
询问Reward Kit应在Agent的共享环境中运行,还是在独立的验证器环境中运行。当评判提示、评分依赖项、API密钥或隔离检查不应被Agent访问时,优先选择独立验证器环境:
toml
[environment]
network_mode = "no-network"   # Agent环境基准——agent.run()期间离线

[verifier]
environment_mode = "separate"

[verifier.environment]
network_mode = "public"     # 验证器环境基准——LLM评判器API调用
docker_image = "python:3.12-slim"
在共享模式下,验证器在Agent容器中运行,并继承
[environment].network_mode
。仅当verify()需要与Agent阶段不同的网络访问时(阶段覆盖,而非基准),才设置
[verifier].network_mode
。如果Agent和验证器需要不同的基准且无需运行时切换,请使用
environment_mode = "separate"
并设置
[verifier.environment].network_mode
调用外部API的评判标准需要验证器环境设置
public
基准或允许列表。仅读取本地文件的程序化检查可使用
no-network
在独立模式下,
tests/
是验证器镜像的构建上下文,必须在运行时提供
/tests/test.sh
;Harbor不会将
tests/
上传到运行中的验证器容器。

Programmatic criteria

程序化标准

Call built-ins from any
.py
file in
tests/
:
python
import rewardkit as rk

rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py", weight=2.0)
rk.json_key_equals("result.json", "status", "ok")
All criteria accept
weight
(default
1.0
) and
isolated
(default
False
, runs in overlayfs so side effects don't leak).
tests/
下的任意
.py
文件中调用内置函数:
python
import rewardkit as rk

rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py", weight=2.0)
rk.json_key_equals("result.json", "status", "ok")
所有标准都支持
weight
(默认值
1.0
)和
isolated
(默认值
False
,在overlayfs中运行,避免副作用扩散)参数。

Available built-ins

可用内置函数

  • Files:
    file_exists
    ,
    file_not_exists
    ,
    file_contains
    ,
    file_contains_regex
    ,
    file_matches
    ,
    files_equal
    ,
    diff_ratio
  • Commands:
    command_succeeds
    ,
    command_output_contains
    ,
    command_output_matches
    ,
    command_output_matches_regex
    (30s default timeout, optional
    cwd
    )
  • Data:
    json_key_equals
    ,
    json_path_equals
    ,
    csv_cell_equals
    ,
    xlsx_cell_equals
    (needs
    [office]
    extra),
    sqlite_query_equals
  • HTTP:
    http_status_equals
    ,
    http_response_contains
  • Images:
    image_similarity
    ,
    image_size_equals
    (needs
    [image]
    extra)
  • Trajectory:
    trajectory_tool_used
    ,
    trajectory_tool_not_used
    ,
    trajectory_turn_count
For extras, install with
uv tool install harbor-rewardkit[all]
.
  • 文件类
    file_exists
    file_not_exists
    file_contains
    file_contains_regex
    file_matches
    files_equal
    diff_ratio
  • 命令类
    command_succeeds
    command_output_contains
    command_output_matches
    command_output_matches_regex
    (默认超时30秒,可选
    cwd
    参数)
  • 数据类
    json_key_equals
    json_path_equals
    csv_cell_equals
    xlsx_cell_equals
    (需要
    [office]
    扩展)、
    sqlite_query_equals
  • HTTP类
    http_status_equals
    http_response_contains
  • 图片类
    image_similarity
    image_size_equals
    (需要
    [image]
    扩展)
  • 轨迹类
    trajectory_tool_used
    trajectory_tool_not_used
    trajectory_turn_count
如需安装扩展,请执行
uv tool install harbor-rewardkit[all]

Custom criteria

自定义标准

Use the
@criterion
decorator. First parameter is always
workspace: Path
. Returns
bool
or
float
:
python
from pathlib import Path
from rewardkit import criterion

@criterion
def has_valid_output(workspace: Path) -> bool:
    return (workspace / "output.txt").read_text().strip() != ""
Zero-parameter criteria auto-register. Criteria with extra args must be called via
rk
:
python
@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
    return len((workspace / "output.txt").read_text().splitlines()) >= n

rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)
For criteria shared across reward subdirs, define with
shared=True
in a root-level file and call from subdirs.
使用
@criterion
装饰器。第一个参数始终为
workspace: Path
。返回值为
bool
float
类型:
python
from pathlib import Path
from rewardkit import criterion

@criterion
def has_valid_output(workspace: Path) -> bool:
    return (workspace / "output.txt").read_text().strip() != ""
无参数的标准会自动注册。带有额外参数的标准必须通过
rk
调用:
python
@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
    return len((workspace / "output.txt").read_text().splitlines()) >= n

rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)
如需在奖励子目录间共享标准,请在根级文件中使用
shared=True
定义,并在子目录中调用。

Judge criteria (LLM or agent-as-a-judge)

评判标准(LLM或Agent作为评判器)

For subjective checks (quality, readability, edge cases), create a TOML file:
toml
[judge]
judge = "anthropic/claude-sonnet-4-6"   # LiteLLM model string
files = ["/app/main.py"]

[[criterion]]
description = "Is the code correct?"
type = "binary"

[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
weight = 2.0
Criterion types:
  • binary
    — yes/no → 1.0 or 0.0
  • likert
    — 1..points, normalized to [0, 1]
  • numeric
    — min..max, normalized to [0, 1]
对于主观性检查(质量、可读性、边缘情况),创建TOML文件:
toml
[judge]
judge = "anthropic/claude-sonnet-4-6"   # LiteLLM模型字符串
files = ["/app/main.py"]

[[criterion]]
description = "Is the code correct?"
type = "binary"

[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
weight = 2.0
标准类型:
  • binary
    — 是/否 → 1.0或0.0
  • likert
    — 1..points,归一化到[0, 1]
  • numeric
    — min..max,归一化到[0, 1]

Agent judges

Agent评判器

Agent judges shell out to a CLI and can explore the filesystem:
toml
[judge]
judge = "claude-code"
model = "anthropic/claude-sonnet-4-6"
isolated = true

[[criterion]]
description = "Does the solution handle edge cases?"
type = "binary"
Slower and more expensive than LLM judges, but they can run commands and inspect files.
Agent评判器通过CLI调用,可探索文件系统:
toml
[judge]
judge = "claude-code"
model = "anthropic/claude-sonnet-4-6"
isolated = true

[[criterion]]
description = "Does the solution handle edge cases?"
type = "binary"
比LLM评判器速度慢、成本高,但可以运行命令和检查文件。

Useful
[judge]
options

实用的
[judge]
选项

timeout
(default 300),
reasoning_effort
(
low
|
medium
|
high
),
reference
(path to reference solution),
atif-trajectory
(evaluate the agent's trajectory),
weight
,
prompt_template
(custom prompt with
{criteria}
placeholder).
timeout
(默认300)、
reasoning_effort
low
|
medium
|
high
)、
reference
(参考解决方案路径)、
atif-trajectory
(评估Agent的执行轨迹)、
weight
prompt_template
(包含
{criteria}
占位符的自定义提示词)。

Scoring aggregation (within one judge TOML)

评分聚合(单个评判TOML文件内)

toml
[scoring]
aggregation = "all_pass"   # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7             # only for threshold
Only affects how this file's own criteria combine. To aggregate across dimensions, see Aggregating dimensions.
toml
[scoring]
aggregation = "all_pass"   # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7             # 仅适用于threshold模式
仅影响此文件内标准的组合方式。如需跨维度聚合,请查看聚合维度

Multi-reward tasks

多奖励任务

Put criteria in subdirectories — each becomes a separate reward:
tests/
├── test.sh
├── correctness/
│   └── check.py
├── structure/
│   └── files_exist.py
└── quality/
    └── quality.toml
Produces:
json
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }
将标准放在子目录中——每个子目录对应一个独立奖励:
tests/
├── test.sh
├── correctness/
│   └── check.py
├── structure/
│   └── files_exist.py
└── quality/
    └── quality.toml
输出结果:
json
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }

Aggregating dimensions

聚合维度

To add aggregated scores on top of the per-dimension keys, add a root-level
tests/reward.toml
with one or more
[[reward]]
tables. Each adds one key to
reward.json
, aggregating the dimensions with the same modes as
[scoring]
:
toml
undefined
如需在各维度分数基础上添加聚合分数,请在根目录下创建
tests/reward.toml
,包含一个或多个
[[reward]]
表。每个表会向
reward.json
添加一个键,使用与
[scoring]
相同的模式聚合维度:
toml
undefined

tests/reward.toml

tests/reward.toml

[[reward]] name = "reward" aggregation = "all_pass" # weighted_mean | all_pass | any_pass | threshold
[[reward]] name = "reward" aggregation = "all_pass" # weighted_mean | all_pass | any_pass | threshold

threshold = 0.7 # only for threshold

threshold = 0.7 # 仅适用于threshold模式


```json
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6, "reward": 0.0 }
The per-dimension scores stay; aggregated keys are added alongside them (a
name
may not collide with a dimension). Each dimension is weighted by the sum of its criteria weights;
reward-details.json
keeps the full breakdown.

输出结果:
```json
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6, "reward": 0.0 }
各维度分数会保留;聚合键会添加到旁边(
name
不能与维度名称冲突)。每个维度的权重为其所有标准权重之和;
reward-details.json
会保留完整的评分明细。

Output files

输出文件

  • /logs/verifier/reward.json
    — per-reward scores
  • /logs/verifier/reward-details.json
    — per-criterion results, judge reasoning, errors
  • /logs/verifier/reward.json
    — 各奖励项分数
  • /logs/verifier/reward-details.json
    — 各标准结果、评判推理过程、错误信息

Multi-step tasks

多步骤任务

In a multi-step task, each step has its own
tests/
under
steps/{name}/tests/
, and the verifier runs once per step. Reward Kit behaves the same as in a single-step task: for each step it reads
/tests
, runs the criteria against
/app
, and writes
/logs/verifier/reward.json
for that step. Harbor then aggregates per-step results into a trial-level reward via
multi_step_reward_strategy
in
task.toml
— aggregation happens outside Reward Kit, so don't try to encode cross-step logic in your criteria.
A task-level
tests/
directory (at the task root) is uploaded to
/tests
first, then the step's own
tests/
is layered on top (same-name files win). Put shared helpers (common
checks.py
functions with
shared=True
, fixture files, a fallback
test.sh
) at the task level, and step-specific criteria under each step.
Multi-reward subdirectories still work within a step:
steps/foo/tests/
can contain
correctness/
,
structure/
,
quality/
— each produces a separate reward key for that step, and
multi_step_reward_strategy = "mean"
averages each key across steps. Use
"final"
when the last step is an end-to-end check whose rewards already represent the full task.
在多步骤任务中,每个步骤在
steps/{name}/tests/
下有自己的
tests/
目录,验证器会在每个步骤运行一次。Reward Kit的行为与单步骤任务相同:针对每个步骤读取
/tests
,对
/app
运行标准,并为该步骤写入
/logs/verifier/reward.json
。Harbor随后通过
task.toml
中的
multi_step_reward_strategy
将各步骤结果聚合为试验级奖励——聚合操作在Reward Kit外部进行,因此请勿在标准中编写跨步骤逻辑。
任务级别的
tests/
目录(位于任务根目录)会先上传到
/tests
,然后步骤自身的
tests/
会覆盖上去(同名文件优先)。将共享辅助工具(带有
shared=True
的通用
checks.py
函数、 fixture文件、备用
test.sh
)放在任务级别,将步骤特定的标准放在每个步骤下。
多奖励子目录在步骤内仍然有效:
steps/foo/tests/
可以包含
correctness/
structure/
quality/
——每个子目录会为该步骤生成独立的奖励键,
multi_step_reward_strategy = "mean"
会在各步骤间平均每个键的分数。当最后一步是端到端检查且其奖励已代表整个任务时,使用
"final"
策略。

When to reach for what

场景选择指南

  • Use built-ins for file existence, string matches, command output, JSON/CSV checks, HTTP probes.
  • Use
    @criterion
    when logic is task-specific but still programmatic.
  • Use LLM judges for subjective quality dimensions (readability, correctness of prose).
  • Use agent judges when the rubric requires exploring the filesystem or running code (e.g. "does the test suite actually pass?").
  • Use subdirectories when you want separate scores (correctness vs structure vs quality) rather than one blended number.
  • Use
    isolated=True
    for any criterion that runs mutating commands, so it doesn't corrupt the workspace for other criteria.
  • 使用内置函数:适用于文件存在性检查、字符串匹配、命令输出检查、JSON/CSV检查、HTTP探测。
  • 使用
    @criterion
    :适用于任务特定但仍可程序化实现的逻辑。
  • 使用LLM评判器:适用于主观质量维度(可读性、文本正确性)。
  • 使用Agent评判器:适用于评分规则需要探索文件系统或运行代码的场景(例如“测试套件是否真的能通过?”)。
  • 使用子目录:当需要独立分数(正确性vs结构vs质量)而非单一综合分数时。
  • 使用
    isolated=True
    :适用于任何会运行修改性命令的标准,避免破坏其他标准的工作区。

Working example

示例项目

See
examples/tasks/reward-kit-example/
in the Harbor repo.
请查看Harbor仓库中的
examples/tasks/reward-kit-example/