rewardkit

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Help the user write task verifiers with Reward Kit. Reward Kit is a lightweight Python package that turns a directory of criteria files into a reward score. Each criterion is a Python function call or a TOML judge file; folders become separate rewards.

帮助用户使用Reward Kit编写任务验证器。Reward Kit是一个轻量级Python包，可将标准文件目录转换为奖励分数。每个标准可以是Python函数调用或TOML评判文件；文件夹会对应独立的奖励项。

Setup in a Harbor task

在Harbor任务中配置

Put criteria alongside

test.sh

in the task's

tests/

directory:

tests/
├── test.sh
├── checks.py         # programmatic criteria
└── judge.toml        # optional LLM/agent judge

tests/test.sh

bash

#!/bin/bash
uvx --from 'harbor-rewardkit==0.1.*' rewardkit /tests

This runs all criteria in

/tests/

against the workspace at

/app

and writes

/logs/verifier/reward.json

. Defaults match Harbor's conventions — no extra config needed.

If judge criteria need API keys, pass them through

task.toml

toml

[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"

Ask whether Reward Kit should run in the agent's shared environment or in a separate verifier environment. Prefer a separate verifier environment when judge prompts, grading dependencies, API keys, or clean-room checks should not be available to the agent:

toml

[environment]
network_mode = "no-network"   # Agent env baseline — offline during agent.run()

[verifier]
environment_mode = "separate"

[verifier.environment]
network_mode = "public"     # Verifier env baseline — LLM judge API calls
docker_image = "python:3.12-slim"

In shared mode, the verifier runs in the agent container and inherits

[environment].network_mode

. Put

[verifier].network_mode

only when verify() needs different network access than the agent phase (a phase override, not a baseline). If agent and verifier need different baselines without runtime switching, use

environment_mode = "separate"

and set

[verifier.environment].network_mode

Judge criteria that call external APIs need a

public

baseline or allowlist on the verifier environment. Programmatic checks that only read local files can use

no-network

In separate mode,

tests/

is the verifier image build context and must provide

/tests/test.sh

at runtime; Harbor does not upload

tests/

into the running verifier container.

将标准文件与

test.sh

一同放在任务的

tests/

目录下：

tests/
├── test.sh
├── checks.py         # 程序化标准
└── judge.toml        # 可选的LLM/Agent评判器

tests/test.sh

bash

#!/bin/bash
uvx --from 'harbor-rewardkit==0.1.*' rewardkit /tests

此命令会针对/app目录下的工作区运行/tests/中的所有标准，并将结果写入/logs/verifier/reward.json。默认配置符合Harbor的约定——无需额外配置。

如果评判标准需要API密钥，可通过

task.toml

传递：

toml

[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"

询问Reward Kit应在Agent的共享环境中运行，还是在独立的验证器环境中运行。当评判提示、评分依赖项、API密钥或隔离检查不应被Agent访问时，优先选择独立验证器环境：

toml

[environment]
network_mode = "no-network"   # Agent环境基准——agent.run()期间离线

[verifier]
environment_mode = "separate"

[verifier.environment]
network_mode = "public"     # 验证器环境基准——LLM评判器API调用
docker_image = "python:3.12-slim"

在共享模式下，验证器在Agent容器中运行，并继承

[environment].network_mode

。仅当verify()需要与Agent阶段不同的网络访问时（阶段覆盖，而非基准），才设置

[verifier].network_mode

。如果Agent和验证器需要不同的基准且无需运行时切换，请使用

environment_mode = "separate"

并设置

[verifier.environment].network_mode

。

调用外部API的评判标准需要验证器环境设置

public

基准或允许列表。仅读取本地文件的程序化检查可使用

no-network

。

在独立模式下，

tests/

是验证器镜像的构建上下文，必须在运行时提供

/tests/test.sh

；Harbor不会将

tests/

上传到运行中的验证器容器。

Programmatic criteria

程序化标准

Call built-ins from any

.py

file in

tests/

python

import rewardkit as rk

rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py", weight=2.0)
rk.json_key_equals("result.json", "status", "ok")

All criteria accept

weight

(default

1.0

) and

isolated

(default

False

, runs in overlayfs so side effects don't leak).

在

tests/

下的任意

.py

文件中调用内置函数：

python

import rewardkit as rk

rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py", weight=2.0)
rk.json_key_equals("result.json", "status", "ok")

所有标准都支持

weight

（默认值

1.0

）和

isolated

（默认值

False

，在overlayfs中运行，避免副作用扩散）参数。

Available built-ins

可用内置函数

Files:

file_exists

file_not_exists

file_contains

file_contains_regex

file_matches

files_equal

diff_ratio

Commands:

command_succeeds

command_output_contains

command_output_matches

command_output_matches_regex

(30s default timeout, optional

cwd

)

Data:

json_key_equals

json_path_equals

csv_cell_equals

xlsx_cell_equals

(needs

[office]

extra),

sqlite_query_equals

HTTP:

http_status_equals

http_response_contains

Images:

image_similarity

image_size_equals

(needs

[image]

extra)

Trajectory:

trajectory_tool_used

trajectory_tool_not_used

trajectory_turn_count

For extras, install with

uv tool install harbor-rewardkit[all]

文件类：

file_exists

、

file_not_exists

、

file_contains

、

file_contains_regex

、

file_matches

、

files_equal

、

diff_ratio

命令类：

command_succeeds

、

command_output_contains

、

command_output_matches

、

command_output_matches_regex

（默认超时30秒，可选

cwd

参数）

数据类：

json_key_equals

、

json_path_equals

、

csv_cell_equals

、

xlsx_cell_equals

（需要

[office]

扩展）、

sqlite_query_equals

HTTP类：

http_status_equals

、

http_response_contains

图片类：

image_similarity

、

image_size_equals

（需要

[image]

扩展）

轨迹类：

trajectory_tool_used

、

trajectory_tool_not_used

、

trajectory_turn_count

如需安装扩展，请执行

uv tool install harbor-rewardkit[all]

。

Custom criteria

自定义标准

Use the

@criterion

decorator. First parameter is always

workspace: Path

. Returns

bool

float

python

from pathlib import Path
from rewardkit import criterion

@criterion
def has_valid_output(workspace: Path) -> bool:
    return (workspace / "output.txt").read_text().strip() != ""

Zero-parameter criteria auto-register. Criteria with extra args must be called via

rk

python

@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
    return len((workspace / "output.txt").read_text().splitlines()) >= n

rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)

For criteria shared across reward subdirs, define with

shared=True

in a root-level file and call from subdirs.

使用

@criterion

装饰器。第一个参数始终为

workspace: Path

。返回值为

bool

或

float

类型：

python

from pathlib import Path
from rewardkit import criterion

@criterion
def has_valid_output(workspace: Path) -> bool:
    return (workspace / "output.txt").read_text().strip() != ""

无参数的标准会自动注册。带有额外参数的标准必须通过

rk

调用：

python

@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
    return len((workspace / "output.txt").read_text().splitlines()) >= n

rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)

如需在奖励子目录间共享标准，请在根级文件中使用

shared=True

定义，并在子目录中调用。

Judge criteria (LLM or agent-as-a-judge)

评判标准（LLM或Agent作为评判器）

For subjective checks (quality, readability, edge cases), create a TOML file:

toml

[judge]
judge = "anthropic/claude-sonnet-4-6"   # LiteLLM model string
files = ["/app/main.py"]

[[criterion]]
description = "Is the code correct?"
type = "binary"

[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
weight = 2.0

Criterion types:

```
binary
```
— yes/no → 1.0 or 0.0
```
likert
```
— 1..points, normalized to [0, 1]
```
numeric
```
— min..max, normalized to [0, 1]

对于主观性检查（质量、可读性、边缘情况），创建TOML文件：

toml

[judge]
judge = "anthropic/claude-sonnet-4-6"   # LiteLLM模型字符串
files = ["/app/main.py"]

[[criterion]]
description = "Is the code correct?"
type = "binary"

[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
weight = 2.0

标准类型：

```
binary
```
— 是/否 → 1.0或0.0
```
likert
```
— 1..points，归一化到[0, 1]
```
numeric
```
— min..max，归一化到[0, 1]

Agent judges

Agent评判器

Agent judges shell out to a CLI and can explore the filesystem:

toml

[judge]
judge = "claude-code"
model = "anthropic/claude-sonnet-4-6"
isolated = true

[[criterion]]
description = "Does the solution handle edge cases?"
type = "binary"

Slower and more expensive than LLM judges, but they can run commands and inspect files.

Agent评判器通过CLI调用，可探索文件系统：

toml

[judge]
judge = "claude-code"
model = "anthropic/claude-sonnet-4-6"
isolated = true

[[criterion]]
description = "Does the solution handle edge cases?"
type = "binary"

比LLM评判器速度慢、成本高，但可以运行命令和检查文件。

Useful

[judge]

options

实用的

[judge]

选项

timeout

(default 300),

reasoning_effort

(

low

medium

high

reference

(path to reference solution),

atif-trajectory

(evaluate the agent's trajectory),

weight

prompt_template

(custom prompt with

{criteria}

placeholder).

timeout

（默认300）、

reasoning_effort

（

low

medium

high

）、

reference

（参考解决方案路径）、

atif-trajectory

（评估Agent的执行轨迹）、

weight

、

prompt_template

（包含

{criteria}

占位符的自定义提示词）。

Scoring aggregation (within one judge TOML)

评分聚合（单个评判TOML文件内）

toml

[scoring]
aggregation = "all_pass"   # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7             # only for threshold

Only affects how this file's own criteria combine. To aggregate across dimensions, see Aggregating dimensions.

toml

[scoring]
aggregation = "all_pass"   # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7             # 仅适用于threshold模式

仅影响此文件内标准的组合方式。如需跨维度聚合，请查看聚合维度。

Multi-reward tasks

多奖励任务

Put criteria in subdirectories — each becomes a separate reward:

tests/
├── test.sh
├── correctness/
│   └── check.py
├── structure/
│   └── files_exist.py
└── quality/
    └── quality.toml

Produces:

json

{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }

将标准放在子目录中——每个子目录对应一个独立奖励：

tests/
├── test.sh
├── correctness/
│   └── check.py
├── structure/
│   └── files_exist.py
└── quality/
    └── quality.toml

输出结果：

json

{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }

Aggregating dimensions

聚合维度

To add aggregated scores on top of the per-dimension keys, add a root-level

tests/reward.toml

with one or more

[[reward]]

tables. Each adds one key to

reward.json

, aggregating the dimensions with the same modes as

[scoring]

toml

undefined

如需在各维度分数基础上添加聚合分数，请在根目录下创建

tests/reward.toml

，包含一个或多个

[[reward]]

表。每个表会向

reward.json

添加一个键，使用与

[scoring]

相同的模式聚合维度：

toml

undefined

tests/reward.toml

[[reward]] name = "reward" aggregation = "all_pass" # weighted_mean | all_pass | any_pass | threshold

threshold = 0.7 # only for threshold

threshold = 0.7 # 仅适用于threshold模式


```json
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6, "reward": 0.0 }

The per-dimension scores stay; aggregated keys are added alongside them (a

name

may not collide with a dimension). Each dimension is weighted by the sum of its criteria weights;

reward-details.json

keeps the full breakdown.


输出结果：
```json
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6, "reward": 0.0 }

各维度分数会保留；聚合键会添加到旁边（

name

不能与维度名称冲突）。每个维度的权重为其所有标准权重之和；

reward-details.json

会保留完整的评分明细。

Output files

输出文件

```
/logs/verifier/reward.json
```
— per-reward scores
```
/logs/verifier/reward-details.json
```
— per-criterion results, judge reasoning, errors

```
/logs/verifier/reward.json
```
— 各奖励项分数
```
/logs/verifier/reward-details.json
```
— 各标准结果、评判推理过程、错误信息

Multi-step tasks

多步骤任务

In a multi-step task, each step has its own

tests/

under

steps/{name}/tests/

, and the verifier runs once per step. Reward Kit behaves the same as in a single-step task: for each step it reads

/tests

, runs the criteria against

/app

, and writes

/logs/verifier/reward.json

for that step. Harbor then aggregates per-step results into a trial-level reward via

multi_step_reward_strategy

task.toml

— aggregation happens outside Reward Kit, so don't try to encode cross-step logic in your criteria.

A task-level

tests/

directory (at the task root) is uploaded to

/tests

first, then the step's own

tests/

is layered on top (same-name files win). Put shared helpers (common

checks.py

functions with

shared=True

, fixture files, a fallback

test.sh

) at the task level, and step-specific criteria under each step.

Multi-reward subdirectories still work within a step:

steps/foo/tests/

can contain

correctness/

structure/

quality/

— each produces a separate reward key for that step, and

multi_step_reward_strategy = "mean"

averages each key across steps. Use

"final"

when the last step is an end-to-end check whose rewards already represent the full task.

在多步骤任务中，每个步骤在

steps/{name}/tests/

下有自己的

tests/

目录，验证器会在每个步骤运行一次。Reward Kit的行为与单步骤任务相同：针对每个步骤读取

/tests

，对

/app

运行标准，并为该步骤写入

/logs/verifier/reward.json

。Harbor随后通过

task.toml

中的

multi_step_reward_strategy

将各步骤结果聚合为试验级奖励——聚合操作在Reward Kit外部进行，因此请勿在标准中编写跨步骤逻辑。

任务级别的

tests/

/tests

，然后步骤自身的

tests/

会覆盖上去（同名文件优先）。将共享辅助工具（带有

shared=True

的通用

checks.py

函数、 fixture文件、备用

test.sh

）放在任务级别，将步骤特定的标准放在每个步骤下。

多奖励子目录在步骤内仍然有效：

steps/foo/tests/

可以包含

correctness/

、

structure/

、

quality/

——每个子目录会为该步骤生成独立的奖励键，

multi_step_reward_strategy = "mean"

会在各步骤间平均每个键的分数。当最后一步是端到端检查且其奖励已代表整个任务时，使用

"final"

策略。

When to reach for what

场景选择指南

Use built-ins for file existence, string matches, command output, JSON/CSV checks, HTTP probes.
Use
@criterion
when logic is task-specific but still programmatic.
Use LLM judges for subjective quality dimensions (readability, correctness of prose).
Use agent judges when the rubric requires exploring the filesystem or running code (e.g. "does the test suite actually pass?").
Use subdirectories when you want separate scores (correctness vs structure vs quality) rather than one blended number.
Use
isolated=True
for any criterion that runs mutating commands, so it doesn't corrupt the workspace for other criteria.

使用内置函数：适用于文件存在性检查、字符串匹配、命令输出检查、JSON/CSV检查、HTTP探测。
使用
@criterion
：适用于任务特定但仍可程序化实现的逻辑。
使用LLM评判器：适用于主观质量维度（可读性、文本正确性）。
使用Agent评判器：适用于评分规则需要探索文件系统或运行代码的场景（例如“测试套件是否真的能通过？”）。
使用子目录：当需要独立分数（正确性vs结构vs质量）而非单一综合分数时。
使用
isolated=True
：适用于任何会运行修改性命令的标准，避免破坏其他标准的工作区。

Working example

示例项目

See

examples/tasks/reward-kit-example/

in the Harbor repo.

请查看Harbor仓库中的

examples/tasks/reward-kit-example/

。

rewardkit

Original

Translation

Setup in a Harbor task

在Harbor任务中配置

Programmatic criteria

程序化标准

Available built-ins

可用内置函数

Custom criteria

自定义标准

Judge criteria (LLM or agent-as-a-judge)

评判标准（LLM或Agent作为评判器）

Agent judges

Agent评判器

Useful
`[judge]`
options

实用的
`[judge]`
选项

Scoring aggregation (within one judge TOML)

评分聚合（单个评判TOML文件内）

Multi-reward tasks

多奖励任务

Aggregating dimensions

聚合维度

tests/reward.toml

tests/reward.toml

threshold = 0.7 # only for threshold

threshold = 0.7 # 仅适用于threshold模式

Output files

输出文件

Multi-step tasks

多步骤任务

When to reach for what

场景选择指南

Working example

示例项目

rewardkit

Original

Translation

Setup in a Harbor task

在Harbor任务中配置

Programmatic criteria

程序化标准

Available built-ins

可用内置函数

Custom criteria

自定义标准

Judge criteria (LLM or agent-as-a-judge)

评判标准（LLM或Agent作为评判器）

Agent judges

Agent评判器

Useful [judge] options

实用的[judge]选项

Scoring aggregation (within one judge TOML)

评分聚合（单个评判TOML文件内）

Multi-reward tasks

多奖励任务

Aggregating dimensions

聚合维度

tests/reward.toml

tests/reward.toml

threshold = 0.7 # only for threshold

threshold = 0.7 # 仅适用于threshold模式

Output files

输出文件

Multi-step tasks

多步骤任务

When to reach for what

场景选择指南

Working example

示例项目

Useful
`[judge]`
options

实用的
`[judge]`
选项