create-task
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGuide the user through creating a new Harbor task end-to-end. Don't just dump commands —
walk them through each decision, especially around the verifier (which is usually the
hardest part).
全程指导用户创建新的Harbor任务。不要只罗列命令——要引导用户完成每一项决策,尤其是围绕验证器的决策(这通常是最困难的部分)。
Step 1: Scaffold the task
步骤1:搭建任务框架
bash
harbor task init "<org>/<task-name>"Useful flags:
--description "..."- (repeat for multiple authors)
--author "Jane Doe <jane@example.com>" - — skip the pytest test template (use if planning Reward Kit or custom verifier)
--no-pytest - — skip solution/ directory
--no-solution - — pre-populate task.toml
--metadata-template path.toml
Produces:
<task-name>/
├── instruction.md # Task prompt for the agent
├── task.toml # Config and metadata
├── environment/Dockerfile # Container definition
├── solution/solve.sh # Reference solution (optional)
└── tests/test.sh # Verifier scriptIf the user wants a multi-step task (ordered steps with per-step
instructions, tests, and early stopping against a shared container), scaffold
the single-step layout first, then convert to the layout described in
the Multi-step tasks section below.
steps/bash
harbor task init "<org>/<task-name>"实用参数:
- :设置任务描述
--description "..." - :设置作者(可重复添加多个作者)
--author "Jane Doe <jane@example.com>" - :跳过pytest测试模板(若计划使用Reward Kit或自定义验证器时选用)
--no-pytest - :跳过solution/目录
--no-solution - :预填充task.toml配置
--metadata-template path.toml
生成的目录结构:
<task-name>/
├── instruction.md # 给Agent的任务提示
├── task.toml # 配置和元数据
├── environment/Dockerfile # 容器定义
├── solution/solve.sh # 参考解决方案(可选)
└── tests/test.sh # 验证器脚本如果用户需要多步骤任务(包含分步指令、测试,且可在共享容器中提前终止的有序步骤),先搭建单步骤布局,再转换为下文「多步骤任务」部分所述的布局。
steps/Step 2: Write instruction.md
步骤2:编写instruction.md
This is the prompt the agent receives. Help the user write it clearly:
- State the goal concretely — what file to create, what behavior to produce
- Specify expected outputs — paths, formats, content
- Include constraints — language, tools, approach
- Don't leak the tests — describe what "done" looks like, not how you'll check it
Example (from the ssh-key-pair tutorial):
markdown
undefined这是Agent接收的提示内容。帮助用户清晰编写:
- 明确任务目标——需要创建什么文件、实现什么行为
- 指定预期输出——路径、格式、内容
- 包含约束条件——语言、工具、实现方式
- 不要泄露测试逻辑——描述完成状态,而非检查方式
示例(来自ssh-key-pair教程):
markdown
undefinedSSH Key Pair Generation
SSH密钥对生成
Generate an SSH key pair in the files and .
~/.ssh/id_rsa~/.ssh/id_rsa.pubDon't make them password protected.
undefined在和文件中生成SSH密钥对。
~/.ssh/id_rsa~/.ssh/id_rsa.pub不要设置密码保护。
undefinedStep 3: Build the environment
步骤3:构建运行环境
Edit to install dependencies the task needs. The agent works
inside this container.
environment/Dockerfiledockerfile
FROM ubuntu:24.04
WORKDIR /app编辑以安装任务所需的依赖。Agent将在此容器内运行。
environment/Dockerfiledockerfile
FROM ubuntu:24.04
WORKDIR /appInstall what the task requires — NOT the solution
安装任务所需依赖——而非解决方案依赖
RUN apt-get update && apt-get install -y openssh-client && rm -rf /var/lib/apt/lists/*
For multi-container setups, use `environment/docker-compose.yaml` instead (note: most
cloud sandbox providers only support Dockerfile).
**Test the environment interactively** before writing the solution or tests:
```bash
harbor task start-env -p "<task-path>" -e docker -a -iThis is usually where task authors realize something is missing from the Dockerfile.
RUN apt-get update && apt-get install -y openssh-client && rm -rf /var/lib/apt/lists/*
如需多容器部署,改用`environment/docker-compose.yaml`(注意:大多数云沙箱提供商仅支持Dockerfile)。
**交互式测试环境**(在编写解决方案或测试前执行):
```bash
harbor task start-env -p "<task-path>" -e docker -a -i任务作者通常会在此步骤发现Dockerfile中缺少的依赖。
Step 4: Decide how to verify
步骤4:确定验证方式
This is the most important decision. Ask the user: "How do you want to grade this
task?" Then help them pick:
这是最重要的决策环节。询问用户:「你希望如何为该任务评分?」 然后帮助他们选择合适的方案:
Option A: Reward Kit (recommended for most cases)
方案A:Reward Kit(多数场景推荐)
Use when the verifier has multiple criteria, needs partial credit, uses an LLM/agent
judge, or would benefit from composable reusable checks. See the skill.
rewardkitGood fit signals:
- Multiple things to check (file exists + content correct + command works)
- Subjective quality dimensions (readability, correctness of prose)
- Want partial credit rather than pass/fail
- Want to compose built-ins like ,
file_contains,command_succeedsjson_key_equals
tests/test.shbash
#!/bin/bash
uvx --from 'harbor-rewardkit==0.1.*' rewardkit /testsNote: the package is named but the executable is ,
hence . Running
directly will fail.
harbor-rewardkitrewardkit--from 'harbor-rewardkit==0.1.*' rewardkituvx harbor-rewardkitThen add and/or . Invoke the skill to
design the criteria.
tests/checks.pytests/judge.tomlrewardkit适用于验证器包含多个评估标准、需要部分得分、使用LLM/Agent评判,或可从可组合的复用检查中受益的场景。参考技能文档。
rewardkit适用场景特征:
- 需要检查多个维度(文件存在性+内容正确性+命令可执行)
- 存在主观质量维度(可读性、文本正确性)
- 需要部分得分而非仅通过/不通过
- 需要组合内置检查项,如、
file_contains、command_succeedsjson_key_equals
tests/test.shbash
#!/bin/bash
uvx --from 'harbor-rewardkit==0.1.*' rewardkit /tests注意:包名为但可执行文件名为,因此需使用。直接运行会失败。
harbor-rewardkitrewardkit--from 'harbor-rewardkit==0.1.*' rewardkituvx harbor-rewardkit随后添加和/或。调用技能设计评估标准。
tests/checks.pytests/judge.tomlrewardkitOption B: pytest (good for deterministic unit-style checks)
方案B:pytest(适用于确定性单元式检查)
Use when the verification is straightforward assertion-style Python. Default template if
wasn't passed.
--no-pytesttests/test.shbash
#!/bin/bash
apt-get update && apt-get install -y curl
curl -LsSf https://astral.sh/uv/0.9.7/install.sh | sh
source $HOME/.local/bin/env
uvx --with pytest==8.4.1 pytest /tests/test_outputs.py
if [ $? -eq 0 ]; then
echo 1 > /logs/verifier/reward.txt
else
echo 0 > /logs/verifier/reward.txt
fiExample :
tests/test_outputs.pypython
from pathlib import Path
def test_file_exists():
assert (Path.home() / ".ssh" / "id_rsa").exists()适用于验证逻辑为简单断言式Python代码的场景。若未传入参数,将默认生成此模板。
--no-pytesttests/test.shbash
#!/bin/bash
apt-get update && apt-get install -y curl
curl -LsSf https://astral.sh/uv/0.9.7/install.sh | sh
source $HOME/.local/bin/env
uvx --with pytest==8.4.1 pytest /tests/test_outputs.py
if [ $? -eq 0 ]; then
echo 1 > /logs/verifier/reward.txt
else
echo 0 > /logs/verifier/reward.txt
fitests/test_outputs.pypython
from pathlib import Path
def test_file_exists():
assert (Path.home() / ".ssh" / "id_rsa").exists()Option C: Custom shell
方案C:自定义Shell脚本
For simple single-command checks (e.g. a binary pass/fail from one command):
bash
#!/bin/bash
if diff -q /app/output.txt /tests/expected.txt; then
echo 1 > /logs/verifier/reward.txt
else
echo 0 > /logs/verifier/reward.txt
fi适用于简单的单命令检查(例如通过一条命令输出二进制的通过/不通过结果):
bash
#!/bin/bash
if diff -q /app/output.txt /tests/expected.txt; then
echo 1 > /logs/verifier/reward.txt
else
echo 0 > /logs/verifier/reward.txt
fiReward file format (all options)
奖励文件格式(所有方案通用)
- — single number (usually
/logs/verifier/reward.txtor0)1 - —
/logs/verifier/reward.jsonfor multiple metrics{"accuracy": 0.95, "runtime_sec": 1.2}
Always use absolute paths in .
test.sh- ——单个数值(通常为
/logs/verifier/reward.txt或0)1 - ——多指标格式,例如
/logs/verifier/reward.json{"accuracy": 0.95, "runtime_sec": 1.2}
在中务必使用绝对路径。
test.shStep 5: Write the solution
步骤5:编写解决方案
Write — a script that actually solves the task. The Oracle agent runs
this to sanity-check that the task is solvable and the tests pass on a correct solution.
solution/solve.shbash
#!/bin/bash
ssh-keygen -t rsa -f ~/.ssh/id_rsa -N ""Make it executable: .
chmod +x solution/solve.sh编写——实际解决任务的脚本。Oracle Agent将运行此脚本,以验证任务可解且正确解决方案能通过测试。
solution/solve.shbash
#!/bin/bash
ssh-keygen -t rsa -f ~/.ssh/id_rsa -N ""设置可执行权限:。
chmod +x solution/solve.shStep 6: Configure task.toml
步骤6:配置task.toml
Walk through the important fields:
toml
[task]
name = "<org>/<task-name>"
description = "One-line description"
keywords = ["jax", "mnist", "rewardkit"] # always populate — used for search/filtering
[metadata]
difficulty = "easy" | "medium" | "hard"
category = "programming" | "machine-learning" | "gpu" | ...
tags = ["..."]
[environment]
cpus = 1 # CPU cores
memory_mb = 2048 # RAM in MB
storage_mb = 10240 # Disk in MB
allow_internet = true # Network access
[agent]
timeout_sec = 120.0 # How long the agent has
[verifier]
timeout_sec = 600.0 # How long tests haveAlways populate . Pick 3–8 lowercase tokens covering the domain
(language/framework/benchmark family), the verifier style (,
, ), and any notable hardware (). They're surfaced in
and registry search.
keywordsrewardkitjudge-gradingpytestgpuharbor datasets listFor Reward Kit judges needing API keys:
toml
[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"引导用户配置关键字段:
toml
[task]
name = "<org>/<task-name>"
description = "单行描述"
keywords = ["jax", "mnist", "rewardkit"] # 务必填写——用于搜索/筛选
[metadata]
difficulty = "easy" | "medium" | "hard"
category = "programming" | "machine-learning" | "gpu" | ...
tags = ["..."]
[environment]
cpus = 1 # CPU核心数
memory_mb = 2048 # 内存(MB)
storage_mb = 10240 # 磁盘空间(MB)
allow_internet = true # 是否允许网络访问
[agent]
timeout_sec = 120.0 # Agent的运行超时时间
[verifier]
timeout_sec = 600.0 # 测试的运行超时时间务必填写。选择3-8个小写关键词,涵盖领域(语言/框架/基准测试类别)、验证器类型(、、)以及特殊硬件需求()。这些关键词会在和注册表搜索中展示。
keywordsrewardkitjudge-gradingpytestgpuharbor datasets list若Reward Kit评判需要API密钥:
toml
[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"Step 7: Verify with the Oracle agent
步骤7:使用Oracle Agent验证
bash
harbor run -p "<task-path>" -a oracleOracle runs and then the verifier. Reward should be . If it's
not, debug in this order:
solution/solve.sh1.0- Does actually solve it? (
solve.shand run it manually)start-env -a -i - Does the verifier correctly detect success? (check output)
/logs/verifier/ - Are paths correct? (absolute vs relative)
- Are dependencies installed in the Dockerfile?
bash
harbor run -p "<task-path>" -a oracleOracle会运行,然后执行验证器。奖励值应为。若不是,按以下顺序排查问题:
solution/solve.sh1.0- 是否真的能解决任务?(使用
solve.sh手动运行测试)start-env -a -i - 验证器是否能正确检测成功状态?(检查输出)
/logs/verifier/ - 路径是否正确?(绝对路径vs相对路径)
- Dockerfile中是否安装了所需依赖?
Step 8: Test with a real agent (optional)
步骤8:使用真实Agent测试(可选)
bash
harbor run -p "<task-path>" -a terminus-2 -m anthropic/claude-sonnet-4-6If the task is too easy (every model 1.0) or impossible (every model 0.0), consider
adjusting difficulty.
bash
harbor run -p "<task-path>" -a terminus-2 -m anthropic/claude-sonnet-4-6如果任务过于简单(所有模型得分均为1.0)或无法完成(所有模型得分均为0.0),考虑调整任务难度。
Step 9: Update README.md (always the final step)
步骤9:更新README.md(最后一步务必执行)
harbor task initREADME.md- What the agent does — one paragraph, link to .
instruction.md - Environment — base image, key installed packages, cached data, hardware (GPU/CPU/RAM), agent timeout.
- Verifier — for Reward Kit tasks, a table of reward dimensions with type (programmatic / LLM judge / agent judge) and what each measures; how they're aggregated.
- Layout — a tree of the task directory with one-line annotations.
- Running — the concrete commands (Oracle + real agent), with the right provider flag if the task needs a GPU.
harbor run
Treat this as docs, not marketing — the reader wants to know what they'd need to
change to modify the task.
harbor task initREADME.md- Agent执行内容——一段描述,链接到
instruction.md - 运行环境——基础镜像、关键安装包、缓存数据、硬件(GPU/CPU/内存)、Agent超时时间
- 验证器——对于Reward Kit任务,列出奖励维度表格,包含类型(程序化/LLM评判/Agent评判)及测量内容,以及聚合方式
- 目录结构——任务目录树及单行注释
- 运行方式——具体的命令(Oracle+真实Agent),若任务需要GPU需添加对应提供商参数
harbor run
将其视为技术文档而非营销内容——读者需要知道修改任务时需要调整哪些部分。
Multi-step tasks
多步骤任务
Use when the work splits into ordered phases that should be scored separately,
when you want early stopping between phases, or when you're testing an agent's
ability to build on its own prior work. Steps share one container; files
persist across steps.
适用于工作可拆分为多个有序阶段且需单独评分、需要在阶段间提前终止,或测试Agent基于自身前期成果继续工作能力的场景。步骤共享同一个容器;文件在各步骤间持久化。
Directory layout
目录布局
Replace the task-root , , and with a
directory containing one sub-directory per step:
instruction.mdtests/solution/steps/<task-name>/
├── task.toml
├── environment/Dockerfile # Built once, shared across all steps
├── steps/
│ ├── scaffold/
│ │ ├── instruction.md # Prompt for this step
│ │ ├── workdir/ # Uploaded to WORKDIR before the agent runs
│ │ │ └── setup.sh # Optional pre-agent hook (reserved filename)
│ │ ├── tests/test.sh # Per-step verifier
│ │ └── solution/solve.sh # Per-step Oracle solution (optional)
│ ├── implement/
│ │ └── ...
│ └── document/
│ └── ...
└── tests/ # Optional shared helpers + fallback test.shTask-level is uploaded to for each step's verification, then
the step's own is layered on top (same-name files win). Use this for
shared helpers.
tests//teststests/steps/{name}/workdir/setup.shworkdir/rm -- "$0"将任务根目录下的、和替换为目录,每个步骤对应一个子目录:
instruction.mdtests/solution/steps/<task-name>/
├── task.toml
├── environment/Dockerfile # 仅构建一次,所有步骤共享
├── steps/
│ ├── scaffold/
│ │ ├── instruction.md # 本步骤的提示内容
│ │ ├── workdir/ # Agent运行前上传至WORKDIR的内容
│ │ │ └── setup.sh # 可选的Agent前置钩子(保留文件名)
│ │ ├── tests/test.sh # 分步验证器
│ │ └── solution/solve.sh # 分步Oracle解决方案(可选)
│ ├── implement/
│ │ └── ...
│ └── document/
│ └── ...
└── tests/ # 可选的共享工具+备用test.sh任务级别的会在每个步骤的验证阶段上传至,然后步骤自身的会覆盖同名文件(同名文件优先)。可用于存放共享工具。
tests//teststests/steps/{name}/workdir/setup.shworkdir/rm -- "$0"task.toml
task.toml配置
toml
schema_version = "1.1"
[task]
name = "<org>/<task-name>"toml
schema_version = "1.1"
[task]
name = "<org>/<task-name>"How per-step rewards roll up into the trial-level verifier_result.
分步奖励如何汇总为测试级别的verifier_result
"mean" (default): per-key mean across steps that produced a result.
"mean"(默认):取所有产生结果的步骤的各指标平均值
"final": the last step's verifier_result verbatim.
"final":直接使用最后一步的verifier_result
multi_step_reward_strategy = "mean"
[[steps]]
name = "scaffold" # Must match the directory under steps/
min_reward = 1.0 # Abort trial if this step's reward < 1.0
[steps.agent]
timeout_sec = 60.0 # Overrides task-level [agent].timeout_sec
[steps.verifier]
timeout_sec = 30.0
[[steps]]
name = "implement"
multi_step_reward_strategy = "mean"
[[steps]]
name = "scaffold" # 必须与steps/下的目录名匹配
min_reward = 1.0 # 若本步骤奖励<1.0则中止测试
[steps.agent]
timeout_sec = 60.0 # 覆盖任务级别的[agent].timeout_sec
[steps.verifier]
timeout_sec = 30.0
[[steps]]
name = "implement"
Dict form gates on specific keys from a multi-dim reward:
字典形式针对多维奖励的特定指标设置阈值:
min_reward = { correctness = 0.8, style = 0.5 }
[steps.agent]
timeout_sec = 120.0
[steps.verifier]
timeout_sec = 30.0
[[steps]]
name = "document"
[steps.agent]
timeout_sec = 60.0
[steps.verifier]
timeout_sec = 30.0
Per-step overrides available: `agent.timeout_sec`, `agent.user`,
`verifier.timeout_sec`, `verifier.env`, `verifier.user`, `healthcheck.*`,
`artifacts`. Unset fields fall back to the task-level values.min_reward = { correctness = 0.8, style = 0.5 }
[steps.agent]
timeout_sec = 120.0
[steps.verifier]
timeout_sec = 30.0
[[steps]]
name = "document"
[steps.agent]
timeout_sec = 60.0
[steps.verifier]
timeout_sec = 30.0
可覆盖的分步配置:`agent.timeout_sec`、`agent.user`、`verifier.timeout_sec`、`verifier.env`、`verifier.user`、`healthcheck.*`、`artifacts`。未设置的字段将沿用任务级别的值。Choosing a reward strategy
奖励策略选择
- — aggregate signal across all steps; good for continuous progress rewards.
"mean" - — last step's verifier_result is the trial reward. Right when the final step is an end-to-end check whose dict already represents the full task. Caveat: if
"final"triggers an early abort,min_rewarduses the aborted step's result, not the intended final step."final"
- ——汇总所有步骤的信号;适用于持续进度奖励
"mean" - ——测试奖励为最后一步的verifier_result。适用于最后一步为端到端检查且其字典已代表完整任务的场景。注意:若
"final"触发提前中止,min_reward将使用中止步骤的结果,而非预期的最后一步结果。"final"
Artifacts
产物收集
Step-level are collected into after that
step's verification. Task-level and trial-level artifacts are collected at
every step in addition to the step-level ones.
artifactssteps/{name}/artifacts/步骤级别的会在本步骤验证完成后收集到。任务级和测试级的产物会在每个步骤中与步骤级产物一同收集。
artifactssteps/{name}/artifacts/Oracle verification
Oracle验证
harbor run -p "<task-path>" -a oraclesolution/solve.sh1.0harbor run -p "<task-path>" -a oraclesolution/solve.sh1.0Full reference + worked example
完整参考+示例
- Docs:
docs/content/docs/tasks/multi-step.mdx - Example task:
examples/tasks/hello-multi-step-advanced/
- 文档:
docs/content/docs/tasks/multi-step.mdx - 示例任务:
examples/tasks/hello-multi-step-advanced/
Special features (mention if relevant)
特殊功能(相关时提及)
- MCP servers: Add in task.toml for agent tooling
[[environment.mcp_servers]] - Healthcheck: Add for services that need to be ready
[environment.healthcheck] - GPU: Set and optionally
environment.gpusenvironment.gpu_types - Pre-built image: Set instead of building from Dockerfile
environment.docker_image - Non-root user: Set /
agent.userfor isolationverifier.user
- MCP服务器:在task.toml中添加以支持Agent工具
[[environment.mcp_servers]] - 健康检查:添加以确保服务就绪
[environment.healthcheck] - GPU支持:设置,可选设置
environment.gpusenvironment.gpu_types - 预构建镜像:设置替代Dockerfile构建
environment.docker_image - 非root用户:设置/
agent.user以实现隔离verifier.user
Common pitfalls
常见陷阱
- Forgetting to write the reward file → task "passes" silently with reward 0
- Using relative paths in → breaks when Harbor runs it from a different cwd
test.sh - Installing the solution into the Dockerfile → agent already gets the answer
- Test script leaks into → agent sees the rubric and gaming becomes trivial
instruction.md - Forgetting → Oracle agent fails
chmod +x solution/solve.sh - Leaving in task.toml → task is invisible to registry search
keywords = [] - Leaving as a stub → teammates have no way to understand the task at a glance
README.md
- 忘记编写奖励文件→任务静默“通过”但奖励为0
- 在中使用相对路径→Harbor从不同工作目录运行时会失效
test.sh - 将解决方案依赖安装到Dockerfile中→Agent直接获取答案
- 测试脚本逻辑泄露到中→Agent看到评分标准,容易出现作弊行为
instruction.md - 忘记设置→Oracle Agent运行失败
chmod +x solution/solve.sh - task.toml中保留→任务在注册表搜索中不可见
keywords = [] - 保留为占位文件→团队成员无法快速理解任务
README.md