write-kaggle-benchmarks

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Write Kaggle Benchmarks

编写Kaggle Benchmarks

Keywords

关键词

Kaggle benchmarks, write a benchmark, benchmark task, kbench, push task, run task.

Kaggle benchmarks、编写基准测试、基准任务、kbench、推送任务、运行任务

Official Resources

官方资源

SDK source & API — https://github.com/Kaggle/kaggle-benchmarks
SDK auto-generated docs — https://deepwiki.com/Kaggle/kaggle-benchmarks
CLI docs — https://github.com/Kaggle/kaggle-cli/blob/main/docs/benchmarks.md

SDK源码与API — https://github.com/Kaggle/kaggle-benchmarks
SDK自动生成文档 — https://deepwiki.com/Kaggle/kaggle-benchmarks
CLI文档 — https://github.com/Kaggle/kaggle-cli/blob/main/docs/benchmarks.md

Command Hierarchy

命令层级

kaggle benchmarks (alias: kaggle b)
├── auth              — Fetch Model Proxy credentials
├── init              — Fetch credentials + setup local dev environment
└── tasks (alias: t)  — Manage benchmark tasks
    ├── push          — Upload a task from a .py file
    ├── run           — Run a task against model(s)
    ├── list          — List your benchmark tasks
    ├── status        — Show task details and per-model run status
    ├── download      — Download completed run outputs (and optionally source notebooks)
    ├── log (logs)    — Show execution logs for run(s) (streams live for RUNNING runs)
    ├── publish       — Make a task public (publishes the backing notebook by default)
    ├── models        — List available benchmark models
    └── delete        — Delete a task (not yet supported by server)

kaggle benchmarks (别名: kaggle b)
├── auth              — 获取Model Proxy凭证
├── init              — 获取凭证并设置本地开发环境
└── tasks (别名: t)  — 管理基准任务
    ├── push          — 从.py文件上传任务
    ├── run           — 针对模型运行任务
    ├── list          — 列出你的基准任务
    ├── status        — 显示任务详情和各模型运行状态
    ├── download      — 下载已完成的运行输出（可选择源notebook）
    ├── log (logs)    — 显示运行的执行日志（对RUNNING状态的任务实时流式输出）
    ├── publish       — 将任务设为公开（默认同时发布关联的notebook）
    ├── models        — 列出可用的基准模型
    └── delete        — 删除任务（服务器暂不支持）

Setup

设置

bash

undefined

bash

undefined

Full setup: credentials + .env + example_task.py + kaggle_benchmarks_reference.md

完整设置：凭证 + .env + example_task.py + kaggle_benchmarks_reference.md

kaggle b init -y

Credentials only (refresh MODEL_PROXY_* in .env)

仅更新凭证（刷新.env中的MODEL_PROXY_*变量）

kaggle b auth -y


Custom paths: `--env-file <FILE>` and `--example-file <FILE>` for init.

kaggle b auth -y


自定义路径：init命令可使用`--env-file <FILE>`和`--example-file <FILE>`指定路径。

Env vars written by init:

init写入的环境变量：

```
MODEL_PROXY_URL
```
```
MODEL_PROXY_API_KEY
```
```
MODEL_PROXY_EXPIRY_TIME
```
```
LLM_DEFAULT
```
```
LLM_DEFAULT_EVAL
```
```
LLMS_AVAILABLE
```

```
MODEL_PROXY_URL
```
```
MODEL_PROXY_API_KEY
```
```
MODEL_PROXY_EXPIRY_TIME
```
```
LLM_DEFAULT
```
```
LLM_DEFAULT_EVAL
```
```
LLMS_AVAILABLE
```

Core workflow: Init → Write → Validate → Push → Run → Status → Download

核心工作流：初始化 → 编写 → 验证 → 推送 → 运行 → 查看状态 → 下载

Pacing — check in at every stage

进度把控——每个阶段都要确认

Do NOT chain the full pipeline. Treat each numbered step below as a checkpoint:

State what you are about to do for the current step (one sentence, including the exact command you intend to run).
Wait for the user's go-ahead before executing — including for steps that look "obvious" like
```
init
```
or
```
list
```
.
After the step completes, show the relevant output, then stop. Do not auto-advance to the next step.
Ask the user how they want to proceed: continue to the next documented step, change parameters, or branch off.

If the user explicitly asks for "the whole pipeline" or "do everything", you may chain, but summarize the planned chain in advance and ask for one confirmation covering the lot, instead of skipping the per-step checkpoints silently.

请勿串联完整流程。将以下每个编号步骤视为一个检查点：

说明当前步骤要执行的操作（一句话，包含你打算运行的具体命令）。
执行前等待用户确认——即使是
```
init
```
或
```
list
```
这类看似“显而易见”的步骤也不例外。
步骤完成后，展示相关输出，然后停止。不要自动进入下一步。
询问用户如何继续：进入文档中的下一步、修改参数，或分支执行其他操作。

如果用户明确要求“完整流程”或“全部执行”，你可以串联命令，但需提前总结计划执行的命令链并请求一次确认，而非静默跳过每一步的检查点。

0. Init (once per environment, re-run when creds expire)

0. 初始化（每个环境执行一次，凭证过期时重新执行）

init

fetches Model Proxy credentials, writes

.env

, and drops an

example_task.py

kaggle_benchmarks_reference.md

next to it. Every later step depends on the

MODEL_PROXY_*

vars it writes, so run it before anything else — and re-run it any time

python task.py

kaggle b t run

fails with an auth error (the API key is short-lived).

bash

kaggle b init -y                      # first-time setup
kaggle b auth -y                      # creds-only refresh (no scaffolding)

init

会获取Model Proxy凭证，写入

.env

文件，并生成

example_task.py

和

kaggle_benchmarks_reference.md

文件。后续所有步骤都依赖它写入的

MODEL_PROXY_*

变量，因此请在执行其他操作前先运行该命令——每当

python task.py

或

kaggle b t run

因认证失败报错时（API密钥有效期较短），请重新运行该命令。

bash

kaggle b init -y                      # 首次设置
kaggle b auth -y                      # 仅刷新凭证（不生成脚手架文件）

1. Write a task file

1. 编写任务文件

A task file must:

Import
```
kaggle_benchmarks as kbench
```
Define at least one function decorated with
```
@kbench.task(...)
```
Call
```
.run(kbench.llm)
```
(or
```
.evaluate(...)
```
) on the task function — see Gotchas
Use
```
# %%
```
cell markers (jupytext percent format)

任务文件必须满足：

导入
```
kaggle_benchmarks as kbench
```
定义至少一个使用
```
@kbench.task(...)
```
装饰的函数
在任务函数上调用
```
.run(kbench.llm)
```
（或
```
.evaluate(...)
```
）——详见注意事项
使用
```
# %%
```
单元格标记（jupytext百分比格式）

Minimal example:

最简示例：

python

undefined

python

undefined

%%

import kaggle_benchmarks as kbench

%%

@kbench.task(name="my-test-task") def my_test_task(llm): response = llm.prompt("What is 2 + 2?") kbench.assertions.assert_in("4", response, expectation="Should contain 4")

my_test_task.run(kbench.llm)

undefined

@kbench.task(name="my-test-task") def my_test_task(llm): response = llm.prompt("What is 2 + 2?") kbench.assertions.assert_in("4", response, expectation="Should contain 4")

my_test_task.run(kbench.llm)

undefined

LLM resolution precedence (highest → lowest):

LLM解析优先级（从高到低）：

Explicit model in code:

task.run(llm=kbench.llms["google/gemini-3.5-flash"])

Default in code:
```
task.run(llm=kbench.llm)
```
(resolves to
```
LLM_DEFAULT
```
)
Env vars from .env (
```
LLM_DEFAULT
```
,
```
LLMS_AVAILABLE
```
,
```
MODEL_PROXY_*
```
)

代码中显式指定模型：

task.run(llm=kbench.llms["google/gemini-3.5-flash"])

代码中使用默认模型：
```
task.run(llm=kbench.llm)
```
（解析为
```
LLM_DEFAULT
```
）
.env中的环境变量（
```
LLM_DEFAULT
```
、
```
LLMS_AVAILABLE
```
、
```
MODEL_PROXY_*
```
）

2. Validate locally

2. 本地验证

Run the task end-to-end before pushing. This catches the silent-no-op gotcha and broken prompts before the push → run → wait → download round-trip.

bash

kaggle b init -y                     # ensure .env is current
python task.py                       # run the task directly
ls -1 *.run.json                     # confirm a run file was produced

python task.py

exits cleanly and

*.run.json

appears, the task is safe to push. If validation fails, fix and re-run before proceeding to Step 3.

推送前先端到端运行任务。这可以在推送→运行→等待→下载的往返流程前，捕获静默无操作问题和无效提示。

bash

kaggle b init -y                     # 确保.env是最新的
python task.py                       # 直接运行任务
ls -1 *.run.json                     # 确认生成了运行文件

如果

python task.py

正常退出且生成了

*.run.json

文件，说明任务可以安全推送。如果验证失败，请修复后重新运行，再进入步骤3。

3. Push

3. 推送

bash

kaggle b t push my-task -f task.py --wait
kaggle b t push my-task -f task.py -d owner/dataset1 -d owner/dataset2   # attach datasets

--wait [TIMEOUT]

blocks until server-side creation finishes (no arg = indefinite).

--poll-interval <SECONDS>

caps the polling interval (default 60s; polling starts at 5s and grows adaptively). Repeat

-d

--kaggle-dataset

once per dataset (do not space-separate; see the Gotchas).

bash

kaggle b t push my-task -f task.py --wait
kaggle b t push my-task -f task.py -d owner/dataset1 -d owner/dataset2   # 附加数据集

--wait [TIMEOUT]

会阻塞直到服务器端创建完成（无参数表示无限等待）。

--poll-interval <SECONDS>

设置轮询间隔上限（默认60秒；轮询从5秒开始自适应增长）。每个数据集需重复使用

-d

--kaggle-dataset

参数（请勿用空格分隔；详见注意事项）。

4. Run

4. 运行

bash

undefined

bash

undefined

Interactive picker

交互式选择模型

kaggle b t run my-task

Specific model

指定模型

kaggle b t run my-task -m google/gemini-3.5-flash

Multiple models (repeat -m, do NOT space-separate)

多个模型（重复使用-m参数，请勿用空格分隔）

kaggle b t run my-task -m google/gemini-3.5-flash -m anthropic/claude-haiku-4-5

Wait for completion

等待运行完成

kaggle b t run my-task -m google/gemini-3.5-flash --wait

List available models: `kaggle b t models`.

kaggle b t run my-task -m google/gemini-3.5-flash --wait

列出可用模型：`kaggle b t models`。

5. Status

5. 查看状态

bash

kaggle b t status my-task
kaggle b t status my-task -m google/gemini-3.5-flash

Prints task metadata (slug, version, state, created timestamp, public flag, task URL) and a per-model run table. Errored runs render their final exception line under an

Errors:

section.

bash

kaggle b t status my-task
kaggle b t status my-task -m google/gemini-3.5-flash

打印任务元数据（别名、版本、状态、创建时间戳、公开标记、任务URL）和各模型运行状态表。运行出错的任务会在

Errors:

部分显示最终异常信息。

6. Download

6. 下载

bash

kaggle b t download my-task                       # all terminal runs
kaggle b t download my-task -o ./results          # custom directory
kaggle b t download my-task -m google/gemini-3.5-flash
kaggle b t download my-task -s                    # also fetch source notebooks
kaggle b t download my-task -f                    # force re-download (overwrite)

Output layout:

<output>/<task>/<version>/<model>/<run_id>/....

Already-downloaded runs are skipped unless

--force

-f

is passed. With

--include-source

-s

, each run's directory also contains

__notebook__.ipynb

and

__notebook_source__.ipynb

alongside the regular outputs (useful for debugging the kernel session).

bash

kaggle b t download my-task                       # 下载所有已完成的运行结果
kaggle b t download my-task -o ./results          # 指定自定义目录
kaggle b t download my-task -m google/gemini-3.5-flash
kaggle b t download my-task -s                    # 同时获取源notebook
kaggle b t download my-task -f                    # 强制重新下载（覆盖现有文件）

输出目录结构：

<output>/<task>/<version>/<model>/<run_id>/....

已下载的运行结果会被跳过，除非使用

--force

-f

参数。使用

--include-source

-s

参数时，每个运行结果目录中除常规输出外，还会包含

__notebook__.ipynb

和

__notebook_source__.ipynb

文件（用于调试内核会话）。

7. Log

7. 查看日志

bash

kaggle b t log my-task                            # logs for every run of the task
kaggle b t log my-task -m google/gemini-3.5-flash # filter to one model
kaggle b t log my-task -m model-a -m model-b      # multiple models, sequential

RUNNING

runs stream live via SSE;

COMPLETED

ERRORED

runs print the persisted log in one shot;

QUEUED

runs print

(No logs available — server returned 404)

and continue.

bash

kaggle b t log my-task                            # 查看任务所有运行的日志
kaggle b t log my-task -m google/gemini-3.5-flash # 过滤单个模型的日志
kaggle b t log my-task -m model-a -m model-b      # 查看多个模型的日志（按顺序）

RUNNING

状态的任务会通过SSE实时流式输出日志；

COMPLETED

ERRORED

状态的任务会一次性打印已保存的日志；

QUEUED

状态的任务会打印

(No logs available — server returned 404)

并持续等待。

8. Publish

8. 发布

bash

kaggle b t publish my-task                              # publish task + backing notebook (default)
kaggle b t publish my-task --no-publish-backing-notebook  # publish task only, keep notebook private

Publishes both the task and the backing notebook by default. If the task is already public the command is a no-op for the task itself but will still publish the notebook unless

--no-publish-backing-notebook

is passed.

bash

kaggle b t publish my-task                              # 发布任务及关联notebook（默认行为）
kaggle b t publish my-task --no-publish-backing-notebook  # 仅发布任务，保留notebook私有

默认同时发布任务和关联的notebook。如果任务已公开，该命令对任务本身无操作，但仍会发布notebook，除非使用

--no-publish-backing-notebook

参数。

Quick Recipes

快速示例

Reminder: these are reference snippets, not invocations to chain automatically. Per the "Pacing" section above, run them one at a time with user confirmation between each, unless the user explicitly asks you to chain them.

bash

undefined

提醒：这些是参考代码片段，并非自动串联的命令。根据上述“进度把控”部分的要求，每次运行一个命令，运行前需用户确认，除非用户明确要求串联执行。

bash

undefined

Push → run → download (run one command at a time, confirm between)

推送 → 运行 → 下载（每次运行一个命令，中间需确认）

kaggle b t push my-task -f task.py --wait kaggle b t run my-task -m google/gemini-3.5-flash --wait kaggle b t download my-task -o ./results

List tasks, filtered

过滤列出任务

kaggle b t list --name-regex "^math" --status errored

Debug an errored run: pull logs first, then download source notebook

调试出错的运行：先拉取日志，再下载源notebook

kaggle b t log my-task -m google/gemini-3.5-flash kaggle b t download my-task -m google/gemini-3.5-flash -s -f

undefined

kaggle b t log my-task -m google/gemini-3.5-flash kaggle b t download my-task -m google/gemini-3.5-flash -s -f

undefined

Gotchas

注意事项

Most of these are silent failures the agent will not detect on its own — review before generating any task file or CLI invocation.

No
.run()
call → silent no-op. The push will succeed even if the file has no
```
.run()
```
(push validation only checks for
```
@task
```
decorators). The task will then execute on the server and produce no
```
.run.json
```
, so nothing is recorded. Every task function must end with
```
task_fn.run(kbench.llm)
```
(or
```
.evaluate(...)
```
).
MODEL_PROXY_API_KEY
is short-lived. If
```
python task.py
```
fails with an auth error, re-run
```
kaggle b auth -y
```
(or
```
kaggle b init -y
```
) to refresh.
init
/
auth
append to the env file. Loaded via
```
dotenv
```
so last-wins makes re-running safe, but the file accumulates duplicate entries over time.
Task slug must match a
@task
decorator.
```
kaggle b t push <SLUG> -f file.py
```
fails if
```
<SLUG>
```
doesn't match the slugified name of some
```
@kbench.task(name=...)
```
(or function name) in the file. Names are normalized:
```
My Task
```
→
```
my-task
```
,
```
my_task
```
→
```
my-task
```
.
Server returns model slugs with
@default
suffix sometimes (e.g.
```
google/gemini-3.5-flash@default
```
). The CLI normalizes
```
@
```
→
```
-
```
for matching; user-facing commands should use the plain
```
owner/model
```
form.
delete
is not implemented server-side. The command exists but currently prints
```
Delete is not supported by the server yet.
```
Repeated flags, not space-separated. For multi-value flags (
```
-m
```
,
```
-d
```
/
```
--kaggle-dataset
```
), pass the flag once per value:
```
-m a -m b
```
, not
```
-m a b
```
. Space-separated form is not supported and will error.
CLI scope is tasks only, not benchmarks. A benchmark is a curated collection of tasks. The CLI lets you create, push, and run individual tasks, but creating or managing benchmarks (collections) must be done on the Kaggle web UI.

这些大多是代理无法自动检测的静默失败——生成任务文件或CLI命令前请仔细查看。

未调用
.run()
→ 静默无操作。即使文件中没有
```
.run()
```
，推送也会成功（推送验证仅检查
```
@task
```
装饰器）。任务在服务器上执行后不会生成
```
.run.json
```
文件，因此不会记录任何结果。每个任务函数必须以
```
task_fn.run(kbench.llm)
```
（或
```
.evaluate(...)
```
）结尾。
MODEL_PROXY_API_KEY
有效期较短。如果
```
python task.py
```
因认证失败报错，请重新运行
```
kaggle b auth -y
```
（或
```
kaggle b init -y
```
）刷新凭证。
init
/
auth
会追加到环境文件。通过
```
dotenv
```
加载，因此重复运行是安全的，但文件会随时间积累重复条目。
任务别名必须与
@task
装饰器匹配。如果
```
<SLUG>
```
与文件中某个
```
@kbench.task(name=...)
```
（或函数名）的标准化别名不匹配，
```
kaggle b t push <SLUG> -f file.py
```
会失败。名称会被标准化：
```
My Task
```
→
```
my-task
```
，
```
my_task
```
→
```
my-task
```
。
服务器返回的模型别名有时带有
@default
后缀（例如
```
google/gemini-3.5-flash@default
```
）。CLI会将
```
@
```
替换为
```
-
```
以匹配；用户使用的命令应采用
```
owner/model
```
的简洁形式。
delete
功能尚未在服务器端实现。命令存在，但当前会打印
```
Delete is not supported by the server yet.
```
。
重复使用参数，而非空格分隔。对于多值参数（
```
-m
```
、
```
-d
```
/
```
--kaggle-dataset
```
），每个值需单独使用一次参数：
```
-m a -m b
```
，而非
```
-m a b
```
。空格分隔的形式不被支持，会报错。
CLI仅针对任务，不针对基准测试集合。基准测试集合是经过整理的任务合集。CLI允许你创建、推送和运行单个任务，但创建或管理基准测试集合必须通过Kaggle网页UI完成。