write-kaggle-benchmarks

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Write Kaggle Benchmarks

编写Kaggle Benchmarks

Keywords

关键词

Kaggle benchmarks, write a benchmark, benchmark task, kbench, push task, run task.
Kaggle benchmarks、编写基准测试、基准任务、kbench、推送任务、运行任务

Official Resources

官方资源

Command Hierarchy

命令层级

kaggle benchmarks (alias: kaggle b)
├── auth              — Fetch Model Proxy credentials
├── init              — Fetch credentials + setup local dev environment
└── tasks (alias: t)  — Manage benchmark tasks
    ├── push          — Upload a task from a .py file
    ├── run           — Run a task against model(s)
    ├── list          — List your benchmark tasks
    ├── status        — Show task details and per-model run status
    ├── download      — Download completed run outputs (and optionally source notebooks)
    ├── log (logs)    — Show execution logs for run(s) (streams live for RUNNING runs)
    ├── publish       — Make a task public (publishes the backing notebook by default)
    ├── models        — List available benchmark models
    └── delete        — Delete a task (not yet supported by server)
kaggle benchmarks (别名: kaggle b)
├── auth              — 获取Model Proxy凭证
├── init              — 获取凭证并设置本地开发环境
└── tasks (别名: t)  — 管理基准任务
    ├── push          — 从.py文件上传任务
    ├── run           — 针对模型运行任务
    ├── list          — 列出你的基准任务
    ├── status        — 显示任务详情和各模型运行状态
    ├── download      — 下载已完成的运行输出(可选择源notebook)
    ├── log (logs)    — 显示运行的执行日志(对RUNNING状态的任务实时流式输出)
    ├── publish       — 将任务设为公开(默认同时发布关联的notebook)
    ├── models        — 列出可用的基准模型
    └── delete        — 删除任务(服务器暂不支持)

Setup

设置

bash
undefined
bash
undefined

Full setup: credentials + .env + example_task.py + kaggle_benchmarks_reference.md

完整设置:凭证 + .env + example_task.py + kaggle_benchmarks_reference.md

kaggle b init -y
kaggle b init -y

Credentials only (refresh MODEL_PROXY_* in .env)

仅更新凭证(刷新.env中的MODEL_PROXY_*变量)

kaggle b auth -y

Custom paths: `--env-file <FILE>` and `--example-file <FILE>` for init.
kaggle b auth -y

自定义路径:init命令可使用`--env-file <FILE>`和`--example-file <FILE>`指定路径。

Env vars written by init:

init写入的环境变量:

  • MODEL_PROXY_URL
  • MODEL_PROXY_API_KEY
  • MODEL_PROXY_EXPIRY_TIME
  • LLM_DEFAULT
  • LLM_DEFAULT_EVAL
  • LLMS_AVAILABLE
  • MODEL_PROXY_URL
  • MODEL_PROXY_API_KEY
  • MODEL_PROXY_EXPIRY_TIME
  • LLM_DEFAULT
  • LLM_DEFAULT_EVAL
  • LLMS_AVAILABLE

Core workflow: Init → Write → Validate → Push → Run → Status → Download

核心工作流:初始化 → 编写 → 验证 → 推送 → 运行 → 查看状态 → 下载

Pacing — check in at every stage

进度把控——每个阶段都要确认

Do NOT chain the full pipeline. Treat each numbered step below as a checkpoint:
  1. State what you are about to do for the current step (one sentence, including the exact command you intend to run).
  2. Wait for the user's go-ahead before executing — including for steps that look "obvious" like
    init
    or
    list
    .
  3. After the step completes, show the relevant output, then stop. Do not auto-advance to the next step.
  4. Ask the user how they want to proceed: continue to the next documented step, change parameters, or branch off.
If the user explicitly asks for "the whole pipeline" or "do everything", you may chain, but summarize the planned chain in advance and ask for one confirmation covering the lot, instead of skipping the per-step checkpoints silently.
请勿串联完整流程。将以下每个编号步骤视为一个检查点:
  1. 说明当前步骤要执行的操作(一句话,包含你打算运行的具体命令)。
  2. 执行前等待用户确认——即使是
    init
    list
    这类看似“显而易见”的步骤也不例外。
  3. 步骤完成后,展示相关输出,然后停止。不要自动进入下一步。
  4. 询问用户如何继续:进入文档中的下一步、修改参数,或分支执行其他操作。
如果用户明确要求“完整流程”或“全部执行”,你可以串联命令,但需提前总结计划执行的命令链并请求一次确认,而非静默跳过每一步的检查点。

0. Init (once per environment, re-run when creds expire)

0. 初始化(每个环境执行一次,凭证过期时重新执行)

init
fetches Model Proxy credentials, writes
.env
, and drops an
example_task.py
+
kaggle_benchmarks_reference.md
next to it. Every later step depends on the
MODEL_PROXY_*
vars it writes, so run it before anything else — and re-run it any time
python task.py
or
kaggle b t run
fails with an auth error (the API key is short-lived).
bash
kaggle b init -y                      # first-time setup
kaggle b auth -y                      # creds-only refresh (no scaffolding)
init
会获取Model Proxy凭证,写入
.env
文件,并生成
example_task.py
kaggle_benchmarks_reference.md
文件。后续所有步骤都依赖它写入的
MODEL_PROXY_*
变量,因此请在执行其他操作前先运行该命令——每当
python task.py
kaggle b t run
因认证失败报错时(API密钥有效期较短),请重新运行该命令。
bash
kaggle b init -y                      # 首次设置
kaggle b auth -y                      # 仅刷新凭证(不生成脚手架文件)

1. Write a task file

1. 编写任务文件

A task file must:
  • Import
    kaggle_benchmarks as kbench
  • Define at least one function decorated with
    @kbench.task(...)
  • Call
    .run(kbench.llm)
    (or
    .evaluate(...)
    ) on the task function — see Gotchas
  • Use
    # %%
    cell markers (jupytext percent format)
任务文件必须满足:
  • 导入
    kaggle_benchmarks as kbench
  • 定义至少一个使用
    @kbench.task(...)
    装饰的函数
  • 在任务函数上调用
    .run(kbench.llm)
    (或
    .evaluate(...)
    )——详见注意事项
  • 使用
    # %%
    单元格标记(jupytext百分比格式)

Minimal example:

最简示例:

python
undefined
python
undefined

%%

%%

import kaggle_benchmarks as kbench
import kaggle_benchmarks as kbench

%%

%%

@kbench.task(name="my-test-task") def my_test_task(llm): response = llm.prompt("What is 2 + 2?") kbench.assertions.assert_in("4", response, expectation="Should contain 4")
my_test_task.run(kbench.llm)
undefined
@kbench.task(name="my-test-task") def my_test_task(llm): response = llm.prompt("What is 2 + 2?") kbench.assertions.assert_in("4", response, expectation="Should contain 4")
my_test_task.run(kbench.llm)
undefined

LLM resolution precedence (highest → lowest):

LLM解析优先级(从高到低):

  1. Explicit model in code:
    task.run(llm=kbench.llms["google/gemini-3.5-flash"])
  2. Default in code:
    task.run(llm=kbench.llm)
    (resolves to
    LLM_DEFAULT
    )
  3. Env vars from .env (
    LLM_DEFAULT
    ,
    LLMS_AVAILABLE
    ,
    MODEL_PROXY_*
    )
  1. 代码中显式指定模型
    task.run(llm=kbench.llms["google/gemini-3.5-flash"])
  2. 代码中使用默认模型
    task.run(llm=kbench.llm)
    (解析为
    LLM_DEFAULT
  3. .env中的环境变量
    LLM_DEFAULT
    LLMS_AVAILABLE
    MODEL_PROXY_*

2. Validate locally

2. 本地验证

Run the task end-to-end before pushing. This catches the silent-no-op gotcha and broken prompts before the push → run → wait → download round-trip.
bash
kaggle b init -y                     # ensure .env is current
python task.py                       # run the task directly
ls -1 *.run.json                     # confirm a run file was produced
If
python task.py
exits cleanly and
*.run.json
appears, the task is safe to push. If validation fails, fix and re-run before proceeding to Step 3.
推送前先端到端运行任务。这可以在推送→运行→等待→下载的往返流程前,捕获静默无操作问题和无效提示。
bash
kaggle b init -y                     # 确保.env是最新的
python task.py                       # 直接运行任务
ls -1 *.run.json                     # 确认生成了运行文件
如果
python task.py
正常退出且生成了
*.run.json
文件,说明任务可以安全推送。如果验证失败,请修复后重新运行,再进入步骤3。

3. Push

3. 推送

bash
kaggle b t push my-task -f task.py --wait
kaggle b t push my-task -f task.py -d owner/dataset1 -d owner/dataset2   # attach datasets
--wait [TIMEOUT]
blocks until server-side creation finishes (no arg = indefinite).
--poll-interval <SECONDS>
caps the polling interval (default 60s; polling starts at 5s and grows adaptively). Repeat
-d
/
--kaggle-dataset
once per dataset (do not space-separate; see the Gotchas).
bash
kaggle b t push my-task -f task.py --wait
kaggle b t push my-task -f task.py -d owner/dataset1 -d owner/dataset2   # 附加数据集
--wait [TIMEOUT]
会阻塞直到服务器端创建完成(无参数表示无限等待)。
--poll-interval <SECONDS>
设置轮询间隔上限(默认60秒;轮询从5秒开始自适应增长)。每个数据集需重复使用
-d
/
--kaggle-dataset
参数(请勿用空格分隔;详见注意事项)。

4. Run

4. 运行

bash
undefined
bash
undefined

Interactive picker

交互式选择模型

kaggle b t run my-task
kaggle b t run my-task

Specific model

指定模型

kaggle b t run my-task -m google/gemini-3.5-flash
kaggle b t run my-task -m google/gemini-3.5-flash

Multiple models (repeat -m, do NOT space-separate)

多个模型(重复使用-m参数,请勿用空格分隔)

kaggle b t run my-task -m google/gemini-3.5-flash -m anthropic/claude-haiku-4-5
kaggle b t run my-task -m google/gemini-3.5-flash -m anthropic/claude-haiku-4-5

Wait for completion

等待运行完成

kaggle b t run my-task -m google/gemini-3.5-flash --wait
List available models: `kaggle b t models`.
kaggle b t run my-task -m google/gemini-3.5-flash --wait
列出可用模型:`kaggle b t models`。

5. Status

5. 查看状态

bash
kaggle b t status my-task
kaggle b t status my-task -m google/gemini-3.5-flash
Prints task metadata (slug, version, state, created timestamp, public flag, task URL) and a per-model run table. Errored runs render their final exception line under an
Errors:
section.
bash
kaggle b t status my-task
kaggle b t status my-task -m google/gemini-3.5-flash
打印任务元数据(别名、版本、状态、创建时间戳、公开标记、任务URL)和各模型运行状态表。运行出错的任务会在
Errors:
部分显示最终异常信息。

6. Download

6. 下载

bash
kaggle b t download my-task                       # all terminal runs
kaggle b t download my-task -o ./results          # custom directory
kaggle b t download my-task -m google/gemini-3.5-flash
kaggle b t download my-task -s                    # also fetch source notebooks
kaggle b t download my-task -f                    # force re-download (overwrite)
Output layout:
<output>/<task>/<version>/<model>/<run_id>/....
Already-downloaded runs are skipped unless
--force
/
-f
is passed. With
--include-source
/
-s
, each run's directory also contains
__notebook__.ipynb
and
__notebook_source__.ipynb
alongside the regular outputs (useful for debugging the kernel session).
bash
kaggle b t download my-task                       # 下载所有已完成的运行结果
kaggle b t download my-task -o ./results          # 指定自定义目录
kaggle b t download my-task -m google/gemini-3.5-flash
kaggle b t download my-task -s                    # 同时获取源notebook
kaggle b t download my-task -f                    # 强制重新下载(覆盖现有文件)
输出目录结构:
<output>/<task>/<version>/<model>/<run_id>/....
已下载的运行结果会被跳过,除非使用
--force
/
-f
参数。使用
--include-source
/
-s
参数时,每个运行结果目录中除常规输出外,还会包含
__notebook__.ipynb
__notebook_source__.ipynb
文件(用于调试内核会话)。

7. Log

7. 查看日志

bash
kaggle b t log my-task                            # logs for every run of the task
kaggle b t log my-task -m google/gemini-3.5-flash # filter to one model
kaggle b t log my-task -m model-a -m model-b      # multiple models, sequential
RUNNING
runs stream live via SSE;
COMPLETED
/
ERRORED
runs print the persisted log in one shot;
QUEUED
runs print
(No logs available — server returned 404)
and continue.
bash
kaggle b t log my-task                            # 查看任务所有运行的日志
kaggle b t log my-task -m google/gemini-3.5-flash # 过滤单个模型的日志
kaggle b t log my-task -m model-a -m model-b      # 查看多个模型的日志(按顺序)
RUNNING
状态的任务会通过SSE实时流式输出日志;
COMPLETED
/
ERRORED
状态的任务会一次性打印已保存的日志;
QUEUED
状态的任务会打印
(No logs available — server returned 404)
并持续等待。

8. Publish

8. 发布

bash
kaggle b t publish my-task                              # publish task + backing notebook (default)
kaggle b t publish my-task --no-publish-backing-notebook  # publish task only, keep notebook private
Publishes both the task and the backing notebook by default. If the task is already public the command is a no-op for the task itself but will still publish the notebook unless
--no-publish-backing-notebook
is passed.
bash
kaggle b t publish my-task                              # 发布任务及关联notebook(默认行为)
kaggle b t publish my-task --no-publish-backing-notebook  # 仅发布任务,保留notebook私有
默认同时发布任务和关联的notebook。如果任务已公开,该命令对任务本身无操作,但仍会发布notebook,除非使用
--no-publish-backing-notebook
参数。

Quick Recipes

快速示例

Reminder: these are reference snippets, not invocations to chain automatically. Per the "Pacing" section above, run them one at a time with user confirmation between each, unless the user explicitly asks you to chain them.
bash
undefined
提醒:这些是参考代码片段,并非自动串联的命令。根据上述“进度把控”部分的要求,每次运行一个命令,运行前需用户确认,除非用户明确要求串联执行。
bash
undefined

Push → run → download (run one command at a time, confirm between)

推送 → 运行 → 下载(每次运行一个命令,中间需确认)

kaggle b t push my-task -f task.py --wait kaggle b t run my-task -m google/gemini-3.5-flash --wait kaggle b t download my-task -o ./results
kaggle b t push my-task -f task.py --wait kaggle b t run my-task -m google/gemini-3.5-flash --wait kaggle b t download my-task -o ./results

List tasks, filtered

过滤列出任务

kaggle b t list --name-regex "^math" --status errored
kaggle b t list --name-regex "^math" --status errored

Debug an errored run: pull logs first, then download source notebook

调试出错的运行:先拉取日志,再下载源notebook

kaggle b t log my-task -m google/gemini-3.5-flash kaggle b t download my-task -m google/gemini-3.5-flash -s -f
undefined
kaggle b t log my-task -m google/gemini-3.5-flash kaggle b t download my-task -m google/gemini-3.5-flash -s -f
undefined

Gotchas

注意事项

Most of these are silent failures the agent will not detect on its own — review before generating any task file or CLI invocation.
  • No
    .run()
    call → silent no-op
    . The push will succeed even if the file has no
    .run()
    (push validation only checks for
    @task
    decorators). The task will then execute on the server and produce no
    .run.json
    , so nothing is recorded. Every task function must end with
    task_fn.run(kbench.llm)
    (or
    .evaluate(...)
    ).
  • MODEL_PROXY_API_KEY
    is short-lived
    . If
    python task.py
    fails with an auth error, re-run
    kaggle b auth -y
    (or
    kaggle b init -y
    ) to refresh.
  • init
    /
    auth
    append to the env file
    . Loaded via
    dotenv
    so last-wins makes re-running safe, but the file accumulates duplicate entries over time.
  • Task slug must match a
    @task
    decorator
    .
    kaggle b t push <SLUG> -f file.py
    fails if
    <SLUG>
    doesn't match the slugified name of some
    @kbench.task(name=...)
    (or function name) in the file. Names are normalized:
    My Task
    my-task
    ,
    my_task
    my-task
    .
  • Server returns model slugs with
    @default
    suffix sometimes
    (e.g.
    google/gemini-3.5-flash@default
    ). The CLI normalizes
    @
    -
    for matching; user-facing commands should use the plain
    owner/model
    form.
  • delete
    is not implemented server-side
    . The command exists but currently prints
    Delete is not supported by the server yet.
  • Repeated flags, not space-separated. For multi-value flags (
    -m
    ,
    -d
    /
    --kaggle-dataset
    ), pass the flag once per value:
    -m a -m b
    , not
    -m a b
    . Space-separated form is not supported and will error.
  • CLI scope is tasks only, not benchmarks. A benchmark is a curated collection of tasks. The CLI lets you create, push, and run individual tasks, but creating or managing benchmarks (collections) must be done on the Kaggle web UI.
这些大多是代理无法自动检测的静默失败——生成任务文件或CLI命令前请仔细查看。
  • 未调用
    .run()
    → 静默无操作
    。即使文件中没有
    .run()
    ,推送也会成功(推送验证仅检查
    @task
    装饰器)。任务在服务器上执行后不会生成
    .run.json
    文件,因此不会记录任何结果。每个任务函数必须以
    task_fn.run(kbench.llm)
    (或
    .evaluate(...)
    )结尾。
  • MODEL_PROXY_API_KEY
    有效期较短
    。如果
    python task.py
    因认证失败报错,请重新运行
    kaggle b auth -y
    (或
    kaggle b init -y
    )刷新凭证。
  • init
    /
    auth
    会追加到环境文件
    。通过
    dotenv
    加载,因此重复运行是安全的,但文件会随时间积累重复条目。
  • 任务别名必须与
    @task
    装饰器匹配
    。如果
    <SLUG>
    与文件中某个
    @kbench.task(name=...)
    (或函数名)的标准化别名不匹配,
    kaggle b t push <SLUG> -f file.py
    会失败。名称会被标准化:
    My Task
    my-task
    my_task
    my-task
  • 服务器返回的模型别名有时带有
    @default
    后缀
    (例如
    google/gemini-3.5-flash@default
    )。CLI会将
    @
    替换为
    -
    以匹配;用户使用的命令应采用
    owner/model
    的简洁形式。
  • delete
    功能尚未在服务器端实现
    。命令存在,但当前会打印
    Delete is not supported by the server yet.
  • 重复使用参数,而非空格分隔。对于多值参数(
    -m
    -d
    /
    --kaggle-dataset
    ),每个值需单独使用一次参数:
    -m a -m b
    ,而非
    -m a b
    。空格分隔的形式不被支持,会报错。
  • CLI仅针对任务,不针对基准测试集合基准测试集合是经过整理的任务合集。CLI允许你创建、推送和运行单个任务,但创建或管理基准测试集合必须通过Kaggle网页UI完成。