write-kaggle-benchmarks
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWrite Kaggle Benchmarks
编写Kaggle Benchmarks
Keywords
关键词
Kaggle benchmarks, write a benchmark, benchmark task, kbench, push task, run task.
Kaggle benchmarks、编写基准测试、基准任务、kbench、推送任务、运行任务
Official Resources
官方资源
- SDK source & API — https://github.com/Kaggle/kaggle-benchmarks
- SDK auto-generated docs — https://deepwiki.com/Kaggle/kaggle-benchmarks
- CLI docs — https://github.com/Kaggle/kaggle-cli/blob/main/docs/benchmarks.md
Command Hierarchy
命令层级
kaggle benchmarks (alias: kaggle b)
├── auth — Fetch Model Proxy credentials
├── init — Fetch credentials + setup local dev environment
└── tasks (alias: t) — Manage benchmark tasks
├── push — Upload a task from a .py file
├── run — Run a task against model(s)
├── list — List your benchmark tasks
├── status — Show task details and per-model run status
├── download — Download completed run outputs (and optionally source notebooks)
├── log (logs) — Show execution logs for run(s) (streams live for RUNNING runs)
├── publish — Make a task public (publishes the backing notebook by default)
├── models — List available benchmark models
└── delete — Delete a task (not yet supported by server)kaggle benchmarks (别名: kaggle b)
├── auth — 获取Model Proxy凭证
├── init — 获取凭证并设置本地开发环境
└── tasks (别名: t) — 管理基准任务
├── push — 从.py文件上传任务
├── run — 针对模型运行任务
├── list — 列出你的基准任务
├── status — 显示任务详情和各模型运行状态
├── download — 下载已完成的运行输出(可选择源notebook)
├── log (logs) — 显示运行的执行日志(对RUNNING状态的任务实时流式输出)
├── publish — 将任务设为公开(默认同时发布关联的notebook)
├── models — 列出可用的基准模型
└── delete — 删除任务(服务器暂不支持)Setup
设置
bash
undefinedbash
undefinedFull setup: credentials + .env + example_task.py + kaggle_benchmarks_reference.md
完整设置:凭证 + .env + example_task.py + kaggle_benchmarks_reference.md
kaggle b init -y
kaggle b init -y
Credentials only (refresh MODEL_PROXY_* in .env)
仅更新凭证(刷新.env中的MODEL_PROXY_*变量)
kaggle b auth -y
Custom paths: `--env-file <FILE>` and `--example-file <FILE>` for init.kaggle b auth -y
自定义路径:init命令可使用`--env-file <FILE>`和`--example-file <FILE>`指定路径。Env vars written by init:
init写入的环境变量:
MODEL_PROXY_URLMODEL_PROXY_API_KEYMODEL_PROXY_EXPIRY_TIMELLM_DEFAULTLLM_DEFAULT_EVALLLMS_AVAILABLE
MODEL_PROXY_URLMODEL_PROXY_API_KEYMODEL_PROXY_EXPIRY_TIMELLM_DEFAULTLLM_DEFAULT_EVALLLMS_AVAILABLE
Core workflow: Init → Write → Validate → Push → Run → Status → Download
核心工作流:初始化 → 编写 → 验证 → 推送 → 运行 → 查看状态 → 下载
Pacing — check in at every stage
进度把控——每个阶段都要确认
Do NOT chain the full pipeline. Treat each numbered step below as a checkpoint:
- State what you are about to do for the current step (one sentence, including the exact command you intend to run).
- Wait for the user's go-ahead before executing — including for steps that look "obvious" like or
init.list - After the step completes, show the relevant output, then stop. Do not auto-advance to the next step.
- Ask the user how they want to proceed: continue to the next documented step, change parameters, or branch off.
If the user explicitly asks for "the whole pipeline" or "do everything", you may chain, but summarize the planned chain in advance and ask for one confirmation covering the lot, instead of skipping the per-step checkpoints silently.
请勿串联完整流程。将以下每个编号步骤视为一个检查点:
- 说明当前步骤要执行的操作(一句话,包含你打算运行的具体命令)。
- 执行前等待用户确认——即使是或
init这类看似“显而易见”的步骤也不例外。list - 步骤完成后,展示相关输出,然后停止。不要自动进入下一步。
- 询问用户如何继续:进入文档中的下一步、修改参数,或分支执行其他操作。
如果用户明确要求“完整流程”或“全部执行”,你可以串联命令,但需提前总结计划执行的命令链并请求一次确认,而非静默跳过每一步的检查点。
0. Init (once per environment, re-run when creds expire)
0. 初始化(每个环境执行一次,凭证过期时重新执行)
init.envexample_task.pykaggle_benchmarks_reference.mdMODEL_PROXY_*python task.pykaggle b t runbash
kaggle b init -y # first-time setup
kaggle b auth -y # creds-only refresh (no scaffolding)init.envexample_task.pykaggle_benchmarks_reference.mdMODEL_PROXY_*python task.pykaggle b t runbash
kaggle b init -y # 首次设置
kaggle b auth -y # 仅刷新凭证(不生成脚手架文件)1. Write a task file
1. 编写任务文件
A task file must:
- Import
kaggle_benchmarks as kbench - Define at least one function decorated with
@kbench.task(...) - Call (or
.run(kbench.llm)) on the task function — see Gotchas.evaluate(...) - Use cell markers (jupytext percent format)
# %%
任务文件必须满足:
- 导入
kaggle_benchmarks as kbench - 定义至少一个使用装饰的函数
@kbench.task(...) - 在任务函数上调用(或
.run(kbench.llm))——详见注意事项.evaluate(...) - 使用单元格标记(jupytext百分比格式)
# %%
Minimal example:
最简示例:
python
undefinedpython
undefined%%
%%
import kaggle_benchmarks as kbench
import kaggle_benchmarks as kbench
%%
%%
@kbench.task(name="my-test-task")
def my_test_task(llm):
response = llm.prompt("What is 2 + 2?")
kbench.assertions.assert_in("4", response, expectation="Should contain 4")
my_test_task.run(kbench.llm)
undefined@kbench.task(name="my-test-task")
def my_test_task(llm):
response = llm.prompt("What is 2 + 2?")
kbench.assertions.assert_in("4", response, expectation="Should contain 4")
my_test_task.run(kbench.llm)
undefinedLLM resolution precedence (highest → lowest):
LLM解析优先级(从高到低):
- Explicit model in code:
task.run(llm=kbench.llms["google/gemini-3.5-flash"]) - Default in code: (resolves to
task.run(llm=kbench.llm))LLM_DEFAULT - Env vars from .env (,
LLM_DEFAULT,LLMS_AVAILABLE)MODEL_PROXY_*
- 代码中显式指定模型:
task.run(llm=kbench.llms["google/gemini-3.5-flash"]) - 代码中使用默认模型:(解析为
task.run(llm=kbench.llm))LLM_DEFAULT - .env中的环境变量(、
LLM_DEFAULT、LLMS_AVAILABLE)MODEL_PROXY_*
2. Validate locally
2. 本地验证
Run the task end-to-end before pushing. This catches the silent-no-op gotcha and broken prompts before the push → run → wait → download round-trip.
bash
kaggle b init -y # ensure .env is current
python task.py # run the task directly
ls -1 *.run.json # confirm a run file was producedIf exits cleanly and appears, the task is safe to push. If validation fails, fix and re-run before proceeding to Step 3.
python task.py*.run.json推送前先端到端运行任务。这可以在推送→运行→等待→下载的往返流程前,捕获静默无操作问题和无效提示。
bash
kaggle b init -y # 确保.env是最新的
python task.py # 直接运行任务
ls -1 *.run.json # 确认生成了运行文件如果正常退出且生成了文件,说明任务可以安全推送。如果验证失败,请修复后重新运行,再进入步骤3。
python task.py*.run.json3. Push
3. 推送
bash
kaggle b t push my-task -f task.py --wait
kaggle b t push my-task -f task.py -d owner/dataset1 -d owner/dataset2 # attach datasets--wait [TIMEOUT]--poll-interval <SECONDS>-d--kaggle-datasetbash
kaggle b t push my-task -f task.py --wait
kaggle b t push my-task -f task.py -d owner/dataset1 -d owner/dataset2 # 附加数据集--wait [TIMEOUT]--poll-interval <SECONDS>-d--kaggle-dataset4. Run
4. 运行
bash
undefinedbash
undefinedInteractive picker
交互式选择模型
kaggle b t run my-task
kaggle b t run my-task
Specific model
指定模型
kaggle b t run my-task -m google/gemini-3.5-flash
kaggle b t run my-task -m google/gemini-3.5-flash
Multiple models (repeat -m, do NOT space-separate)
多个模型(重复使用-m参数,请勿用空格分隔)
kaggle b t run my-task -m google/gemini-3.5-flash -m anthropic/claude-haiku-4-5
kaggle b t run my-task -m google/gemini-3.5-flash -m anthropic/claude-haiku-4-5
Wait for completion
等待运行完成
kaggle b t run my-task -m google/gemini-3.5-flash --wait
List available models: `kaggle b t models`.kaggle b t run my-task -m google/gemini-3.5-flash --wait
列出可用模型:`kaggle b t models`。5. Status
5. 查看状态
bash
kaggle b t status my-task
kaggle b t status my-task -m google/gemini-3.5-flashPrints task metadata (slug, version, state, created timestamp, public flag, task URL) and a per-model run table. Errored runs render their final exception line under an section.
Errors:bash
kaggle b t status my-task
kaggle b t status my-task -m google/gemini-3.5-flash打印任务元数据(别名、版本、状态、创建时间戳、公开标记、任务URL)和各模型运行状态表。运行出错的任务会在部分显示最终异常信息。
Errors:6. Download
6. 下载
bash
kaggle b t download my-task # all terminal runs
kaggle b t download my-task -o ./results # custom directory
kaggle b t download my-task -m google/gemini-3.5-flash
kaggle b t download my-task -s # also fetch source notebooks
kaggle b t download my-task -f # force re-download (overwrite)Output layout: Already-downloaded runs are skipped unless / is passed. With /, each run's directory also contains and alongside the regular outputs (useful for debugging the kernel session).
<output>/<task>/<version>/<model>/<run_id>/....--force-f--include-source-s__notebook__.ipynb__notebook_source__.ipynbbash
kaggle b t download my-task # 下载所有已完成的运行结果
kaggle b t download my-task -o ./results # 指定自定义目录
kaggle b t download my-task -m google/gemini-3.5-flash
kaggle b t download my-task -s # 同时获取源notebook
kaggle b t download my-task -f # 强制重新下载(覆盖现有文件)输出目录结构: 已下载的运行结果会被跳过,除非使用/参数。使用/参数时,每个运行结果目录中除常规输出外,还会包含和文件(用于调试内核会话)。
<output>/<task>/<version>/<model>/<run_id>/....--force-f--include-source-s__notebook__.ipynb__notebook_source__.ipynb7. Log
7. 查看日志
bash
kaggle b t log my-task # logs for every run of the task
kaggle b t log my-task -m google/gemini-3.5-flash # filter to one model
kaggle b t log my-task -m model-a -m model-b # multiple models, sequentialRUNNINGCOMPLETEDERROREDQUEUED(No logs available — server returned 404)bash
kaggle b t log my-task # 查看任务所有运行的日志
kaggle b t log my-task -m google/gemini-3.5-flash # 过滤单个模型的日志
kaggle b t log my-task -m model-a -m model-b # 查看多个模型的日志(按顺序)RUNNINGCOMPLETEDERROREDQUEUED(No logs available — server returned 404)8. Publish
8. 发布
bash
kaggle b t publish my-task # publish task + backing notebook (default)
kaggle b t publish my-task --no-publish-backing-notebook # publish task only, keep notebook privatePublishes both the task and the backing notebook by default. If the task is already public the command is a no-op for the task itself but will still publish the notebook unless is passed.
--no-publish-backing-notebookbash
kaggle b t publish my-task # 发布任务及关联notebook(默认行为)
kaggle b t publish my-task --no-publish-backing-notebook # 仅发布任务,保留notebook私有默认同时发布任务和关联的notebook。如果任务已公开,该命令对任务本身无操作,但仍会发布notebook,除非使用参数。
--no-publish-backing-notebookQuick Recipes
快速示例
Reminder: these are reference snippets, not invocations to chain automatically. Per the "Pacing" section above, run them one at a time with user confirmation between each, unless the user explicitly asks you to chain them.
bash
undefined提醒:这些是参考代码片段,并非自动串联的命令。根据上述“进度把控”部分的要求,每次运行一个命令,运行前需用户确认,除非用户明确要求串联执行。
bash
undefinedPush → run → download (run one command at a time, confirm between)
推送 → 运行 → 下载(每次运行一个命令,中间需确认)
kaggle b t push my-task -f task.py --wait
kaggle b t run my-task -m google/gemini-3.5-flash --wait
kaggle b t download my-task -o ./results
kaggle b t push my-task -f task.py --wait
kaggle b t run my-task -m google/gemini-3.5-flash --wait
kaggle b t download my-task -o ./results
List tasks, filtered
过滤列出任务
kaggle b t list --name-regex "^math" --status errored
kaggle b t list --name-regex "^math" --status errored
Debug an errored run: pull logs first, then download source notebook
调试出错的运行:先拉取日志,再下载源notebook
kaggle b t log my-task -m google/gemini-3.5-flash
kaggle b t download my-task -m google/gemini-3.5-flash -s -f
undefinedkaggle b t log my-task -m google/gemini-3.5-flash
kaggle b t download my-task -m google/gemini-3.5-flash -s -f
undefinedGotchas
注意事项
Most of these are silent failures the agent will not detect on its own — review before generating any task file or CLI invocation.
- No call → silent no-op. The push will succeed even if the file has no
.run()(push validation only checks for.run()decorators). The task will then execute on the server and produce no@task, so nothing is recorded. Every task function must end with.run.json(ortask_fn.run(kbench.llm))..evaluate(...) - is short-lived. If
MODEL_PROXY_API_KEYfails with an auth error, re-runpython task.py(orkaggle b auth -y) to refresh.kaggle b init -y - /
initappend to the env file. Loaded viaauthso last-wins makes re-running safe, but the file accumulates duplicate entries over time.dotenv - Task slug must match a decorator.
@taskfails ifkaggle b t push <SLUG> -f file.pydoesn't match the slugified name of some<SLUG>(or function name) in the file. Names are normalized:@kbench.task(name=...)→My Task,my-task→my_task.my-task - Server returns model slugs with suffix sometimes (e.g.
@default). The CLI normalizesgoogle/gemini-3.5-flash@default→@for matching; user-facing commands should use the plain-form.owner/model - is not implemented server-side. The command exists but currently prints
deleteDelete is not supported by the server yet. - Repeated flags, not space-separated. For multi-value flags (,
-m/-d), pass the flag once per value:--kaggle-dataset, not-m a -m b. Space-separated form is not supported and will error.-m a b - CLI scope is tasks only, not benchmarks. A benchmark is a curated collection of tasks. The CLI lets you create, push, and run individual tasks, but creating or managing benchmarks (collections) must be done on the Kaggle web UI.
这些大多是代理无法自动检测的静默失败——生成任务文件或CLI命令前请仔细查看。
- 未调用→ 静默无操作。即使文件中没有
.run(),推送也会成功(推送验证仅检查.run()装饰器)。任务在服务器上执行后不会生成@task文件,因此不会记录任何结果。每个任务函数必须以.run.json(或task_fn.run(kbench.llm))结尾。.evaluate(...) - 有效期较短。如果
MODEL_PROXY_API_KEY因认证失败报错,请重新运行python task.py(或kaggle b auth -y)刷新凭证。kaggle b init -y - /
init会追加到环境文件。通过auth加载,因此重复运行是安全的,但文件会随时间积累重复条目。dotenv - 任务别名必须与装饰器匹配。如果
@task与文件中某个<SLUG>(或函数名)的标准化别名不匹配,@kbench.task(name=...)会失败。名称会被标准化:kaggle b t push <SLUG> -f file.py→My Task,my-task→my_task。my-task - 服务器返回的模型别名有时带有后缀(例如
@default)。CLI会将google/gemini-3.5-flash@default替换为@以匹配;用户使用的命令应采用-的简洁形式。owner/model - 功能尚未在服务器端实现。命令存在,但当前会打印
delete。Delete is not supported by the server yet. - 重复使用参数,而非空格分隔。对于多值参数(、
-m/-d),每个值需单独使用一次参数:--kaggle-dataset,而非-m a -m b。空格分隔的形式不被支持,会报错。-m a b - CLI仅针对任务,不针对基准测试集合。基准测试集合是经过整理的任务合集。CLI允许你创建、推送和运行单个任务,但创建或管理基准测试集合必须通过Kaggle网页UI完成。