run-experiment

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Run Experiment

运行实验

Submit an ML experiment to a compute environment — local machine, SLURM HPC (Ibex, UW, etc.), or RunAI/Kubernetes (EPFL).
Generates a reproducible job script in
jobs/
that is committed alongside the code, then provides the exact submit command to run.
Pair this skill with
research-project-memory
when a launched job should be linked to a planned experiment, evidence item, worktree, or project action.
将ML实验提交至计算环境——本地机器、SLURM HPC(Ibex、UW等)或RunAI/Kubernetes(EPFL)。
会在
jobs/
目录下生成一份可复现的作业脚本,并与代码一同提交,随后提供用于运行的准确提交命令。
当启动的作业需要与规划好的实验、证据项、工作树或项目操作关联时,可将此技能与
research-project-memory
搭配使用。

Skill Directory Layout

技能目录结构

<installed-skill-dir>/
├── SKILL.md
├── environments.yaml        # Environment profiles (extend for new clusters)
└── templates/
    ├── slurm_job.sh         # SLURM template (Ibex, UW, any SLURM cluster)
    ├── runai_job.sh         # RunAI/Kubernetes template (EPFL)
    └── local_run.sh         # Local tmux/nohup template

<installed-skill-dir>/
├── SKILL.md
├── environments.yaml        # 环境配置文件(可扩展以支持新集群)
└── templates/
    ├── slurm_job.sh         # SLURM模板(适用于Ibex、UW及任何SLURM集群)
    ├── runai_job.sh         # RunAI/Kubernetes模板(适用于EPFL)
    └── local_run.sh         # 本地tmux/nohup模板

Steps to Follow

操作步骤

1. Read the environment registry

1. 读取环境注册表

Resolve
<installed-skill-dir>
as the directory containing this
SKILL.md
, then read
<installed-skill-dir>/environments.yaml
.
List the available environments to the user with a one-line description each.
<installed-skill-dir>
解析为包含此
SKILL.md
的目录,然后读取
<installed-skill-dir>/environments.yaml
向用户列出可用环境,每个环境附带一行描述。

2. Ask for experiment details

2. 询问实验详情

Ask the user in a single message:
  1. Environment: Which compute env? (show available choices from environments.yaml + "other")
  2. Script / command: What command to run? (e.g.,
    uv run python train.py --lr 1e-3
    )
  3. Job name: Short identifier (e.g.,
    baseline-cifar10
    ,
    ablation-no-attn
    ). Default: script basename + date.
  4. GPU count: How many GPUs? (default from env profile, 0 for CPU-only)
  5. Walltime / time limit: (SLURM only) How long? (default from env profile)
  6. Conda env or venv: Name of the conda environment, or
    .venv
    path (if applicable)
  7. Output directory: Where to save checkpoints/results? (default:
    outputs/<job-name>/
    )
  8. Anything special?: Extra env vars, array job, specific GPU type, PVC mounts (RunAI), etc.
If
--env
,
--script
,
--name
, or
--gpus
were passed as arguments, pre-fill those answers.
一条消息中向用户询问以下内容:
  1. 环境:选择哪个计算环境?(展示environments.yaml中的可用选项 + "其他")
  2. 脚本/命令:要运行什么命令?(例如:
    uv run python train.py --lr 1e-3
  3. 作业名称:简短标识符(例如:
    baseline-cifar10
    ablation-no-attn
    )。默认值:脚本基名 + 日期。
  4. GPU数量:需要多少个GPU?(默认值来自环境配置,CPU-only场景填0)
  5. 运行时长/时间限制:(仅SLURM适用)运行多久?(默认值来自环境配置)
  6. Conda环境或venv:Conda环境名称,或
    .venv
    路径(如适用)
  7. 输出目录: checkpoint/结果保存至何处?(默认值:
    outputs/<job-name>/
  8. 特殊需求?:额外环境变量、数组作业、特定GPU类型、PVC挂载(RunAI)等。
如果传入了
--env
--script
--name
--gpus
参数,则预先填充对应的答案。

3. Locate the project root

3. 定位项目根目录

bash
git rev-parse --show-toplevel 2>/dev/null || pwd
Also capture the short git commit hash:
bash
git rev-parse --short HEAD 2>/dev/null || echo "no-git"
bash
git rev-parse --show-toplevel 2>/dev/null || pwd
同时捕获简短的git提交哈希值:
bash
git rev-parse --short HEAD 2>/dev/null || echo "no-git"

4. Generate the job script

4. 生成作业脚本

Based on the environment type:
根据环境类型生成:

type: slurm (ibex, uw, or any SLURM cluster)

类型:slurm(ibex、uw或任何SLURM集群)

Read the SLURM template from
<installed-skill-dir>/templates/slurm_job.sh
.
Fill in all
{PLACEHOLDER}
variables:
PlaceholderValue
{PROJECT}
project directory name
{ENV_NAME}
environment key (e.g.,
ibex
)
{ENV_DISPLAY}
display name from profile
{DATE}
today's date YYYY-MM-DD
{COMMIT}
short git SHA
{JOB_NAME}
user-provided job name
{SCRIPT_NAME}
filename of the generated script
{PARTITION}
from env profile defaults (or user override)
{CPUS}
cpus_per_task from profile (or user override)
{GPUS}
user-provided GPU count
{MEM}
from profile defaults (or user override)
{WALLTIME}
user-provided or profile default
{LOG_DIR}
outputs/logs/<job-name>
{OUTPUT_DIR}
outputs/<job-name>
{PROJECT_ROOT}
absolute project root path
{CONDA_ENV}
user-provided env name
{RUN_COMMAND}
user-provided command
{SCRATCH}
scratch path from env profile
Uncomment the relevant
module load
lines based on the env profile's
common_modules
. Uncomment the
conda activate
or
source .venv/activate
line based on user's answer. If scratch path is in the env profile, uncomment the TMPDIR block.
Output path:
jobs/<job-name>.sh
<installed-skill-dir>/templates/slurm_job.sh
读取SLURM模板。
填充所有
{PLACEHOLDER}
变量:
占位符取值
{PROJECT}
项目目录名称
{ENV_NAME}
环境键(例如:
ibex
{ENV_DISPLAY}
配置文件中的显示名称
{DATE}
当前日期 YYYY-MM-DD
{COMMIT}
简短git SHA值
{JOB_NAME}
用户提供的作业名称
{SCRIPT_NAME}
生成的脚本文件名
{PARTITION}
来自环境配置默认值(或用户覆盖值)
{CPUS}
配置文件中的cpus_per_task值(或用户覆盖值)
{GPUS}
用户提供的GPU数量
{MEM}
来自环境配置默认值(或用户覆盖值)
{WALLTIME}
用户提供的值或配置文件默认值
{LOG_DIR}
outputs/logs/<job-name>
{OUTPUT_DIR}
outputs/<job-name>
{PROJECT_ROOT}
项目根目录绝对路径
{CONDA_ENV}
用户提供的环境名称
{RUN_COMMAND}
用户提供的命令
{SCRATCH}
环境配置中的临时存储路径
根据环境配置的
common_modules
,取消相关
module load
行的注释。 根据用户的回答,取消
conda activate
source .venv/activate
行的注释。 如果环境配置中包含临时存储路径,取消TMPDIR块的注释。
输出路径
jobs/<job-name>.sh

type: runai (runai profile)

类型:runai(runai配置)

Read the RunAI template from
<installed-skill-dir>/templates/runai_job.sh
.
Fill in placeholders:
PlaceholderValue
{PROJECT}
project directory name
{DATE}
today's date
{COMMIT}
short git SHA
{JOB_NAME}
user-provided job name
{SCRIPT_NAME}
filename of generated script
{IMAGE}
from env profile
default_image
(ask user to confirm)
{RUNAI_PROJECT}
from env profile
project
{GPUS}
GPU count
{CPUS}
CPU count from profile defaults
{MEM}
memory from profile defaults
{PVC_FLAGS}
generated from
pvc_mounts
in profile:
--pvc claim:path \
per mount
{RUN_COMMAND}
user-provided command
Output path:
jobs/<job-name>-runai.sh
<installed-skill-dir>/templates/runai_job.sh
读取RunAI模板。
填充占位符:
占位符取值
{PROJECT}
项目目录名称
{DATE}
当前日期
{COMMIT}
简短git SHA值
{JOB_NAME}
用户提供的作业名称
{SCRIPT_NAME}
生成的脚本文件名
{IMAGE}
来自环境配置的
default_image
(需询问用户确认)
{RUNAI_PROJECT}
来自环境配置的
project
{GPUS}
GPU数量
{CPUS}
配置文件默认的CPU数量
{MEM}
配置文件默认的内存大小
{PVC_FLAGS}
根据配置文件中的
pvc_mounts
生成:每个挂载对应
--pvc claim:path \
{RUN_COMMAND}
用户提供的命令
输出路径
jobs/<job-name>-runai.sh

type: local

类型:local

Read the local template from
<installed-skill-dir>/templates/local_run.sh
.
Fill in placeholders similarly. Uncomment conda/venv activation as appropriate.
Output path:
jobs/<job-name>-local.sh
<installed-skill-dir>/templates/local_run.sh
读取本地模板。
类似地填充占位符,根据情况取消conda/venv激活行的注释。
输出路径
jobs/<job-name>-local.sh

type: other / unknown

类型:其他/未知

If the user specifies an environment not in
environments.yaml
:
  1. Ask: "What scheduler does it use? (slurm / runai / other)"
  2. If SLURM-compatible: use the SLURM template with the info the user provides.
  3. If truly novel: generate a minimal generic wrapper and explain what to fill in.
  4. Suggest: "Want me to add this environment to
    environments.yaml
    for future use?"
如果用户指定的环境不在
environments.yaml
中:
  1. 询问:"它使用什么调度器?(slurm / runai / 其他)"
  2. 如果兼容SLURM:使用SLURM模板并结合用户提供的信息。
  3. 如果是全新环境:生成一个最小化的通用包装器,并说明需要填充的内容。
  4. 建议:"需要我将此环境添加到
    environments.yaml
    中以便未来使用吗?"

5. Write the job script and preview

5. 写入作业脚本并预览

Create the job script directory, log directory, and output directory before previewing or submitting:
bash
mkdir -p <project-root>/jobs
mkdir -p <project-root>/outputs/logs/<job-name>
mkdir -p <output-dir>
Write the filled-in script to
jobs/<job-name>.sh
(or
-runai.sh
/
-local.sh
).
Show the user the full generated script for review.
在预览或提交前,创建作业脚本目录、日志目录和输出目录:
bash
mkdir -p <project-root>/jobs
mkdir -p <project-root>/outputs/logs/<job-name>
mkdir -p <output-dir>
将填充好的脚本写入
jobs/<job-name>.sh
(或
-runai.sh
/
-local.sh
)。
向用户展示完整的生成脚本以供审核。

6. Show the submit command and ask to launch

6. 展示提交命令并询问是否启动

Print the exact command(s) to submit, tailored to the environment:
打印针对环境定制的准确提交命令:

SLURM (Ibex / UW / etc.)

SLURM(Ibex / UW / 等)

undefined
undefined

If you're already on the login node:

若已在登录节点上:

sbatch jobs/<job-name>.sh
sbatch jobs/<job-name>.sh

If submitting from your laptop (requires ssh access):

若从笔记本提交(需ssh访问权限):

scp jobs/<job-name>.sh <ssh-alias>:<project-root>/jobs/ ssh <ssh-alias> "cd <project-root> && mkdir -p outputs/logs/<job-name> <output-dir> jobs && sbatch jobs/<job-name>.sh"
scp jobs/<job-name>.sh <ssh-alias>:<project-root>/jobs/ ssh <ssh-alias> "cd <project-root> && mkdir -p outputs/logs/<job-name> <output-dir> jobs && sbatch jobs/<job-name>.sh"

Monitor:

监控:

squeue -u $USER sacct -j <jobid> --format=JobID,State,Elapsed,AllocGRES tail -f outputs/logs/<job-name>/slurm-<jobid>.out
undefined
squeue -u $USER sacct -j <jobid> --format=JobID,State,Elapsed,AllocGRES tail -f outputs/logs/<job-name>/slurm-<jobid>.out
undefined

RunAI

RunAI

bash jobs/<job-name>-runai.sh
bash jobs/<job-name>-runai.sh

Monitor:

监控:

runai list runai logs <job-name> -f
undefined
runai list runai logs <job-name> -f
undefined

Local

本地

undefined
undefined

Attached (output in terminal):

前台运行(输出在终端):

bash jobs/<job-name>-local.sh
bash jobs/<job-name>-local.sh

Detached in tmux:

在tmux中后台运行:

tmux new-session -d -s <job-name> "bash jobs/<job-name>-local.sh" tmux attach -t <job-name>
tmux new-session -d -s <job-name> "bash jobs/<job-name>-local.sh" tmux attach -t <job-name>

Background with nohup:

用nohup后台运行:

nohup bash jobs/<job-name>-local.sh &

Ask: **"Want me to run the submit command now?"**

- If yes and local: run it directly.
- If yes and remote: run the `scp` + `ssh sbatch` command (requires ssh key auth to be set up).
- If no: remind the user that the script is saved in `jobs/` and ready to submit.
nohup bash jobs/<job-name>-local.sh &

询问:**"需要我现在运行提交命令吗?"**

- 若是本地环境:直接运行。
- 若是远程环境:运行`scp` + `ssh sbatch`命令(需已设置ssh密钥认证)。
- 若否:提醒用户脚本已保存至`jobs/`目录,随时可提交。

7. Offer to add to jobs index (optional)

7. 可选:添加至作业索引

If a
jobs/README.md
or
jobs/index.md
exists, offer to append a one-line entry:
| {DATE} | {JOB_NAME} | {ENV_NAME} | {COMMIT} | {RUN_COMMAND_BRIEF} |
If the repo follows the code evidence layout from
init-python-project
, also offer to create or update a short run pointer under:
text
docs/runs/<DATE>-<job-name>.md
This file should contain the command, config, commit, output path, expected metric, and monitor command. It should not contain raw logs.
若存在
jobs/README.md
jobs/index.md
,可提议添加一行条目:
| {DATE} | {JOB_NAME} | {ENV_NAME} | {COMMIT} | {RUN_COMMAND_BRIEF} |
若仓库遵循
init-python-project
的代码证据结构,还可提议在以下路径创建或更新简短的运行指针:
text
docs/runs/<DATE>-<job-name>.md
该文件应包含命令、配置、提交哈希、输出路径、预期指标和监控命令,不应包含原始日志。

8. Update project memory when present

8. 若存在项目内存则更新

If the repo has
memory/
or a worktree
.agent/worktree-status.md
, update only verified run pointers:
  • memory/evidence-board.md
    : add or update the linked
    EXP-###
    with job script path, commit, command, output directory, and status
    planned
    ,
    submitted
    , or
    running
    only if verified
  • docs/runs/
    : write a small run pointer when the code repo uses that convention
  • memory/action-board.md
    : mark the launch action as
    doing
    or create a monitor action
  • memory/current-status.md
    : record the latest known job and what must be checked next
  • <worktree>/.agent/worktree-status.md
    : link the run to the worktree purpose and exit condition
Do not store queue state, job success, or final metric values as durable facts unless they were verified in this session. Use
needs-verification
for monitor tasks.

若仓库包含
memory/
目录或工作树
.agent/worktree-status.md
,仅更新已验证的运行指针:
  • memory/evidence-board.md
    :添加或更新关联的
    EXP-###
    条目,包含作业脚本路径、提交哈希、命令、输出目录,以及仅在验证后的状态
    planned
    submitted
    running
  • docs/runs/
    :当代码仓库遵循该约定时,写入小型运行指针
  • memory/action-board.md
    :将启动操作标记为
    doing
    ,或创建监控操作
  • memory/current-status.md
    :记录最新已知作业及下一步需检查的内容
  • <worktree>/.agent/worktree-status.md
    :将运行与工作树用途及退出条件关联
除非在本次会话中已验证,否则不要将队列状态、作业成功或最终指标值作为持久事实存储。监控任务使用
needs-verification
标记。

Environment Reference

环境参考

All environments are defined in
environments.yaml
. The current known environments:
KeyTypeClusterNotes
local
localCurrent machine, tmux/nohup
ibex
slurmKAUST Ibex
ilogin.ibex.kaust.edu.sa
; gpu/batch/himem partitions
uw
slurmUW HPCPlaceholder — update
environments.yaml
with actual details
runai
runaiEPFL RunAIKubernetes; update project/image in
environments.yaml
所有环境都在
environments.yaml
中定义。当前已知环境:
类型集群说明
local
local当前机器,使用tmux/nohup
ibex
slurmKAUST Ibex
ilogin.ibex.kaust.edu.sa
;包含gpu/batch/himem分区
uw
slurmUW HPC占位符——需更新
environments.yaml
中的实际细节
runai
runaiEPFL RunAIKubernetes;需在
environments.yaml
中更新项目/镜像

Adding a New Environment

添加新环境

Edit
<installed-skill-dir>/environments.yaml
and add a block:
yaml
my-cluster:
  type: slurm                       # or runai / local
  display_name: "My University HPC"
  login_node: "login.cluster.edu"
  ssh_alias: mycluster
  scheduler: slurm
  partitions:
    gpu:
      name: gpu
      flag: "--partition=gpu"
      gpu_flag: "--gres=gpu:{count}"
      max_gpus_per_job: 4
  defaults:
    partition: gpu
    gpus: 1
    cpus_per_task: 4
    mem: "32G"
    walltime: "12:00:00"
    max_walltime: "48:00:00"
  storage:
    home: "/home/{user}"
    scratch: "/scratch/{user}"
  module_system: lmod
  common_modules:
    - "cuda/12.1"
    - "python/3.11"
  notes: "..."

编辑
<installed-skill-dir>/environments.yaml
并添加如下块:
yaml
my-cluster:
  type: slurm                       # 或runai / local
  display_name: "My University HPC"
  login_node: "login.cluster.edu"
  ssh_alias: mycluster
  scheduler: slurm
  partitions:
    gpu:
      name: gpu
      flag: "--partition=gpu"
      gpu_flag: "--gres=gpu:{count}"
      max_gpus_per_job: 4
  defaults:
    partition: gpu
    gpus: 1
    cpus_per_task: 4
    mem: "32G"
    walltime: "12:00:00"
    max_walltime: "48:00:00"
  storage:
    home: "/home/{user}"
    scratch: "/scratch/{user}"
  module_system: lmod
  common_modules:
    - "cuda/12.1"
    - "python/3.11"
  notes: "..."

Reproducibility Conventions

可复现性约定

Every generated job script includes:
  • Git commit hash in the header and as an env var (
    GIT_COMMIT
    )
  • Structured output directory:
    outputs/<job-name>/
    for checkpoints,
    outputs/logs/<job-name>/
    for logs
  • Timestamped log files so reruns don't overwrite
  • Exit code propagation so job arrays and downstream scripts detect failures
The
jobs/
directory should be committed to git (the scripts are small text files). Actual outputs go to
outputs/
which is typically
.gitignore
d.

每个生成的作业脚本都包含:
  • Git提交哈希:在头部和环境变量
    GIT_COMMIT
  • 结构化输出目录
    outputs/<job-name>/
    用于保存checkpoint,
    outputs/logs/<job-name>/
    用于保存日志
  • 带时间戳的日志文件:避免重运行时覆盖
  • 退出码传递:便于作业数组和下游脚本检测失败
jobs/
目录应提交至git(脚本为小型文本文件)。实际输出存于
outputs/
目录,该目录通常被
.gitignore
忽略。

Example Invocations

调用示例

/run-experiment                                              # interactive wizard
/run-experiment --env ibex --script train.py --gpus 2
/run-experiment --env local --script eval.py --name eval-baseline
/run-experiment --env runai --gpus 4 --name big-run
/run-experiment --env ibex --script sweep.py --name sweep --gpus 1

/run-experiment                                              # 交互式向导
/run-experiment --env ibex --script train.py --gpus 2
/run-experiment --env local --script eval.py --name eval-baseline
/run-experiment --env runai --gpus 4 --name big-run
/run-experiment --env ibex --script sweep.py --name sweep --gpus 1

Common Patterns

常见模式

Job Array (SLURM) — hyperparameter sweep

作业数组(SLURM)——超参数扫描

When the user says "I want to sweep over N configs":
  1. Ask for the sweep configs or config file (e.g.,
    configs/sweep.yaml
    with N entries).
  2. Add
    #SBATCH --array=0-{N-1}%{max_concurrent}
    to the script.
  3. Add to the run command:
    --config configs/sweep.yaml --config-idx $SLURM_ARRAY_TASK_ID
  4. Output dir:
    outputs/<job-name>/$SLURM_ARRAY_TASK_ID/
当用户表示"我想要在N个配置上进行扫描"时:
  1. 询问扫描配置或配置文件(例如:包含N个条目的
    configs/sweep.yaml
    )。
  2. 在脚本中添加
    #SBATCH --array=0-{N-1}%{max_concurrent}
  3. 在运行命令中添加:
    --config configs/sweep.yaml --config-idx $SLURM_ARRAY_TASK_ID
  4. 输出目录:
    outputs/<job-name>/$SLURM_ARRAY_TASK_ID/

Multi-GPU (DDP)

多GPU(DDP)

When GPUs > 1 and the env is SLURM:
  • Add
    --ntasks-per-node={GPUS}
    directive
  • Wrap command with
    torchrun --nproc_per_node={GPUS}
    or
    srun python -m torch.distributed.launch
  • Ask the user which distributed launcher they use
当GPU数量>1且环境为SLURM时:
  • 添加
    --ntasks-per-node={GPUS}
    指令
  • torchrun --nproc_per_node={GPUS}
    srun python -m torch.distributed.launch
    包裹命令
  • 询问用户使用哪种分布式启动器

Interactive Session (Debugging)

交互式会话(调试)

When the user wants to debug interactively (not submit a batch job):
Ibex:
bash
srun --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=32G --time=2:00:00 --pty bash
RunAI:
bash
runai submit <name> --image <image> --gpu 1 --interactive --stdin -- bash
runai bash <name>
Generate this command directly without creating a script file.
当用户想要交互式调试(而非提交批处理作业)时:
Ibex
bash
srun --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=32G --time=2:00:00 --pty bash
RunAI
bash
runai submit <name> --image <image> --gpu 1 --interactive --stdin -- bash
runai bash <name>
直接生成此命令,无需创建脚本文件。