run-experiment

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Run Experiment

运行实验

Submit an ML experiment to a compute environment — local machine, SLURM HPC (Ibex, UW, etc.), or RunAI/Kubernetes (EPFL).

Generates a reproducible job script in

jobs/

that is committed alongside the code, then provides the exact submit command to run.

Pair this skill with

research-project-memory

when a launched job should be linked to a planned experiment, evidence item, worktree, or project action.

将ML实验提交至计算环境——本地机器、SLURM HPC（Ibex、UW等）或RunAI/Kubernetes（EPFL）。

会在

jobs/

目录下生成一份可复现的作业脚本，并与代码一同提交，随后提供用于运行的准确提交命令。

当启动的作业需要与规划好的实验、证据项、工作树或项目操作关联时，可将此技能与

research-project-memory

搭配使用。

Skill Directory Layout

技能目录结构

<installed-skill-dir>/
├── SKILL.md
├── environments.yaml        # Environment profiles (extend for new clusters)
└── templates/
    ├── slurm_job.sh         # SLURM template (Ibex, UW, any SLURM cluster)
    ├── runai_job.sh         # RunAI/Kubernetes template (EPFL)
    └── local_run.sh         # Local tmux/nohup template

<installed-skill-dir>/
├── SKILL.md
├── environments.yaml        # 环境配置文件（可扩展以支持新集群）
└── templates/
    ├── slurm_job.sh         # SLURM模板（适用于Ibex、UW及任何SLURM集群）
    ├── runai_job.sh         # RunAI/Kubernetes模板（适用于EPFL）
    └── local_run.sh         # 本地tmux/nohup模板

Steps to Follow

操作步骤

1. Read the environment registry

1. 读取环境注册表

Resolve

<installed-skill-dir>

as the directory containing this

SKILL.md

, then read

<installed-skill-dir>/environments.yaml

List the available environments to the user with a one-line description each.

将

<installed-skill-dir>

解析为包含此

SKILL.md

的目录，然后读取

<installed-skill-dir>/environments.yaml

。

向用户列出可用环境，每个环境附带一行描述。

2. Ask for experiment details

2. 询问实验详情

Ask the user in a single message:

Environment: Which compute env? (show available choices from environments.yaml + "other")
Script / command: What command to run? (e.g.,
```
uv run python train.py --lr 1e-3
```
)
Job name: Short identifier (e.g.,
```
baseline-cifar10
```
,
```
ablation-no-attn
```
). Default: script basename + date.
GPU count: How many GPUs? (default from env profile, 0 for CPU-only)
Walltime / time limit: (SLURM only) How long? (default from env profile)
Conda env or venv: Name of the conda environment, or
```
.venv
```
path (if applicable)
Output directory: Where to save checkpoints/results? (default:
```
outputs/<job-name>/
```
)
Anything special?: Extra env vars, array job, specific GPU type, PVC mounts (RunAI), etc.

--env

--script

--name

, or

--gpus

were passed as arguments, pre-fill those answers.

在一条消息中向用户询问以下内容：

环境：选择哪个计算环境？（展示environments.yaml中的可用选项 + "其他"）
脚本/命令：要运行什么命令？（例如：
```
uv run python train.py --lr 1e-3
```
）
作业名称：简短标识符（例如：
```
baseline-cifar10
```
、
```
ablation-no-attn
```
）。默认值：脚本基名 + 日期。
GPU数量：需要多少个GPU？（默认值来自环境配置，CPU-only场景填0）
运行时长/时间限制：（仅SLURM适用）运行多久？（默认值来自环境配置）
Conda环境或venv：Conda环境名称，或
```
.venv
```
路径（如适用）
输出目录： checkpoint/结果保存至何处？（默认值：
```
outputs/<job-name>/
```
）
特殊需求？：额外环境变量、数组作业、特定GPU类型、PVC挂载（RunAI）等。

如果传入了

--env

、

--script

、

--name

或

--gpus

参数，则预先填充对应的答案。

3. Locate the project root

3. 定位项目根目录

bash

git rev-parse --show-toplevel 2>/dev/null || pwd

Also capture the short git commit hash:

bash

git rev-parse --short HEAD 2>/dev/null || echo "no-git"

bash

git rev-parse --show-toplevel 2>/dev/null || pwd

同时捕获简短的git提交哈希值：

bash

git rev-parse --short HEAD 2>/dev/null || echo "no-git"

4. Generate the job script

4. 生成作业脚本

Based on the environment type:

根据环境类型生成：

type: slurm (ibex, uw, or any SLURM cluster)

类型：slurm（ibex、uw或任何SLURM集群）

Read the SLURM template from

<installed-skill-dir>/templates/slurm_job.sh

Fill in all

{PLACEHOLDER}

variables:

Placeholder	Value
`{PROJECT}`	project directory name
`{ENV_NAME}`	environment key (e.g., `ibex` )
`{ENV_DISPLAY}`	display name from profile
`{DATE}`	today's date YYYY-MM-DD
`{COMMIT}`	short git SHA
`{JOB_NAME}`	user-provided job name
`{SCRIPT_NAME}`	filename of the generated script
`{PARTITION}`	from env profile defaults (or user override)
`{CPUS}`	cpus_per_task from profile (or user override)
`{GPUS}`	user-provided GPU count
`{MEM}`	from profile defaults (or user override)
`{WALLTIME}`	user-provided or profile default
`{LOG_DIR}`	`outputs/logs/<job-name>`
`{OUTPUT_DIR}`	`outputs/<job-name>`
`{PROJECT_ROOT}`	absolute project root path
`{CONDA_ENV}`	user-provided env name
`{RUN_COMMAND}`	user-provided command
`{SCRATCH}`	scratch path from env profile

Uncomment the relevant

module load

lines based on the env profile's

common_modules

. Uncomment the

conda activate

source .venv/activate

line based on user's answer. If scratch path is in the env profile, uncomment the TMPDIR block.

Output path:

jobs/<job-name>.sh

从

<installed-skill-dir>/templates/slurm_job.sh

读取SLURM模板。

填充所有

{PLACEHOLDER}

变量：

占位符	取值
`{PROJECT}`	项目目录名称
`{ENV_NAME}`	环境键（例如： `ibex` ）
`{ENV_DISPLAY}`	配置文件中的显示名称
`{DATE}`	当前日期 YYYY-MM-DD
`{COMMIT}`	简短git SHA值
`{JOB_NAME}`	用户提供的作业名称
`{SCRIPT_NAME}`	生成的脚本文件名
`{PARTITION}`	来自环境配置默认值（或用户覆盖值）
`{CPUS}`	配置文件中的cpus_per_task值（或用户覆盖值）
`{GPUS}`	用户提供的GPU数量
`{MEM}`	来自环境配置默认值（或用户覆盖值）
`{WALLTIME}`	用户提供的值或配置文件默认值
`{LOG_DIR}`	`outputs/logs/<job-name>`
`{OUTPUT_DIR}`	`outputs/<job-name>`
`{PROJECT_ROOT}`	项目根目录绝对路径
`{CONDA_ENV}`	用户提供的环境名称
`{RUN_COMMAND}`	用户提供的命令
`{SCRATCH}`	环境配置中的临时存储路径

根据环境配置的

common_modules

，取消相关

module load

行的注释。根据用户的回答，取消

conda activate

或

source .venv/activate

行的注释。如果环境配置中包含临时存储路径，取消TMPDIR块的注释。

输出路径：

jobs/<job-name>.sh

type: runai (runai profile)

类型：runai（runai配置）

Read the RunAI template from

<installed-skill-dir>/templates/runai_job.sh

Fill in placeholders:

Placeholder	Value
`{PROJECT}`	project directory name
`{DATE}`	today's date
`{COMMIT}`	short git SHA
`{JOB_NAME}`	user-provided job name
`{SCRIPT_NAME}`	filename of generated script
`{IMAGE}`	from env profile `default_image` (ask user to confirm)
`{RUNAI_PROJECT}`	from env profile `project`
`{GPUS}`	GPU count
`{CPUS}`	CPU count from profile defaults
`{MEM}`	memory from profile defaults
`{PVC_FLAGS}`	generated from `pvc_mounts` in profile: `--pvc claim:path \` per mount
`{RUN_COMMAND}`	user-provided command

Output path:

jobs/<job-name>-runai.sh

从

<installed-skill-dir>/templates/runai_job.sh

读取RunAI模板。

填充占位符：

占位符	取值
`{PROJECT}`	项目目录名称
`{DATE}`	当前日期
`{COMMIT}`	简短git SHA值
`{JOB_NAME}`	用户提供的作业名称
`{SCRIPT_NAME}`	生成的脚本文件名
`{IMAGE}`	来自环境配置的 `default_image` （需询问用户确认）
`{RUNAI_PROJECT}`	来自环境配置的 `project`
`{GPUS}`	GPU数量
`{CPUS}`	配置文件默认的CPU数量
`{MEM}`	配置文件默认的内存大小
`{PVC_FLAGS}`	根据配置文件中的 `pvc_mounts` 生成：每个挂载对应 `--pvc claim:path \`
`{RUN_COMMAND}`	用户提供的命令

输出路径：

jobs/<job-name>-runai.sh

type: local

类型：local

Read the local template from

<installed-skill-dir>/templates/local_run.sh

Fill in placeholders similarly. Uncomment conda/venv activation as appropriate.

Output path:

jobs/<job-name>-local.sh

从

<installed-skill-dir>/templates/local_run.sh

读取本地模板。

类似地填充占位符，根据情况取消conda/venv激活行的注释。

输出路径：

jobs/<job-name>-local.sh

type: other / unknown

类型：其他/未知

If the user specifies an environment not in

environments.yaml

Ask: "What scheduler does it use? (slurm / runai / other)"
If SLURM-compatible: use the SLURM template with the info the user provides.
If truly novel: generate a minimal generic wrapper and explain what to fill in.
Suggest: "Want me to add this environment to
```
environments.yaml
```
for future use?"

如果用户指定的环境不在

environments.yaml

中：

询问："它使用什么调度器？(slurm / runai / 其他)"
如果兼容SLURM：使用SLURM模板并结合用户提供的信息。
如果是全新环境：生成一个最小化的通用包装器，并说明需要填充的内容。
建议："需要我将此环境添加到
```
environments.yaml
```
中以便未来使用吗？"

5. Write the job script and preview

5. 写入作业脚本并预览

Create the job script directory, log directory, and output directory before previewing or submitting:

bash

mkdir -p <project-root>/jobs
mkdir -p <project-root>/outputs/logs/<job-name>
mkdir -p <output-dir>

Write the filled-in script to

jobs/<job-name>.sh

(or

-runai.sh

-local.sh

Show the user the full generated script for review.

在预览或提交前，创建作业脚本目录、日志目录和输出目录：

bash

mkdir -p <project-root>/jobs
mkdir -p <project-root>/outputs/logs/<job-name>
mkdir -p <output-dir>

将填充好的脚本写入

jobs/<job-name>.sh

（或

-runai.sh

-local.sh

）。

向用户展示完整的生成脚本以供审核。

6. Show the submit command and ask to launch

6. 展示提交命令并询问是否启动

Print the exact command(s) to submit, tailored to the environment:

打印针对环境定制的准确提交命令：

SLURM (Ibex / UW / etc.)

SLURM（Ibex / UW / 等）

undefined

undefined

If you're already on the login node:

若已在登录节点上：

sbatch jobs/<job-name>.sh

If submitting from your laptop (requires ssh access):

若从笔记本提交（需ssh访问权限）：

scp jobs/<job-name>.sh <ssh-alias>:<project-root>/jobs/ ssh <ssh-alias> "cd <project-root> && mkdir -p outputs/logs/<job-name> <output-dir> jobs && sbatch jobs/<job-name>.sh"

Monitor:

监控：

squeue -u $USER sacct -j <jobid> --format=JobID,State,Elapsed,AllocGRES tail -f outputs/logs/<job-name>/slurm-<jobid>.out

undefined

squeue -u $USER sacct -j <jobid> --format=JobID,State,Elapsed,AllocGRES tail -f outputs/logs/<job-name>/slurm-<jobid>.out

undefined

RunAI

bash jobs/<job-name>-runai.sh

bash jobs/<job-name>-runai.sh

Monitor:

监控：

runai list runai logs <job-name> -f

undefined

runai list runai logs <job-name> -f

undefined

Local

本地

undefined

undefined

Attached (output in terminal):

前台运行（输出在终端）：

bash jobs/<job-name>-local.sh

Detached in tmux:

在tmux中后台运行：

tmux new-session -d -s <job-name> "bash jobs/<job-name>-local.sh" tmux attach -t <job-name>

Background with nohup:

用nohup后台运行：

nohup bash jobs/<job-name>-local.sh &


Ask: **"Want me to run the submit command now?"**

- If yes and local: run it directly.
- If yes and remote: run the `scp` + `ssh sbatch` command (requires ssh key auth to be set up).
- If no: remind the user that the script is saved in `jobs/` and ready to submit.

nohup bash jobs/<job-name>-local.sh &


询问：**"需要我现在运行提交命令吗？"**

- 若是本地环境：直接运行。
- 若是远程环境：运行`scp` + `ssh sbatch`命令（需已设置ssh密钥认证）。
- 若否：提醒用户脚本已保存至`jobs/`目录，随时可提交。

7. Offer to add to jobs index (optional)

7. 可选：添加至作业索引

If a

jobs/README.md

jobs/index.md

exists, offer to append a one-line entry:

| {DATE} | {JOB_NAME} | {ENV_NAME} | {COMMIT} | {RUN_COMMAND_BRIEF} |

If the repo follows the code evidence layout from

init-python-project

, also offer to create or update a short run pointer under:

text

docs/runs/<DATE>-<job-name>.md

This file should contain the command, config, commit, output path, expected metric, and monitor command. It should not contain raw logs.

若存在

jobs/README.md

或

jobs/index.md

，可提议添加一行条目：

| {DATE} | {JOB_NAME} | {ENV_NAME} | {COMMIT} | {RUN_COMMAND_BRIEF} |

若仓库遵循

init-python-project

的代码证据结构，还可提议在以下路径创建或更新简短的运行指针：

text

docs/runs/<DATE>-<job-name>.md

该文件应包含命令、配置、提交哈希、输出路径、预期指标和监控命令，不应包含原始日志。

8. Update project memory when present

8. 若存在项目内存则更新

If the repo has

memory/

or a worktree

.agent/worktree-status.md

, update only verified run pointers:

```
memory/evidence-board.md
```
: add or update the linked
```
EXP-###
```
with job script path, commit, command, output directory, and status
```
planned
```
,
```
submitted
```
, or
```
running
```
only if verified
```
docs/runs/
```
: write a small run pointer when the code repo uses that convention
```
memory/action-board.md
```
: mark the launch action as
```
doing
```
or create a monitor action
```
memory/current-status.md
```
: record the latest known job and what must be checked next
```
<worktree>/.agent/worktree-status.md
```
: link the run to the worktree purpose and exit condition

Do not store queue state, job success, or final metric values as durable facts unless they were verified in this session. Use

needs-verification

for monitor tasks.

若仓库包含

memory/

目录或工作树

.agent/worktree-status.md

，仅更新已验证的运行指针：

```
memory/evidence-board.md
```
：添加或更新关联的
```
EXP-###
```
条目，包含作业脚本路径、提交哈希、命令、输出目录，以及仅在验证后的状态
```
planned
```
、
```
submitted
```
或
```
running
```
```
docs/runs/
```
：当代码仓库遵循该约定时，写入小型运行指针
```
memory/action-board.md
```
：将启动操作标记为
```
doing
```
，或创建监控操作
```
memory/current-status.md
```
：记录最新已知作业及下一步需检查的内容
```
<worktree>/.agent/worktree-status.md
```
：将运行与工作树用途及退出条件关联

除非在本次会话中已验证，否则不要将队列状态、作业成功或最终指标值作为持久事实存储。监控任务使用

needs-verification

标记。

Environment Reference

环境参考

All environments are defined in

environments.yaml

. The current known environments:

Key	Type	Cluster	Notes
`local`	local	—	Current machine, tmux/nohup
`ibex`	slurm	KAUST Ibex	`ilogin.ibex.kaust.edu.sa` ; gpu/batch/himem partitions
`uw`	slurm	UW HPC	Placeholder — update `environments.yaml` with actual details
`runai`	runai	EPFL RunAI	Kubernetes; update project/image in `environments.yaml`

所有环境都在

environments.yaml

中定义。当前已知环境：

键	类型	集群	说明
`local`	local	—	当前机器，使用tmux/nohup
`ibex`	slurm	KAUST Ibex	`ilogin.ibex.kaust.edu.sa` ；包含gpu/batch/himem分区
`uw`	slurm	UW HPC	占位符——需更新 `environments.yaml` 中的实际细节
`runai`	runai	EPFL RunAI	Kubernetes；需在 `environments.yaml` 中更新项目/镜像

Adding a New Environment

添加新环境

Edit

<installed-skill-dir>/environments.yaml

and add a block:

yaml

my-cluster:
  type: slurm                       # or runai / local
  display_name: "My University HPC"
  login_node: "login.cluster.edu"
  ssh_alias: mycluster
  scheduler: slurm
  partitions:
    gpu:
      name: gpu
      flag: "--partition=gpu"
      gpu_flag: "--gres=gpu:{count}"
      max_gpus_per_job: 4
  defaults:
    partition: gpu
    gpus: 1
    cpus_per_task: 4
    mem: "32G"
    walltime: "12:00:00"
    max_walltime: "48:00:00"
  storage:
    home: "/home/{user}"
    scratch: "/scratch/{user}"
  module_system: lmod
  common_modules:
    - "cuda/12.1"
    - "python/3.11"
  notes: "..."

编辑

<installed-skill-dir>/environments.yaml

并添加如下块：

yaml

my-cluster:
  type: slurm                       # 或runai / local
  display_name: "My University HPC"
  login_node: "login.cluster.edu"
  ssh_alias: mycluster
  scheduler: slurm
  partitions:
    gpu:
      name: gpu
      flag: "--partition=gpu"
      gpu_flag: "--gres=gpu:{count}"
      max_gpus_per_job: 4
  defaults:
    partition: gpu
    gpus: 1
    cpus_per_task: 4
    mem: "32G"
    walltime: "12:00:00"
    max_walltime: "48:00:00"
  storage:
    home: "/home/{user}"
    scratch: "/scratch/{user}"
  module_system: lmod
  common_modules:
    - "cuda/12.1"
    - "python/3.11"
  notes: "..."

Reproducibility Conventions

可复现性约定

Every generated job script includes:

Git commit hash in the header and as an env var (
```
GIT_COMMIT
```
)
Structured output directory:
```
outputs/<job-name>/
```
for checkpoints,
```
outputs/logs/<job-name>/
```
for logs
Timestamped log files so reruns don't overwrite
Exit code propagation so job arrays and downstream scripts detect failures

The

jobs/

directory should be committed to git (the scripts are small text files). Actual outputs go to

outputs/

which is typically

.gitignore

每个生成的作业脚本都包含：

Git提交哈希：在头部和环境变量
```
GIT_COMMIT
```
中
结构化输出目录：
```
outputs/<job-name>/
```
用于保存checkpoint，
```
outputs/logs/<job-name>/
```
用于保存日志
带时间戳的日志文件：避免重运行时覆盖
退出码传递：便于作业数组和下游脚本检测失败

jobs/

目录应提交至git（脚本为小型文本文件）。实际输出存于

outputs/

.gitignore

忽略。

Example Invocations

调用示例

/run-experiment                                              # interactive wizard
/run-experiment --env ibex --script train.py --gpus 2
/run-experiment --env local --script eval.py --name eval-baseline
/run-experiment --env runai --gpus 4 --name big-run
/run-experiment --env ibex --script sweep.py --name sweep --gpus 1

/run-experiment                                              # 交互式向导
/run-experiment --env ibex --script train.py --gpus 2
/run-experiment --env local --script eval.py --name eval-baseline
/run-experiment --env runai --gpus 4 --name big-run
/run-experiment --env ibex --script sweep.py --name sweep --gpus 1

Common Patterns

常见模式

Job Array (SLURM) — hyperparameter sweep

作业数组（SLURM）——超参数扫描

When the user says "I want to sweep over N configs":

Ask for the sweep configs or config file (e.g.,
```
configs/sweep.yaml
```
with N entries).

Add

#SBATCH --array=0-{N-1}%{max_concurrent}

to the script.

Add to the run command:

--config configs/sweep.yaml --config-idx $SLURM_ARRAY_TASK_ID

Output dir:

outputs/<job-name>/$SLURM_ARRAY_TASK_ID/

当用户表示"我想要在N个配置上进行扫描"时：

询问扫描配置或配置文件（例如：包含N个条目的
```
configs/sweep.yaml
```
）。

在脚本中添加

#SBATCH --array=0-{N-1}%{max_concurrent}

。

在运行命令中添加：

--config configs/sweep.yaml --config-idx $SLURM_ARRAY_TASK_ID

输出目录：

outputs/<job-name>/$SLURM_ARRAY_TASK_ID/

Multi-GPU (DDP)

多GPU（DDP）

When GPUs > 1 and the env is SLURM:

Add
```
--ntasks-per-node={GPUS}
```
directive

Wrap command with

torchrun --nproc_per_node={GPUS}

srun python -m torch.distributed.launch

Ask the user which distributed launcher they use

当GPU数量>1且环境为SLURM时：

添加
```
--ntasks-per-node={GPUS}
```
指令

用

torchrun --nproc_per_node={GPUS}

或

srun python -m torch.distributed.launch

包裹命令

询问用户使用哪种分布式启动器

Interactive Session (Debugging)

交互式会话（调试）

When the user wants to debug interactively (not submit a batch job):

Ibex:

bash

srun --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=32G --time=2:00:00 --pty bash

RunAI:

bash

runai submit <name> --image <image> --gpu 1 --interactive --stdin -- bash
runai bash <name>

Generate this command directly without creating a script file.

当用户想要交互式调试（而非提交批处理作业）时：

Ibex：

bash

srun --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=32G --time=2:00:00 --pty bash

RunAI：

bash

runai submit <name> --image <image> --gpu 1 --interactive --stdin -- bash
runai bash <name>

直接生成此命令，无需创建脚本文件。