experiment-queue

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Experiment Queue

实验队列（Experiment Queue）

Orchestrate large batches of ML experiments on SSH remote GPU servers with proper state tracking, OOM retry, stale cleanup, and wave transitions.

在SSH远程GPU服务器上编排大规模ML实验批次，具备完善的状态跟踪、OOM重试、僵死进程清理和波次过渡功能。

When to Use This Skill

适用场景

Use when

/run-experiment

is insufficient:

≥10 jobs that need batching across GPUs
Multi-seed sweeps (e.g., 21 seeds × 12 cells)
Wave transitions (run wave 1, wait, run wave 2, wait, run wave 3...)
Teacher+student chains (train teacher then distill; auto-trigger student after teacher done)
OOM-prone configs where you need to retry with different GPU or wait
Mixed seed grids where failed cells need re-running

Do NOT use for:

Single ad-hoc experiment (use
```
/run-experiment
```
)
Modal/Vast.ai deployments (those have their own orchestration)
Experiments that need manual inspection between runs

当/run-experiment无法满足需求时使用：

≥10个作业，需要跨GPU批量处理
多种子扫参（例如：21个种子 × 12个参数组合）
波次过渡（运行第1波，等待，运行第2波，等待，运行第3波……）
教师-学生链式实验（先训练教师模型再进行蒸馏；教师模型完成后自动触发学生模型训练）
易出现OOM的配置，需更换GPU或等待后重试
混合种子网格实验，失败的参数组合需重新运行

请勿用于：

单次临时实验（使用/run-experiment）
Modal/Vast.ai部署（这类平台自有编排机制）
运行间需人工检查的实验

Why This Exists

设计初衷

Based on session audit (2026-04-16), the major wall-clock sinks in multi-seed grid experiments are:

Stale screens — python finishes, wandb uploads, screen hangs, next wave blocked
OOM on shared GPU — previous job's memory not yet released
Wave race — new wave launches before previous wave fully settles
Missing checkpoints — student launches before teacher saved
Parser duplication — rewriting multi-seed analysis python every batch

All of these are pure engineering friction that can be orchestrated.

基于2026-04-16的会话审计，多种子网格实验中耗时最多的环节为：

僵死屏幕（Stale screens） —— Python运行完毕、wandb上传完成后，screen进程仍挂起，阻塞下一波次运行
共享GPU出现OOM —— 前一个作业的内存未释放
波次竞争（Wave race） —— 新波次在旧波次完全结束前启动
** checkpoint缺失** —— 学生模型在教师模型保存checkpoint前启动
解析器重复编写 —— 每次批量实验都要重写多种子分析Python代码

以上均为可通过编排解决的纯工程性摩擦。

Core Concepts

核心概念

Job Manifest

作业清单（Job Manifest）

A manifest lists jobs with explicit state:

yaml

project: dllm_distill
cwd: /home/rfyang/rfyang_code/dllm_experiments_torch
conda: dllm

清单列出带有明确状态的作业：

yaml

project: dllm_distill
cwd: /home/rfyang/rfyang_code/dllm_experiments_torch
conda: dllm

Optional: override conda hook path if conda is not at a standard location.

可选：若conda不在标准位置，可覆盖conda hook路径。

Can be a bare path (wrapped automatically) or a full

eval "$(... shell.bash hook)"

string.

可直接填写路径（会自动包装）或完整的

eval "$(... shell.bash hook)"

字符串。

Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc.,

默认自动检测~/anaconda3、~/miniconda3、/opt/anaconda3等路径，

or the ARIS_CONDA_HOOK environment variable.

或使用ARIS_CONDA_HOOK环境变量。

conda_hook: /custom/path/to/conda

ssh: SJTUServer5 default_cmd: > python run_pc_distill_exp.py --backbone softmax --lam 0.5 --K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4

preconditions:

type: checkpoint_exists path: checkpoints/transformer/pcc_softmax_L96_K500_N{N}_wikitext103.pt

gpus: [0, 1, 2, 3, 4, 5, 6, 7] max_parallel: 8 gpu_free_threshold_mib: 500 # optional, default 500; raise for shared servers, lower for tight packing oom_retry: delay: 120 max_attempts: 3

jobs:

id: s200_N64_n50K args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
id: s200_N128_n50K args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}

... 14 more

undefined

ssh: SJTUServer5 default_cmd: > python run_pc_distill_exp.py --backbone softmax --lam 0.5 --K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4

preconditions:

type: checkpoint_exists path: checkpoints/transformer/pcc_softmax_L96_K500_N{N}_wikitext103.pt

gpus: [0, 1, 2, 3, 4, 5, 6, 7] max_parallel: 8 gpu_free_threshold_mib: 500 # 可选，默认500；共享服务器可调高，密集部署可调低 oom_retry: delay: 120 max_attempts: 3

jobs:

id: s200_N64_n50K args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
id: s200_N128_n50K args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}

... 14个更多作业

undefined

Job State Machine

作业状态机（Job State Machine）

pending → running → completed
                 ↘ failed_oom → pending (after delay) [retry up to N]
                 ↘ failed_other → stuck (needs manual inspection)
stale_screen_detected → cleaned → pending

pending → running → completed
                 ↘ failed_oom → pending（延迟后）[最多重试N次]
                 ↘ failed_other → stuck（需人工检查）
stale_screen_detected → cleaned → pending

Wave Orchestration

波次编排（Wave Orchestration）

A "wave" is a batch of jobs that fit available GPUs. Next wave only starts when:

All current-wave python processes have exited
No stale screens remain for current-wave tags
GPU memory has dropped below threshold (≤500 MiB)
Precondition checks pass for next-wave jobs

“波次”是指一批可适配可用GPU的作业。下一波次仅在满足以下条件时启动：

当前波次的所有Python进程已退出
当前波次标签对应的僵死屏幕已清理完毕
GPU内存已降至阈值以下（≤500 MiB）
下一波次作业的前置条件检查通过

Workflow

工作流程

Step 1: Parse Manifest / Build from Grid

步骤1：解析清单 / 从网格构建

Input can be:

YAML manifest (explicit job list, recommended for complex cases)
Grid spec (Cartesian product of param values, e.g.,
```
N=[64,128,256] × n=[50K,150K,500K,652K]
```
)
Natural language description (Claude parses into manifest)

Save the built manifest to

<project>/experiment_queue/<timestamp>/manifest.json

for reproducibility.

输入可分为：

YAML清单（明确的作业列表，推荐用于复杂场景）
网格规格（参数值的笛卡尔积，例如：
```
N=[64,128,256] × n=[50K,150K,500K,652K]
```
）
自然语言描述（Claude会将其解析为清单）

将构建好的清单保存至

<project>/experiment_queue/<timestamp>/manifest.json

，以保证可复现性。

Step 2: Pre-flight

步骤2：预检查

Check SSH connection works
Check conda env exists on remote
Check
```
cwd
```
exists on remote
Check all preconditions (checkpoints, input files)
Check GPU availability (at least
```
max_parallel
```
free GPUs)

If any precondition fails, show user which jobs are blocked and why.

检查SSH连接正常
检查远程服务器上存在conda环境
检查远程服务器上存在
```
cwd
```
目录
检查所有前置条件（checkpoint、输入文件）
检查GPU可用性（至少有
```
max_parallel
```
个空闲GPU）

若任何前置条件不满足，向用户展示哪些作业被阻塞及原因。

Step 3: Launch Scheduler

步骤3：启动调度器

Run

tools/queue_manager.py

(bundled with this skill) as a detached

nohup

process on the SSH host:

bash

ssh <server> 'nohup python3 ~/.aris_queue/queue_manager.py \
  --manifest /tmp/manifest.json \
  --state /tmp/queue_state.json \
  --log /tmp/queue.log \
  > /tmp/queue_mgr.log 2>&1 &'

The scheduler:

Reads manifest
Loops: for each pending job, assign to free GPU, launch via
```
screen
```
Polls job status (every 60s)
Detects stale screens (python exited but screen detached → kill)
Detects OOM (CUDA OOM in log → mark failed_oom → retry after delay)
Detects completion (expected output JSON/file exists) → mark completed
Launches next wave when current wave settles
Writes state to
```
queue_state.json
```
continuously

在SSH主机上以分离式

nohup

进程运行本技能附带的

tools/queue_manager.py

：

bash

ssh <server> 'nohup python3 ~/.aris_queue/queue_manager.py \
  --manifest /tmp/manifest.json \
  --state /tmp/queue_state.json \
  --log /tmp/queue.log \
  > /tmp/queue_mgr.log 2>&1 &'

调度器功能：

读取清单
循环执行：为每个pending作业分配空闲GPU，通过
```
screen
```
启动
轮询作业状态（每60秒一次）
检测僵死屏幕（Python已退出但screen仍分离 → 杀死进程）
检测OOM（日志中出现CUDA OOM → 标记为failed_oom → 延迟后重试）
检测完成状态（预期输出JSON/文件存在）→ 标记为completed
当前波次结束后启动下一波次
持续将状态写入
```
queue_state.json
```

Step 4: Monitoring

步骤4：监控

User can check state anytime:

bash

ssh <server> cat /tmp/queue_state.json | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'

Or invoke

/monitor-experiment

which reads the state file.

用户可随时检查状态：

bash

ssh <server> cat /tmp/queue_state.json | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'

或调用

/monitor-experiment

，该命令会读取状态文件。

Step 5: Post-completion

步骤5：完成后处理

When all jobs in

manifest.json

are

completed

stuck

Scheduler exits cleanly

Write final summary to

<project>/experiment_queue/<timestamp>/summary.md

Invoke

/analyze-results

analyze_on_complete: true

当

manifest.json

中的所有作业均为

completed

或

stuck

状态时：

调度器正常退出

将最终总结写入

<project>/experiment_queue/<timestamp>/summary.md

若

analyze_on_complete: true

，则调用

/analyze-results

Grid Spec Syntax

网格规格语法

Instead of writing 24 job entries manually:

yaml

grid:
  N: [64, 128, 256]
  n: [50000, 150000, 500000, 652000]
  seed: [42, 200, 201]
template:
  id: "s${seed}_N${N}_n${n}"
  args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}

Expands to 36 jobs automatically.

无需手动编写24个作业条目：

yaml

grid:
  N: [64, 128, 256]
  n: [50000, 150000, 500000, 652000]
  seed: [42, 200, 201]
template:
  id: "s${seed}_N${N}_n${n}"
  args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}

会自动扩展为36个作业。

Wave Chaining

波次链式编排

For sequential phases (teacher → student):

yaml

phases:
  - name: train_teachers
    grid:
      N: [384, 512]
    template:
      cmd: python run_pc_exp.py --direction c --backbone softmax --n_hidden ${N} ...
      output_check: checkpoints/transformer/pcc_softmax_L96_K500_N${N}_wikitext103.pt
  
  - name: distill_students
    depends_on: train_teachers
    grid:
      N: [384, 512]
      seed: [42, 200, 201]
    template:
      cmd: python run_pc_distill_exp.py --n_hidden ${N} --seed ${seed} ...
      output_check: figures/pcdistill_sw_N${N}_*_seed${seed}.json

Scheduler enforces

depends_on

distill_students

jobs stay

pending

until all

train_teachers

jobs are

completed

针对连续阶段（教师→学生）：

yaml

phases:
  - name: train_teachers
    grid:
      N: [384, 512]
    template:
      cmd: python run_pc_exp.py --direction c --backbone softmax --n_hidden ${N} ...
      output_check: checkpoints/transformer/pcc_softmax_L96_K500_N${N}_wikitext103.pt
  
  - name: distill_students
    depends_on: train_teachers
    grid:
      N: [384, 512]
      seed: [42, 200, 201]
    template:
      cmd: python run_pc_distill_exp.py --n_hidden ${N} --seed ${seed} ...
      output_check: figures/pcdistill_sw_N${N}_*_seed${seed}.json

调度器会强制执行

depends_on

规则：

distill_students

作业会保持

pending

状态，直到所有

train_teachers

作业均为

completed

状态。

OOM Handling

OOM处理

Detect OOM from stdout:

regex

torch\.OutOfMemoryError: CUDA out of memory

On detection:

Mark job
```
failed_oom
```
Kill screen
Wait
```
oom_retry.delay
```
seconds
Check if current GPU is free; if not, try another free GPU
Requeue as
```
pending
```
Max
```
oom_retry.max_attempts
```
before marking
```
stuck
```

从标准输出中检测OOM：

regex

torch\.OutOfMemoryError: CUDA out of memory

检测到OOM时：

将作业标记为
```
failed_oom
```
杀死screen进程
等待
```
oom_retry.delay
```
秒
检查当前GPU是否空闲；若不空闲，尝试其他空闲GPU
重新加入队列，标记为
```
pending
```
最多重试
```
oom_retry.max_attempts
```
次，之后标记为
```
stuck
```

Stale Screen Detection

僵死屏幕检测

Every 60s, for each running screen:

Check screen exists (
```
screen -ls
```
)
Check python PID still running (
```
ps -p
```
)
If screen exists but python exited:
- If expected output file exists → mark
```
completed
```
  , kill stale screen
- If no output file → mark
```
failed_other
```
  , kill screen

每60秒，针对每个运行中的screen：

检查screen是否存在（
```
screen -ls
```
）
检查Python PID是否仍在运行（
```
ps -p
```
）
若screen存在但Python已退出：
- 若预期输出文件存在 → 标记为
```
completed
```
  ，杀死僵死screen
- 若不存在输出文件 → 标记为
```
failed_other
```
  ，杀死screen

Resume-on-restart

重启恢复

If scheduler crashes / is killed:

Read
```
queue_state.json
```
For each
```
running
```
job: check screen; if still alive, keep; if not, re-evaluate state
For each
```
pending
```
: continue normally
Idempotent: safe to restart scheduler without losing state

若调度器崩溃/被杀死：

读取
```
queue_state.json
```
针对每个
```
running
```
作业：检查screen是否仍存活；若存活则保留状态，若不存活则重新评估状态
针对每个
```
pending
```
作业：正常继续
幂等性：重启调度器不会丢失状态，安全可靠

Output: Summary Report

输出：总结报告

markdown

undefined

markdown

undefined

Experiment Queue Summary

实验队列总结

Project: dllm_distill Started: 2026-04-16 11:36:29 Completed: 2026-04-16 18:02:14 Total wall-clock: 6h 25m Jobs: 40 completed, 2 OOM-retried then completed, 0 stuck

项目: dllm_distill 开始时间: 2026-04-16 11:36:29 完成时间: 2026-04-16 18:02:14 总耗时: 6小时25分钟 作业情况: 40个完成，2个经OOM重试后完成，0个卡住

Phases

阶段统计

Phase	Jobs	Success	OOM retries	Duration
train_teachers	2	2	0	58m
distill_students	24	24	2	4h 02m
multi_seed_validation	16	16	0	1h 25m

阶段	作业数	成功数	OOM重试次数	耗时
train_teachers	2	2	0	58分钟
distill_students	24	24	2	4小时02分钟
multi_seed_validation	16	16	0	1小时25分钟

Results Files

结果文件

42 JSON files in
```
figures/pcdistill_sw_*.json
```

42个JSON文件位于
```
figures/pcdistill_sw_*.json
```

Next Steps

后续步骤

Run
```
/analyze-results
```
on output JSONs
Figures auto-regen via
```
artifact-sync
```
(if configured)

undefined

对输出JSON运行
```
/analyze-results
```
若已配置，通过
```
artifact-sync
```
自动重新生成图表

undefined

Comparison with

/run-experiment

与/run-experiment的对比

Feature	`/run-experiment`	`experiment-queue`
Single-shot experiment	✅	✅ (overkill)
Multi-GPU parallel	Basic	Proper scheduling
Wave transitions	Manual	Automatic
OOM retry	Manual	Automatic
Stale screen cleanup	Manual	Automatic
Teacher→student chain	Manual	Built-in
State persistence	No	Yes (JSON)
Resume on crash	No	Yes
Grid expansion	Manual	Declarative

Rule: Use

/run-experiment

for ≤5 jobs. Use

experiment-queue

for ≥10 jobs or anything with phases.

功能	`/run-experiment`	`experiment-queue`
单次实验	✅	✅（大材小用）
多GPU并行	基础版	完善调度
波次过渡	手动	自动
OOM重试	手动	自动
僵死屏幕清理	手动	自动
教师→学生链式实验	手动	内置支持
状态持久化	无	有（JSON）
崩溃后恢复	无	有
网格扩展	手动	声明式

规则: 作业数≤5时使用

/run-experiment

。作业数≥10或包含多阶段实验时使用

experiment-queue

。

Key Rules

核心规则

Never overlap screens on the same GPU — always wait for
```
memory.used < 500 MiB
```
before launching new job
Always write state to disk — every state change flushed to
```
queue_state.json
```
Idempotent scheduler — safe to restart; picks up from state file
Expected-output-based completion — don't trust screen state alone; verify output file exists
Bounded retry — max N OOM retries, then mark
```
stuck
```
and alert
Dependencies enforced at launch — never launch student before teacher checkpoint exists

同一GPU上绝不重叠screen进程 —— 必须等待
```
memory.used < 500 MiB
```
后再启动新作业
始终将状态写入磁盘 —— 每次状态变更都刷新至
```
queue_state.json
```
调度器具备幂等性 —— 重启安全，可从状态文件恢复
基于预期输出判断完成状态 —— 不单独依赖screen状态；需验证输出文件存在
有限重试 —— OOM最多重试N次，之后标记为
```
stuck
```
并发出警报
启动时强制执行依赖 —— 教师模型checkpoint存在前绝不启动学生模型

Known Failure Modes

已知故障模式

SSH connection drop during scheduling: scheduler keeps running on remote (nohup), just reconnect and check
GPU reservation by another user: scheduler waits, does not pre-empt
Disk full on remote: scheduler detects write failure, marks all pending
```
stuck
```
, alerts

调度期间SSH连接断开: 调度器在远程服务器上仍会运行（nohup），只需重新连接并检查状态
GPU被其他用户占用: 调度器会等待，不会抢占
远程服务器磁盘已满: 调度器检测到写入失败，将所有pending作业标记为
```
stuck
```
并发出警报

Example Session

示例会话

User: "跑 T5+T6 全部实验：T5 = N∈{80,192} × n 4 values × seed {200,201}, T6 = N∈{384,512} × n 4 values × seed {42,200,201}; T6 需要先 train teacher"

Claude invokes

/experiment-queue

Parses description into 2-phase manifest
Phase 1: T5 (16 jobs, no teacher dependency) + T6 teacher training (2 jobs)
Phase 2: T6 distillation (24 jobs, depends on teachers)
Deploys scheduler via nohup
Reports: "Scheduler PID 93534, total 42 jobs, estimated 6-7h wall-clock"

Then user can check anytime or wait for summary report.

用户: "跑 T5+T6 全部实验：T5 = N∈{80,192} × n 4个值 × seed {200,201}, T6 = N∈{384,512} × n 4个值 × seed {42,200,201}; T6 需要先 train teacher"

Claude调用

/experiment-queue

将描述解析为两阶段清单
阶段1: T5（16个作业，无教师依赖）+ T6教师训练（2个作业）
阶段2: T6蒸馏（24个作业，依赖教师模型）
通过nohup部署调度器
报告: "调度器PID 93534，共42个作业，预计耗时6-7小时"

之后用户可随时检查状态，或等待总结报告。

Rationale / Source

设计依据 / 来源

Identified via 2026-04-16 post-mortem analysis (Codex GPT-5.4 xhigh) of a 1.5-day multi-seed paper experiment session:

Wall-clock sink: stale screens, OOM, wave transitions, manual parser
Token sink: re-writing orchestration code each session
Cognitive sink: tracking which cells succeeded, which failed, which to retry

This skill targets the wall-clock sink specifically; see

artifact-sync

and

paper-fix-auto-apply

for the other two.

基于2026-04-16对一次耗时1.5天的多种子论文实验会话的事后分析（Codex GPT-5.4 xhigh）：

耗时瓶颈：僵死屏幕、OOM、波次过渡、手动编写解析器
Token消耗：每次会话都要重写编排代码
认知负担：跟踪哪些参数组合成功、失败、需要重试

本技能专门针对耗时瓶颈；其他两个问题可参考

artifact-sync

和

paper-fix-auto-apply

。

experiment-queue

Original

Translation

Experiment Queue

实验队列（Experiment Queue）

When to Use This Skill

适用场景

Why This Exists

设计初衷

Core Concepts

核心概念

Job Manifest

作业清单（Job Manifest）

Optional: override conda hook path if conda is not at a standard location.

可选：若conda不在标准位置，可覆盖conda hook路径。

Can be a bare path (wrapped automatically) or a full eval "$(... shell.bash hook)" string.

可直接填写路径（会自动包装）或完整的eval "$(... shell.bash hook)"字符串。

Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc.,

默认自动检测~/anaconda3、~/miniconda3、/opt/anaconda3等路径，

or the ARIS_CONDA_HOOK environment variable.

或使用ARIS_CONDA_HOOK环境变量。

conda_hook: /custom/path/to/conda

conda_hook: /custom/path/to/conda

... 14 more

... 14个更多作业

Job State Machine

作业状态机（Job State Machine）

Wave Orchestration

波次编排（Wave Orchestration）

Workflow

工作流程

Step 1: Parse Manifest / Build from Grid

步骤1：解析清单 / 从网格构建

Step 2: Pre-flight

步骤2：预检查

Step 3: Launch Scheduler

步骤3：启动调度器

Step 4: Monitoring

步骤4：监控

Step 5: Post-completion

步骤5：完成后处理

Grid Spec Syntax

网格规格语法

Wave Chaining

波次链式编排

OOM Handling

OOM处理

Stale Screen Detection

僵死屏幕检测

Resume-on-restart

重启恢复

Output: Summary Report

输出：总结报告

Experiment Queue Summary

实验队列总结

Phases

阶段统计

Results Files

结果文件

Next Steps

后续步骤

Comparison with /run-experiment

与/run-experiment的对比

Key Rules

核心规则

Known Failure Modes

已知故障模式

Example Session

示例会话

See Also

相关链接

Rationale / Source

设计依据 / 来源

Can be a bare path (wrapped automatically) or a full
`eval "$(... shell.bash hook)"`
string.

可直接填写路径（会自动包装）或完整的
`eval "$(... shell.bash hook)"`
字符串。

Comparison with
`/run-experiment`