experiment-queue

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Experiment Queue

实验队列(Experiment Queue)

Orchestrate large batches of ML experiments on SSH remote GPU servers with proper state tracking, OOM retry, stale cleanup, and wave transitions.
在SSH远程GPU服务器上编排大规模ML实验批次,具备完善的状态跟踪、OOM重试、僵死进程清理和波次过渡功能。

When to Use This Skill

适用场景

Use when
/run-experiment
is insufficient:
  • ≥10 jobs that need batching across GPUs
  • Multi-seed sweeps (e.g., 21 seeds × 12 cells)
  • Wave transitions (run wave 1, wait, run wave 2, wait, run wave 3...)
  • Teacher+student chains (train teacher then distill; auto-trigger student after teacher done)
  • OOM-prone configs where you need to retry with different GPU or wait
  • Mixed seed grids where failed cells need re-running
Do NOT use for:
  • Single ad-hoc experiment (use
    /run-experiment
    )
  • Modal/Vast.ai deployments (those have their own orchestration)
  • Experiments that need manual inspection between runs
当/run-experiment无法满足需求时使用:
  • ≥10个作业,需要跨GPU批量处理
  • 多种子扫参(例如:21个种子 × 12个参数组合)
  • 波次过渡(运行第1波,等待,运行第2波,等待,运行第3波……)
  • 教师-学生链式实验(先训练教师模型再进行蒸馏;教师模型完成后自动触发学生模型训练)
  • 易出现OOM的配置,需更换GPU或等待后重试
  • 混合种子网格实验,失败的参数组合需重新运行
请勿用于:
  • 单次临时实验(使用/run-experiment)
  • Modal/Vast.ai部署(这类平台自有编排机制)
  • 运行间需人工检查的实验

Why This Exists

设计初衷

Based on session audit (2026-04-16), the major wall-clock sinks in multi-seed grid experiments are:
  1. Stale screens — python finishes, wandb uploads, screen hangs, next wave blocked
  2. OOM on shared GPU — previous job's memory not yet released
  3. Wave race — new wave launches before previous wave fully settles
  4. Missing checkpoints — student launches before teacher saved
  5. Parser duplication — rewriting multi-seed analysis python every batch
All of these are pure engineering friction that can be orchestrated.
基于2026-04-16的会话审计,多种子网格实验中耗时最多的环节为:
  1. 僵死屏幕(Stale screens) —— Python运行完毕、wandb上传完成后,screen进程仍挂起,阻塞下一波次运行
  2. 共享GPU出现OOM —— 前一个作业的内存未释放
  3. 波次竞争(Wave race) —— 新波次在旧波次完全结束前启动
  4. ** checkpoint缺失** —— 学生模型在教师模型保存checkpoint前启动
  5. 解析器重复编写 —— 每次批量实验都要重写多种子分析Python代码
以上均为可通过编排解决的纯工程性摩擦。

Core Concepts

核心概念

Job Manifest

作业清单(Job Manifest)

A manifest lists jobs with explicit state:
yaml
project: dllm_distill
cwd: /home/rfyang/rfyang_code/dllm_experiments_torch
conda: dllm
清单列出带有明确状态的作业:
yaml
project: dllm_distill
cwd: /home/rfyang/rfyang_code/dllm_experiments_torch
conda: dllm

Optional: override conda hook path if conda is not at a standard location.

可选:若conda不在标准位置,可覆盖conda hook路径。

Can be a bare path (wrapped automatically) or a full
eval "$(... shell.bash hook)"
string.

可直接填写路径(会自动包装)或完整的
eval "$(... shell.bash hook)"
字符串。

Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc.,

默认自动检测~/anaconda3、~/miniconda3、/opt/anaconda3等路径,

or the ARIS_CONDA_HOOK environment variable.

或使用ARIS_CONDA_HOOK环境变量。

conda_hook: /custom/path/to/conda

conda_hook: /custom/path/to/conda

ssh: SJTUServer5 default_cmd: > python run_pc_distill_exp.py --backbone softmax --lam 0.5 --K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4
preconditions:
  • type: checkpoint_exists path: checkpoints/transformer/pcc_softmax_L96_K500_N{N}_wikitext103.pt
gpus: [0, 1, 2, 3, 4, 5, 6, 7] max_parallel: 8 gpu_free_threshold_mib: 500 # optional, default 500; raise for shared servers, lower for tight packing oom_retry: delay: 120 max_attempts: 3
jobs:
  • id: s200_N64_n50K args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
  • id: s200_N128_n50K args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}

... 14 more

undefined
ssh: SJTUServer5 default_cmd: > python run_pc_distill_exp.py --backbone softmax --lam 0.5 --K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4
preconditions:
  • type: checkpoint_exists path: checkpoints/transformer/pcc_softmax_L96_K500_N{N}_wikitext103.pt
gpus: [0, 1, 2, 3, 4, 5, 6, 7] max_parallel: 8 gpu_free_threshold_mib: 500 # 可选,默认500;共享服务器可调高,密集部署可调低 oom_retry: delay: 120 max_attempts: 3
jobs:
  • id: s200_N64_n50K args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
  • id: s200_N128_n50K args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}

... 14个更多作业

undefined

Job State Machine

作业状态机(Job State Machine)

pending → running → completed
                 ↘ failed_oom → pending (after delay) [retry up to N]
                 ↘ failed_other → stuck (needs manual inspection)
stale_screen_detected → cleaned → pending
pending → running → completed
                 ↘ failed_oom → pending(延迟后)[最多重试N次]
                 ↘ failed_other → stuck(需人工检查)
stale_screen_detected → cleaned → pending

Wave Orchestration

波次编排(Wave Orchestration)

A "wave" is a batch of jobs that fit available GPUs. Next wave only starts when:
  1. All current-wave python processes have exited
  2. No stale screens remain for current-wave tags
  3. GPU memory has dropped below threshold (≤500 MiB)
  4. Precondition checks pass for next-wave jobs
“波次”是指一批可适配可用GPU的作业。下一波次仅在满足以下条件时启动:
  1. 当前波次的所有Python进程已退出
  2. 当前波次标签对应的僵死屏幕已清理完毕
  3. GPU内存已降至阈值以下(≤500 MiB)
  4. 下一波次作业的前置条件检查通过

Workflow

工作流程

Step 1: Parse Manifest / Build from Grid

步骤1:解析清单 / 从网格构建

Input can be:
  • YAML manifest (explicit job list, recommended for complex cases)
  • Grid spec (Cartesian product of param values, e.g.,
    N=[64,128,256] × n=[50K,150K,500K,652K]
    )
  • Natural language description (Claude parses into manifest)
Save the built manifest to
<project>/experiment_queue/<timestamp>/manifest.json
for reproducibility.
输入可分为:
  • YAML清单(明确的作业列表,推荐用于复杂场景)
  • 网格规格(参数值的笛卡尔积,例如:
    N=[64,128,256] × n=[50K,150K,500K,652K]
  • 自然语言描述(Claude会将其解析为清单)
将构建好的清单保存至
<project>/experiment_queue/<timestamp>/manifest.json
,以保证可复现性。

Step 2: Pre-flight

步骤2:预检查

  • Check SSH connection works
  • Check conda env exists on remote
  • Check
    cwd
    exists on remote
  • Check all preconditions (checkpoints, input files)
  • Check GPU availability (at least
    max_parallel
    free GPUs)
If any precondition fails, show user which jobs are blocked and why.
  • 检查SSH连接正常
  • 检查远程服务器上存在conda环境
  • 检查远程服务器上存在
    cwd
    目录
  • 检查所有前置条件(checkpoint、输入文件)
  • 检查GPU可用性(至少有
    max_parallel
    个空闲GPU)
若任何前置条件不满足,向用户展示哪些作业被阻塞及原因。

Step 3: Launch Scheduler

步骤3:启动调度器

Run
tools/queue_manager.py
(bundled with this skill) as a detached
nohup
process on the SSH host:
bash
ssh <server> 'nohup python3 ~/.aris_queue/queue_manager.py \
  --manifest /tmp/manifest.json \
  --state /tmp/queue_state.json \
  --log /tmp/queue.log \
  > /tmp/queue_mgr.log 2>&1 &'
The scheduler:
  • Reads manifest
  • Loops: for each pending job, assign to free GPU, launch via
    screen
  • Polls job status (every 60s)
  • Detects stale screens (python exited but screen detached → kill)
  • Detects OOM (CUDA OOM in log → mark failed_oom → retry after delay)
  • Detects completion (expected output JSON/file exists) → mark completed
  • Launches next wave when current wave settles
  • Writes state to
    queue_state.json
    continuously
在SSH主机上以分离式
nohup
进程运行本技能附带的
tools/queue_manager.py
bash
ssh <server> 'nohup python3 ~/.aris_queue/queue_manager.py \
  --manifest /tmp/manifest.json \
  --state /tmp/queue_state.json \
  --log /tmp/queue.log \
  > /tmp/queue_mgr.log 2>&1 &'
调度器功能:
  • 读取清单
  • 循环执行:为每个pending作业分配空闲GPU,通过
    screen
    启动
  • 轮询作业状态(每60秒一次)
  • 检测僵死屏幕(Python已退出但screen仍分离 → 杀死进程)
  • 检测OOM(日志中出现CUDA OOM → 标记为failed_oom → 延迟后重试)
  • 检测完成状态(预期输出JSON/文件存在)→ 标记为completed
  • 当前波次结束后启动下一波次
  • 持续将状态写入
    queue_state.json

Step 4: Monitoring

步骤4:监控

User can check state anytime:
bash
ssh <server> cat /tmp/queue_state.json | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'
Or invoke
/monitor-experiment
which reads the state file.
用户可随时检查状态:
bash
ssh <server> cat /tmp/queue_state.json | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'
或调用
/monitor-experiment
,该命令会读取状态文件。

Step 5: Post-completion

步骤5:完成后处理

When all jobs in
manifest.json
are
completed
or
stuck
:
  • Scheduler exits cleanly
  • Write final summary to
    <project>/experiment_queue/<timestamp>/summary.md
  • Invoke
    /analyze-results
    if
    analyze_on_complete: true
manifest.json
中的所有作业均为
completed
stuck
状态时:
  • 调度器正常退出
  • 将最终总结写入
    <project>/experiment_queue/<timestamp>/summary.md
  • analyze_on_complete: true
    ,则调用
    /analyze-results

Grid Spec Syntax

网格规格语法

Instead of writing 24 job entries manually:
yaml
grid:
  N: [64, 128, 256]
  n: [50000, 150000, 500000, 652000]
  seed: [42, 200, 201]
template:
  id: "s${seed}_N${N}_n${n}"
  args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}
Expands to 36 jobs automatically.
无需手动编写24个作业条目:
yaml
grid:
  N: [64, 128, 256]
  n: [50000, 150000, 500000, 652000]
  seed: [42, 200, 201]
template:
  id: "s${seed}_N${N}_n${n}"
  args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}
会自动扩展为36个作业。

Wave Chaining

波次链式编排

For sequential phases (teacher → student):
yaml
phases:
  - name: train_teachers
    grid:
      N: [384, 512]
    template:
      cmd: python run_pc_exp.py --direction c --backbone softmax --n_hidden ${N} ...
      output_check: checkpoints/transformer/pcc_softmax_L96_K500_N${N}_wikitext103.pt
  
  - name: distill_students
    depends_on: train_teachers
    grid:
      N: [384, 512]
      seed: [42, 200, 201]
    template:
      cmd: python run_pc_distill_exp.py --n_hidden ${N} --seed ${seed} ...
      output_check: figures/pcdistill_sw_N${N}_*_seed${seed}.json
Scheduler enforces
depends_on
:
distill_students
jobs stay
pending
until all
train_teachers
jobs are
completed
.
针对连续阶段(教师→学生):
yaml
phases:
  - name: train_teachers
    grid:
      N: [384, 512]
    template:
      cmd: python run_pc_exp.py --direction c --backbone softmax --n_hidden ${N} ...
      output_check: checkpoints/transformer/pcc_softmax_L96_K500_N${N}_wikitext103.pt
  
  - name: distill_students
    depends_on: train_teachers
    grid:
      N: [384, 512]
      seed: [42, 200, 201]
    template:
      cmd: python run_pc_distill_exp.py --n_hidden ${N} --seed ${seed} ...
      output_check: figures/pcdistill_sw_N${N}_*_seed${seed}.json
调度器会强制执行
depends_on
规则:
distill_students
作业会保持
pending
状态,直到所有
train_teachers
作业均为
completed
状态。

OOM Handling

OOM处理

Detect OOM from stdout:
regex
torch\.OutOfMemoryError: CUDA out of memory
On detection:
  1. Mark job
    failed_oom
  2. Kill screen
  3. Wait
    oom_retry.delay
    seconds
  4. Check if current GPU is free; if not, try another free GPU
  5. Requeue as
    pending
  6. Max
    oom_retry.max_attempts
    before marking
    stuck
从标准输出中检测OOM:
regex
torch\.OutOfMemoryError: CUDA out of memory
检测到OOM时:
  1. 将作业标记为
    failed_oom
  2. 杀死screen进程
  3. 等待
    oom_retry.delay
  4. 检查当前GPU是否空闲;若不空闲,尝试其他空闲GPU
  5. 重新加入队列,标记为
    pending
  6. 最多重试
    oom_retry.max_attempts
    次,之后标记为
    stuck

Stale Screen Detection

僵死屏幕检测

Every 60s, for each running screen:
  1. Check screen exists (
    screen -ls
    )
  2. Check python PID still running (
    ps -p
    )
  3. If screen exists but python exited:
    • If expected output file exists → mark
      completed
      , kill stale screen
    • If no output file → mark
      failed_other
      , kill screen
每60秒,针对每个运行中的screen:
  1. 检查screen是否存在(
    screen -ls
  2. 检查Python PID是否仍在运行(
    ps -p
  3. 若screen存在但Python已退出:
    • 若预期输出文件存在 → 标记为
      completed
      ,杀死僵死screen
    • 若不存在输出文件 → 标记为
      failed_other
      ,杀死screen

Resume-on-restart

重启恢复

If scheduler crashes / is killed:
  1. Read
    queue_state.json
  2. For each
    running
    job: check screen; if still alive, keep; if not, re-evaluate state
  3. For each
    pending
    : continue normally
  4. Idempotent: safe to restart scheduler without losing state
若调度器崩溃/被杀死:
  1. 读取
    queue_state.json
  2. 针对每个
    running
    作业:检查screen是否仍存活;若存活则保留状态,若不存活则重新评估状态
  3. 针对每个
    pending
    作业:正常继续
  4. 幂等性:重启调度器不会丢失状态,安全可靠

Output: Summary Report

输出:总结报告

markdown
undefined
markdown
undefined

Experiment Queue Summary

实验队列总结

Project: dllm_distill Started: 2026-04-16 11:36:29 Completed: 2026-04-16 18:02:14 Total wall-clock: 6h 25m Jobs: 40 completed, 2 OOM-retried then completed, 0 stuck
项目: dllm_distill 开始时间: 2026-04-16 11:36:29 完成时间: 2026-04-16 18:02:14 总耗时: 6小时25分钟 作业情况: 40个完成,2个经OOM重试后完成,0个卡住

Phases

阶段统计

PhaseJobsSuccessOOM retriesDuration
train_teachers22058m
distill_students242424h 02m
multi_seed_validation161601h 25m
阶段作业数成功数OOM重试次数耗时
train_teachers22058分钟
distill_students242424小时02分钟
multi_seed_validation161601小时25分钟

Results Files

结果文件

  • 42 JSON files in
    figures/pcdistill_sw_*.json
  • 42个JSON文件位于
    figures/pcdistill_sw_*.json

Next Steps

后续步骤

  • Run
    /analyze-results
    on output JSONs
  • Figures auto-regen via
    artifact-sync
    (if configured)
undefined
  • 对输出JSON运行
    /analyze-results
  • 若已配置,通过
    artifact-sync
    自动重新生成图表
undefined

Comparison with
/run-experiment

与/run-experiment的对比

Feature
/run-experiment
experiment-queue
Single-shot experiment✅ (overkill)
Multi-GPU parallelBasicProper scheduling
Wave transitionsManualAutomatic
OOM retryManualAutomatic
Stale screen cleanupManualAutomatic
Teacher→student chainManualBuilt-in
State persistenceNoYes (JSON)
Resume on crashNoYes
Grid expansionManualDeclarative
Rule: Use
/run-experiment
for ≤5 jobs. Use
experiment-queue
for ≥10 jobs or anything with phases.
功能
/run-experiment
experiment-queue
单次实验✅(大材小用)
多GPU并行基础版完善调度
波次过渡手动自动
OOM重试手动自动
僵死屏幕清理手动自动
教师→学生链式实验手动内置支持
状态持久化有(JSON)
崩溃后恢复
网格扩展手动声明式
规则: 作业数≤5时使用
/run-experiment
。作业数≥10或包含多阶段实验时使用
experiment-queue

Key Rules

核心规则

  • Never overlap screens on the same GPU — always wait for
    memory.used < 500 MiB
    before launching new job
  • Always write state to disk — every state change flushed to
    queue_state.json
  • Idempotent scheduler — safe to restart; picks up from state file
  • Expected-output-based completion — don't trust screen state alone; verify output file exists
  • Bounded retry — max N OOM retries, then mark
    stuck
    and alert
  • Dependencies enforced at launch — never launch student before teacher checkpoint exists
  • 同一GPU上绝不重叠screen进程 —— 必须等待
    memory.used < 500 MiB
    后再启动新作业
  • 始终将状态写入磁盘 —— 每次状态变更都刷新至
    queue_state.json
  • 调度器具备幂等性 —— 重启安全,可从状态文件恢复
  • 基于预期输出判断完成状态 —— 不单独依赖screen状态;需验证输出文件存在
  • 有限重试 —— OOM最多重试N次,之后标记为
    stuck
    并发出警报
  • 启动时强制执行依赖 —— 教师模型checkpoint存在前绝不启动学生模型

Known Failure Modes

已知故障模式

  • SSH connection drop during scheduling: scheduler keeps running on remote (nohup), just reconnect and check
  • GPU reservation by another user: scheduler waits, does not pre-empt
  • Disk full on remote: scheduler detects write failure, marks all pending
    stuck
    , alerts
  • 调度期间SSH连接断开: 调度器在远程服务器上仍会运行(nohup),只需重新连接并检查状态
  • GPU被其他用户占用: 调度器会等待,不会抢占
  • 远程服务器磁盘已满: 调度器检测到写入失败,将所有pending作业标记为
    stuck
    并发出警报

Example Session

示例会话

User: "跑 T5+T6 全部实验:T5 = N∈{80,192} × n 4 values × seed {200,201}, T6 = N∈{384,512} × n 4 values × seed {42,200,201}; T6 需要先 train teacher"
Claude invokes
/experiment-queue
:
  1. Parses description into 2-phase manifest
  2. Phase 1: T5 (16 jobs, no teacher dependency) + T6 teacher training (2 jobs)
  3. Phase 2: T6 distillation (24 jobs, depends on teachers)
  4. Deploys scheduler via nohup
  5. Reports: "Scheduler PID 93534, total 42 jobs, estimated 6-7h wall-clock"
Then user can check anytime or wait for summary report.
用户: "跑 T5+T6 全部实验:T5 = N∈{80,192} × n 4个值 × seed {200,201}, T6 = N∈{384,512} × n 4个值 × seed {42,200,201}; T6 需要先 train teacher"
Claude调用
/experiment-queue
:
  1. 将描述解析为两阶段清单
  2. 阶段1: T5(16个作业,无教师依赖)+ T6教师训练(2个作业)
  3. 阶段2: T6蒸馏(24个作业,依赖教师模型)
  4. 通过nohup部署调度器
  5. 报告: "调度器PID 93534,共42个作业,预计耗时6-7小时"
之后用户可随时检查状态,或等待总结报告。

See Also

相关链接

  • /run-experiment
    — single experiment deployment
  • /monitor-experiment
    — check progress (now reads from queue_state.json)
  • /analyze-results
    — post-hoc analysis
  • tools/queue_manager.py
    (bundled) — the scheduler implementation
  • tools/build_manifest.py
    (bundled) — build manifest from grid spec
  • /run-experiment
    —— 单次实验部署
  • /monitor-experiment
    —— 检查进度(现在读取queue_state.json)
  • /analyze-results
    —— 事后分析
  • tools/queue_manager.py
    (附带)—— 调度器实现
  • tools/build_manifest.py
    (附带)—— 从网格规格构建清单

Rationale / Source

设计依据 / 来源

Identified via 2026-04-16 post-mortem analysis (Codex GPT-5.4 xhigh) of a 1.5-day multi-seed paper experiment session:
  • Wall-clock sink: stale screens, OOM, wave transitions, manual parser
  • Token sink: re-writing orchestration code each session
  • Cognitive sink: tracking which cells succeeded, which failed, which to retry
This skill targets the wall-clock sink specifically; see
artifact-sync
and
paper-fix-auto-apply
for the other two.
基于2026-04-16对一次耗时1.5天的多种子论文实验会话的事后分析(Codex GPT-5.4 xhigh):
  • 耗时瓶颈:僵死屏幕、OOM、波次过渡、手动编写解析器
  • Token消耗:每次会话都要重写编排代码
  • 认知负担:跟踪哪些参数组合成功、失败、需要重试
本技能专门针对耗时瓶颈;其他两个问题可参考
artifact-sync
paper-fix-auto-apply