experiment-queue
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseExperiment Queue
实验队列(Experiment Queue)
Orchestrate large batches of ML experiments on SSH remote GPU servers with proper state tracking, OOM retry, stale cleanup, and wave transitions.
在SSH远程GPU服务器上编排大规模ML实验批次,具备完善的状态跟踪、OOM重试、僵死进程清理和波次过渡功能。
When to Use This Skill
适用场景
Use when is insufficient:
/run-experiment- ≥10 jobs that need batching across GPUs
- Multi-seed sweeps (e.g., 21 seeds × 12 cells)
- Wave transitions (run wave 1, wait, run wave 2, wait, run wave 3...)
- Teacher+student chains (train teacher then distill; auto-trigger student after teacher done)
- OOM-prone configs where you need to retry with different GPU or wait
- Mixed seed grids where failed cells need re-running
Do NOT use for:
- Single ad-hoc experiment (use )
/run-experiment - Modal/Vast.ai deployments (those have their own orchestration)
- Experiments that need manual inspection between runs
当/run-experiment无法满足需求时使用:
- ≥10个作业,需要跨GPU批量处理
- 多种子扫参(例如:21个种子 × 12个参数组合)
- 波次过渡(运行第1波,等待,运行第2波,等待,运行第3波……)
- 教师-学生链式实验(先训练教师模型再进行蒸馏;教师模型完成后自动触发学生模型训练)
- 易出现OOM的配置,需更换GPU或等待后重试
- 混合种子网格实验,失败的参数组合需重新运行
请勿用于:
- 单次临时实验(使用/run-experiment)
- Modal/Vast.ai部署(这类平台自有编排机制)
- 运行间需人工检查的实验
Why This Exists
设计初衷
Based on session audit (2026-04-16), the major wall-clock sinks in multi-seed grid experiments are:
- Stale screens — python finishes, wandb uploads, screen hangs, next wave blocked
- OOM on shared GPU — previous job's memory not yet released
- Wave race — new wave launches before previous wave fully settles
- Missing checkpoints — student launches before teacher saved
- Parser duplication — rewriting multi-seed analysis python every batch
All of these are pure engineering friction that can be orchestrated.
基于2026-04-16的会话审计,多种子网格实验中耗时最多的环节为:
- 僵死屏幕(Stale screens) —— Python运行完毕、wandb上传完成后,screen进程仍挂起,阻塞下一波次运行
- 共享GPU出现OOM —— 前一个作业的内存未释放
- 波次竞争(Wave race) —— 新波次在旧波次完全结束前启动
- ** checkpoint缺失** —— 学生模型在教师模型保存checkpoint前启动
- 解析器重复编写 —— 每次批量实验都要重写多种子分析Python代码
以上均为可通过编排解决的纯工程性摩擦。
Core Concepts
核心概念
Job Manifest
作业清单(Job Manifest)
A manifest lists jobs with explicit state:
yaml
project: dllm_distill
cwd: /home/rfyang/rfyang_code/dllm_experiments_torch
conda: dllm清单列出带有明确状态的作业:
yaml
project: dllm_distill
cwd: /home/rfyang/rfyang_code/dllm_experiments_torch
conda: dllmOptional: override conda hook path if conda is not at a standard location.
可选:若conda不在标准位置,可覆盖conda hook路径。
Can be a bare path (wrapped automatically) or a full eval "$(... shell.bash hook)"
string.
eval "$(... shell.bash hook)"可直接填写路径(会自动包装)或完整的eval "$(... shell.bash hook)"
字符串。
eval "$(... shell.bash hook)"Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc.,
默认自动检测~/anaconda3、~/miniconda3、/opt/anaconda3等路径,
or the ARIS_CONDA_HOOK environment variable.
或使用ARIS_CONDA_HOOK环境变量。
conda_hook: /custom/path/to/conda
conda_hook: /custom/path/to/conda
ssh: SJTUServer5
default_cmd: >
python run_pc_distill_exp.py --backbone softmax --lam 0.5
--K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4
preconditions:
- type: checkpoint_exists path: checkpoints/transformer/pcc_softmax_L96_K500_N{N}_wikitext103.pt
gpus: [0, 1, 2, 3, 4, 5, 6, 7]
max_parallel: 8
gpu_free_threshold_mib: 500 # optional, default 500; raise for shared servers, lower for tight packing
oom_retry:
delay: 120
max_attempts: 3
jobs:
- id: s200_N64_n50K args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
- id: s200_N128_n50K args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}
... 14 more
undefinedssh: SJTUServer5
default_cmd: >
python run_pc_distill_exp.py --backbone softmax --lam 0.5
--K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4
preconditions:
- type: checkpoint_exists path: checkpoints/transformer/pcc_softmax_L96_K500_N{N}_wikitext103.pt
gpus: [0, 1, 2, 3, 4, 5, 6, 7]
max_parallel: 8
gpu_free_threshold_mib: 500 # 可选,默认500;共享服务器可调高,密集部署可调低
oom_retry:
delay: 120
max_attempts: 3
jobs:
- id: s200_N64_n50K args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
- id: s200_N128_n50K args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}
... 14个更多作业
undefinedJob State Machine
作业状态机(Job State Machine)
pending → running → completed
↘ failed_oom → pending (after delay) [retry up to N]
↘ failed_other → stuck (needs manual inspection)
stale_screen_detected → cleaned → pendingpending → running → completed
↘ failed_oom → pending(延迟后)[最多重试N次]
↘ failed_other → stuck(需人工检查)
stale_screen_detected → cleaned → pendingWave Orchestration
波次编排(Wave Orchestration)
A "wave" is a batch of jobs that fit available GPUs. Next wave only starts when:
- All current-wave python processes have exited
- No stale screens remain for current-wave tags
- GPU memory has dropped below threshold (≤500 MiB)
- Precondition checks pass for next-wave jobs
“波次”是指一批可适配可用GPU的作业。下一波次仅在满足以下条件时启动:
- 当前波次的所有Python进程已退出
- 当前波次标签对应的僵死屏幕已清理完毕
- GPU内存已降至阈值以下(≤500 MiB)
- 下一波次作业的前置条件检查通过
Workflow
工作流程
Step 1: Parse Manifest / Build from Grid
步骤1:解析清单 / 从网格构建
Input can be:
- YAML manifest (explicit job list, recommended for complex cases)
- Grid spec (Cartesian product of param values, e.g., )
N=[64,128,256] × n=[50K,150K,500K,652K] - Natural language description (Claude parses into manifest)
Save the built manifest to for reproducibility.
<project>/experiment_queue/<timestamp>/manifest.json输入可分为:
- YAML清单(明确的作业列表,推荐用于复杂场景)
- 网格规格(参数值的笛卡尔积,例如:)
N=[64,128,256] × n=[50K,150K,500K,652K] - 自然语言描述(Claude会将其解析为清单)
将构建好的清单保存至,以保证可复现性。
<project>/experiment_queue/<timestamp>/manifest.jsonStep 2: Pre-flight
步骤2:预检查
- Check SSH connection works
- Check conda env exists on remote
- Check exists on remote
cwd - Check all preconditions (checkpoints, input files)
- Check GPU availability (at least free GPUs)
max_parallel
If any precondition fails, show user which jobs are blocked and why.
- 检查SSH连接正常
- 检查远程服务器上存在conda环境
- 检查远程服务器上存在目录
cwd - 检查所有前置条件(checkpoint、输入文件)
- 检查GPU可用性(至少有个空闲GPU)
max_parallel
若任何前置条件不满足,向用户展示哪些作业被阻塞及原因。
Step 3: Launch Scheduler
步骤3:启动调度器
Run (bundled with this skill) as a detached process on the SSH host:
tools/queue_manager.pynohupbash
ssh <server> 'nohup python3 ~/.aris_queue/queue_manager.py \
--manifest /tmp/manifest.json \
--state /tmp/queue_state.json \
--log /tmp/queue.log \
> /tmp/queue_mgr.log 2>&1 &'The scheduler:
- Reads manifest
- Loops: for each pending job, assign to free GPU, launch via
screen - Polls job status (every 60s)
- Detects stale screens (python exited but screen detached → kill)
- Detects OOM (CUDA OOM in log → mark failed_oom → retry after delay)
- Detects completion (expected output JSON/file exists) → mark completed
- Launches next wave when current wave settles
- Writes state to continuously
queue_state.json
在SSH主机上以分离式进程运行本技能附带的:
nohuptools/queue_manager.pybash
ssh <server> 'nohup python3 ~/.aris_queue/queue_manager.py \
--manifest /tmp/manifest.json \
--state /tmp/queue_state.json \
--log /tmp/queue.log \
> /tmp/queue_mgr.log 2>&1 &'调度器功能:
- 读取清单
- 循环执行:为每个pending作业分配空闲GPU,通过启动
screen - 轮询作业状态(每60秒一次)
- 检测僵死屏幕(Python已退出但screen仍分离 → 杀死进程)
- 检测OOM(日志中出现CUDA OOM → 标记为failed_oom → 延迟后重试)
- 检测完成状态(预期输出JSON/文件存在)→ 标记为completed
- 当前波次结束后启动下一波次
- 持续将状态写入
queue_state.json
Step 4: Monitoring
步骤4:监控
User can check state anytime:
bash
ssh <server> cat /tmp/queue_state.json | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'Or invoke which reads the state file.
/monitor-experiment用户可随时检查状态:
bash
ssh <server> cat /tmp/queue_state.json | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'或调用,该命令会读取状态文件。
/monitor-experimentStep 5: Post-completion
步骤5:完成后处理
When all jobs in are or :
manifest.jsoncompletedstuck- Scheduler exits cleanly
- Write final summary to
<project>/experiment_queue/<timestamp>/summary.md - Invoke if
/analyze-resultsanalyze_on_complete: true
当中的所有作业均为或状态时:
manifest.jsoncompletedstuck- 调度器正常退出
- 将最终总结写入
<project>/experiment_queue/<timestamp>/summary.md - 若,则调用
analyze_on_complete: true/analyze-results
Grid Spec Syntax
网格规格语法
Instead of writing 24 job entries manually:
yaml
grid:
N: [64, 128, 256]
n: [50000, 150000, 500000, 652000]
seed: [42, 200, 201]
template:
id: "s${seed}_N${N}_n${n}"
args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}Expands to 36 jobs automatically.
无需手动编写24个作业条目:
yaml
grid:
N: [64, 128, 256]
n: [50000, 150000, 500000, 652000]
seed: [42, 200, 201]
template:
id: "s${seed}_N${N}_n${n}"
args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}会自动扩展为36个作业。
Wave Chaining
波次链式编排
For sequential phases (teacher → student):
yaml
phases:
- name: train_teachers
grid:
N: [384, 512]
template:
cmd: python run_pc_exp.py --direction c --backbone softmax --n_hidden ${N} ...
output_check: checkpoints/transformer/pcc_softmax_L96_K500_N${N}_wikitext103.pt
- name: distill_students
depends_on: train_teachers
grid:
N: [384, 512]
seed: [42, 200, 201]
template:
cmd: python run_pc_distill_exp.py --n_hidden ${N} --seed ${seed} ...
output_check: figures/pcdistill_sw_N${N}_*_seed${seed}.jsonScheduler enforces : jobs stay until all
jobs are .
depends_ondistill_studentspendingtrain_teacherscompleted针对连续阶段(教师→学生):
yaml
phases:
- name: train_teachers
grid:
N: [384, 512]
template:
cmd: python run_pc_exp.py --direction c --backbone softmax --n_hidden ${N} ...
output_check: checkpoints/transformer/pcc_softmax_L96_K500_N${N}_wikitext103.pt
- name: distill_students
depends_on: train_teachers
grid:
N: [384, 512]
seed: [42, 200, 201]
template:
cmd: python run_pc_distill_exp.py --n_hidden ${N} --seed ${seed} ...
output_check: figures/pcdistill_sw_N${N}_*_seed${seed}.json调度器会强制执行规则:作业会保持状态,直到所有作业均为状态。
depends_ondistill_studentspendingtrain_teacherscompletedOOM Handling
OOM处理
Detect OOM from stdout:
regex
torch\.OutOfMemoryError: CUDA out of memoryOn detection:
- Mark job
failed_oom - Kill screen
- Wait seconds
oom_retry.delay - Check if current GPU is free; if not, try another free GPU
- Requeue as
pending - Max before marking
oom_retry.max_attemptsstuck
从标准输出中检测OOM:
regex
torch\.OutOfMemoryError: CUDA out of memory检测到OOM时:
- 将作业标记为
failed_oom - 杀死screen进程
- 等待秒
oom_retry.delay - 检查当前GPU是否空闲;若不空闲,尝试其他空闲GPU
- 重新加入队列,标记为
pending - 最多重试次,之后标记为
oom_retry.max_attemptsstuck
Stale Screen Detection
僵死屏幕检测
Every 60s, for each running screen:
- Check screen exists ()
screen -ls - Check python PID still running ()
ps -p - If screen exists but python exited:
- If expected output file exists → mark , kill stale screen
completed - If no output file → mark , kill screen
failed_other
- If expected output file exists → mark
每60秒,针对每个运行中的screen:
- 检查screen是否存在()
screen -ls - 检查Python PID是否仍在运行()
ps -p - 若screen存在但Python已退出:
- 若预期输出文件存在 → 标记为,杀死僵死screen
completed - 若不存在输出文件 → 标记为,杀死screen
failed_other
- 若预期输出文件存在 → 标记为
Resume-on-restart
重启恢复
If scheduler crashes / is killed:
- Read
queue_state.json - For each job: check screen; if still alive, keep; if not, re-evaluate state
running - For each : continue normally
pending - Idempotent: safe to restart scheduler without losing state
若调度器崩溃/被杀死:
- 读取
queue_state.json - 针对每个作业:检查screen是否仍存活;若存活则保留状态,若不存活则重新评估状态
running - 针对每个作业:正常继续
pending - 幂等性:重启调度器不会丢失状态,安全可靠
Output: Summary Report
输出:总结报告
markdown
undefinedmarkdown
undefinedExperiment Queue Summary
实验队列总结
Project: dllm_distill
Started: 2026-04-16 11:36:29
Completed: 2026-04-16 18:02:14
Total wall-clock: 6h 25m
Jobs: 40 completed, 2 OOM-retried then completed, 0 stuck
项目: dllm_distill
开始时间: 2026-04-16 11:36:29
完成时间: 2026-04-16 18:02:14
总耗时: 6小时25分钟
作业情况: 40个完成,2个经OOM重试后完成,0个卡住
Phases
阶段统计
| Phase | Jobs | Success | OOM retries | Duration |
|---|---|---|---|---|
| train_teachers | 2 | 2 | 0 | 58m |
| distill_students | 24 | 24 | 2 | 4h 02m |
| multi_seed_validation | 16 | 16 | 0 | 1h 25m |
| 阶段 | 作业数 | 成功数 | OOM重试次数 | 耗时 |
|---|---|---|---|---|
| train_teachers | 2 | 2 | 0 | 58分钟 |
| distill_students | 24 | 24 | 2 | 4小时02分钟 |
| multi_seed_validation | 16 | 16 | 0 | 1小时25分钟 |
Results Files
结果文件
- 42 JSON files in
figures/pcdistill_sw_*.json
- 42个JSON文件位于
figures/pcdistill_sw_*.json
Next Steps
后续步骤
- Run on output JSONs
/analyze-results - Figures auto-regen via (if configured)
artifact-sync
undefined- 对输出JSON运行
/analyze-results - 若已配置,通过自动重新生成图表
artifact-sync
undefinedComparison with /run-experiment
/run-experiment与/run-experiment的对比
| Feature | | |
|---|---|---|
| Single-shot experiment | ✅ | ✅ (overkill) |
| Multi-GPU parallel | Basic | Proper scheduling |
| Wave transitions | Manual | Automatic |
| OOM retry | Manual | Automatic |
| Stale screen cleanup | Manual | Automatic |
| Teacher→student chain | Manual | Built-in |
| State persistence | No | Yes (JSON) |
| Resume on crash | No | Yes |
| Grid expansion | Manual | Declarative |
Rule: Use for ≤5 jobs. Use for ≥10 jobs or anything with phases.
/run-experimentexperiment-queue| 功能 | | |
|---|---|---|
| 单次实验 | ✅ | ✅(大材小用) |
| 多GPU并行 | 基础版 | 完善调度 |
| 波次过渡 | 手动 | 自动 |
| OOM重试 | 手动 | 自动 |
| 僵死屏幕清理 | 手动 | 自动 |
| 教师→学生链式实验 | 手动 | 内置支持 |
| 状态持久化 | 无 | 有(JSON) |
| 崩溃后恢复 | 无 | 有 |
| 网格扩展 | 手动 | 声明式 |
规则: 作业数≤5时使用。作业数≥10或包含多阶段实验时使用。
/run-experimentexperiment-queueKey Rules
核心规则
- Never overlap screens on the same GPU — always wait for before launching new job
memory.used < 500 MiB - Always write state to disk — every state change flushed to
queue_state.json - Idempotent scheduler — safe to restart; picks up from state file
- Expected-output-based completion — don't trust screen state alone; verify output file exists
- Bounded retry — max N OOM retries, then mark and alert
stuck - Dependencies enforced at launch — never launch student before teacher checkpoint exists
- 同一GPU上绝不重叠screen进程 —— 必须等待后再启动新作业
memory.used < 500 MiB - 始终将状态写入磁盘 —— 每次状态变更都刷新至
queue_state.json - 调度器具备幂等性 —— 重启安全,可从状态文件恢复
- 基于预期输出判断完成状态 —— 不单独依赖screen状态;需验证输出文件存在
- 有限重试 —— OOM最多重试N次,之后标记为并发出警报
stuck - 启动时强制执行依赖 —— 教师模型checkpoint存在前绝不启动学生模型
Known Failure Modes
已知故障模式
- SSH connection drop during scheduling: scheduler keeps running on remote (nohup), just reconnect and check
- GPU reservation by another user: scheduler waits, does not pre-empt
- Disk full on remote: scheduler detects write failure, marks all pending , alerts
stuck
- 调度期间SSH连接断开: 调度器在远程服务器上仍会运行(nohup),只需重新连接并检查状态
- GPU被其他用户占用: 调度器会等待,不会抢占
- 远程服务器磁盘已满: 调度器检测到写入失败,将所有pending作业标记为并发出警报
stuck
Example Session
示例会话
User: "跑 T5+T6 全部实验:T5 = N∈{80,192} × n 4 values × seed {200,201}, T6 = N∈{384,512} × n 4 values × seed {42,200,201}; T6 需要先 train teacher"
Claude invokes :
/experiment-queue- Parses description into 2-phase manifest
- Phase 1: T5 (16 jobs, no teacher dependency) + T6 teacher training (2 jobs)
- Phase 2: T6 distillation (24 jobs, depends on teachers)
- Deploys scheduler via nohup
- Reports: "Scheduler PID 93534, total 42 jobs, estimated 6-7h wall-clock"
Then user can check anytime or wait for summary report.
用户: "跑 T5+T6 全部实验:T5 = N∈{80,192} × n 4个值 × seed {200,201}, T6 = N∈{384,512} × n 4个值 × seed {42,200,201}; T6 需要先 train teacher"
Claude调用:
/experiment-queue- 将描述解析为两阶段清单
- 阶段1: T5(16个作业,无教师依赖)+ T6教师训练(2个作业)
- 阶段2: T6蒸馏(24个作业,依赖教师模型)
- 通过nohup部署调度器
- 报告: "调度器PID 93534,共42个作业,预计耗时6-7小时"
之后用户可随时检查状态,或等待总结报告。
See Also
相关链接
- — single experiment deployment
/run-experiment - — check progress (now reads from queue_state.json)
/monitor-experiment - — post-hoc analysis
/analyze-results - (bundled) — the scheduler implementation
tools/queue_manager.py - (bundled) — build manifest from grid spec
tools/build_manifest.py
- —— 单次实验部署
/run-experiment - —— 检查进度(现在读取queue_state.json)
/monitor-experiment - —— 事后分析
/analyze-results - (附带)—— 调度器实现
tools/queue_manager.py - (附带)—— 从网格规格构建清单
tools/build_manifest.py
Rationale / Source
设计依据 / 来源
Identified via 2026-04-16 post-mortem analysis (Codex GPT-5.4 xhigh) of a 1.5-day
multi-seed paper experiment session:
- Wall-clock sink: stale screens, OOM, wave transitions, manual parser
- Token sink: re-writing orchestration code each session
- Cognitive sink: tracking which cells succeeded, which failed, which to retry
This skill targets the wall-clock sink specifically; see and
for the other two.
artifact-syncpaper-fix-auto-apply基于2026-04-16对一次耗时1.5天的多种子论文实验会话的事后分析(Codex GPT-5.4 xhigh):
- 耗时瓶颈:僵死屏幕、OOM、波次过渡、手动编写解析器
- Token消耗:每次会话都要重写编排代码
- 认知负担:跟踪哪些参数组合成功、失败、需要重试
本技能专门针对耗时瓶颈;其他两个问题可参考和。
artifact-syncpaper-fix-auto-apply