training-check
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTraining Check
训练检查
Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time.
训练期间定期读取WandB指标,及早发现问题。不必等到训练结束才发现这是在浪费GPU时间。
Context: $ARGUMENTS
上下文:$ARGUMENTS
Constants
常量
- WANDB_ENTITY and WANDB_PROJECT: read from CLAUDE.md or passed as argument (format: )
entity/project/run_id - CHECK_INTERVAL: starts at 10 minutes, then gradually increases if consistently healthy: 10 min → 20 min → 30 min → 60 min (cap)
- REVIEWER_MODEL = — used via Codex MCP for ambiguous cases only
gpt-5.4
- WANDB_ENTITY 和 WANDB_PROJECT:从CLAUDE.md读取或作为参数传入(格式:)
entity/project/run_id - CHECK_INTERVAL:初始为10分钟,若训练持续健康则逐步增加:10分钟 → 20分钟 → 30分钟 → 60分钟(上限)
- REVIEWER_MODEL = — 仅在情况不明确时通过Codex MCP使用
gpt-5.4
When to Use
使用场景
- After training is confirmed running (session alive, loss decreasing for first few steps)
- Set up via CronCreate to fire periodically during training
- This skill checks training QUALITY, not process HEALTH. Process health (session alive, GPU utilization) is watchdog.py's job.
- 确认训练已正常运行后(会话活跃,初始几步损失下降)
- 通过CronCreate设置,在训练期间定期触发
- 本技能检查训练质量,而非进程健康状况。进程健康状况(会话是否活跃、GPU利用率)是watchdog.py的职责。
Workflow
工作流程
Step 1: Read WandB Metrics
步骤1:读取WandB指标
python
import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()If WandB is unreachable (API error, network issue), fall back to reading the log file directly via SSH:
bash
ssh server "tail -100 /path/to/training.log"Check these signals:
- Loss trend: Is training loss decreasing over the last N steps?
- Eval metrics: Are evaluation metrics improving (or at least not degrading)?
- NaN / Inf: Any NaN or Inf values in loss or gradients?
- Spikes: Sudden large jumps in loss (>10x normal variance)?
- Learning rate: Is the schedule behaving as expected?
- Gradient norm: Exploding or vanishing?
python
import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()若无法连接WandB(API错误、网络问题),则通过SSH直接读取日志文件作为备选:
bash
ssh server "tail -100 /path/to/training.log"检查以下信号:
- 损失趋势:最近N步的训练损失是否在下降?
- 评估指标:评估指标是否在提升(或至少没有下降)?
- NaN / Inf:损失或梯度中是否存在NaN或Inf值?
- 突变:损失是否出现突然大幅飙升(超过正常方差的10倍)?
- 学习率:学习率调度是否符合预期?
- 梯度范数:是否出现梯度爆炸或消失?
Step 2: Judgment
步骤2:判断
| Signal | Judgment | Action |
|---|---|---|
| NaN/Inf in loss | Clearly bad | Stop training, investigate |
| Loss diverging (increasing for >N steps) | Clearly bad | Stop training, investigate |
| Eval metrics significantly worse than baseline | Clearly bad | Stop training, investigate |
| Loss decreasing, metrics improving | Clearly fine | Continue, increase check interval |
| Loss flat but not diverging | Unsure | → Step 3 (Codex judgment) |
| Metrics noisy, can't tell trend | Unsure | → Step 3 (Codex judgment) |
| Slightly worse than baseline but still early | Unsure | → Step 3 (Codex judgment) |
| 信号 | 判断 | 操作 |
|---|---|---|
| 损失中存在NaN/Inf | 明显异常 | 停止训练,展开调查 |
| 损失发散(连续>N步上升) | 明显异常 | 停止训练,展开调查 |
| 评估指标显著差于基线 | 明显异常 | 停止训练,展开调查 |
| 损失下降,指标提升 | 完全正常 | 继续训练,延长检查间隔 |
| 损失平稳但未发散 | 不确定 | → 步骤3(Codex判断) |
| 指标波动大,无法判断趋势 | 不确定 | → 步骤3(Codex判断) |
| 略差于基线但训练尚早 | 不确定 | → 步骤3(Codex判断) |
Step 3: Codex Judgment (only when unsure)
步骤3:Codex判断(仅在不确定时使用)
Only escalate to Codex when the signal is ambiguous. For clearly good or clearly bad signals, act directly.
mcp__codex__codex:
config: {"model_reasoning_effort": "high"}
prompt: |
TRAINING HEALTH CHECK — need your judgment on ambiguous metrics.
Run: <entity>/<project>/<run_id>
Current epoch/step: X / Y total
Training loss (last 10 checkpoints): [values]
Eval metrics (last 3 evals): [values]
Baseline reference: [numbers from paper/reproduction]
What I'm unsure about: [specific concern]
Please respond with exactly one of:
- STOP: clearly problematic, should kill training
- CONTINUE: looks fine, check again next interval
- WAIT: not enough data to judge, check again sooner仅在信号不明确时升级至Codex处理。对于明显正常或明显异常的信号,直接采取行动。
mcp__codex__codex:
config: {"model_reasoning_effort": "high"}
prompt: |
TRAINING HEALTH CHECK — need your judgment on ambiguous metrics.
Run: <entity>/<project>/<run_id>
Current epoch/step: X / Y total
Training loss (last 10 checkpoints): [values]
Eval metrics (last 3 evals): [values]
Baseline reference: [numbers from paper/reproduction]
What I'm unsure about: [specific concern]
Please respond with exactly one of:
- STOP: clearly problematic, should kill training
- CONTINUE: looks fine, check again next interval
- WAIT: not enough data to judge, check again soonerStep 4: Act
步骤4:执行操作
| Decision | Action |
|---|---|
| Stop | Kill the training session. Save the WandB run URL, key metrics, and reason for stopping. Log to project notes for debugging. |
| Continue | Do nothing. Will be invoked again at next interval (increase interval if consistently healthy). |
| Wait | Do nothing but keep the current short interval (don't increase). |
| 决策 | 操作 |
|---|---|
| 停止 | 终止训练会话。保存WandB运行URL、关键指标以及停止原因。记录到项目笔记中以便调试。 |
| 继续 | 不执行任何操作。将在下一个间隔再次触发(若持续健康则延长间隔)。 |
| 等待 | 不执行任何操作,但保持当前较短的检查间隔(不延长)。 |
Integration with Watchdog
与Watchdog的集成
Training-check and watchdog.py operate at different levels:
| Layer | Tool | What it checks | Frequency |
|---|---|---|---|
| Process health | watchdog.py | Session alive? GPU active? | Every 60s (continuous) |
| Training quality | training-check | Loss trend? Metrics improving? | Every 10-60 min (periodic) |
Use both together:
- Watchdog catches crashes and idle GPUs immediately
- Training-check catches subtle quality issues (loss plateau, metric degradation)
Training-check和watchdog.py在不同层面运作:
| 层级 | 工具 | 检查内容 | 频率 |
|---|---|---|---|
| 进程健康 | watchdog.py | 会话是否活跃?GPU是否在运行? | 每60秒(持续监控) |
| 训练质量 | training-check | 损失趋势?指标是否提升? | 每10-60分钟(定期检查) |
建议结合使用两者:
- Watchdog可立即捕获崩溃和GPU闲置情况
- Training-check可捕获细微的质量问题(如损失停滞、指标下降)
Rules
规则
- Do not stop training on first sign of noise — some loss spikes are normal. Look at trends over multiple checkpoints.
- When stopping training, always save the WandB run URL and key metrics as evidence.
- If both WandB and log files are unreachable, report the connectivity issue and try again next interval. Do not assume training is broken.
- Gradually increase check interval when healthy (10 → 20 → 30 → 60 min). Reset to 10 min after any anomaly.
- This skill is meant to be automated via CronCreate — do not ask the user whether to set it up. Just set it.
- 不要在首次发现波动时就停止训练——部分损失突变属于正常现象。需观察多个检查点的趋势。
- 停止训练时,务必保存WandB运行URL和关键指标作为依据。
- 若WandB和日志文件均无法访问,报告连接问题并在下一个间隔重试。不要默认训练出现故障。
- 训练健康时逐步延长检查间隔(10 → 20 → 30 → 60分钟)。出现任何异常后重置为10分钟。
- 本技能需通过CronCreate实现自动化——无需询问用户是否设置,直接完成配置。
CronCreate Setup Example
CronCreate设置示例
After training is confirmed stable:
CronCreate (recurring, every 10 minutes initially):
"Run /training-check for wandb run <entity>/<project>/<run_id>"As the check interval increases, delete the old CronCreate job and create a new one with the longer interval.
确认训练稳定后:
CronCreate(定期任务,初始每10分钟执行一次):
"为WandB运行<entity>/<project>/<run_id>执行/training-check"当检查间隔延长时,删除旧的CronCreate任务并创建新的长间隔任务。