training-check

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Training Check

训练检查

Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time.
训练期间定期读取WandB指标,及早发现问题。不必等到训练结束才发现这是在浪费GPU时间。

Context: $ARGUMENTS

上下文:$ARGUMENTS

Constants

常量

  • WANDB_ENTITY and WANDB_PROJECT: read from CLAUDE.md or passed as argument (format:
    entity/project/run_id
    )
  • CHECK_INTERVAL: starts at 10 minutes, then gradually increases if consistently healthy: 10 min → 20 min → 30 min → 60 min (cap)
  • REVIEWER_MODEL =
    gpt-5.4
    — used via Codex MCP for ambiguous cases only
  • WANDB_ENTITY 和 WANDB_PROJECT:从CLAUDE.md读取或作为参数传入(格式:
    entity/project/run_id
  • CHECK_INTERVAL:初始为10分钟,若训练持续健康则逐步增加:10分钟 → 20分钟 → 30分钟 → 60分钟(上限)
  • REVIEWER_MODEL =
    gpt-5.4
    — 仅在情况不明确时通过Codex MCP使用

When to Use

使用场景

  • After training is confirmed running (session alive, loss decreasing for first few steps)
  • Set up via CronCreate to fire periodically during training
  • This skill checks training QUALITY, not process HEALTH. Process health (session alive, GPU utilization) is watchdog.py's job.
  • 确认训练已正常运行后(会话活跃,初始几步损失下降)
  • 通过CronCreate设置,在训练期间定期触发
  • 本技能检查训练质量,而非进程健康状况。进程健康状况(会话是否活跃、GPU利用率)是watchdog.py的职责。

Workflow

工作流程

Step 1: Read WandB Metrics

步骤1:读取WandB指标

python
import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()
If WandB is unreachable (API error, network issue), fall back to reading the log file directly via SSH:
bash
ssh server "tail -100 /path/to/training.log"
Check these signals:
  • Loss trend: Is training loss decreasing over the last N steps?
  • Eval metrics: Are evaluation metrics improving (or at least not degrading)?
  • NaN / Inf: Any NaN or Inf values in loss or gradients?
  • Spikes: Sudden large jumps in loss (>10x normal variance)?
  • Learning rate: Is the schedule behaving as expected?
  • Gradient norm: Exploding or vanishing?
python
import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()
若无法连接WandB(API错误、网络问题),则通过SSH直接读取日志文件作为备选:
bash
ssh server "tail -100 /path/to/training.log"
检查以下信号:
  • 损失趋势:最近N步的训练损失是否在下降?
  • 评估指标:评估指标是否在提升(或至少没有下降)?
  • NaN / Inf:损失或梯度中是否存在NaN或Inf值?
  • 突变:损失是否出现突然大幅飙升(超过正常方差的10倍)?
  • 学习率:学习率调度是否符合预期?
  • 梯度范数:是否出现梯度爆炸或消失?

Step 2: Judgment

步骤2:判断

SignalJudgmentAction
NaN/Inf in lossClearly badStop training, investigate
Loss diverging (increasing for >N steps)Clearly badStop training, investigate
Eval metrics significantly worse than baselineClearly badStop training, investigate
Loss decreasing, metrics improvingClearly fineContinue, increase check interval
Loss flat but not divergingUnsure→ Step 3 (Codex judgment)
Metrics noisy, can't tell trendUnsure→ Step 3 (Codex judgment)
Slightly worse than baseline but still earlyUnsure→ Step 3 (Codex judgment)
信号判断操作
损失中存在NaN/Inf明显异常停止训练,展开调查
损失发散(连续>N步上升)明显异常停止训练,展开调查
评估指标显著差于基线明显异常停止训练,展开调查
损失下降,指标提升完全正常继续训练,延长检查间隔
损失平稳但未发散不确定→ 步骤3(Codex判断)
指标波动大,无法判断趋势不确定→ 步骤3(Codex判断)
略差于基线但训练尚早不确定→ 步骤3(Codex判断)

Step 3: Codex Judgment (only when unsure)

步骤3:Codex判断(仅在不确定时使用)

Only escalate to Codex when the signal is ambiguous. For clearly good or clearly bad signals, act directly.
mcp__codex__codex:
  config: {"model_reasoning_effort": "high"}
  prompt: |
    TRAINING HEALTH CHECK — need your judgment on ambiguous metrics.

    Run: <entity>/<project>/<run_id>
    Current epoch/step: X / Y total
    Training loss (last 10 checkpoints): [values]
    Eval metrics (last 3 evals): [values]
    Baseline reference: [numbers from paper/reproduction]

    What I'm unsure about: [specific concern]

    Please respond with exactly one of:
    - STOP: clearly problematic, should kill training
    - CONTINUE: looks fine, check again next interval
    - WAIT: not enough data to judge, check again sooner
仅在信号不明确时升级至Codex处理。对于明显正常或明显异常的信号,直接采取行动。
mcp__codex__codex:
  config: {"model_reasoning_effort": "high"}
  prompt: |
    TRAINING HEALTH CHECK — need your judgment on ambiguous metrics.

    Run: <entity>/<project>/<run_id>
    Current epoch/step: X / Y total
    Training loss (last 10 checkpoints): [values]
    Eval metrics (last 3 evals): [values]
    Baseline reference: [numbers from paper/reproduction]

    What I'm unsure about: [specific concern]

    Please respond with exactly one of:
    - STOP: clearly problematic, should kill training
    - CONTINUE: looks fine, check again next interval
    - WAIT: not enough data to judge, check again sooner

Step 4: Act

步骤4:执行操作

DecisionAction
StopKill the training session. Save the WandB run URL, key metrics, and reason for stopping. Log to project notes for debugging.
ContinueDo nothing. Will be invoked again at next interval (increase interval if consistently healthy).
WaitDo nothing but keep the current short interval (don't increase).
决策操作
停止终止训练会话。保存WandB运行URL、关键指标以及停止原因。记录到项目笔记中以便调试。
继续不执行任何操作。将在下一个间隔再次触发(若持续健康则延长间隔)。
等待不执行任何操作,但保持当前较短的检查间隔(不延长)。

Integration with Watchdog

与Watchdog的集成

Training-check and watchdog.py operate at different levels:
LayerToolWhat it checksFrequency
Process healthwatchdog.pySession alive? GPU active?Every 60s (continuous)
Training qualitytraining-checkLoss trend? Metrics improving?Every 10-60 min (periodic)
Use both together:
  • Watchdog catches crashes and idle GPUs immediately
  • Training-check catches subtle quality issues (loss plateau, metric degradation)
Training-check和watchdog.py在不同层面运作:
层级工具检查内容频率
进程健康watchdog.py会话是否活跃?GPU是否在运行?每60秒(持续监控)
训练质量training-check损失趋势?指标是否提升?每10-60分钟(定期检查)
建议结合使用两者:
  • Watchdog可立即捕获崩溃和GPU闲置情况
  • Training-check可捕获细微的质量问题(如损失停滞、指标下降)

Rules

规则

  • Do not stop training on first sign of noise — some loss spikes are normal. Look at trends over multiple checkpoints.
  • When stopping training, always save the WandB run URL and key metrics as evidence.
  • If both WandB and log files are unreachable, report the connectivity issue and try again next interval. Do not assume training is broken.
  • Gradually increase check interval when healthy (10 → 20 → 30 → 60 min). Reset to 10 min after any anomaly.
  • This skill is meant to be automated via CronCreate — do not ask the user whether to set it up. Just set it.
  • 不要在首次发现波动时就停止训练——部分损失突变属于正常现象。需观察多个检查点的趋势
  • 停止训练时,务必保存WandB运行URL和关键指标作为依据。
  • 若WandB和日志文件均无法访问,报告连接问题并在下一个间隔重试。不要默认训练出现故障。
  • 训练健康时逐步延长检查间隔(10 → 20 → 30 → 60分钟)。出现任何异常后重置为10分钟。
  • 本技能需通过CronCreate实现自动化——无需询问用户是否设置,直接完成配置。

CronCreate Setup Example

CronCreate设置示例

After training is confirmed stable:
  CronCreate (recurring, every 10 minutes initially):
    "Run /training-check for wandb run <entity>/<project>/<run_id>"
As the check interval increases, delete the old CronCreate job and create a new one with the longer interval.
确认训练稳定后:
  CronCreate(定期任务,初始每10分钟执行一次):
    "为WandB运行<entity>/<project>/<run_id>执行/training-check"
当检查间隔延长时,删除旧的CronCreate任务并创建新的长间隔任务。