training-check

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Training Check

训练检查

Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time.

训练期间定期读取WandB指标，及早发现问题。不必等到训练结束才发现这是在浪费GPU时间。

Context: $ARGUMENTS

上下文：$ARGUMENTS

Constants

常量

WANDB_ENTITY and WANDB_PROJECT: read from CLAUDE.md or passed as argument (format:
```
entity/project/run_id
```
)
CHECK_INTERVAL: starts at 10 minutes, then gradually increases if consistently healthy: 10 min → 20 min → 30 min → 60 min (cap)
REVIEWER_MODEL =
```
gpt-5.4
```
— used via Codex MCP for ambiguous cases only

WANDB_ENTITY 和 WANDB_PROJECT：从CLAUDE.md读取或作为参数传入（格式：
```
entity/project/run_id
```
）
CHECK_INTERVAL：初始为10分钟，若训练持续健康则逐步增加：10分钟 → 20分钟 → 30分钟 → 60分钟（上限）
REVIEWER_MODEL =
```
gpt-5.4
```
— 仅在情况不明确时通过Codex MCP使用

When to Use

使用场景

After training is confirmed running (session alive, loss decreasing for first few steps)
Set up via CronCreate to fire periodically during training
This skill checks training QUALITY, not process HEALTH. Process health (session alive, GPU utilization) is watchdog.py's job.

确认训练已正常运行后（会话活跃，初始几步损失下降）
通过CronCreate设置，在训练期间定期触发
本技能检查训练质量，而非进程健康状况。进程健康状况（会话是否活跃、GPU利用率）是watchdog.py的职责。

Workflow

工作流程

Step 1: Read WandB Metrics

步骤1：读取WandB指标

python

import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()

If WandB is unreachable (API error, network issue), fall back to reading the log file directly via SSH:

bash

ssh server "tail -100 /path/to/training.log"

Check these signals:

Loss trend: Is training loss decreasing over the last N steps?
Eval metrics: Are evaluation metrics improving (or at least not degrading)?
NaN / Inf: Any NaN or Inf values in loss or gradients?
Spikes: Sudden large jumps in loss (>10x normal variance)?
Learning rate: Is the schedule behaving as expected?
Gradient norm: Exploding or vanishing?

python

import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()

若无法连接WandB（API错误、网络问题），则通过SSH直接读取日志文件作为备选：

bash

ssh server "tail -100 /path/to/training.log"

检查以下信号：

损失趋势：最近N步的训练损失是否在下降？
评估指标：评估指标是否在提升（或至少没有下降）？
NaN / Inf：损失或梯度中是否存在NaN或Inf值？
突变：损失是否出现突然大幅飙升（超过正常方差的10倍）？
学习率：学习率调度是否符合预期？
梯度范数：是否出现梯度爆炸或消失？

Step 2: Judgment

步骤2：判断

Signal	Judgment	Action
NaN/Inf in loss	Clearly bad	Stop training, investigate
Loss diverging (increasing for >N steps)	Clearly bad	Stop training, investigate
Eval metrics significantly worse than baseline	Clearly bad	Stop training, investigate
Loss decreasing, metrics improving	Clearly fine	Continue, increase check interval
Loss flat but not diverging	Unsure	→ Step 3 (Codex judgment)
Metrics noisy, can't tell trend	Unsure	→ Step 3 (Codex judgment)
Slightly worse than baseline but still early	Unsure	→ Step 3 (Codex judgment)

信号	判断	操作
损失中存在NaN/Inf	明显异常	停止训练，展开调查
损失发散（连续>N步上升）	明显异常	停止训练，展开调查
评估指标显著差于基线	明显异常	停止训练，展开调查
损失下降，指标提升	完全正常	继续训练，延长检查间隔
损失平稳但未发散	不确定	→ 步骤3（Codex判断）
指标波动大，无法判断趋势	不确定	→ 步骤3（Codex判断）
略差于基线但训练尚早	不确定	→ 步骤3（Codex判断）

Step 3: Codex Judgment (only when unsure)

步骤3：Codex判断（仅在不确定时使用）

Only escalate to Codex when the signal is ambiguous. For clearly good or clearly bad signals, act directly.

mcp__codex__codex:
  config: {"model_reasoning_effort": "high"}
  prompt: |
    TRAINING HEALTH CHECK — need your judgment on ambiguous metrics.

    Run: <entity>/<project>/<run_id>
    Current epoch/step: X / Y total
    Training loss (last 10 checkpoints): [values]
    Eval metrics (last 3 evals): [values]
    Baseline reference: [numbers from paper/reproduction]

    What I'm unsure about: [specific concern]

    Please respond with exactly one of:
    - STOP: clearly problematic, should kill training
    - CONTINUE: looks fine, check again next interval
    - WAIT: not enough data to judge, check again sooner

仅在信号不明确时升级至Codex处理。对于明显正常或明显异常的信号，直接采取行动。

mcp__codex__codex:
  config: {"model_reasoning_effort": "high"}
  prompt: |
    TRAINING HEALTH CHECK — need your judgment on ambiguous metrics.

    Run: <entity>/<project>/<run_id>
    Current epoch/step: X / Y total
    Training loss (last 10 checkpoints): [values]
    Eval metrics (last 3 evals): [values]
    Baseline reference: [numbers from paper/reproduction]

    What I'm unsure about: [specific concern]

    Please respond with exactly one of:
    - STOP: clearly problematic, should kill training
    - CONTINUE: looks fine, check again next interval
    - WAIT: not enough data to judge, check again sooner

Step 4: Act

步骤4：执行操作

Decision	Action
Stop	Kill the training session. Save the WandB run URL, key metrics, and reason for stopping. Log to project notes for debugging.
Continue	Do nothing. Will be invoked again at next interval (increase interval if consistently healthy).
Wait	Do nothing but keep the current short interval (don't increase).

决策	操作
停止	终止训练会话。保存WandB运行URL、关键指标以及停止原因。记录到项目笔记中以便调试。
继续	不执行任何操作。将在下一个间隔再次触发（若持续健康则延长间隔）。
等待	不执行任何操作，但保持当前较短的检查间隔（不延长）。

Integration with Watchdog

与Watchdog的集成

Training-check and watchdog.py operate at different levels:

Layer	Tool	What it checks	Frequency
Process health	watchdog.py	Session alive? GPU active?	Every 60s (continuous)
Training quality	training-check	Loss trend? Metrics improving?	Every 10-60 min (periodic)

Use both together:

Watchdog catches crashes and idle GPUs immediately
Training-check catches subtle quality issues (loss plateau, metric degradation)

Training-check和watchdog.py在不同层面运作：

层级	工具	检查内容	频率
进程健康	watchdog.py	会话是否活跃？GPU是否在运行？	每60秒（持续监控）
训练质量	training-check	损失趋势？指标是否提升？	每10-60分钟（定期检查）

建议结合使用两者：

Watchdog可立即捕获崩溃和GPU闲置情况
Training-check可捕获细微的质量问题（如损失停滞、指标下降）

Rules

规则

Do not stop training on first sign of noise — some loss spikes are normal. Look at trends over multiple checkpoints.
When stopping training, always save the WandB run URL and key metrics as evidence.
If both WandB and log files are unreachable, report the connectivity issue and try again next interval. Do not assume training is broken.
Gradually increase check interval when healthy (10 → 20 → 30 → 60 min). Reset to 10 min after any anomaly.
This skill is meant to be automated via CronCreate — do not ask the user whether to set it up. Just set it.

不要在首次发现波动时就停止训练——部分损失突变属于正常现象。需观察多个检查点的趋势。
停止训练时，务必保存WandB运行URL和关键指标作为依据。
若WandB和日志文件均无法访问，报告连接问题并在下一个间隔重试。不要默认训练出现故障。
训练健康时逐步延长检查间隔（10 → 20 → 30 → 60分钟）。出现任何异常后重置为10分钟。
本技能需通过CronCreate实现自动化——无需询问用户是否设置，直接完成配置。

CronCreate Setup Example

CronCreate设置示例

After training is confirmed stable:
  CronCreate (recurring, every 10 minutes initially):
    "Run /training-check for wandb run <entity>/<project>/<run_id>"

As the check interval increases, delete the old CronCreate job and create a new one with the longer interval.

确认训练稳定后：
  CronCreate（定期任务，初始每10分钟执行一次）：
    "为WandB运行<entity>/<project>/<run_id>执行/training-check"

当检查间隔延长时，删除旧的CronCreate任务并创建新的长间隔任务。