huggingface-trackio
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTrackio - Experiment Tracking for ML Training
Trackio - 机器学习训练实验跟踪工具
Trackio is an experiment tracking library for logging and visualizing ML training metrics. It syncs to Hugging Face Spaces for real-time monitoring dashboards.
Trackio是一款用于记录和可视化机器学习(ML)训练指标的实验跟踪库。它可同步到Hugging Face Spaces,实现实时监控仪表板。
Three Interfaces
三种接口
| Task | Interface | Reference |
|---|---|---|
| Logging metrics during training | Python API | references/logging_metrics.md |
| Firing alerts for training diagnostics | Python API | references/alerts.md |
| Retrieving metrics & alerts after/during training | CLI | references/retrieving_metrics.md |
| 任务 | 接口 | 参考文档 |
|---|---|---|
| 训练过程中记录指标 | Python API | references/logging_metrics.md |
| 触发训练诊断警报 | Python API | references/alerts.md |
| 训练中/训练后检索指标与警报 | CLI | references/retrieving_metrics.md |
When to Use Each
各接口适用场景
Python API → Logging
Python API → 记录指标
Use in your training scripts to log metrics:
import trackio- Initialize tracking with
trackio.init() - Log metrics with or use TRL's
trackio.log()report_to="trackio" - Finalize with
trackio.finish()
Key concept: For remote/cloud training, pass — metrics sync to a Space dashboard so they persist after the instance terminates.
space_id→ See references/logging_metrics.md for setup, TRL integration, and configuration options.
在训练脚本中使用来记录指标:
import trackio- 使用初始化跟踪
trackio.init() - 使用或TRL的
trackio.log()来记录指标report_to="trackio" - 使用完成跟踪
trackio.finish()
核心要点:对于远程/云端训练,传入——指标会同步到Space仪表板,即使实例终止后数据仍能保留。
space_id→ 查看references/logging_metrics.md了解设置方法、TRL集成和配置选项。
Python API → Alerts
Python API → 警报功能
Insert calls in training code to flag important events — like inserting print statements for debugging, but structured and queryable:
trackio.alert()- — fire an alert
trackio.alert(title="...", level=trackio.AlertLevel.WARN) - Three severity levels: ,
INFO,WARNERROR - Alerts are printed to terminal, stored in the database, shown in the dashboard, and optionally sent to webhooks (Slack/Discord)
Key concept for LLM agents: Alerts are the primary mechanism for autonomous experiment iteration. An agent should insert alerts into training code for diagnostic conditions (loss spikes, NaN gradients, low accuracy, training stalls). Since alerts are printed to the terminal, an agent that is watching the training script's output will see them automatically. For background or detached runs, the agent can poll via CLI instead.
→ See references/alerts.md for the full alerts API, webhook setup, and autonomous agent workflows.
在训练代码中插入调用,标记重要事件——类似于用于调试的print语句,但结构更规范且可查询:
trackio.alert()- ——触发警报
trackio.alert(title="...", level=trackio.AlertLevel.WARN) - 三种严重级别:、
INFO、WARNERROR - 警报会打印到终端、存储在数据库中、显示在仪表板上,还可通过webhooks(Slack/Discord)发送
LLM Agent核心要点:警报是实现自主实验迭代的主要机制。Agent应在训练代码中为诊断场景(损失飙升、NaN梯度、低准确率、训练停滞)插入警报。由于警报会打印到终端,监控训练脚本输出的Agent会自动接收这些信息。对于后台或分离运行的任务,Agent可通过CLI轮询获取警报。
→ 查看references/alerts.md了解完整的警报API、webhook设置和自主Agent工作流。
CLI → Retrieving
CLI → 检索数据
Use the command to query logged metrics and alerts:
trackio- — discover what's available
trackio list projects/runs/metrics - — retrieve summaries and values
trackio get project/run/metric - — retrieve alerts
trackio list alerts --project <name> --json - — launch the dashboard
trackio show - — sync to HF Space
trackio sync
Key concept: Add for programmatic output suitable for automation and LLM agents.
--json→ See references/retrieving_metrics.md for all commands, workflows, and JSON output formats.
使用命令查询已记录的指标和警报:
trackio- ——查看可用数据
trackio list projects/runs/metrics - ——检索汇总信息和具体数值
trackio get project/run/metric - ——检索警报
trackio list alerts --project <name> --json - ——启动仪表板
trackio show - ——同步到HF Space
trackio sync
核心要点:添加参数可获得适合自动化和LLM Agent的程序化输出。
--json→ 查看references/retrieving_metrics.md了解所有命令、工作流和JSON输出格式。
Minimal Logging Setup
极简记录设置
python
import trackio
trackio.init(project="my-project", space_id="username/trackio")
trackio.log({"loss": 0.1, "accuracy": 0.9})
trackio.log({"loss": 0.09, "accuracy": 0.91})
trackio.finish()python
import trackio
trackio.init(project="my-project", space_id="username/trackio")
trackio.log({"loss": 0.1, "accuracy": 0.9})
trackio.log({"loss": 0.09, "accuracy": 0.91})
trackio.finish()Minimal Retrieval
极简检索示例
bash
trackio list projects --json
trackio get metric --project my-project --run my-run --metric loss --jsonbash
trackio list projects --json
trackio get metric --project my-project --run my-run --metric loss --jsonAutonomous ML Experiment Workflow
自主机器学习实验工作流
When running experiments autonomously as an LLM agent, the recommended workflow is:
- Set up training with alerts — insert calls for diagnostic conditions
trackio.alert() - Launch training — run the script in the background
- Poll for alerts — use to check for new alerts
trackio list alerts --project <name> --json --since <timestamp> - Read metrics — use to inspect specific values
trackio get metric ... - Iterate — based on alerts and metrics, stop the run, adjust hyperparameters, and launch a new run
python
import trackio
trackio.init(project="my-project", config={"lr": 1e-4})
for step in range(num_steps):
loss = train_step()
trackio.log({"loss": loss, "step": step})
if step > 100 and loss > 5.0:
trackio.alert(
title="Loss divergence",
text=f"Loss {loss:.4f} still high after {step} steps",
level=trackio.AlertLevel.ERROR,
)
if step > 0 and abs(loss) < 1e-8:
trackio.alert(
title="Vanishing loss",
text="Loss near zero — possible gradient collapse",
level=trackio.AlertLevel.WARN,
)
trackio.finish()Then poll from a separate terminal/process:
bash
trackio list alerts --project my-project --json --since "2025-01-01T00:00:00"当通过LLM Agent自主运行实验时,推荐工作流如下:
- 设置带警报的训练任务 ——为诊断场景插入调用
trackio.alert() - 启动训练 ——在后台运行脚本
- 轮询警报 ——使用检查新警报
trackio list alerts --project <name> --json --since <timestamp> - 读取指标 ——使用查看具体数值
trackio get metric ... - 迭代优化 ——根据警报和指标,终止当前运行、调整超参数并启动新的训练任务
python
import trackio
trackio.init(project="my-project", config={"lr": 1e-4})
for step in range(num_steps):
loss = train_step()
trackio.log({"loss": loss, "step": step})
if step > 100 and loss > 5.0:
trackio.alert(
title="Loss divergence",
text=f"Loss {loss:.4f} still high after {step} steps",
level=trackio.AlertLevel.ERROR,
)
if step > 0 and abs(loss) < 1e-8:
trackio.alert(
title="Vanishing loss",
text="Loss near zero — possible gradient collapse",
level=trackio.AlertLevel.WARN,
)
trackio.finish()然后从单独的终端/进程中轮询:
bash
trackio list alerts --project my-project --json --since "2025-01-01T00:00:00"