huggingface-trackio

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Trackio - Experiment Tracking for ML Training

Trackio - 机器学习训练实验跟踪工具

Trackio is an experiment tracking library for logging and visualizing ML training metrics. It syncs to Hugging Face Spaces for real-time monitoring dashboards.

Trackio是一款用于记录和可视化机器学习（ML）训练指标的实验跟踪库。它可同步到Hugging Face Spaces，实现实时监控仪表板。

Three Interfaces

三种接口

Task	Interface	Reference
Logging metrics during training	Python API	references/logging_metrics.md
Firing alerts for training diagnostics	Python API	references/alerts.md
Retrieving metrics & alerts after/during training	CLI	references/retrieving_metrics.md

任务	接口	参考文档
训练过程中记录指标	Python API	references/logging_metrics.md
触发训练诊断警报	Python API	references/alerts.md
训练中/训练后检索指标与警报	CLI	references/retrieving_metrics.md

When to Use Each

各接口适用场景

Python API → Logging

Python API → 记录指标

Use

import trackio

in your training scripts to log metrics:

Initialize tracking with
```
trackio.init()
```
Log metrics with
```
trackio.log()
```
or use TRL's
```
report_to="trackio"
```
Finalize with
```
trackio.finish()
```

Key concept: For remote/cloud training, pass

space_id

— metrics sync to a Space dashboard so they persist after the instance terminates.

→ See references/logging_metrics.md for setup, TRL integration, and configuration options.

在训练脚本中使用

import trackio

来记录指标：

使用
```
trackio.init()
```
初始化跟踪
使用
```
trackio.log()
```
或TRL的
```
report_to="trackio"
```
来记录指标
使用
```
trackio.finish()
```
完成跟踪

核心要点：对于远程/云端训练，传入

space_id

——指标会同步到Space仪表板，即使实例终止后数据仍能保留。

→ 查看references/logging_metrics.md了解设置方法、TRL集成和配置选项。

Python API → Alerts

Python API → 警报功能

Insert

trackio.alert()

calls in training code to flag important events — like inserting print statements for debugging, but structured and queryable:

trackio.alert(title="...", level=trackio.AlertLevel.WARN)

— fire an alert

Three severity levels:
```
INFO
```
,
```
WARN
```
,
```
ERROR
```
Alerts are printed to terminal, stored in the database, shown in the dashboard, and optionally sent to webhooks (Slack/Discord)

Key concept for LLM agents: Alerts are the primary mechanism for autonomous experiment iteration. An agent should insert alerts into training code for diagnostic conditions (loss spikes, NaN gradients, low accuracy, training stalls). Since alerts are printed to the terminal, an agent that is watching the training script's output will see them automatically. For background or detached runs, the agent can poll via CLI instead.

→ See references/alerts.md for the full alerts API, webhook setup, and autonomous agent workflows.

在训练代码中插入

trackio.alert()

调用，标记重要事件——类似于用于调试的print语句，但结构更规范且可查询：

trackio.alert(title="...", level=trackio.AlertLevel.WARN)

——触发警报

三种严重级别：
```
INFO
```
、
```
WARN
```
、
```
ERROR
```
警报会打印到终端、存储在数据库中、显示在仪表板上，还可通过webhooks（Slack/Discord）发送

LLM Agent核心要点：警报是实现自主实验迭代的主要机制。Agent应在训练代码中为诊断场景（损失飙升、NaN梯度、低准确率、训练停滞）插入警报。由于警报会打印到终端，监控训练脚本输出的Agent会自动接收这些信息。对于后台或分离运行的任务，Agent可通过CLI轮询获取警报。

→ 查看references/alerts.md了解完整的警报API、webhook设置和自主Agent工作流。

CLI → Retrieving

CLI → 检索数据

Use the

trackio

command to query logged metrics and alerts:

```
trackio list projects/runs/metrics
```
— discover what's available
```
trackio get project/run/metric
```
— retrieve summaries and values

trackio list alerts --project <name> --json

— retrieve alerts

```
trackio show
```
— launch the dashboard
```
trackio sync
```
— sync to HF Space

Key concept: Add

--json

for programmatic output suitable for automation and LLM agents.

→ See references/retrieving_metrics.md for all commands, workflows, and JSON output formats.

使用

trackio

命令查询已记录的指标和警报：

```
trackio list projects/runs/metrics
```
——查看可用数据
```
trackio get project/run/metric
```
——检索汇总信息和具体数值

trackio list alerts --project <name> --json

——检索警报

```
trackio show
```
——启动仪表板
```
trackio sync
```
——同步到HF Space

核心要点：添加

--json

参数可获得适合自动化和LLM Agent的程序化输出。

→ 查看references/retrieving_metrics.md了解所有命令、工作流和JSON输出格式。

Minimal Logging Setup

极简记录设置

python

import trackio

trackio.init(project="my-project", space_id="username/trackio")
trackio.log({"loss": 0.1, "accuracy": 0.9})
trackio.log({"loss": 0.09, "accuracy": 0.91})
trackio.finish()

python

import trackio

trackio.init(project="my-project", space_id="username/trackio")
trackio.log({"loss": 0.1, "accuracy": 0.9})
trackio.log({"loss": 0.09, "accuracy": 0.91})
trackio.finish()

Minimal Retrieval

极简检索示例

bash

trackio list projects --json
trackio get metric --project my-project --run my-run --metric loss --json

bash

trackio list projects --json
trackio get metric --project my-project --run my-run --metric loss --json

Autonomous ML Experiment Workflow

自主机器学习实验工作流

When running experiments autonomously as an LLM agent, the recommended workflow is:

Set up training with alerts — insert
```
trackio.alert()
```
calls for diagnostic conditions
Launch training — run the script in the background

Poll for alerts — use

trackio list alerts --project <name> --json --since <timestamp>

to check for new alerts

Read metrics — use
```
trackio get metric ...
```
to inspect specific values
Iterate — based on alerts and metrics, stop the run, adjust hyperparameters, and launch a new run

python

import trackio

trackio.init(project="my-project", config={"lr": 1e-4})

for step in range(num_steps):
    loss = train_step()
    trackio.log({"loss": loss, "step": step})

    if step > 100 and loss > 5.0:
        trackio.alert(
            title="Loss divergence",
            text=f"Loss {loss:.4f} still high after {step} steps",
            level=trackio.AlertLevel.ERROR,
        )
    if step > 0 and abs(loss) < 1e-8:
        trackio.alert(
            title="Vanishing loss",
            text="Loss near zero — possible gradient collapse",
            level=trackio.AlertLevel.WARN,
        )

trackio.finish()

Then poll from a separate terminal/process:

bash

trackio list alerts --project my-project --json --since "2025-01-01T00:00:00"

当通过LLM Agent自主运行实验时，推荐工作流如下：

设置带警报的训练任务 ——为诊断场景插入
```
trackio.alert()
```
调用
启动训练 ——在后台运行脚本

轮询警报 ——使用

trackio list alerts --project <name> --json --since <timestamp>

检查新警报

读取指标 ——使用
```
trackio get metric ...
```
查看具体数值
迭代优化 ——根据警报和指标，终止当前运行、调整超参数并启动新的训练任务

python

import trackio

trackio.init(project="my-project", config={"lr": 1e-4})

for step in range(num_steps):
    loss = train_step()
    trackio.log({"loss": loss, "step": step})

    if step > 100 and loss > 5.0:
        trackio.alert(
            title="Loss divergence",
            text=f"Loss {loss:.4f} still high after {step} steps",
            level=trackio.AlertLevel.ERROR,
        )
    if step > 0 and abs(loss) < 1e-8:
        trackio.alert(
            title="Vanishing loss",
            text="Loss near zero — possible gradient collapse",
            level=trackio.AlertLevel.WARN,
        )

trackio.finish()

然后从单独的终端/进程中轮询：

bash

trackio list alerts --project my-project --json --since "2025-01-01T00:00:00"