huggingface-trackio

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Trackio - Experiment Tracking for ML Training

Trackio - 机器学习训练实验跟踪工具

Trackio is an experiment tracking library for logging and visualizing ML training metrics. It syncs to Hugging Face Spaces for real-time monitoring dashboards.
Trackio是一款用于记录和可视化机器学习(ML)训练指标的实验跟踪库。它可同步到Hugging Face Spaces,实现实时监控仪表板。

Three Interfaces

三种接口

TaskInterfaceReference
Logging metrics during trainingPython APIreferences/logging_metrics.md
Firing alerts for training diagnosticsPython APIreferences/alerts.md
Retrieving metrics & alerts after/during trainingCLIreferences/retrieving_metrics.md
任务接口参考文档
训练过程中记录指标Python APIreferences/logging_metrics.md
触发训练诊断警报Python APIreferences/alerts.md
训练中/训练后检索指标与警报CLIreferences/retrieving_metrics.md

When to Use Each

各接口适用场景

Python API → Logging

Python API → 记录指标

Use
import trackio
in your training scripts to log metrics:
  • Initialize tracking with
    trackio.init()
  • Log metrics with
    trackio.log()
    or use TRL's
    report_to="trackio"
  • Finalize with
    trackio.finish()
Key concept: For remote/cloud training, pass
space_id
— metrics sync to a Space dashboard so they persist after the instance terminates.
→ See references/logging_metrics.md for setup, TRL integration, and configuration options.
在训练脚本中使用
import trackio
来记录指标:
  • 使用
    trackio.init()
    初始化跟踪
  • 使用
    trackio.log()
    或TRL的
    report_to="trackio"
    来记录指标
  • 使用
    trackio.finish()
    完成跟踪
核心要点:对于远程/云端训练,传入
space_id
——指标会同步到Space仪表板,即使实例终止后数据仍能保留。
→ 查看references/logging_metrics.md了解设置方法、TRL集成和配置选项。

Python API → Alerts

Python API → 警报功能

Insert
trackio.alert()
calls in training code to flag important events — like inserting print statements for debugging, but structured and queryable:
  • trackio.alert(title="...", level=trackio.AlertLevel.WARN)
    — fire an alert
  • Three severity levels:
    INFO
    ,
    WARN
    ,
    ERROR
  • Alerts are printed to terminal, stored in the database, shown in the dashboard, and optionally sent to webhooks (Slack/Discord)
Key concept for LLM agents: Alerts are the primary mechanism for autonomous experiment iteration. An agent should insert alerts into training code for diagnostic conditions (loss spikes, NaN gradients, low accuracy, training stalls). Since alerts are printed to the terminal, an agent that is watching the training script's output will see them automatically. For background or detached runs, the agent can poll via CLI instead.
→ See references/alerts.md for the full alerts API, webhook setup, and autonomous agent workflows.
在训练代码中插入
trackio.alert()
调用,标记重要事件——类似于用于调试的print语句,但结构更规范且可查询:
  • trackio.alert(title="...", level=trackio.AlertLevel.WARN)
    ——触发警报
  • 三种严重级别:
    INFO
    WARN
    ERROR
  • 警报会打印到终端、存储在数据库中、显示在仪表板上,还可通过webhooks(Slack/Discord)发送
LLM Agent核心要点:警报是实现自主实验迭代的主要机制。Agent应在训练代码中为诊断场景(损失飙升、NaN梯度、低准确率、训练停滞)插入警报。由于警报会打印到终端,监控训练脚本输出的Agent会自动接收这些信息。对于后台或分离运行的任务,Agent可通过CLI轮询获取警报。
→ 查看references/alerts.md了解完整的警报API、webhook设置和自主Agent工作流。

CLI → Retrieving

CLI → 检索数据

Use the
trackio
command to query logged metrics and alerts:
  • trackio list projects/runs/metrics
    — discover what's available
  • trackio get project/run/metric
    — retrieve summaries and values
  • trackio list alerts --project <name> --json
    — retrieve alerts
  • trackio show
    — launch the dashboard
  • trackio sync
    — sync to HF Space
Key concept: Add
--json
for programmatic output suitable for automation and LLM agents.
→ See references/retrieving_metrics.md for all commands, workflows, and JSON output formats.
使用
trackio
命令查询已记录的指标和警报:
  • trackio list projects/runs/metrics
    ——查看可用数据
  • trackio get project/run/metric
    ——检索汇总信息和具体数值
  • trackio list alerts --project <name> --json
    ——检索警报
  • trackio show
    ——启动仪表板
  • trackio sync
    ——同步到HF Space
核心要点:添加
--json
参数可获得适合自动化和LLM Agent的程序化输出。
→ 查看references/retrieving_metrics.md了解所有命令、工作流和JSON输出格式。

Minimal Logging Setup

极简记录设置

python
import trackio

trackio.init(project="my-project", space_id="username/trackio")
trackio.log({"loss": 0.1, "accuracy": 0.9})
trackio.log({"loss": 0.09, "accuracy": 0.91})
trackio.finish()
python
import trackio

trackio.init(project="my-project", space_id="username/trackio")
trackio.log({"loss": 0.1, "accuracy": 0.9})
trackio.log({"loss": 0.09, "accuracy": 0.91})
trackio.finish()

Minimal Retrieval

极简检索示例

bash
trackio list projects --json
trackio get metric --project my-project --run my-run --metric loss --json
bash
trackio list projects --json
trackio get metric --project my-project --run my-run --metric loss --json

Autonomous ML Experiment Workflow

自主机器学习实验工作流

When running experiments autonomously as an LLM agent, the recommended workflow is:
  1. Set up training with alerts — insert
    trackio.alert()
    calls for diagnostic conditions
  2. Launch training — run the script in the background
  3. Poll for alerts — use
    trackio list alerts --project <name> --json --since <timestamp>
    to check for new alerts
  4. Read metrics — use
    trackio get metric ...
    to inspect specific values
  5. Iterate — based on alerts and metrics, stop the run, adjust hyperparameters, and launch a new run
python
import trackio

trackio.init(project="my-project", config={"lr": 1e-4})

for step in range(num_steps):
    loss = train_step()
    trackio.log({"loss": loss, "step": step})

    if step > 100 and loss > 5.0:
        trackio.alert(
            title="Loss divergence",
            text=f"Loss {loss:.4f} still high after {step} steps",
            level=trackio.AlertLevel.ERROR,
        )
    if step > 0 and abs(loss) < 1e-8:
        trackio.alert(
            title="Vanishing loss",
            text="Loss near zero — possible gradient collapse",
            level=trackio.AlertLevel.WARN,
        )

trackio.finish()
Then poll from a separate terminal/process:
bash
trackio list alerts --project my-project --json --since "2025-01-01T00:00:00"
当通过LLM Agent自主运行实验时,推荐工作流如下:
  1. 设置带警报的训练任务 ——为诊断场景插入
    trackio.alert()
    调用
  2. 启动训练 ——在后台运行脚本
  3. 轮询警报 ——使用
    trackio list alerts --project <name> --json --since <timestamp>
    检查新警报
  4. 读取指标 ——使用
    trackio get metric ...
    查看具体数值
  5. 迭代优化 ——根据警报和指标,终止当前运行、调整超参数并启动新的训练任务
python
import trackio

trackio.init(project="my-project", config={"lr": 1e-4})

for step in range(num_steps):
    loss = train_step()
    trackio.log({"loss": loss, "step": step})

    if step > 100 and loss > 5.0:
        trackio.alert(
            title="Loss divergence",
            text=f"Loss {loss:.4f} still high after {step} steps",
            level=trackio.AlertLevel.ERROR,
        )
    if step > 0 and abs(loss) < 1e-8:
        trackio.alert(
            title="Vanishing loss",
            text="Loss near zero — possible gradient collapse",
            level=trackio.AlertLevel.WARN,
        )

trackio.finish()
然后从单独的终端/进程中轮询:
bash
trackio list alerts --project my-project --json --since "2025-01-01T00:00:00"