ai-tracking-experiments

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Track Which Optimization Experiment Was Best

追踪最优的优化实验

Guide the user through logging, comparing, and managing optimization experiments. The pattern: run experiments systematically, log everything, compare results, promote the winner to production.

引导用户完成优化实验的日志记录、对比和管理流程。核心模式：系统化运行实验、记录所有信息、对比结果、将最优实验推广至生产环境。

When you need this

适用场景

You've run 5+ optimization experiments and lost track of which was best
"The intern ran experiments, which .json file is the good one?"
You need to justify to stakeholders why you picked a specific approach
You want to reproduce last week's best experiment with more data
You're comparing optimizers, models, or hyperparameters

你已运行5次以上优化实验，记不清哪次效果最佳
"实习生做了实验，但哪个.json文件是最优的？"
你需要向利益相关者证明选择某一方案的合理性
你希望用更多数据复现上周的最优实验
你正在对比不同优化器、模型或超参数

How it's different from improving accuracy

与提升准确率的区别

	Improving accuracy ( `/ai-improving-accuracy` )	Tracking experiments (this skill)
Focus	Running a single optimization pass	Managing the full experimental lifecycle
Output	An optimized program	A comparison of all runs with the winner promoted
Question	"How do I make this better?"	"Which of our 8 optimization runs was best?"

	提升准确率（ `/ai-improving-accuracy` ）	实验追踪（本技能）
核心焦点	运行单轮优化	管理实验全生命周期
输出结果	一款优化后的程序	所有实验的对比结果及最优实验的推广
解决问题	"如何让效果更好？"	"我们8次优化实验中哪次效果最佳？"

Step 1: Understand the setup

步骤1：了解现有配置

Ask the user:

How many experiments have you run? (2-3 → file-based tracking. 10+ → consider W&B Weave or LangWatch)
What varied between runs? (optimizer, model, training data, hyperparameters?)
Do you have an existing tracking tool? (W&B, MLflow, etc.)
Do multiple people run experiments? (solo → file-based. Team → shared tool)

询问用户以下问题：

你已运行多少次实验？（2-3次 → 基于文件的追踪；10次以上 → 考虑使用W&B Weave或LangWatch）
各次实验之间的变量是什么？（优化器、模型、训练数据、超参数？）
你是否有现成的追踪工具？（如W&B、MLflow等）
是否有多人参与运行实验？（单人 → 基于文件；团队 → 共享工具）

Step 2: Lightweight experiment tracking (no extra tools)

步骤2：轻量级实验追踪（无需额外工具）

A JSONL file is all you need to start. Each line records one experiment run:

python

import json
from datetime import datetime

EXPERIMENT_LOG = "experiments.jsonl"

def log_experiment(run):
    """Log a single experiment run."""
    run["timestamp"] = datetime.now().isoformat()
    with open(EXPERIMENT_LOG, "a") as f:
        f.write(json.dumps(run) + "\n")

def load_experiments(path=EXPERIMENT_LOG):
    """Load all experiment runs."""
    with open(path) as f:
        return [json.loads(line) for line in f]

仅需一个JSONL文件即可开始。每一行记录一次实验运行：

python

import json
from datetime import datetime

EXPERIMENT_LOG = "experiments.jsonl"

def log_experiment(run):
    """Log a single experiment run."""
    run["timestamp"] = datetime.now().isoformat()
    with open(EXPERIMENT_LOG, "a") as f:
        f.write(json.dumps(run) + "\n")

def load_experiments(path=EXPERIMENT_LOG):
    """Load all experiment runs."""
    with open(path) as f:
        return [json.loads(line) for line in f]

What to log for each run

每次运行需记录的内容

python

run = {
    "name": "mipro-medium-gpt4o-mini",       # Human-readable name
    "optimizer": "MIPROv2",                    # Which optimizer
    "optimizer_config": {"auto": "medium"},    # Optimizer settings
    "model": "openai/gpt-4o-mini",            # Which LM
    "trainset_size": 200,                      # Training examples used
    "devset_size": 50,                         # Evaluation examples
    "metric": "answer_quality",                # Which metric
    "score": 0.84,                             # Score on devset
    "baseline_score": 0.65,                    # Score before optimization
    "improvement": 0.19,                       # Delta
    "cost_usd": 4.50,                          # API cost for this run
    "duration_minutes": 12,                    # Wall clock time
    "artifact_path": "artifacts/mipro_medium_gpt4o_mini.json",  # Saved program
    "notes": "Best so far. Instruction quality seems high.",
}
log_experiment(run)

python

run = {
    "name": "mipro-medium-gpt4o-mini",       # Human-readable name
    "optimizer": "MIPROv2",                    # Which optimizer
    "optimizer_config": {"auto": "medium"},    # Optimizer settings
    "model": "openai/gpt-4o-mini",            # Which LM
    "trainset_size": 200,                      # Training examples used
    "devset_size": 50,                         # Evaluation examples
    "metric": "answer_quality",                # Which metric
    "score": 0.84,                             # Score on devset
    "baseline_score": 0.65,                    # Score before optimization
    "improvement": 0.19,                       # Delta
    "cost_usd": 4.50,                          # API cost for this run
    "duration_minutes": 12,                    # Wall clock time
    "artifact_path": "artifacts/mipro_medium_gpt4o_mini.json",  # Saved program
    "notes": "Best so far. Instruction quality seems high.",
}
log_experiment(run)

Step 3: Run and log experiments systematically

步骤3：系统化运行并记录实验

Template function that runs one experiment end-to-end:

python

import dspy
import time
from dspy.evaluate import Evaluate

def run_experiment(
    name,
    program_class,
    optimizer_class,
    optimizer_kwargs,
    trainset,
    devset,
    metric,
    model="openai/gpt-4o-mini",
    artifact_dir="artifacts",
):
    """Run one optimization experiment and log results."""
    import os
    os.makedirs(artifact_dir, exist_ok=True)

    # Configure
    lm = dspy.LM(model)
    dspy.configure(lm=lm)
    program = program_class()

    # Baseline
    evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
    baseline_score = evaluator(program)

    # Optimize
    start = time.time()
    optimizer = optimizer_class(**optimizer_kwargs)
    if optimizer_class == dspy.GEPA:
        optimized = optimizer.compile(program, trainset=trainset, metric=metric)
    else:
        optimized = optimizer.compile(program, trainset=trainset)
    duration = (time.time() - start) / 60

    # Evaluate optimized
    score = evaluator(optimized)

    # Save artifact
    artifact_path = f"{artifact_dir}/{name}.json"
    optimized.save(artifact_path)

    # Log
    run = {
        "name": name,
        "optimizer": optimizer_class.__name__,
        "optimizer_config": optimizer_kwargs,
        "model": model,
        "trainset_size": len(trainset),
        "devset_size": len(devset),
        "metric": metric.__name__,
        "baseline_score": baseline_score,
        "score": score,
        "improvement": score - baseline_score,
        "duration_minutes": round(duration, 1),
        "artifact_path": artifact_path,
    }
    log_experiment(run)

    print(f"[{name}] {baseline_score:.1f}% -> {score:.1f}% (+{score - baseline_score:.1f}%)")
    return optimized, run

可完成端到端实验的模板函数：

python

import dspy
import time
from dspy.evaluate import Evaluate

def run_experiment(
    name,
    program_class,
    optimizer_class,
    optimizer_kwargs,
    trainset,
    devset,
    metric,
    model="openai/gpt-4o-mini",
    artifact_dir="artifacts",
):
    """Run one optimization experiment and log results."""
    import os
    os.makedirs(artifact_dir, exist_ok=True)

    # Configure
    lm = dspy.LM(model)
    dspy.configure(lm=lm)
    program = program_class()

    # Baseline
    evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
    baseline_score = evaluator(program)

    # Optimize
    start = time.time()
    optimizer = optimizer_class(**optimizer_kwargs)
    if optimizer_class == dspy.GEPA:
        optimized = optimizer.compile(program, trainset=trainset, metric=metric)
    else:
        optimized = optimizer.compile(program, trainset=trainset)
    duration = (time.time() - start) / 60

    # Evaluate optimized
    score = evaluator(optimized)

    # Save artifact
    artifact_path = f"{artifact_dir}/{name}.json"
    optimized.save(artifact_path)

    # Log
    run = {
        "name": name,
        "optimizer": optimizer_class.__name__,
        "optimizer_config": optimizer_kwargs,
        "model": model,
        "trainset_size": len(trainset),
        "devset_size": len(devset),
        "metric": metric.__name__,
        "baseline_score": baseline_score,
        "score": score,
        "improvement": score - baseline_score,
        "duration_minutes": round(duration, 1),
        "artifact_path": artifact_path,
    }
    log_experiment(run)

    print(f"[{name}] {baseline_score:.1f}% -> {score:.1f}% (+{score - baseline_score:.1f}%)")
    return optimized, run

Run a batch of experiments

批量运行实验

python

experiments = [
    {
        "name": "bootstrap-4demos",
        "optimizer_class": dspy.BootstrapFewShot,
        "optimizer_kwargs": {"metric": metric, "max_bootstrapped_demos": 4},
    },
    {
        "name": "bootstrap-8demos",
        "optimizer_class": dspy.BootstrapFewShot,
        "optimizer_kwargs": {"metric": metric, "max_bootstrapped_demos": 8},
    },
    {
        "name": "mipro-light",
        "optimizer_class": dspy.MIPROv2,
        "optimizer_kwargs": {"metric": metric, "auto": "light"},
    },
    {
        "name": "mipro-medium",
        "optimizer_class": dspy.MIPROv2,
        "optimizer_kwargs": {"metric": metric, "auto": "medium"},
    },
]

results = []
for exp in experiments:
    optimized, run = run_experiment(
        name=exp["name"],
        program_class=MyProgram,
        optimizer_class=exp["optimizer_class"],
        optimizer_kwargs=exp["optimizer_kwargs"],
        trainset=trainset,
        devset=devset,
        metric=metric,
    )
    results.append(run)

python

experiments = [
    {
        "name": "bootstrap-4demos",
        "optimizer_class": dspy.BootstrapFewShot,
        "optimizer_kwargs": {"metric": metric, "max_bootstrapped_demos": 4},
    },
    {
        "name": "bootstrap-8demos",
        "optimizer_class": dspy.BootstrapFewShot,
        "optimizer_kwargs": {"metric": metric, "max_bootstrapped_demos": 8},
    },
    {
        "name": "mipro-light",
        "optimizer_class": dspy.MIPROv2,
        "optimizer_kwargs": {"metric": metric, "auto": "light"},
    },
    {
        "name": "mipro-medium",
        "optimizer_class": dspy.MIPROv2,
        "optimizer_kwargs": {"metric": metric, "auto": "medium"},
    },
]

results = []
for exp in experiments:
    optimized, run = run_experiment(
        name=exp["name"],
        program_class=MyProgram,
        optimizer_class=exp["optimizer_class"],
        optimizer_kwargs=exp["optimizer_kwargs"],
        trainset=trainset,
        devset=devset,
        metric=metric,
    )
    results.append(run)

Step 4: Compare experiments

步骤4：对比实验

Display comparison table

显示对比表格

python

def compare_experiments(path=EXPERIMENT_LOG, sort_by="score"):
    """Load experiments and display a comparison table."""
    runs = load_experiments(path)
    runs.sort(key=lambda r: r.get(sort_by, 0), reverse=True)

    # Header
    print(f"{'Name':<30} {'Optimizer':<20} {'Model':<22} {'Score':>7} {'Improve':>8} {'Cost':>7}")
    print("-" * 120)

    for r in runs:
        name = r.get("name", "?")[:29]
        opt = r.get("optimizer", "?")[:19]
        model = r.get("model", "?")[:21]
        score = r.get("score", 0)
        improvement = r.get("improvement", 0)
        cost = r.get("cost_usd", 0)

        print(f"{name:<30} {opt:<20} {model:<22} {score:>6.1f}% {improvement:>+7.1f}% ${cost:>5.2f}")

compare_experiments()

python

def compare_experiments(path=EXPERIMENT_LOG, sort_by="score"):
    """Load experiments and display a comparison table."""
    runs = load_experiments(path)
    runs.sort(key=lambda r: r.get(sort_by, 0), reverse=True)

    # Header
    print(f"{'Name':<30} {'Optimizer':<20} {'Model':<22} {'Score':>7} {'Improve':>8} {'Cost':>7}")
    print("-" * 120)

    for r in runs:
        name = r.get("name", "?")[:29]
        opt = r.get("optimizer", "?")[:19]
        model = r.get("model", "?")[:21]
        score = r.get("score", 0)
        improvement = r.get("improvement", 0)
        cost = r.get("cost_usd", 0)

        print(f"{name:<30} {opt:<20} {model:<22} {score:>6.1f}% {improvement:>+7.1f}% ${cost:>5.2f}")

compare_experiments()

Name Optimizer Model Score Improve Cost

------------------------------------------------------------------------------------------------------------------------

mipro-medium MIPROv2 openai/gpt-4o-mini 84.0% +19.0% $4.50

mipro-light MIPROv2 openai/gpt-4o-mini 78.0% +13.0% $1.20

bootstrap-8demos BootstrapFewShot openai/gpt-4o-mini 74.0% +9.0% $0.30

bootstrap-4demos BootstrapFewShot openai/gpt-4o-mini 71.0% +6.0% $0.15

undefined

undefined

Filter experiments

筛选实验

python

def filter_experiments(path=EXPERIMENT_LOG, **filters):
    """Filter experiments by any field."""
    runs = load_experiments(path)

    for key, value in filters.items():
        if key == "min_score":
            runs = [r for r in runs if r.get("score", 0) >= value]
        elif key == "optimizer":
            runs = [r for r in runs if r.get("optimizer") == value]
        elif key == "model":
            runs = [r for r in runs if r.get("model") == value]

    return runs

python

def filter_experiments(path=EXPERIMENT_LOG, **filters):
    """Filter experiments by any field."""
    runs = load_experiments(path)

    for key, value in filters.items():
        if key == "min_score":
            runs = [r for r in runs if r.get("score", 0) >= value]
        elif key == "optimizer":
            runs = [r for r in runs if r.get("optimizer") == value]
        elif key == "model":
            runs = [r for r in runs if r.get("model") == value]

    return runs

Only MIPROv2 runs

mipro_runs = filter_experiments(optimizer="MIPROv2")

Runs scoring above 80%

good_runs = filter_experiments(min_score=80.0)

undefined

good_runs = filter_experiments(min_score=80.0)

undefined

Step 5: Promote best experiment to production

步骤5：将最优实验推广至生产环境

python

import shutil

def promote_experiment(name, production_path="production/optimized.json"):
    """Copy the winning experiment's artifact to the production path."""
    import os
    runs = load_experiments()

    run = next((r for r in runs if r["name"] == name), None)
    if not run:
        print(f"Experiment '{name}' not found")
        return

    os.makedirs(os.path.dirname(production_path), exist_ok=True)
    shutil.copy2(run["artifact_path"], production_path)

    # Log the promotion
    promotion = {
        "event": "promotion",
        "experiment_name": name,
        "score": run["score"],
        "source_artifact": run["artifact_path"],
        "production_path": production_path,
        "timestamp": datetime.now().isoformat(),
    }
    with open("promotions.jsonl", "a") as f:
        f.write(json.dumps(promotion) + "\n")

    print(f"Promoted '{name}' (score: {run['score']:.1f}%) to {production_path}")

python

import shutil

def promote_experiment(name, production_path="production/optimized.json"):
    """Copy the winning experiment's artifact to the production path."""
    import os
    runs = load_experiments()

    run = next((r for r in runs if r["name"] == name), None)
    if not run:
        print(f"Experiment '{name}' not found")
        return

    os.makedirs(os.path.dirname(production_path), exist_ok=True)
    shutil.copy2(run["artifact_path"], production_path)

    # Log the promotion
    promotion = {
        "event": "promotion",
        "experiment_name": name,
        "score": run["score"],
        "source_artifact": run["artifact_path"],
        "production_path": production_path,
        "timestamp": datetime.now().isoformat(),
    }
    with open("promotions.jsonl", "a") as f:
        f.write(json.dumps(promotion) + "\n")

    print(f"Promoted '{name}' (score: {run['score']:.1f}%) to {production_path}")

Promote the best experiment

promote_experiment("mipro-medium")

Promoted 'mipro-medium' (score: 84.0%) to production/optimized.json

undefined

undefined

Load the promoted program in production

在生产环境中加载推广的程序

python

undefined

python

undefined

In your production code

program = MyProgram() program.load("production/optimized.json")

undefined

program = MyProgram() program.load("production/optimized.json")

undefined

Step 6: Use W&B Weave (for teams)

步骤6：使用W&B Weave（适用于团队场景）

For teams running many experiments, W&B Weave adds visual dashboards and collaboration:

bash

pip install weave

python

import weave

weave.init("my-project")

@weave.op()
def run_optimization(optimizer_name, model, trainset, devset, metric):
    """Tracked optimization run — Weave logs inputs, outputs, and cost."""
    lm = dspy.LM(model)
    dspy.configure(lm=lm)

    program = MyProgram()
    optimizer = dspy.MIPROv2(metric=metric, auto="medium")
    optimized = optimizer.compile(program, trainset=trainset)

    evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
    score = evaluator(optimized)

    return {"score": score, "optimizer": optimizer_name, "model": model}

对于运行大量实验的团队，W&B Weave可提供可视化仪表板与协作功能：

bash

pip install weave

python

import weave

weave.init("my-project")

@weave.op()
def run_optimization(optimizer_name, model, trainset, devset, metric):
    """Tracked optimization run — Weave logs inputs, outputs, and cost."""
    lm = dspy.LM(model)
    dspy.configure(lm=lm)

    program = MyProgram()
    optimizer = dspy.MIPROv2(metric=metric, auto="medium")
    optimized = optimizer.compile(program, trainset=trainset)

    evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
    score = evaluator(optimized)

    return {"score": score, "optimizer": optimizer_name, "model": model}

Weave auto-tracks everything — view at wandb.ai

result = run_optimization("mipro-medium", "openai/gpt-4o-mini", trainset, devset, metric)

undefined

result = run_optimization("mipro-medium", "openai/gpt-4o-mini", trainset, devset, metric)

undefined

Step 7: Use LangWatch (for real-time optimizer progress)

步骤7：使用LangWatch（实时查看优化器进度）

LangWatch shows optimizer progress as it runs — useful for long optimization runs:

bash

pip install langwatch

python

import langwatch

langwatch.init()

LangWatch可显示优化器的运行进度，适用于耗时较长的优化实验：

bash

pip install langwatch

python

import langwatch

langwatch.init()

LangWatch tracks DSPy optimizer steps in real-time

optimizer = dspy.MIPROv2(metric=metric, auto="heavy") optimized = optimizer.compile(program, trainset=trainset)

Watch progress at app.langwatch.ai

undefined

undefined

Key patterns

核心模式

Log from day one: even if you only have 2 experiments now, you'll have 20 next month
Log the artifact path: an experiment without a saved .json file is useless
Compare on the same devset: scores from different devsets aren't comparable
Track cost: "20% better accuracy for 10x the cost" is a real tradeoff
Promote explicitly: don't just copy files — log which experiment is in production
Start file-based, upgrade later: JSONL tracking works fine until you have a team

从第一天开始记录：即使现在只有2次实验，下个月可能就会有20次
记录工件路径：没有保存.json文件的实验毫无价值
基于同一验证集对比：不同验证集的分数不具备可比性
跟踪成本：“准确率提升20%但成本增加10倍”是实际存在的权衡
明确推广流程：不要仅复制文件，需记录生产环境中部署的是哪次实验
从文件式开始，后续升级：JSONL追踪在团队出现前完全够用

Additional resources

额外资源

For worked examples, see examples.md
Use
```
/ai-improving-accuracy
```
to run individual optimization passes
Use
```
/ai-switching-models
```
when comparing the same optimizer across different models
Use
```
/ai-cutting-costs
```
when experiment costs are a concern
Use
```
/ai-monitoring
```
to track how the promoted experiment performs in production

如需示例项目，可查看examples.md
如需运行单轮优化，使用
```
/ai-improving-accuracy
```
如需对比同一优化器在不同模型上的表现，使用
```
/ai-switching-models
```
如需控制实验成本，使用
```
/ai-cutting-costs
```
如需跟踪推广后的实验在生产环境中的表现，使用
```
/ai-monitoring
```