ai-tracking-experiments

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Track Which Optimization Experiment Was Best

追踪最优的优化实验

Guide the user through logging, comparing, and managing optimization experiments. The pattern: run experiments systematically, log everything, compare results, promote the winner to production.
引导用户完成优化实验的日志记录、对比和管理流程。核心模式:系统化运行实验、记录所有信息、对比结果、将最优实验推广至生产环境。

When you need this

适用场景

  • You've run 5+ optimization experiments and lost track of which was best
  • "The intern ran experiments, which .json file is the good one?"
  • You need to justify to stakeholders why you picked a specific approach
  • You want to reproduce last week's best experiment with more data
  • You're comparing optimizers, models, or hyperparameters
  • 你已运行5次以上优化实验,记不清哪次效果最佳
  • "实习生做了实验,但哪个.json文件是最优的?"
  • 你需要向利益相关者证明选择某一方案的合理性
  • 你希望用更多数据复现上周的最优实验
  • 你正在对比不同优化器、模型或超参数

How it's different from improving accuracy

与提升准确率的区别

Improving accuracy (
/ai-improving-accuracy
)
Tracking experiments (this skill)
FocusRunning a single optimization passManaging the full experimental lifecycle
OutputAn optimized programA comparison of all runs with the winner promoted
Question"How do I make this better?""Which of our 8 optimization runs was best?"
提升准确率(
/ai-improving-accuracy
实验追踪(本技能)
核心焦点运行单轮优化管理实验全生命周期
输出结果一款优化后的程序所有实验的对比结果及最优实验的推广
解决问题"如何让效果更好?""我们8次优化实验中哪次效果最佳?"

Step 1: Understand the setup

步骤1:了解现有配置

Ask the user:
  1. How many experiments have you run? (2-3 → file-based tracking. 10+ → consider W&B Weave or LangWatch)
  2. What varied between runs? (optimizer, model, training data, hyperparameters?)
  3. Do you have an existing tracking tool? (W&B, MLflow, etc.)
  4. Do multiple people run experiments? (solo → file-based. Team → shared tool)
询问用户以下问题:
  1. 你已运行多少次实验?(2-3次 → 基于文件的追踪;10次以上 → 考虑使用W&B Weave或LangWatch)
  2. 各次实验之间的变量是什么?(优化器、模型、训练数据、超参数?)
  3. 你是否有现成的追踪工具?(如W&B、MLflow等)
  4. 是否有多人参与运行实验?(单人 → 基于文件;团队 → 共享工具)

Step 2: Lightweight experiment tracking (no extra tools)

步骤2:轻量级实验追踪(无需额外工具)

A JSONL file is all you need to start. Each line records one experiment run:
python
import json
from datetime import datetime

EXPERIMENT_LOG = "experiments.jsonl"

def log_experiment(run):
    """Log a single experiment run."""
    run["timestamp"] = datetime.now().isoformat()
    with open(EXPERIMENT_LOG, "a") as f:
        f.write(json.dumps(run) + "\n")

def load_experiments(path=EXPERIMENT_LOG):
    """Load all experiment runs."""
    with open(path) as f:
        return [json.loads(line) for line in f]
仅需一个JSONL文件即可开始。每一行记录一次实验运行:
python
import json
from datetime import datetime

EXPERIMENT_LOG = "experiments.jsonl"

def log_experiment(run):
    """Log a single experiment run."""
    run["timestamp"] = datetime.now().isoformat()
    with open(EXPERIMENT_LOG, "a") as f:
        f.write(json.dumps(run) + "\n")

def load_experiments(path=EXPERIMENT_LOG):
    """Load all experiment runs."""
    with open(path) as f:
        return [json.loads(line) for line in f]

What to log for each run

每次运行需记录的内容

python
run = {
    "name": "mipro-medium-gpt4o-mini",       # Human-readable name
    "optimizer": "MIPROv2",                    # Which optimizer
    "optimizer_config": {"auto": "medium"},    # Optimizer settings
    "model": "openai/gpt-4o-mini",            # Which LM
    "trainset_size": 200,                      # Training examples used
    "devset_size": 50,                         # Evaluation examples
    "metric": "answer_quality",                # Which metric
    "score": 0.84,                             # Score on devset
    "baseline_score": 0.65,                    # Score before optimization
    "improvement": 0.19,                       # Delta
    "cost_usd": 4.50,                          # API cost for this run
    "duration_minutes": 12,                    # Wall clock time
    "artifact_path": "artifacts/mipro_medium_gpt4o_mini.json",  # Saved program
    "notes": "Best so far. Instruction quality seems high.",
}
log_experiment(run)
python
run = {
    "name": "mipro-medium-gpt4o-mini",       # Human-readable name
    "optimizer": "MIPROv2",                    # Which optimizer
    "optimizer_config": {"auto": "medium"},    # Optimizer settings
    "model": "openai/gpt-4o-mini",            # Which LM
    "trainset_size": 200,                      # Training examples used
    "devset_size": 50,                         # Evaluation examples
    "metric": "answer_quality",                # Which metric
    "score": 0.84,                             # Score on devset
    "baseline_score": 0.65,                    # Score before optimization
    "improvement": 0.19,                       # Delta
    "cost_usd": 4.50,                          # API cost for this run
    "duration_minutes": 12,                    # Wall clock time
    "artifact_path": "artifacts/mipro_medium_gpt4o_mini.json",  # Saved program
    "notes": "Best so far. Instruction quality seems high.",
}
log_experiment(run)

Step 3: Run and log experiments systematically

步骤3:系统化运行并记录实验

Template function that runs one experiment end-to-end:
python
import dspy
import time
from dspy.evaluate import Evaluate

def run_experiment(
    name,
    program_class,
    optimizer_class,
    optimizer_kwargs,
    trainset,
    devset,
    metric,
    model="openai/gpt-4o-mini",
    artifact_dir="artifacts",
):
    """Run one optimization experiment and log results."""
    import os
    os.makedirs(artifact_dir, exist_ok=True)

    # Configure
    lm = dspy.LM(model)
    dspy.configure(lm=lm)
    program = program_class()

    # Baseline
    evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
    baseline_score = evaluator(program)

    # Optimize
    start = time.time()
    optimizer = optimizer_class(**optimizer_kwargs)
    if optimizer_class == dspy.GEPA:
        optimized = optimizer.compile(program, trainset=trainset, metric=metric)
    else:
        optimized = optimizer.compile(program, trainset=trainset)
    duration = (time.time() - start) / 60

    # Evaluate optimized
    score = evaluator(optimized)

    # Save artifact
    artifact_path = f"{artifact_dir}/{name}.json"
    optimized.save(artifact_path)

    # Log
    run = {
        "name": name,
        "optimizer": optimizer_class.__name__,
        "optimizer_config": optimizer_kwargs,
        "model": model,
        "trainset_size": len(trainset),
        "devset_size": len(devset),
        "metric": metric.__name__,
        "baseline_score": baseline_score,
        "score": score,
        "improvement": score - baseline_score,
        "duration_minutes": round(duration, 1),
        "artifact_path": artifact_path,
    }
    log_experiment(run)

    print(f"[{name}] {baseline_score:.1f}% -> {score:.1f}% (+{score - baseline_score:.1f}%)")
    return optimized, run
可完成端到端实验的模板函数:
python
import dspy
import time
from dspy.evaluate import Evaluate

def run_experiment(
    name,
    program_class,
    optimizer_class,
    optimizer_kwargs,
    trainset,
    devset,
    metric,
    model="openai/gpt-4o-mini",
    artifact_dir="artifacts",
):
    """Run one optimization experiment and log results."""
    import os
    os.makedirs(artifact_dir, exist_ok=True)

    # Configure
    lm = dspy.LM(model)
    dspy.configure(lm=lm)
    program = program_class()

    # Baseline
    evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
    baseline_score = evaluator(program)

    # Optimize
    start = time.time()
    optimizer = optimizer_class(**optimizer_kwargs)
    if optimizer_class == dspy.GEPA:
        optimized = optimizer.compile(program, trainset=trainset, metric=metric)
    else:
        optimized = optimizer.compile(program, trainset=trainset)
    duration = (time.time() - start) / 60

    # Evaluate optimized
    score = evaluator(optimized)

    # Save artifact
    artifact_path = f"{artifact_dir}/{name}.json"
    optimized.save(artifact_path)

    # Log
    run = {
        "name": name,
        "optimizer": optimizer_class.__name__,
        "optimizer_config": optimizer_kwargs,
        "model": model,
        "trainset_size": len(trainset),
        "devset_size": len(devset),
        "metric": metric.__name__,
        "baseline_score": baseline_score,
        "score": score,
        "improvement": score - baseline_score,
        "duration_minutes": round(duration, 1),
        "artifact_path": artifact_path,
    }
    log_experiment(run)

    print(f"[{name}] {baseline_score:.1f}% -> {score:.1f}% (+{score - baseline_score:.1f}%)")
    return optimized, run

Run a batch of experiments

批量运行实验

python
experiments = [
    {
        "name": "bootstrap-4demos",
        "optimizer_class": dspy.BootstrapFewShot,
        "optimizer_kwargs": {"metric": metric, "max_bootstrapped_demos": 4},
    },
    {
        "name": "bootstrap-8demos",
        "optimizer_class": dspy.BootstrapFewShot,
        "optimizer_kwargs": {"metric": metric, "max_bootstrapped_demos": 8},
    },
    {
        "name": "mipro-light",
        "optimizer_class": dspy.MIPROv2,
        "optimizer_kwargs": {"metric": metric, "auto": "light"},
    },
    {
        "name": "mipro-medium",
        "optimizer_class": dspy.MIPROv2,
        "optimizer_kwargs": {"metric": metric, "auto": "medium"},
    },
]

results = []
for exp in experiments:
    optimized, run = run_experiment(
        name=exp["name"],
        program_class=MyProgram,
        optimizer_class=exp["optimizer_class"],
        optimizer_kwargs=exp["optimizer_kwargs"],
        trainset=trainset,
        devset=devset,
        metric=metric,
    )
    results.append(run)
python
experiments = [
    {
        "name": "bootstrap-4demos",
        "optimizer_class": dspy.BootstrapFewShot,
        "optimizer_kwargs": {"metric": metric, "max_bootstrapped_demos": 4},
    },
    {
        "name": "bootstrap-8demos",
        "optimizer_class": dspy.BootstrapFewShot,
        "optimizer_kwargs": {"metric": metric, "max_bootstrapped_demos": 8},
    },
    {
        "name": "mipro-light",
        "optimizer_class": dspy.MIPROv2,
        "optimizer_kwargs": {"metric": metric, "auto": "light"},
    },
    {
        "name": "mipro-medium",
        "optimizer_class": dspy.MIPROv2,
        "optimizer_kwargs": {"metric": metric, "auto": "medium"},
    },
]

results = []
for exp in experiments:
    optimized, run = run_experiment(
        name=exp["name"],
        program_class=MyProgram,
        optimizer_class=exp["optimizer_class"],
        optimizer_kwargs=exp["optimizer_kwargs"],
        trainset=trainset,
        devset=devset,
        metric=metric,
    )
    results.append(run)

Step 4: Compare experiments

步骤4:对比实验

Display comparison table

显示对比表格

python
def compare_experiments(path=EXPERIMENT_LOG, sort_by="score"):
    """Load experiments and display a comparison table."""
    runs = load_experiments(path)
    runs.sort(key=lambda r: r.get(sort_by, 0), reverse=True)

    # Header
    print(f"{'Name':<30} {'Optimizer':<20} {'Model':<22} {'Score':>7} {'Improve':>8} {'Cost':>7}")
    print("-" * 120)

    for r in runs:
        name = r.get("name", "?")[:29]
        opt = r.get("optimizer", "?")[:19]
        model = r.get("model", "?")[:21]
        score = r.get("score", 0)
        improvement = r.get("improvement", 0)
        cost = r.get("cost_usd", 0)

        print(f"{name:<30} {opt:<20} {model:<22} {score:>6.1f}% {improvement:>+7.1f}% ${cost:>5.2f}")

compare_experiments()
python
def compare_experiments(path=EXPERIMENT_LOG, sort_by="score"):
    """Load experiments and display a comparison table."""
    runs = load_experiments(path)
    runs.sort(key=lambda r: r.get(sort_by, 0), reverse=True)

    # Header
    print(f"{'Name':<30} {'Optimizer':<20} {'Model':<22} {'Score':>7} {'Improve':>8} {'Cost':>7}")
    print("-" * 120)

    for r in runs:
        name = r.get("name", "?")[:29]
        opt = r.get("optimizer", "?")[:19]
        model = r.get("model", "?")[:21]
        score = r.get("score", 0)
        improvement = r.get("improvement", 0)
        cost = r.get("cost_usd", 0)

        print(f"{name:<30} {opt:<20} {model:<22} {score:>6.1f}% {improvement:>+7.1f}% ${cost:>5.2f}")

compare_experiments()

Name Optimizer Model Score Improve Cost

Name Optimizer Model Score Improve Cost

------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------

mipro-medium MIPROv2 openai/gpt-4o-mini 84.0% +19.0% $4.50

mipro-medium MIPROv2 openai/gpt-4o-mini 84.0% +19.0% $4.50

mipro-light MIPROv2 openai/gpt-4o-mini 78.0% +13.0% $1.20

mipro-light MIPROv2 openai/gpt-4o-mini 78.0% +13.0% $1.20

bootstrap-8demos BootstrapFewShot openai/gpt-4o-mini 74.0% +9.0% $0.30

bootstrap-8demos BootstrapFewShot openai/gpt-4o-mini 74.0% +9.0% $0.30

bootstrap-4demos BootstrapFewShot openai/gpt-4o-mini 71.0% +6.0% $0.15

bootstrap-4demos BootstrapFewShot openai/gpt-4o-mini 71.0% +6.0% $0.15

undefined
undefined

Filter experiments

筛选实验

python
def filter_experiments(path=EXPERIMENT_LOG, **filters):
    """Filter experiments by any field."""
    runs = load_experiments(path)

    for key, value in filters.items():
        if key == "min_score":
            runs = [r for r in runs if r.get("score", 0) >= value]
        elif key == "optimizer":
            runs = [r for r in runs if r.get("optimizer") == value]
        elif key == "model":
            runs = [r for r in runs if r.get("model") == value]

    return runs
python
def filter_experiments(path=EXPERIMENT_LOG, **filters):
    """Filter experiments by any field."""
    runs = load_experiments(path)

    for key, value in filters.items():
        if key == "min_score":
            runs = [r for r in runs if r.get("score", 0) >= value]
        elif key == "optimizer":
            runs = [r for r in runs if r.get("optimizer") == value]
        elif key == "model":
            runs = [r for r in runs if r.get("model") == value]

    return runs

Only MIPROv2 runs

Only MIPROv2 runs

mipro_runs = filter_experiments(optimizer="MIPROv2")
mipro_runs = filter_experiments(optimizer="MIPROv2")

Runs scoring above 80%

Runs scoring above 80%

good_runs = filter_experiments(min_score=80.0)
undefined
good_runs = filter_experiments(min_score=80.0)
undefined

Step 5: Promote best experiment to production

步骤5:将最优实验推广至生产环境

python
import shutil

def promote_experiment(name, production_path="production/optimized.json"):
    """Copy the winning experiment's artifact to the production path."""
    import os
    runs = load_experiments()

    run = next((r for r in runs if r["name"] == name), None)
    if not run:
        print(f"Experiment '{name}' not found")
        return

    os.makedirs(os.path.dirname(production_path), exist_ok=True)
    shutil.copy2(run["artifact_path"], production_path)

    # Log the promotion
    promotion = {
        "event": "promotion",
        "experiment_name": name,
        "score": run["score"],
        "source_artifact": run["artifact_path"],
        "production_path": production_path,
        "timestamp": datetime.now().isoformat(),
    }
    with open("promotions.jsonl", "a") as f:
        f.write(json.dumps(promotion) + "\n")

    print(f"Promoted '{name}' (score: {run['score']:.1f}%) to {production_path}")
python
import shutil

def promote_experiment(name, production_path="production/optimized.json"):
    """Copy the winning experiment's artifact to the production path."""
    import os
    runs = load_experiments()

    run = next((r for r in runs if r["name"] == name), None)
    if not run:
        print(f"Experiment '{name}' not found")
        return

    os.makedirs(os.path.dirname(production_path), exist_ok=True)
    shutil.copy2(run["artifact_path"], production_path)

    # Log the promotion
    promotion = {
        "event": "promotion",
        "experiment_name": name,
        "score": run["score"],
        "source_artifact": run["artifact_path"],
        "production_path": production_path,
        "timestamp": datetime.now().isoformat(),
    }
    with open("promotions.jsonl", "a") as f:
        f.write(json.dumps(promotion) + "\n")

    print(f"Promoted '{name}' (score: {run['score']:.1f}%) to {production_path}")

Promote the best experiment

Promote the best experiment

promote_experiment("mipro-medium")
promote_experiment("mipro-medium")

Promoted 'mipro-medium' (score: 84.0%) to production/optimized.json

Promoted 'mipro-medium' (score: 84.0%) to production/optimized.json

undefined
undefined

Load the promoted program in production

在生产环境中加载推广的程序

python
undefined
python
undefined

In your production code

In your production code

program = MyProgram() program.load("production/optimized.json")
undefined
program = MyProgram() program.load("production/optimized.json")
undefined

Step 6: Use W&B Weave (for teams)

步骤6:使用W&B Weave(适用于团队场景)

For teams running many experiments, W&B Weave adds visual dashboards and collaboration:
bash
pip install weave
python
import weave

weave.init("my-project")

@weave.op()
def run_optimization(optimizer_name, model, trainset, devset, metric):
    """Tracked optimization run — Weave logs inputs, outputs, and cost."""
    lm = dspy.LM(model)
    dspy.configure(lm=lm)

    program = MyProgram()
    optimizer = dspy.MIPROv2(metric=metric, auto="medium")
    optimized = optimizer.compile(program, trainset=trainset)

    evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
    score = evaluator(optimized)

    return {"score": score, "optimizer": optimizer_name, "model": model}
对于运行大量实验的团队,W&B Weave可提供可视化仪表板与协作功能:
bash
pip install weave
python
import weave

weave.init("my-project")

@weave.op()
def run_optimization(optimizer_name, model, trainset, devset, metric):
    """Tracked optimization run — Weave logs inputs, outputs, and cost."""
    lm = dspy.LM(model)
    dspy.configure(lm=lm)

    program = MyProgram()
    optimizer = dspy.MIPROv2(metric=metric, auto="medium")
    optimized = optimizer.compile(program, trainset=trainset)

    evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
    score = evaluator(optimized)

    return {"score": score, "optimizer": optimizer_name, "model": model}

Weave auto-tracks everything — view at wandb.ai

Weave auto-tracks everything — view at wandb.ai

result = run_optimization("mipro-medium", "openai/gpt-4o-mini", trainset, devset, metric)
undefined
result = run_optimization("mipro-medium", "openai/gpt-4o-mini", trainset, devset, metric)
undefined

Step 7: Use LangWatch (for real-time optimizer progress)

步骤7:使用LangWatch(实时查看优化器进度)

LangWatch shows optimizer progress as it runs — useful for long optimization runs:
bash
pip install langwatch
python
import langwatch

langwatch.init()
LangWatch可显示优化器的运行进度,适用于耗时较长的优化实验:
bash
pip install langwatch
python
import langwatch

langwatch.init()

LangWatch tracks DSPy optimizer steps in real-time

LangWatch tracks DSPy optimizer steps in real-time

optimizer = dspy.MIPROv2(metric=metric, auto="heavy") optimized = optimizer.compile(program, trainset=trainset)
optimizer = dspy.MIPROv2(metric=metric, auto="heavy") optimized = optimizer.compile(program, trainset=trainset)

Watch progress at app.langwatch.ai

Watch progress at app.langwatch.ai

undefined
undefined

Key patterns

核心模式

  • Log from day one: even if you only have 2 experiments now, you'll have 20 next month
  • Log the artifact path: an experiment without a saved .json file is useless
  • Compare on the same devset: scores from different devsets aren't comparable
  • Track cost: "20% better accuracy for 10x the cost" is a real tradeoff
  • Promote explicitly: don't just copy files — log which experiment is in production
  • Start file-based, upgrade later: JSONL tracking works fine until you have a team
  • 从第一天开始记录:即使现在只有2次实验,下个月可能就会有20次
  • 记录工件路径:没有保存.json文件的实验毫无价值
  • 基于同一验证集对比:不同验证集的分数不具备可比性
  • 跟踪成本:“准确率提升20%但成本增加10倍”是实际存在的权衡
  • 明确推广流程:不要仅复制文件,需记录生产环境中部署的是哪次实验
  • 从文件式开始,后续升级:JSONL追踪在团队出现前完全够用

Additional resources

额外资源

  • For worked examples, see examples.md
  • Use
    /ai-improving-accuracy
    to run individual optimization passes
  • Use
    /ai-switching-models
    when comparing the same optimizer across different models
  • Use
    /ai-cutting-costs
    when experiment costs are a concern
  • Use
    /ai-monitoring
    to track how the promoted experiment performs in production
  • 如需示例项目,可查看examples.md
  • 如需运行单轮优化,使用
    /ai-improving-accuracy
  • 如需对比同一优化器在不同模型上的表现,使用
    /ai-switching-models
  • 如需控制实验成本,使用
    /ai-cutting-costs
  • 如需跟踪推广后的实验在生产环境中的表现,使用
    /ai-monitoring