ai-tracking-experiments
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTrack Which Optimization Experiment Was Best
追踪最优的优化实验
Guide the user through logging, comparing, and managing optimization experiments. The pattern: run experiments systematically, log everything, compare results, promote the winner to production.
引导用户完成优化实验的日志记录、对比和管理流程。核心模式:系统化运行实验、记录所有信息、对比结果、将最优实验推广至生产环境。
When you need this
适用场景
- You've run 5+ optimization experiments and lost track of which was best
- "The intern ran experiments, which .json file is the good one?"
- You need to justify to stakeholders why you picked a specific approach
- You want to reproduce last week's best experiment with more data
- You're comparing optimizers, models, or hyperparameters
- 你已运行5次以上优化实验,记不清哪次效果最佳
- "实习生做了实验,但哪个.json文件是最优的?"
- 你需要向利益相关者证明选择某一方案的合理性
- 你希望用更多数据复现上周的最优实验
- 你正在对比不同优化器、模型或超参数
How it's different from improving accuracy
与提升准确率的区别
Improving accuracy ( | Tracking experiments (this skill) | |
|---|---|---|
| Focus | Running a single optimization pass | Managing the full experimental lifecycle |
| Output | An optimized program | A comparison of all runs with the winner promoted |
| Question | "How do I make this better?" | "Which of our 8 optimization runs was best?" |
提升准确率( | 实验追踪(本技能) | |
|---|---|---|
| 核心焦点 | 运行单轮优化 | 管理实验全生命周期 |
| 输出结果 | 一款优化后的程序 | 所有实验的对比结果及最优实验的推广 |
| 解决问题 | "如何让效果更好?" | "我们8次优化实验中哪次效果最佳?" |
Step 1: Understand the setup
步骤1:了解现有配置
Ask the user:
- How many experiments have you run? (2-3 → file-based tracking. 10+ → consider W&B Weave or LangWatch)
- What varied between runs? (optimizer, model, training data, hyperparameters?)
- Do you have an existing tracking tool? (W&B, MLflow, etc.)
- Do multiple people run experiments? (solo → file-based. Team → shared tool)
询问用户以下问题:
- 你已运行多少次实验?(2-3次 → 基于文件的追踪;10次以上 → 考虑使用W&B Weave或LangWatch)
- 各次实验之间的变量是什么?(优化器、模型、训练数据、超参数?)
- 你是否有现成的追踪工具?(如W&B、MLflow等)
- 是否有多人参与运行实验?(单人 → 基于文件;团队 → 共享工具)
Step 2: Lightweight experiment tracking (no extra tools)
步骤2:轻量级实验追踪(无需额外工具)
A JSONL file is all you need to start. Each line records one experiment run:
python
import json
from datetime import datetime
EXPERIMENT_LOG = "experiments.jsonl"
def log_experiment(run):
"""Log a single experiment run."""
run["timestamp"] = datetime.now().isoformat()
with open(EXPERIMENT_LOG, "a") as f:
f.write(json.dumps(run) + "\n")
def load_experiments(path=EXPERIMENT_LOG):
"""Load all experiment runs."""
with open(path) as f:
return [json.loads(line) for line in f]仅需一个JSONL文件即可开始。每一行记录一次实验运行:
python
import json
from datetime import datetime
EXPERIMENT_LOG = "experiments.jsonl"
def log_experiment(run):
"""Log a single experiment run."""
run["timestamp"] = datetime.now().isoformat()
with open(EXPERIMENT_LOG, "a") as f:
f.write(json.dumps(run) + "\n")
def load_experiments(path=EXPERIMENT_LOG):
"""Load all experiment runs."""
with open(path) as f:
return [json.loads(line) for line in f]What to log for each run
每次运行需记录的内容
python
run = {
"name": "mipro-medium-gpt4o-mini", # Human-readable name
"optimizer": "MIPROv2", # Which optimizer
"optimizer_config": {"auto": "medium"}, # Optimizer settings
"model": "openai/gpt-4o-mini", # Which LM
"trainset_size": 200, # Training examples used
"devset_size": 50, # Evaluation examples
"metric": "answer_quality", # Which metric
"score": 0.84, # Score on devset
"baseline_score": 0.65, # Score before optimization
"improvement": 0.19, # Delta
"cost_usd": 4.50, # API cost for this run
"duration_minutes": 12, # Wall clock time
"artifact_path": "artifacts/mipro_medium_gpt4o_mini.json", # Saved program
"notes": "Best so far. Instruction quality seems high.",
}
log_experiment(run)python
run = {
"name": "mipro-medium-gpt4o-mini", # Human-readable name
"optimizer": "MIPROv2", # Which optimizer
"optimizer_config": {"auto": "medium"}, # Optimizer settings
"model": "openai/gpt-4o-mini", # Which LM
"trainset_size": 200, # Training examples used
"devset_size": 50, # Evaluation examples
"metric": "answer_quality", # Which metric
"score": 0.84, # Score on devset
"baseline_score": 0.65, # Score before optimization
"improvement": 0.19, # Delta
"cost_usd": 4.50, # API cost for this run
"duration_minutes": 12, # Wall clock time
"artifact_path": "artifacts/mipro_medium_gpt4o_mini.json", # Saved program
"notes": "Best so far. Instruction quality seems high.",
}
log_experiment(run)Step 3: Run and log experiments systematically
步骤3:系统化运行并记录实验
Template function that runs one experiment end-to-end:
python
import dspy
import time
from dspy.evaluate import Evaluate
def run_experiment(
name,
program_class,
optimizer_class,
optimizer_kwargs,
trainset,
devset,
metric,
model="openai/gpt-4o-mini",
artifact_dir="artifacts",
):
"""Run one optimization experiment and log results."""
import os
os.makedirs(artifact_dir, exist_ok=True)
# Configure
lm = dspy.LM(model)
dspy.configure(lm=lm)
program = program_class()
# Baseline
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
baseline_score = evaluator(program)
# Optimize
start = time.time()
optimizer = optimizer_class(**optimizer_kwargs)
if optimizer_class == dspy.GEPA:
optimized = optimizer.compile(program, trainset=trainset, metric=metric)
else:
optimized = optimizer.compile(program, trainset=trainset)
duration = (time.time() - start) / 60
# Evaluate optimized
score = evaluator(optimized)
# Save artifact
artifact_path = f"{artifact_dir}/{name}.json"
optimized.save(artifact_path)
# Log
run = {
"name": name,
"optimizer": optimizer_class.__name__,
"optimizer_config": optimizer_kwargs,
"model": model,
"trainset_size": len(trainset),
"devset_size": len(devset),
"metric": metric.__name__,
"baseline_score": baseline_score,
"score": score,
"improvement": score - baseline_score,
"duration_minutes": round(duration, 1),
"artifact_path": artifact_path,
}
log_experiment(run)
print(f"[{name}] {baseline_score:.1f}% -> {score:.1f}% (+{score - baseline_score:.1f}%)")
return optimized, run可完成端到端实验的模板函数:
python
import dspy
import time
from dspy.evaluate import Evaluate
def run_experiment(
name,
program_class,
optimizer_class,
optimizer_kwargs,
trainset,
devset,
metric,
model="openai/gpt-4o-mini",
artifact_dir="artifacts",
):
"""Run one optimization experiment and log results."""
import os
os.makedirs(artifact_dir, exist_ok=True)
# Configure
lm = dspy.LM(model)
dspy.configure(lm=lm)
program = program_class()
# Baseline
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
baseline_score = evaluator(program)
# Optimize
start = time.time()
optimizer = optimizer_class(**optimizer_kwargs)
if optimizer_class == dspy.GEPA:
optimized = optimizer.compile(program, trainset=trainset, metric=metric)
else:
optimized = optimizer.compile(program, trainset=trainset)
duration = (time.time() - start) / 60
# Evaluate optimized
score = evaluator(optimized)
# Save artifact
artifact_path = f"{artifact_dir}/{name}.json"
optimized.save(artifact_path)
# Log
run = {
"name": name,
"optimizer": optimizer_class.__name__,
"optimizer_config": optimizer_kwargs,
"model": model,
"trainset_size": len(trainset),
"devset_size": len(devset),
"metric": metric.__name__,
"baseline_score": baseline_score,
"score": score,
"improvement": score - baseline_score,
"duration_minutes": round(duration, 1),
"artifact_path": artifact_path,
}
log_experiment(run)
print(f"[{name}] {baseline_score:.1f}% -> {score:.1f}% (+{score - baseline_score:.1f}%)")
return optimized, runRun a batch of experiments
批量运行实验
python
experiments = [
{
"name": "bootstrap-4demos",
"optimizer_class": dspy.BootstrapFewShot,
"optimizer_kwargs": {"metric": metric, "max_bootstrapped_demos": 4},
},
{
"name": "bootstrap-8demos",
"optimizer_class": dspy.BootstrapFewShot,
"optimizer_kwargs": {"metric": metric, "max_bootstrapped_demos": 8},
},
{
"name": "mipro-light",
"optimizer_class": dspy.MIPROv2,
"optimizer_kwargs": {"metric": metric, "auto": "light"},
},
{
"name": "mipro-medium",
"optimizer_class": dspy.MIPROv2,
"optimizer_kwargs": {"metric": metric, "auto": "medium"},
},
]
results = []
for exp in experiments:
optimized, run = run_experiment(
name=exp["name"],
program_class=MyProgram,
optimizer_class=exp["optimizer_class"],
optimizer_kwargs=exp["optimizer_kwargs"],
trainset=trainset,
devset=devset,
metric=metric,
)
results.append(run)python
experiments = [
{
"name": "bootstrap-4demos",
"optimizer_class": dspy.BootstrapFewShot,
"optimizer_kwargs": {"metric": metric, "max_bootstrapped_demos": 4},
},
{
"name": "bootstrap-8demos",
"optimizer_class": dspy.BootstrapFewShot,
"optimizer_kwargs": {"metric": metric, "max_bootstrapped_demos": 8},
},
{
"name": "mipro-light",
"optimizer_class": dspy.MIPROv2,
"optimizer_kwargs": {"metric": metric, "auto": "light"},
},
{
"name": "mipro-medium",
"optimizer_class": dspy.MIPROv2,
"optimizer_kwargs": {"metric": metric, "auto": "medium"},
},
]
results = []
for exp in experiments:
optimized, run = run_experiment(
name=exp["name"],
program_class=MyProgram,
optimizer_class=exp["optimizer_class"],
optimizer_kwargs=exp["optimizer_kwargs"],
trainset=trainset,
devset=devset,
metric=metric,
)
results.append(run)Step 4: Compare experiments
步骤4:对比实验
Display comparison table
显示对比表格
python
def compare_experiments(path=EXPERIMENT_LOG, sort_by="score"):
"""Load experiments and display a comparison table."""
runs = load_experiments(path)
runs.sort(key=lambda r: r.get(sort_by, 0), reverse=True)
# Header
print(f"{'Name':<30} {'Optimizer':<20} {'Model':<22} {'Score':>7} {'Improve':>8} {'Cost':>7}")
print("-" * 120)
for r in runs:
name = r.get("name", "?")[:29]
opt = r.get("optimizer", "?")[:19]
model = r.get("model", "?")[:21]
score = r.get("score", 0)
improvement = r.get("improvement", 0)
cost = r.get("cost_usd", 0)
print(f"{name:<30} {opt:<20} {model:<22} {score:>6.1f}% {improvement:>+7.1f}% ${cost:>5.2f}")
compare_experiments()python
def compare_experiments(path=EXPERIMENT_LOG, sort_by="score"):
"""Load experiments and display a comparison table."""
runs = load_experiments(path)
runs.sort(key=lambda r: r.get(sort_by, 0), reverse=True)
# Header
print(f"{'Name':<30} {'Optimizer':<20} {'Model':<22} {'Score':>7} {'Improve':>8} {'Cost':>7}")
print("-" * 120)
for r in runs:
name = r.get("name", "?")[:29]
opt = r.get("optimizer", "?")[:19]
model = r.get("model", "?")[:21]
score = r.get("score", 0)
improvement = r.get("improvement", 0)
cost = r.get("cost_usd", 0)
print(f"{name:<30} {opt:<20} {model:<22} {score:>6.1f}% {improvement:>+7.1f}% ${cost:>5.2f}")
compare_experiments()Name Optimizer Model Score Improve Cost
Name Optimizer Model Score Improve Cost
------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
mipro-medium MIPROv2 openai/gpt-4o-mini 84.0% +19.0% $4.50
mipro-medium MIPROv2 openai/gpt-4o-mini 84.0% +19.0% $4.50
mipro-light MIPROv2 openai/gpt-4o-mini 78.0% +13.0% $1.20
mipro-light MIPROv2 openai/gpt-4o-mini 78.0% +13.0% $1.20
bootstrap-8demos BootstrapFewShot openai/gpt-4o-mini 74.0% +9.0% $0.30
bootstrap-8demos BootstrapFewShot openai/gpt-4o-mini 74.0% +9.0% $0.30
bootstrap-4demos BootstrapFewShot openai/gpt-4o-mini 71.0% +6.0% $0.15
bootstrap-4demos BootstrapFewShot openai/gpt-4o-mini 71.0% +6.0% $0.15
undefinedundefinedFilter experiments
筛选实验
python
def filter_experiments(path=EXPERIMENT_LOG, **filters):
"""Filter experiments by any field."""
runs = load_experiments(path)
for key, value in filters.items():
if key == "min_score":
runs = [r for r in runs if r.get("score", 0) >= value]
elif key == "optimizer":
runs = [r for r in runs if r.get("optimizer") == value]
elif key == "model":
runs = [r for r in runs if r.get("model") == value]
return runspython
def filter_experiments(path=EXPERIMENT_LOG, **filters):
"""Filter experiments by any field."""
runs = load_experiments(path)
for key, value in filters.items():
if key == "min_score":
runs = [r for r in runs if r.get("score", 0) >= value]
elif key == "optimizer":
runs = [r for r in runs if r.get("optimizer") == value]
elif key == "model":
runs = [r for r in runs if r.get("model") == value]
return runsOnly MIPROv2 runs
Only MIPROv2 runs
mipro_runs = filter_experiments(optimizer="MIPROv2")
mipro_runs = filter_experiments(optimizer="MIPROv2")
Runs scoring above 80%
Runs scoring above 80%
good_runs = filter_experiments(min_score=80.0)
undefinedgood_runs = filter_experiments(min_score=80.0)
undefinedStep 5: Promote best experiment to production
步骤5:将最优实验推广至生产环境
python
import shutil
def promote_experiment(name, production_path="production/optimized.json"):
"""Copy the winning experiment's artifact to the production path."""
import os
runs = load_experiments()
run = next((r for r in runs if r["name"] == name), None)
if not run:
print(f"Experiment '{name}' not found")
return
os.makedirs(os.path.dirname(production_path), exist_ok=True)
shutil.copy2(run["artifact_path"], production_path)
# Log the promotion
promotion = {
"event": "promotion",
"experiment_name": name,
"score": run["score"],
"source_artifact": run["artifact_path"],
"production_path": production_path,
"timestamp": datetime.now().isoformat(),
}
with open("promotions.jsonl", "a") as f:
f.write(json.dumps(promotion) + "\n")
print(f"Promoted '{name}' (score: {run['score']:.1f}%) to {production_path}")python
import shutil
def promote_experiment(name, production_path="production/optimized.json"):
"""Copy the winning experiment's artifact to the production path."""
import os
runs = load_experiments()
run = next((r for r in runs if r["name"] == name), None)
if not run:
print(f"Experiment '{name}' not found")
return
os.makedirs(os.path.dirname(production_path), exist_ok=True)
shutil.copy2(run["artifact_path"], production_path)
# Log the promotion
promotion = {
"event": "promotion",
"experiment_name": name,
"score": run["score"],
"source_artifact": run["artifact_path"],
"production_path": production_path,
"timestamp": datetime.now().isoformat(),
}
with open("promotions.jsonl", "a") as f:
f.write(json.dumps(promotion) + "\n")
print(f"Promoted '{name}' (score: {run['score']:.1f}%) to {production_path}")Promote the best experiment
Promote the best experiment
promote_experiment("mipro-medium")
promote_experiment("mipro-medium")
Promoted 'mipro-medium' (score: 84.0%) to production/optimized.json
Promoted 'mipro-medium' (score: 84.0%) to production/optimized.json
undefinedundefinedLoad the promoted program in production
在生产环境中加载推广的程序
python
undefinedpython
undefinedIn your production code
In your production code
program = MyProgram()
program.load("production/optimized.json")
undefinedprogram = MyProgram()
program.load("production/optimized.json")
undefinedStep 6: Use W&B Weave (for teams)
步骤6:使用W&B Weave(适用于团队场景)
For teams running many experiments, W&B Weave adds visual dashboards and collaboration:
bash
pip install weavepython
import weave
weave.init("my-project")
@weave.op()
def run_optimization(optimizer_name, model, trainset, devset, metric):
"""Tracked optimization run — Weave logs inputs, outputs, and cost."""
lm = dspy.LM(model)
dspy.configure(lm=lm)
program = MyProgram()
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
optimized = optimizer.compile(program, trainset=trainset)
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
score = evaluator(optimized)
return {"score": score, "optimizer": optimizer_name, "model": model}对于运行大量实验的团队,W&B Weave可提供可视化仪表板与协作功能:
bash
pip install weavepython
import weave
weave.init("my-project")
@weave.op()
def run_optimization(optimizer_name, model, trainset, devset, metric):
"""Tracked optimization run — Weave logs inputs, outputs, and cost."""
lm = dspy.LM(model)
dspy.configure(lm=lm)
program = MyProgram()
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
optimized = optimizer.compile(program, trainset=trainset)
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
score = evaluator(optimized)
return {"score": score, "optimizer": optimizer_name, "model": model}Weave auto-tracks everything — view at wandb.ai
Weave auto-tracks everything — view at wandb.ai
result = run_optimization("mipro-medium", "openai/gpt-4o-mini", trainset, devset, metric)
undefinedresult = run_optimization("mipro-medium", "openai/gpt-4o-mini", trainset, devset, metric)
undefinedStep 7: Use LangWatch (for real-time optimizer progress)
步骤7:使用LangWatch(实时查看优化器进度)
LangWatch shows optimizer progress as it runs — useful for long optimization runs:
bash
pip install langwatchpython
import langwatch
langwatch.init()LangWatch可显示优化器的运行进度,适用于耗时较长的优化实验:
bash
pip install langwatchpython
import langwatch
langwatch.init()LangWatch tracks DSPy optimizer steps in real-time
LangWatch tracks DSPy optimizer steps in real-time
optimizer = dspy.MIPROv2(metric=metric, auto="heavy")
optimized = optimizer.compile(program, trainset=trainset)
optimizer = dspy.MIPROv2(metric=metric, auto="heavy")
optimized = optimizer.compile(program, trainset=trainset)
Watch progress at app.langwatch.ai
Watch progress at app.langwatch.ai
undefinedundefinedKey patterns
核心模式
- Log from day one: even if you only have 2 experiments now, you'll have 20 next month
- Log the artifact path: an experiment without a saved .json file is useless
- Compare on the same devset: scores from different devsets aren't comparable
- Track cost: "20% better accuracy for 10x the cost" is a real tradeoff
- Promote explicitly: don't just copy files — log which experiment is in production
- Start file-based, upgrade later: JSONL tracking works fine until you have a team
- 从第一天开始记录:即使现在只有2次实验,下个月可能就会有20次
- 记录工件路径:没有保存.json文件的实验毫无价值
- 基于同一验证集对比:不同验证集的分数不具备可比性
- 跟踪成本:“准确率提升20%但成本增加10倍”是实际存在的权衡
- 明确推广流程:不要仅复制文件,需记录生产环境中部署的是哪次实验
- 从文件式开始,后续升级:JSONL追踪在团队出现前完全够用
Additional resources
额外资源
- For worked examples, see examples.md
- Use to run individual optimization passes
/ai-improving-accuracy - Use when comparing the same optimizer across different models
/ai-switching-models - Use when experiment costs are a concern
/ai-cutting-costs - Use to track how the promoted experiment performs in production
/ai-monitoring
- 如需示例项目,可查看examples.md
- 如需运行单轮优化,使用
/ai-improving-accuracy - 如需对比同一优化器在不同模型上的表现,使用
/ai-switching-models - 如需控制实验成本,使用
/ai-cutting-costs - 如需跟踪推广后的实验在生产环境中的表现,使用
/ai-monitoring