bayesian-bandits

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<!-- Bundled files (accessible via ${CLAUDE_SKILL_DIR}): - SKILL.md — this file - scripts/demo.py — runnable marimo notebook with worked example -->
<!-- 捆绑文件(可通过 ${CLAUDE_SKILL_DIR} 访问): - SKILL.md — 本文件 - scripts/demo.py — 可运行的marimo笔记本,包含示例 -->

Bayesian Bandits — Thompson Sampling

贝叶斯老虎机——Thompson采样

For problems where you want to learn and exploit simultaneously, use multi-armed bandits with Thompson sampling. Unlike A/B testing (which fixes traffic allocation and decides at the end), bandits adapt allocation during the experiment — sending more traffic to better-performing arms and reducing the total number of users exposed to inferior variants.
对于需要同时学习与利用的问题,可结合Thompson采样使用多臂老虎机。与A/B测试(固定流量分配并在实验结束后做出决策)不同,老虎机在实验过程中会自适应调整分配策略——将更多流量导向表现更优的“臂”,减少接触劣质变体的用户总数。

When to use this skill

何时使用该技能

  • Multiple variants (ads, landing pages, recommendations, pricing tiers) and you want to minimize total regret, not just identify a winner
  • You can act on results in real time (update allocation each round or batch)
  • The cost of showing a bad variant is high enough that fixed 50/50 allocation is wasteful
  • You want an "always-on" optimization that converges to the best arm without a predefined stopping rule
  • Contextual: the best variant depends on user features (segment, device, geography)
  • 存在多个变体(广告、着陆页、推荐内容、定价层级),且你希望最小化总遗憾,而非仅识别最优变体
  • 你可以实时根据结果采取行动(每轮或批量更新分配策略)
  • 展示劣质变体的成本过高,固定50/50的分配方式存在浪费
  • 你需要一个“持续运行”的优化方案,无需预定义停止规则即可收敛到最优臂
  • 上下文相关:最优变体取决于用户特征(细分群体、设备、地域)

When NOT to use this skill

何时不使用该技能

  • You need a clean causal estimate of treatment effect with pre-registered sample size → use
    bayesian-ab-testing
  • The reward is delayed by days/weeks (bandits need fast feedback loops to adapt effectively)
  • You have more than ~50 arms with sparse rewards → consider collaborative filtering or neural bandits
  • The reward distribution changes over time (non-stationary) → use sliding-window or discounted Thompson sampling (not covered here)
  • 你需要通过预注册样本量获得清晰的处理效应因果估计 → 使用
    bayesian-ab-testing
  • 奖励存在数天/数周的延迟(老虎机需要快速反馈环才能有效自适应)
  • 存在约50个以上的臂且奖励稀疏 → 考虑协同过滤或神经老虎机
  • 奖励分布随时间变化(非平稳)→ 使用滑动窗口或折扣Thompson采样(本文未涵盖)

Project layout

项目结构

<project>/
├── src/
│   ├── bandit.py          # BernoulliBandit, ThompsonSampler classes
│   ├── baselines.py       # EpsilonGreedy, UCB1 for comparison
│   ├── simulate.py        # Run simulations, compute regret
│   └── contextual.py      # ContextualThompson (per-bin or logistic)
├── notebooks/
│   └── demo.py            # marimo walkthrough
└── results/               # regret curves, posterior snapshots
<project>/
├── src/
│   ├── bandit.py          # BernoulliBandit、ThompsonSampler类
│   ├── baselines.py       # 用于对比的EpsilonGreedy、UCB1
│   ├── simulate.py        # 运行模拟、计算遗憾值
│   └── contextual.py      # ContextualThompson(按区间或逻辑回归)
├── notebooks/
│   └── demo.py            # marimo分步演示
└── results/               # 遗憾曲线、后验快照

Core algorithm — Thompson sampling for Bernoulli arms

核心算法——针对伯努利臂的Thompson采样

python
import numpy as np

class ThompsonBernoulli:
    def __init__(self, n_arms: int):
        self.alphas = np.ones(n_arms)  # Beta prior: successes + 1
        self.betas = np.ones(n_arms)   # Beta prior: failures + 1

    def select_arm(self, rng: np.random.Generator) -> int:
        samples = rng.beta(self.alphas, self.betas)
        return int(np.argmax(samples))

    def update(self, arm: int, reward: int):
        self.alphas[arm] += reward
        self.betas[arm] += 1 - reward
That's the entire algorithm. No MCMC, no optimization, no hyperparameters. The Beta posterior is conjugate to the Bernoulli likelihood, so updates are exact and O(1).
python
import numpy as np

class ThompsonBernoulli:
    def __init__(self, n_arms: int):
        self.alphas = np.ones(n_arms)  # Beta先验:成功次数 + 1
        self.betas = np.ones(n_arms)   # Beta先验:失败次数 + 1

    def select_arm(self, rng: np.random.Generator) -> int:
        samples = rng.beta(self.alphas, self.betas)
        return int(np.argmax(samples))

    def update(self, arm: int, reward: int):
        self.alphas[arm] += reward
        self.betas[arm] += 1 - reward
这就是完整的算法,无需MCMC、无需优化、无需超参数。Beta后验与伯努利似然共轭,因此更新是精确的且时间复杂度为O(1)。

Why Thompson beats the alternatives

为什么Thompson采样优于其他方案

vs. Epsilon-greedy

对比Epsilon-greedy

Epsilon-greedy explores uniformly — it wastes pulls on arms it already knows are bad. Thompson concentrates exploration on arms that might be best (their posterior overlaps with the current leader).
Epsilon-greedy采用均匀探索——会在已知表现不佳的臂上浪费尝试次数。Thompson采样则将探索集中在可能最优的臂上(它们的后验分布与当前领先者重叠)。

vs. UCB1

对比UCB1

UCB is deterministic and optimistic — it always pulls the arm with the highest upper confidence bound. Thompson randomizes, which makes it:
  • More robust to delayed feedback (common in production)
  • Naturally handles batched updates
  • Easier to parallelize across concurrent requests
UCB是确定性且乐观的——它总是选择具有最高置信上限的臂。Thompson采样是随机的,这使其:
  • 对延迟反馈更鲁棒(生产环境中常见)
  • 自然支持批量更新
  • 更容易在并发请求中并行化

Regret bounds

遗憾边界

Thompson sampling achieves the Lai-Robbins lower bound for Bernoulli bandits:
$$R(T) = O\left(\sum_{i: p_i < p^} \frac{\ln T}{\text{KL}(p_i | p^)}\right)$$
This is optimal — no algorithm can do better asymptotically.
Thompson采样在伯努利老虎机上达到Lai-Robbins下界
$$R(T) = O\left(\sum_{i: p_i < p^} \frac{\ln T}{\text{KL}(p_i | p^)}\right)$$
这是最优的——没有算法能在渐近情况下表现更好。

Contextual bandits

上下文老虎机

When the best arm depends on context (user features), maintain separate posteriors per context:
当最优臂取决于上下文(用户特征)时,为每个上下文维护独立的后验分布:

Simple: discrete contexts

简单版:离散上下文

python
class ContextualThompson:
    def __init__(self, n_contexts: int, n_arms: int):
        self.agents = [
            ThompsonBernoulli(n_arms) for _ in range(n_contexts)
        ]

    def select_arm(self, context: int, rng) -> int:
        return self.agents[context].select_arm(rng)

    def update(self, context: int, arm: int, reward: int):
        self.agents[context].update(arm, reward)
python
class ContextualThompson:
    def __init__(self, n_contexts: int, n_arms: int):
        self.agents = [
            ThompsonBernoulli(n_arms) for _ in range(n_contexts)
        ]

    def select_arm(self, context: int, rng) -> int:
        return self.agents[context].select_arm(rng)

    def update(self, context: int, arm: int, reward: int):
        self.agents[context].update(arm, reward)

Advanced: continuous contexts (logistic Thompson)

进阶版:连续上下文(逻辑回归Thompson采样)

For continuous features, model the reward probability with a logistic regression per arm and use a Laplace approximation for the posterior. Each round, sample coefficients from the approximate posterior and pick the arm with the highest predicted reward. This is "logistic Thompson sampling" — see Chapelle & Li (2011).
对于连续特征,为每个臂使用逻辑回归建模奖励概率,并使用拉普拉斯近似后验分布。每一轮从近似后验分布中采样系数,选择预测奖励最高的臂。这就是“逻辑回归Thompson采样”——详见Chapelle & Li (2011)。

Regret analysis

遗憾分析

Always plot cumulative regret over time:
python
def cumulative_regret(arm_choices, rewards, true_probs):
    best_prob = true_probs.max()
    instant_regret = best_prob - true_probs[arm_choices]
    return np.cumsum(instant_regret)
Run multiple simulations and plot mean +/- std to get a stable picture. Thompson's regret curve should be sublinear (flattening over time), while epsilon-greedy's is linear (constant exploration waste).
务必绘制累计遗憾值随时间变化的曲线:
python
def cumulative_regret(arm_choices, rewards, true_probs):
    best_prob = true_probs.max()
    instant_regret = best_prob - true_probs[arm_choices]
    return np.cumsum(instant_regret)
运行多次模拟并绘制均值±标准差以获得稳定结果。Thompson采样的遗憾曲线应为次线性(随时间趋于平缓),而epsilon-greedy的曲线为线性(持续存在探索浪费)。

Batched Thompson sampling

批量Thompson采样

In production you often can't update after every single event. With batched updates (e.g., every 1000 impressions), Thompson still works — just accumulate successes/failures per batch and update once:
python
undefined
在生产环境中,你通常无法在每次事件后立即更新。Thompson采样在批量更新(例如每1000次展示更新一次)场景下仍然有效——只需累计每个批次的成功/失败次数,然后一次性更新:
python
undefined

End of batch

批次结束时

for arm in range(n_arms): agent.alphas[arm] += batch_successes[arm] agent.betas[arm] += batch_failures[arm]

The posterior is still exact. Batching only slows convergence; it
doesn't bias the algorithm.
for arm in range(n_arms): agent.alphas[arm] += batch_successes[arm] agent.betas[arm] += batch_failures[arm]

后验分布仍然是精确的。批量处理只会减慢收敛速度,不会使算法产生偏差。

MLflow logging

MLflow日志记录

For bandit experiments, log:
KindWhat
params
n_arms, horizon, algorithm, epsilon (if egreedy), prior_alpha, prior_beta, n_contexts (if contextual)
metrics
final_cumulative_regret, mean_regret_per_round, best_arm_pull_fraction, convergence_round (round at which best arm gets >90% of pulls)
tags
true_probs (as JSON), best_arm
artifacts
regret_curve.png, posterior_snapshots.png, arm_allocation.png
对于老虎机实验,需记录以下内容:
类型内容
params
臂数量n_arms、实验时长horizon、算法类型、epsilon(若使用egreedy)、先验alpha、先验beta、上下文数量n_contexts(若为上下文老虎机)
metrics
最终累计遗憾值、每轮平均遗憾值、最优臂被选择的比例、收敛轮次(最优臂获得超过90%流量的轮次)
tags
真实概率true_probs(JSON格式)、最优臂
artifacts
regret_curve.png、posterior_snapshots.png、arm_allocation.png

Common pitfalls

常见陷阱

  1. Using bandits when you need causal inference. Bandits optimize allocation but don't give you a clean estimate of treatment effect. If regulatory or scientific rigor requires a pre-registered sample size and unbiased estimator, use an A/B test.
  2. Ignoring delayed feedback. If conversions take 7 days to materialize, the bandit is making decisions based on stale data. Either shorten the feedback loop (use a proxy metric) or add a delay buffer before updating.
  3. Too many arms with sparse rewards. With 100 arms and 1% conversion, most arms will have zero conversions for a long time. Thompson still works but converges very slowly — consider hierarchical priors or neural bandits.
  4. Non-stationary rewards. If the true probabilities change over time, standard Thompson will be slow to adapt. Use a discounted version: decay alphas and betas each round by a factor gamma < 1.
  5. Forgetting the prior matters. Beta(1,1) is uniform — fine for most problems. But if you have historical data (last month's CTR was 3%), initialize with Beta(3, 97) to avoid wasting the first few hundred pulls on exploration you don't need.
  6. Comparing regret across different arm configurations. Regret depends on the gap between the best and second-best arm. A "higher regret" run might just have harder arms, not a worse algorithm. Always compare algorithms on the same arm configuration.
  1. 在需要因果推断时使用老虎机:老虎机优化分配策略,但无法提供清晰的处理效应估计。如果监管或科学严谨性要求预注册样本量和无偏估计器,请使用A/B测试。
  2. 忽略延迟反馈:如果转化需要7天才能完成,老虎机将基于过时数据做出决策。要么缩短反馈环(使用代理指标),要么在更新前添加延迟缓冲。
  3. 臂数量过多且奖励稀疏:100个臂且转化率为1%时,大多数臂在很长时间内都不会有转化。Thompson采样仍然有效,但收敛速度极慢——考虑分层先验或神经老虎机。
  4. 非平稳奖励:如果真实概率随时间变化,标准Thompson采样的自适应速度会很慢。使用折扣版本:每轮将alphas和betas乘以一个小于1的gamma因子进行衰减。
  5. 忽略先验的重要性:Beta(1,1)是均匀分布——适用于大多数问题。但如果你有历史数据(上月点击率为3%),请用Beta(3, 97)初始化,避免在不必要的探索上浪费前几百次尝试。
  6. 在不同臂配置间比较遗憾值:遗憾值取决于最优臂与次优臂之间的差距。“更高遗憾值”的运行可能只是臂的难度更大,而非算法更差。务必在相同的臂配置下对比算法。

Worked example

示例演示

See
demo.py
(marimo notebook). It simulates a multi-armed Bernoulli bandit, runs Thompson sampling, epsilon-greedy, and UCB1 head-to-head, and shows interactive regret curves, posterior evolution, arm allocation, and a contextual bandit example. Run it with:
marimo edit --sandbox demo.py
详见
demo.py
(marimo笔记本)。它模拟了多臂伯努利老虎机,将Thompson采样、epsilon-greedy和UCB1进行对比,展示了交互式遗憾曲线、后验演化、臂分配情况以及上下文老虎机示例。运行命令:
marimo edit --sandbox demo.py