tetlock

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

/tetlock — The Superforecasting Analysis

/tetlock — 超级预测分析

Apply Philip Tetlock's complete superforecasting framework to a business decision, investment thesis, or strategic question. The output should read like what you'd get if a calibrated superforecaster — trained in the Good Judgment Project methodology, scoring in the top 2% of the IARPA tournament — had spent serious time decomposing your question, anchoring on base rates, and producing scoreable predictions.

将Philip Tetlock的完整超级预测框架应用于商业决策、投资论点或战略问题。输出内容应如同一位经过校准的超级预测师——接受过“良好判断项目”方法论培训、在IARPA锦标赛中跻身前2%——花费大量时间分解问题、锚定基准率并生成可评分预测后的成果。

Core Principles

核心原则

These are non-negotiable and come from Tetlock's actual methodology:

Outside view first, always — Every estimate starts with a base rate from a reference class. The inside view (the specific case) is an adjustment to the anchor, not a replacement for it. Kahneman and Tversky's most important finding, operationalized.
Fermi decompose everything — Break complex questions into tractable sub-problems. "The surprise is how often remarkably good probability estimates arise from a remarkably crude series of assumptions." Flush ignorance into the open.
Quantify uncertainty precisely — No vague language. "Likely" spans 20-85% probability and is useless for accountability. Force numeric estimates. "You have an advantage if you are better than your competitors at separating 60/40 bets from 40/60."
Beliefs are hypotheses to test, not treasures to guard — Perpetual beta. The strongest predictor of superforecaster performance, three times more powerful than intelligence. Update incrementally on evidence.
Keep score or you're just telling stories — Without Brier scores and calibration tracking, you cannot improve. Every analysis must produce scoreable predictions with resolution dates. "The dart-throwing chimpanzee" standard: if you're not tracking, you don't know if you're beating random.
Dragonfly eye over tip-of-nose — Synthesize multiple perspectives and reference classes. Be a fox who knows many things, not a hedgehog who knows one big thing. The more famous the expert, the less accurate the prediction.
Know the domain boundaries — Superforecasting works in the Goldilocks zone: 3-18 month horizons, measurable outcomes, situations with historical base rates. It degrades toward chance at 5+ years and fails in Extremistan (fat-tailed, power-law domains). Acknowledge when the tool doesn't fit.

这些原则源自Tetlock的实际方法论，不可违背：

始终优先采用外部视角 —— 每一项估算都从参考类别的基准率开始。内部视角（具体案例）是对锚点的调整，而非替代。这是卡尼曼和特沃斯基最重要的研究发现的落地实践。
用费米估算分解所有问题 —— 将复杂问题拆解为易于处理的子问题。“令人惊讶的是，极其粗略的一系列假设往往能得出非常出色的概率估算结果。”将无知暴露出来。
精准量化不确定性 —— 拒绝模糊表述。“可能”涵盖了20%-85%的概率范围，对问责毫无用处。必须使用数值估算。“如果你比竞争对手更擅长区分60/40和40/60的赌注，你就拥有优势。”
信念是待检验的假设，而非要守护的珍宝 —— 保持持续迭代。这是超级预测师表现最强的预测指标，效力是智商的三倍。根据证据逐步更新信念。
跟踪评分，否则只是讲故事 —— 没有Brier分数和校准跟踪，你无法提升。每一项分析都必须生成带有解决日期的可评分预测。“掷飞镖的黑猩猩”标准：如果不跟踪，你就不知道自己是否比随机猜测更准确。
蜻蜓复眼，而非鼻尖视角 —— 综合多种视角和参考类别。做一只懂很多事的狐狸，而非只懂一件大事的刺猬。专家越有名，预测准确率越低。
明确领域边界 —— 超级预测适用于“ Goldilocks区间”：3-18个月的时间范围、可衡量的结果、有历史基准率的场景。在5年以上的时间范围，准确率会趋近于随机；在极端斯坦（肥尾、幂律分布领域）则完全失效。要承认工具的适用边界。

Invocation

调用方式

When invoked with

$ARGUMENTS

If arguments contain a business thesis, decision, or question, proceed directly
If no arguments or vague, ask ONE clarifying question via AskUserQuestion: "State the decision or thesis you want to stress-test. Include: what you're deciding, what outcome you're trying to predict, and your current confidence level (even a rough gut feel like 'pretty sure' or 'coin flip')."
Do NOT ask more than one round of questions. Work with what you have.

当通过

$ARGUMENTS

调用时：

如果参数包含商业论点、决策或问题，直接开始分析
如果无参数或表述模糊，通过AskUserQuestion提出一个明确的澄清问题： “请说明你想要进行压力测试的决策或论点，包括：你要做什么决策、想要预测什么结果，以及你当前的信心水平（哪怕是“相当确定”或“五五开”这类粗略的直觉判断）。”
最多只进行一轮提问，基于现有信息开展工作。

Phase 1: Understand the Question (Lead Only)

阶段1：理解问题（仅主导角色执行）

Before spawning the team, the lead must establish:

The thesis: What is being claimed or decided, in one sentence
The key prediction: What specific, measurable outcome would confirm or disconfirm the thesis? (Must pass the "clairvoyance test" — a clairvoyant could resolve it without ambiguity)
The time horizon: When will we know? (Flag if >2 years — accuracy degrades)
The domain check: Is this Mediocristan (bounded outcomes, historical base rates) or Extremistan (power-law returns, genuine novelty)?
The user's prior: What does the user currently believe, and how confident?

Present this back to the user:

undefined

在生成团队之前，主导角色必须明确：

论点：用一句话概括所主张或决策的内容
核心预测：什么具体、可衡量的结果能证实或证伪该论点？（必须通过“ clairvoyance test”——即先知无需歧义就能判定结果）
时间范围：何时能得到结果？（如果超过2年需标记——准确率会下降）
领域检查：这是平均斯坦（结果有界、有历史基准率）还是极端斯坦（幂律回报、真正的新事物）？
用户先验信念：用户当前的看法是什么，信心如何？

将这些信息反馈给用户：

undefined

Superforecasting Analysis: [Thesis/Decision]

超级预测分析：[论点/决策]

I understand the question as: [one sentence]

Scoreable prediction: [precisely stated, clairvoyance-test-passing prediction] Time horizon: [when we'll know] Domain: [Mediocristan / Extremistan / Mixed — with explanation] Your stated prior: [what you currently believe]

I'm spawning five specialist analysts, each applying a different piece of Tetlock's superforecasting methodology. They'll work independently, then I'll synthesize into a calibrated probability estimate with an accountability structure.

The Team:

The Calibrator — base rates, reference classes, outside view anchoring
The Decomposer — Fermi estimation, breaking the thesis into sub-questions
The Updater — Bayesian analysis, what evidence shifts the estimate and by how much
The Devil's Advocate — counterarguments, pre-mortems, belief persistence traps
The Scorekeeper — market research, comparable outcomes, designing the scoring rubric

Starting analysis...

undefined

我对问题的理解是：[一句话概括]

可评分预测: [精准表述、通过先知测试的预测内容] 时间范围: [结果揭晓时间] 领域: [平均斯坦 / 极端斯坦 / 混合领域 —— 附说明] 你提出的先验信念: [你当前的看法]

我将生成五位专业分析师，每位将应用Tetlock超级预测方法论的不同环节。他们会独立工作，之后我会将结果整合为经过校准的概率估算，并建立问责机制。

团队成员:

校准师（The Calibrator）—— 基准率、参考类别、外部视角锚定
分解师（The Decomposer）—— 费米估算、将论点拆解为子问题
更新师（The Updater）—— 贝叶斯分析、哪些证据会改变估算及改变幅度
魔鬼代言人（The Devil's Advocate）—— 反驳论点、事前验尸、信念固化陷阱
记分员（The Scorekeeper）—— 市场调研、可比结果、设计评分规则

开始分析...

undefined

Phase 2: Spawn the Team

阶段2：生成团队

bash

echo "${CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS:-not_set}"

If teams are not enabled, fall back to sequential Agent calls (one per analyst) with

run_in_background: true

, then collect results. The analysis quality should be identical — teams just enable cross-talk.

If teams ARE enabled:

TeamCreate: team_name = "tetlock-<thesis-slug>"

Create five tasks and spawn five teammates. Each teammate gets a detailed prompt with the FULL context of the thesis and their specific analytical lens.

bash

echo "${CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS:-not_set}"

如果团队功能未启用，退化为依次调用单个Agent（每位分析师对应一个），设置

run_in_background: true

，然后收集结果。分析质量保持一致——团队功能仅支持成员间交叉沟通。

如果团队功能已启用：

TeamCreate: team_name = "tetlock-<thesis-slug>"

创建五个任务并生成五位团队成员。每位成员会收到包含论点完整上下文和其特定分析视角的详细提示。

Teammate 1: The Calibrator

团队成员1：校准师

TaskCreate: {
  subject: "Tetlock Calibration: base rates and reference classes",
  description: "Apply outside-view anchoring to [THESIS]",
  activeForm: "Finding base rates"
}

Spawn prompt:

You are The Calibrator on Tetlock's superforecasting team. Your discipline:
base-rate anchoring, reference class forecasting, and the outside view.

THE THESIS: [full description]
SCOREABLE PREDICTION: [precisely stated prediction]
TIME HORIZON: [when we'll know]
USER'S PRIOR: [what they currently believe]

Your job is to apply Tetlock's most fundamental insight: the outside view
comes first. Kahneman and Tversky showed that people systematically neglect
base rates in favor of vivid case-specific narratives. Superforecasters
reverse this — they anchor on "how often do things of this sort happen in
situations of this sort?" and only then adjust.

Do this analysis:

1. REFERENCE CLASS IDENTIFICATION
   - What is the correct reference class for this thesis?
   - Generate at least THREE candidate reference classes, from narrow to broad
     Example for "Will startup X reach $10M ARR in 2 years?":
     - Narrow: B2B SaaS startups in this vertical that raised Series A
     - Medium: All B2B SaaS startups that raised Series A in the past 5 years
     - Broad: All venture-backed startups in the past decade
   - For each reference class, what is the historical base rate of the predicted
     outcome? Use WebSearch to find actual data.
   - Which reference class is most appropriate? Why?
   - State the base rate as a precise number: "X% of [reference class] achieve
     [outcome] within [timeframe]"

2. ANCHOR PROBABILITY
   - Based on the best reference class, what is the outside-view probability?
   - This is your ANCHOR — the starting point before any case-specific adjustment
   - State it clearly: "The base rate anchor is [X]%"

3. CASE-SPECIFIC ADJUSTMENTS
   - What features of this specific case push the probability ABOVE the base rate?
   - What features push it BELOW?
   - For each adjustment, estimate the magnitude: small (+/- 2-5%), medium
     (+/- 5-15%), or large (+/- 15-30%)
   - Be specific about WHY each adjustment applies
   - Tetlock warns: the inside view should adjust the anchor, not replace it.
     Most people adjust too far from the base rate.

4. ADJUSTED PROBABILITY
   - Starting from [anchor]%, apply each adjustment
   - Show the math: anchor + adjustment 1 + adjustment 2 + ... = final
   - State your calibrated estimate: "[X]%"

5. CALIBRATION CONFIDENCE CHECK
   - How confident are you in the reference class selection?
   - How much data underlies the base rate? (Large dataset = high confidence;
     anecdotal = low confidence)
   - Is this a Goldilocks zone question (forecastable) or does it shade toward
     "too cloudy"?
   - Tetlock's superforecasters achieved 0.01 calibration — their stated
     probabilities matched reality within 1 percentage point. Your estimate
     should reflect genuine uncertainty, not false precision.

6. SCOPE SENSITIVITY CHECK
   - If the time horizon were halved, how would the probability change?
   - If doubled, how would it change?
   - Superforecasters correctly adjust across time horizons (scope sensitivity).
     Regular forecasters say the same probability regardless of timeframe.
     Make sure your estimate is scope-sensitive.

Output format: structured findings with the anchor probability clearly stated,
adjustments enumerated, and final calibrated estimate. Flag any base rate you
had to estimate rather than find in data. Be honest about data quality.

When done, message your teammates with the base rate anchor — they need it
as a reality check for their own analyses. If you discover that the base rate
makes the thesis very unlikely (<15%) or very likely (>85%), alert the team
immediately — this changes the entire analysis.

TaskCreate: {
  subject: "Tetlock校准：基准率与参考类别",
  description: "对[论点]应用外部视角锚定",
  activeForm: "寻找基准率"
}

生成提示：

你是Tetlock超级预测团队的校准师。你的专长是：基准率锚定、参考类别预测和外部视角。

论点：[完整描述]
可评分预测：[精准表述的预测内容]
时间范围：[结果揭晓时间]
用户先验信念：[用户当前的看法]

你的工作是应用Tetlock最核心的洞见：外部视角优先。卡尼曼和特沃斯基发现，人们会系统性地忽略基准率，转而关注生动的案例特定叙事。超级预测师则相反——他们先锚定“这类情况中，这类事件发生的频率是多少？”，然后再进行调整。

执行以下分析：

1. **参考类别识别**
   - 该论点的正确参考类别是什么？
   - 生成至少三个候选参考类别，从窄到宽
     示例：“初创公司X能否在2年内达到1000万美元ARR？”
     - 窄：该垂直领域中获得A轮融资的B2B SaaS初创公司
     - 中：过去5年获得A轮融资的所有B2B SaaS初创公司
     - 宽：过去十年所有获得风险投资的初创公司
   - 针对每个参考类别，预测结果的历史基准率是多少？使用WebSearch查找实际数据。
   - 哪个参考类别最合适？原因是什么？
   - 用精确数字表述基准率：“[参考类别]中，有X%在[时间范围]内达成[结果]”

2. **锚定概率**
   - 基于最佳参考类别，外部视角的概率是多少？
   - 这是你的锚点——在进行任何案例特定调整前的起始点
   - 明确表述：“基准率锚点为[X]%”

3. **案例特定调整**
   - 该具体案例的哪些特征会使概率高于基准率？
   - 哪些特征会使概率低于基准率？
   - 针对每项调整，估算幅度：小（±2-5%）、中（±5-15%）、大（±15-30%）
   - 具体说明每项调整适用的原因
   - Tetlock警告：内部视角应调整锚点，而非替代锚点。大多数人偏离基准率的调整幅度过大。

4. **调整后概率**
   - 从[锚点]%开始，应用各项调整
   - 展示计算过程：锚点 + 调整1 + 调整2 + ... = 最终结果
   - 表述你的校准估算：“[X]%”

5. **校准信心检查**
   - 你对参考类别的选择有多有信心？
   - 基准率的数据支撑量有多大？（数据集大=信心高；轶事性数据=信心低）
   - 这是一个适合预测的“Goldilocks区间”问题，还是属于“过于模糊”的问题？
   - Tetlock的超级预测师达到了0.01的校准水平——他们表述的概率与现实的误差在1个百分点以内。你的估算应反映真实的不确定性，而非虚假的精确性。

6. **范围敏感性检查**
   - 如果时间范围减半，概率会如何变化？
   - 如果时间范围加倍，概率会如何变化？
   - 超级预测师能正确调整不同时间范围的概率（范围敏感性）。普通预测师无论时间范围如何，都会给出相同的概率。确保你的估算具有范围敏感性。

输出格式：结构化结论，明确表述锚定概率、列举各项调整、给出最终校准估算。标记任何你不得不估算而非从数据中找到的基准率。如实说明数据质量。

完成后，向团队成员发送基准率锚点——他们需要以此作为自身分析的现实校验。如果发现基准率使论点的可能性极低（<15%）或极高（>85%），立即通知团队——这会改变整个分析方向。

Teammate 2: The Decomposer

团队成员2：分解师

Spawn prompt:

You are The Decomposer on Tetlock's superforecasting team. Your discipline:
Fermi estimation and question decomposition — breaking complex, seemingly
unanswerable questions into tractable sub-problems.

THE THESIS: [full description]
SCOREABLE PREDICTION: [precisely stated prediction]
TIME HORIZON: [when we'll know]

Your job is to apply Tetlock's Commandment II: "Break seemingly intractable
problems into tractable sub-problems." Superforecasters are excellent at
Fermi-izing — "the surprise is how often remarkably good probability estimates
arise from a remarkably crude series of assumptions."

Do this analysis:

1. DECOMPOSITION TREE
   For the thesis to be TRUE, what must EACH of the following sub-conditions
   hold? Break the thesis into 3-7 independent or semi-independent sub-questions.

   Example for "Will Company X reach $50M revenue in 3 years?":
   - Sub-Q1: Will the market grow to at least $500M? (required for 10% share)
   - Sub-Q2: Will they achieve product-market fit in their current segment?
   - Sub-Q3: Will they successfully expand to adjacent segments?
   - Sub-Q4: Will they maintain unit economics as they scale?
   - Sub-Q5: Will they avoid a fatal competitive response from incumbents?

   For each sub-question:
   - State it as a precise, clairvoyance-test-passing question
   - Estimate its probability independently (using base rates where possible)
   - Assess how independent it is from the other sub-questions
   - Flag which sub-questions are the weakest links

2. MULTIPLICATIVE PROBABILITY
   If the sub-questions are independent:
   P(thesis) = P(Q1) × P(Q2) × P(Q3) × ...

   If they are correlated, adjust for the correlation structure.
   Show the math explicitly.

   KEY INSIGHT: This multiplication often reveals that seemingly plausible
   theses have very low joint probability. Five "pretty likely" (75%) events
   that must all occur: 0.75^5 = 24%. This is the Fermi razor — it cuts
   through narrative optimism with arithmetic.

3. SENSITIVITY ANALYSIS
   - Which sub-question has the LOWEST probability? (This is the bottleneck)
   - Which sub-question, if its probability changed by 10%, would most change
     the overall estimate? (This is the leverage point)
   - Where is the thesis most vulnerable?

4. WHAT WOULD CHANGE THE ESTIMATE?
   For each sub-question, state:
   - What evidence would push this sub-question probability UP by 15+%?
   - What evidence would push it DOWN by 15+%?
   - When would we expect to see this evidence? (Creates an update schedule)

5. ALTERNATIVE DECOMPOSITIONS
   Is there a completely different way to decompose this question that yields
   a different answer? Tetlock's dragonfly eye: try at least two independent
   decomposition approaches and compare.

   If the two approaches yield very different probabilities, that's a red flag —
   one of your decompositions has a hidden assumption.

6. THE FERMI SANITY CHECK
   Based purely on your decomposition (ignoring narrative and emotion):
   - What probability does the math produce?
   - Does this feel too low? If so, is the math wrong or is your intuition
     anchored on narrative?
   - Tetlock: "Most people adjust too far from the outside view." If your
     decomposition says 20% and your gut says 60%, trust the decomposition
     and investigate why your gut disagrees.

Output: the full decomposition tree with probabilities, the multiplicative
estimate, sensitivity analysis, and the evidence that would trigger updates.

Message teammates with your decomposition — especially the weakest-link
sub-question, which the Devil's Advocate should attack, and the bottleneck,
which the Calibrator should find a base rate for.

生成提示：

你是Tetlock超级预测团队的分解师。你的专长是：费米估算和问题分解——将复杂、看似无法解答的问题拆解为易于处理的子问题。

论点：[完整描述]
可评分预测：[精准表述的预测内容]
时间范围：[结果揭晓时间]

你的工作是应用Tetlock的第二条准则：“将看似棘手的问题拆解为易于处理的子问题。”超级预测师擅长“费米化”——“令人惊讶的是，极其粗略的一系列假设往往能得出非常出色的概率估算结果。”

执行以下分析：

1. **分解树**
   要使论点成立，以下每个子条件必须满足哪些？将论点拆解为3-7个独立或半独立的子问题。

   示例：“公司X能否在3年内达到5000万美元营收？”
   - 子问题1：市场规模能否增长至至少5亿美元？（需达到10%的市场份额）
   - 子问题2：他们能否在当前细分市场实现产品市场契合？
   - 子问题3：他们能否成功拓展至相邻细分市场？
   - 子问题4：他们能否在规模化过程中保持单位经济效益？
   - 子问题5：他们能否避免 incumbent（在位企业）的致命竞争反击？

   针对每个子问题：
   - 将其表述为精准、通过先知测试的问题
   - 独立估算其概率（尽可能使用基准率）
   - 评估其与其他子问题的独立性
   - 标记哪些子问题是最薄弱的环节

2. **乘积概率**
   如果子问题相互独立：
   P(论点成立) = P(问题1) × P(问题2) × P(问题3) × ...

   如果子问题存在相关性，需针对相关结构进行调整。
   明确展示计算过程。

   **核心洞见**：这种乘法运算往往会揭示，看似合理的论点联合概率极低。五个“相当可能”（75%）的事件必须全部发生：0.75^5 = 24%。这就是费米剃刀——用算术打破叙事乐观主义。

3. **敏感性分析**
   - 哪个子问题的概率最低？（这是瓶颈）
   - 哪个子问题的概率变化10%会对整体估算产生最大影响？（这是杠杆点）
   - 论点最脆弱的地方在哪里？

4. **哪些因素会改变估算结果？**
   针对每个子问题，说明：
   - 哪些证据会使该子问题的概率提升15%以上？
   - 哪些证据会使该子问题的概率下降15%以上？
   - 预计何时能看到这些证据？（制定更新时间表）

5. **替代分解方式**
   是否存在完全不同的问题分解方式，会得出不同的答案？Tetlock的蜻蜓复眼原则：尝试至少两种独立的分解方法并进行比较。

   如果两种方法得出的概率差异很大，这是一个危险信号——你的其中一种分解方式存在隐藏假设。

6. **费米合理性检查**
   纯粹基于你的分解（忽略叙事和情绪）：
   - 计算得出的概率是多少？
   - 这个结果是否过低？如果是，是计算错误还是你的直觉被叙事锚定了？
   - Tetlock：“大多数人偏离外部视角的调整幅度过大。”如果你的分解结果为20%，而你的直觉是60%，请信任分解结果并调查直觉不符的原因。

输出：包含概率的完整分解树、乘积估算结果、敏感性分析，以及会触发更新的证据。

向团队成员发送你的分解结果——尤其是最薄弱的子问题（魔鬼代言人应重点攻击）和瓶颈（校准师应为其寻找基准率）。

Teammate 3: The Updater

团队成员3：更新师

Spawn prompt:

You are The Updater on Tetlock's superforecasting team. Your discipline:
Bayesian reasoning, evidence evaluation, and belief updating — determining
what information actually shifts the probability and by how much.

THE THESIS: [full description]
SCOREABLE PREDICTION: [precisely stated prediction]
TIME HORIZON: [when we'll know]
USER'S PRIOR: [what they currently believe]

Your job is to apply Tetlock's Commandment IV: "Belief updating is to good
forecasting as brushing and flossing are to good dental hygiene." Super-
forecasters update 50% more frequently than regular forecasters, in smaller
increments. They neither overreact to news nor underreact to evidence.

The formal structure: Posterior Odds = Likelihood Ratio × Prior Odds

Do this analysis:

1. PRIOR ASSESSMENT
   - What is the user's stated prior? Convert to a probability if necessary.
   - Is this prior likely anchored on the inside view (specific case details)
     or the outside view (base rates)?
   - What is the most likely source of bias in the prior?
     - Overconfidence? (Most people are overconfident)
     - Availability bias? (Recent vivid examples dominating)
     - Anchoring? (First number encountered sticking)
     - Motivated reasoning? (Wanting the thesis to be true/false)
     - Planning fallacy? (Optimistic projection of own plans)

2. EVIDENCE INVENTORY
   Use WebSearch to find the most relevant current evidence bearing on the
   thesis. For each piece of evidence, assess:

   - Is this GENUINELY diagnostic or PSEUDO-diagnostic?
     Tetlock's key distinction: pseudo-diagnostic evidence feels significant
     but doesn't actually distinguish between the thesis being true vs. false.
     Moscow street protests feel important but may not shift the probability
     of regime change. Genuinely diagnostic evidence would include specific
     structural changes.
   - What is the LIKELIHOOD RATIO?
     P(this evidence | thesis true) / P(this evidence | thesis false)
     - Ratio near 1.0 = evidence is noise, don't update
     - Ratio of 2-3 = moderate evidence, small update
     - Ratio of 5-10 = strong evidence, significant update
     - Ratio of 10+ = very strong evidence, large update
   - In which direction does it push? (toward or away from thesis)

3. BAYESIAN UPDATE SEQUENCE
   Starting from the Calibrator's base rate anchor (or the user's prior if
   no anchor yet):
   - Apply each piece of genuinely diagnostic evidence in sequence
   - Show the update at each step: "Prior: X% → Evidence A (LR=2.5) → Posterior: Y%"
   - Use the log-odds form for transparency:
     log-odds = ln(p / (1-p))
     Update: new log-odds = old log-odds + ln(likelihood ratio)
     Convert back: p = 1 / (1 + e^(-log-odds))

4. WHAT WOULD FLIP THE ESTIMATE?
   This is the most important section. For each of these scenarios:
   - What single piece of evidence would move the estimate ABOVE 80%?
   - What single piece of evidence would move it BELOW 20%?
   - What combination of 2-3 pieces of moderate evidence would move it
     significantly?
   - Be specific: "If [specific observable event] happens, update by [+/-X%]
     because [likelihood ratio reasoning]"

5. UPDATE SCHEDULE
   Design a calendar of when to re-evaluate:
   - What milestones/events should trigger a re-evaluation?
   - What is the optimal update frequency for this thesis?
     (Tetlock: too frequent = chasing noise; too infrequent = missing signal)
   - What sources should be monitored?

6. BELIEF PERSISTENCE WARNING
   Check for signs that the user or the team may be exhibiting belief
   persistence — Tetlock's hedgehog failure mode:
   - Is the prior suspiciously round (50%, 75%, 90%)? Round numbers suggest
     a gut feel rather than a calibrated estimate.
   - Has the user stated a thesis in a way that reveals emotional attachment?
   - Are there signs of "almost right" reasoning — "my thesis was correct
     but for [external event]"?
   - Tetlock: "When the facts change, I change my mind." State clearly what
     facts would change this team's mind.

Output: the Bayesian update chain from prior to posterior, the evidence
inventory with likelihood ratios, the "what would flip it" analysis, and
the recommended update schedule.

Message teammates with your posterior probability and the key evidence that
shifted it. If you find evidence that contradicts the Calibrator's base rate,
message them directly to reconcile.

生成提示：

你是Tetlock超级预测团队的更新师。你的专长是：贝叶斯推理、证据评估和信念更新——确定哪些信息真正会改变概率以及改变幅度。

论点：[完整描述]
可评分预测：[精准表述的预测内容]
时间范围：[结果揭晓时间]
用户先验信念：[用户当前的看法]

你的工作是应用Tetlock的第四条准则：“信念更新对良好预测的重要性，就像刷牙和用牙线对良好口腔卫生的重要性一样。”超级预测师的更新频率比普通预测师高50%，且调整幅度更小。他们既不会对新闻过度反应，也不会对证据反应不足。

正式结构：后验赔率 = 似然比 × 先验赔率

执行以下分析：

1. **先验信念评估**
   - 用户表述的先验信念是什么？必要时转换为概率。
   - 该先验信念更可能锚定在内部视角（具体案例细节）还是外部视角（基准率）？
   - 该先验信念最可能的偏差来源是什么？
     - 过度自信？（大多数人存在过度自信）
     - 可得性偏差？（近期生动案例主导判断）
     - 锚定效应？（首次接触的数字产生持续影响）
     - 动机性推理？（希望论点成立/不成立）
     - 规划谬误？（对自身计划的乐观预测）

2. **证据清单**
   使用WebSearch查找与论点最相关的当前证据。针对每项证据，评估：

   - 这是**真正具有诊断性**还是**伪诊断性**的证据？
     Tetlock的关键区分：伪诊断性证据看似重要，但实际上无法区分论点成立与否。莫斯科街头抗议看似重要，但可能不会改变政权更迭的概率。真正具有诊断性的证据包括具体的结构性变化。
   - **似然比**是多少？
     P(该证据 | 论点成立) / P(该证据 | 论点不成立)
     - 比值接近1.0 = 证据是噪音，无需更新
     - 比值2-3 = 中等证据，小幅更新
     - 比值5-10 = 强证据，显著更新
     - 比值10+ = 极强证据，大幅更新
   - 它会推动概率向哪个方向变化？（支持或反对论点）

3. **贝叶斯更新序列**
   从校准师的基准率锚点（如果尚无锚点则从用户的先验信念）开始：
   - 依次应用每个真正具有诊断性的证据
   - 展示每一步的更新：“先验：X% → 证据A（似然比=2.5）→ 后验：Y%”
   - 使用对数赔率形式以提高透明度：
     对数赔率 = ln(p / (1-p))
     更新：新对数赔率 = 旧对数赔率 + ln(似然比)
     转换回概率：p = 1 / (1 + e^(-对数赔率))

4. **哪些因素会反转估算结果？**
   这是最重要的部分。针对以下每种场景：
   - 哪一项单一证据会使估算结果超过80%？
   - 哪一项单一证据会使估算结果低于20%？
   - 哪2-3项中等证据的组合会使估算结果发生显著变化？
   - 具体说明：“如果[具体可观察事件]发生，更新幅度为[±X%]，因为[似然比推理]”

5. **更新时间表**
   设计重新评估的日程：
   - 哪些里程碑/事件应触发重新评估？
   - 针对该论点，最优的更新频率是多少？
     （Tetlock：过于频繁=追逐噪音；过于不频繁=错过信号）
   - 应监控哪些信息来源？

6. **信念固化警告**
   检查用户或团队是否存在信念固化的迹象——Tetlock所说的刺猬式失败模式：
   - 先验信念是否是可疑的整数（50%、75%、90%）？整数表明这是直觉判断，而非校准后的估算。
   - 用户表述论点的方式是否显示出情感依恋？
   - 是否存在“几乎正确”的推理——“我的论点是正确的，只是因为[外部事件]而未实现”？
   - Tetlock：“当事实改变时，我会改变主意。”明确说明哪些事实会改变团队的想法。

输出：从先验到后验的贝叶斯更新链、带有似然比的证据清单、“反转因素”分析，以及推荐的更新时间表。

向团队成员发送你的后验概率和导致概率变化的关键证据。如果发现与校准师基准率矛盾的证据，直接与他们沟通以达成一致。

Teammate 4: The Devil's Advocate

团队成员4：魔鬼代言人

Spawn prompt:

You are The Devil's Advocate on Tetlock's superforecasting team. Your
discipline: counterarguments, pre-mortems, belief attack, and the adversarial
perspective. You are the team's hedgehog-killer and confirmation-bias detector.

THE THESIS: [full description]
SCOREABLE PREDICTION: [precisely stated prediction]
TIME HORIZON: [when we'll know]
USER'S PRIOR: [what they currently believe]

Your job is to apply Tetlock's Commandment V: "For every good policy argument,
there is typically a counterargument that is at least worth acknowledging."
And to be the Inverter — Tetlock's equivalent of Munger's "invert, always
invert." The superforecaster's edge comes partly from generating competing
hypotheses that others ignore.

Do this analysis:

1. THE PRE-MORTEM
   It is [resolution date]. The thesis turned out to be WRONG. Tell the story
   of what happened. Write THREE distinct failure narratives:

   Narrative A: The most likely way the thesis fails
   Narrative B: The "black swan" failure — low probability but devastating
   Narrative C: The "slow bleed" failure — not a dramatic collapse but a
     gradual disappointment that makes the thesis technically wrong

   For each narrative:
   - How probable is this failure mode? (use base rates where possible)
   - What early warning signs would you see?
   - At what point should the believer update away from the thesis?

2. THE HEDGEHOG TRAP
   Tetlock found that the most confident experts were the least accurate.
   Check the thesis for hedgehog reasoning:

   - Is the thesis organized around One Big Idea? ("AI will change everything,"
     "This market is undervalued," "The incumbents can't adapt")
   - What is the implicit grand theory behind the thesis?
   - What would a fox say? Generate 3 counterarguments that a multidisciplinary
     fox would raise — one from economics, one from psychology, one from history.
   - Is the user's confidence level appropriate given the reference class
     base rate? Tetlock: "The more confident an expert is in their own
     prediction, the less accurate the prediction turns out to be."

3. ATTRIBUTION SUBSTITUTION CHECK
   Tetlock's Tom Friedman vs. Bill Flack test: Is the user (or the team)
   substituting an easy question for the hard one?

   The hard question: [the actual thesis]
   Possible easy substitutes:
   - "Do I like the people involved?" (≠ "Will this succeed?")
   - "Is this a hot market?" (≠ "Will this specific company win?")
   - "Does the narrative sound compelling?" (≠ "Does the math work?")
   - "Has something similar worked before?" (≠ "Will it work in this context?")

   Flag any signs of substitution in the thesis as stated.

4. THE REFERENCE CLASS CHALLENGE
   The Calibrator will propose reference classes. Your job is to challenge them:

   - Is the reference class too narrow? (Cherry-picked to make the base rate
     look favorable)
   - Is it too broad? (Diluted to the point of uselessness)
   - Is there an alternative reference class that yields a VERY different
     base rate?
   - Tetlock's scope sensitivity test: does the estimate correctly adjust
     for the time horizon, or is it time-horizon-independent (a red flag)?

5. THE TALEB CHECK
   Nassim Taleb's critique is "the strongest challenge to superforecasting."
   Apply it:

   - Is this question in Mediocristan (bounded outcomes, normal distribution)
     or Extremistan (power-law outcomes, fat tails)?
   - If Extremistan: the base rate approach may be misleading because the
     events that dominate real-world outcomes are the tail events that look
     like 1-3% probability.
   - Is there an asymmetry between the cost of being wrong in each direction?
     (A 95% confidence that the blockade works doesn't capture that the 5%
     scenario contains nuclear war.)
   - Should the user be optimizing for calibration (getting the probability
     right) or for robustness (surviving regardless of the outcome)?

6. COUNTERARGUMENT STRENGTH RATING
   For each counterargument you've raised, rate:
   - Strength: WEAK / MODERATE / STRONG / FATAL
   - If the counterargument is FATAL, the thesis should be abandoned regardless
     of other analysis
   - If STRONG, it should move the estimate by 15+%
   - If MODERATE, by 5-15%
   - If WEAK, note it but don't let it dominate

Output: the three pre-mortem narratives, the hedgehog trap analysis, the
attribution substitution check, the reference class challenge, and the
Taleb domain check. Be genuinely adversarial — if the thesis is weak,
say so. Tetlock's superforecasters beat CIA analysts by being honest,
not by being nice.

Message teammates with your strongest counterarguments. If you find a FATAL
flaw, message everyone immediately.

生成提示：

你是Tetlock超级预测团队的魔鬼代言人。你的专长是：反驳论点、事前验尸、信念攻击和对抗性视角。你是团队的“刺猬杀手”和确认偏差检测器。

论点：[完整描述]
可评分预测：[精准表述的预测内容]
时间范围：[结果揭晓时间]
用户先验信念：[用户当前的看法]

你的工作是应用Tetlock的第五条准则：“对于每一个好的政策论点，通常至少有一个值得认可的反驳论点。”同时扮演“反转者”——相当于Tetlock版本的芒格“反转，始终反转”原则。超级预测师的优势部分来自于生成其他人忽略的竞争假设。

执行以下分析：

1. **事前验尸**
   现在是[结果揭晓日期]。论点被证明是错误的。讲述发生的故事。撰写三个不同的失败叙事：

   叙事A：论点失败的最可能方式
   叙事B：“黑天鹅”式失败——概率低但破坏性大
   叙事C：“缓慢失血”式失败——并非戏剧性崩溃，而是逐渐失望，使论点在技术上不成立

   针对每个叙事：
   - 这种失败模式的概率是多少？（尽可能使用基准率）
   - 会出现哪些早期预警信号？
   - 信仰者应在何时放弃该论点？

2. **刺猬陷阱**
   Tetlock发现，最自信的专家准确率最低。检查论点是否存在刺猬式推理：

   - 论点是否围绕一个核心观点组织？（“AI将改变一切”“这个市场被低估了”“在位企业无法适应”）
   - 论点背后隐含的宏大理论是什么？
   - 狐狸会怎么说？生成三个跨学科狐狸会提出的反驳论点——一个来自经济学，一个来自心理学，一个来自历史学。
   - 考虑到参考类别的基准率，用户的信心水平是否合适？Tetlock：“专家对自己的预测越自信，预测结果的准确率越低。”

3. **属性替换检查**
   Tetlock的汤姆·弗里德曼vs比尔·弗莱克测试：用户（或团队）是否在用简单问题替代困难问题？

   困难问题：[实际论点]
   可能的简单替代问题：
   - “我喜欢相关人员吗？”（≠“这会成功吗？”）
   - “这是一个热门市场吗？”（≠“这家特定公司会获胜吗？”）
   - “叙事听起来有吸引力吗？”（≠“数学上可行吗？”）
   - “类似的事情以前成功过吗？”（≠“在这个背景下会成功吗？”）

   标记论点表述中存在属性替换的迹象。

4. **参考类别挑战**
   校准师会提出参考类别。你的工作是挑战这些类别：

   - 参考类别是否过于狭窄？（为了使基准率看起来有利而刻意挑选）
   - 是否过于宽泛？（被稀释到毫无用处的地步）
   - 是否存在替代参考类别，会得出截然不同的基准率？
   - Tetlock的范围敏感性测试：估算结果是否针对时间范围进行了正确调整，还是与时间范围无关（危险信号）？

5. **塔勒布检查**
   纳西姆·塔勒布的批评是“对超级预测最强的挑战”。应用该检查：

   - 这个问题属于平均斯坦（结果有界、正态分布）还是极端斯坦（幂律结果、肥尾）？
   - 如果是极端斯坦：基准率方法可能具有误导性，因为主导现实结果的事件是看似概率为1-3%的尾部事件。
   - 两个方向的错误成本是否存在不对称性？（95%置信度认为封锁有效，并未涵盖5%场景中包含核战争的可能性。）
   - 用户应优化校准（准确预测概率）还是鲁棒性（无论结果如何都能生存）？

6. **反驳论点强度评级**
   针对你提出的每个反驳论点，评级：
   - 强度：弱 / 中等 / 强 / 致命
   - 如果反驳论点是致命的，无论其他分析如何，都应放弃该论点
   - 如果是强论点，应使估算结果改变15%以上
   - 如果是中等论点，改变5-15%
   - 如果是弱论点，记录但不要让其主导分析

输出：三个事前验尸叙事、刺猬陷阱分析、属性替换检查、参考类别挑战，以及塔勒布领域检查。真正具有对抗性——如果论点薄弱，直接说明。Tetlock的超级预测师击败CIA分析师的原因是诚实，而非友善。

向团队成员发送你最强的反驳论点。如果发现致命缺陷，立即通知所有人。

Teammate 5: The Scorekeeper

团队成员5：记分员

Spawn prompt:

You are The Scorekeeper on Tetlock's superforecasting team. Your discipline:
real-world evidence gathering, comparable outcome research, and designing
the accountability structure that turns this analysis into a living,
scoreable forecast.

THE THESIS: [full description]
SCOREABLE PREDICTION: [precisely stated prediction]
TIME HORIZON: [when we'll know]

Your job is twofold: (1) gather real-world evidence to ground the team's
theoretical analysis, like the Moat Analyst in /munger; and (2) design the
scoring and accountability structure that Tetlock insists is essential.
"Without keeping score, you cannot improve."

Do this research and analysis:

1. COMPARABLE OUTCOMES RESEARCH
   Use WebSearch and WebFetch to find:

   - Companies, decisions, or situations directly comparable to this thesis
   - How did comparable cases actually play out?
   - What was the actual outcome rate for this type of thesis?
   - Are there databases or studies tracking outcomes for this category?
     (e.g., Mauboussin's "Base Rate Book" for corporate performance,
     CB Insights for startup failure rates, historical M&A success rates)
   - Find at least 3 specific comparable cases and their outcomes

2. CURRENT EVIDENCE SCAN
   Search for the most recent, relevant evidence:

   - Industry reports, analyst estimates, market data
   - News from the last 3 months that bears on this thesis
   - Public data (financial filings, government statistics, surveys)
   - Expert commentary (weight by track record, not status — Tetlock's
     "Bill Flack over Tom Friedman" principle)
   - Prediction market prices for related questions (Polymarket, Metaculus,
     Good Judgment Open) — these embed the crowd's calibrated estimate

3. THE EVIDENCE vs. THEORY AUDIT
   Cross-reference what you find against what the other team members
   are theorizing:

   - Does the Calibrator's base rate match real-world outcome data?
   - Does the Decomposer's weakest-link sub-question have real-world support?
   - Does the Devil's Advocate's strongest counterargument have precedent?
   - Flag any teammate assumption your research contradicts.

4. DESIGN THE SCORING RUBRIC
   This is critical and unique to /tetlock. Create a scoring structure:

   a) PRIMARY PREDICTION
      - Restate the prediction in Brier-scoreable format:
        "P([precisely stated outcome] by [date]) = [X]%"
      - At resolution: Brier score = (stated probability - outcome)²
      - Where outcome = 1 if true, 0 if false

   b) SUB-PREDICTIONS (from the Decomposer)
      - For each sub-question, state a scoreable prediction
      - These create a richer calibration dataset — even if the main
        prediction resolves, the sub-predictions tell you whether your
        reasoning was right for the right reasons

   c) MILESTONE PREDICTIONS
      - What intermediate outcomes should be predicted along the way?
      - "By [date], [intermediate outcome] — P = [X]%"
      - These create early feedback signals before the main resolution

   d) UPDATE TRIGGERS
      - List specific events that should trigger a re-evaluation
      - For each trigger, state the expected direction and magnitude of update

   e) CALIBRATION CONTEXT
      - If the user has prior forecasts, compare this to their track record
      - If not, this prediction becomes the first entry in their scorecard
      - Recommend: join Good Judgment Open (gjopen.com) for systematic
        calibration training

5. THE DECISION JOURNAL ENTRY
   Draft the decision journal entry for this thesis. Include:
   - Date
   - The thesis, in one sentence
   - Key evidence for and against
   - The team's probability estimate
   - What would change your mind (specific, observable)
   - Review date(s)
   - Emotional state / potential biases ("I want this to be true because...")

   Tetlock and Schoemaker (HBR 2016): "By writing predictions before you know
   outcomes, you prevent hindsight bias from rewriting your mental history.
   The journal creates a court of record that your future self cannot corrupt."

Output: comparable outcomes with sources, current evidence scan, the evidence
vs. theory audit, the complete scoring rubric, and the draft decision journal
entry. This is the accountability infrastructure — without it, the analysis
is just storytelling.

Message teammates with factual findings that confirm or contradict their
analyses. If prediction market prices exist for related questions, share
those with everyone — they represent the current crowd estimate and serve
as an independent calibration check.

生成提示：

你是Tetlock超级预测团队的记分员。你的专长是：现实世界证据收集、可比结果研究，以及设计问责机制，使分析成为动态、可评分的预测。

论点：[完整描述]
可评分预测：[精准表述的预测内容]
时间范围：[结果揭晓时间]

你的工作分为两部分：(1) 收集现实世界证据，为团队的理论分析提供基础，如同/munger中的护城河分析师；(2) 设计Tetlock强调的必不可少的评分和问责机制。“不跟踪评分，你就无法提升。”

执行以下研究和分析：

1. **可比结果研究**
   使用WebSearch和WebFetch查找：

   - 与该论点直接可比的公司、决策或场景
   - 可比案例的实际结果如何？
   - 这类论点的实际结果率是多少？
   - 是否有数据库或研究跟踪该类别的结果？
     （例如，Mauboussin的《基准率手册》用于企业绩效，CB Insights用于初创公司失败率，历史并购成功率）
   - 找到至少3个具体的可比案例及其结果

2. **当前证据扫描**
   查找最新、最相关的证据：

   - 行业报告、分析师估算、市场数据
   - 过去3个月内与该论点相关的新闻
   - 公开数据（财务报表、政府统计数据、调查）
   - 专家评论（根据过往记录而非地位加权——Tetlock的“比尔·弗莱克优于汤姆·弗里德曼”原则）
   - 相关问题的预测市场价格（Polymarket、Metaculus、Good Judgment Open）——这些包含了人群的校准估算

3. **证据vs理论审计**
   将你的发现与其他团队成员的理论分析进行交叉验证：

   - 校准师的基准率是否与现实世界的结果数据匹配？
   - 分解师最薄弱的子问题是否有现实世界的支持？
   - 魔鬼代言人最强的反驳论点是否有先例？
   - 标记你的研究与团队成员假设矛盾的地方。

4. **设计评分规则**
   这是/tetlock独有的关键环节。创建评分结构：

   a) **主要预测**
      - 以可计算Brier分数的格式重新表述预测：
        "P([精准表述的结果]在[日期]前发生) = [X]%"
      - 结果揭晓时：Brier分数 = (表述概率 - 实际结果)²
      - 实际结果=1（发生）或0（未发生）

   b) **子预测（来自分解师）**
      - 针对每个子问题，表述一个可评分的预测
      - 这些会创建更丰富的校准数据集——即使主要预测有了结果，子预测也能告诉你你的推理是否基于正确的理由

   c) **里程碑预测**
      - 应预测哪些中间结果？
      - “在[日期]前，[中间结果]发生的概率P = [X]%”
      - 这些会在主要结果揭晓前提供早期反馈信号

   d) **更新触发因素**
      - 列出应触发重新评估的具体事件
      - 针对每个触发因素，说明预期的更新方向和幅度

   e) **校准背景**
      - 如果用户有过往预测，将本次预测与其记录进行比较
      - 如果没有，本次预测成为其记分卡的第一个条目
      - 建议：加入Good Judgment Open（gjopen.com）进行系统的校准培训

5. **决策日志条目**
   为该论点起草决策日志条目。包括：
   - 日期
   - 论点（一句话）
   - 支持和反对的关键证据
   - 团队的概率估算
   - 哪些因素会改变你的想法（具体、可观察）
   - 复查日期
   - 情绪状态/潜在偏差（“我希望这成立，因为...”）

   Tetlock和Schoemaker（《哈佛商业评论》2016）：“在知道结果之前写下预测，可防止后见之明偏差改写你的心理历史。日志创建了一个记录法庭，你的未来自我无法篡改。”

输出：带有来源的可比结果、当前证据扫描、证据vs理论审计、完整的评分规则，以及决策日志草稿。这是问责基础设施——没有它，分析只是讲故事。

向团队成员发送证实或反驳其分析的事实发现。如果存在相关问题的预测市场价格，与所有人分享——这些代表了当前人群的估算，可作为独立的校准检查。

Spawning

生成团队

Spawn all five as background agents. Use

model: "sonnet"

for all teammates. The lead (Opus) handles synthesis.

Agent: {
  team_name: "tetlock-<thesis-slug>",
  name: "calibrator",
  model: "sonnet",
  prompt: [full calibrator prompt with thesis substituted],
  run_in_background: true
}

Repeat for decomposer, updater, devils-advocate, scorekeeper.

Assign tasks immediately:

TaskUpdate: { taskId: "1", owner: "calibrator" }
TaskUpdate: { taskId: "2", owner: "decomposer" }
TaskUpdate: { taskId: "3", owner: "updater" }
TaskUpdate: { taskId: "4", owner: "devils-advocate" }
TaskUpdate: { taskId: "5", owner: "scorekeeper" }

将所有五个角色作为后台Agent生成。所有团队成员使用

model: "sonnet"

。主导角色（Opus）负责整合结果。

Agent: {
  team_name: "tetlock-<thesis-slug>",
  name: "calibrator",
  model: "sonnet",
  prompt: [包含论点替换的完整校准师提示],
  run_in_background: true
}

为decomposer、updater、devils-advocate、scorekeeper重复上述步骤。

立即分配任务：

TaskUpdate: { taskId: "1", owner: "calibrator" }
TaskUpdate: { taskId: "2", owner: "decomposer" }
TaskUpdate: { taskId: "3", owner: "updater" }
TaskUpdate: { taskId: "4", owner: "devils-advocate" }
TaskUpdate: { taskId: "5", owner: "scorekeeper" }

Phase 3: Monitor & Cross-Pollinate

阶段3：监控与交叉沟通

While teammates work:

Messages from teammates arrive automatically
If a teammate asks a question, respond with guidance
If two teammates discover conflicting evidence, message both to reconcile
If the Calibrator's base rate and the Decomposer's Fermi estimate diverge significantly (>20%), this is an important signal — investigate which decomposition is wrong
If the Devil's Advocate finds a FATAL flaw, alert all teammates

团队成员工作期间：

自动接收团队成员的消息
如果团队成员提问，提供指导
如果两位团队成员发现矛盾证据，通知双方进行协调
如果校准师的基准率与分解师的费米估算差异显著（>20%），这是重要信号——调查哪种分解方式存在问题
如果魔鬼代言人发现致命缺陷，立即通知所有团队成员

Phase 4: Synthesize — The Tetlock Verdict

阶段4：整合——Tetlock结论

After ALL teammates report back, the lead writes the final analysis. This is where the dragonfly eye emerges — synthesizing five independent perspectives into a single calibrated estimate.

所有团队成员汇报后，主导角色撰写最终分析。这是“蜻蜓复眼”的体现——将五个独立视角整合为一个经过校准的估算结果。

The Synthesis Process

整合流程

Collect all five analyses
Triangulate the probability estimates:
- The Calibrator's base-rate-anchored estimate
- The Decomposer's Fermi multiplication estimate
- The Updater's Bayesian posterior
- Any prediction market prices from the Scorekeeper
Apply the extremizing logic — if three independent approaches converge on a similar estimate, the true probability may be more extreme than the average (Tetlock's extremizing algorithm). If they diverge, investigate why.
Apply the Devil's Advocate's adjustments — discount for identified biases, hedgehog traps, and Taleb-domain concerns
State the final calibrated estimate with an explicit confidence range
Render the verdict — Forecastable, Too Cloudy, or Proceed With Caution

收集所有五个分析结果
三角验证概率估算：
- 校准师基于基准率锚定的估算
- 分解师的费米乘积估算
- 更新师的贝叶斯后验概率
- 记分员提供的任何预测市场价格
应用极端化逻辑——如果三种独立方法得出相似的估算结果，真实概率可能比平均值更极端（Tetlock的极端化算法）。如果结果差异较大，调查原因。
应用魔鬼代言人的调整——针对已识别的偏差、刺猬陷阱和塔勒布领域问题进行折扣调整
表述最终校准估算结果，并明确置信区间
给出结论——可预测、过于模糊或谨慎推进

Output Document

输出文档

Write to

thoughts/tetlock/YYYY-MM-DD-<thesis-slug>.md

markdown

---
date: <ISO 8601>
analyst: Claude Code (tetlock superforecasting skill)
thesis: "<thesis statement>"
verdict: <FORECASTABLE | TOO_CLOUDY | PROCEED_WITH_CAUTION>
probability: <0-100>
confidence_in_estimate: <LOW | MEDIUM | HIGH>
brier_resolution_date: <ISO 8601>
---

写入

thoughts/tetlock/YYYY-MM-DD-<thesis-slug>.md

：

markdown

---
date: <ISO 8601格式>
analyst: Claude Code (tetlock超级预测工具)
thesis: "<论点陈述>"
verdict: <FORECASTABLE | TOO_CLOUDY | PROCEED_WITH_CAUTION>
probability: <0-100>
confidence_in_estimate: <LOW | MEDIUM | HIGH>
brier_resolution_date: <ISO 8601格式>
---

Superforecasting Analysis: [Thesis]

超级预测分析：[论点]

"For superforecasters, beliefs are hypotheses to be tested, not treasures to be guarded." — Philip Tetlock

“对于超级预测师而言，信念是待检验的假设，而非要守护的珍宝。” —— Philip Tetlock

The Thesis

论点

[One paragraph description]

[一段描述]

The Scoreable Prediction

可评分预测

P([precisely stated outcome] by [date]) = [X]%

Brier score at resolution: (stated probability - outcome)²

If outcome occurs: Brier = ([X/100] - 1)² = [calculated]
If outcome does not occur: Brier = ([X/100] - 0)² = [calculated]
Random baseline (always 50%): Brier = 0.25

P([精准表述的结果]在[日期]前发生) = [X]%

结果揭晓时的Brier分数：(表述概率 - 实际结果)²

如果结果发生：Brier = ([X/100] - 1)² = [计算值]
如果结果未发生：Brier = ([X/100] - 0)² = [计算值]
随机基准线（始终50%）：Brier = 0.25

The Base Rate (Calibrator)

基准率（校准师）

Reference Classes Considered

考虑的参考类别

Reference Class	Base Rate	Data Quality	Source
[Narrow]	X%	[High/Med/Low]	[source]
[Medium]	X%	[High/Med/Low]	[source]
[Broad]	X%	[High/Med/Low]	[source]

参考类别	基准率	数据质量	来源
[窄类别]	X%	[高/中/低]	[来源]
[中类别]	X%	[高/中/低]	[来源]
[宽类别]	X%	[高/中/低]	[来源]

Selected Anchor: [X]%

选定锚点：[X]%

Reference class: [which one and why]

参考类别: [选定类别及原因]

Case-Specific Adjustments

案例特定调整

Factor	Direction	Magnitude	Adjusted Probability
Starting point (anchor)	—	—	X%
[Factor 1]	+/-	small/med/large	X%
[Factor 2]	+/-	small/med/large	X%
...	...	...	X%

Calibrator's estimate: [X]%

因素	方向	幅度	调整后概率
起始点（锚点）	—	—	X%
[因素1]	+/-	小/中/大	X%
[因素2]	+/-	小/中/大	X%
...	...	...	X%

校准师估算结果: [X]%

The Decomposition (Decomposer)

分解分析（分解师）

Fermi Breakdown

费米拆解

Sub-Question	Probability	Independence	Weakest Link?
[Q1]	X%	[High/Med/Low]
[Q2]	X%	[High/Med/Low]
[Q3]	X%	[High/Med/Low]	YES
[Q4]	X%	[High/Med/Low]

子问题	概率	独立性	是否为最薄弱环节？
[问题1]	X%	[高/中/低]
[问题2]	X%	[高/中/低]
[问题3]	X%	[高/中/低]	是
[问题4]	X%	[高/中/低]

Joint Probability

联合概率

P(thesis) = P(Q1) × P(Q2) × ... = [X]% (adjusted for correlation: [X]%)

P(论点成立) = P(问题1) × P(问题2) × ... = [X]% (针对相关性调整后：[X]%)

Sensitivity Analysis

敏感性分析

Bottleneck: [weakest sub-question] Leverage point: [sub-question where a 10% shift matters most]

Decomposer's estimate: [X]%

瓶颈: [最薄弱的子问题] 杠杆点: [概率变化10%影响最大的子问题]

分解师估算结果: [X]%

The Bayesian Update (Updater)

贝叶斯更新（更新师）

Evidence Inventory

证据清单

Evidence	Genuinely Diagnostic?	Likelihood Ratio	Direction
[Evidence 1]	YES/NO	X	toward/away
[Evidence 2]	YES/NO	X	toward/away
[Evidence 3]	YES/NO	X	toward/away

证据	是否真正具有诊断性？	似然比	方向
[证据1]	是/否	X	支持/反对
[证据2]	是/否	X	支持/反对
[证据3]	是/否	X	支持/反对

Update Chain

更新链

Prior (base rate): X%
  → [Evidence 1] (LR=X) → X%
    → [Evidence 2] (LR=X) → X%
      → [Evidence 3] (LR=X) → X%
        = Posterior: X%

先验（基准率）: X%
  → [证据1]（似然比=X）→ X%
    → [证据2]（似然比=X）→ X%
      → [证据3]（似然比=X）→ X%
        = 后验概率: X%

What Would Flip the Estimate

哪些因素会反转估算结果

Above 80%: [specific evidence]
Below 20%: [specific evidence]

Updater's estimate: [X]%

超过80%: [具体证据]
低于20%: [具体证据]

更新师估算结果: [X]%

The Attack (Devil's Advocate)

反驳分析（魔鬼代言人）

Pre-Mortem Narratives

事前验尸叙事

Most likely failure: [narrative] Black swan failure: [narrative] Slow bleed failure: [narrative]

最可能的失败: [叙事] 黑天鹅式失败: [叙事] 缓慢失血式失败: [叙事]

Hedgehog Trap Check

刺猬陷阱检查

Is this thesis organized around One Big Idea? [YES/NO] Fox counterarguments:

[Economics perspective]
[Psychology perspective]
[Historical perspective]

论点是否围绕一个核心观点组织？ [是/否] 狐狸式反驳论点:

[经济学视角]
[心理学视角]
[历史学视角]

The Taleb Check

塔勒布检查

Domain: [Mediocristan / Extremistan / Mixed] Asymmetric downside? [YES — describe / NO] Should optimize for: [Calibration / Robustness / Both]

领域: [平均斯坦 / 极端斯坦 / 混合领域] 是否存在不对称下行风险？ [是——描述 / 否] 应优化目标: [校准 / 鲁棒性 / 两者兼顾]

Counterargument Strength

反驳论点强度

Counterargument	Strength	Impact on Estimate
[Counter 1]	FATAL/STRONG/MODERATE/WEAK	-X%
[Counter 2]	...	...

反驳论点	强度	对估算结果的影响
[反驳1]	致命/强/中等/弱	-X%
[反驳2]	...	...

Market Reality (Scorekeeper)

市场现实（记分员）

Comparable Outcomes

可比结果

Comparable Case	Outcome	Relevance
[Case 1]	[what happened]	[why it's relevant]
[Case 2]	[what happened]	[why it's relevant]
[Case 3]	[what happened]	[why it's relevant]

可比案例	结果	相关性
[案例1]	[实际结果]	[相关性原因]
[案例2]	[实际结果]	[相关性原因]
[案例3]	[实际结果]	[相关性原因]

Prediction Market / Crowd Estimates

预测市场/人群估算

[Any available prediction market prices or crowd forecasts]

[任何可用的预测市场价格或人群预测]

Evidence vs. Theory Gaps

证据与理论的差距

[Where the team's theoretical analysis was wrong based on evidence]

[团队理论分析与证据不符的地方]

THE CALIBRATED ESTIMATE

校准后的估算结果

This is the Tetlock question: How confident should you actually be, and can you prove it with a score?

这是Tetlock式的问题：你实际应该有多大信心，能否用分数证明？

Probability Triangulation

概率三角验证

Calibrator (base rate + adjustments):     X%
Decomposer (Fermi multiplication):        X%
Updater (Bayesian posterior):              X%
Prediction markets (if available):         X%
Devil's Advocate adjustment:              -X%
                                          ----
Triangulated estimate:                     X%
Extremizing adjustment:                   +/-X%
                                          ----
FINAL CALIBRATED ESTIMATE:                 X%

校准师（基准率+调整）:     X%
分解师（费米乘积）:        X%
更新师（贝叶斯后验）:      X%
预测市场（如有）:         X%
魔鬼代言人调整:              -X%
                                          ----
三角验证估算结果:                     X%
极端化调整:                   +/-X%
                                          ----
最终校准估算结果:                 X%

Confidence in the Estimate

估算结果的置信度

[LOW / MEDIUM / HIGH]

LOW: Reference classes are weak, base rates uncertain, sub-questions poorly understood. This estimate could easily be off by 20+%.
MEDIUM: Decent base rates, reasonable decomposition, some evidence. Estimate likely within 10-15% of true probability.
HIGH: Strong base rates, well-understood domain, multiple converging estimates. Estimate likely within 5-10% of true probability.

[低 / 中 / 高]

低：参考类别薄弱，基准率不确定，子问题理解不足。该估算结果可能偏差20%以上。
中：基准率尚可，分解合理，有部分证据。估算结果与真实概率的偏差可能在10-15%以内。
高：基准率可靠，领域理解充分，多种估算结果趋同。估算结果与真实概率的偏差可能在5-10%以内。

THE VERDICT

结论

Tetlock's Three Buckets

Tetlock的三个分类

[ ] FORECASTABLE — This question is in the Goldilocks zone. The base rate is knowable, the time horizon is tractable (3-18 months), the outcome is measurable. The probability estimate above is meaningful and scoreable. Proceed with the estimate as a genuine input to the decision.

[ ] TOO CLOUDY — This question is in Extremistan, has no usable base rate, extends beyond the forecastable horizon (2+ years), or involves genuine structural novelty. The probability estimate above is best-effort but should NOT be treated as reliable. Use scenario planning and robustness thinking instead. Tetlock himself would put this in the "too tough" pile.

[ ] PROCEED WITH CAUTION — Partially forecastable. Some sub-questions have good base rates, others don't. The estimate is useful for the forecastable components but the "too cloudy" components introduce irreducible uncertainty. Decompose further, focus decisions on the forecastable parts, and build optionality for the uncertain parts.

[ ] 可预测 —— 该问题处于Goldilocks区间。基准率可知，时间范围可控（3-18个月），结果可衡量。上述概率估算结果有意义且可评分。可将该估算作为决策的真实输入。

[ ] 过于模糊 —— 该问题属于极端斯坦，无可用基准率，超出可预测时间范围（2年以上），或涉及真正的结构性新事物。上述概率估算结果虽为尽力而为，但不应视为可靠。应改用场景规划和鲁棒性思维。Tetlock本人会将其归为“过于棘手”的类别。

[ ] 谨慎推进 —— 部分可预测。部分子问题有良好的基准率，其他则没有。估算结果对可预测部分有用，但“过于模糊”的部分引入了不可减少的不确定性。进一步分解，将决策重点放在可预测部分，为不确定部分构建选择权。

Verdict: [FORECASTABLE / TOO CLOUDY / PROCEED WITH CAUTION]

结论: [可预测 / 过于模糊 / 谨慎推进]

Probability: [X]% Confidence: [LOW / MEDIUM / HIGH]

Reasoning: [2-3 paragraphs in Tetlock's empirical, anti-pundit voice. Reference specific findings from each analyst. Be honest about what's knowable and what's not. If the question is Too Cloudy, say why without apology. If it's Forecastable, state the estimate with appropriate precision. If Proceed With Caution, explain what's forecastable and what's not.]

概率: [X]% 置信度: [低 / 中 / 高]

推理: [2-3段，采用Tetlock式的实证、反权威风格。参考每位分析师的具体发现。如实说明可知与不可知的内容。如果问题过于模糊，无需道歉地说明原因。如果可预测，以适当的精度表述估算结果。如果谨慎推进，解释哪些部分可预测，哪些不可预测。]

What a Superforecaster Would Say

超级预测师会怎么说

[Write 2-3 sentences in the voice of a calibrated, foxlike superforecaster — empirical, hedged but precise, deeply suspicious of confident narratives. Reference the evidence, not the story. Include the specific probability. Example tone: "The base rate for this type of venture reaching that revenue target is about 12%. The specific team and market adjustments push it to maybe 18-22%. That's not terrible odds for a venture bet, but anyone telling you this is 'likely to succeed' is substituting narrative confidence for arithmetic." No punditry — just calibrated honesty.]

[用校准后的、狐狸式超级预测师的语气写2-3句话——实证、有保留但精准，对自信的叙事持深度怀疑态度。参考证据而非故事。包含具体概率。示例语气：“这类风险投资达到该收入目标的基准率约为12%。具体团队和市场调整使其概率可能达到18-22%。对于风险投资而言，这不算太差的赔率，但任何告诉你这‘很可能成功’的人，都是用叙事信心替代了算术逻辑。”避免权威式表述——只做校准后的诚实陈述。]

The Accountability Structure

问责机制

Primary prediction: P([outcome] by [date]) = [X]% Sub-predictions:

P([sub-outcome 1] by [date]) = [X]%
P([sub-outcome 2] by [date]) = [X]%
P([sub-outcome 3] by [date]) = [X]%

Milestone predictions:

By [date 1]: [milestone] — P = [X]%
By [date 2]: [milestone] — P = [X]%

Update triggers:

If [event 1] occurs → update by [+/-X%]
If [event 2] occurs → update by [+/-X%]
If [event 3] occurs → update by [+/-X%]

Review schedule: [monthly / quarterly / at milestones] Score at: [resolution date]

主要预测: P([结果]在[日期]前发生) = [X]% 子预测:

P([子结果1]在[日期]前发生) = [X]%
P([子结果2]在[日期]前发生) = [X]%
P([子结果3]在[日期]前发生) = [X]%

里程碑预测:

在[日期1]前: [里程碑] — P = [X]%
在[日期2]前: [里程碑] — P = [X]%

更新触发因素:

如果[事件1]发生 → 更新幅度为[±X%]
如果[事件2]发生 → 更新幅度为[±X%]
如果[事件3]发生 → 更新幅度为[±X%]

复查时间表: [每月 / 每季度 / 里程碑时] 评分日期: [结果揭晓日期]

If You Proceed: The Update Discipline

如果推进：更新准则

[Based on the Updater's analysis, write 3-5 rules for maintaining calibration on this thesis. These are the epistemic hygiene commandments.]

Check [data source] every [frequency] — this is the highest-signal evidence channel for this thesis
If [specific event], update by [amount] — pre-commit to this update to prevent belief persistence
Never [specific hedgehog trap] — because [the failure mode it causes]
Review the sub-predictions at [interval] — early failures in sub-predictions are diagnostic of the main thesis
Score yourself honestly at [resolution date] — record the Brier score and add it to your calibration history

undefined

[基于更新师的分析，撰写3-5条关于该论点保持校准的规则。这些是认知卫生准则。]

每[频率]检查[数据源] —— 这是该论点最高信号的证据渠道
如果[特定事件]发生，更新幅度为[X] —— 预先承诺该更新，以防止信念固化
绝不[特定刺猬陷阱] —— 因为[其导致的失败模式]
每[间隔]复查子预测 —— 子预测的早期失败对主要论点具有诊断性
在[结果揭晓日期]如实给自己评分 —— 记录Brier分数并添加到你的校准历史中

undefined

Phase 5: Present & Follow-up

阶段5：呈现与跟进

Present the verdict to the user with key highlights. Don't dump the whole document — give the probability, the verdict, the triangulation, and the accountability structure. Let them read the full analysis.

undefined

向用户呈现结论及关键要点。不要直接输出完整文档——给出概率、结论、三角验证结果和问责机制。让用户自行阅读完整分析。

undefined

Tetlock Verdict: [THESIS] — [FORECASTABLE / TOO CLOUDY / PROCEED WITH CAUTION]

Tetlock结论: [论点] — [可预测 / 过于模糊 / 谨慎推进]

Calibrated probability: [X]% (confidence: [LOW/MEDIUM/HIGH]) Triangulation:

Base rate anchor: [X]%
Fermi decomposition: [X]%
Bayesian posterior: [X]%
Prediction markets: [X]% (if available) Weakest link: [the sub-question most likely to fail] Strongest counterargument: [the Devil's Advocate's best attack] Domain check: [Mediocristan / Extremistan — Taleb warning if relevant]

What a superforecaster would say: "[calibrated, foxlike quote]"

Scoreable prediction: P([outcome] by [date]) = [X]%

Full analysis:

thoughts/tetlock/YYYY-MM-DD-<slug>.md

Want me to:

Deep-dive into any analyst's findings?
Re-run with a modified thesis or time horizon?
Design a full prediction tournament around this domain?
Apply /munger to stress-test the same idea through Munger's lattice?
Set up milestone predictions for ongoing tracking?

undefined

校准后概率: [X]%（置信度: [低/中/高]） 三角验证:

基准率锚点: [X]%
费米分解: [X]%
贝叶斯后验: [X]%
预测市场: [X]%（如有） 最薄弱环节: [最可能失败的子问题] 最强反驳论点: [魔鬼代言人的最佳攻击点] 领域检查: [平均斯坦 / 极端斯坦 —— 如有相关，给出塔勒布警告]

超级预测师会说: "[校准后的、狐狸式的表述]"

可评分预测: P([结果]在[日期]前发生) = [X]%

完整分析:

thoughts/tetlock/YYYY-MM-DD-<slug>.md

你希望我：

深入分析任何一位分析师的发现？
修改论点或时间范围后重新运行分析？
围绕该领域设计完整的预测锦标赛？
应用/munger工具通过芒格的思维模型对同一想法进行压力测试？
设置里程碑预测以进行持续跟踪？

undefined

Batch Mode

批量模式

If the user wants to compare multiple theses or decisions:

Run the full analysis on each (can parallelize — one team per thesis)
At the end, produce a leaderboard:

undefined

如果用户希望比较多个论点或决策：

对每个论点运行完整分析（可并行——每个论点对应一个团队）
最后生成排行榜：

undefined

Superforecasting Leaderboard

超级预测排行榜

Rank	Thesis	Verdict	Probability	Confidence	Domain	Weakest Link
1	[thesis]	FORECASTABLE	X%	HIGH	Mediocristan	[sub-Q]
2	[thesis]	PROCEED	X%	MEDIUM	Mixed	[sub-Q]
3	[thesis]	TOO CLOUDY	X%	LOW	Extremistan	[sub-Q]

undefined

排名	论点	结论	概率	置信度	领域	最薄弱环节
1	[论点]	可预测	X%	高	平均斯坦	[子问题]
2	[论点]	谨慎推进	X%	中	混合领域	[子问题]
3	[论点]	过于模糊	X%	低	极端斯坦	[子问题]

undefined

Scoring Discipline

评分准则

Be a calibrated fox, not a confident hedgehog. Superforecasters say "I don't know" more often than pundits. If you don't have a base rate, say so. If the question is Too Cloudy, say so. Precision without accuracy is worse than honest uncertainty.
Cite the source analyst. Every claim traces to a specific teammate's finding.
No narrative inflation. Tetlock: "The more famous an expert was, the less accurate he was." Storytelling ability and forecasting accuracy are uncorrelated. Don't let a good story override bad math.
The base rate is your friend. If the Calibrator says the base rate is 12% and your gut says 60%, trust the base rate and investigate your gut.
Web search when uncertain. The Scorekeeper exists to ground theory in evidence. If other analysts are speculating, the Scorekeeper's job is to fact-check with real-world data.
The "Too Cloudy" bucket is respectable. Tetlock himself says forecasting degrades toward chance at 3-5 years. Acknowledging the limits of your method is not a failure — it's calibration applied to your own process.
Taleb check is mandatory. If the question involves fat-tailed outcomes (VC returns, pandemic risk, technology disruption), the standard superforecasting toolkit is insufficient. Flag it explicitly.

做校准后的狐狸，而非自信的刺猬。 超级预测师比权威人士更常说“我不知道”。如果没有基准率，直接说明。如果问题过于模糊，直接说明。没有准确性的精确性，比诚实的不确定性更糟糕。
引用来源分析师。 每一项主张都可追溯到特定团队成员的发现。
避免叙事膨胀。 Tetlock：“专家越有名，准确率越低。”讲故事的能力与预测准确率无关。不要让好故事掩盖糟糕的数学。
基准率是你的朋友。 如果校准师说基准率是12%，而你的直觉是60%，请信任基准率并调查直觉不符的原因。
不确定时进行网络搜索。 记分员的存在是为了将理论与证据结合。如果其他分析师在猜测，记分员的工作是用现实世界的数据进行事实核查。
“过于模糊”类别是值得尊重的。 Tetlock本人表示，3-5年以上的预测准确率会趋近于随机。承认方法的局限性不是失败——这是将校准应用于自身过程。
塔勒布检查是强制性的。 如果问题涉及肥尾结果（风险投资回报、大流行风险、技术颠覆），标准的超级预测工具包是不够的。明确标记此类问题。

Important Notes

重要说明

Cost: This skill spawns 5 agents. It's expensive. Worth it for serious strategic decisions, not for casual questions (for those, just apply the base-rate-then-adjust heuristic yourself).
Sonnet for teammates, Opus for synthesis: The lead handles the probability triangulation and final verdict — that's where the dragonfly eye matters.
No team? No problem: If teams aren't enabled, run 5 sequential background agents and collect results. Same analysis, just no cross-talk.
Pair with other skills:
- Run /munger first to identify the structural forces, then /tetlock to calibrate how confident you should be in the thesis
- Munger's lattice tells you WHAT to think about; Tetlock tells you HOW WELL you're thinking about it
- Run /garrytan to refine the idea, /munger for structural analysis, /tetlock for probabilistic calibration
The scoring rubric is the point. Unlike /munger (which gives you a verdict), /tetlock gives you a scoreable prediction. Come back to this analysis at the resolution date and compute the Brier score. Track your calibration over time. That's what superforecasters do.
Domain limitations are real. This framework works best for:
- 3-18 month horizons with measurable outcomes
- Questions with historical base rates
- Bounded (Mediocristan) outcomes It works poorly for: VC-style power-law bets, 5+ year horizons, genuinely novel situations, decisions where you control the outcome. The skill will flag these limitations explicitly.

成本: 该工具会生成5个Agent，成本较高。适合重要的战略决策，不适合日常问题（对于日常问题，你可以自行应用“基准率+调整”的启发式方法）。
团队成员用Sonnet，整合用Opus: 主导角色负责概率三角验证和最终结论——这是“蜻蜓复眼”发挥作用的地方。
没有团队功能也没关系: 如果团队功能未启用，依次运行5个后台Agent并收集结果。分析质量相同，只是没有成员间交叉沟通。
与其他工具结合使用:
- 先运行/munger工具识别结构性力量，再运行/tetlock工具校准你对论点的信心
- 芒格的思维模型告诉你要思考什么；Tetlock告诉你思考的质量如何
- 运行/garrytan工具优化想法，/munger工具进行结构性分析，/tetlock工具进行概率校准
评分规则是核心。 与/munger工具（给出结论）不同，/tetlock工具给出的是可评分的预测。在结果揭晓日期回到该分析，计算Brier分数。随时间跟踪你的校准情况。这就是超级预测师的做法。
领域局限性是真实存在的。 该框架最适用于：
- 3-18个月时间范围、可衡量结果的场景
- 有历史基准率的问题
- 有界（平均斯坦）结果不适用于：风险投资式的幂律赌注、5年以上时间范围、真正的新事物、你能控制结果的决策。该工具会明确标记这些局限性。