cheat-bump

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

/cheat-bump — Rubric / Bucket 升级

/cheat-bump — Rubric / Bucket Upgrade

两种模式:
模式触发做什么验证强度
完整 rubric bump
--propose "<新公式>"
改公式 / 维度 / 权重5 步 + 跨模型审核(强制)
bucket-only 重校
--bucket-only
只重新派生 bucket 边界数据自动派生,无审核
完整 rubric bump 严格遵守 shared-references/bump-validation-protocol.md 的 5 步。bucket-only 走轻量路径——见下方 Phase B。
Two modes:
ModeTriggerWhat it doesValidation Strength
Full rubric bump
--propose "<new formula>"
Modify formula / dimensions / weightsMandatory 5-step process + cross-model audit
--bucket-only recalibration
--bucket-only
Only re-derive bucket boundariesAutomatic data derivation, no audit
Full rubric bump strictly follows the 5 steps in shared-references/bump-validation-protocol.md. Bucket-only follows the lightweight path — see Phase B below.

Overview

Overview

入口:用户触发 /cheat-bump
[Phase A0: 检测调用模式]
  ├─ --bucket-only  →  [Phase B: 轻量 bucket 重校]
  └─ --propose      →  [Phase 0~6: 完整 rubric bump]
Entry: User triggers /cheat-bump
[Phase A0: Detect Call Mode]
  ├─ --bucket-only  →  [Phase B: Lightweight Bucket Recalibration]
  └─ --propose      →  [Phase 0~6: Full Rubric Bump]

Phase A0: 调用模式分流(先做)

Phase A0: Call Mode Diversion (Do First)

读用户参数:
  • --bucket-only
    → 走 Phase B(轻量重校)
  • --propose "<...>"
    → 走 Phase 0~8(完整 rubric bump)
  • 都没有 → 询问用户:"你想做什么?1) 调 rubric 公式 / 加减维度 → --propose;2) 只重新派生 bucket 边界 → --bucket-only"
如果用户说"我觉得 ER 太低了想调"→ 是
--propose
路径。 如果用户说"我账号长大了,bucket 不准了"→ 是
--bucket-only
路径。 两条路径不能混调——一次操作只做一种事。

Read user parameters:
  • Contains
    --bucket-only
    → proceed to Phase B (lightweight recalibration)
  • Contains
    --propose "<...>"
    → proceed to Phase 0~8 (full rubric bump)
  • Neither → Ask user: "What do you want to do? 1) Adjust rubric formula / add/remove dimensions → --propose; 2) Only re-derive bucket boundaries → --bucket-only"
If user says "I think ER is too low and want to adjust" → it's the
--propose
path. If user says "My account has grown, buckets are no longer accurate" → it's the
--bucket-only
path. The two paths cannot be mixed — only one type of operation per action.

完整 rubric bump 流程

Full Rubric Bump Workflow

[用户:升级 rubric --propose "ER×1.5→2.0,砍 NA,加 MS"]
[Phase 0: 前置门槛检查]
[Phase 1: 写出新公式完整方程]
[Phase 2: 校准池全量重打分]
[Phase 3: 计算排序一致性]
[Phase 4: 跨模型独立审核(强制)]
[Phase 5: 落地 + cleanup pass]
[Phase 6: 更新所有校准样本的 prediction 文件底部追加 Re-scored 行]
[User: upgrade rubric --propose "ER×1.5→2.0, remove NA, add MS"]
[Phase 0: Pre-threshold Check]
[Phase 1: Write Complete New Formula Equation]
[Phase 2: Full Re-scoring of Calibration Pool]
[Phase 3: Calculate Ranking Consistency]
[Phase 4: Mandatory Cross-Model Independent Audit]
[Phase 5: Implementation + Cleanup Pass]
[Phase 6: Append Re-scored Line to Bottom of All Calibration Sample Prediction Files]

Constants

Constants

  • READINESS_HEURISTIC
    • 默认参考:校准池 ≥ 5 样本 + 至少 1 个跨样本观察有 ≥3 样本支持
    • 但 Claude 可以提议 bump(即使样本少)如果观察信号特别强:
      • N=3 但出现完全推翻当前 rubric 假设的强反例(composite 8.5 vs 实绩 5w 这种 ≥3x 偏差)
      • 1 篇出现单点但极强的现象(如评论区出现 ≥2000 赞的单一模因)
    • Claude 也可以拒绝 bump(即使样本足)如果证据弱:
      • N=10 但观察都是低置信度的零碎 pattern,无清晰方向
      • 用户复盘时大量"随便看了下"的非严肃判断
    • 写在 prediction header 或 cheat-bump 输出时必说明:本次提议是 default-aligned 还是 judgment-driven,给用户审视依据
  • THRESHOLD = 0.8 — 新排序与实绩排序一致性阈值(4/5)。这条写死——bump 验证的统计刚性
  • CROSS_MODEL_AUDIT = true — 调外部 LLM 独立审核。false 仅用于离线
  • REQUIRE_CONFIRM = true — 落地前要求用户明确"yes, bump"
  • READINESS_HEURISTIC
    • Default Reference: Calibration pool ≥5 samples + at least 1 cross-sample observation supported by ≥3 samples
    • But Claude can propose bump (even with few samples) if observation signals are exceptionally strong:
      • N=3 but there's a strong counterexample that completely overturns current rubric assumptions (e.g., composite 8.5 vs actual performance 50k, a ≥3x deviation)
      • A single post shows an extreme phenomenon (e.g., a single meme with ≥2000 likes in the comment section)
    • Claude can also reject bump (even with sufficient samples) if evidence is weak:
      • N=10 but observations are low-confidence fragmented patterns with no clear direction
      • User reviews contain numerous non-serious judgments like "just glanced at it"
    • Must be stated in prediction header or cheat-bump output: Whether this proposal is default-aligned or judgment-driven, providing users with a basis for review
  • THRESHOLD = 0.8 — Consistency threshold between new ranking and actual performance ranking (4/5). This is hard-coded — statistical rigidity for bump validation
  • CROSS_MODEL_AUDIT = true — Call external LLM for independent audit. false is only used for offline scenarios
  • REQUIRE_CONFIRM = true — Require explicit user confirmation "yes, bump" before implementation

Inputs

Inputs

必填来源
--propose
文本
用户参数;缺失则询问
rubric_notes.md
用户项目根
predictions/*.md
全量
校准池数据
.cheat-state.json
状态
RequiredSource
--propose
text
User parameters; ask if missing
rubric_notes.md
User project root
All
predictions/*.md
files
Calibration pool data
.cheat-state.json
State file

Workflow

Workflow

Phase 0: 前置门槛检查

Phase 0: Pre-threshold Check

bump-validation-protocol.md 的"何时禁止"段,逐项检查:
检查失败处理
校准池总样本数 vs 观察强度Claude 判断——按 READINESS_HEURISTIC:默认 ≥5 样本但允许特例(强反例 / 强模因)。如不满足默认,Claude 必须显式说明为什么仍然提议 bump("虽然只 N=3 样本,但 X 这条出现 composite Y vs 实绩 Z,这是 W 倍偏差"),让用户审视
上次 bump 距今的新校准数 vs 观察成熟度Claude 判断——默认建议 ≥3 篇新样本,但如果连续 3 篇都强证据指向同一方向 → 不必再等
in_progress_session == null
拒绝:"你有 in-progress 预测未完成。先走完那条流程或清掉 state"
触发条件成立(系统性偏差 / 跨样本新观察 / 新维度证据足)警告但不阻塞——询问用户为什么现在 bump
通过 → 进入 Phase 1。
Check item by item according to the "When to Prohibit" section in bump-validation-protocol.md:
CheckFailure Handling
Total calibration pool samples vs observation strengthClaude's judgment — follow READINESS_HEURISTIC: default ≥5 samples but allow exceptions (strong counterexamples / strong memes). If default is not met, Claude must explicitly explain why bump is still proposed ("Although only N=3 samples, entry X has composite Y vs actual performance Z, a W-fold deviation"), allowing user review
Number of new calibrations since last bump vs observation maturityClaude's judgment — default suggests ≥3 new samples, but if 3 consecutive samples all provide strong evidence pointing to the same direction → no need to wait
in_progress_session == null
Reject: "You have an in-progress prediction that is not completed. Finish that process first or clear the state"
Trigger conditions are met (systematic deviation / new cross-sample observations / sufficient evidence for new dimensions)Warn but do not block — ask user why they want to bump now
Pass → proceed to Phase 1.

Phase 1: 写出新公式完整方程

Phase 1: Write Complete New Formula Equation

不能只接受用户的简短描述。把它展开为完整方程:
当前:v2  composite = (ER×1.5 + SR×1.5 + HP×1.5 + QL + NA + AB + SAT) / 8.5 × 2.0
提议:v2.1  composite = (ER×2.0 + HP×1.5 + MS×1.5 + QL + SR + TS + SAT) / 9.0 × 2.0

变化总结:
- ER ×1.5 → ×2.0(升)
- SR ×1.5 → ×1.0(降)
- 新增 MS ×1.5(Memetic Shareability)
- 新增 TS ×1.0(Topic Shareability)
- 删除 NA(与 HP 重叠)
- 删除 AB(被 TS 替代)
- 归一化常数 8.5 → 9.0
- 公式总维度数:7 → 7(净变化 0)
如果用户的提议含糊(如"ER 权重提一点")→ 询问具体数值,禁止自己猜
Do not only accept brief user descriptions. Expand it into a complete equation:
Current: v2  composite = (ER×1.5 + SR×1.5 + HP×1.5 + QL + NA + AB + SAT) / 8.5 × 2.0
Proposed: v2.1  composite = (ER×2.0 + HP×1.5 + MS×1.5 + QL + SR + TS + SAT) / 9.0 × 2.0

Summary of changes:
- ER ×1.5 → ×2.0 (increased)
- SR ×1.5 → ×1.0 (decreased)
- Added MS ×1.5 (Memetic Shareability)
- Added TS ×1.0 (Topic Shareability)
- Removed NA (overlaps with HP)
- Removed AB (replaced by TS)
- Normalization constant 8.5 → 9.0
- Total number of formula dimensions: 7 → 7 (net change 0)
If user's proposal is vague (e.g., "Increase ER weight a bit") → ask for specific values, do not guess.

Phase 2: 校准池全量重打分(强制走 blind sub-agent

Phase 2: Full Re-scoring of Calibration Pool (Mandatory Blind Sub-agent)

Glob
predictions/*.md
中所有有完整复盘段的文件 → 校准池。
bump 是工具最高风险动作——所有重打必须走 cheat-score-blind sub-agent。inline 重打 = 主 Claude 已经看过实绩,rank 一致性变成 overfit 而非真信号。
Glob all files in
predictions/*.md
with complete review sections → calibration pool.
Bump is the highest-risk action of the tool — all re-scoring must go through cheat-score-blind sub-agent. Inline re-scoring means the main Claude has already seen actual performance, making rank consistency overfitting rather than real signals.

强制约束

Mandatory Constraints

  • 不接受 self-scored fallback——
    /cheat-predict
    --skip-blind
    flag,但
    /cheat-bump
    没有。如果 Task tool 不可用 → abort bump,向用户报告"先解决 Task tool 再 bump"
  • 不接受"我只重算 composite 不重打 dim" —— 即使新公式只调权重不加维度,每条 prediction 的所有 dim 都要由 sub-agent 重新审 script。理由:旧 dim 分本身可能是污染的;权重变了不能保证旧 dim 还成立
  • No self-scored fallback accepted
    /cheat-predict
    has a
    --skip-blind
    flag, but
    /cheat-bump
    does not. If Task tool is unavailable → abort bump, report to user "Resolve Task tool issues first before bumping"
  • No "I only recalculate composite without re-scoring dimensions" — even if the new formula only adjusts weights without adding dimensions, all dimensions of each prediction must be re-reviewed by the sub-agent. Reason: Old dimension scores may already be contaminated; weight changes cannot guarantee old dimensions are still valid

对每篇 prediction:

For Each Prediction:

  1. 解析 prediction 文件拿到对应
    scripts/<id>.md
    路径(从
    Script Path
    header 字段)
  2. 校验 script 文件存在 + hash 跟 header
    Script Hash
    一致;不一致 → 警告(script 改过了)但仍 spawn sub-agent
  3. 通过 Task tool spawn cheat-score-blind sub-agent
    Spawn cheat-score-blind sub-agent.
    
    Input:
      script_path: <prediction header 的 Script Path>
      rubric_notes_path: rubric_notes.md
      sidecar_path: .cheat-cache/bump-rescores/<prediction-id>.json
    
    Task: 按 rubric_notes 当前公式(已是新版 vN+1)给 script 打分。
    返回严格 JSON。写 sidecar 文件用于 bump 主流程批量读取。
    
    不要读 state file / predictions/ / videos/ 任何其他文件。
    不要询问用户 —— 你没有用户。
    不要读这份 prediction 文件本身 —— 你只看 script + rubric。
  4. 等 sub-agent 完成 → 读 sidecar JSON → 主流程用新公式算 composite
  5. 写"重打表"到
    .cheat-cache/bump-rescores.json
    (汇总)。每条 entry 标
    blind: true
    —— bump phase 5 cleanup 时把这个字段连同新分数写到 prediction 文件的
    Re-scored under v<N+1>
  1. Parse the prediction file to get the corresponding
    scripts/<id>.md
    path (from the
    Script Path
    header field)
  2. Verify the script file exists + hash matches the header
    Script Hash
    ; if not → warn (script has been modified) but still spawn sub-agent
  3. Spawn cheat-score-blind sub-agent via Task tool:
    Spawn cheat-score-blind sub-agent.
    
    Input:
      script_path: <Script Path from prediction header>
      rubric_notes_path: rubric_notes.md
      sidecar_path: .cheat-cache/bump-rescores/<prediction-id>.json
    
    Task: Score the script according to the current formula in rubric_notes (already updated to new version vN+1).
    Return strict JSON. Write to sidecar file for batch reading by the main bump process.
    
    Do not read state file / predictions/ / videos/ or any other files.
    Do not ask user — you have no user.
    Do not read this prediction file itself — only look at script + rubric.
  4. Wait for sub-agent to complete → read sidecar JSON → main process calculates composite using new formula
  5. Write "re-score table" to
    .cheat-cache/bump-rescores.json
    (summary). Mark each entry with
    blind: true
    — during Phase 5 cleanup of bump, write this field along with the new score to the
    Re-scored under v<N+1>
    line in the prediction file

还污染没污染的诚实标注

Honest Labeling of Contamination

即使走 sub-agent,仍有两类残余 contamination 要在 bump report 里诚实标注
类型来源标注字段
模型 prior contaminationsub-agent 仍是 Claude,RLHF 共享
model_prior_warning: true
(默认 true,不可关)
用户自己 rubric design biasrubric_notes.md 是用户写的,自然 fit 自己内容
rubric_self_designed: true
(默认 true,不可关)
这两条提示用户 channel C(跨模型 audit)的不可省。bump 报告末尾必印:"上面的 rank 一致性是 channel A 内的一致性。最终决策必须等 channel C audit 通过。"
Even with sub-agent, two types of residual contamination must be honestly labeled in the bump report:
TypeSourceLabel Field
Model prior contaminationSub-agent is still Claude, sharing RLHF
model_prior_warning: true
(default true, cannot be turned off)
User's own rubric design biasrubric_notes.md is written by user, naturally fitting their own content
rubric_self_designed: true
(default true, cannot be turned off)
These two remind users that channel C (cross-model audit) is indispensable. The end of the bump report must state: "The above rank consistency is within channel A. Final decision must wait for channel C audit approval."

失败模式

Failure Modes

症状处理
某条 prediction 的 script 文件不见了sub-agent skip 该条,主流程汇总报告"N 条因 script 缺失被排除"。如剩余有效池 < MIN_SAMPLES → abort bump
sub-agent 返回
refusal != null
重发 Task 最多 3 次;仍败 → 该条标
rescore_failed: true
排除出校准池
Task tool 整个不可用abort bump,提示用户"Task tool 是 bump 的硬依赖。如真的离线环境,跑
/cheat-bump --bucket-only
走轻量分支"
sub-agent 输出含 contamination_signal
suspicious: true
但不排除——bump report 末尾列这些可疑条目让用户审
SymptomHandling
Script file for a prediction is missingSub-agent skips this entry, main process summarizes "N entries excluded due to missing script". If remaining valid pool < MIN_SAMPLES → abort bump
Sub-agent returns
refusal != null
Resend Task up to 3 times; if still failed → mark this entry as
rescore_failed: true
and exclude from calibration pool
Task tool is completely unavailableAbort bump, prompt user "Task tool is a hard dependency for bump. If in offline environment, run
/cheat-bump --bucket-only
to use lightweight branch"
Sub-agent output contains contamination_signalMark as
suspicious: true
but do not exclude — list these suspicious entries at the end of bump report for user review

Phase 3: 计算排序一致性

Phase 3: Calculate Ranking Consistency

每个样本:
  new_composite_rank: 用新公式排序的 rank
  actual_plays_rank: 用实际播放排序的 rank
  delta: |new_rank - actual_rank|

输出对照表:
| 样本 | composite (v2) | composite (v2.1) | rank (new) | actual | rank (actual) | delta |
|---|---|---|---|---|---|---|
| 仓鼠 | 9.41 | 9.55 | 1 | 124.8w | 1 | 0 |
| 停止期待 | 8.24 | 9.11 | 2 | 71.1w | 2 | 0 |
| 老板废话 | 7.65 | 8.11 | 4 | 39.6w | 3 | 1 |
| 求职悖论 | 8.47 | 7.56 | 5 | 16.8w | 4 | 1 |
| 谁问你了 | 8.24 | 7.00 | 6 | 11.7w | 5 | 1 |

排序一致性:4/5 在 |delta| ≤ 1
Pairwise no-regression:旧公式做对的所有 pair 在新公式下未颠倒 ✓
判定:
  • 排序一致性 < THRESHOLD(默认 0.8) → 本地拒绝,转 Phase 4 之前明确报告失败
  • pairwise 出现回归 → 本地拒绝
THRESHOLD
写死在协议里——不允许临时调低(那本身是另一个需要 bump 的元决策)。
For each sample:
  new_composite_rank: Rank sorted by new formula
  actual_plays_rank: Rank sorted by actual plays
  delta: |new_rank - actual_rank|

Output comparison table:
| Sample | composite (v2) | composite (v2.1) | rank (new) | actual | rank (actual) | delta |
|---|---|---|---|---|---|---|
| Hamster | 9.41 | 9.55 | 1 | 1.248M | 1 | 0 |
| Stop Expecting | 8.24 | 9.11 | 2 | 711K | 2 | 0 |
| Boss Nonsense | 7.65 | 8.11 | 4 | 396K | 3 | 1 |
| Job Hunting Paradox | 8.47 | 7.56 | 5 | 168K | 4 | 1 |
| Who Asked You | 8.24 | 7.00 | 6 | 117K | 5 | 1 |

Ranking consistency: 4/5 with |delta| ≤ 1
Pairwise no-regression: All pairs correctly ranked by old formula are not reversed under new formula ✓
Judgment:
  • Ranking consistency < THRESHOLD (default 0.8) → Local rejection, explicitly report failure before proceeding to Phase 4
  • Pairwise regression occurs → Local rejection
THRESHOLD
is hard-coded in the protocol — temporary lowering is not allowed (that itself is another meta-decision requiring bump).

Phase 4: 跨模型独立审核(强制,除非 escape hatch)

Phase 4: Mandatory Cross-Model Independent Audit (Mandatory, except escape hatch)

CROSS_MODEL_AUDIT=true
(默认):
调用
mcp__llm-chat__chat
prompt:
你是一个独立审稿人。下面是一个内容创作者准备升级的 rubric 公式。
请独立判定两件事:
1. 排序一致性:新公式给样本的排序与实际表现排序,是否真的在 ≥80% 样本上一致?
2. 解释力:新公式相比旧公式,是否更好地解释了校准池的实绩分布?

数据:
旧公式:(ER×1.5 + SR×1.5 + HP×1.5 + QL + NA + AB + SAT) / 8.5 × 2.0
新公式:(ER×2.0 + HP×1.5 + MS×1.5 + QL + SR + TS + SAT) / 9.0 × 2.0

校准池:
[Phase 2 重打表的完整 JSON]

排序对照:
[Phase 3 表格的完整 JSON]

输出格式:
- 判定:PASS 或 REJECT
- 理由:≥100 字
- 关键风险:[如有,列出新公式的潜在问题]
收到外部 LLM 回复 → 解析判定。
判定逻辑:
  • 本地 PASS + 外部 PASS → 通过,进入 Phase 5
  • 本地 PASS + 外部 REJECT → 视为 REJECT。冲突意味着至少一方解读不稳定
  • 本地 REJECT → 已在 Phase 3 终止
  • mcp__llm-chat__chat 不可用 → 优雅降级到
    CROSS_MODEL_AUDIT=false
    ,state file 标
    last_bump_self_audited: true
CROSS_MODEL_AUDIT=false
  • 仅依赖本地判定
  • state file 持续标记,cheat-status 持续提示用户"这次 bump 是自审,建议配置 mcp__llm-chat__chat"
CROSS_MODEL_AUDIT=true
(default):
Call
mcp__llm-chat__chat
:
prompt:
You are an independent reviewer. Below is a rubric formula that a content creator is preparing to upgrade.
Please independently judge two things:
1. Ranking consistency: Is the ranking of samples by the new formula consistent with the ranking of actual performance in ≥80% of samples?
2. Explanatory power: Does the new formula better explain the actual performance distribution of the calibration pool compared to the old formula?

Data:
Old formula: (ER×1.5 + SR×1.5 + HP×1.5 + QL + NA + AB + SAT) / 8.5 × 2.0
New formula: (ER×2.0 + HP×1.5 + MS×1.5 + QL + SR + TS + SAT) / 9.0 × 2.0

Calibration pool:
[Full JSON of re-score table from Phase 2]

Ranking comparison:
[Full JSON of table from Phase 3]

Output format:
- Judgment: PASS or REJECT
- Reason: ≥100 words
- Key risks: [List potential issues of new formula if any]
Receive external LLM response → parse judgment.
Judgment logic:
  • Local PASS + External PASS → Pass, proceed to Phase 5
  • Local PASS + External REJECT → Treat as REJECT. Conflict means at least one party's interpretation is unstable
  • Local REJECT → Already terminated in Phase 3
  • mcp__llm-chat__chat unavailable → Gracefully degrade to
    CROSS_MODEL_AUDIT=false
    , mark
    last_bump_self_audited: true
    in state file
CROSS_MODEL_AUDIT=false
:
  • Only rely on local judgment
  • Continuously mark in state file, cheat-status continuously prompts user "This bump was self-audited, it is recommended to configure mcp__llm-chat__chat"

Phase 5: 落地 + cleanup pass

Phase 5: Implementation + Cleanup Pass

通过审核后,REQUIRE_CONFIRM=true → 询问用户:"新公式 PASS 本地与外部审核。最后确认:执行 bump 落地?这会修改 rubric_notes.md + rubric-memo.md 并删除若干已被吸收的观察。回答 'yes, bump' 才执行。"
用户确认后:
After passing audit, REQUIRE_CONFIRM=true → Ask user: "New formula passed local and external audits. Final confirmation: Execute bump implementation? This will modify rubric_notes.md + rubric-memo.md and delete several absorbed observations. Only execute if you answer 'yes, bump'."
After user confirmation:

5a. 更新
rubric_notes.md
只放通用语言,不含视频名 / 实绩

5a. Update
rubric_notes.md
(Only use general language, no video names / actual performance data)

  • 顶部 metadata 更新:
    • **当前版本**: vN+1
    • **Last bumped at**: <ISO 8601>
    • **Upgrade memos**: 见 [rubric-memo.md](rubric-memo.md)
      (指针,不复制 Memo 内容)
  • 版本速查表加一行(只含版本号 + 公式签名,不含证据样本)
  • 更新"当前评分维度"段(删 NA / AB,加 MS / TS)
  • 派生证据段 如新维度需要锚点解释 → 用通用语言
    • ✅ 允许:「派生证据:高抽象密度样本 → CC=1 → 低 reach」
    • ❌ 禁止:「派生证据:「停止期待」CC=1 → 实绩 13.7w」(视频名 + 实绩 数字)
    • 命中违禁 pattern → 把该段抽到 rubric-memo.md 的"派生证据"子段,原位用通用语言替代
  • Update top metadata:
    • **Current Version**: vN+1
    • **Last bumped at**: <ISO 8601>
    • **Upgrade memos**: See [rubric-memo.md](rubric-memo.md)
      (pointer, do not copy memo content)
  • Add a line to version quick reference table (only include version number + formula signature, no evidence samples)
  • Update "Current Scoring Dimensions" section (remove NA / AB, add MS / TS)
  • Derived evidence section if new dimensions need anchor explanation → Use general language:
    • ✅ Allowed: "Derived evidence: High abstract density samples → CC=1 → Low reach"
    • ❌ Prohibited: "Derived evidence: 'Stop Expecting' CC=1 → Actual performance 137K" (video name + actual performance number)
    • If prohibited pattern is hit → move this section to "Derived Evidence" sub-section in rubric-memo.md, replace with general language in place

5b. 写 Memo 到
rubric-memo.md
append 模式,不覆盖历史

5b. Write Memo to
rubric-memo.md
(Append mode, do not overwrite history)

bump-validation-protocol.md Step 5 + templates/rubric-memo.template.md 格式 append 一段 Memo 到文件末尾:
  • 触发观察(含真实观察 ID)
  • 证据数据(校准池重打表 + 排序对照,含真实视频名 + 实绩
  • 派生证据(含真实样本名 + 实绩
  • 诊断
  • 新公式
  • 跨模型审核结论引用(含模型名 + 判定 + 理由摘录)
  • 已知局限
绝不覆盖 rubric-memo.md 已有内容——bump memo 按时间顺序累积。
Append a memo section to the end of the file according to Step 5 in bump-validation-protocol.md + templates/rubric-memo.template.md:
  • Trigger observation (include real observation ID)
  • Evidence data (Full re-score table + ranking comparison of calibration pool, include real video names + actual performance)
  • Derived evidence (Include real sample names + actual performance)
  • Diagnosis
  • New formula
  • Cross-model audit conclusion reference (include model name + judgment + reason excerpt)
  • Known limitations
Never overwrite existing content in rubric-memo.md — bump memos accumulate in chronological order.

5c. cleanup pass(按 observation-lifecycle.md 的"cleanup pass 强制时机")

5c. Cleanup Pass (According to "Mandatory Cleanup Pass Timing" in observation-lifecycle.md)

rubric_notes.md
内执行(动 rubric-memo.md):
  • 已被吸收为新维度的观察 → 删(如观察 E 被吸收为 MS → 删观察 E)
  • 被新数据推翻的观察 → 删
  • 仍未解决的观察 → 迁移到新版本"待验证假设"段
  • 已被验证的"规律"→ 移到"规律沉淀区"
Execute within
rubric_notes.md
(Do not modify rubric-memo.md):
  • Observations that have been absorbed into new dimensions → Delete (e.g., Observation E absorbed into MS → delete Observation E)
  • Observations overturned by new data → Delete
  • Unresolved observations → Move to "Pending Validation Hypotheses" section of new version
  • Validated "rules" → Move to "Rule Precipitation Area"

5d. 整理 + 自检

5d. Organize + Self-Check

  • 重新读
    rubric_notes.md
    全文,确保读者能在 60 秒内理解当下规则——超出 600 行触发额外清算
  • 自检 leak guard:对
    rubric_notes.md
    grep -E '\\d+\\s*[wWmMkK万]|播放|实绩|实际'
    → 如有命中 → abort bump + 回滚,提示用户"rubric_notes.md 写入了违禁内容(实绩 / 播放数)"。这些内容应在 rubric-memo.md,不在 rubric_notes.md
  • Re-read entire
    rubric_notes.md
    to ensure readers can understand current rules within 60 seconds → trigger additional cleanup if exceeding 600 lines
  • Self-check leak guard: Run
    grep -E '\\d+\\s*[wWmMkK]|plays|actual performance|actual'
    on
    rubric_notes.md
    → if any hits → abort bump + rollback, prompt user "rubric_notes.md contains prohibited content (actual performance / play counts)". These contents should be in rubric-memo.md, not rubric_notes.md

Phase 6: 校准样本批量更新

Phase 6: Batch Update of Calibration Samples

对每个校准样本的 prediction 文件,底部追加(不动预测段、不动复盘段):
markdown

---
**Re-scored under v2.1 on 2026-05-04**: composite=8.24 → 9.11 (blind: true)
(rubric bump 时全量重算,由 cheat-score-blind sub-agent 独立打分;详见 rubric-memo.md 的 v2 → v2.1 升级 Memo)
blind: true
字段必填——告诉未来读这条记录的人"这是 channel B 隔离打分,不是主 Claude 自评"。如果某条 prediction 在 Phase 2 因 sub-agent 失败被排除 → 不会有 Re-scored 行(保持原样)。
用 Edit 工具,匹配每个文件的最末尾。
For each calibration sample's prediction file, append to the bottom (do not modify prediction section or review section):
markdown

---
**Re-scored under v2.1 on 2026-05-04**: composite=8.24 → 9.11 (blind: true)
(Full re-calculated during rubric bump, independently scored by cheat-score-blind sub-agent; see v2 → v2.1 upgrade memo in rubric-memo.md)
The
blind: true
field is required — tell future readers of this record "This is channel B isolated scoring, not self-scored by main Claude". If a prediction was excluded in Phase 2 due to sub-agent failure → no Re-scored line will be added (keep as is).
Use Edit tool to match the end of each file.

Phase 7: 更新 state file

Phase 7: Update State File

json
{
  "rubric_version": "v2.1",
  "last_bump_at": "<ISO timestamp>",
  "last_bump_self_audited": false,
  "consecutive_directional_errors": [],
  "calibration_samples_at_last_bump": <current value>
}
清空
consecutive_directional_errors
——新 rubric 重新计数。
json
{
  "rubric_version": "v2.1",
  "last_bump_at": "<ISO timestamp>",
  "last_bump_self_audited": false,
  "consecutive_directional_errors": [],
  "calibration_samples_at_last_bump": <current value>
}
Clear
consecutive_directional_errors
— new rubric starts counting again.

Phase 8: 控制台报告

Phase 8: Console Report

✅ Rubric 已升级 v2 → v2.1

变化:
- ER ×1.5 → ×2.0
- SR ×1.5 → ×1.0
- 新增 MS / TS
- 删除 NA / AB

校准池重打:5/5 通过排序检查(4/5 一致 + 0 pairwise 回归)
跨模型审核:✅ PASS
Cleanup pass:删除观察 D 和 E(已吸收为 QL 重定义和 MS 维度)

下一篇预测起按 v2.1 公式打分。
所有历史预测文件已追加 Re-scored 标记。

✅ Rubric upgraded from v2 → v2.1

Changes:
- ER ×1.5 → ×2.0
- SR ×1.5 → ×1.0
- Added MS / TS
- Removed NA / AB

Calibration pool re-scoring: 5/5 passed ranking check (4/5 consistent + 0 pairwise regression)
Cross-model audit: ✅ PASS
Cleanup pass: Deleted Observations D and E (absorbed into QL redefinition and MS dimension)

Scoring will use v2.1 formula starting from next prediction.
All historical prediction files have been appended with Re-scored marks.

Phase B:bucket-only 重校(轻量分支)

Phase B: Bucket-Only Recalibration (Lightweight Branch)

/cheat-bump --bucket-only [--scheme ratio|absolute|percentile]
与完整 bump 的本质区别:bucket 边界不是规则的一部分,是数据派生量。重新派生它不需要跨模型审核——派生算法是确定性的,没有"判断"成分。
/cheat-bump --bucket-only [--scheme ratio|absolute|percentile]
Essential difference from full bump: Bucket boundaries are not part of the rules, they are data-derived quantities. Re-deriving them does not require cross-model audit — the derivation algorithm is deterministic with no "judgment" component.

B1: 选择算法(按可用样本数自动派生,state 不存 scheme

B1: Select Algorithm (Automatically Derived Based on Available Sample Count, Scheme Not Stored in State)

算法适用边界派生方式
ratio
(默认 N=1-4)
小样本上一篇 / 最近 3 篇中位数 × {0.3 / 1 / 3 / 10 / 30}
absolute
(默认 N=5-9)
中等样本校准池中位数 × {0.3 / 1 / 3 / 10 / 30},固定边界
percentile
(默认 N≥10)
大样本校准池实绩 percentile {30 / 60 / 85 / 95 / 100}
--scheme
参数允许用户显式覆盖默认
  • --scheme ratio
    强制用 ratio(即使 N≥5)
  • --scheme absolute
    强制用 absolute
  • --scheme percentile
    强制用 percentile(要求 N≥3,否则报错)
未指定
--scheme
→ 按上表自动派生。
旧设计有
bucket_scheme
state 字段——v1.1 删了。所有 skill 实时按 calibration_samples 派生算法,不需要持久化"当前用哪个"。这避免了"切换 scheme 后忘了同步"的状态不一致问题。
AlgorithmApplicableBoundary Derivation Method
ratio
(default for N=1-4)
Small sample sizeMedian of last 1 / last 3 samples × {0.3 / 1 / 3 / 10 / 30}
absolute
(default for N=5-9)
Medium sample sizeMedian of entire calibration pool × {0.3 / 1 / 3 / 10 / 30}, fixed boundaries
percentile
(default for N≥10)
Large sample sizeActual performance percentiles of calibration pool {30 / 60 / 85 / 95 / 100}
The
--scheme
parameter allows users to explicitly override default:
  • --scheme ratio
    forces use of ratio (even if N≥5)
  • --scheme absolute
    forces use of absolute
  • --scheme percentile
    forces use of percentile (requires N≥3, otherwise error)
If
--scheme
is not specified → automatically derived according to the table above.
Old design had
bucket_scheme
state field — removed in v1.1. All skills derive algorithm in real-time based on calibration_samples, no need to persist "which one is currently used". This avoids state inconsistency issues like "forgot to sync after switching scheme".

B2: 派生新边界

B2: Derive New Boundaries

predictions/*.md
中所有有
actual_plays
的样本。
ratio 模式
baseline = median(最近 3 篇 actual_plays)
buckets = {
  "退步": (-inf, baseline * 0.3),
  "持平": (baseline * 0.3, baseline * 1),
  "命中": (baseline * 1, baseline * 3),
  "小爆": (baseline * 3, baseline * 10),
  "大爆": (baseline * 10, +inf),
}
absolute 模式
baseline = median(全部校准池 actual_plays)
buckets = {
  "底部": (-inf, baseline * 0.3),
  "基础盘": (baseline * 0.3, baseline * 1),
  "命中": (baseline * 1, baseline * 3),
  "爆款": (baseline * 3, baseline * 10),
  "现象级": (baseline * 10, +inf),
}
percentile 模式
sorted_plays = sorted(全部校准池 actual_plays)
buckets = {
  "底部":   ≤ p30,
  "基础盘": p30 - p60,
  "命中":   p60 - p85,
  "小爆":   p85 - p95,
  "大爆":   ≥ p95,
}
Read all samples with
actual_plays
in
predictions/*.md
.
Ratio Mode:
baseline = median(last 3 actual_plays)
buckets = {
  "Decline": (-inf, baseline * 0.3),
  "Stable": (baseline * 0.3, baseline * 1),
  "Hit": (baseline * 1, baseline * 3),
  "Small Viral": (baseline * 3, baseline * 10),
  "Big Viral": (baseline * 10, +inf),
}
Absolute Mode:
baseline = median(all calibration pool actual_plays)
buckets = {
  "Bottom": (-inf, baseline * 0.3),
  "Base Audience": (baseline * 0.3, baseline * 1),
  "Hit": (baseline * 1, baseline * 3),
  "Viral": (baseline * 3, baseline * 10),
  "Phenomenal": (baseline * 10, +inf),
}
Percentile Mode:
sorted_plays = sorted(all calibration pool actual_plays)
buckets = {
  "Bottom":   ≤ p30,
  "Base Audience": p30 - p60,
  "Hit":   p60 - p85,
  "Small Viral":   p85 - p95,
  "Big Viral":   ≥ p95,
}

B3: 报告变化 + 用户确认

B3: Report Changes + User Confirmation

当前 bucket scheme: ratio
proposed scheme: absolute
baseline: 4.2w 中位数(基于 5 篇校准样本)

新边界:
- 底部:   < 1.3w
- 基础盘: 1.3w - 4.2w
- 命中:   4.2w - 12.6w
- 爆款:   12.6w - 42w
- 现象级: > 42w

派生说明:
- 5 篇实绩:1.5w / 3.8w / 4.2w / 5.6w / 18w
- 中位数 4.2w,新桶按 ×{0.3, 1, 3, 10} 派生

确认应用?(yes / no)
Current bucket scheme: ratio
Proposed scheme: absolute
Baseline: 42K median (based on 5 calibration samples)

New boundaries:
- Bottom:   < 12.6K
- Base Audience: 12.6K - 42K
- Hit:   42K - 126K
- Viral:   126K - 420K
- Phenomenal: > 420K

Derivation explanation:
- 5 actual performances: 15K / 38K / 42K / 56K / 180K
- Median is 42K, new buckets derived by ×{0.3, 1, 3, 10}

Confirm application? (yes / no)

B4: 落地

B4: Implementation

用户确认后:
  1. 编辑
    rubric_notes.md
    的 "Bucket 方案" 段,替换为新表
  2. 更新
    .cheat-state.json
    baseline_plays
    字段(bucket scheme 不持久化——下次 cheat-predict 实时派生)
  3. rubric_notes.md
    的 bucket 段顶部追加一行变更记录:
    v2 buckets recalibrated on YYYY-MM-DD: scheme=absolute, baseline=4.2w (基于 N=10 个样本)
  4. 修改任何 prediction 文件——历史预测的 bucket 标签保持原样(在该样本写入时的方案下做出的判断)
After user confirmation:
  1. Edit the "Bucket Scheme" section in
    rubric_notes.md
    , replace with new table
  2. Update
    baseline_plays
    field in
    .cheat-state.json
    (bucket scheme is not persisted — derived in real-time during next cheat-predict)
  3. Append a change record to the top of the bucket section in
    rubric_notes.md
    :
    v2 buckets recalibrated on YYYY-MM-DD: scheme=absolute, baseline=42K (based on N=10 samples)
  4. Do not modify any prediction files — bucket tags in historical predictions remain as they are (judgments made under the scheme at the time of writing the sample)

B5: 对未来预测的影响

B5: Impact on Future Predictions

下一次
/cheat-predict
起按新 bucket 派生。历史 prediction 文件里的 bucket 标签不重算——bucket 是预测时的语义判断,事后改写会破坏盲度。
Starting from the next
/cheat-predict
, new buckets will be derived. Bucket tags in historical prediction files will not be recalculated — buckets are semantic judgments made at prediction time, post-hoc rewriting will destroy blindness.

Phase B 不做的事

What Phase B Does Not Do

  • 不重打 composite(公式没变)
  • 不重新审核观察段(rubric 没变)
  • 不调跨模型审核(确定性派生无需判断)
  • 不要求严格的样本数门槛(按 READINESS_HEURISTIC 由 Claude 判断;ratio 模式 N=1 就能跑)

  • No re-calculation of composite (formula remains unchanged)
  • No re-review of observation sections (rubric remains unchanged)
  • No cross-model audit (deterministic derivation requires no judgment)
  • No strict sample count threshold (judged by Claude according to READINESS_HEURISTIC; ratio mode can run with N=1)

Key Rules

Key Rules

  1. 5 步不可跳(仅完整 rubric bump)。任何"先简化跑一下"的请求都拒绝
  2. THRESHOLD 写死(仅完整 rubric bump)。不允许动态调整
  3. 跨模型审核是默认(仅完整 rubric bump)。关闭审核需要在 state file 显式标记
  4. cleanup pass 是 bump 的一部分(仅完整 rubric bump)。不允许 bump 完不清理观察段
  5. REQUIRE_CONFIRM(两种模式都要)。最后落地前必须用户明确说 "yes, bump" 或 "yes, recalibrate"
  6. bucket 重校不动历史预测。bucket 是预测时语义,事后改写破坏盲度
  1. 5 steps cannot be skipped (only for full rubric bump). Reject any request to "run a simplified version first"
  2. THRESHOLD is hard-coded (only for full rubric bump). Dynamic adjustment is not allowed
  3. Cross-model audit is default (only for full rubric bump). Turning off audit requires explicit marking in state file
  4. Cleanup pass is part of bump (only for full rubric bump). Bump cannot be completed without cleaning observation sections
  5. REQUIRE_CONFIRM (both modes). Must get explicit user confirmation "yes, bump" or "yes, recalibrate" before final implementation
  6. Bucket recalibration does not modify historical predictions. Buckets are prediction-time semantics, post-hoc rewriting destroys blindness

Refusals

Refusals

  • 「跳过校准池重打,直接换公式」 → 拒绝。原则 #2
  • 「跳过 cheat-score-blind sub-agent,主 Claude 直接重打就行」 → 拒绝。bump 不接受任何 self-scored fallback——sub-agent 不可用 → abort bump,不接受"自审"
  • 「跳过外部 LLM 审核」 → 仅当
    CROSS_MODEL_AUDIT=false
    显式设置
  • 「这次 THRESHOLD 调到 3/5 让它过」 → 拒绝。改 THRESHOLD 是元层级 bump
  • 「保留所有旧观察作为历史」 → 违反原则 #3
  • 「先 bump,cleanup 下次再做」 → 拒绝。cleanup 是 bump 的一部分
  • 「只重算 composite 不重打 dim」 → 拒绝。新权重 × 旧 dim 仍是旧污染。每个 dim 都由 sub-agent 重审 script
  • 「把 Memo 全文写进 rubric_notes.md 顶部,方便我读」 → 拒绝。rubric_notes.md 是 blind sub-agent 白名单——含视频名 / 实绩 → 通过白名单泄漏。Memo 写 rubric-memo.md(白名单),rubric_notes.md 只放公式 + 通用语言维度定义 + 指针
  • 「派生证据段保留真实视频名,让 rubric 读起来更具体」 → 拒绝。在 rubric_notes.md 必须用通用语言("高抽象密度样本");带视频名的派生证据写 rubric-memo.md
  • "Skip calibration pool re-scoring, directly change formula" → Reject. Rule #2
  • "Skip cheat-score-blind sub-agent, main Claude can re-score directly" → Reject. Bump does not accept any self-scored fallback — if sub-agent is unavailable → abort bump, do not accept "self-audit"
  • "Skip external LLM audit" → Only allowed if
    CROSS_MODEL_AUDIT=false
    is explicitly set
  • "Adjust THRESHOLD to 3/5 this time to let it pass" → Reject. Changing THRESHOLD is a meta-level bump
  • "Keep all old observations as history" → Violates Rule #3
  • "Bump first, do cleanup next time" → Reject. Cleanup is part of bump
  • "Only recalculate composite without re-scoring dimensions" → Reject. New weights × old dimensions are still old contamination. Each dimension must be re-reviewed by sub-agent
  • "Write full memo into top of rubric_notes.md for easy reading" → Reject. rubric_notes.md is whitelisted for blind sub-agent — containing video names / actual performance → leaks through whitelist. Memo is written in rubric-memo.md (outside whitelist), rubric_notes.md only contains formula + general language dimension definitions + pointer
  • "Keep real video names in derived evidence section to make rubric more specific" → Reject. Must use general language in rubric_notes.md ("high abstract density samples"); derived evidence with video names is written in rubric-memo.md

Integration

Integration

  • 上游:
    /cheat-retro
    检测到 ≥3 同向偏差 → 提议跑
    /cheat-bump
  • 依赖:
    mcp__llm-chat__chat
    (如配置)+ Task tool(spawn cheat-score-blind)
  • 修改:
    • rubric_notes.md
      (结构性更新,绝不写真实视频名 / 实绩)
    • rubric-memo.md
      ——append Memo 全文,含证据 + 派生证据)
    • 所有
      predictions/*.md
      (追加 Re-scored 行,不动预测段)
    • .cheat-state.json
  • 下游:下一篇
    /cheat-predict
    自动按新 rubric_version 打分
  • Upstream:
    /cheat-retro
    detects ≥3 same-direction deviations → propose running
    /cheat-bump
  • Dependencies:
    mcp__llm-chat__chat
    (if configured) + Task tool (spawn cheat-score-blind)
  • Modifications:
    • rubric_notes.md
      (structural update, never write real video names / actual performance)
    • rubric-memo.md
      (new — append full memo, including evidence + derived evidence)
    • All
      predictions/*.md
      (append Re-scored line, do not modify prediction section)
    • .cheat-state.json
  • Downstream: Next
    /cheat-predict
    automatically uses new rubric_version for scoring