cheat-bump
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese/cheat-bump — Rubric / Bucket 升级
/cheat-bump — Rubric / Bucket Upgrade
两种模式:
| 模式 | 触发 | 做什么 | 验证强度 |
|---|---|---|---|
| 完整 rubric bump | | 改公式 / 维度 / 权重 | 5 步 + 跨模型审核(强制) |
| bucket-only 重校 | | 只重新派生 bucket 边界 | 数据自动派生,无审核 |
完整 rubric bump 严格遵守 shared-references/bump-validation-protocol.md 的 5 步。bucket-only 走轻量路径——见下方 Phase B。
Two modes:
| Mode | Trigger | What it does | Validation Strength |
|---|---|---|---|
| Full rubric bump | | Modify formula / dimensions / weights | Mandatory 5-step process + cross-model audit |
| --bucket-only recalibration | | Only re-derive bucket boundaries | Automatic data derivation, no audit |
Full rubric bump strictly follows the 5 steps in shared-references/bump-validation-protocol.md. Bucket-only follows the lightweight path — see Phase B below.
Overview
Overview
入口:用户触发 /cheat-bump
↓
[Phase A0: 检测调用模式]
↓
├─ --bucket-only → [Phase B: 轻量 bucket 重校]
└─ --propose → [Phase 0~6: 完整 rubric bump]Entry: User triggers /cheat-bump
↓
[Phase A0: Detect Call Mode]
↓
├─ --bucket-only → [Phase B: Lightweight Bucket Recalibration]
└─ --propose → [Phase 0~6: Full Rubric Bump]Phase A0: 调用模式分流(先做)
Phase A0: Call Mode Diversion (Do First)
读用户参数:
- 含 → 走 Phase B(轻量重校)
--bucket-only - 含 → 走 Phase 0~8(完整 rubric bump)
--propose "<...>" - 都没有 → 询问用户:"你想做什么?1) 调 rubric 公式 / 加减维度 → --propose;2) 只重新派生 bucket 边界 → --bucket-only"
如果用户说"我觉得 ER 太低了想调"→ 是 路径。
如果用户说"我账号长大了,bucket 不准了"→ 是 路径。
两条路径不能混调——一次操作只做一种事。
--propose--bucket-onlyRead user parameters:
- Contains → proceed to Phase B (lightweight recalibration)
--bucket-only - Contains → proceed to Phase 0~8 (full rubric bump)
--propose "<...>" - Neither → Ask user: "What do you want to do? 1) Adjust rubric formula / add/remove dimensions → --propose; 2) Only re-derive bucket boundaries → --bucket-only"
If user says "I think ER is too low and want to adjust" → it's the path.
If user says "My account has grown, buckets are no longer accurate" → it's the path.
The two paths cannot be mixed — only one type of operation per action.
--propose--bucket-only完整 rubric bump 流程
Full Rubric Bump Workflow
[用户:升级 rubric --propose "ER×1.5→2.0,砍 NA,加 MS"]
↓
[Phase 0: 前置门槛检查]
↓
[Phase 1: 写出新公式完整方程]
↓
[Phase 2: 校准池全量重打分]
↓
[Phase 3: 计算排序一致性]
↓
[Phase 4: 跨模型独立审核(强制)]
↓
[Phase 5: 落地 + cleanup pass]
↓
[Phase 6: 更新所有校准样本的 prediction 文件底部追加 Re-scored 行][User: upgrade rubric --propose "ER×1.5→2.0, remove NA, add MS"]
↓
[Phase 0: Pre-threshold Check]
↓
[Phase 1: Write Complete New Formula Equation]
↓
[Phase 2: Full Re-scoring of Calibration Pool]
↓
[Phase 3: Calculate Ranking Consistency]
↓
[Phase 4: Mandatory Cross-Model Independent Audit]
↓
[Phase 5: Implementation + Cleanup Pass]
↓
[Phase 6: Append Re-scored Line to Bottom of All Calibration Sample Prediction Files]Constants
Constants
- READINESS_HEURISTIC —
- 默认参考:校准池 ≥ 5 样本 + 至少 1 个跨样本观察有 ≥3 样本支持
- 但 Claude 可以提议 bump(即使样本少)如果观察信号特别强:
- N=3 但出现完全推翻当前 rubric 假设的强反例(composite 8.5 vs 实绩 5w 这种 ≥3x 偏差)
- 1 篇出现单点但极强的现象(如评论区出现 ≥2000 赞的单一模因)
- Claude 也可以拒绝 bump(即使样本足)如果证据弱:
- N=10 但观察都是低置信度的零碎 pattern,无清晰方向
- 用户复盘时大量"随便看了下"的非严肃判断
- 写在 prediction header 或 cheat-bump 输出时必说明:本次提议是 default-aligned 还是 judgment-driven,给用户审视依据
- THRESHOLD = 0.8 — 新排序与实绩排序一致性阈值(4/5)。这条写死——bump 验证的统计刚性
- CROSS_MODEL_AUDIT = true — 调外部 LLM 独立审核。false 仅用于离线
- REQUIRE_CONFIRM = true — 落地前要求用户明确"yes, bump"
- READINESS_HEURISTIC —
- Default Reference: Calibration pool ≥5 samples + at least 1 cross-sample observation supported by ≥3 samples
- But Claude can propose bump (even with few samples) if observation signals are exceptionally strong:
- N=3 but there's a strong counterexample that completely overturns current rubric assumptions (e.g., composite 8.5 vs actual performance 50k, a ≥3x deviation)
- A single post shows an extreme phenomenon (e.g., a single meme with ≥2000 likes in the comment section)
- Claude can also reject bump (even with sufficient samples) if evidence is weak:
- N=10 but observations are low-confidence fragmented patterns with no clear direction
- User reviews contain numerous non-serious judgments like "just glanced at it"
- Must be stated in prediction header or cheat-bump output: Whether this proposal is default-aligned or judgment-driven, providing users with a basis for review
- THRESHOLD = 0.8 — Consistency threshold between new ranking and actual performance ranking (4/5). This is hard-coded — statistical rigidity for bump validation
- CROSS_MODEL_AUDIT = true — Call external LLM for independent audit. false is only used for offline scenarios
- REQUIRE_CONFIRM = true — Require explicit user confirmation "yes, bump" before implementation
Inputs
Inputs
| 必填 | 来源 |
|---|---|
| 用户参数;缺失则询问 |
| 用户项目根 |
| 校准池数据 |
| 状态 |
| Required | Source |
|---|---|
| User parameters; ask if missing |
| User project root |
All | Calibration pool data |
| State file |
Workflow
Workflow
Phase 0: 前置门槛检查
Phase 0: Pre-threshold Check
按 bump-validation-protocol.md 的"何时禁止"段,逐项检查:
| 检查 | 失败处理 |
|---|---|
| 校准池总样本数 vs 观察强度 | Claude 判断——按 READINESS_HEURISTIC:默认 ≥5 样本但允许特例(强反例 / 强模因)。如不满足默认,Claude 必须显式说明为什么仍然提议 bump("虽然只 N=3 样本,但 X 这条出现 composite Y vs 实绩 Z,这是 W 倍偏差"),让用户审视 |
| 上次 bump 距今的新校准数 vs 观察成熟度 | Claude 判断——默认建议 ≥3 篇新样本,但如果连续 3 篇都强证据指向同一方向 → 不必再等 |
| 拒绝:"你有 in-progress 预测未完成。先走完那条流程或清掉 state" |
| 触发条件成立(系统性偏差 / 跨样本新观察 / 新维度证据足) | 警告但不阻塞——询问用户为什么现在 bump |
通过 → 进入 Phase 1。
Check item by item according to the "When to Prohibit" section in bump-validation-protocol.md:
| Check | Failure Handling |
|---|---|
| Total calibration pool samples vs observation strength | Claude's judgment — follow READINESS_HEURISTIC: default ≥5 samples but allow exceptions (strong counterexamples / strong memes). If default is not met, Claude must explicitly explain why bump is still proposed ("Although only N=3 samples, entry X has composite Y vs actual performance Z, a W-fold deviation"), allowing user review |
| Number of new calibrations since last bump vs observation maturity | Claude's judgment — default suggests ≥3 new samples, but if 3 consecutive samples all provide strong evidence pointing to the same direction → no need to wait |
| Reject: "You have an in-progress prediction that is not completed. Finish that process first or clear the state" |
| Trigger conditions are met (systematic deviation / new cross-sample observations / sufficient evidence for new dimensions) | Warn but do not block — ask user why they want to bump now |
Pass → proceed to Phase 1.
Phase 1: 写出新公式完整方程
Phase 1: Write Complete New Formula Equation
不能只接受用户的简短描述。把它展开为完整方程:
当前:v2 composite = (ER×1.5 + SR×1.5 + HP×1.5 + QL + NA + AB + SAT) / 8.5 × 2.0
提议:v2.1 composite = (ER×2.0 + HP×1.5 + MS×1.5 + QL + SR + TS + SAT) / 9.0 × 2.0
变化总结:
- ER ×1.5 → ×2.0(升)
- SR ×1.5 → ×1.0(降)
- 新增 MS ×1.5(Memetic Shareability)
- 新增 TS ×1.0(Topic Shareability)
- 删除 NA(与 HP 重叠)
- 删除 AB(被 TS 替代)
- 归一化常数 8.5 → 9.0
- 公式总维度数:7 → 7(净变化 0)如果用户的提议含糊(如"ER 权重提一点")→ 询问具体数值,禁止自己猜。
Do not only accept brief user descriptions. Expand it into a complete equation:
Current: v2 composite = (ER×1.5 + SR×1.5 + HP×1.5 + QL + NA + AB + SAT) / 8.5 × 2.0
Proposed: v2.1 composite = (ER×2.0 + HP×1.5 + MS×1.5 + QL + SR + TS + SAT) / 9.0 × 2.0
Summary of changes:
- ER ×1.5 → ×2.0 (increased)
- SR ×1.5 → ×1.0 (decreased)
- Added MS ×1.5 (Memetic Shareability)
- Added TS ×1.0 (Topic Shareability)
- Removed NA (overlaps with HP)
- Removed AB (replaced by TS)
- Normalization constant 8.5 → 9.0
- Total number of formula dimensions: 7 → 7 (net change 0)If user's proposal is vague (e.g., "Increase ER weight a bit") → ask for specific values, do not guess.
Phase 2: 校准池全量重打分(强制走 blind sub-agent)
Phase 2: Full Re-scoring of Calibration Pool (Mandatory Blind Sub-agent)
Glob 中所有有完整复盘段的文件 → 校准池。
predictions/*.mdbump 是工具最高风险动作——所有重打必须走 cheat-score-blind sub-agent。inline 重打 = 主 Claude 已经看过实绩,rank 一致性变成 overfit 而非真信号。
Glob all files in with complete review sections → calibration pool.
predictions/*.mdBump is the highest-risk action of the tool — all re-scoring must go through cheat-score-blind sub-agent. Inline re-scoring means the main Claude has already seen actual performance, making rank consistency overfitting rather than real signals.
强制约束
Mandatory Constraints
- 不接受 self-scored fallback——有
/cheat-predictflag,但--skip-blind没有。如果 Task tool 不可用 → abort bump,向用户报告"先解决 Task tool 再 bump"/cheat-bump - 不接受"我只重算 composite 不重打 dim" —— 即使新公式只调权重不加维度,每条 prediction 的所有 dim 都要由 sub-agent 重新审 script。理由:旧 dim 分本身可能是污染的;权重变了不能保证旧 dim 还成立
- No self-scored fallback accepted — has a
/cheat-predictflag, but--skip-blinddoes not. If Task tool is unavailable → abort bump, report to user "Resolve Task tool issues first before bumping"/cheat-bump - No "I only recalculate composite without re-scoring dimensions" — even if the new formula only adjusts weights without adding dimensions, all dimensions of each prediction must be re-reviewed by the sub-agent. Reason: Old dimension scores may already be contaminated; weight changes cannot guarantee old dimensions are still valid
对每篇 prediction:
For Each Prediction:
- 解析 prediction 文件拿到对应 路径(从
scripts/<id>.mdheader 字段)Script Path - 校验 script 文件存在 + hash 跟 header 一致;不一致 → 警告(script 改过了)但仍 spawn sub-agent
Script Hash - 通过 Task tool spawn cheat-score-blind sub-agent:
Spawn cheat-score-blind sub-agent. Input: script_path: <prediction header 的 Script Path> rubric_notes_path: rubric_notes.md sidecar_path: .cheat-cache/bump-rescores/<prediction-id>.json Task: 按 rubric_notes 当前公式(已是新版 vN+1)给 script 打分。 返回严格 JSON。写 sidecar 文件用于 bump 主流程批量读取。 不要读 state file / predictions/ / videos/ 任何其他文件。 不要询问用户 —— 你没有用户。 不要读这份 prediction 文件本身 —— 你只看 script + rubric。 - 等 sub-agent 完成 → 读 sidecar JSON → 主流程用新公式算 composite
- 写"重打表"到 (汇总)。每条 entry 标
.cheat-cache/bump-rescores.json—— bump phase 5 cleanup 时把这个字段连同新分数写到 prediction 文件的blind: true行Re-scored under v<N+1>
- Parse the prediction file to get the corresponding path (from the
scripts/<id>.mdheader field)Script Path - Verify the script file exists + hash matches the header ; if not → warn (script has been modified) but still spawn sub-agent
Script Hash - Spawn cheat-score-blind sub-agent via Task tool:
Spawn cheat-score-blind sub-agent. Input: script_path: <Script Path from prediction header> rubric_notes_path: rubric_notes.md sidecar_path: .cheat-cache/bump-rescores/<prediction-id>.json Task: Score the script according to the current formula in rubric_notes (already updated to new version vN+1). Return strict JSON. Write to sidecar file for batch reading by the main bump process. Do not read state file / predictions/ / videos/ or any other files. Do not ask user — you have no user. Do not read this prediction file itself — only look at script + rubric. - Wait for sub-agent to complete → read sidecar JSON → main process calculates composite using new formula
- Write "re-score table" to (summary). Mark each entry with
.cheat-cache/bump-rescores.json— during Phase 5 cleanup of bump, write this field along with the new score to theblind: trueline in the prediction fileRe-scored under v<N+1>
还污染没污染的诚实标注
Honest Labeling of Contamination
即使走 sub-agent,仍有两类残余 contamination 要在 bump report 里诚实标注:
| 类型 | 来源 | 标注字段 |
|---|---|---|
| 模型 prior contamination | sub-agent 仍是 Claude,RLHF 共享 | |
| 用户自己 rubric design bias | rubric_notes.md 是用户写的,自然 fit 自己内容 | |
这两条提示用户 channel C(跨模型 audit)的不可省。bump 报告末尾必印:"上面的 rank 一致性是 channel A 内的一致性。最终决策必须等 channel C audit 通过。"
Even with sub-agent, two types of residual contamination must be honestly labeled in the bump report:
| Type | Source | Label Field |
|---|---|---|
| Model prior contamination | Sub-agent is still Claude, sharing RLHF | |
| User's own rubric design bias | rubric_notes.md is written by user, naturally fitting their own content | |
These two remind users that channel C (cross-model audit) is indispensable. The end of the bump report must state: "The above rank consistency is within channel A. Final decision must wait for channel C audit approval."
失败模式
Failure Modes
| 症状 | 处理 |
|---|---|
| 某条 prediction 的 script 文件不见了 | sub-agent skip 该条,主流程汇总报告"N 条因 script 缺失被排除"。如剩余有效池 < MIN_SAMPLES → abort bump |
sub-agent 返回 | 重发 Task 最多 3 次;仍败 → 该条标 |
| Task tool 整个不可用 | abort bump,提示用户"Task tool 是 bump 的硬依赖。如真的离线环境,跑 |
| sub-agent 输出含 contamination_signal | 标 |
| Symptom | Handling |
|---|---|
| Script file for a prediction is missing | Sub-agent skips this entry, main process summarizes "N entries excluded due to missing script". If remaining valid pool < MIN_SAMPLES → abort bump |
Sub-agent returns | Resend Task up to 3 times; if still failed → mark this entry as |
| Task tool is completely unavailable | Abort bump, prompt user "Task tool is a hard dependency for bump. If in offline environment, run |
| Sub-agent output contains contamination_signal | Mark as |
Phase 3: 计算排序一致性
Phase 3: Calculate Ranking Consistency
每个样本:
new_composite_rank: 用新公式排序的 rank
actual_plays_rank: 用实际播放排序的 rank
delta: |new_rank - actual_rank|
输出对照表:
| 样本 | composite (v2) | composite (v2.1) | rank (new) | actual | rank (actual) | delta |
|---|---|---|---|---|---|---|
| 仓鼠 | 9.41 | 9.55 | 1 | 124.8w | 1 | 0 |
| 停止期待 | 8.24 | 9.11 | 2 | 71.1w | 2 | 0 |
| 老板废话 | 7.65 | 8.11 | 4 | 39.6w | 3 | 1 |
| 求职悖论 | 8.47 | 7.56 | 5 | 16.8w | 4 | 1 |
| 谁问你了 | 8.24 | 7.00 | 6 | 11.7w | 5 | 1 |
排序一致性:4/5 在 |delta| ≤ 1
Pairwise no-regression:旧公式做对的所有 pair 在新公式下未颠倒 ✓判定:
- 排序一致性 < THRESHOLD(默认 0.8) → 本地拒绝,转 Phase 4 之前明确报告失败
- pairwise 出现回归 → 本地拒绝
THRESHOLDFor each sample:
new_composite_rank: Rank sorted by new formula
actual_plays_rank: Rank sorted by actual plays
delta: |new_rank - actual_rank|
Output comparison table:
| Sample | composite (v2) | composite (v2.1) | rank (new) | actual | rank (actual) | delta |
|---|---|---|---|---|---|---|
| Hamster | 9.41 | 9.55 | 1 | 1.248M | 1 | 0 |
| Stop Expecting | 8.24 | 9.11 | 2 | 711K | 2 | 0 |
| Boss Nonsense | 7.65 | 8.11 | 4 | 396K | 3 | 1 |
| Job Hunting Paradox | 8.47 | 7.56 | 5 | 168K | 4 | 1 |
| Who Asked You | 8.24 | 7.00 | 6 | 117K | 5 | 1 |
Ranking consistency: 4/5 with |delta| ≤ 1
Pairwise no-regression: All pairs correctly ranked by old formula are not reversed under new formula ✓Judgment:
- Ranking consistency < THRESHOLD (default 0.8) → Local rejection, explicitly report failure before proceeding to Phase 4
- Pairwise regression occurs → Local rejection
THRESHOLDPhase 4: 跨模型独立审核(强制,除非 escape hatch)
Phase 4: Mandatory Cross-Model Independent Audit (Mandatory, except escape hatch)
CROSS_MODEL_AUDIT=true调用 :
mcp__llm-chat__chatprompt:
你是一个独立审稿人。下面是一个内容创作者准备升级的 rubric 公式。
请独立判定两件事:
1. 排序一致性:新公式给样本的排序与实际表现排序,是否真的在 ≥80% 样本上一致?
2. 解释力:新公式相比旧公式,是否更好地解释了校准池的实绩分布?
数据:
旧公式:(ER×1.5 + SR×1.5 + HP×1.5 + QL + NA + AB + SAT) / 8.5 × 2.0
新公式:(ER×2.0 + HP×1.5 + MS×1.5 + QL + SR + TS + SAT) / 9.0 × 2.0
校准池:
[Phase 2 重打表的完整 JSON]
排序对照:
[Phase 3 表格的完整 JSON]
输出格式:
- 判定:PASS 或 REJECT
- 理由:≥100 字
- 关键风险:[如有,列出新公式的潜在问题]收到外部 LLM 回复 → 解析判定。
判定逻辑:
- 本地 PASS + 外部 PASS → 通过,进入 Phase 5
- 本地 PASS + 外部 REJECT → 视为 REJECT。冲突意味着至少一方解读不稳定
- 本地 REJECT → 已在 Phase 3 终止
- mcp__llm-chat__chat 不可用 → 优雅降级到 ,state file 标
CROSS_MODEL_AUDIT=falselast_bump_self_audited: true
CROSS_MODEL_AUDIT=false- 仅依赖本地判定
- state file 持续标记,cheat-status 持续提示用户"这次 bump 是自审,建议配置 mcp__llm-chat__chat"
CROSS_MODEL_AUDIT=trueCall :
mcp__llm-chat__chatprompt:
You are an independent reviewer. Below is a rubric formula that a content creator is preparing to upgrade.
Please independently judge two things:
1. Ranking consistency: Is the ranking of samples by the new formula consistent with the ranking of actual performance in ≥80% of samples?
2. Explanatory power: Does the new formula better explain the actual performance distribution of the calibration pool compared to the old formula?
Data:
Old formula: (ER×1.5 + SR×1.5 + HP×1.5 + QL + NA + AB + SAT) / 8.5 × 2.0
New formula: (ER×2.0 + HP×1.5 + MS×1.5 + QL + SR + TS + SAT) / 9.0 × 2.0
Calibration pool:
[Full JSON of re-score table from Phase 2]
Ranking comparison:
[Full JSON of table from Phase 3]
Output format:
- Judgment: PASS or REJECT
- Reason: ≥100 words
- Key risks: [List potential issues of new formula if any]Receive external LLM response → parse judgment.
Judgment logic:
- Local PASS + External PASS → Pass, proceed to Phase 5
- Local PASS + External REJECT → Treat as REJECT. Conflict means at least one party's interpretation is unstable
- Local REJECT → Already terminated in Phase 3
- mcp__llm-chat__chat unavailable → Gracefully degrade to , mark
CROSS_MODEL_AUDIT=falsein state filelast_bump_self_audited: true
CROSS_MODEL_AUDIT=false- Only rely on local judgment
- Continuously mark in state file, cheat-status continuously prompts user "This bump was self-audited, it is recommended to configure mcp__llm-chat__chat"
Phase 5: 落地 + cleanup pass
Phase 5: Implementation + Cleanup Pass
通过审核后,REQUIRE_CONFIRM=true → 询问用户:"新公式 PASS 本地与外部审核。最后确认:执行 bump 落地?这会修改 rubric_notes.md + rubric-memo.md 并删除若干已被吸收的观察。回答 'yes, bump' 才执行。"
用户确认后:
After passing audit, REQUIRE_CONFIRM=true → Ask user: "New formula passed local and external audits. Final confirmation: Execute bump implementation? This will modify rubric_notes.md + rubric-memo.md and delete several absorbed observations. Only execute if you answer 'yes, bump'."
After user confirmation:
5a. 更新 rubric_notes.md
(只放通用语言,不含视频名 / 实绩)
rubric_notes.md5a. Update rubric_notes.md
(Only use general language, no video names / actual performance data)
rubric_notes.md- 顶部 metadata 更新:
**当前版本**: vN+1**Last bumped at**: <ISO 8601>- (指针,不复制 Memo 内容)
**Upgrade memos**: 见 [rubric-memo.md](rubric-memo.md)
- 版本速查表加一行(只含版本号 + 公式签名,不含证据样本)
- 更新"当前评分维度"段(删 NA / AB,加 MS / TS)
- 派生证据段 如新维度需要锚点解释 → 用通用语言:
- ✅ 允许:「派生证据:高抽象密度样本 → CC=1 → 低 reach」
- ❌ 禁止:「派生证据:「停止期待」CC=1 → 实绩 13.7w」(视频名 + 实绩 数字)
- 命中违禁 pattern → 把该段抽到 rubric-memo.md 的"派生证据"子段,原位用通用语言替代
- Update top metadata:
**Current Version**: vN+1**Last bumped at**: <ISO 8601>- (pointer, do not copy memo content)
**Upgrade memos**: See [rubric-memo.md](rubric-memo.md)
- Add a line to version quick reference table (only include version number + formula signature, no evidence samples)
- Update "Current Scoring Dimensions" section (remove NA / AB, add MS / TS)
- Derived evidence section if new dimensions need anchor explanation → Use general language:
- ✅ Allowed: "Derived evidence: High abstract density samples → CC=1 → Low reach"
- ❌ Prohibited: "Derived evidence: 'Stop Expecting' CC=1 → Actual performance 137K" (video name + actual performance number)
- If prohibited pattern is hit → move this section to "Derived Evidence" sub-section in rubric-memo.md, replace with general language in place
5b. 写 Memo 到 rubric-memo.md
(append 模式,不覆盖历史)
rubric-memo.md5b. Write Memo to rubric-memo.md
(Append mode, do not overwrite history)
rubric-memo.md按 bump-validation-protocol.md Step 5 + templates/rubric-memo.template.md 格式 append 一段 Memo 到文件末尾:
- 触发观察(含真实观察 ID)
- 证据数据(校准池重打表 + 排序对照,含真实视频名 + 实绩)
- 派生证据(含真实样本名 + 实绩)
- 诊断
- 新公式
- 跨模型审核结论引用(含模型名 + 判定 + 理由摘录)
- 已知局限
绝不覆盖 rubric-memo.md 已有内容——bump memo 按时间顺序累积。
Append a memo section to the end of the file according to Step 5 in bump-validation-protocol.md + templates/rubric-memo.template.md:
- Trigger observation (include real observation ID)
- Evidence data (Full re-score table + ranking comparison of calibration pool, include real video names + actual performance)
- Derived evidence (Include real sample names + actual performance)
- Diagnosis
- New formula
- Cross-model audit conclusion reference (include model name + judgment + reason excerpt)
- Known limitations
Never overwrite existing content in rubric-memo.md — bump memos accumulate in chronological order.
5c. cleanup pass(按 observation-lifecycle.md 的"cleanup pass 强制时机")
5c. Cleanup Pass (According to "Mandatory Cleanup Pass Timing" in observation-lifecycle.md)
在 内执行(不动 rubric-memo.md):
rubric_notes.md- 已被吸收为新维度的观察 → 删(如观察 E 被吸收为 MS → 删观察 E)
- 被新数据推翻的观察 → 删
- 仍未解决的观察 → 迁移到新版本"待验证假设"段
- 已被验证的"规律"→ 移到"规律沉淀区"
Execute within (Do not modify rubric-memo.md):
rubric_notes.md- Observations that have been absorbed into new dimensions → Delete (e.g., Observation E absorbed into MS → delete Observation E)
- Observations overturned by new data → Delete
- Unresolved observations → Move to "Pending Validation Hypotheses" section of new version
- Validated "rules" → Move to "Rule Precipitation Area"
5d. 整理 + 自检
5d. Organize + Self-Check
- 重新读 全文,确保读者能在 60 秒内理解当下规则——超出 600 行触发额外清算
rubric_notes.md - 自检 leak guard:对 跑
rubric_notes.md→ 如有命中 → abort bump + 回滚,提示用户"rubric_notes.md 写入了违禁内容(实绩 / 播放数)"。这些内容应在 rubric-memo.md,不在 rubric_notes.mdgrep -E '\\d+\\s*[wWmMkK万]|播放|实绩|实际'
- Re-read entire to ensure readers can understand current rules within 60 seconds → trigger additional cleanup if exceeding 600 lines
rubric_notes.md - Self-check leak guard: Run on
grep -E '\\d+\\s*[wWmMkK]|plays|actual performance|actual'→ if any hits → abort bump + rollback, prompt user "rubric_notes.md contains prohibited content (actual performance / play counts)". These contents should be in rubric-memo.md, not rubric_notes.mdrubric_notes.md
Phase 6: 校准样本批量更新
Phase 6: Batch Update of Calibration Samples
对每个校准样本的 prediction 文件,底部追加(不动预测段、不动复盘段):
markdown
---
**Re-scored under v2.1 on 2026-05-04**: composite=8.24 → 9.11 (blind: true)
(rubric bump 时全量重算,由 cheat-score-blind sub-agent 独立打分;详见 rubric-memo.md 的 v2 → v2.1 升级 Memo)blind: true用 Edit 工具,匹配每个文件的最末尾。
For each calibration sample's prediction file, append to the bottom (do not modify prediction section or review section):
markdown
---
**Re-scored under v2.1 on 2026-05-04**: composite=8.24 → 9.11 (blind: true)
(Full re-calculated during rubric bump, independently scored by cheat-score-blind sub-agent; see v2 → v2.1 upgrade memo in rubric-memo.md)The field is required — tell future readers of this record "This is channel B isolated scoring, not self-scored by main Claude". If a prediction was excluded in Phase 2 due to sub-agent failure → no Re-scored line will be added (keep as is).
blind: trueUse Edit tool to match the end of each file.
Phase 7: 更新 state file
Phase 7: Update State File
json
{
"rubric_version": "v2.1",
"last_bump_at": "<ISO timestamp>",
"last_bump_self_audited": false,
"consecutive_directional_errors": [],
"calibration_samples_at_last_bump": <current value>
}清空 ——新 rubric 重新计数。
consecutive_directional_errorsjson
{
"rubric_version": "v2.1",
"last_bump_at": "<ISO timestamp>",
"last_bump_self_audited": false,
"consecutive_directional_errors": [],
"calibration_samples_at_last_bump": <current value>
}Clear — new rubric starts counting again.
consecutive_directional_errorsPhase 8: 控制台报告
Phase 8: Console Report
✅ Rubric 已升级 v2 → v2.1
变化:
- ER ×1.5 → ×2.0
- SR ×1.5 → ×1.0
- 新增 MS / TS
- 删除 NA / AB
校准池重打:5/5 通过排序检查(4/5 一致 + 0 pairwise 回归)
跨模型审核:✅ PASS
Cleanup pass:删除观察 D 和 E(已吸收为 QL 重定义和 MS 维度)
下一篇预测起按 v2.1 公式打分。
所有历史预测文件已追加 Re-scored 标记。✅ Rubric upgraded from v2 → v2.1
Changes:
- ER ×1.5 → ×2.0
- SR ×1.5 → ×1.0
- Added MS / TS
- Removed NA / AB
Calibration pool re-scoring: 5/5 passed ranking check (4/5 consistent + 0 pairwise regression)
Cross-model audit: ✅ PASS
Cleanup pass: Deleted Observations D and E (absorbed into QL redefinition and MS dimension)
Scoring will use v2.1 formula starting from next prediction.
All historical prediction files have been appended with Re-scored marks.Phase B:bucket-only 重校(轻量分支)
Phase B: Bucket-Only Recalibration (Lightweight Branch)
/cheat-bump --bucket-only [--scheme ratio|absolute|percentile]与完整 bump 的本质区别:bucket 边界不是规则的一部分,是数据派生量。重新派生它不需要跨模型审核——派生算法是确定性的,没有"判断"成分。
/cheat-bump --bucket-only [--scheme ratio|absolute|percentile]Essential difference from full bump: Bucket boundaries are not part of the rules, they are data-derived quantities. Re-deriving them does not require cross-model audit — the derivation algorithm is deterministic with no "judgment" component.
B1: 选择算法(按可用样本数自动派生,state 不存 scheme)
B1: Select Algorithm (Automatically Derived Based on Available Sample Count, Scheme Not Stored in State)
| 算法 | 适用 | 边界派生方式 |
|---|---|---|
| 小样本 | 上一篇 / 最近 3 篇中位数 × {0.3 / 1 / 3 / 10 / 30} |
| 中等样本 | 校准池中位数 × {0.3 / 1 / 3 / 10 / 30},固定边界 |
| 大样本 | 校准池实绩 percentile {30 / 60 / 85 / 95 / 100} |
--scheme- 强制用 ratio(即使 N≥5)
--scheme ratio - 强制用 absolute
--scheme absolute - 强制用 percentile(要求 N≥3,否则报错)
--scheme percentile
未指定 → 按上表自动派生。
--scheme旧设计有state 字段——v1.1 删了。所有 skill 实时按 calibration_samples 派生算法,不需要持久化"当前用哪个"。这避免了"切换 scheme 后忘了同步"的状态不一致问题。bucket_scheme
| Algorithm | Applicable | Boundary Derivation Method |
|---|---|---|
| Small sample size | Median of last 1 / last 3 samples × {0.3 / 1 / 3 / 10 / 30} |
| Medium sample size | Median of entire calibration pool × {0.3 / 1 / 3 / 10 / 30}, fixed boundaries |
| Large sample size | Actual performance percentiles of calibration pool {30 / 60 / 85 / 95 / 100} |
The parameter allows users to explicitly override default:
--scheme- forces use of ratio (even if N≥5)
--scheme ratio - forces use of absolute
--scheme absolute - forces use of percentile (requires N≥3, otherwise error)
--scheme percentile
If is not specified → automatically derived according to the table above.
--schemeOld design hadstate field — removed in v1.1. All skills derive algorithm in real-time based on calibration_samples, no need to persist "which one is currently used". This avoids state inconsistency issues like "forgot to sync after switching scheme".bucket_scheme
B2: 派生新边界
B2: Derive New Boundaries
读 中所有有 的样本。
predictions/*.mdactual_playsratio 模式:
baseline = median(最近 3 篇 actual_plays)
buckets = {
"退步": (-inf, baseline * 0.3),
"持平": (baseline * 0.3, baseline * 1),
"命中": (baseline * 1, baseline * 3),
"小爆": (baseline * 3, baseline * 10),
"大爆": (baseline * 10, +inf),
}absolute 模式:
baseline = median(全部校准池 actual_plays)
buckets = {
"底部": (-inf, baseline * 0.3),
"基础盘": (baseline * 0.3, baseline * 1),
"命中": (baseline * 1, baseline * 3),
"爆款": (baseline * 3, baseline * 10),
"现象级": (baseline * 10, +inf),
}percentile 模式:
sorted_plays = sorted(全部校准池 actual_plays)
buckets = {
"底部": ≤ p30,
"基础盘": p30 - p60,
"命中": p60 - p85,
"小爆": p85 - p95,
"大爆": ≥ p95,
}Read all samples with in .
actual_playspredictions/*.mdRatio Mode:
baseline = median(last 3 actual_plays)
buckets = {
"Decline": (-inf, baseline * 0.3),
"Stable": (baseline * 0.3, baseline * 1),
"Hit": (baseline * 1, baseline * 3),
"Small Viral": (baseline * 3, baseline * 10),
"Big Viral": (baseline * 10, +inf),
}Absolute Mode:
baseline = median(all calibration pool actual_plays)
buckets = {
"Bottom": (-inf, baseline * 0.3),
"Base Audience": (baseline * 0.3, baseline * 1),
"Hit": (baseline * 1, baseline * 3),
"Viral": (baseline * 3, baseline * 10),
"Phenomenal": (baseline * 10, +inf),
}Percentile Mode:
sorted_plays = sorted(all calibration pool actual_plays)
buckets = {
"Bottom": ≤ p30,
"Base Audience": p30 - p60,
"Hit": p60 - p85,
"Small Viral": p85 - p95,
"Big Viral": ≥ p95,
}B3: 报告变化 + 用户确认
B3: Report Changes + User Confirmation
当前 bucket scheme: ratio
proposed scheme: absolute
baseline: 4.2w 中位数(基于 5 篇校准样本)
新边界:
- 底部: < 1.3w
- 基础盘: 1.3w - 4.2w
- 命中: 4.2w - 12.6w
- 爆款: 12.6w - 42w
- 现象级: > 42w
派生说明:
- 5 篇实绩:1.5w / 3.8w / 4.2w / 5.6w / 18w
- 中位数 4.2w,新桶按 ×{0.3, 1, 3, 10} 派生
确认应用?(yes / no)Current bucket scheme: ratio
Proposed scheme: absolute
Baseline: 42K median (based on 5 calibration samples)
New boundaries:
- Bottom: < 12.6K
- Base Audience: 12.6K - 42K
- Hit: 42K - 126K
- Viral: 126K - 420K
- Phenomenal: > 420K
Derivation explanation:
- 5 actual performances: 15K / 38K / 42K / 56K / 180K
- Median is 42K, new buckets derived by ×{0.3, 1, 3, 10}
Confirm application? (yes / no)B4: 落地
B4: Implementation
用户确认后:
- 编辑 的 "Bucket 方案" 段,替换为新表
rubric_notes.md - 更新 的
.cheat-state.json字段(bucket scheme 不持久化——下次 cheat-predict 实时派生)baseline_plays - 在 的 bucket 段顶部追加一行变更记录:
rubric_notes.mdv2 buckets recalibrated on YYYY-MM-DD: scheme=absolute, baseline=4.2w (基于 N=10 个样本) - 不修改任何 prediction 文件——历史预测的 bucket 标签保持原样(在该样本写入时的方案下做出的判断)
After user confirmation:
- Edit the "Bucket Scheme" section in , replace with new table
rubric_notes.md - Update field in
baseline_plays(bucket scheme is not persisted — derived in real-time during next cheat-predict).cheat-state.json - Append a change record to the top of the bucket section in :
rubric_notes.mdv2 buckets recalibrated on YYYY-MM-DD: scheme=absolute, baseline=42K (based on N=10 samples) - Do not modify any prediction files — bucket tags in historical predictions remain as they are (judgments made under the scheme at the time of writing the sample)
B5: 对未来预测的影响
B5: Impact on Future Predictions
下一次 起按新 bucket 派生。历史 prediction 文件里的 bucket 标签不重算——bucket 是预测时的语义判断,事后改写会破坏盲度。
/cheat-predictStarting from the next , new buckets will be derived. Bucket tags in historical prediction files will not be recalculated — buckets are semantic judgments made at prediction time, post-hoc rewriting will destroy blindness.
/cheat-predictPhase B 不做的事
What Phase B Does Not Do
- 不重打 composite(公式没变)
- 不重新审核观察段(rubric 没变)
- 不调跨模型审核(确定性派生无需判断)
- 不要求严格的样本数门槛(按 READINESS_HEURISTIC 由 Claude 判断;ratio 模式 N=1 就能跑)
- No re-calculation of composite (formula remains unchanged)
- No re-review of observation sections (rubric remains unchanged)
- No cross-model audit (deterministic derivation requires no judgment)
- No strict sample count threshold (judged by Claude according to READINESS_HEURISTIC; ratio mode can run with N=1)
Key Rules
Key Rules
- 5 步不可跳(仅完整 rubric bump)。任何"先简化跑一下"的请求都拒绝
- THRESHOLD 写死(仅完整 rubric bump)。不允许动态调整
- 跨模型审核是默认(仅完整 rubric bump)。关闭审核需要在 state file 显式标记
- cleanup pass 是 bump 的一部分(仅完整 rubric bump)。不允许 bump 完不清理观察段
- REQUIRE_CONFIRM(两种模式都要)。最后落地前必须用户明确说 "yes, bump" 或 "yes, recalibrate"
- bucket 重校不动历史预测。bucket 是预测时语义,事后改写破坏盲度
- 5 steps cannot be skipped (only for full rubric bump). Reject any request to "run a simplified version first"
- THRESHOLD is hard-coded (only for full rubric bump). Dynamic adjustment is not allowed
- Cross-model audit is default (only for full rubric bump). Turning off audit requires explicit marking in state file
- Cleanup pass is part of bump (only for full rubric bump). Bump cannot be completed without cleaning observation sections
- REQUIRE_CONFIRM (both modes). Must get explicit user confirmation "yes, bump" or "yes, recalibrate" before final implementation
- Bucket recalibration does not modify historical predictions. Buckets are prediction-time semantics, post-hoc rewriting destroys blindness
Refusals
Refusals
- 「跳过校准池重打,直接换公式」 → 拒绝。原则 #2
- 「跳过 cheat-score-blind sub-agent,主 Claude 直接重打就行」 → 拒绝。bump 不接受任何 self-scored fallback——sub-agent 不可用 → abort bump,不接受"自审"
- 「跳过外部 LLM 审核」 → 仅当 显式设置
CROSS_MODEL_AUDIT=false - 「这次 THRESHOLD 调到 3/5 让它过」 → 拒绝。改 THRESHOLD 是元层级 bump
- 「保留所有旧观察作为历史」 → 违反原则 #3
- 「先 bump,cleanup 下次再做」 → 拒绝。cleanup 是 bump 的一部分
- 「只重算 composite 不重打 dim」 → 拒绝。新权重 × 旧 dim 仍是旧污染。每个 dim 都由 sub-agent 重审 script
- 「把 Memo 全文写进 rubric_notes.md 顶部,方便我读」 → 拒绝。rubric_notes.md 是 blind sub-agent 白名单——含视频名 / 实绩 → 通过白名单泄漏。Memo 写 rubric-memo.md(白名单外),rubric_notes.md 只放公式 + 通用语言维度定义 + 指针
- 「派生证据段保留真实视频名,让 rubric 读起来更具体」 → 拒绝。在 rubric_notes.md 必须用通用语言("高抽象密度样本");带视频名的派生证据写 rubric-memo.md
- "Skip calibration pool re-scoring, directly change formula" → Reject. Rule #2
- "Skip cheat-score-blind sub-agent, main Claude can re-score directly" → Reject. Bump does not accept any self-scored fallback — if sub-agent is unavailable → abort bump, do not accept "self-audit"
- "Skip external LLM audit" → Only allowed if is explicitly set
CROSS_MODEL_AUDIT=false - "Adjust THRESHOLD to 3/5 this time to let it pass" → Reject. Changing THRESHOLD is a meta-level bump
- "Keep all old observations as history" → Violates Rule #3
- "Bump first, do cleanup next time" → Reject. Cleanup is part of bump
- "Only recalculate composite without re-scoring dimensions" → Reject. New weights × old dimensions are still old contamination. Each dimension must be re-reviewed by sub-agent
- "Write full memo into top of rubric_notes.md for easy reading" → Reject. rubric_notes.md is whitelisted for blind sub-agent — containing video names / actual performance → leaks through whitelist. Memo is written in rubric-memo.md (outside whitelist), rubric_notes.md only contains formula + general language dimension definitions + pointer
- "Keep real video names in derived evidence section to make rubric more specific" → Reject. Must use general language in rubric_notes.md ("high abstract density samples"); derived evidence with video names is written in rubric-memo.md
Integration
Integration
- 上游:检测到 ≥3 同向偏差 → 提议跑
/cheat-retro/cheat-bump - 依赖:(如配置)+ Task tool(spawn cheat-score-blind)
mcp__llm-chat__chat - 修改:
- (结构性更新,绝不写真实视频名 / 实绩)
rubric_notes.md - (新——append Memo 全文,含证据 + 派生证据)
rubric-memo.md - 所有 (追加 Re-scored 行,不动预测段)
predictions/*.md .cheat-state.json
- 下游:下一篇 自动按新 rubric_version 打分
/cheat-predict
- Upstream: detects ≥3 same-direction deviations → propose running
/cheat-retro/cheat-bump - Dependencies: (if configured) + Task tool (spawn cheat-score-blind)
mcp__llm-chat__chat - Modifications:
- (structural update, never write real video names / actual performance)
rubric_notes.md - (new — append full memo, including evidence + derived evidence)
rubric-memo.md - All (append Re-scored line, do not modify prediction section)
predictions/*.md .cheat-state.json
- Downstream: Next automatically uses new rubric_version for scoring
/cheat-predict